Disseminate: The Computer Science Research Podcast - Vasily Sartakov | CAP-VMs: Capability-Based Isolation and Sharing in the Cloud #19

Starting point is 00:00:00 Hello and welcome to Disseminate, the computer science research podcast. I'm your host, Jack Wardby. I'm delighted to say I'm joined today by Vasily Sartakov, who will be talking about his OSDI paper, CapVMs, Capability-Based Isolation and Sharing in the Cloud. He is an advanced research fellow in the large-scale data and systems group of Imperial College London. Vasily, thanks for joining us on the show. Thank you. Can you start off by setting the scene for the listener and introducing your research? Okay. First of all, I would like to tell you about my background,

Starting point is 00:00:59 what I did before I started my career in research, because it defines my research topics in many aspects. I graduated from university in 2010 with a diploma of an engineer, and I already had some operating system-related experience in the industry, and my final thesis was devoted to microkernels. Since then, I was involved in various R&D activities in the area of system security, mostly related to microkernels, microkernel frameworks, microhypervisors.

Starting point is 00:01:29 I was leading a small group of engineers called Xis Labs. At this time, there was the second, maybe the third coming of microkernels. CL4 became open sourced. There were several projects from Germany to Dresden. Fiasco, we see, L4E, GNOT framework, Micro HappyWise and Nova. There is actually a UOC paper about it.

Starting point is 00:01:54 It was a quite popular topic in the industry and many people wanted to use micro kernels. And our usual task sounded like hey, we have a huge analytics software and we need someone who will partition it and run on top of micro kernel. And our usual task sounded like, hey, we have a huge monolithic software, and we need someone who will partition it and run on top of microkernel. They usually need TCP stack, network services, obviously device drivers, deep-reached graphic using interface, et cetera.

Starting point is 00:02:20 And typical problem was the messaging interface, not only on the technical side, but mostly on social technical. I will explain. Analytics software uses jump and return interfaces between components. And when you cut them, you need to redevelop this interface to copy arguments, to redesign upper guide pointers, callback tables, etc. This requires engineering effort, but more important, someone needs to support their sports. And this is the problem.

Starting point is 00:02:53 People come, people go, nobody wants to repeat the same partitioning each time to update software. This is actually why microkernel frameworks usually have coarse-grained components. The complexity of interfaces doesn't allow to make partitioning practical. Huge components usually have narrow interfaces that are easy to port to messaging and to support later. And let's come back to my research. So in 2017, I moved to Germany and started my research in computational computing.

Starting point is 00:03:22 And in 2019, I moved to the United Kingdom. In my research from technological point of view, I'm looking how new hardware features, particularly inter-process isolation mechanisms, can be used to efficiently partition and isolate software. Speaking about operating system, it's the rarest size of metal. The part I'm interested in doesn't manage hardware, but manages software and provides

Starting point is 00:03:45 primitives for isolation and sharing. The example is jump and return calls. You can use inter-process isolation mechanisms like Intel MPK or Cherry to isolate components and to create IPC primitives that preserve jump and return semantic of calls, as we demonstrated in our past year as well as paper called Cubiclos. Today, we will in our past year, I suppose, paper called Cubicolors. Today, we will discuss introvisor and CAP VMs, and they also use hardware-supported mechanisms, such as hardware memory capabilities, to efficiently isolate components and provide efficient primitives for data sharing. Again, the isolation and sharing plus novel hardware technologies. With all that in mind, can you describe to the listener what are the current

Starting point is 00:04:25 approaches to achieving application compartmentalization in the cloud are today? What are the current approaches people take to do it, to achieving this? The key element in the compartmentalization is the isolation technology. It defines what you can do with compartment and in general how practical the compartmentalization will be. However, despite the existence of various modern isolation technologies such as Intel APK, SGX is also some sort of inter-process isolation with a very special threat model. The overwhelming practical solutions are based on virtual machines, hardware virtualization

Starting point is 00:05:04 and containers namespace virtualization. Okay, great. So what are the problems with VMs and containers? Let's have a look on them from the point of view of tension between isolation and sharing. Virtual machines provide very strong isolation. They are based on the idea that there is no red pill, and if you have two communicating VMs hosted on top of a single server, you should use TCPAP stack

Starting point is 00:05:33 and virtual networking for communication. It's very excessive because the data is in memory already, and all what you need is to give reliable access to it. Instead, we use TCPAP for communication and data exchange. Containers, in turn, virtualize namespace, and they may use something better than networking,

Starting point is 00:05:53 but the kernel is the lowest common denominator over all containers, including the host kernel, which means that your REST API application includes USB driver in the trusted computing base or a duplicated file system or some components that you don't ever use. But this is the problem. And in other words, VM have very strong isolation, may have small shared trusted computing base.

Starting point is 00:06:23 For example, microhypervisor, NOAA, it's very, very small, but very slow cross-VM IPC. Containers, on the other hand, weaker isolation, huge shari-trusted computing base, but relatively fast IPC. And what we actually need

Starting point is 00:06:41 is slow shari-trusted computing virtualization with fast IPC. Okay, cool. So given the state of play and the current trade-offs you just explained, what is the key idea behind CapVMs? The idea of the project is that we don't use MMU for isolation. Indeed, MMU gives you sharding and isolation, the basic technology. But MMU defines the characteristics of the compartmentalization. We have processes, we share the page granularity,

Starting point is 00:07:12 and we involve a shared intermediary like kernel or hypervisor each time when we want to perform IPC. So, Amin, can you explain to the listener what memory capabilities are? Now we come to our basic technology of the project. So let's consider an example. You have a register, and you can load any value into it and choose this register component to any reasonable address in the address space. So you can load and store data via this register

Starting point is 00:07:45 using simple instruction. But in the case of hardware memory capabilities, this register has not only the address this register points to, but also bounds of the memory that this register can point to in principle. Each time when CPU loads an address into this register, it actually loads a fat pointer.

Starting point is 00:08:07 It's built in addresses, bounds, and permissions. And so the CPU disallows you to reference data outside of the bounds or use this register. These operations like special load or special store. Of course, you can't construct capability from a random sequence of bytes, and capabilities can be created only from another capability via capability-aware instructions. What are the main challenges that arise

Starting point is 00:08:39 by using memory capabilities as part of the CloudStack, or part of a CloudStack? Speaking about memory capabilities in the context of our paper, we of course speak about the CHERI architecture. And CHERI is a very novel, very new implementation of hardware memory capabilities. It's a hybrid architecture. It combines at the same time MMU plus capability-based isolation. And the cherry requires porting per software. If you want to get all features of hardware memory capabilities, such memory safety, you should port your software

Starting point is 00:09:20 to something called Pure Capability API. In this API, all registers are capabilities and all instructions that use pointers are capability aware. On practice, it means that some low level and system software requires some modification and people usually don't like to change things that work. So if you have a very huge project, it will be highly likely that you need to change something

Starting point is 00:09:47 into this project to make it work on top of pure capability API. Generally speaking, as I say, pointers become twice hotter. So if you have padding in structures, if you care about alignment and other low-level things, you should change something in your software. However, compiler and the verifying system provided by Cherry and people who developed this project, they said it's very clever. And if you don't use something low-level, specific things

Starting point is 00:10:19 that are about alignment, as I mentioned, you will not see any problems. Okay, cool. So what are the key features then of CAP VMs and how do they go about addressing the challenges you just described and how do they improve over the pitfalls of VMs and containers? In short, CAP VMs offer lightweight isolation, private namespace in the form of deeper-rich library operating system,

Starting point is 00:10:51 and fast-cross CAP VM IPC primitives, which work at byte-grunarity. The latter is also important. Of course, you can share memory for communication between virtual machines, but you should think twice before remapping pages. Of course, you can share memory for communication between the virtual machine,

Starting point is 00:11:08 but you should think twice before remapping pages. What if you accidentally expose something that shouldn't be exposed? Well, in the case of CAPVMs, you can give access to the data in place because sharing works by granularity. Okay, nice. So let's dive into the details a little bit now of CAPVMs. Can you give the listener an overview or describe the architecture of CAPVMs? CAPVMs are managed by Intravisor. We call it sometimes as Type-C Rehabivisor. It's a user-level process that uses its own memory to host CAPVMs. The CAP VM is adjusted binary executing in a constrained way.

Starting point is 00:11:47 Code inside the CAP VM can jump outside the CAP VM only in a controlled way. This is called the host-call interface, and it also can access memory outside CAP VM bounds also only in a controlled way. General speaking, you can execute any code inside CapVM, but to run something practical, you, of course, need Linux-compatible

Starting point is 00:12:10 environment, and we provide it. So, we have a deep-religion LKL, so Linux kernel library, full compatibility with Linux, plus a massive libc. Nice. So, how are CapVMs isolated then? Now, it's a very interesting part.

Starting point is 00:12:28 The isolation is a very important technical part of this project, and we fully rely on the Cherry architecture. We already discussed a bit pure capability ABI, when all pointers related to instructions and registers are capabilities, but we use hybrid ABI. This ABI allows you to use native instructions when everything is capability unaware and selectively use capabilities and capability-aware instructions when it's necessary.

Starting point is 00:12:55 Also, hybrid code is constrained by something called default capabilities. Internally, all instructions become relative to two capability registers, one data and program capabilities. Constraint code can't access data or code outside the bounds of the default capabilities, but can do this via capabilities and capability-aware instructions. So this is the mechanism we use for cross-cup VM IPC. In other words, default capabilities are part of a thread context. They make up a fraction of virtual address space,

Starting point is 00:13:29 which means that you can create multiple compartments defined by pairs of non-intersecting default capabilities. That's how we create multiple CAP-VMs inside a single address space of Intravisor. Nice. So what does the API of CapVM look like then? API is quite minimalistic. Just implement proof of concept.

Starting point is 00:13:52 You can instruct Intravisor to spawn a CapVM described by several parameters like size of compartment, binary you want to load inside, argument you pass, etc. Also, there is interface between the introvisor

Starting point is 00:14:08 and CAPVM, so mechanisms such as creation of threads or interaction with IEO, they can't be implemented inside the CAPVM and CAPVM invokes introvisor to provide those mechanisms. This is our host call interface. You also can use

Starting point is 00:14:24 capability-based primitives, I mean IPC based on capabilities. We have CAP files, CAP calls, CAP stream. They also use the host call interface.

Starting point is 00:14:39 Cool. So how do CAP VMs then avoid the need for having your application code being aware of having capability-aware application code? Good question. Let's speak about our capability-based IPC mechanisms. Constraint code, even compromised, can't access data outside. Constrained code, even compromised, can't access data and code outside of bounds defined by default capabilities.

Starting point is 00:15:12 Also, you can't create default capabilities from nothing or increase permission of existing ones. So code is really constrained. And to invoke functions of Intravisor via the host call interface, you need capabilities. And CAP VMs have them. The Intravisor stores capabilities for host call interface at the moment of creation of CAP VMs. So now we have a CAP VM and there is a controlled way how you can jump outside of this CAP VM. For that, you need to use capability-aware instructions

Starting point is 00:15:46 and you need to have capabilities. So you must use capability-aware code. However, this is the lowest low level like hardware obstruction layer inside the classical VMs. So this code anyway will be pure capability-aware. But we want to speak about the application code, which is always native, which means capability unaware. But capability unaware application code wants to benefit from the

Starting point is 00:16:13 use of capability-based IPC mechanisms, right? And we have a transition layer, which is pure capability, and it's a layer, it's a driver driver player so when you have a connection between two cap pms one have a donor copy m and one is a donor copy m the other one is recipient copy the donor informs introvisor that it has a memory to share again byki, while the recipient probes this memory via a key, like a shared memory interface in POSIX, and the intervisor stores the capability to this shared memory into the memory of the recipient. So now the recipient, CAPPM,

Starting point is 00:16:57 has a capability to share its memory and to use it, it should use capability-aware instructions that work inside CAPPM kernel, as I said, like a driver. So the native code reads and writes data from the kernel objects, it's a file. And when it reads this via a system call, in fact, the kernel driver involve or uses capability-aware instructions

Starting point is 00:17:22 to read data from remote capability, from remote CAPVM. That's all. So the separation between native and pure capability code is the kernel interface, the Cisco interface. Cool. Awesome. So can you talk us through how you went about implementing CAPVMs and how you evaluate them?

Starting point is 00:17:44 What did you compare them against? How did you compare them? And what were the key results from your evaluation? We prototyped with Intravisor and Cat VMs using RISC-564 Cherry-enabled platform. It's a development of people from Cambridge

Starting point is 00:17:59 and the Cherry project. And we used FPGA for benchmarks. I also implemented by the Cherry team. Also we used HiFi, SciFi, OneMatch boards to test complex multi-thread services, but obviously without any security guarantees. We needed this platform to generate Docker images. So we have full compatibility

Starting point is 00:18:25 with ordinary systems. So we use Docker to compile and generate images that we use as car PMs. And also we tested our system on top of multi-core CPU to obtain

Starting point is 00:18:42 performance results in complex applications. Well, general speaking, of course, our evaluation is a performance-based benchmark. We measure performance of IPC primitives we develop with such CAP files, CAP calls, CAP streams. We compare them with legacy interfaces. And the most important result probably is that if you don't need synchronization

Starting point is 00:19:09 mechanisms in your IPC, you can reach performance of IPC close to memory copy speed, which is very fast. Of course, there is a not so brilliant result. If you need synchronization, you will pay for the

Starting point is 00:19:28 synchronization because you need to involve the host kernel each time when you want to occur mutics, for example. And this adds some overhead. So if you use small chunks of data and you need synchronization, our proposed mechanism is not always faster than legacy, but of course there are cases where our CAP files

Starting point is 00:19:57 and CAM streams significantly outperform two times and more legacy interfaces. Yeah, you preempted my next question that I was going to ask you earlier. Any situations in which CAP VMs are the wrong or suboptimal choice and what are the characteristics of those situations? But I guess you covered it there unless you've got anything else to add on that. CAP VMs, yes, of course, actually many things. CAP VMs offer lightweight isolation and have

Starting point is 00:20:25 low share retracing computing base and have fast CAP VM

Starting point is 00:20:28 capability based IPC mechanisms. So, of course, we already know that

Starting point is 00:20:34 there are some exceptions, there are some cases where our IPC might be less

Starting point is 00:20:39 efficient than they could be, but the first and the second options, they always the first and second features,

Starting point is 00:20:49 they also, they're anyway very important. So, of course, so if you don't need capability-based IPC or your use case doesn't benefit from the use of them, you anyway will have strong isolation and you'll have low TCB infrastructure. Then another thing is if you want to run inside the CAP-PA more than one application, this will become a little bit tricky. So it's a question in real life, will you use a virtual machine to run more than one application inside the container? Because if we look on practical use of containers, people usually try to spawn

Starting point is 00:21:36 a single service inside a single container or a VM. But anyway, so if you want to run more than one application inside a cubvm, this becomes slightly tricky and it will require some modification. So the problem is the transformation of pointers when you pass them between isolating compartments, like a program in the kernel or between kernel and the intervisor. And those pointers should be either capability, and there will be no problem, but the code should be capability-aware. Or if they're integers, so if they're capability-unaware, those integers should be transformed to be valid inside different compartments. And this

Starting point is 00:22:24 works with two nested compartments with the same base. When you have two compartments, one inside another one and they have the same base, it works very well with application, this is kernel. Application can't access kernel, kernel can access application, this works very well. But if you want to add another application like application, application kernel, one of them will be more privileged, one application will be more privileged than another one.

Starting point is 00:22:48 This technically can be solved, so there is a solution that will allow you to have multiple native applications isolated, and there will be no need in transformation of pointers when you pass them between different compartments. But this is the limitation isn't very optimal. This is the limitation of the architecture. Cool. So how could a malicious user, an attacker gain control of a cat vm what was the threat model you used when you were developing and designing cat vms very easy um we use native code and native code is capability and it's completely

Starting point is 00:23:36 and it doesn't benefit from the user capabilities so all what we have is read, write, execute bits on memory and nothing else. So obviously, if you have attack based on stack smashing, you can gain control over application. And in our model, well, we assume that you can get access even to the kernel, which is kind of technically isolated from the application. So, well, you're free to do this. We assume that we have bugs everywhere and our attacker gains

Starting point is 00:24:09 control. However, it's not a problem because what an attacker can do, they can try to access memory outside borders of the CAP VM, and this is impossible because all native instructions

Starting point is 00:24:26 are constrained by default capabilities and there is no way to increase permissions of existing default capabilities because as I said you can't cast a sequence of bytes as a capability so this capability should be created by someone

Starting point is 00:24:42 and well there is no mechanism to create new capability for this KPM inside a KPM. So we can't access data outside our KPM. The attacker can try to jump outside of the capability. And again, it's impossible because to jump outside capability, you need to perform capability-aware instruction

Starting point is 00:25:07 and you need to use capabilities to jump outside of the CAPVM. And again, all capabilities that a CAPVM has, that have execution permission, they are sealed. They can't be changed. So there is only one way to jump outside of CAP-VM, and this way is controlled.

Starting point is 00:25:32 We will always go to the defined place, and the intervisor will look where you're going. And the third point, the attacker can try to access data provided from other CAP VMs. For example, you share data and provide the CAP VM or provide the capability for that. It's okay. And then you decided to revoke this access. And you may assume that the malicious attacker can once you have

Starting point is 00:26:06 access to this data because it has a capability again our system will prevent this because the attacker can't store capabilities provided by other parties of

Starting point is 00:26:22 communication so sounds very safe and secure. Brilliant. So where do you see this research having the biggest impact? Who will find Cat VMs the most useful? Who do you think your results are the most relevant for? This is the research work, and we ask a research question. What does the cloud stack look like if we had

Starting point is 00:26:46 memory capabilities um so first of all we consider the research community as the target of audience hardware accelerated ipc but by polarity um they privileged components with very low overhead for isolation. And we also, well, shown how great is hybrid code. There is a huge part of the community in the area of memory capabilities, which discourage people to use hybrid systems. So they think they try to shift more accents on pure capability code while we use hybrid code. From the industry point of view, obviously any area that uses virtualization,

Starting point is 00:27:39 but the dissemination is very limited by the availability. There is a Morello board, but there is a program supported by the UK government. I don't know how easy to get access to this board for ordinary people. Okay, cool. So what would you say, over the course of working on Cat VMs, what's the most interesting lesson that you've learned?

Starting point is 00:28:06 Maybe the most unexpected thing you've learned while working on this project? Well, to be honest, I was very constrained in time, so I had no time for lessons. But at the same time, I was very surprised when we moved everything from Quima to FPGA board and everything worked out of box without any change. So it's quite an unusual practice when you do this. The chair team has been

Starting point is 00:28:35 doing a great job. On the technical side, there was a moment when I realized that I'm actually benchmarking TLB. But what I expected to benchmark. So low-level benchmarks are very tricky and require a lot of attention. So you always must be sure that you're benchmarking what you're going to benchmark, but not something else.

Starting point is 00:28:59 So I got very, very, very wrong numbers. I didn't understand why in one case I have, let's say, 2x performance, in another case 5x performance benefit, and only after some very long investigation I accidentally realized actually it's because of TLB. We have different addresses, and they have different performance impacts. Nice. So obviously progress in research is very non-linear.

Starting point is 00:29:28 From the conception of the initial idea to the actual end, the publication, were there things that you tried that failed and were there other things that maybe you could explain to the listeners so they don't end up making the same mistakes that you've done? This project is complete. I mean, there is a story, design implementation benchmark, and it has a finished set of features.

Starting point is 00:29:59 So, well, people in research know that sometimes to sell an idea, you need to add several features to make reviewers happy. And after the first rejection, the second, people add more and more features, making the project more and more complicated, while it didn't change the original idea. So in the current state of the project, it's incredibly complicated to add something into it from the point of view of features.

Starting point is 00:30:26 Otherwise, you will break something else. For example, some may say that people may want pure capability CAP VMs, while we have hybrid CAP VMs. Actually, InterVisor already supports pure CAP CAP VMs, but the use of pure CAP code breaks the revocation story. So you can't revoke shared capabilities using the method we introduced in the paper. So we had several forks on the road. And, well, do you want to pass pointers between isolation layers as integers of capabilities? Because this defines, do you need to redevelop the interfaces or you can use all of what you have.

Starting point is 00:31:08 Do we want to add capability-aware calls or use just all what we already have? It's actually, I began my presentation today. I said about jump and return calls at the beginning of our discussion

Starting point is 00:31:23 and it's exactly the example about the problem and the possible decision. So another question is, do we need revocation of capabilities or not? And finally, are we capable to port huge software like LQL to pure capability architecture? So depending on the decision, the implementation will be completely different. And the story as well also will be different. And it's hard to say that all other decisions are scientifically wrong, especially given that in systems, the scientific criterion is performance.

Starting point is 00:32:03 If one can find a way how to easily port LK to pure capability or deal with cap revocation in pure CAPE code, this would be a great project to get a paper. The revocation problem actually is a very well-known problem in the field of hardware memory capabilities, while porting of huge low-level system code is very challenging, but also depends on resources you have. But, well, maybe people can find something more easy way to do this.

Starting point is 00:32:33 Nice. So kind of building on that, I know you said this project is kind of finished, but what do you have planned for future research? We are working already on the next project. It's also related to the future of cloud computing. It uses the conception of CAP-EMs. This trial we will try to overcome some drawbacks of pure CAP code, but the project is in a very early stage. So, okay, I don't have many information about it. Sure, no problem.

Starting point is 00:33:07 So I know you kind of talked us through your background at the start, but what kind of attracted you to this research area and what do you think is the biggest challenge facing it now? I do like operating systems and I

Starting point is 00:33:24 have been working in this area for many years and it suits my skill set. Also, I like to use fancy hardware features to improve software systems. However, it's relatively complicated to conduct research in this field, the field of operating systems. I want to say that this is challenging not because of scientific side, but more about the gap between existing research industry and the way how the research community accepts the

Starting point is 00:33:56 ideas. There is a keynote from ASPLOS21 by Timothy Roscoe, and he described this problem very well, so I recommend to everyone to have a look at it. Long story short, the systems in general and operating systems

Starting point is 00:34:11 in particular are very problem-oriented, and that's why we see performance-based justification of scientific results in every first paper, and because of the overwhelming amount of software leaves in the 90s with monolithic Linux, that's why people want to see solutions which are

Starting point is 00:34:33 compatible or comparable with Linux. And instead of moving to new technologies invented in the last 20 years, people deal with old things that just work, but researchers don't know how, on the one hand, to propose another solution to problems that they already have been solving for many years and nobody uses them, or try to convince people with very crazy ideas

Starting point is 00:35:01 that can't be compatible with Linux and the expected set of benchmarks and software. Cool, awesome. So, time for the last word now. What's the one key thing you want listeners to take away from your research on

Starting point is 00:35:17 CatVMs? Hardware memory capabilities introduced in Cherry not only solve problems related to memory safety but can be used to build low overhead highly partitioned cloud systems Fantastic

Starting point is 00:35:33 It's not only about clouds Perfect Well we will leave it there Thanks so much Vasily If you are interested in knowing more about his work we will put links to all his relevant materials in the show notes. And thanks for listening. And we'll see you all next time for some more awesome computer science research. Thank you.

Disseminate: The Computer Science Research Podcast - Vasily Sartakov | CAP-VMs: Capability-Based Isolation and Sharing in the Cloud #19

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.