Disseminate: The Computer Science Research Podcast - Vasily Sartakov | CAP-VMs: Capability-Based Isolation and Sharing in the Cloud #19
Episode Date: January 23, 2023Summary: Cloud stacks must isolate application components, while permitting efficient data sharing between components deployed on the same physical host. Traditionally, the memory management unit ...;(MMU) enforces isolation and permits sharing at page granularity. MMU approaches, however, lead to cloud stacks with large trusted computing bases in kernel space, and page granularity requires inefficient OS interfaces for data sharing. Forthcoming CPUs with hardware support for memory capabilities offer new opportunities to implement isolation and sharing at a finer granularity. In this episode, Vasily talks about his work on cVMs, a new VM-like abstraction that uses memory capabilities to isolate application components while supporting efficient data sharing, all without mandating application code to be capability-aware. Listen to find out more!Links: OSDI PaperVasily's homepageVasily's LinkedIn Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate, the computer science research podcast. I'm your host, Jack Wardby.
I'm delighted to say I'm joined today by Vasily Sartakov, who will be talking about his OSDI paper,
CapVMs, Capability-Based Isolation and Sharing in the Cloud.
He is an advanced research fellow in the large-scale data and systems group of Imperial College London.
Vasily, thanks for joining us on the show.
Thank you.
Can you start off by setting the scene for the listener and introducing your research?
Okay. First of all, I would like to tell you about my background,
what I did before I started my career in research,
because it defines my research topics in many
aspects.
I graduated from university in 2010 with a diploma of an engineer, and I already had
some operating system-related experience in the industry, and my final thesis was devoted
to microkernels.
Since then, I was involved in various R&D activities in the area of system security,
mostly related to microkernels, microkernel frameworks, microhypervisors.
I was leading a small group of engineers called Xis Labs.
At this time, there was the second, maybe the third coming of microkernels.
CL4 became open sourced.
There were several projects from Germany to Dresden. Fiasco, we see,
L4E, GNOT
framework, Micro HappyWise
and Nova. There is actually
a UOC paper about it.
It was a quite popular topic
in the industry and many people
wanted to use micro kernels.
And our usual task sounded like
hey, we have a huge
analytics software and we need someone who will partition it and run on top of micro kernel. And our usual task sounded like, hey, we have a huge monolithic software,
and we need someone who will partition it and run on top of microkernel. They usually need TCP stack, network services, obviously device drivers,
deep-reached graphic using interface, et cetera.
And typical problem was the messaging interface,
not only on the technical side, but mostly on social technical.
I will explain.
Analytics software uses jump and return interfaces between components.
And when you cut them, you need to redevelop this interface to copy arguments, to redesign upper guide pointers, callback tables, etc.
This requires engineering effort,
but more important, someone needs to support their sports.
And this is the problem.
People come, people go,
nobody wants to repeat the same partitioning each time to update software.
This is actually why microkernel frameworks
usually have coarse-grained components.
The complexity of interfaces doesn't allow to make partitioning practical.
Huge components usually have narrow interfaces that are easy to port to messaging and to support later.
And let's come back to my research.
So in 2017, I moved to Germany and started my research in computational computing.
And in 2019, I moved to the United Kingdom. In my research from technological point of view,
I'm looking how new hardware features,
particularly inter-process isolation mechanisms,
can be used to efficiently partition and isolate software.
Speaking about operating system,
it's the rarest size of metal.
The part I'm interested in doesn't manage hardware,
but manages software and provides
primitives for isolation and sharing. The example is jump and return calls. You can use
inter-process isolation mechanisms like Intel MPK or Cherry to isolate components and to create
IPC primitives that preserve jump and return semantic of calls, as we demonstrated in our
past year as well as paper called Cubiclos. Today, we will in our past year, I suppose, paper called Cubicolors.
Today, we will discuss introvisor and CAP VMs, and they also use hardware-supported mechanisms, such as hardware memory capabilities, to efficiently isolate components and provide
efficient primitives for data sharing.
Again, the isolation and sharing plus novel hardware technologies.
With all that in mind, can you describe to the listener what are the current
approaches to achieving application compartmentalization in the cloud are today? What are the current
approaches people take to do it, to achieving this? The key element in the compartmentalization
is the isolation technology. It defines what you can do with compartment and in general
how practical the compartmentalization
will be.
However, despite the existence of various modern isolation technologies such as Intel
APK, SGX is also some sort of inter-process isolation with a very special threat model.
The overwhelming practical solutions are based on virtual machines, hardware virtualization
and containers namespace virtualization.
Okay, great. So what are the problems with VMs and containers?
Let's have a look on them from the point of view of tension between isolation and sharing.
Virtual machines provide very strong isolation.
They are based on the idea that there is no red pill,
and if you have two communicating VMs
hosted on top of a single server,
you should use TCPAP stack
and virtual networking for communication.
It's very excessive
because the data is in memory already,
and all what you need is to give reliable access to it.
Instead, we use TCPAP for communication
and data exchange.
Containers, in turn, virtualize namespace,
and they may use something better than networking,
but the kernel is the lowest common denominator
over all containers, including the host kernel,
which means that your REST API application
includes USB driver in the trusted computing base
or a duplicated file system or some components that you don't ever use.
But this is the problem.
And in other words, VM have very strong isolation,
may have small shared trusted computing base.
For example, microhypervisor, NOAA,
it's very, very small,
but very slow cross-VM IPC.
Containers, on the other hand,
weaker isolation,
huge shari-trusted computing base,
but relatively fast IPC.
And what we actually need
is slow shari-trusted computing virtualization
with fast IPC.
Okay, cool.
So given the state of play and the current trade-offs you just explained, what is the key idea behind CapVMs?
The idea of the project is that we don't use MMU for isolation.
Indeed, MMU gives you sharding and isolation, the basic technology.
But MMU defines the characteristics of the compartmentalization.
We have processes, we share the page granularity,
and we involve a shared intermediary like kernel or hypervisor
each time when we want to perform IPC.
So, Amin, can you explain to the listener what memory capabilities are?
Now we come to our basic technology of the project.
So let's consider an example.
You have a register, and you can load any value into it
and choose this register component to any reasonable address in the address space.
So you can load and store data via this register
using simple instruction.
But in the case of hardware memory capabilities,
this register has not only the address
this register points to,
but also bounds of the memory
that this register can point to in principle.
Each time when CPU loads an address into this register,
it actually loads a fat pointer.
It's built in addresses, bounds, and permissions.
And so the CPU disallows you to reference data outside of the bounds or use this register.
These operations like special load or special store.
Of course, you can't construct capability
from a random sequence of bytes,
and capabilities can be created only from another capability
via capability-aware instructions.
What are the main challenges that arise
by using memory capabilities as part of the CloudStack,
or part of a CloudStack?
Speaking about memory capabilities in the context of our paper, we of course speak about
the CHERI architecture. And CHERI is a very novel, very new implementation of hardware
memory capabilities. It's a hybrid architecture. It combines at the same time MMU plus capability-based isolation.
And the cherry requires porting per software.
If you want to get all features of hardware memory capabilities,
such memory safety, you should port your software
to something called Pure Capability API.
In this API, all registers are capabilities
and all instructions that use pointers are capability aware.
On practice, it means that some low level
and system software requires some modification
and people usually don't like to change things that work.
So if you have a very huge project,
it will be highly likely that you need to change something
into this project to make it work on top of pure capability API.
Generally speaking, as I say, pointers become twice hotter.
So if you have padding in structures,
if you care about alignment and other low-level things,
you should change something in your software.
However, compiler and the verifying system provided by Cherry
and people who developed this project, they said it's very clever.
And if you don't use something low-level, specific things
that are about alignment, as I mentioned,
you will not see any problems.
Okay, cool. So what are the key features then of CAP VMs and how do they go about addressing
the challenges you just described and how do they improve over the pitfalls of VMs and
containers? In short, CAP VMs offer lightweight
isolation,
private namespace in the form of
deeper-rich library operating system,
and fast-cross CAP VM
IPC primitives, which work
at byte-grunarity.
The latter is also important.
Of course, you can share
memory for communication
between virtual machines,
but you should think twice before remapping pages. Of course, you can share memory for communication between the virtual machine,
but you should think twice before remapping pages.
What if you accidentally expose something that shouldn't be exposed?
Well, in the case of CAPVMs, you can give access to the data in place because sharing works by granularity.
Okay, nice.
So let's dive into the details a little bit now of CAPVMs.
Can you give the listener an overview or describe the architecture of CAPVMs?
CAPVMs are managed by Intravisor. We call it sometimes as Type-C Rehabivisor. It's a user-level
process that uses its own memory to host CAPVMs. The CAP VM is adjusted binary executing in a constrained way.
Code inside the CAP VM can jump
outside the CAP VM only in a controlled way.
This is called the host-call interface,
and it also can access memory outside
CAP VM bounds also only in a controlled way.
General speaking, you can execute any code inside CapVM,
but to run something practical,
you, of course, need Linux-compatible
environment, and we provide it. So, we have
a deep-religion LKL, so
Linux kernel library, full compatibility
with Linux, plus a massive
libc.
Nice. So, how are CapVMs isolated
then? Now, it's a very
interesting part.
The isolation is a very important technical part of this project,
and we fully rely on the Cherry architecture.
We already discussed a bit pure capability ABI,
when all pointers related to instructions and registers are capabilities,
but we use hybrid ABI.
This ABI allows you to use native instructions when everything is capability unaware
and selectively use capabilities
and capability-aware instructions when it's necessary.
Also, hybrid code is constrained
by something called default capabilities.
Internally, all instructions become relative
to two capability registers,
one data and program capabilities. Constraint code can't access data or code outside the bounds of the default capabilities, but can do this via capabilities and capability-aware instructions.
So this is the mechanism we use for cross-cup VM IPC.
In other words, default capabilities are part of a thread context.
They make up a fraction of virtual address space,
which means that you can create multiple compartments
defined by pairs of non-intersecting default capabilities.
That's how we create multiple CAP-VMs
inside a single address space of Intravisor.
Nice. So what does the API of CapVM
look like then?
API is quite minimalistic.
Just implement proof of concept.
You can
instruct Intravisor to spawn
a CapVM described by several parameters
like size of compartment,
binary you want
to load inside, argument
you pass, etc. Also,
there is interface between the introvisor
and CAPVM, so mechanisms such
as creation of threads or
interaction with IEO,
they can't be implemented inside the CAPVM
and CAPVM invokes introvisor
to provide those
mechanisms. This is our host call interface.
You also can use
capability-based
primitives, I mean
IPC based on capabilities.
We have CAP files, CAP
calls, CAP stream.
They also
use the host call
interface.
Cool. So how do
CAP VMs then avoid
the need for having your application code
being aware of having capability-aware application code?
Good question.
Let's speak about our capability-based IPC mechanisms.
Constraint code, even compromised, can't access data outside.
Constrained code, even compromised, can't access data and code outside of bounds defined by default capabilities.
Also, you can't create default capabilities from nothing
or increase permission of existing ones.
So code is really constrained.
And to invoke functions of Intravisor via the host call interface, you need capabilities.
And CAP VMs have them.
The Intravisor stores capabilities for host call interface at the moment of creation of CAP VMs.
So now we have a CAP VM and there is a controlled way how you can jump outside of this CAP VM.
For that, you need to use capability-aware instructions
and you need to have capabilities.
So you must use capability-aware code.
However, this is the lowest low level
like hardware obstruction layer inside the classical VMs.
So this code anyway will be pure capability-aware.
But we want to speak about the application code,
which is always native, which means
capability unaware. But capability unaware application code wants to benefit from the
use of capability-based IPC mechanisms, right? And we have a transition layer, which is pure
capability, and it's a layer, it's a driver driver player so when you have a connection between two cap pms
one have a donor copy m and one is a donor copy m the other one is recipient copy the donor informs
introvisor that it has a memory to share again byki, while the recipient probes this memory via a key,
like a shared memory interface in POSIX,
and the intervisor stores the capability
to this shared memory into the memory of the recipient.
So now the recipient, CAPPM,
has a capability to share its memory
and to use it, it should use capability-aware instructions
that work inside CAPPM kernel, as I said, like a driver.
So the native code reads and writes data
from the kernel objects, it's a file.
And when it reads this via a system call,
in fact, the kernel driver
involve or uses capability-aware instructions
to read data from remote capability,
from remote CAPVM.
That's all.
So the separation between native and pure capability code is the kernel interface,
the Cisco interface.
Cool. Awesome.
So can you talk us through how you went about implementing CAPVMs
and how you evaluate them?
What did you compare them against?
How did you compare them?
And what were the key results from your evaluation?
We prototyped
with Intravisor and Cat VMs
using RISC-564
Cherry-enabled platform. It's a
development of people from Cambridge
and the Cherry project.
And we used FPGA for
benchmarks. I also implemented by the Cherry team.
Also we used HiFi, SciFi, OneMatch boards
to test complex multi-thread services,
but obviously without any security guarantees.
We needed this platform to generate Docker images.
So we have full compatibility
with ordinary systems.
So we use Docker to compile
and generate images that we use
as car PMs.
And also we tested
our system
on top of
multi-core CPU to obtain
performance results
in complex applications.
Well, general speaking, of course, our evaluation is a performance-based benchmark.
We measure performance of IPC primitives we develop with such CAP files,
CAP calls, CAP streams.
We compare them with legacy interfaces.
And the most important result probably is that
if you don't need synchronization
mechanisms in your IPC, you
can reach performance
of IPC close
to memory copy speed, which
is very fast.
Of course, there is a not so
brilliant result.
If you need synchronization, you will pay for the
synchronization because
you need to involve the host kernel
each time when you want to occur mutics, for example.
And this adds some overhead. So if you use
small chunks of data
and you need synchronization,
our proposed mechanism is not always faster than legacy,
but of course there are cases where our CAP files
and CAM streams significantly outperform
two times and more legacy interfaces.
Yeah, you preempted my next question that I was going to ask you earlier.
Any situations in which CAP VMs are the wrong or suboptimal choice
and what are the characteristics of those situations?
But I guess you covered it there unless you've got anything else to add on that.
CAP VMs, yes, of course, actually many things.
CAP VMs offer lightweight isolation and have
low
share
retracing
computing
base
and have
fast
CAP VM
capability
based
IPC
mechanisms.
So,
of course,
we already
know that
there are
some
exceptions,
there are
some cases
where our
IPC might
be less
efficient than
they could
be,
but the
first and
the second
options,
they always the first and second features,
they also, they're anyway very important.
So, of course, so if you don't need capability-based IPC
or your use case doesn't benefit from the use of them,
you anyway will have strong isolation and you'll have low
TCB infrastructure. Then another thing is if you want to run inside the CAP-PA more than one
application, this will become a little bit tricky. So it's a question in real life, will you use a
virtual machine to run more than one application inside
the container? Because if we look on practical use of containers, people usually try to spawn
a single service inside a single container or a VM. But anyway, so if you want to run more than one application inside a cubvm,
this becomes slightly tricky and it will require some modification. So the problem
is the transformation of pointers when you pass them between isolating compartments,
like a program in the kernel or between kernel and the intervisor. And those pointers should be either
capability, and there will be no problem, but the code should be
capability-aware. Or if they're integers, so if
they're capability-unaware, those integers should be
transformed to be valid inside different compartments. And this
works with two nested compartments with the same base.
When you have two compartments,
one inside another one and they have the same base,
it works very well with application, this is kernel.
Application can't access kernel,
kernel can access application, this works very well.
But if you want to add another application like application,
application kernel, one of them will be more privileged, one application will be more privileged than another one.
This technically can be solved, so there is a solution
that will allow you to have multiple native applications
isolated, and there will be no
need in transformation of pointers when you pass them
between different compartments.
But this is the limitation isn't very optimal. This is the limitation of the architecture.
Cool. So how could a malicious user, an attacker gain control of a cat vm what was the threat model you used when you were developing and
designing cat vms very easy um we use native code and native code is capability and it's completely
and it doesn't benefit from the user capabilities so all what we have is read, write, execute bits on memory and nothing else.
So obviously, if you have attack based on stack smashing,
you can gain control over application.
And in our model, well, we assume that you can get access even to the kernel,
which is kind of technically isolated from the application.
So, well, you're free to do this.
We assume that we have bugs everywhere
and our attacker gains
control. However,
it's not a problem
because
what an attacker can do,
they can try to access
memory outside borders of the
CAP VM, and this is impossible
because all native instructions
are constrained by default capabilities
and there is no way to increase
permissions of existing default capabilities
because as I said you can't
cast a sequence of bytes
as a capability
so this capability should be
created by someone
and well there is no mechanism
to create
new capability for this KPM inside
a KPM.
So we can't access data outside our KPM.
The attacker can try to jump outside of the capability.
And again, it's impossible because to jump outside capability,
you need to perform capability-aware instruction
and you need to use capabilities
to jump outside of the CAPVM.
And again, all capabilities that a CAPVM has,
that have execution permission,
they are sealed.
They can't be changed.
So there is only one way to jump outside of CAP-VM,
and this way is controlled.
We will always go to the defined place,
and the intervisor will look where you're going.
And the third point,
the attacker can try to access data provided from other CAP VMs.
For example, you share data and provide the CAP VM or provide the capability for that.
It's okay.
And then you decided to revoke this access.
And you may assume that the malicious attacker can once you have
access to this data
because it has a capability again
our
system will prevent this
because the
attacker can't store capabilities
provided by
other parties of
communication so
sounds very safe and secure.
Brilliant.
So where do you see this research having the biggest impact?
Who will find Cat VMs the most useful?
Who do you think your results are the most relevant for?
This is the research work, and we ask a research question.
What does the cloud stack look like if we had
memory capabilities um so first of all we consider the research community as the target of audience
hardware accelerated ipc but by polarity um they privileged components with very low overhead for isolation. And we also, well, shown how great is hybrid code.
There is a huge part of the community
in the area of memory capabilities,
which discourage people to use hybrid systems. So they think they try to shift more accents
on pure capability code while we use hybrid code.
From the industry point of view,
obviously any area that uses virtualization,
but the dissemination is very limited
by the availability.
There is a Morello board, but there is a program supported by the UK government.
I don't know how easy to get access to this board for ordinary people.
Okay, cool.
So what would you say, over the course of working on Cat VMs,
what's the most interesting lesson
that you've learned?
Maybe the most unexpected thing you've learned while working on this project?
Well, to be honest, I was very constrained in time, so I had no time for lessons.
But at the same time, I was very surprised when we moved everything from Quima to FPGA board
and everything
worked out of box without any
change. So it's quite an unusual
practice when you do this.
The chair team has been
doing a great job.
On the technical side, there was
a moment when I realized that
I'm actually benchmarking TLB.
But what I expected to benchmark.
So low-level benchmarks are very tricky and require a lot of attention.
So you always must be sure that you're benchmarking what you're going to benchmark, but not something
else.
So I got very, very, very wrong numbers.
I didn't understand why in one case I have, let's say, 2x performance,
in another case 5x performance benefit,
and only after some very long investigation I accidentally realized
actually it's because of TLB.
We have different addresses, and they have different performance impacts.
Nice.
So obviously progress in research is very non-linear.
From the conception of the initial idea
to the actual end, the publication,
were there things that you tried that failed and were there other things that maybe you
could explain to the listeners
so they don't end up making the same mistakes that you've done?
This project is complete.
I mean, there is a story, design implementation benchmark,
and it has a finished set of features.
So, well, people in research know that sometimes
to sell an idea,
you need to add several features to make reviewers happy.
And after the first rejection, the second,
people add more and more features, making the project more and more complicated,
while it didn't change the original idea.
So in the current state of the project,
it's incredibly complicated to add something into it from the point of view of features.
Otherwise, you will break something else.
For example, some may say that people may want pure capability CAP VMs, while we have hybrid CAP VMs.
Actually, InterVisor already supports pure CAP CAP VMs, but the use of pure CAP code breaks the revocation story.
So you can't revoke shared capabilities using the method we introduced in the paper.
So we had several forks on the road.
And, well, do you want to pass pointers between isolation layers as integers of capabilities?
Because this defines, do you need to redevelop the interfaces or
you can use all of what you have.
Do we want to
add capability-aware calls
or use
just all what we already have?
It's actually, I began
my presentation today.
I said about jump and return
calls at the beginning of our discussion
and it's exactly the example about the problem and the possible decision.
So another question is, do we need revocation of capabilities or not?
And finally, are we capable to port huge software like LQL
to pure capability architecture?
So depending on the decision, the implementation will be completely different.
And the story as well also will be different.
And it's hard to say that all other decisions are scientifically wrong,
especially given that in systems, the scientific criterion is performance.
If one can find a way how to easily port LK to pure capability
or deal with cap revocation in pure CAPE code,
this would be a great project to get a paper.
The revocation problem actually is a very well-known problem
in the field of hardware memory capabilities,
while porting of huge low-level system code is very challenging,
but also depends on resources you have.
But, well, maybe people can find something more easy way to do this.
Nice. So kind of building on that,
I know you said this project is kind of finished,
but what do you have planned for future research?
We are working already on the next project.
It's also related to the future of cloud computing.
It uses the conception of CAP-EMs. This trial we will try to overcome some drawbacks of pure
CAP code, but the project is in a very early stage. So, okay, I don't have many information about it.
Sure, no problem.
So I know you kind of talked us through your background
at the start, but what kind of attracted
you to this research area
and what do you think is the biggest
challenge facing
it now?
I do like
operating systems and I
have been working in this area for many years and it suits my skill set.
Also, I like to use fancy hardware features to improve software systems.
However, it's relatively complicated to conduct research in this field, the field of operating systems. I want to say that this is challenging not
because of scientific side,
but more about the
gap between existing research industry
and the way how the research
community accepts the
ideas. There is a
keynote from ASPLOS21
by Timothy Roscoe, and
he described this problem very well,
so I recommend to everyone to have a
look at it.
Long story short, the systems
in general and operating systems
in particular are very problem-oriented,
and that's why
we see performance-based
justification of scientific
results in every first paper,
and because of the overwhelming amount
of software leaves in the 90s with
monolithic Linux, that's why people want to see solutions which are
compatible or comparable with Linux.
And instead of moving to new technologies invented in the last 20 years,
people deal with old things that just work,
but researchers don't know how, on the one hand,
to propose another solution to problems
that they already have been solving for many years
and nobody uses them,
or try to convince people with very crazy ideas
that can't be compatible with Linux
and the expected
set of benchmarks and
software.
Cool, awesome. So, time for
the last word now. What's the one
key thing you want listeners to take
away from your research on
CatVMs?
Hardware memory capabilities
introduced in Cherry not
only solve problems
related to memory safety
but can be used to build
low overhead highly partitioned cloud systems
Fantastic
It's not only about clouds
Perfect
Well we will leave it there
Thanks so much Vasily
If you are interested in knowing more about his work
we will put links to all his relevant materials
in the show notes.
And thanks for listening. And we'll see you all next time for some more awesome computer science research. Thank you.