Disseminate: The Computer Science Research Podcast - Mohamed Alzayat | Groundhog: Efficient Request Isolation in FaaS | #40
Episode Date: September 11, 2023Summary:Security is a core responsibility for Function-as-a-Service (FaaS) providers. The prevailing approach has each function execute in its own container to isolate concurrent executions of differe...nt functions. However, successive invocations of the same function commonly reuse the runtime state of a previous invocation in order to avoid container cold-start delays when invoking a function. Although efficient, this container reuse has security implications for functions that are invoked on behalf of differently privileged users or administrative domains: bugs in a function’s implementation, third-party library, or the language runtime may leak private data from one invocation of the function to subsequent invocations of the same function.In this episode, Mohamed Alzayat tells us about Groundhog, which isolates sequential invocations of a function by efficiently reverting to a clean state, free from any private data, after each invocation. Tune in to learn more about how Groundhog works and how it improves security in FaaS!Links:Mohamed's homepageGroundhog EuroSys'23 paperGroundhog codebase Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate the Computer Science Research Podcast. I'm your host, Jack Wardby.
A quick reminder that if you do enjoy the show, please do consider supporting us through Buy Me A Coffee.
It really helps us to continue making the show.
Today, I'm joined by Mohamed Al Zayed, who will be telling us everything we need to know about his work on Groundhog, Efficient Request Isolation in FAS.
This was published recently at Eurasis. Mohamed is a final year PhD student at the Max Planck Institute for Software Systems,
and he's recently joined Amazon. Welcome to the show, Mohamed.
Thanks, Jack. Thanks for inviting me. It's my pleasure to be here.
The pleasure is all ours. Let's jump straight in. So can you tell us a little bit more about
yourself and how you became interested in systems research? Sure. I'm Mohamed. I'm currently wrapping up my PhD under the supervision
of Professor Peter Gruschl and Deepak Garg at MPI-SWS and Zorland University. And I have recently
joined the AWS Kernel and Operating Systems team in Dresden, Germany. The discussion today is not
affiliated by Amazon or has nothing to do with Amazon,
of course. So my interest in systems research developed over the years. In a sense, I always
wanted to understand how computers, as in software and hardware, worked internally. And that was
basically the driver for me to study computer science at the German University in Cairo.
After that, and during my master's at Saarland University in
Germany, I attended and audited several systems courses such as distributed systems, database
systems, and operating systems, among others. And I enjoyed both the theoretical and practical
aspects of these courses. So I approached Professor Gruschl for a master thesis, and he introduced me
to Professor Garg, and then from there continued
to do a PhD with them. Fantastic. So let's jump into the topic today then. So can you start off
and tell us a little bit more background about what is FAST, right? What is Function as a Service?
Sure. So FAST, as you mentioned, is an acronym for Function as a Service. And it's an emerging high-level abstraction for event-driven cloud applications.
So this abstraction allows tenants to state their application logic in a stateless event-triggered functions, typically written in high-level languages like Python or JavaScript, and then upload them to the FAST provider and get an endpoint that can be used to invoke these functions on demand.
And FaaS also has an on-demand charge model.
So the tenant only pays for the compute time and memory used during the execution of their functions.
To help make the paradigm more clear, let me briefly describe a typical workflow of deploying and using a FaaS function.
So the tenant or the developer writes
one or more functions and sends the code to the cloud provider. The cloud provider sends back an
endpoint that the tenant can use as part of their services. When an end client uses the tenant's
service, the service would issue a request to that endpoint. And then the fast provider would
then forward that request to a provisioned instance then the fast provider would then forward that request to a
provisioned instance of the execution environment with the tenant's code loaded into it, or provision
one if none is readily available. The fast provider would then let the tenant's function
run and do the processing, return the result, and then this result would be forwarded by the
fast provider to the end client. So this is the brief overview of what FaaS is and how it works.
Awesome. That's really good.
It's succinct to definition with some good examples of how it works.
So cool.
So why is security important in this context then?
And how is it typically achieved today in FaaS?
So, of course course security is important for
all systems but in FaaS specifically there are several aspects. So in FaaS different functions
from different developers share the same underlying software and hardware resources that are available
at the cloud by the cloud made available by the cloud provider. And so this is one dimension, the fact that they all share
the resources. And the other dimension is that a single function may serve multiple end clients.
And the current fast approach to security focuses on the provider isolating function
instances from one another in containers or lightweight VMs. So basically they are isolating the available resources
such that no function instance can have access
to any of the data or resources of other instances.
Okay, cool.
So what's the problem with this approach?
And it sounds a pretty relatively clean approach
of separating out and keeping everything secure.
So what are the problems with this?
And then I guess this kind of lays the groundwork
for the motivation for Groundhog.
Sure.
So this approach is actually fine
if all the end clients of a single pass instance
are from the same trust domain,
meaning it's either one client that uses one instance
or a set of clients that share all their data together.
So there are no different administrative or trust domains that use a single function. The problem is that this
is not necessarily the case. And in many cases, a single fast function instance is invoked or
triggered on behalf of mutually distrusting clients. The problem here is that bugs in a
function implementation or one of the libraries it depends on might retain confidential data from one
request and leak it to a subsequent one. And what we want is to have strong isolation guarantees
between different functions and across sequential invocations of the very same function on behalf
of different clients awesome yeah so i'm just just kind of going off on a slight tangent there
with this sort of kind of not security at all but with the way that currently things are at the
moment are there any examples of where this has been exploited and where like concurrent sorry
sequential calls have been used to um i don't know, hack into systems or do any sort of mad, crazy things.
So FaaS is still an emerging paradigm.
However, these kinds of attacks happened on conventional servers.
And in many conventional servers like Apache, for example,
there is always a configuration for doing request
isolation. So Apache has the default pre-fork model, which can be configured to run each request
in an instance and kill the instance afterwards. So the fact that there are no large-scale exploits
that are publicized to the media doesn't rule out the possibility of such an
exploit. And in fact, it has been mentioned in the OWASP security report on the 10 most potential
dangerous risks in FaaS as one of the potential risks. Basically, it's called the shared space problem
where multiple requests share the same space,
whether it be memory or storage.
So if an instance is running
and it serves or it handles multiple requests
one after the other,
then the memory is shared in a sense.
So luckily, there are no large-scale exploits yet,
but we should have systems-
That we know about.
Yes, yes, yes.
But we should have systems that have guarantees by design.
Yes, completely agree.
So kind of a simple way, I guess,
to solve this problem in a very coarse way
would be to run every activation function in a fresh container.
Why is this a bad idea?
Yes, so, right, this is actually a very simple and sound way of enforcing sequential request isolation in FaaS.
However, this is very expensive from a performance point of view.
If we rely on this approach, we would have to deal with what is famously known as the cold start problem in FaaS. Basically, there are a few expensive steps that need to happen
so that the provisioned execution environment is ready to handle a new request.
So first resources must be allocated
and the new execution environment must be instantiated.
Then the language runtime must be initialized.
After that, the static data structures of the function will have to be populated.
And only then the function can get the first request inputs and process the request.
Now, most fast functions are short-lived, which means that the relative overhead of preparing the execution environment to the actual execution of the function would be very high. And this is
the reason cloud providers actually reuse existing execution environments
to serve sequential requests.
So what happens now is that
once a function is triggered by one request,
it's kept alive for a few minutes
so that if another request to the same function arrives,
it can be handled without having to pay
the cold start overhead right yeah that that makes
total sense from a and probably from an efficiency point of view right i guess why they do that so
obviously there's problems like we said earlier on so this kind of sets us up perfectly for groundhog
and so tell us a little bit more about groundhog and then we can maybe kick things off with like
the design principles you had behind it
when you went about kind of approaching,
how you went about kind of coming up with a more efficient solution to this problem.
Groundhog, basically there are two main properties
that we wanted to maintain while designing Groundhog.
So we wanted to preserve the performance benefits of reusing function instances
while at the same time enforcing request isolation.
And the other important thing we had in mind is how can we design Groundhog such that it can be retrofitted into existing fast platforms.
So we wanted Groundhog to be transparent such that it can be plugged into a platform without any modifications required on the function side or on the platform side.
So these were the guiding design principles.
The key idea of how Groundhog works is basically is a simple observation.
So once the function is provisioned, we are 100% sure it has no client data, right? So the function is ready
to get the first request. It has no client data. So the idea is very simple. We take an in-memory
snapshot, essentially save up the warmed up execution environment state before any confidential
data is processed, and then let the processing happen for the first request. And after the first request is finished,
we can roll back to the snapshotted state
of the critical path,
basically before the new request arrives,
which makes basically effectively subsequent requests
operate on a pristine execution environment
that has no confidential data.
So this is the high-level key idea.
Okay, cool.
So it's kind of a case of we get everything up and running
before we've done anything with it,
so we've not had any user data or anything come in yet.
We then take an in-memory snapshot, put that to the side,
do our sort of processing on top
that we want to do with this function call.
Then once we finish it, we discard that sort of next version along and we kind of keep when we
roll back essentially to the in-memory state we had before and then it's as if we're pristine and
clean again and we can take the next call that kind of how it works fantastic without the need
to prove it yeah awesome so let's go into the details of them and how do you go about
achieving this then so what's happening under the hood to make this possible?
So Groundhog is implemented as a management process
that can control the function execution environment,
which is, in our case, a standard Linux process.
Groundhog's managing process can interrupt the functions process
or execution environment, the execution environment that runs the function,
create a snapshot of the functions process memory
and CPU state in Groundhog's internal memory,
and instruct the operating system
to track any memory modifications
within that functions process.
Groundhog then lets the function process
receive the inputs and do the processing and return the result.
Once the results are returned to the end client, Groundhog then interrupts the function again to identify any changes that happened to the memory layout or CPU registers and roll back these changes by overwriting any changes with their original snapshotted version.
So if we look at the design, we will find that from the function's point of view,
Groundhog is the fast platform.
And from the platform's point of view, Groundhog is the function
because Groundhog interposes between the communication of both.
And Groundhog relies on standard Linux facilities like P-Trace,
the ProcFile system, and standard soft dirty bits tracking to be able to manage the lifecycle of the
whole operation, basically control the process, identify and roll back any changes. This design
allows Groundhog to be fully transparent to both the function and the platform, modulo the need,
of course, for the platform to enable Groundhog. And basically, Groundhog is able to track the modified memory pages only and only restore
these pages, resulting in efficient rollbacks.
And Groundhog does this restoration of the critical path after the function is done with
the processing of the request, which means that the restoration overhead do not significantly
affect or the Groundhog do not significantly affect,
or the Groundhog does not significantly affect
the end client latency
because most of the heavy lifting
happens after the request is done.
Nice, cool.
So I'm just going to picture this in my head.
It's kind of like,
it's almost like a middle
where that sits between the two sort of things
and kind of from both sides,
it looks like the thing they're expecting but yeah so
so i wanted to switch on to on the implementation slightly there so how hard was it to sort of
to kind of create that sort of abstraction between the two things was it difficult process
engineering wise not not not difficult um so basically linux have this idea of tasks or
processes that can be parents of other processes.
And if a process is a parent, it has ability to read into the memory of the child.
And it has the ability also to interrupt it.
And basically, we can use some tricks to inject system calls into the child to do the operations we want in the child address space.
So it's not particularly hard yet.
That sounds good.
Was it a long implementation effort
or was it pretty time not intensive?
The opposite of time intensive.
So we tried a few things in the hope that we
can get better performance and that was probably the thing that took more time of course there
were also some corner cases that required a lot of debugging so that also takes time
basically while developing you find you expect to work, but then you see memory corruptions,
and then you try to find out why this happens
and realize basically the memory layout is not...
There is some flag that you missed or something like that.
But yeah, it was fun.
That sounds good.
Cool. Let's talk some numbers then,
because you said that there's no sort of real overhead
that gets introduced, like latency from the client's perspective so can we maybe touch a little
bit on performance but then i'm also really interested in finding out how you actually
measure the the performance of something in terms of the security of it essentially
like did you were you able to sort of empirically measure how secure it was
so we so let me answer the second uh first. So we didn't measure security because
basically the new design guarantees security by the merit of erasing any data or basically
programmatically ensuring that any data that was introduced was rolled back. So there is no need to
measure security in this case, but we definitely intensively measured performance.
Yeah, so basically, if we went through, we evaluated Groundhog on a large set of micro and macro benchmarks. of validating our hypotheses on where the performance overheads are and how they are correlated with the total memory size
and the write set or the dirty set of a function.
And the macro benchmarks cover a wide variety of use cases,
including web applications, data and image processing,
statistical computations, among others.
And basically, these allowed us to capture the performance impact on applications that may use Groundhog as a building block for request isolation.
So the way we, maybe I can go through the setup of the experiments. So the way we set up our
experiments is by relying on an open source platform called Apache OpenWhisk.
So this is a popular open source fast platform.
And we deployed OpenWhisk in a two node,
using a two node deployment.
And the reason we did a two node deployment
was to performance isolate the component
we want to measure from everything else.
So the component we want to measure is the invoker.
It's in OpenRisk, it's called the invoker.
And this is the component that launches
and directly manages the function instances.
So we had that on one node
and all other OpenRisk components on another node.
And in our benchmarks,
we compared Groundhog against an insecure baseline that serves one request after the other without any privacy.
So basically, this is the standard way of reusing function instances.
And we compared against a copy on write approach. So instead of tracking the modified pages and overwriting them after the function finishes,
there is a simple way of using copy on write, which basically creates a copy of the page
just before it's written.
And this is done transparently by the operating system.
And we compared also against a secure baseline, which starts a new function for each request,
but we didn't plot the results for the secure
baseline because they were, they just make all the, the latency is so high. So of course, all
the numbers are in the paper and it would have been easier if we look at the graphs, but let me
give a brief, yeah, let me give a brief description of the high levellevel trends, perhaps. So for the microbenchmarks, we implemented
two C functions, one that allocates a fixed size of memory and have a request that dirties a
percentage of that size, and another function that allocates a varying amount of memory,
but dirties a fixed number of pages.
So basically, this allows us to see if Groundhog's overhead are correlated more with the dirtying
or the total memory size.
And the high-level observation is that Groundhog's overhead on the critical path, basically while
the function is processing the request and before it sends the response to the client
are correlated with the number of modified pages
because it keeps track of what pages have been modified.
And the restoration overhead is correlated with both.
The total memory size,
because we have to scan the total memory
and identify changes in the memory layout
and roll back every modified layout and then every modified page.
So these are the high-level trends we have seen in the micro benchmarks.
For the macro benchmarks, we evaluated Groundhog, as I mentioned, on a wide set of benchmarks. We evaluated Groundhog, as I mentioned, on a wide set of benchmarks.
So these benchmarks are the PyPerformance Python benchmark, the PolyBench C benchmark,
and the FAS profiler Python and JSON benchmarks. In all of the benchmarks, the Groundhog's end-to-end
latency is on par with that of the insecure baseline, the one that serves requests one after the other.
But the throughput was impacted by the rollbacks overhead.
So basically, the throughput measurements here are a bit pessimistic because the benchmark saturates the system, which is the worst-case scenario that should never happen in production. Overall, the majority of the PyPerformance, Python, and Polybench C benchmarks
saw little to no noticeable impact on the end-to-end request latency and throughput as well,
except for the very short benchmarks and the benchmarks with very large write sets.
So these benchmarks had a drop in throughput.
So for short benchmarks, think of a function that just gets the time and exits.
So it takes less than one millisecond.
So for a one millisecond function, when we are speaking about short functions,
drop-in throughput, for a one millisecond function,
it means that after each function request, Groundhog would interrupt the function,
scans its memory, identify any modified pages, and draw them back, and then hand back the control to the function to serve the next request.
And if all of this happens in one millisecond, then this is a 50% drop in throughput.
But starting this function from scratch would take at least 100 milliseconds.
So it's still a huge improvement.
I mean, if Groundhog does it in one millisecond, we have 50% throughput,
but the alternative is much worse.
But for relatively longer functions, we see a very minimal drop.
In some rare cases when the function has a very high number of dirty pages or a workload that modifies the memory layout heavily, then the rollback that analyzes the changes and rolls them back.
So basically it unmaps all the newly added memory maps. It resizes memory maps to their proper size and restore
all the modified pages,
sometimes this overhead is
excessive. And here we see
also a drop. But this is not the case.
This has been noticeable in
Node.js, specifically
because we are running a vanilla Node.js
unmodified Node.js
which has aggressive memory allocation patterns
with some garbage collection triggers
that happen due to time.
But yeah, overall,
there is minimal impact to latency and throughput
for the average function, let's say.
Yeah, it sounds great.
So like the average sort of use case,
that it seems almost like a free lunch
in those scenarios, right?
It was funny when you said about the,
in production, the customer shouldn't be doing this
or they shouldn't be doing that.
But I mean, they probably will be, right?
I mean, people do some crazy things,
but yeah, you shouldn't be rubbing up
against the limits of your resources, right?
But it's interesting.
So is it possible to
have just thinking about kind of crazy scenarios a really short function in terms of time but also
one that that actually debt is a lot of like a lot of state and creates like a large amount of
state because they're the two sort of extremes of when throughput can drop off right when you
when you're having to kind of you're doing something really short and then it has to do a big scan through and check everything and
it's kind of relative to the size of the operation it's not it's a lot of work and then the other
the other end of the spectrum is you when you're changing a lot of stuff you're going to then like
roll all that back right which is again a lot of work so is the key is key is it quite a contrived
scenario of having kind of both of those two extremes been true at the same time it's probably possible but there is so basically there is a limitation of how much you can do in a
in a limited amount of time but but of course it's possible and of course in some extreme cases
maybe it's it's cheaper to to start a function from scratch, just in some extreme cases.
But basically, so the nice thing is that Groundhog can be used transparently,
which means it can be used in an opt-in fashion
if it's ever adopted by a cloud provider.
So basically as a client, you can go and say,
I want to have Groundhog for this function
because we need security here there is another
function that is invoked by only a single client and basically there is no need for request
isolation there there is a third function that should be executed once and then get killed for
example yeah yeah it's really a nice feature of it that allows you to sort of be sort of granular
with respect to what the application requirements are right right? So yeah, that's a really, really cool feature of it.
Are there any other sort of kind of scenarios?
And we kind of touched on that really,
and when Groundhog is sort of suboptimal,
kind of what the limitations are for it
that might stop it being adopted by a cloud provider?
Maybe, I don't know.
Apart from the functions that have very high
dirtying rates of memory, which correspond to longer rollbacks, and in some cases, Groundhog might not be the optimal solution here, at least the prototype implementation of Groundhog. that come with snapshot-based techniques. Namely, it may capture
pair function instance ephemeral state,
such as the time at which the function started
or a pseudo-random number generator
that have been already seeded
in the initialization phase.
So if a pseudo-random number generator
have been seeded
and then we take the snapshot
after it has been seeded, this means that the next pseudo-random number would be always the same because Groundhog rolls back the pseudo-random number generator state. taken at the beginning for some reason then we will always see that the time keeps increasing
because we are not refreshing the timestamp so these are sort of known limitations with
snapshot based techniques these have workarounds but we haven't implemented them as part of the
prototype yeah cool and this this next question now we were
joking about it a little bit before we started recording about kind of what's next on the
research agenda for groundhog and maybe these things would have been but i know you're you're
you're i'll let you tell a joke yes so the next thing is to hopefully defend my thesis. But yeah, so these are important problems
in the fast paradigm.
And in fact, so basically snapshots
and restore techniques have been used
or are being applied to solve the cold start problem,
basically by taking a snapshot
of the execution environment
so that it can be started faster
than reconstructing the state.
And these are the sort of problems that come with the techniques that rely on snapshot
and restore.
And solutions and workarounds are being developed as we chat right now.
So this is something that can be a follow-up for Groundhog in addition to many optimizations
and more reasoning about the security guarantees
that one gets in FaaS.
Just on another point.
Yeah, first defend.
First defend.
I just wanted to touch on this.
Because obviously in a lot of papers and systems,
these are all have names.
And I like to know where the name comes from.
Why Groundhog?
So it refers to a movie, which is the Groundhog Day.
So basically, they're the actor.
Every day is basically rolls back.
So Groundhog rolls back memory.
And every day is repeated rolls back. So Groundhog rolls back memory and every day is repeated
for the function instance
as in the movie.
But yeah.
Yeah, I like that.
That's cool.
Awesome.
Yeah, cool.
So my next question is,
what sort of impact do you think
this work can have then?
Can it inspire a cloud provider to go and pick this up? Or yeah, so what's the sort of impact do you think this work can have then? Can it inspire a cloud provider to go and pick this up?
Or yeah, so what's the sort of scope for impact with Groundhog,
do you think?
So I think there are two sides to that.
There's the cloud provider side
and there's the software developer side.
And I would start with the software developer side.
So the very first important thing is the mindset of while developing these functions.
So when working with sensitive client data, developers should keep in mind that the unit of isolation that they should consider is not the function or the company or the application.
It's each client request and each data item.
Second, they have to keep in mind that isolation can be broken due to several reasons.
One of them is bugs, both in their code and on the libraries they rely on,
and potentially in the code that they rely on.
So there's the question of best practices for enforcing this client-level isolation.
One very conservative, highly granular way is to enforce isolation per request, as Groundhog
does. Less granular methods involves identifying sets of clients,
routing them to the, routing a group of clients together to an, in an administrative domain that
gets served by a set of functions, for example, these, these kinds of things.
Awesome. Cool. Yeah. It feels like it has got like the possibility here to be like really
sort of inspiring and impactful kind of going forward so yeah um cool so when you were working
on groundhog what was the sort of the most interesting thing you kind of that that fell
out of working on it like what was the most interesting lesson that you land, I guess? I would say to never optimize early on.
So basically, never try to get
the most optimal version ready.
Rather, start simple,
get your intuitions verified,
and then build the most stupid,
naive implementation that gets the job done,
and then iterate and optimize
afterwards.
Another thing that I
learned is
to fully automate
experiments from day one.
So basically, start
with the automation even
before building the system.
Have a plan for automating everything.
Basically, have all
experiments be able to run using a single a single enter on a script nice so yeah premature
optimization is the root of all evil but premature automation is not right that's what we're saying
here right automate as soon as you can right yeah yeah cool um that that's that that's that's funny um awesome i mean i'm kind of on the flip
side of that then and maybe maybe it felt like groundhog day every day you're working on it
but what was the sort of the were the things along the way that you tried doing that kind of failed
yeah what were the war stories so what one prominent thing that we tried and failed was relying on a newly available
kernel feature for tracking
dirty pages instead
of the one we are currently using. So currently
we are using something called the soft dirty bits
which
basically the
operating system protects all the
memory of the process
and whenever a write to
a memory page happens,
there is a page fault.
And then the kernel keeps,
basically does the booking,
the bookkeeping and sets a bit to one
that corresponds to that page.
So that afterwards,
one can scan the pages
and identify which pages were modified.
The alternative or the new feature
was the user-fold file descriptor approach,
which allows the user space
to get a notification for every modified page.
So we tried working on that
and had a full prototype that uses UFFDs,
user-fold fileified descriptors.
But then the overhead of context switching for each notification was so high
that it was cheaper to scan all pages and try to figure out which pages were modified.
So basically, the advantage of the UFFDs approach is that you don't need to go through
all the memory pages
and see which pages have been modified.
Instead, you just get a notification with,
okay, page X got modified.
So you know right away to roll that one back
after the migration, after the request.
But lesson learned,
it turned out to be more expensive performance-wise.
At least the current implementation of UR50s, at least.
Interesting.
Yeah, I guess there's some scope in the future
for that kind of relationship to change.
But how far down the road did you get
with this sort of approach before you realized,
damn, this is actually not the right thing to do?
Almost after having the full...
Oh, wow.
We had an initial prototype with the soft dirty bits,
and it was almost complete.
And then we realized that a new kernel version was released,
a stock kernel version.
So we decided to rely on stock kernels,
basically to make adopting Groundhog easier
because no one wants to rely on kernel patches
and maintain them.
So we realized that the new kernel version
was released with UFFD page write tracking support
is available.
So we thought, okay, this will cut our overhead
of scanning the pages.
Let's do it.
But yeah, it turned out to be more costly.
At least the current implementation turned out to be more costly.
Yeah, but like you said, a lesson learned, I guess.
Well, another lesson learned is not to fully trust APIs
if you see smoke sometimes.
So even if it's the kernel,
so we noticed that one of the very old kernel features,
which is the soft dirty bits had a bug.
So basically, yeah.
So basically we were having all the tracking.
We are rolling back all the modified pages
and we are still hitting memory corruptions.
And it turned out after doing binary comparison that some pages have been modified. So basically,
we do the rollback and then compare the original memory with the rolled back memory, and we see
that some pages have been modified
then debugging and trying to figure out okay is that page actually marked as dirtied and then it
turned out no so basically it turned out to be a bug in the kernel did it get patched or did you
have to work around it no so basically i asked on the on the kernel mailing lists and it turned
out to be a bug, so I helped a bit with
finding the commit
by exacting basically
the kernel had a
reproducer, by exacting the
kernel found the version that it was introduced
in and then
one of the guys who was working on that
subsystem actively added
a patch, so it's now patched.
How long had it existed there for?
I think it existed for maybe six months or so.
I don't recall the exact date.
It's not too long.
We are talking like, well, six months is still quite a long time, right?
But I mean, it's not like 20 years, right?
But yeah.
Cool.
Yeah, awesome.
I guess we're
almost we're almost um towards the end of the podcast now but can you maybe tell the listeners
about your other research as well sort of things you've been working on crossing phd obviously
groundhog isn't the only thing you've worked on so yeah can you give us a flavor of some of the
other things that you've that you've done so there are several directions of research that I worked on some on analyzing the impact of network delays
on Bitcoin, for example. But most of my research has been on how we can design or rather redesign
cloud systems such that they provide additional privacy guarantees by design. And an earlier example would be Pacer, which was led by Asta
Meta. And I worked on that one with Asta, Roberta, Diviti, Peter Drussel, Deepak Garg, and
Bjorn Brandenburg, all from MPI-SWS. So Pacer was tackling the problem of network IO side channels. And the idea is that basically any shared resource can be used
to launch side channel attacks and snoop on whoever is sharing that resource with the attacker.
And network is no exception. So in the cloud, the network card is shared between the tenants
of the same host, which means that an attacker can infer the traffic shape
of the content of a co-tenant,
as we have demonstrated in the paper.
And the problem here is that the traffic shape
can be used to infer the content of the packets,
even if the packets are encrypted,
if the data being served is from a public corpus, like think YouTube, Wikipedia,
and so on. So Pacer basically redesigned the way networking happened in the cloud such that
the traffic was forced to follow a predetermined shape. And it's an interesting read. It proved to be more challenging
than we initially anticipated,
but we learned a lot through the process.
So perhaps if someone is interested,
he can Google Pacer,
comprehensive network side channel mitigation
in the cloud by Asta.
Yeah, we'll stick that in the show notes.
We'll link all the relevant materials
so they can go and find out if they are interested.
So yeah, kind of going off that and sort of like the other way.
So I'd like to know more about your creative process.
So like how do you go about actually kind of generating these ideas?
Because you've worked on quite a few different things, right?
And how do you then select which thing to work on?
That's a tough question, actually. I wouldn't claim that I have an established
creative process or an idea generation approach, at least yet. Rather, I just get curious about an
area, try to understand it and see what or where things can break, if there are any gaps or potential improvements, and start from there.
Also, I sometimes have this idea bank, which are all the things I hear about and I find interesting.
But more often than not, I never go through them.
But yeah, so I wouldn't say I have a um a principled or a a proper approach for that
but basically just chat with people get curious and learn about something new and yeah no that's
awesome yeah i think i think sometimes with something like a creative process for like you
almost don't want to formalize it and have it standardized because that sometimes often takes away from their creativity of it right you want it to be sort of
spontaneous and like you say maybe have an idea bank or whatever and look for every now and again
but often you've i don't know i'm kind of similar to you in that sense in that like if it interests
me and i'm curious that's often enough to sort of spark an idea or something yeah that's awesome so
it's it's the last question now, Mohamed,
so what's the one thing you want the listener
to take away from this chat today?
I would say is to develop the mindset
of treating security and privacy
as a first-class citizen
when designing an application.
And when in doubt about the security guarantees,
just reach out to the service provider
that you're relying on
to make sure that you are getting the guarantees you need
to make sure that your client's data is safe.
Fantastic. That's a great message.
Let's end it there. Thanks again so much, Mohamed, for coming on the show Fantastic. That's a great message. Let's end it there.
Thanks again so much, Mohamed, for coming on the show.
It's been a great chat.
If the listener wants to know more about Mohamed's work,
we'll put a link to everything in the show notes
so they can go and find those.
And again, if you do enjoy the show,
please consider supporting us through Buy Me A Coffee.
Like I said earlier, it really helps us to keep making the show.
And yeah, we'll see you all next time
for some more awesome computer science research.