Disseminate: The Computer Science Research Podcast - Mohamed Alzayat | Groundhog: Efficient Request Isolation in FaaS | #40

Episode Date: September 11, 2023

Summary:Security is a core responsibility for Function-as-a-Service (FaaS) providers. The prevailing approach has each function execute in its own container to isolate concurrent executions of differe...nt functions. However, successive invocations of the same function commonly reuse the runtime state of a previous invocation in order to avoid container cold-start delays when invoking a function. Although efficient, this container reuse has security implications for functions that are invoked on behalf of differently privileged users or administrative domains: bugs in a function’s implementation, third-party library, or the language runtime may leak private data from one invocation of the function to subsequent invocations of the same function.In this episode, Mohamed Alzayat tells us about Groundhog, which isolates sequential invocations of a function by efficiently reverting to a clean state, free from any private data, after each invocation. Tune in to learn more about how Groundhog works and how it improves security in FaaS!Links:Mohamed's homepageGroundhog EuroSys'23 paperGroundhog codebase Hosted on Acast. See acast.com/privacy for more information.

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Disseminate the Computer Science Research Podcast. I'm your host, Jack Wardby. A quick reminder that if you do enjoy the show, please do consider supporting us through Buy Me A Coffee. It really helps us to continue making the show. Today, I'm joined by Mohamed Al Zayed, who will be telling us everything we need to know about his work on Groundhog, Efficient Request Isolation in FAS. This was published recently at Eurasis. Mohamed is a final year PhD student at the Max Planck Institute for Software Systems, and he's recently joined Amazon. Welcome to the show, Mohamed. Thanks, Jack. Thanks for inviting me. It's my pleasure to be here. The pleasure is all ours. Let's jump straight in. So can you tell us a little bit more about
Starting point is 00:01:02 yourself and how you became interested in systems research? Sure. I'm Mohamed. I'm currently wrapping up my PhD under the supervision of Professor Peter Gruschl and Deepak Garg at MPI-SWS and Zorland University. And I have recently joined the AWS Kernel and Operating Systems team in Dresden, Germany. The discussion today is not affiliated by Amazon or has nothing to do with Amazon, of course. So my interest in systems research developed over the years. In a sense, I always wanted to understand how computers, as in software and hardware, worked internally. And that was basically the driver for me to study computer science at the German University in Cairo. After that, and during my master's at Saarland University in
Starting point is 00:01:45 Germany, I attended and audited several systems courses such as distributed systems, database systems, and operating systems, among others. And I enjoyed both the theoretical and practical aspects of these courses. So I approached Professor Gruschl for a master thesis, and he introduced me to Professor Garg, and then from there continued to do a PhD with them. Fantastic. So let's jump into the topic today then. So can you start off and tell us a little bit more background about what is FAST, right? What is Function as a Service? Sure. So FAST, as you mentioned, is an acronym for Function as a Service. And it's an emerging high-level abstraction for event-driven cloud applications. So this abstraction allows tenants to state their application logic in a stateless event-triggered functions, typically written in high-level languages like Python or JavaScript, and then upload them to the FAST provider and get an endpoint that can be used to invoke these functions on demand.
Starting point is 00:02:47 And FaaS also has an on-demand charge model. So the tenant only pays for the compute time and memory used during the execution of their functions. To help make the paradigm more clear, let me briefly describe a typical workflow of deploying and using a FaaS function. So the tenant or the developer writes one or more functions and sends the code to the cloud provider. The cloud provider sends back an endpoint that the tenant can use as part of their services. When an end client uses the tenant's service, the service would issue a request to that endpoint. And then the fast provider would then forward that request to a provisioned instance then the fast provider would then forward that request to a
Starting point is 00:03:25 provisioned instance of the execution environment with the tenant's code loaded into it, or provision one if none is readily available. The fast provider would then let the tenant's function run and do the processing, return the result, and then this result would be forwarded by the fast provider to the end client. So this is the brief overview of what FaaS is and how it works. Awesome. That's really good. It's succinct to definition with some good examples of how it works. So cool. So why is security important in this context then?
Starting point is 00:04:00 And how is it typically achieved today in FaaS? So, of course course security is important for all systems but in FaaS specifically there are several aspects. So in FaaS different functions from different developers share the same underlying software and hardware resources that are available at the cloud by the cloud made available by the cloud provider. And so this is one dimension, the fact that they all share the resources. And the other dimension is that a single function may serve multiple end clients. And the current fast approach to security focuses on the provider isolating function instances from one another in containers or lightweight VMs. So basically they are isolating the available resources
Starting point is 00:04:47 such that no function instance can have access to any of the data or resources of other instances. Okay, cool. So what's the problem with this approach? And it sounds a pretty relatively clean approach of separating out and keeping everything secure. So what are the problems with this? And then I guess this kind of lays the groundwork
Starting point is 00:05:06 for the motivation for Groundhog. Sure. So this approach is actually fine if all the end clients of a single pass instance are from the same trust domain, meaning it's either one client that uses one instance or a set of clients that share all their data together. So there are no different administrative or trust domains that use a single function. The problem is that this
Starting point is 00:05:33 is not necessarily the case. And in many cases, a single fast function instance is invoked or triggered on behalf of mutually distrusting clients. The problem here is that bugs in a function implementation or one of the libraries it depends on might retain confidential data from one request and leak it to a subsequent one. And what we want is to have strong isolation guarantees between different functions and across sequential invocations of the very same function on behalf of different clients awesome yeah so i'm just just kind of going off on a slight tangent there with this sort of kind of not security at all but with the way that currently things are at the moment are there any examples of where this has been exploited and where like concurrent sorry
Starting point is 00:06:21 sequential calls have been used to um i don't know, hack into systems or do any sort of mad, crazy things. So FaaS is still an emerging paradigm. However, these kinds of attacks happened on conventional servers. And in many conventional servers like Apache, for example, there is always a configuration for doing request isolation. So Apache has the default pre-fork model, which can be configured to run each request in an instance and kill the instance afterwards. So the fact that there are no large-scale exploits that are publicized to the media doesn't rule out the possibility of such an
Starting point is 00:07:07 exploit. And in fact, it has been mentioned in the OWASP security report on the 10 most potential dangerous risks in FaaS as one of the potential risks. Basically, it's called the shared space problem where multiple requests share the same space, whether it be memory or storage. So if an instance is running and it serves or it handles multiple requests one after the other, then the memory is shared in a sense.
Starting point is 00:07:39 So luckily, there are no large-scale exploits yet, but we should have systems- That we know about. Yes, yes, yes. But we should have systems that have guarantees by design. Yes, completely agree. So kind of a simple way, I guess, to solve this problem in a very coarse way
Starting point is 00:08:01 would be to run every activation function in a fresh container. Why is this a bad idea? Yes, so, right, this is actually a very simple and sound way of enforcing sequential request isolation in FaaS. However, this is very expensive from a performance point of view. If we rely on this approach, we would have to deal with what is famously known as the cold start problem in FaaS. Basically, there are a few expensive steps that need to happen so that the provisioned execution environment is ready to handle a new request. So first resources must be allocated and the new execution environment must be instantiated.
Starting point is 00:08:38 Then the language runtime must be initialized. After that, the static data structures of the function will have to be populated. And only then the function can get the first request inputs and process the request. Now, most fast functions are short-lived, which means that the relative overhead of preparing the execution environment to the actual execution of the function would be very high. And this is the reason cloud providers actually reuse existing execution environments to serve sequential requests. So what happens now is that once a function is triggered by one request,
Starting point is 00:09:14 it's kept alive for a few minutes so that if another request to the same function arrives, it can be handled without having to pay the cold start overhead right yeah that that makes total sense from a and probably from an efficiency point of view right i guess why they do that so obviously there's problems like we said earlier on so this kind of sets us up perfectly for groundhog and so tell us a little bit more about groundhog and then we can maybe kick things off with like the design principles you had behind it
Starting point is 00:09:45 when you went about kind of approaching, how you went about kind of coming up with a more efficient solution to this problem. Groundhog, basically there are two main properties that we wanted to maintain while designing Groundhog. So we wanted to preserve the performance benefits of reusing function instances while at the same time enforcing request isolation. And the other important thing we had in mind is how can we design Groundhog such that it can be retrofitted into existing fast platforms. So we wanted Groundhog to be transparent such that it can be plugged into a platform without any modifications required on the function side or on the platform side.
Starting point is 00:10:28 So these were the guiding design principles. The key idea of how Groundhog works is basically is a simple observation. So once the function is provisioned, we are 100% sure it has no client data, right? So the function is ready to get the first request. It has no client data. So the idea is very simple. We take an in-memory snapshot, essentially save up the warmed up execution environment state before any confidential data is processed, and then let the processing happen for the first request. And after the first request is finished, we can roll back to the snapshotted state of the critical path,
Starting point is 00:11:11 basically before the new request arrives, which makes basically effectively subsequent requests operate on a pristine execution environment that has no confidential data. So this is the high-level key idea. Okay, cool. So it's kind of a case of we get everything up and running before we've done anything with it,
Starting point is 00:11:30 so we've not had any user data or anything come in yet. We then take an in-memory snapshot, put that to the side, do our sort of processing on top that we want to do with this function call. Then once we finish it, we discard that sort of next version along and we kind of keep when we roll back essentially to the in-memory state we had before and then it's as if we're pristine and clean again and we can take the next call that kind of how it works fantastic without the need to prove it yeah awesome so let's go into the details of them and how do you go about
Starting point is 00:12:01 achieving this then so what's happening under the hood to make this possible? So Groundhog is implemented as a management process that can control the function execution environment, which is, in our case, a standard Linux process. Groundhog's managing process can interrupt the functions process or execution environment, the execution environment that runs the function, create a snapshot of the functions process memory and CPU state in Groundhog's internal memory,
Starting point is 00:12:33 and instruct the operating system to track any memory modifications within that functions process. Groundhog then lets the function process receive the inputs and do the processing and return the result. Once the results are returned to the end client, Groundhog then interrupts the function again to identify any changes that happened to the memory layout or CPU registers and roll back these changes by overwriting any changes with their original snapshotted version. So if we look at the design, we will find that from the function's point of view, Groundhog is the fast platform.
Starting point is 00:13:15 And from the platform's point of view, Groundhog is the function because Groundhog interposes between the communication of both. And Groundhog relies on standard Linux facilities like P-Trace, the ProcFile system, and standard soft dirty bits tracking to be able to manage the lifecycle of the whole operation, basically control the process, identify and roll back any changes. This design allows Groundhog to be fully transparent to both the function and the platform, modulo the need, of course, for the platform to enable Groundhog. And basically, Groundhog is able to track the modified memory pages only and only restore these pages, resulting in efficient rollbacks.
Starting point is 00:13:54 And Groundhog does this restoration of the critical path after the function is done with the processing of the request, which means that the restoration overhead do not significantly affect or the Groundhog do not significantly affect, or the Groundhog does not significantly affect the end client latency because most of the heavy lifting happens after the request is done. Nice, cool.
Starting point is 00:14:14 So I'm just going to picture this in my head. It's kind of like, it's almost like a middle where that sits between the two sort of things and kind of from both sides, it looks like the thing they're expecting but yeah so so i wanted to switch on to on the implementation slightly there so how hard was it to sort of to kind of create that sort of abstraction between the two things was it difficult process
Starting point is 00:14:35 engineering wise not not not difficult um so basically linux have this idea of tasks or processes that can be parents of other processes. And if a process is a parent, it has ability to read into the memory of the child. And it has the ability also to interrupt it. And basically, we can use some tricks to inject system calls into the child to do the operations we want in the child address space. So it's not particularly hard yet. That sounds good. Was it a long implementation effort
Starting point is 00:15:17 or was it pretty time not intensive? The opposite of time intensive. So we tried a few things in the hope that we can get better performance and that was probably the thing that took more time of course there were also some corner cases that required a lot of debugging so that also takes time basically while developing you find you expect to work, but then you see memory corruptions, and then you try to find out why this happens and realize basically the memory layout is not...
Starting point is 00:15:52 There is some flag that you missed or something like that. But yeah, it was fun. That sounds good. Cool. Let's talk some numbers then, because you said that there's no sort of real overhead that gets introduced, like latency from the client's perspective so can we maybe touch a little bit on performance but then i'm also really interested in finding out how you actually measure the the performance of something in terms of the security of it essentially
Starting point is 00:16:17 like did you were you able to sort of empirically measure how secure it was so we so let me answer the second uh first. So we didn't measure security because basically the new design guarantees security by the merit of erasing any data or basically programmatically ensuring that any data that was introduced was rolled back. So there is no need to measure security in this case, but we definitely intensively measured performance. Yeah, so basically, if we went through, we evaluated Groundhog on a large set of micro and macro benchmarks. of validating our hypotheses on where the performance overheads are and how they are correlated with the total memory size and the write set or the dirty set of a function. And the macro benchmarks cover a wide variety of use cases,
Starting point is 00:17:16 including web applications, data and image processing, statistical computations, among others. And basically, these allowed us to capture the performance impact on applications that may use Groundhog as a building block for request isolation. So the way we, maybe I can go through the setup of the experiments. So the way we set up our experiments is by relying on an open source platform called Apache OpenWhisk. So this is a popular open source fast platform. And we deployed OpenWhisk in a two node, using a two node deployment.
Starting point is 00:17:56 And the reason we did a two node deployment was to performance isolate the component we want to measure from everything else. So the component we want to measure is the invoker. It's in OpenRisk, it's called the invoker. And this is the component that launches and directly manages the function instances. So we had that on one node
Starting point is 00:18:16 and all other OpenRisk components on another node. And in our benchmarks, we compared Groundhog against an insecure baseline that serves one request after the other without any privacy. So basically, this is the standard way of reusing function instances. And we compared against a copy on write approach. So instead of tracking the modified pages and overwriting them after the function finishes, there is a simple way of using copy on write, which basically creates a copy of the page just before it's written. And this is done transparently by the operating system.
Starting point is 00:18:57 And we compared also against a secure baseline, which starts a new function for each request, but we didn't plot the results for the secure baseline because they were, they just make all the, the latency is so high. So of course, all the numbers are in the paper and it would have been easier if we look at the graphs, but let me give a brief, yeah, let me give a brief description of the high levellevel trends, perhaps. So for the microbenchmarks, we implemented two C functions, one that allocates a fixed size of memory and have a request that dirties a percentage of that size, and another function that allocates a varying amount of memory, but dirties a fixed number of pages.
Starting point is 00:19:45 So basically, this allows us to see if Groundhog's overhead are correlated more with the dirtying or the total memory size. And the high-level observation is that Groundhog's overhead on the critical path, basically while the function is processing the request and before it sends the response to the client are correlated with the number of modified pages because it keeps track of what pages have been modified. And the restoration overhead is correlated with both. The total memory size,
Starting point is 00:20:19 because we have to scan the total memory and identify changes in the memory layout and roll back every modified layout and then every modified page. So these are the high-level trends we have seen in the micro benchmarks. For the macro benchmarks, we evaluated Groundhog, as I mentioned, on a wide set of benchmarks. We evaluated Groundhog, as I mentioned, on a wide set of benchmarks. So these benchmarks are the PyPerformance Python benchmark, the PolyBench C benchmark, and the FAS profiler Python and JSON benchmarks. In all of the benchmarks, the Groundhog's end-to-end latency is on par with that of the insecure baseline, the one that serves requests one after the other.
Starting point is 00:21:10 But the throughput was impacted by the rollbacks overhead. So basically, the throughput measurements here are a bit pessimistic because the benchmark saturates the system, which is the worst-case scenario that should never happen in production. Overall, the majority of the PyPerformance, Python, and Polybench C benchmarks saw little to no noticeable impact on the end-to-end request latency and throughput as well, except for the very short benchmarks and the benchmarks with very large write sets. So these benchmarks had a drop in throughput. So for short benchmarks, think of a function that just gets the time and exits. So it takes less than one millisecond. So for a one millisecond function, when we are speaking about short functions,
Starting point is 00:21:55 drop-in throughput, for a one millisecond function, it means that after each function request, Groundhog would interrupt the function, scans its memory, identify any modified pages, and draw them back, and then hand back the control to the function to serve the next request. And if all of this happens in one millisecond, then this is a 50% drop in throughput. But starting this function from scratch would take at least 100 milliseconds. So it's still a huge improvement. I mean, if Groundhog does it in one millisecond, we have 50% throughput, but the alternative is much worse.
Starting point is 00:22:33 But for relatively longer functions, we see a very minimal drop. In some rare cases when the function has a very high number of dirty pages or a workload that modifies the memory layout heavily, then the rollback that analyzes the changes and rolls them back. So basically it unmaps all the newly added memory maps. It resizes memory maps to their proper size and restore all the modified pages, sometimes this overhead is excessive. And here we see also a drop. But this is not the case. This has been noticeable in
Starting point is 00:23:17 Node.js, specifically because we are running a vanilla Node.js unmodified Node.js which has aggressive memory allocation patterns with some garbage collection triggers that happen due to time. But yeah, overall, there is minimal impact to latency and throughput
Starting point is 00:23:40 for the average function, let's say. Yeah, it sounds great. So like the average sort of use case, that it seems almost like a free lunch in those scenarios, right? It was funny when you said about the, in production, the customer shouldn't be doing this or they shouldn't be doing that.
Starting point is 00:23:54 But I mean, they probably will be, right? I mean, people do some crazy things, but yeah, you shouldn't be rubbing up against the limits of your resources, right? But it's interesting. So is it possible to have just thinking about kind of crazy scenarios a really short function in terms of time but also one that that actually debt is a lot of like a lot of state and creates like a large amount of
Starting point is 00:24:18 state because they're the two sort of extremes of when throughput can drop off right when you when you're having to kind of you're doing something really short and then it has to do a big scan through and check everything and it's kind of relative to the size of the operation it's not it's a lot of work and then the other the other end of the spectrum is you when you're changing a lot of stuff you're going to then like roll all that back right which is again a lot of work so is the key is key is it quite a contrived scenario of having kind of both of those two extremes been true at the same time it's probably possible but there is so basically there is a limitation of how much you can do in a in a limited amount of time but but of course it's possible and of course in some extreme cases maybe it's it's cheaper to to start a function from scratch, just in some extreme cases.
Starting point is 00:25:08 But basically, so the nice thing is that Groundhog can be used transparently, which means it can be used in an opt-in fashion if it's ever adopted by a cloud provider. So basically as a client, you can go and say, I want to have Groundhog for this function because we need security here there is another function that is invoked by only a single client and basically there is no need for request isolation there there is a third function that should be executed once and then get killed for
Starting point is 00:25:37 example yeah yeah it's really a nice feature of it that allows you to sort of be sort of granular with respect to what the application requirements are right right? So yeah, that's a really, really cool feature of it. Are there any other sort of kind of scenarios? And we kind of touched on that really, and when Groundhog is sort of suboptimal, kind of what the limitations are for it that might stop it being adopted by a cloud provider? Maybe, I don't know.
Starting point is 00:26:00 Apart from the functions that have very high dirtying rates of memory, which correspond to longer rollbacks, and in some cases, Groundhog might not be the optimal solution here, at least the prototype implementation of Groundhog. that come with snapshot-based techniques. Namely, it may capture pair function instance ephemeral state, such as the time at which the function started or a pseudo-random number generator that have been already seeded in the initialization phase. So if a pseudo-random number generator
Starting point is 00:26:42 have been seeded and then we take the snapshot after it has been seeded, this means that the next pseudo-random number would be always the same because Groundhog rolls back the pseudo-random number generator state. taken at the beginning for some reason then we will always see that the time keeps increasing because we are not refreshing the timestamp so these are sort of known limitations with snapshot based techniques these have workarounds but we haven't implemented them as part of the prototype yeah cool and this this next question now we were joking about it a little bit before we started recording about kind of what's next on the research agenda for groundhog and maybe these things would have been but i know you're you're
Starting point is 00:27:35 you're i'll let you tell a joke yes so the next thing is to hopefully defend my thesis. But yeah, so these are important problems in the fast paradigm. And in fact, so basically snapshots and restore techniques have been used or are being applied to solve the cold start problem, basically by taking a snapshot of the execution environment so that it can be started faster
Starting point is 00:28:03 than reconstructing the state. And these are the sort of problems that come with the techniques that rely on snapshot and restore. And solutions and workarounds are being developed as we chat right now. So this is something that can be a follow-up for Groundhog in addition to many optimizations and more reasoning about the security guarantees that one gets in FaaS. Just on another point.
Starting point is 00:28:39 Yeah, first defend. First defend. I just wanted to touch on this. Because obviously in a lot of papers and systems, these are all have names. And I like to know where the name comes from. Why Groundhog? So it refers to a movie, which is the Groundhog Day.
Starting point is 00:28:57 So basically, they're the actor. Every day is basically rolls back. So Groundhog rolls back memory. And every day is repeated rolls back. So Groundhog rolls back memory and every day is repeated for the function instance as in the movie. But yeah. Yeah, I like that.
Starting point is 00:29:14 That's cool. Awesome. Yeah, cool. So my next question is, what sort of impact do you think this work can have then? Can it inspire a cloud provider to go and pick this up? Or yeah, so what's the sort of impact do you think this work can have then? Can it inspire a cloud provider to go and pick this up? Or yeah, so what's the sort of scope for impact with Groundhog,
Starting point is 00:29:30 do you think? So I think there are two sides to that. There's the cloud provider side and there's the software developer side. And I would start with the software developer side. So the very first important thing is the mindset of while developing these functions. So when working with sensitive client data, developers should keep in mind that the unit of isolation that they should consider is not the function or the company or the application. It's each client request and each data item.
Starting point is 00:30:11 Second, they have to keep in mind that isolation can be broken due to several reasons. One of them is bugs, both in their code and on the libraries they rely on, and potentially in the code that they rely on. So there's the question of best practices for enforcing this client-level isolation. One very conservative, highly granular way is to enforce isolation per request, as Groundhog does. Less granular methods involves identifying sets of clients, routing them to the, routing a group of clients together to an, in an administrative domain that gets served by a set of functions, for example, these, these kinds of things.
Starting point is 00:31:01 Awesome. Cool. Yeah. It feels like it has got like the possibility here to be like really sort of inspiring and impactful kind of going forward so yeah um cool so when you were working on groundhog what was the sort of the most interesting thing you kind of that that fell out of working on it like what was the most interesting lesson that you land, I guess? I would say to never optimize early on. So basically, never try to get the most optimal version ready. Rather, start simple, get your intuitions verified,
Starting point is 00:31:38 and then build the most stupid, naive implementation that gets the job done, and then iterate and optimize afterwards. Another thing that I learned is to fully automate experiments from day one.
Starting point is 00:31:56 So basically, start with the automation even before building the system. Have a plan for automating everything. Basically, have all experiments be able to run using a single a single enter on a script nice so yeah premature optimization is the root of all evil but premature automation is not right that's what we're saying here right automate as soon as you can right yeah yeah cool um that that's that that's that's funny um awesome i mean i'm kind of on the flip
Starting point is 00:32:27 side of that then and maybe maybe it felt like groundhog day every day you're working on it but what was the sort of the were the things along the way that you tried doing that kind of failed yeah what were the war stories so what one prominent thing that we tried and failed was relying on a newly available kernel feature for tracking dirty pages instead of the one we are currently using. So currently we are using something called the soft dirty bits which
Starting point is 00:32:55 basically the operating system protects all the memory of the process and whenever a write to a memory page happens, there is a page fault. And then the kernel keeps, basically does the booking,
Starting point is 00:33:11 the bookkeeping and sets a bit to one that corresponds to that page. So that afterwards, one can scan the pages and identify which pages were modified. The alternative or the new feature was the user-fold file descriptor approach, which allows the user space
Starting point is 00:33:31 to get a notification for every modified page. So we tried working on that and had a full prototype that uses UFFDs, user-fold fileified descriptors. But then the overhead of context switching for each notification was so high that it was cheaper to scan all pages and try to figure out which pages were modified. So basically, the advantage of the UFFDs approach is that you don't need to go through all the memory pages
Starting point is 00:34:05 and see which pages have been modified. Instead, you just get a notification with, okay, page X got modified. So you know right away to roll that one back after the migration, after the request. But lesson learned, it turned out to be more expensive performance-wise. At least the current implementation of UR50s, at least.
Starting point is 00:34:28 Interesting. Yeah, I guess there's some scope in the future for that kind of relationship to change. But how far down the road did you get with this sort of approach before you realized, damn, this is actually not the right thing to do? Almost after having the full... Oh, wow.
Starting point is 00:34:43 We had an initial prototype with the soft dirty bits, and it was almost complete. And then we realized that a new kernel version was released, a stock kernel version. So we decided to rely on stock kernels, basically to make adopting Groundhog easier because no one wants to rely on kernel patches and maintain them.
Starting point is 00:35:07 So we realized that the new kernel version was released with UFFD page write tracking support is available. So we thought, okay, this will cut our overhead of scanning the pages. Let's do it. But yeah, it turned out to be more costly. At least the current implementation turned out to be more costly.
Starting point is 00:35:32 Yeah, but like you said, a lesson learned, I guess. Well, another lesson learned is not to fully trust APIs if you see smoke sometimes. So even if it's the kernel, so we noticed that one of the very old kernel features, which is the soft dirty bits had a bug. So basically, yeah. So basically we were having all the tracking.
Starting point is 00:36:00 We are rolling back all the modified pages and we are still hitting memory corruptions. And it turned out after doing binary comparison that some pages have been modified. So basically, we do the rollback and then compare the original memory with the rolled back memory, and we see that some pages have been modified then debugging and trying to figure out okay is that page actually marked as dirtied and then it turned out no so basically it turned out to be a bug in the kernel did it get patched or did you have to work around it no so basically i asked on the on the kernel mailing lists and it turned
Starting point is 00:36:46 out to be a bug, so I helped a bit with finding the commit by exacting basically the kernel had a reproducer, by exacting the kernel found the version that it was introduced in and then one of the guys who was working on that
Starting point is 00:37:02 subsystem actively added a patch, so it's now patched. How long had it existed there for? I think it existed for maybe six months or so. I don't recall the exact date. It's not too long. We are talking like, well, six months is still quite a long time, right? But I mean, it's not like 20 years, right?
Starting point is 00:37:22 But yeah. Cool. Yeah, awesome. I guess we're almost we're almost um towards the end of the podcast now but can you maybe tell the listeners about your other research as well sort of things you've been working on crossing phd obviously groundhog isn't the only thing you've worked on so yeah can you give us a flavor of some of the other things that you've that you've done so there are several directions of research that I worked on some on analyzing the impact of network delays
Starting point is 00:37:48 on Bitcoin, for example. But most of my research has been on how we can design or rather redesign cloud systems such that they provide additional privacy guarantees by design. And an earlier example would be Pacer, which was led by Asta Meta. And I worked on that one with Asta, Roberta, Diviti, Peter Drussel, Deepak Garg, and Bjorn Brandenburg, all from MPI-SWS. So Pacer was tackling the problem of network IO side channels. And the idea is that basically any shared resource can be used to launch side channel attacks and snoop on whoever is sharing that resource with the attacker. And network is no exception. So in the cloud, the network card is shared between the tenants of the same host, which means that an attacker can infer the traffic shape of the content of a co-tenant,
Starting point is 00:38:49 as we have demonstrated in the paper. And the problem here is that the traffic shape can be used to infer the content of the packets, even if the packets are encrypted, if the data being served is from a public corpus, like think YouTube, Wikipedia, and so on. So Pacer basically redesigned the way networking happened in the cloud such that the traffic was forced to follow a predetermined shape. And it's an interesting read. It proved to be more challenging than we initially anticipated,
Starting point is 00:39:28 but we learned a lot through the process. So perhaps if someone is interested, he can Google Pacer, comprehensive network side channel mitigation in the cloud by Asta. Yeah, we'll stick that in the show notes. We'll link all the relevant materials so they can go and find out if they are interested.
Starting point is 00:39:47 So yeah, kind of going off that and sort of like the other way. So I'd like to know more about your creative process. So like how do you go about actually kind of generating these ideas? Because you've worked on quite a few different things, right? And how do you then select which thing to work on? That's a tough question, actually. I wouldn't claim that I have an established creative process or an idea generation approach, at least yet. Rather, I just get curious about an area, try to understand it and see what or where things can break, if there are any gaps or potential improvements, and start from there.
Starting point is 00:40:27 Also, I sometimes have this idea bank, which are all the things I hear about and I find interesting. But more often than not, I never go through them. But yeah, so I wouldn't say I have a um a principled or a a proper approach for that but basically just chat with people get curious and learn about something new and yeah no that's awesome yeah i think i think sometimes with something like a creative process for like you almost don't want to formalize it and have it standardized because that sometimes often takes away from their creativity of it right you want it to be sort of spontaneous and like you say maybe have an idea bank or whatever and look for every now and again but often you've i don't know i'm kind of similar to you in that sense in that like if it interests
Starting point is 00:41:17 me and i'm curious that's often enough to sort of spark an idea or something yeah that's awesome so it's it's the last question now, Mohamed, so what's the one thing you want the listener to take away from this chat today? I would say is to develop the mindset of treating security and privacy as a first-class citizen when designing an application.
Starting point is 00:41:45 And when in doubt about the security guarantees, just reach out to the service provider that you're relying on to make sure that you are getting the guarantees you need to make sure that your client's data is safe. Fantastic. That's a great message. Let's end it there. Thanks again so much, Mohamed, for coming on the show Fantastic. That's a great message. Let's end it there. Thanks again so much, Mohamed, for coming on the show.
Starting point is 00:42:07 It's been a great chat. If the listener wants to know more about Mohamed's work, we'll put a link to everything in the show notes so they can go and find those. And again, if you do enjoy the show, please consider supporting us through Buy Me A Coffee. Like I said earlier, it really helps us to keep making the show. And yeah, we'll see you all next time
Starting point is 00:42:25 for some more awesome computer science research.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.