Disseminate: The Computer Science Research Podcast - Bogdan Stoica | WAFFLE: Exposing Memory Ordering Bugs Efficiently with Active Delay Injection | #38

Episode Date: August 14, 2023

Concurrency bugs are difficult to detect, reproduce, and diagnose, as they manifest under rare timing conditions. Recently, active delay injection has proven efficient for exposing one such ...type of bug — thread-safety violations — with low over-head, high coverage, and minimal code analysis. However, how to efficiently apply active delay injection to broader classes of concurrency bugs is still an open question.In this episode, Bogdan Stoica tells us about how answered this question by focusing on MemOrder bugs — a type of concurrency bug caused by incorrect timing between a memory access to a particular object and the object’s initialization or deallocation. Tune to learn about Waffle â€” a delay injection tool that tailors key design points to better match the nature of MemOrder bugs. Links: EuroSys'23 PaperBogdan's HomepageWaffle's GitHub Repo Hosted on Acast. See acast.com/privacy for more information.

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Disseminate the Computer Science Research Podcast. I'm your host, Jack Wardby. A quick reminder that if you do enjoy the show, please do consider supporting us through buying me a coffee. It really helps us keep making the show. Today, it gives me great pleasure to say I'm going to be joined by Bogdan Stojka, who will be telling us everything we need to know about WAFL, Exposing Memory Ordering Bugs Efficiently with Active Delay Injection. Bogdan is a PhD student at UChicago. Bogdan, welcome to the show. Thank you. Hi, Jack. Thank you for having me. It's a pleasure. It's all ours. Let's jump straight in.
Starting point is 00:00:54 So can you tell us more about yourself and maybe how you became interested in systems research? Yeah. So before joining the PhD, I actually had a grownup job as a software engineer, writing software for a company. Imagine that. And I wrote a lot of bugs. I was writing code with a lot of bugs. And, you know, I've started by fixing my own bugs first in a testing environment. And then I realized that there are bugs in production that users will report. And these were so much difficult to tackle.
Starting point is 00:01:36 And that made me thinking about, well, wait a minute. How am I writing the code? Should I keep in mind the user? Should I keep in mind resource you know, the user? Should I keep in mind resource constraints? And yeah, that's kind of all started. And I've switched a little bit tracks. I was doing my bachelor's at that time in theoretical computer science.
Starting point is 00:02:02 And I wanted to do cryptography, but I wasn't that good at number theory. So that ship has sailed. And yeah, so I was interacting a lot with users and with the bugs that I was mostly responsible for. And that got me thinking about systems design, about testing, about building tools to help developers sort of reason about their code and how it interacts with the systems that it runs on. It's funny, before I started my PhD, I was kind of at one point interested in cryptography as well. I'd read some book by Simon Singh, I think it was. I can't remember the title of the book.
Starting point is 00:02:41 And it was like the history of cryptography and loads of cool stories around it. And I was like, was like oh this is really cool but then i kind of had the realization that yeah i'm maybe not good enough at number three for this so yeah yeah yeah i mean i i was i was reading a lot of books about that and then i i went to my first class and uh you know there's just uh linear algebra for the first few classes and then the professor said, well, okay, so now that we've gone through the introduction, here's a crash course in number theory. And I was like, wait, what?
Starting point is 00:03:13 So then I was like, yeah, I'm just going across the street to the systems OS class. Right, so we're going to be talking about bugs a lot today. I mean mean i'm also guilty of writing many books as well in various pieces of code and then having to figure out what the hell i've written and where they all come from but we're going to be focusing kind of today more on concurrency books
Starting point is 00:03:35 right so maybe you can tell us more give some background on concurrency books concurrency books right and why are they so damn hard to to detect Right. You know, I've been asking myself the same question, and I probably keep asking myself why they're so hard to detect until the end of my career. Yeah. So just like a high-level view, concurrency bugs, let me see, typically occur when two threads access or try to access shared resources. And this happens without being properly synchronized, the threads, that is, or coordinated. And this might be, you know, by design, the developer forgot to add synchronization. I'm certainly guilty of making this mistake quite often. Or it might just be an oversight of the overall design, or developers
Starting point is 00:04:27 sometimes make assumptions for the code that they don't see. I remember a lot of times I was using, early in my career, I was using these standard libraries and never thinking about, you know, how a particular algorithm is or how a particular API is implemented. To me, it was just like a black box. I was just hash.add and magically the key was added to the hash. I didn't even think about, well, there's thousands, if not tens of thousands of instructions behind that add. So it's all these assumptions that you make and the lack of synchronization or coordination
Starting point is 00:05:04 that create concurrency bugs. Now, why they are hard to detect? Well, timing. If I were to summarize this whole research up to now into bug finding in terms of concurrency bugs, it's finding the right timing. These bugs happen unexpectedly because of unexpected interactions or interleavings of different components of the program. And these unexpected interactions are rare and they need to happen at the right or wrong time to trigger a bug. And you can run a program hundreds of thousands of times and never hit that interleaving.
Starting point is 00:05:49 You can run it in a test environment. You can deploy it in production. And only in rare cases, you will find that bug. But when that bug occurs, it might be catastrophic. It's incredibly difficult to detect, again, because you have to hit that right timing and the right sequence of operations between threads. That's what I mean by term leaving. And yeah, has been the focus of many researchers for the last three, four, five decades.
Starting point is 00:06:19 I don't know, 1965, I think, is the first i i remember in one of the classes the first time they mentioned concurrency bugs as a they coined concurrency bugs as a term cool i mean it's very hard it's good on a tangent quickly i remember just when i was working on some project for my phd like some concurrency control protocol i had this this nasty concurrency bug in there and it was so difficult to observe because as soon as i tried to inject any sort of i don't know profile in there or any sort of debugging it slowed it down enough such that the the bug didn't care anymore so like it needed a really sort of tight race tight timing to kind of to actually to happen it's so hard to observe but anyway that was a yeah like a month trying to debug it and it was oh it drove me insane. It drove me insane.
Starting point is 00:07:06 But yeah, anyway, it's cool. So yeah, so we've touched on concurrency comp books there. But so maybe you can tell us a little bit more about the various different types of concurrency books they actually are. Right. How much time do you have? There's so many patterns.
Starting point is 00:07:22 But I would just mention a few that I think they're not worthy. I would say if you want to tackle concurrency bugs and doing research in this space, you should definitely take a look at deadlock and non-deadlock bugs. And by deadlock, I mean that you have two threads and say thread A holds a resource that thread B needs, and thread B holds a resource that thread A needs. So they're kind of blocking each other, right? And that's, I don't remember concrete numbers on top of my head, but that's a good chunk of the concurrency bugs out there. And then there are non-Deadlock bugs. And these are some of the bugs that the tool, our paper is tackling.
Starting point is 00:08:14 The paper we're talking about today is tackling. And those also kind of get divided into patterns, typically, atomicity violations and order violations. And it's all about expectations for these bugs. So an atomicity violation happens when you have a block of code that should execute atomically, but you forgot to add proper synchronization for it. So another thread could be preempted and could execute right smack in the middle of that block, assuming that the block is executed by a different thread, of course.
Starting point is 00:08:52 And order violations are this type of concurrency bugs where the order of the operation is not the one that is intended. And I think maybe, you know, as a concrete example, think about use after free bugs. You know, you expect to, once you've freed an object, you don't expect it to use it down the line. But if that expectation is breached,
Starting point is 00:09:16 that contract is breached, you will end up with an order violation. And order violations are a focus of our paper. But there are so many other patterns in between and so many other great papers to read that tackle these bugs and these patterns and other patterns. So it's quite a broad field. And yeah, for people that are starting research right now and they want to dip their feet into this pool, yeah, there's plenty of work, plenty of bugs. And I'm pretty sure we won't run off of concurrency bugs. Yeah, I'm sure, especially when I'm writing code.
Starting point is 00:10:01 I mean, I'm going to keep a supply of bugs there for a long time. But anyway, cool. That's why I've quit my software engineering job. I was afraid to contribute to the problem. And I was just dedicating myself to try to atone for my sins and fix as many bugs as I can. Nice. So yeah, like you said there, we're going to be focusing on auto-violation bugs and that class of concurrency bugs.
Starting point is 00:10:30 And a technique that can help us here is active delay injection. So can you tell us kind of what it is and like how it works? Right. So active delay injection is, I guess I would say it's deceptively simple to describe and understand. Essentially, it means it's a technique that allows you to pause threads at strategic times and places in the program
Starting point is 00:10:57 with the end goal of forcing possible buggy interleavings and exposing concurrency bugs in particular. And the nice thing about it is it will report a bug only when it occurs as a consequence of the delay. So obviously there's some code instrumentation involved, but once you have these probes in place and you delay the threads at particular times and locations and the program behaves in a way that it shouldn't, you know, the technique allows you to get that snapshot of where you've paused the threads and
Starting point is 00:11:39 for how long, and it gives you a blueprint of how to reproduce the bug. Now, having said that, it's incredibly hard to figure out the times and places you need to inject delays. I've deliberately used the word strategic, which is very ambiguous because it basically depends on the bug. And we'll see later in our discussion what I mean by that. But yeah, at the high level, this is it. Once you figure out the places where you want to delay the threads and for how long, you just wait for the bugs to happen. Magic.
Starting point is 00:12:19 Magic. Cool. So kind of in this sort of space with this approach and kind of what obviously other than kind of waffle now what was the sort of the state of the art kind of in this area and i think it's i think given it's tsvd i think that's the was the sort of always one of the state of the approach approach so can you tell us a little bit more about that and kind of how that goes about answering this sort of strategic question of when to inject things and for how long. Right.
Starting point is 00:12:50 So there are several activity injection tools that we looked at, and they all have their trade-offs and whatnot. I'll just mention TSVD because you mentioned it, and because TSVD served as inspiration for our work. And TSVD is a tool that tackles a slightly different type of bug. We are looking at order violations. TSVD looks at the subclass of atomicity violations. And it's a small distinction, or at least that's what I thought in the beginning. And the way it works is it simply identifies a set just ignore that or think that, you know,
Starting point is 00:13:46 HashMap.add is an atomic operation. So you can just execute it concurrently with another HashMap operation without, you know, without any unwanted consequences and side effects. And it turns out, obviously, there are side effects and unwanted consequences because these operations are not atomic. And it's found a surprisingly large number of bugs just from this tiny assumption. And, you know, I looked at it and said, well, okay, in terms of location, all the APIs that you can find, you know, from all the standard
Starting point is 00:14:26 library entry points in your code, like map, hash map, array, array list, if you're in Java, whatnot, right? So that doesn't sound too bad. That's doable. I can tap that. And then in terms of timing, it was just, you know, injecting a fixed delay that, you know, was configurable. But the authors figured out that, you know, a particular number, I don't remember, it was 10 milliseconds or something, was good enough to find a lot of bugs, thousands of bugs, to be precise. So, you know, the location constraints were kind of loose. The timing constraints were kind of loose. And I was like, well, you know, this is great. And not only it's easy for me to understand, I was just, this was actually my first PhD project. But it's also easy to replicate, or so I thought. So boy, I was wrong.
Starting point is 00:15:31 It took two, two and a half years to get from those ideas to adapt and translate and adapt that simplicity into another tool that finds another type of bug that relies on the same, you know, active delay injection philosophy, I would say. Yeah. But yeah, so, you know, I'll tell you more as we go along, but suffice to say that, you know, the state of the art wasn't tackling the exact bugs that we go along. But suffice to say that, you know, the state of the art wasn't tackling the exact bugs that we were targeting. And that's what made this tool. And it was great at
Starting point is 00:16:13 finding a slightly different type of bug and finding them in abundance. And we were like, great, let's borrow some ideas and, you know, apply them to ordering violations and see what happens. And yeah, that was two, two and a half years ago. Great. Fair enough. Well, let's dig into it. And so why was it bad for detecting? These are memo order books, right?
Starting point is 00:16:38 You were trying to apply this technique. So why was it so bad? Why didn't it work as well? Right. Right. So it turns out that the two types of bugs have two different properties that make all the difference. So threat safety violations happen when you execute two methods, two APIs concurrently, and those were not meant to execute concurrently because they're not atomic and they're not synchronized. And if you just focus on APIs from standard libraries, they're few and far between. And as I mentioned in the beginning, you don't typically think when you use these standard libraries,
Starting point is 00:17:34 what's the implementation behind it. Or at least I didn't think that much. You just do hash map dot add, then to me that looks like a single, you know, it's easy to make the mistake, think that looks like a single instruction, but there are tens of thousands of instructions behind that one function call. And in order to catch a thread safety violation, you just want to find one of the many interweavings where two of these API, the instructions behind these two APIs execute concurrently. So it's a looser timing constraint.
Starting point is 00:18:13 Now, if you look at order violations, and let's take an example of a use after free bug, what you want to do is execute a particular access to that object after it has been freed. Not during, not close to, it has to be after. And if it's not after, you don't find a bug. So that's one difference. And the second difference is that whereas for thread safety violations, you know, developers
Starting point is 00:18:44 don't overuse these standard libraries. It's mostly standard data structures. They don't, you know, you don't see a hash map dot add every five lines of code or four lines of code, right? But order violations, especially when they relate to memory, they can happen between any, they could potentially happen between any two memory accesses if one of them is either an allocation or a free operation. So the location space is far, far greater for memorder bugs, at least, well, in our experiments, it was 10 to 100x more. So you have to, all of a sudden, you have to deal with an order to magnitude more of locations where you have to inject delay. And then the timing between these operations is more strict.
Starting point is 00:19:41 You don't have like a window for an instruction window for a hash.add and instruction window for a hash.find that you just have to interleave in any way you please. These are two, they're not two individual instructions, but a small, small, much smaller set of instruction, like a handful of instruction that have to execute after another handful of instruction. If you don't get that right, you don't see the bug. So DSVD was not calibrated for this type of strictness. It was calibrated for looser constraints. And it was able to get away with it because, you know,
Starting point is 00:20:21 there are just a lot of threat safety violations out there in the world. And yeah, and it's doing a pretty good job at finding them. Just to summarize this, it's like on one half of things, like actually trying to identify the candidate locations, the state space for the memo order books is a lot bigger, right? It's like it's harder to find where to actually put these things. But then on the other side, for when you're actually trying to work out like how long to inject the the delay for i think that the timing has to be a lot more strict so you've kind of got two opposite if i've
Starting point is 00:20:53 understood that correctly yeah that's that's pretty much the gist of it you you have far more injection sites that you have to now consider and deal with, both in terms of when, you know, where and when and in terms of overhead, because obviously active delay injection does not come for free. It does introduce some overhead. So that's one aspect. And then the second one is, well, for how long are you going to inject that delay? And it turned out that just trying to have one value to rule them all doesn't actually work. And you have to be much more strategic about it. Because that timing constraint for ordering bugs, it's so much stricter than at least threat safety violations
Starting point is 00:21:46 and many other concurrency bugs, I would say. Cool. So let's talk about how we get more strategic and tell us more about WAFL and how it works. And is it an acronym? So like, what does it stand for if it is an acronym? No, it's not an acronym. It's, it's actually a funny story. It actually has a funny story behind it. I, earlier in my, my PhD, I, I decided to have, you know, if, if my, my PhD work won't probably not be spectacular or, or either read by, by many people. But so, so having that in mind, I said, well, that has to have a quirk. So I've decided to name all my tools with, you know, starting with the same letter.
Starting point is 00:22:32 And earlier in my career, I've worked on another tool that started with a W. So then I had to scramble and find a list of names. And I do have a list of names. My next project will also be a tool that starts with W. So there's not an acronym or any other logic behind it that just starts with a particular letter. Okay, cool. Well, my last name's Wardby, right? So that begins with a W. So maybe I can get in your list of possible candidate names for projects. I don't know.
Starting point is 00:22:59 Yeah, I mean, after my current project, I'll definitely consider, I'll add that to my list. Cool, yeah, congratulations. So let's tell us how Waffle works. And so like, how did you go about solving these problems that you found with TSVD? Right. So what we did initially was to, like I said, TSVD is a great tool that we wanted to replicate its end result, finding many, many bugs with minimal adaptation. And I spent many months trying to do that and figuring out what to tweak and what not to tweak. But it turns out it's not just about tweaking,
Starting point is 00:23:40 it's about the underlying assumptions that TSVD makes and the way active delay injection works. So at the end of these several months, unfruitful several months, we took a step back and looked at what's the how at a conceptual level, decompose active delay injection into or deconstruct active delay injection into four design points where you can have various trade-offs. And once we started looking at active delay injection in that way, we started seeing the trade-offs that TSVD was making, but also other tools that we're making. So we knew that this was the right characterization of this technique. And once we had these design points, the next step was to figure out how do we adapt them to match the properties of Memorder bugs. So it was actually a two-pronged process, identifying these key knobs and then trying to adapt it to the properties of the bug that we're targeting. And in a sense, our paper became so much more about this journey rather than the end result and, you know,
Starting point is 00:25:01 finding a few dozen bugs of a particular particular type nice so like how did you go about like implementing waffling so like this is this journey you went on so i guess you go through multiple iterations of it and maybe you could kind of tell us more about that journey a little bit and and how it was how did you implement it i guess as well like it's kind of fold that into there as well yeah so once we identify these uh uh these four four design points and uh you know i'll just maybe it's worthwhile to just go quickly through them and uh i know the audience can read the details in the in the paper but it it essentially breaks down into it it breaks down into, it breaks down to how much, let's say, how much overhead you're willing to spend in your analysis and how fast you want to get the bugs. So if you think of these two as
Starting point is 00:25:56 two, you know, two constraints on as two dimensions, you can easily put all the tools that do delay injection on this axis. And there are a few design points that you can tweak here. You could do a lot of synchronization analysis in the beginning to narrow down the set of locations where you have to inject delay and essentially pruning out any pair of instructions that are synchronized. Because if you try to inject delay to force a different ordering for pairs of instructions that are synchronized, you will not be able to do that. So one design decision was how much synchronization analysis are you willing to pay? And there are various trade-offs. You can read them in the paper. And then the second one, the second design choice here is once you do this synchronization, once you find a few locations, you can already start injecting delays. Or you can just do your synchronization analysis, spit out a set of locations where you want to inject, and then do the injection.
Starting point is 00:27:23 And this affects the number of times you have to run a program to find the bugs. Obviously, if you start injecting at the immediate next opportunity, at least in theory, you could find bugs faster. But this creates all sorts of other problems because desynchronization analysis, they're sensitive of anything that happens, anything that happens in the execution, including the fact that you're slightly perturbing the execution. And that can confuse the analysis itself or make it not work. So, you know, that's another trade-off. And by the way, all these trade-offs, they were actually more or less different implementations of WAFL.
Starting point is 00:28:04 So you actually went through the process of not only identifying these design points, but also producing a version of WAFL that takes a left and then takes a right and then comparing the results. Right. And once you've nailed that down, how much synchronization analysis you want to do and whether or not you want to start injecting early or you want to wait for that analysis to finish. Some of these analysis take a lot of time. Then it becomes a question of how long are you going to delay a thread at that particular location? And yeah, that's a huge space. I won't get into it. We just scratched
Starting point is 00:28:41 the surface there and we're lucky to find a strategy that worked for MemOrder bugs. Might not work for full disclosure, might not work for other types of concurrency bugs. Yeah, and finally, you know, once you start injecting delays in a system, you're perturbing the execution. So as you mentioned at the beginning, you're trying to debug this concurrency bug that you had in your program and you put it through a debugger and that started to slow down some of the threads, which made the problem go away. Well, this is exactly what's happening when you use active delay injection. You could potentially slow down some threads and make some bugs go away. So you have to be strategic about when you're going to inject the delay
Starting point is 00:29:32 and when you're going to say, well, no, I'm not going to inject delay here. I mean, I'm going to wait a little bit and let the other delays play out and see what happens. And anyway, you can read about all of these in the paper. They're great details. And we explain how TSVD approaches
Starting point is 00:29:51 or skips over some of these decisions because they don't make sense for threat safety violations, but they make so much sense for memory bugs. And you asked me about implementation. Basically, this is it. We took all these major, these key design points and implemented the version of our tool that makes one trade-off or the other. And then we evaluated how well it does, not in terms of how many bugs it finds, but rather in terms of why it doesn't find a particular bug. And this is one of the most interesting lessons that I've learned in this
Starting point is 00:30:32 project is sometimes you would think about the metric of success being the number of bugs your strategy finds. But in this case, the metric of success were, well, wait a minute, I know there's a bug here somewhere. I vaguely know the ordering and interleaving. Why is my tool not finding it? What's the root cause? And it's ironic. In a sense, we were debugging our own debugging tool.
Starting point is 00:31:01 How about that? I don't know if that answers know answers uh answers your question or you're talking more about you know the coding details but i'm happy to get into that as well that answers the question really well on the kind of the implementation on the coding side of things i guess like what was the reason for having to do a separate complete implementation for each one rather than having kind of uh one tool that is sort of parameterized basically in different directions, like, okay, having set this to this run,
Starting point is 00:31:30 basically, or why did it have to be separate implementations? Well, because some of these decisions are not actually parameters or values. They're entire different algorithms. The only parameter in all these design points is the delay, the value of the delay or the
Starting point is 00:31:52 amount of the delay. The rest of the things are entire algorithms. For example, we looked at implementing a vector clock-based synchronization analysis. But that's not the only synchronization analysis that you can implement. It's just the one that I was more familiar with. And I could code quickly. But that was a totally separate algorithm.
Starting point is 00:32:19 Or when you've done early analysis, you identified all the locations, you were lucky enough to identify how much delay to inject at each location, or even in which context, then the question becomes, how do you reduce the noise? Because now you've introduced noise in your system and you might, in your execution, you might miss some of the bugs. This injection, this injection i might not have mentioned this but this injection happens at runtime i guess it was obvious from the description but just making sure i i think it was obvious because i'm so invested in this but probably not it probably wasn't obvious right so so this injection happens while the program runs. And any perturbance
Starting point is 00:33:05 is much more likely to hide bugs than expose bugs, expose concurrency bugs. And there are many strategies to deal with this interference. And we implemented a few. Some of them were as easy as flipping coins.
Starting point is 00:33:23 So doing some probabilistic injection. Others were a little bit more, we had to be a little bit more deliberate and strategic. And where we ended up for this particular design point is a combination of three different strategies tied together with a rope and it worked. But yeah, so that's why we had to build different implementations. And we had to make sure that that's the right choice for memorder box, because we're going to write that in a it. But I also wanted to understand why it doesn't work. I think halfway through this project, I had this idea that we can actually have a taxonomy for all the active delay injection tools. and explain away the trade-offs in a systematic way, not just, oh, you know, these are the parameters that work. I don't know why they work, but they work. And find a few dozen bugs, I'm going to write a paper. And in a way, we kind of hope that we gave a recipe
Starting point is 00:34:37 to navigate this active delay injection space. So that's why we wanted to be deliberate and really understand some of these trade-offs. And yeah, for that reason, we needed different tools to run on different sets of benchmarks and data sets and compare results and try to make inferences.
Starting point is 00:35:01 Nice. On the results, you mentioned an interesting point earlier on where it's like you weren't interested in the kind of the the absolute number of bugs you found but more in the case of i know there's a bug there did this tool find it and kind of the specificity of the so i think that's quite specific like the false was it trying to minimize false negatives right and i get a mixed rate which which way around it is but yeah yes you tell tells more about your experiments then and sort of what those look like and what your findings were so when you know you read all this analysis and and and you look at all the the the design has
Starting point is 00:35:36 been as it been put in and you end up in the evaluation section and we we 18 bugs. And, you know, depending on where you stand, you could say, well, 18 bugs, it's great because these concurrency bugs are very difficult to find. And some of the, I would say most of these concurrency bugs that our tool exposes have been in those repositories for months, if not years. Or you can say, well, wait a minute, all this analysis for 18 bugs, it's kind of underwhelming. I don't know where I stand. As the main driver of this project, I'm kind of in between. But, you know, having said that, as you mentioned, I think the more interesting result of our paper is not, or the takeaway message is not the number of bugs per se, but the fact that it finds them so deliberately. And you can explain away why it finds a particular bug. You pick one
Starting point is 00:36:46 from our data set and you can actually explain why it finds that bug in terms of these design decisions and tweaks that we made. But you asked about the experimental setup. We selected 11 C-sharp open source projects that have been widely used and have many stars on GitHub. And we deliberately looked at open source software because we also wanted to understand ourselves why a bug is occurring to see if more or less to understand why our tool is not finding it. And for that, we need access to code. And then there were some concurrency bugs that were already patched after many, many months in these repositories. Some of them had small unit tests that were attempting to reproduce the issue and make sure there's no regressions there. But obviously, since this is a currency bug and you have to hit the timing right, you had to run the test for hundreds of times. We couldn't essentially reproduce anything.
Starting point is 00:37:59 But we ran our tool on these known bugs. And within a short span of time and just executing the program twice, we were able to find all of them. And we were able to find all of them with tests that were in the data set at the time the bug was introduced. So tests that weren't necessarily designed to find the bug, that particular bug, were totally oblivious. They were designed to test something else. We were able to use active delay injection and the test suite to our advantage. And our tool was finding the right timings so that test would fail and expose that particular bug.
Starting point is 00:38:50 And I found that really fascinating because essentially what it all points out to is, well, if only those developers would have had our tool when they were doing in-house testing, they would have caught these bugs. And we also found six bugs that were new. They were unreported. And the two offending operations
Starting point is 00:39:14 were quite a ways away from each other in terms of execution. If you think in terms of milliseconds, actually one of them was seconds apart. So it was almost impossible to simply expose it by, you know, running your test suite and hoping you hit the right timing at some point. So, yeah, the interesting thing that we discovered while doing this, building this tool so deliberately is that it actually finds the interleaving and the timing in a very short amount of time. And it finds it and it makes it deterministic in a sense.
Starting point is 00:40:07 It gives the developer the places where you inject and the actual values, the actual delay values. And developers can then go back and replay what our tool is telling them and they can reproduce the bug over and over again in a deterministic fashion. As opposed to, you know, without using our tool, they just run the test suite and hope for the best.
Starting point is 00:40:35 Yeah, you're flying wide, right? Without the tool, you really sort of, like, you've got nothing there. The tool, I mean, yeah. I mean, I don't know if you had much interaction with the developers in these set of these, behind these repos, where they actually were like, wow, we wish we had this tool know if you had much interaction with the developers of the in these set of these behind these repos, whether they actually were like, wow, we wish we had this tool. Did you get any feedback from them in that sense? Like we we tried, you know, but but we reported obviously we reported the bug and some of them were fixed silently.
Starting point is 00:40:58 Some of them were fixed, you know, by acknowledging we we didn't actively try to pitch our tool, but we did build our tool so that it uses all this, you know, strategies without much manual intervention. So the high level, and probably I should have mentioned this in the beginning, that the kind of the high level idea of our tool is to run without the developer knowing it's there. Essentially, if you think about the development process as you write code and then you ship your code to a build server, it builds and runs some daily tests and then, you know, it commits and ships to production. Our tool is meant to sit on that build server. And whenever you run the test, you run it without our tool and then with our tool trying to figure out, you know, finding these tight timing and interleaving constraints. So the developer doesn't have to know anything about this extra layer of protection that
Starting point is 00:42:03 we built. And we didn't try to actively pitch this, but we did try to reach out to a few development teams without any luck so far. But we're still pressing. You never know. Now the code is open source. You never know. Awesome.
Starting point is 00:42:19 So let's change it first and look at it from the other angle. So what are the limitations here? It sounds great and it sounds almost like a free lunch, but what are the downsides or the scenarios in which it's suboptimal? There are a few
Starting point is 00:42:37 limitations that we observe. Well, one limitation which is obvious before building the tool is one of the design decisions before building the tool is that we don't want to burden developers to write tests for finding specific tests for finding these concurrency bugs. And our sort of goal was to say, well, you know, whatever tests they put in the repo, we're going to repurpose them and force them to find bugs. So our bug finding results, in a way, as a first order approximation, are as good as the test suite that ships with the software. So if you don't have tests covering a particular piece of code
Starting point is 00:43:24 that might harbor a bug, we won't catch that bug. So that's a limitation. But that's a limitation that all of these active delay injection tools have, including TSVD, or at least debugging tools that rely on pre-existing tests. Having said that, more limitations specific to WAFL are related to the abundance of injection sites that WAFL has to deal with. And what I mean by that is simply this. If you have few locations in the program where these memorder bugs can occur, it's so much easier for Waffle to find these bugs. And while you have more and more and more and more locations, it makes it a tad more difficult for Waffle to find these bugs. And the reason is that it has to inject more delays, and injecting more delays introduces more noise.
Starting point is 00:44:27 And as you mentioned in the beginning, more noise usually doesn't help. It just makes the it just hides the bug better. So we haven't found an antidote to this noise. Hopefully people that pick our paper and pick up, you know, where we left off might do a better job. But, you know, if you increase the number of injection sites, our tool will have, will find bugs,
Starting point is 00:45:01 but it will take longer and we'll have a more difficult time. And you might have to tweak a little bit. So I would say that's what made the limitation. And again, it's a trade-off that we made. It's something we sort of figured out early. And we said, well, you know, it's all about the number. It's all about how much overhead you're willing to spend to get rid of that noise versus how fast you want the tool to find the bug and report the bug.
Starting point is 00:45:36 You know, if you can wait weeks, then I could probably write an algorithm that denoises everything, right? Yeah. If you only have a couple of hours, you know, trade-offsoffs it's all about trade-offs it is yes this paper is all about trade-offs we can you know when you ask me what's this paper about we can we could have just ended the discussion there yeah cool so where do you go next in from here then so what's next on the recent agenda for waffle so it um you know trying to build such a tool and compare it with state of the art is a difficult undertaking comparing with a state of the art that finds thousands of bugs it's even more challenging to to navigate and at least get get you know not only get the past reviews and
Starting point is 00:46:27 scrutiny and whatnot, but get it past our own, you know, scrutiny and whether or not we think this is ready to be unleashed upon the world. But so having, you know, so having said that, there is one important lesson that we learn from this delays is a sort of a fault, are great at finding, not only finding bugs, but giving you a way to replicate the bug, which is something that is extremely useful. I would venture to say many of the experimental code analysis tools are not used in production because they don't offer this determinism. They don't allow the developer to say, well, okay, run this tool once. I want to run it again, or I want to replicate the conditions and I want to get the same result. And actually, delay injection and fault injection more broadly gives you this. So that's a worthwhile idea to pursue.
Starting point is 00:47:51 And the second idea that kind of spinned off from this project is the idea of reusing existing tests as opposed to either synthesizing or having, or worse, having developers write tests for their own bugs. Because if they can write tests to find their own bugs, why would they use your tool, right? So what's next for Waffle is that there probably not be a SQL, but what we're working now is a tool that incorporates many insights and ideas that we used in building Waffle. And in particular, this idea of injecting faults strategically and deliberately, and the idea of having a tool that is essentially push button. You don't have to write tests. You don't even know it's there. It's reusing the tests you already written,
Starting point is 00:48:51 and it's repurposing them to find a particular type of bug. And yeah, of course, what comes with this is the fact that, and I guess this is the bigger takeaway of our paper, you know, whatever technique you're using, there are all these trade-offs. And the way to approach these trade-offs is to first figure out the properties of the bug you're targeting and then have the trade-offs match the properties. And that's the path of least resistance
Starting point is 00:49:17 when building this code analysis and debugging tools. Nice. I'd just like to kind of, there's a few questions that I've kind of, I have sketched down here, but I feel like you've answered kind of a few of them at various different points. So I'll kind of skip those. And I want to jump straight into
Starting point is 00:49:35 how you actually go about generating these ideas. Like what's your creative process like? And how do you then say, okay, this is like the thing I want to work on for the next two and a half years or whatever because it's quite a large undertaking right when you pick a project so yeah how do you go about i guess one generating ideas and two then selecting what to work on that's actually an interesting and hard question um i i i i think uh early in my career i was um
Starting point is 00:50:01 i was uh reading a lot of or my approach was reading a lot of research papers and trying to find a gap in the, or a limitation and jump on it, which is not inherently bad. I definitely recommend people, you know, or especially students that are just starting their PhD to definitely do that. Reading papers, it's really important. But that's kind of a drunken sailor strategy. You know, you jump from one limitation to another. They might not correlate with each other.
Starting point is 00:50:39 They might not be in the same family. They might require totally different techniques. And there might not be, you know, where still they might not be worthwhile pursuing. There might be a limitation that, you know, you're not going to solve anytime soon. And it might be crowded, a crowded space. So I kind of moved away from, so that was kind of in the beginning. And it helped me a lot to ground myself in this space of code analysis space, I would say, to know the techniques, to know the papers, the groups that work on various approaches.
Starting point is 00:51:12 But I kind of moved away from reading research papers to, or moving back, I would say, from reading papers to reading bug reports. And it sounds counterintuitive, but I cannot stress this enough. In order to build code analysis tools or select a particular problem to work with, nowadays, I just go and read bug reports. And I do have a vague idea of the type of bug that I want to tackle next, right? Like
Starting point is 00:51:48 order violations, which is generic enough. And then I go and read a bunch of bug reports to try to figure out why they happened. Is there a pattern there? Is there some signal? Is there some signal in how they occur? Is there a signal on how they're fixed? Are they fixed? Are they ignored? Some concurrency bugs are ignored. They're not that they don't affect the software in a way which elicits fixing them, right? Sometimes it's more expensive to fix them in terms of overheads and efficiency and perform overall performance, end-to-end performance than not fixing them. So that's how I go about. I kind of have a vague idea of my next bug.
Starting point is 00:52:31 Usually it's preempted by the previous paper that I was writing or previous tool that I was working on. Some of them are my own bugs, like I said in the beginning, trying to atone for my sins, and then go read a lot of bug reports to to try to extract some some patterns and uh yeah doing this you can do this automatically of course but uh where's the fun in that yeah yeah you you um you have to do this kind of manual work and go and read the code and and and figure out for yourself and explain to yourself why the bug happened and you do that for a few dozen bugs and uh usually at least for me i i i
Starting point is 00:53:11 see a trend then i jump on it and then i spent two and a half years building the tool amazing that's a great that's a great answer to that question cool well we're at um time for the the last the last word now bogdan so like what's's the one takeaway you want the listener to get from this podcast today? Debugging is hard. So don't beat yourself up if you have bugs and you cannot find them. And using code analysis tools is also difficult. But have patience. Many of our experimental, many of these tools are written by PhD students.
Starting point is 00:53:51 So I guess in a sense, for the listeners that want to do research in this space, it's rewarding in the end. It's frustrating. The road ahead might seem uncertain, but the end goal is rewarding. And there is nothing more satisfying than having your tool find bugs that have been in a code base for months or years and having those bugs fixed. And for those listeners that want to use tools, I mean, I would say, be a little bit more patient. Most of these tools are experimental. Some of them, I totally agree, they're clunky to use. But I think, and especially for those listeners that are software engineers, and, you know, they're practitioners, I think academia would benefit so much more from a tighter collaboration between the two. And you mentioned, you asked me at some point if we try to have developers use Waffle.
Starting point is 00:54:56 We would have liked to work with a few developers that would use it and give us the feedback and we could have improved it. But still waiting, there's still time. But that's kind of the two takeaways. Don't be afraid to work on the space because it's incredibly rewarding when you fix bugs. And it's a worthwhile effort. There are a lot of bugs out there and somebody should fix them.
Starting point is 00:55:30 And yeah, for practitioners, please use our tools, even if sometimes they're clunky. We would appreciate the feedback. Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.