Disseminate: The Computer Science Research Podcast - Rohan Padhye & Ao Li | Fray: An Efficient General-Purpose Concurrency JVM Testing Platform | #66

Starting point is 00:00:00 Disseminate the Computer Science Research Podcast. Hi, and welcome to disseminate the computer science research podcast. I'm your host, still not Jack Watby. You may recognize me from the previous episode. I'm, of course, you're truly Bogdan Stoica, but you can call me both. That's a mouthful. Today, I'm joined by two guests, Rohan Padier, who's an assistant professor at CMU, where he leads the pasta research lab, and Leo Lee, a final year PhD,

Starting point is 00:00:30 student advised by Rohan. They're here to talk about software testing woes, concurrency bugs, and their most recent paper published at Uppsla this year, Frey, an efficient general-purpose concurrency testing platform for the JVM. Leo is also on the 10-year-track job market for this year, so please check out his research and papers. Reach out to him if you're in the hiring committee or if you know your department is hired. Rohan, Leo, welcome to the show.

Starting point is 00:01:01 Thank you very much. Glad to be here. Thank you. So let's dig in. I am always curious when we get new guests on the show to talk about how they actually found their way into their research area. So for you guys, that is software reliability more generically. So how did you end up in software reliability? and why did you choose CMU and why pasta?

Starting point is 00:01:32 Yeah, so I've always been interested in doing research on anything to do with software, mainly because it's easy to imagine myself as the customer or the end user for the research that I do, because I build tools for software developers and I've been doing software development for a long time. Specifically, the area of program analysis is something that I've been looking into for, I think over 12 years now. I think 12 years ago, I took a class called program analysis when I was a student in a master's program back in India. And here I am today at CMU as a professor of teaching a class called program analysis. So this is something that I really enjoy doing.

Starting point is 00:02:09 And I've also worked at various different companies in different roles and have seen the kinds of code that actually runs in production and powers many of our software systems that we all rely on. And so I sort of understand the importance and importance and, and impact of research that can improve software reliability. CMU is really special. You know, it's obviously a well-known university in computer science overall, but specifically in the area of sort of software analysis, it's very special because we have our own PhD program in software engineering. CMU is also weird because we have many different departments within the School of Computer Science,

Starting point is 00:02:46 one of which it's called the one that I'm in, it's called the Software and Societal Systems Department. And so we essentially have the largest collection of researchers in the field of sort of software analysis and engineering, probably anywhere in the world. So it's really amazing, and we also get to work with amazing students, such as Leo, on these kinds of fun projects, and that's what's really exciting. Yeah, so for me, like, I started to work on software-related research starting my undergrad, and where I was working with Karin, who was professor at University of Kargari, And so we were building static analysis tools for SWIFT.

Starting point is 00:03:26 So that was a really fun experience because one thing, I got a chance to experience what the real-world bugs at issue are. And also, I got a chance to contribute to open-source project. So I think that actually motivates me to continue to work on this area. And also, I chose CMU. Part of the reason is that CMU has a really big school. where you can work on many different areas and you can find experts in pretty much every domain. So for me, I'm especially interested in software analysis,

Starting point is 00:04:04 but also I want to apply software analysis to real-war application. So I am now currently codewise by two professors, one Rohan Paki, he's an expert in program analysis. I'm also advised by Vyaseka, who is an expert in, system network research. So I think there are many places where you can actually find experts in pretty much every domain. So that makes me thinking CMU is like a great place to do this kind of research. Sounds like you both have a similar path, working on program analysis during undergrad or during

Starting point is 00:04:42 your master's, then working in industry and on applied, more applied research projects, continuing this line of research in your PhD and beyond. So I'm curious, your work focuses on, or at least in the recent years, have been focusing on a particular type of issues, concurrency bugs, which have been studies for decades. You've both been working in this area for quite some time. So why are concurrency bugs still a problem in industry today, given all these extensive efforts,

Starting point is 00:05:16 and what's still out there to fix? That's a great question, yeah. And I think that's a, there's a lot of a two-part answer here. Firstly, you know, concurrency bugs are just like harder than bugs in the sequential programs for humans, for developers to a sort of reason about, mainly because of the way we're sort of used to thinking about, you know, code. We write code as, you know, like sequential lines in a file. We sort of are used to imagining code executing as like traces of instruction by instruction.

Starting point is 00:05:45 concurrency bugs often arise due to, you know, race conditions between, you know, various components interacting in a way that's really hard for humans to imagine. And therefore, they're rare. They're often escape testing, unless you sort of are looking for them specifically. And also, if you actually do encounter them, they're really hard to debug because the order of,

Starting point is 00:06:05 the order of interaction between components is not always very easy to sort of control or reason about. That's from, you know, the point of view of the, you know, the source of these bugs. But as you said, like, you know, researchers have been studying techniques to, you know, detect and discover and reproduce these bugs for decades.

Starting point is 00:06:24 Why are we still facing these issues today in sort of most real-world applications? Well, I think there's a little bit of nuance to this aspect. This is my personal opinion, and I think an opinion that Leo also shares. There's been a lot of work in academia on developing algorithms

Starting point is 00:06:40 for, you know, finding various kinds of risk-condition bugs, everything from data races in multi-threaded code to, you know, race conditions in distributed systems. But the real world code is complex. It's messy. And often the challenges in finding concurrency bugs is not just about having the right algorithm that can search the space in an efficient way, but it might sometimes be just having platforms that allow you to run any of these algorithms in the first place. a lot of concurrent programs rely on third-party libraries. They rely on fancy features such as asynchronous computation and futures.

Starting point is 00:07:20 They might be running distributed systems. They might depend on external services. They might be stateful. And a lot of the academic work sort of makes, in some cases, simplifying assumptions about how these algorithms can run, under what context they can run, when they will actually work. And often the engineering of the practical tools is sort of left as an exercise for the reader, let's say it that way. And so, you know, engineers who work on, you know, who build

Starting point is 00:07:45 real systems sometimes still find it hard to use any of this research that we have as a community been doing for many decades in actually finding these sorts of risk condition bugs in their code. And so part of our motivation for this work was to bridge this gap. Yeah. Well, I give a really nice point. I think there's one thing I want to mention is that in many cases, when we think about building a system or designing a new programming languages. In most cases, we think about usability, we think about performance. But we don't really think about debugability and testability at the very beginning. So in many cases, testability and debugability are actually an afterthought.

Starting point is 00:08:32 After we already achieved this performance, after we achieve this programming features, then we think about how we can debug them, how we can test them. So concurrency is actually a really good example where people start to use concurrency to chase performance. But after they achieve their performance goal, they notice, hey, actually those concurrent programs are extremely hard to debug and test. So this is also my research goal.

Starting point is 00:09:04 I hope in the future we can bridge this gap by helping developers to think about how they can debug and test this kind of program easier. Awesome. So this is so much my experience with having a grown-up job before joining my PhD program and working in industry as a software engineer. Leo, I think you hit the nail on its head. Debugability is an afterthought. times. I would argue performance as well, but that's a separate discussion, maybe a separate

Starting point is 00:09:42 podcast episode. And yeah, I think it's as a community, not only research community, but a software engineering community, systems community, we need to start thinking about debugability as built into the system design process. And as you both mentioned, maybe have this as a platform, or have these as at least abstractions that developers can and researchers that they're building research prototypes can reason about. Which brings me to my next question is this the core idea behind Frey, your most recent system that got published at Uppsla this year? The idea of Frey is very simple.

Starting point is 00:10:29 We just want to provide a general purpose, click button thing for concurrency testing in GBM. The reason we pick GBM is that we already had lots of experience in Java and the GVM platform. So we feel most comfortable about this. And so we actually had a very simple goal for this paper, for this project. I see.

Starting point is 00:10:56 Rohan, I think, you know, there's a for you working in industry and now being in academia, I think, you know, you've seen both sides of the problem. The struggle that researchers have with coming up with these tools to make them more usable and less false positives and whatnot, and then software engineers struggling to make sense of the output of the tool, how does this new project bridges this gap? I know, you know, from skimming through the paper

Starting point is 00:11:30 that Frey actually has real users. And as Leo mentioned, finds bugs in open source software that software engineers care about. Yes, absolutely. And I think right from the outset, our goal was to make the tool that's sort of easy to use.

Starting point is 00:11:48 And so we upfront had very strict requirements of not having to impose a lot of manual burden on users to have to set up that projects to be able to use whatever solution we're building for finding concurrency bugs. So the way that Frey works is that it just works off your existing unit test. So we actually sort of surveyed

Starting point is 00:12:08 a lot of real-world software projects, things like Apache Kafka and Apache Lucene, Google, Guava, libraries, et cetera. And we saw that it's actually quite common for Java developers to write unit tests that do have some sort of multi-threading in them. So for example, I mean, I can sort of simplify this and say, let's say you have a parallel sort of algorithm.

Starting point is 00:12:28 There are unit tests which would just say, you know, spawn a bunch of threads, you know, run your parallel sort, wait for all the threads to join and then check that the result is, you know, it's a sorted list, right? And so it's sort of a simple, you know, concurrent tests, but traditionally the way that these users have been checking the correctness of these concurrent algorithms is simply to run these unit tests multiple times, you know, sometimes called stress testing. So what you wanted to do is to leverage these sorts of existing tests, right, and not have developers, you know, write any sort of new, information in terms of like manual annotations or having to learn a new DSL, we just wanted to take these existing tests and say we have a tool that will allow you to run this test in a way that you can specify how the various threads in your application would interleave and you can run one or more search algorithms. So it could be the systematically search for

Starting point is 00:13:18 all possible interleavings that really doesn't scale, maybe just do a random search, or run one of these fancy algorithms that academics have been developing over the decades, like partial order sampling or some other well-known algorithms. And so all of the, you know, the magic is sort of under the hood. From the user's point of view, they're just running their existing tests.

Starting point is 00:13:38 They're pressing a button that says run them with fray. And then if we find some tight interleaving that causes their test to fail, either their assertion fails, they get an unexpected exception, or their code just hangs with the deadlock. Then we give them a replay file that says, here's a file with which you can actually

Starting point is 00:13:55 reproduce exactly the same behavior that you saw when running. with Frey. And so that would rerun their existing test with a specific sequence of thread interleavings that will show them how that buggy behavior manifests. And so we found this paradigm to be very natural for users to leverage. And that, I think, helped us, you know, helped us on the tool gain traction really quickly with real engineers. So what you're saying is you built a concurrency debugging platform as a surface, right? Yeah, something like that. Yeah, no, that's that's much, much needed in both in industry and in our research community.

Starting point is 00:14:33 One of the woes in working on concurrency bugs and trying to build on others tools is, well, of course, concurrency bugs are non-deterministic, but turns out that the bug finding tools are also non-deterministic and hard to work with. And it sounds like Frey is trying to build some, to bring some much needed predictability in this industry. space. You mentioned real users and open source software that the community cares about. But before we get into the practicality and usability of Frey, I wanted to take a step back and maybe tell a bit the listeners, what type of concurrency bugs do you tackle? Is it all types of concurrency bugs or a very particular subset and you know as a as a person that has written programs software programs before and I'm familiar with having to write tests you also mentioned rohan that developers usually

Starting point is 00:15:38 write tests with some concurrency built in why aren't those sufficient so I guess what types What type of bugs does Frey tackle and why they cannot be surfaced by existing infrastructure that software engineers built? So with Frayor looking for, you know, race condition bugs, which probably defined just means some undesirable behavior that manifests under some threatened interleavings, but not necessarily all of them, we don't actually have any sort of special logic or semantics to say we can detect a specific kind of bug in your program. We expect that the existing test that the developers have written have some sort of, you know, assertions or some property that they're checking that that ensure that the software is. behaving correctly. But from experience, the kinds of bugs that we actually do end up uncovering deal with things like atomicity issues or ordering violations. So atomicity issues are those where a program is maybe performing a set of operations that ideally should all occur without

Starting point is 00:16:48 any other interference. But in practice, those operations could actually interleave with other threads that are interacting with shared memory. A classic example of this is, even if you use something like a concurrent hash map in Java, which is a threat-safe data structure, a common sort of buggy pattern is where you might check, is there an entry for this hash map at some key, and if it does not exist, maybe I will write a new entry. But in between these two operations,

Starting point is 00:17:15 another thread could come in and interleave and sort of manipulate the hash map in a way that you actually get undesirable effects. The other thing is things like ordering violations, which means that applications that expect certain sequence, of operations to be run in a certain order, but actually there might be some sort of leavings where you actually end up with a violation of whatever protocol the application expects. And this often arises because the developer may have had some assumption or

Starting point is 00:17:43 understanding of what state the program is in after a certain sequence of operations, but maybe that state either gets corrupted or you sort of end up having a state transition that was not in your original protocol design. And so often these are like design bugs, which are not very trivial to fix. We sort of distinguish these kinds of bugs from simple data races. So a lot of academic work has focused on this data race detection. So, you know, just for the benefit of our listeners, you know, precisely define a data race as, you know, a situation where you have more than one thread

Starting point is 00:18:14 accessing the same shared memory location, like, let's say, you know, a field of an object or a global variable. And at least one of those threads is performing a right operation. So you have like a read-write or a right-right conflict. And there's no synchronization between these threads, right? so there's no use of locks or no use of special keywords like Walladette. And these kinds of data races are bad for many reasons. And there's a lot of research focused on trying to find these data races.

Starting point is 00:18:38 In Frey, we took the conscious decision to say we don't actually care about data races. We're actually trying to find, we're trying to find concurrency bugs and programs that otherwise may be well synchronized. And there's a couple of reasons for this. One is that there are lots of data race detection tools out there. And secondly, it's also not that hard to sort of fix data race bugs, right? If you find the data race, usually the solution is add some locks, add some synchronization, and you're sort of done.

Starting point is 00:19:03 And so we were targeting these harder bugs, like protocol-level bugs or where the issues are usually more in the design aspect. And these are trickier for developers to reason about and to debug and to fix. And so our tool sort of helps bridge that gap. I see. So it's, and feel free both of you could chime in here. To me, it's a bit surprising every time I read about concurrency bugs in Java. Because Java has such a wide array of features that help developers write and reason about concurrency in their program.

Starting point is 00:19:36 But still, I feel at least in my limited interaction with, you know, C, C sharp, Java, Python, I don't feel that limited interaction in terms of concurrency bugs. I don't feel that there are fewer concurrency bugs in one programming language than the other. Why do you think that despite all these new features and useful features in Java, we still face similar concurrency bugs that I used to face when I was writing C? I think one reason is that concurrency bugs manifest because of timing. And it's really hard for a developer to reason about timing, especially when they interact with large and complex.

Starting point is 00:20:27 system. So for ManyBugs, FreyFund was actually, like, one thread, it is touching a state. Well, another thread, which is very far from the current threat, like, by far, I mean, like, the file that implements that thread is not located in the same package or in the same, like, same class. And also, the location, the time the other threat is created is also very far from the current threat. So, like, in order to understand the complex interleaving across different threats, you know, complex system requires a developer to memorize, like, all kinds of interaction across files, across threats. So I feel this part is very challenging. And another thing is that this is not deterministic, meaning that after I implement my feature, I can run the

Starting point is 00:21:25 program, I can run the test, but the bug will not, like, appear immediately. It may only appear when you have a high workload server, while your CPU is fault or when you run it on a very special hardware. So I think all of these facts as to this complexity of debugging a concurrent program. And if I can add to that, Bogdan, you mentioned the word performance. some time ago. And while Frey doesn't specifically look for performance bugs, I think one of the reasons why we have so many concurrency bugs out there,

Starting point is 00:22:05 even in, you know, languages that have nice abstractions for writing current code, is that developers often end up doing really crazy things to squeeze performance out because there's a trade-off between, you know, clean abstractions and performance in many cases. And so if you want performance, you sometimes have to, you know, write a little bit more low-level code than what some of the high-level libraries provide. The kind of APIs that high-level libraries provide. And so a lot

Starting point is 00:22:31 of the bugs that we've seen have been in sort of these infrastructure layer libraries, things like Kafka or like Google Guava, which is a set of libraries, which are trying to provide higher-level functionality to their clients, but they do so using slightly more high-performance algorithms that deal with like low-level

Starting point is 00:22:47 primitives. And that's sort of where you run into this kind of similar trap that you might run into if you were not trying to write concurrent C-cord, as you said. Right, right. So timing issues being non-deterministic and trying to squeeze performance are definitely, definitely factors that contribute to having more concurrency bugs. And I feel that as you both mentioned earlier, that, you know,

Starting point is 00:23:15 Frey is not actually trying to specifically trigger a particular pattern or a particular bug, but rather be a little bit more broad. Having said that, fray finds different categories of bugs. And I'm curious if, you know, if you could first maybe talk about the high-level design of fray and what's the process of finding, how does fray help developers find bugs? Maybe not necessarily in terms of techniques,

Starting point is 00:23:53 but rather in terms of abstractions. So, yeah, actually, again, like we are trying to make free push button and easy to use, and we actually, so to achieve this goal, we did lots of effort to make fray easy to use. And currently, if you are implemented, like your project, use Gradle or Maven,

Starting point is 00:24:14 we already provided Gradle and Maven plugin, and it will set up the Frey tool for you. And the only thing you need to do is that if you're using J-unit, you can replace the test annotation with concurrency test annotation. And if you are using other testing framework, we also provide a wrapper so that you can launch Frey by calling a simple method call. And the way Frey work is that Frey will instrument your application with small logs. And those logs will cooperate with each other so that it will only let

Starting point is 00:24:51 one threat wrong at each time. So this allows us to exhaustively explore all possible possible threat interleavings in your application. But as we mentioned before, systematically exploring all possible threat interleavings is like really hard, it's really time-consuming, especially for real work and current programs.

Starting point is 00:25:17 And we also provide more advanced searching algorithms So like PCT, POS, and SURW, those are all, like, well-studded, concurrency testing framework, and they are proven to be more efficient in terms of bug finding. So in this case, we believe finding bugs is more valuable than proving correctness because we're hoping this, like, we're positioning free as a useful bug finding tool, similar to fuzzing. So once a bug is found, a user can deterministically replay that bug. So we also actually implement debugger plugin inside Intelligy, and a user can visualize the entire timeline of each thread

Starting point is 00:26:06 and can understand how different threads interacts with each other that leads to the final bug or final failure. So, yeah, we are actually doing more current stage. so to make fray easier to use and we constantly stay improving the user experience using fray. Awesome. You mentioned in the very beginning, and this is super useful.

Starting point is 00:26:33 Thank you. Thank you for walking through the use cases of fray. But you mentioned something very interesting. It was actually two things, but I'll start with the one that I find most interesting. You said that, you know, concurrency testing is always, trying to balance the tension between coverage and state explosion.

Starting point is 00:26:52 And whenever we're trying to explore thread interleavings, you end up with state explosion. Could you tell us a little bit more about the trade-offs you had to implement in order to balance this tension? Yes. So we actually did lots of efforts trying to reduce the search space. actually starting with data risk free assumption. This is already an optimization we want to make to reduce the search list.

Starting point is 00:27:25 After we have data risk free assumption, we only need to perform threat interleaving before concurrency permutives. So this actually aggressively reduced the search list for us, but this still provides a soundness and complete is guaranteeing for us, meaning that, like, every time Frey finds a bug, this bug can appear

Starting point is 00:27:50 in real world scenario from the original program, and like every possible concurrency bug in the real work application, Frey can find some threat interleaving that trigger that bug. So this is one

Starting point is 00:28:06 thing we need to reduce the search space, but even with that, systematically threat interliving is still challenging. Like, for example, for many losing Kafka bugs, so even running your wine iteration takes a couple seconds. So think about the search list, it's around a million threat interleavings

Starting point is 00:28:30 happened in one single test run. So, but we still want to find bug faster. Like, doing random walk is the, like, the baseline we have. So every time you can randomly pick a threat to run. But as we all know that, concurrency testing is actually a crowded area and has been studied for many years. And there are already very nice algorithms are proven more effective in finding bugs. It's just that their own practical framework for developers are using them.

Starting point is 00:29:08 So now we actually bring those algorithms to the digital. developers and developers can just use those algorithms to test their concurrent programs and find much more efficient. Even better, we actually designed Frey in a way that implementing new concurrency testing algorithm very easy. So we provide a uniformed interface for developers to implement them. We already implemented many of them. So we know that implementing new our rhythm is actually very easy inside Frey. Awesome. So Frey is not only a bug-finding platform where you as researchers have already implemented several techniques that could trigger concurrency bugs, but it's built in such a way

Starting point is 00:29:58 that you allow developers to write their own random walks if they want, or their own fault injection, I'm assuming, or delay injection, whatever they feel, it's, it's, it's appropriate for their scenario, which is, I think it's great. It's something that I guess both, both he and Rohan were mentioning in the beginning, not only the usability, but the flexibility of those platforms and the fact that developers can now have, can implement something that they know or they're much easier for them to understand or maybe very specific to their scenario, which I guess makes Frey a very versatile platform. So Frey uses a collection of algorithms and techniques to surface all these concurrency bugs.

Starting point is 00:30:47 But I'm curious, how do you come up with the workload to run the programs and get them into a faulty state? What are you using for inputs? Are you relying on existing tests or you're hoping that developers will write their own tests using Frey? Can you walk us through this? Yeah, so this actually went back to the motivation of this project. So as Roan had mentioned, we explored this popular concurrent Java programs, Kapka losing Java. There are already plenty concurrent unit has used running in their CI pipeline every day.

Starting point is 00:31:30 So our goal was not trying to provide a framework so that developers can improve, new concurrence test. Our goal is to just to implement a platform that can just reuse those existing concurrent tests. So it turns out, these concurrency tests are actually sufficient to test many concurrent features inside the existing framework. It's just lacking of a better testing framework that can surface those bugs. So in our evaluation, we found more than 18 bugs just by rerunging all the existing concurrency tests in those large-scale applications. So that's how we did for free. Yeah, if you want to, I can provide some numbers from our paper.

Starting point is 00:32:19 We actually saw that there were 2,600 or so unit tests across the three projects I mentioned before, Apache Kafka, Apache Lucene, and Google Guava. These are widely used, you know, sort of software libraries or frameworks. and the 2,600 of their unit tests that had some sort of concurrency in them. So they spawned some threads, did some work, joined the threads, and check the result. And these pass in, you know, CI usually every day, as as Leo mentioned. But with Frey, we found that there were 363 of these tests. So that's a pretty large fraction, you know, almost, I guess, somewhere between 10 and 15% of

Starting point is 00:32:57 these tests, that could actually fail given a certain tenant relieving, that we can now, you know, deterministically replay and show the developers saying, a specific under leaving under which either the assertion fails or you get like an unexpected null-pointer exception or you get a deadlock and the whole test hangs. And so this was definitely very surprising to the developers, which is why, you know, we got some traction in having many of these bugs. So I mean, there's lots of tests for failing. I think the root causes, as as as Leo mentioned, were around like 18 unique bugs that we, that we filed in these repositories. But going back to your original question bug done, so we, you know, these, these tests already were pretty good,

Starting point is 00:33:34 And the developers wrote these tests as a way to cover various functionality in their code. And so they sort of pick their own loads, so to say. And Frege just helps, you know, find the right interleaving that can maybe find some bug. What could be challenging is if you actually end up having a load that's intentionally designed to be like a stress load, right? Where you generate, you know, if you're spawning like tens of thousands of tests or you're like, you know, running a very large workload in a loop, that can actually become a little bit problematic for. fray because you have the state space to search what would really explore and so we sort of focus on these like unit tests not not end-to-end system level stress tests for for for searching over

Starting point is 00:34:14 interlevings uh thank you thank you rohan for for for for mentioning this actually you know one of my next questions was about phrase phrase impact but i'll hold hold off on that a little bit you you both mentioned two interesting things and i guess you know this this is interesting to me but I hope listeners also find is interesting. The fact that these software applications already shipped with tests that are meant to exercise concurrent parts of the application.

Starting point is 00:34:48 So developers are aware of at least some bug or failure patterns that they can surface in their application while their application runs in production. So they try to get ahead of this and mitigate. still, you know, writing test is not sufficient. And I think this is exactly what Leo was mentioning,

Starting point is 00:35:10 a little bit in the beginning, that it's just incredibly, incredibly challenging to reason about concurrency bugs by just looking at the code. You have to keep in mind all the various timing and interleavings that could happen, could occur in practice. And, of course, that is something very complicated to do in a complex software. You also mentioned that Frey finds a lot of bugs, but they map to about 20, you said, about 20 fundamental root causes.

Starting point is 00:35:43 I would be curious if you, you know, both Rohan and Dio, can you tell us a little bit more about these root causes and what is surprising about those and why do you feel, you know, developers are still falling for these pitfalls, if you will? So, yeah, as you had mentioned, like, there are, to be honest, they're nothing new from these buffs, like those buffs are already well-studded, like automaticity violations, like basically the developer assumes that certain operation is automatic, but apparently it's not, and while two threats are touching the same operation that things happen. And there's also order dependencies where, like, developers assume there are certain orders that for a specific operation will happen. But in fact, if you run that in a concurrent setting, the order is not always guaranteed.

Starting point is 00:36:45 I think the surprising thing we found is actually many, you don't really need a sophisticated searching algorithm to find many bucks. And many bugs can surface even with random walk with very small number of iterations. Here, iteration I mean like different interleavings. It's just like if you run them on a no more like CI environment, you don't have access to this control concurrency testing environment. You're running that on a normal Linux machine,

Starting point is 00:37:23 no more kernel scheduler, that bug won't appear. It's just because how the kernel schedule your program is different from how a control concurrency testing framework will schedule your program. So this also highlights the importance having this sort of framework where even though we have so many other methods techniques in searching algorithm, but having this kind of framework is still important to just surface that many bugs. Yeah, I agree with everything that Leo said. I think one other sort of surprising quote pattern that I saw in some of the bugs that we found was this notion of either spurious wake-ups or essentially like if you have code that is suspending a thread, either using APIs like, you know, object dot weight in Java, or it could be like thread dot park, it could be thread.

Starting point is 00:38:20 does sleep, there's often assumptions that some developers make that once the thread sort of wakes up from this blocking call, the state has transitioned to something good, like the thread is woken up because, you know, there was a resource that it was waiting for and now that resource is available. And it proceeds to do something with that assumption, but it could be the case that the thread actually is woken up before the resource was made available. And this is a very nuanced situation because it requires sort of understanding how the sort of the suspend resume APIs for things like, you know, object wait notify or thread park, unpark, and Java actually work.

Starting point is 00:38:56 It turns out that the JVM doesn't really give you many guarantees about when these suspended threads wake up. And so you actually, in theory, could wake up from these pauses even before the resources that you were waiting for were actually made available. And so the correct code pattern to use is always to like check after you, your thread is woken up that you are in the state that you expect to be in. So simply just have having a check or an assertion would have prevented some of these errors. But if in the absence of those checks, you know, the code just keeps going. And often, you know, it might just cut up some data or try to like read from a null pointer

Starting point is 00:39:31 and the program would usually crash. I see. Yeah. So I think I think this is this is one of the features that I love the most about the fray and reading your UBSLAP. paper. The fact that you, you know, it proves time and time again that you, you, you don't need overly complicated algorithms to find these concurrency bugs. And still, you know, simple, simple algorithms can find many of these bugs. And, and at the same time, it also reminds us how difficult it is in going back again, what you both said in the beginning, going back to this challenge of reasoning about time and timing.

Starting point is 00:40:18 in your in your concurrent application. And I think, you know, this is what makes Frey so attractive is so useful. And I just have to mention this for the listeners. Frey not only finds bugs that happen in open source software

Starting point is 00:40:35 that developers care about, it actually started to build a bit of a community around it. I peeked a little bit on Leo's website before this podcast, and I saw a bunch of user testimonials. So maybe Leo, can you tell us a little bit more about this ongoing adoption effort, if you will, going from a research prototype and a research paper to an open source

Starting point is 00:41:03 tools, that tool that attracts a community around it? Yeah, definitely. So I'm always interested in building tools for real-world developers. And I learned a lot from developers as well. So this is also one of the goal of the free project itself. We are not only building concurrency testing framework for the research community. We are also building a tool that developer can use. So that's also an ongoing effort.

Starting point is 00:41:40 So we released Frey as an open source project. and we also encourage developers to use Frey so we wrote a couple tutorials and blog posts about how Frey can be used to test conferred programs and how developers can integrate Frey into their testing pipeline. And also this is an other word test plan here. If you are interested in Frey

Starting point is 00:42:09 and you want to contribute definitely email us or like, broads the GitHub issue, grab anything, you can help. So this is definitely welcome. And listeners will have the opportunity to do that. If we go on the episode page, you will find all the links, relevant links to the paper, to GitHub, to Leo and Rohan's web pages where you can contact them and, yeah, join the Frey, if you will. Rohan, how do you decide after all this, you know, after all this experience both in academia

Starting point is 00:42:49 and industry and working in software reliability for so long, how do you decide a research prototype or research idea is ready to ship as a tool that software engineers can actually use? Yeah, so that's a good question, especially because, you know, as researchers and academics, We're sometimes, you know, fall into this trap of, you know, iterating on something until, you know, we think it's perfect and we set really high bars for ourselves and never get there. So I think it does require a balance. I mean, you know, the more pragmatic answer is we obviously, you know, we obviously have like publication deadlines. And so we, you know, at some point we need to, you know, find like a logical point for the project to be considered as, you know, mature, ready to write up, ready to publish, let's say, evaluation results. But I think the question you're asking you more from the point of view of how do we actually ship the software to users and say, hey, you know, it's ready to be used.

Starting point is 00:43:45 I think it requires several steps, right? I mean, in our case, we, you know, we had some early versions, like almost like beta versions that we, you know, we made sure they work on real world code. I think that was our main target was that we wanted to be able to work, you know, in a push button fashion on large-scale software targets like Apache Kafka, which is not at all trivial to do. And then also once we're able to do this for a few projects, we also check that it, you know, like a sort of unseen validation set, right? Does the tool also work on other kinds of targets that we haven't explicitly considered when developing the tool? I think that's a good benchmark to know that the tool is ready,

Starting point is 00:44:22 that we don't have to keep, you know, iterating on our tool with every new target that we try to evaluate on. And once we sort of reach a, you do some convergence where every new target we consider our tool runs on just fine, I think that's good to go. But I think the other thing also is the students in my group spent a lot of time and sort of interacting with open source developers. Like for example, Leo spent a lot of time in debugging all of the issues found by frame open source software,

Starting point is 00:44:50 you know, creating, you know, easy to understand, you know, bug reports, providing more information to the developers to have those bugs getting fixed. I believe we also gave, you know, a talk at some companies that were supporting this open source development. And I think I would encourage other, you know, researchers in academia who care about, you know, the use of their artifacts to also consider these things that are not sort of strictly required from an academic point of view, right? Like, you know, you don't really get as many, you know, beans, let's say, to count that for those who count beans on folks' TVs, like beyond your research publication. But if you do care about the community, you do care about your tool being useful to spend the time and actually interacting with, you know, the open source community, get feedback, you know, write blog posts, go and give talks at, like, developer conferences, at like, you know, meetups, you know, participate in interviews for podcasts hosted by, you know, eminent publishers of, of tech media, and things like that, right? And just to make sure that, you know, you get feedback.

Starting point is 00:45:58 And I think as long as it open to that feedback, as long as it open to allocating some time and titrating on the artifact, even after. the, you know, the paper is published and you move on to other projects, I think that's important. And then that's helped us in not only this project, but other projects as well. So thank you, Rohan, for saying this. I think many of the listeners will take all of this advice to heart. And it's certainly something that I felt in my own research that, you know, it's nice to have a paper accepted and counting beans as you, as you put it. But it's also equally important to show or not necessarily to show others your impact, but at least to have an idea yourself of what kind of impact, the tools you're building have. And this kind of guides you,

Starting point is 00:46:45 maybe this could guide your next move or your next iteration of your research. I'm curious what you think about our research community and how can our research community support more these these efforts because it's it's as you say we we we love counting beans and that's not necessarily bad i think it's just a fact of life but i'm curious from your perspective can our community do more and what what should that more be yeah i think i think that's a great question and it's an important discussion i think in general yeah i think valuing you know contributions beyond just the traditional academic beans of things like publications and citations, I think is really important and our community should strive to doing that.

Starting point is 00:47:34 And I have seen changes, especially in like the systems community. There is in at least some subsets of the community, a lot of respect for, you know, researchers who are actually able to deliver on artifacts and sort of show real world impact. I know this is true at CMU for sure. Let's, you know, in our processes, everything from like faculty hiring, you know, to promotions and tenure, you know, the university has a policy of taking into account all different kinds of impact, not just, you know, traditional research publications. And I think that also stands by the recommendations of, you know, the Computing Research Association and

Starting point is 00:48:14 some other sort of standards. So I think as long as, you know, the community embraces this kind of work, I think we will make more progress, you know, moving not just as a field in terms of science, in terms of acceptance by the actual, you know, consumers of our scientific innovations, in this case, you know, software engineers or even the end users of the software systems who care about the reliability of these systems. That's great to hear that not only, you know, individual researchers care more and more about these efforts, but also, you know, hiring committees and promoting committees are taking into account impact more and more, and hopefully this will spark or will determine more groups

Starting point is 00:49:03 to at least publish their artifacts and having their artifacts out there. And you mentioned that the systems and software engineering community and other sub-communities computer science, they have these ongoing efforts for every conference. There is a parallel track where you can get your artifact evaluated, and that definitely helps a lot. So, which brings me to my next question, what is the next step into developing and productizing fray? Would there be more extensions, follow-up, maybe replicate the same ideas to other type of bugs? What's next? Yeah, so there are many things. I may have a different vision that other, Rohan, but I can start with mine. One thing I noticed,

Starting point is 00:49:52 it's not about bug finding, but about debugging. Even though Frey found so many bugs, and for each of them debugging still take that. It's just because the software is complex, interleaving is complex, and reproducing

Starting point is 00:50:09 certain bugs causes many steps and even let the developer to fully understand what happened behind the thing is still challenging. So this involves like, we need, to design a better debugging algorithm here, what I mean is similar to inside fuzzling,

Starting point is 00:50:30 we have input minimization, right? So maybe for concurrency tests, we also need to have some sort of like input or scheduling minimization for for concurrent program to help the developers to debug them. On the other hand, there's also UI, UX challenges where we have debugger, we have like ID for sequential program, but we don't really have a debugger or IDE for a concurrent program.

Starting point is 00:51:01 So we are also thinking how we can build better visualization tool for developers to debug concurrent program. So yeah, this is something I feel important. Yeah, that's I totally agree with

Starting point is 00:51:16 the debuggerability aspects. I think that can be improved. Another thing that we are sort of excited about as sort of the next step beyond Frey is thinking beyond this single process multi-threaded program which is what Frey currently targets but also looking at distributed application

Starting point is 00:51:32 so things like you know Apache Zookeeper or Cassandra distributed systems they're all sort of still within Java and they still have are susceptible to like race condition issues often these race conditions are rooted not just in

Starting point is 00:51:49 inter-thread interleaving but also in interviewing across, like, messages over the network, you know, through the distributed system. And so we have some ongoing, you know, work on trying to use some of the core ideas from Frey to also test its distributed systems. So, you know, we can apply these, you know, concurrency testing algorithms, not just on intertreateningings, but also on the interleavings of network messages. And that's sort of a work in progress. So, you know, maybe we'll come back on this podcast once we have that, you know, those results are a bit more mature. And, And yeah, that's the direction I'm very excited about.

Starting point is 00:52:23 I also like that Leo started his answer by saying that he might have a different vision than me. I think he's now going on the tenure track job market. So this is very apt for him to have his own independent vision. And I also would be excited to see how he starts his own research group and where he takes things from there. I'll keep a close eye on your lab and Leo's new lab. and I'm sure you'll have more than a handful of offers to begin with beginning next year. And I'm also excited for what his vision is about debugging in general and how developers and researchers should think about finding bugs when building software.

Starting point is 00:53:09 And this is an excellent segue to sort of my last set of questions, which revolve around creative process. And I think many listeners and myself included always struggle with coming up with new interesting ideas that a community cares about or finding research problems that you can actually tackle in a few years or a grand cycle, right? And it doesn't take a dozen years to solve.

Starting point is 00:53:42 Or you have to wait for a new, a new math to be invented to solve. So I was curious, I'm curious, how both of you, how do you come up with these ideas beyond, you know, just reading papers and maybe using your industry experience? What's your creative process? How do you scale a problem, to select a problem for an appropriate project,

Starting point is 00:54:12 PhD project-wise scale? I just hire amazing students and I delegate the task to them so maybe Leo can explain this. The whole freight project was I think born out of Leo's observations with

Starting point is 00:54:27 concurrent programming in Java and debugging. He's very tuned in to things that the open source community is working on and he's good at identifying the gap between academic research and practice. And so

Starting point is 00:54:43 So a lot of the main motivation of treating, you know, applicability to deliver targets as like a first-class research problem, which is what really does, right? We didn't invent any new algorithms or, you know, research techniques. We're just building a new platform. That entire sort of process and that entire line of thinking, you know, came out of, you know, Leo's experience and I'm lucky to work with amazing students such as them. So I'll let Leo answer this in more detail. Yeah, thank you, Rohan. Thank you. So actually, so when I started working on this related research, I also started by designing your algorithm. So I did my internship at Amazon three years ago where we were designing better concurrency testing framework inside P. So P is a framework for distributed model testing. And because it's testing distributed system models, so it has full control over

Starting point is 00:55:43 the message processing, concurrency, and, like, execution. So that was cool, and that was a very fun experience. And that also led me to learn that control concurrency testing is a really powerful tools for debugging, sorry, for testing concurrent applications. But if you look at the real world, right, real world, people write C, C++, Ross, Java, C Sharp. They don't really write P. they only model their application in P but when they write real-word application,

Starting point is 00:56:18 they still use this commercial or like real-world languages. But we don't see this thing exist in these like popular languages. So even though like there are many research papers about this, talking about this, but developers are still using stress testing. Developers are still just to write some unit tests and hope that, will help them to find the concurrency bug. So having this, like, I think, like, having exposed to the platformies of, like, control concurrency testing and also having observed this gap between the academic research,

Starting point is 00:57:00 we all know that, and also the real world gap, is, like, help us to, to solve this problem. Right. So we all know that control concurrency testing is a good thing. And the other goal is just to bring control concurrency testing to the developers. Awesome. Awesome. So basically what you're both saying is in order to find these interesting problems that the community cares about, you really have to be attuned to both what happened in kind of the research, academic space. but also what happens in the software engineering space and what software developers are actually facing and what are the gaps between theory and practice, if you will, if you will. And I think that's something many of the listeners could relate to.

Starting point is 00:57:56 I definitely can relate to as having a grown-up job in the distant past. And then going to research, I definitely felt this gap acutely. And I'm happy that there are researchers out there that are trying to tackle this gap in their research every day. Leo, Rohan, thank you so much for joining us today. I'm really excited to see what's next on your research agenda. And I'm more than happy to have you again on the podcast with the next project or the next set of bugs you're going to tackle. Guys, thank you so much. Thank you very much, Vodan, for inviting us.

Starting point is 00:58:40 This was a blast. Yeah, thank you. Thank you. I'm very happy to be here. Thank you.

Disseminate: The Computer Science Research Podcast - Rohan Padhye & Ao Li | Fray: An Efficient General-Purpose Concurrency JVM Testing Platform | #66

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.