Disseminate: The Computer Science Research Podcast - Rohan Padhye & Ao Li | Fray: An Efficient General-Purpose Concurrency JVM Testing Platform | #66
Episode Date: October 6, 2025In this episode of Disseminate: The Computer Science Research Podcast, guest host Bogdan Stoica sits down with Ao Li and Rohan Padhye (Carnegie Mellon University) to discuss their OOPSLA 2025 paper: "...Fray: An Efficient General-Purpose Concurrency Testing Platform for the JVM".We dive into:Why concurrency bugs remain so hard to catch -- even in "well-tested" Java projects.The design of Fray, a new concurrency testing platform that outperforms prior tools like JPF and rr.Real-world bugs discovered in Apache Kafka, Lucene, and Google Guava.The gap between academic research and industrial practice, and how Fray bridges it.What’s next for concurrency testing: debugging tools, distributed systems, and beyond.If you’re a Java developer, systems researcher, or just curious about how to make software more reliable, this conversation is packed with insights on the future of software testing.Links & Resources:- The Fray paper (OOPSLA 2025):- Fray on GitHub- Ao Li’s research - Rohan Padhye’s research Don’t forget to like, subscribe, and hit the 🔔 to stay updated on the latest episodes about cutting-edge computer science research.#Java #Concurrency #SoftwareTesting #Fray #OOPSLA2025 #Programming #Debugging #JVM #ComputerScience #ResearchPodcast Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Disseminate the Computer Science Research Podcast.
Hi, and welcome to disseminate the computer science research podcast.
I'm your host, still not Jack Watby.
You may recognize me from the previous episode.
I'm, of course, you're truly Bogdan Stoica, but you can call me both.
That's a mouthful.
Today, I'm joined by two guests, Rohan Padier, who's an assistant professor at CMU,
where he leads the pasta research lab, and Leo Lee, a final year PhD,
student advised by Rohan.
They're here to talk about software testing woes, concurrency bugs,
and their most recent paper published at Uppsla this year,
Frey, an efficient general-purpose concurrency testing platform for the JVM.
Leo is also on the 10-year-track job market for this year,
so please check out his research and papers.
Reach out to him if you're in the hiring committee or if you know your department is hired.
Rohan, Leo, welcome to the show.
Thank you very much.
Glad to be here.
Thank you.
So let's dig in.
I am always curious when we get new guests on the show to talk about how they actually found their way into their research area.
So for you guys, that is software reliability more generically.
So how did you end up in software reliability?
and why did you choose CMU and why pasta?
Yeah, so I've always been interested in doing research on anything to do with software,
mainly because it's easy to imagine myself as the customer or the end user for the research that I do,
because I build tools for software developers and I've been doing software development for a long time.
Specifically, the area of program analysis is something that I've been looking into for,
I think over 12 years now.
I think 12 years ago, I took a class called program analysis when I was a student in a master's program back in India.
And here I am today at CMU as a professor of teaching a class called program analysis.
So this is something that I really enjoy doing.
And I've also worked at various different companies in different roles and have seen the kinds of code that actually runs in production and powers many of our software systems that we all rely on.
And so I sort of understand the importance and importance and,
and impact of research that can improve software reliability.
CMU is really special.
You know, it's obviously a well-known university in computer science overall,
but specifically in the area of sort of software analysis,
it's very special because we have our own PhD program in software engineering.
CMU is also weird because we have many different departments within the School of Computer Science,
one of which it's called the one that I'm in, it's called the Software and Societal Systems Department.
And so we essentially have the largest collection of researchers in the field of
sort of software analysis and engineering, probably anywhere in the world.
So it's really amazing, and we also get to work with amazing students, such as Leo,
on these kinds of fun projects, and that's what's really exciting.
Yeah, so for me, like, I started to work on software-related research starting my undergrad,
and where I was working with Karin, who was professor at University of Kargari,
And so we were building static analysis tools for SWIFT.
So that was a really fun experience because one thing, I got a chance to experience what the real-world bugs at issue are.
And also, I got a chance to contribute to open-source project.
So I think that actually motivates me to continue to work on this area.
And also, I chose CMU.
Part of the reason is that CMU has a really big school.
where you can work on many different areas
and you can find experts in pretty much every domain.
So for me, I'm especially interested in software analysis,
but also I want to apply software analysis to real-war application.
So I am now currently codewise by two professors,
one Rohan Paki, he's an expert in program analysis.
I'm also advised by Vyaseka, who is an expert in,
system network research.
So I think there are many places where you can actually find experts in pretty much every domain.
So that makes me thinking CMU is like a great place to do this kind of research.
Sounds like you both have a similar path, working on program analysis during undergrad or during
your master's, then working in industry and on applied, more applied research projects,
continuing this line of research in your PhD and beyond.
So I'm curious, your work focuses on, or at least in the recent years,
have been focusing on a particular type of issues, concurrency bugs,
which have been studies for decades.
You've both been working in this area for quite some time.
So why are concurrency bugs still a problem in industry today,
given all these extensive efforts,
and what's still out there to fix?
That's a great question, yeah.
And I think that's a, there's a lot of a two-part answer here.
Firstly, you know, concurrency bugs are just like harder than bugs in the sequential programs
for humans, for developers to a sort of reason about, mainly because of the way we're sort
of used to thinking about, you know, code.
We write code as, you know, like sequential lines in a file.
We sort of are used to imagining code executing as like traces of instruction by instruction.
concurrency bugs often arise due to, you know,
race conditions between, you know, various components interacting
in a way that's really hard for humans to imagine.
And therefore, they're rare.
They're often escape testing,
unless you sort of are looking for them specifically.
And also, if you actually do encounter them,
they're really hard to debug because the order of,
the order of interaction between components
is not always very easy to sort of control or reason about.
That's from, you know, the point of view of the, you know,
the source of these bugs.
But as you said, like, you know,
researchers have been studying techniques
to, you know, detect and discover
and reproduce these bugs for decades.
Why are we still facing these issues today
in sort of most real-world applications?
Well, I think there's a little bit of nuance
to this aspect.
This is my personal opinion,
and I think an opinion that Leo also shares.
There's been a lot of work in academia
on developing algorithms
for, you know, finding various kinds of risk-condition bugs,
everything from data races in multi-threaded code to, you know, race conditions in distributed systems.
But the real world code is complex. It's messy.
And often the challenges in finding concurrency bugs is not just about having the right algorithm
that can search the space in an efficient way, but it might sometimes be just having platforms
that allow you to run any of these algorithms in the first place.
a lot of concurrent programs rely on third-party libraries.
They rely on fancy features such as asynchronous computation and futures.
They might be running distributed systems.
They might depend on external services.
They might be stateful.
And a lot of the academic work sort of makes, in some cases,
simplifying assumptions about how these algorithms can run,
under what context they can run, when they will actually work.
And often the engineering of the practical tools is sort of left as an exercise for the
reader, let's say it that way. And so, you know, engineers who work on, you know, who build
real systems sometimes still find it hard to use any of this research that we have as a community
been doing for many decades in actually finding these sorts of risk condition bugs in their
code. And so part of our motivation for this work was to bridge this gap. Yeah. Well, I give a really
nice point. I think there's one thing I want to mention is that in many cases,
when we think about building a system or designing a new programming languages.
In most cases, we think about usability, we think about performance.
But we don't really think about debugability and testability at the very beginning.
So in many cases, testability and debugability are actually an afterthought.
After we already achieved this performance, after we achieve this programming features,
then we think about how we can debug them, how we can test them.
So concurrency is actually a really good example
where people start to use concurrency to chase performance.
But after they achieve their performance goal,
they notice, hey, actually those concurrent programs
are extremely hard to debug and test.
So this is also my research goal.
I hope in the future we can bridge this gap
by helping developers to think about how they can debug and test this kind of program easier.
Awesome.
So this is so much my experience with having a grown-up job before joining my PhD program
and working in industry as a software engineer.
Leo, I think you hit the nail on its head.
Debugability is an afterthought.
times. I would argue performance as well, but that's a separate discussion, maybe a separate
podcast episode. And yeah, I think it's as a community, not only research community, but a software
engineering community, systems community, we need to start thinking about debugability as built
into the system design process. And as you both mentioned, maybe have this as a platform,
or have these as at least abstractions that developers can and researchers
that they're building research prototypes can reason about.
Which brings me to my next question is this the core idea behind Frey,
your most recent system that got published at Uppsla this year?
The idea of Frey is very simple.
We just want to provide a general purpose,
click button thing for concurrency testing in GBM.
The reason we pick GBM is that we already had
lots of experience in Java and the GVM platform.
So we feel most comfortable about this.
And so we actually had a very simple goal for this paper,
for this project.
I see.
Rohan, I think, you know, there's a for you working in industry
and now being in academia,
I think, you know, you've seen both sides of the problem.
The struggle that researchers have with coming up with these tools
to make them more usable and less false positives and whatnot,
and then software engineers struggling to make sense of the output of the tool,
how does this new project bridges this gap?
I know, you know, from skimming through the paper
that Frey actually has real users.
And as Leo mentioned,
finds bugs in open source software
that software engineers care about.
Yes, absolutely.
And I think right from the outset,
our goal was to make the tool
that's sort of easy to use.
And so we upfront had very strict requirements
of not having to impose a lot of manual burden
on users to have to set up that projects
to be able to use whatever solution
we're building for finding concurrency bugs.
So the way that Frey works is that
it just works off your existing unit test.
So we actually sort of surveyed
a lot of real-world software projects,
things like Apache Kafka and Apache Lucene,
Google, Guava, libraries, et cetera.
And we saw that it's actually quite common
for Java developers to write unit tests
that do have some sort of multi-threading in them.
So for example, I mean, I can sort of simplify this and say,
let's say you have a parallel sort of algorithm.
There are unit tests which would just say, you know, spawn a bunch of threads, you know, run your parallel sort,
wait for all the threads to join and then check that the result is, you know, it's a sorted list, right?
And so it's sort of a simple, you know, concurrent tests, but traditionally the way that these users have been checking the correctness of these concurrent algorithms is simply to run these unit tests multiple times, you know, sometimes called stress testing.
So what you wanted to do is to leverage these sorts of existing tests, right, and not have developers, you know, write any sort of new,
information in terms of like manual annotations or having to learn a new DSL, we just wanted
to take these existing tests and say we have a tool that will allow you to run this test
in a way that you can specify how the various threads in your application would interleave
and you can run one or more search algorithms. So it could be the systematically search for
all possible interleavings that really doesn't scale, maybe just do a random search,
or run one of these fancy algorithms that academics have been developing over the decades,
like partial order sampling
or some other well-known algorithms.
And so all of the, you know,
the magic is sort of under the hood.
From the user's point of view,
they're just running their existing tests.
They're pressing a button that says run them with fray.
And then if we find some tight interleaving
that causes their test to fail,
either their assertion fails,
they get an unexpected exception,
or their code just hangs with the deadlock.
Then we give them a replay file that says,
here's a file with which you can actually
reproduce exactly the same behavior
that you saw when running.
with Frey. And so that would rerun their existing test with a specific sequence of
thread interleavings that will show them how that buggy behavior manifests. And so we found this
paradigm to be very natural for users to leverage. And that, I think, helped us, you know,
helped us on the tool gain traction really quickly with real engineers. So what you're saying
is you built a concurrency debugging platform as a surface, right? Yeah, something like that.
Yeah, no, that's that's much, much needed in both in industry and in our research community.
One of the woes in working on concurrency bugs and trying to build on others tools is, well, of course,
concurrency bugs are non-deterministic, but turns out that the bug finding tools are also non-deterministic and hard to work with.
And it sounds like Frey is trying to build some, to bring some much needed predictability in this industry.
space. You mentioned real users and open source software that the community cares about. But before
we get into the practicality and usability of Frey, I wanted to take a step back and maybe tell a bit
the listeners, what type of concurrency bugs do you tackle? Is it all types of concurrency bugs or a very
particular subset and you know as a as a person that has written programs software programs before
and I'm familiar with having to write tests you also mentioned rohan that developers usually
write tests with some concurrency built in why aren't those sufficient so I guess what types
What type of bugs does Frey tackle and why they cannot be surfaced by existing infrastructure that software engineers built?
So with Frayor looking for, you know, race condition bugs, which probably defined just means some undesirable behavior that manifests under some
threatened interleavings, but not necessarily all of them, we don't actually have any sort of special logic or semantics to say we can detect a specific kind of bug in your program.
We expect that the existing test that the developers have written have some sort of, you know, assertions or some property that they're checking that that ensure that the software is.
behaving correctly. But from experience, the kinds of bugs that we actually do end up uncovering
deal with things like atomicity issues or ordering violations. So atomicity issues are those
where a program is maybe performing a set of operations that ideally should all occur without
any other interference. But in practice, those operations could actually interleave with
other threads that are interacting with shared memory. A classic example of this is,
even if you use something like a concurrent hash map in Java,
which is a threat-safe data structure,
a common sort of buggy pattern is where you might check,
is there an entry for this hash map at some key,
and if it does not exist, maybe I will write a new entry.
But in between these two operations,
another thread could come in and interleave
and sort of manipulate the hash map
in a way that you actually get undesirable effects.
The other thing is things like ordering violations,
which means that applications that expect certain sequence,
of operations to be run in a certain order, but actually there might be some sort of
leavings where you actually end up with a violation of whatever protocol the application
expects. And this often arises because the developer may have had some assumption or
understanding of what state the program is in after a certain sequence of operations, but
maybe that state either gets corrupted or you sort of end up having a state transition that was
not in your original protocol design. And so often these are like design bugs, which are
not very trivial to fix.
We sort of distinguish these kinds of bugs from simple data races.
So a lot of academic work has focused on this data race detection.
So, you know, just for the benefit of our listeners, you know,
precisely define a data race as, you know, a situation where you have more than one thread
accessing the same shared memory location, like, let's say, you know, a field of an object
or a global variable.
And at least one of those threads is performing a right operation.
So you have like a read-write or a right-right conflict.
And there's no synchronization between these threads, right?
so there's no use of locks or no use of special keywords like Walladette.
And these kinds of data races are bad for many reasons.
And there's a lot of research focused on trying to find these data races.
In Frey, we took the conscious decision to say we don't actually care about data races.
We're actually trying to find, we're trying to find concurrency bugs and programs that
otherwise may be well synchronized.
And there's a couple of reasons for this.
One is that there are lots of data race detection tools out there.
And secondly, it's also not that hard to sort of fix data race bugs, right?
If you find the data race, usually the solution is add some locks,
add some synchronization, and you're sort of done.
And so we were targeting these harder bugs, like protocol-level bugs
or where the issues are usually more in the design aspect.
And these are trickier for developers to reason about and to debug and to fix.
And so our tool sort of helps bridge that gap.
I see.
So it's, and feel free both of you could chime in here.
To me, it's a bit surprising every time I read about concurrency bugs in Java.
Because Java has such a wide array of features that help developers write and reason about concurrency in their program.
But still, I feel at least in my limited interaction with, you know, C, C sharp, Java, Python,
I don't feel that limited interaction in terms of concurrency bugs.
I don't feel that there are fewer concurrency bugs in one programming language than the other.
Why do you think that despite all these new features and useful features in Java,
we still face similar concurrency bugs that I used to face when I was writing C?
I think one reason is that concurrency bugs manifest because of timing.
And it's really hard for a developer to reason about timing,
especially when they interact with large and complex.
system. So for ManyBugs, FreyFund was actually, like, one thread, it is touching a state.
Well, another thread, which is very far from the current threat, like, by far, I mean, like, the file
that implements that thread is not located in the same package or in the same, like, same class.
And also, the location, the time the other threat is created is also very far from the
current threat. So, like, in order to understand the complex interleaving across different
threats, you know, complex system requires a developer to memorize, like, all kinds of
interaction across files, across threats. So I feel this part is very challenging. And another
thing is that this is not deterministic, meaning that after I implement my feature, I can run the
program, I can run the test, but the bug will not, like, appear immediately.
It may only appear when you have a high workload server, while your CPU is fault
or when you run it on a very special hardware.
So I think all of these facts as to this complexity of debugging a concurrent program.
And if I can add to that, Bogdan, you mentioned the word performance.
some time ago.
And while Frey doesn't specifically look for performance bugs,
I think one of the reasons why we have so many concurrency bugs out there,
even in, you know, languages that have nice abstractions for writing current code,
is that developers often end up doing really crazy things to squeeze performance out
because there's a trade-off between, you know, clean abstractions and performance in many cases.
And so if you want performance, you sometimes have to, you know,
write a little bit more low-level code
than what some of the high-level libraries
provide. The kind of APIs
that high-level libraries provide. And so a lot
of the bugs that we've seen have been in sort of these
infrastructure layer libraries, things like
Kafka or like Google Guava, which is a set
of libraries, which are trying to provide
higher-level functionality to their clients,
but they do so using
slightly more high-performance
algorithms that deal with like low-level
primitives. And that's sort of where you run
into this kind of similar trap that you might
run into if you were not trying to write
concurrent C-cord, as you said.
Right, right.
So timing issues being non-deterministic and trying to squeeze performance are definitely,
definitely factors that contribute to having more concurrency bugs.
And I feel that as you both mentioned earlier, that, you know,
Frey is not actually trying to specifically trigger a particular pattern or a particular bug,
but rather be a little bit more broad.
Having said that, fray finds different categories of bugs.
And I'm curious if, you know,
if you could first maybe talk about the high-level design of fray
and what's the process of finding,
how does fray help developers find bugs?
Maybe not necessarily in terms of techniques,
but rather in terms of abstractions.
So, yeah, actually, again,
like we are trying to make free push button
and easy to use, and we actually,
so to achieve this goal,
we did lots of effort to make fray easy to use.
And currently, if you are implemented,
like your project, use Gradle or Maven,
we already provided Gradle and Maven plugin,
and it will set up the Frey tool for you.
And the only thing you need to do is that if you're using J-unit, you can replace the test annotation
with concurrency test annotation.
And if you are using other testing framework, we also provide a wrapper so that you can launch
Frey by calling a simple method call.
And the way Frey work is that Frey will instrument your application with small logs.
And those logs will cooperate with each other so that it will only let
one threat wrong at each time.
So this allows us to exhaustively explore all possible
possible threat interleavings in your application.
But as we mentioned before,
systematically exploring all possible threat interleavings
is like really hard,
it's really time-consuming,
especially for real work and current programs.
And we also provide more advanced searching algorithms
So like PCT, POS, and SURW, those are all, like, well-studded,
concurrency testing framework, and they are proven to be more efficient in terms of bug finding.
So in this case, we believe finding bugs is more valuable than proving correctness
because we're hoping this, like, we're positioning free as a useful bug finding tool, similar to fuzzing.
So once a bug is found, a user can deterministically replay that bug.
So we also actually implement debugger plugin inside Intelligy,
and a user can visualize the entire timeline of each thread
and can understand how different threads interacts with each other
that leads to the final bug or final failure.
So, yeah, we are actually doing more current stage.
so to make fray easier to use
and we constantly stay improving the user experience using fray.
Awesome.
You mentioned in the very beginning,
and this is super useful.
Thank you.
Thank you for walking through the use cases of fray.
But you mentioned something very interesting.
It was actually two things,
but I'll start with the one that I find most interesting.
You said that, you know,
concurrency testing is always,
trying to balance the tension between coverage and state explosion.
And whenever we're trying to explore thread interleavings, you end up with state explosion.
Could you tell us a little bit more about the trade-offs you had to implement
in order to balance this tension?
Yes.
So we actually did lots of efforts trying to reduce the search space.
actually starting with data risk free assumption.
This is already an optimization we want to make
to reduce the search list.
After we have data risk free assumption,
we only need to perform threat interleaving
before concurrency permutives.
So this actually aggressively reduced the search list for us,
but this still provides a soundness
and complete is guaranteeing for us,
meaning that, like, every time
Frey finds a bug, this bug can appear
in real world scenario from the
original program, and
like every possible
concurrency bug in the
real work application,
Frey can find some
threat interleaving that trigger
that bug. So this is one
thing we need to reduce the
search space, but even with that,
systematically
threat interliving is still challenging.
Like, for example, for many losing Kafka bugs,
so even running your wine iteration takes a couple seconds.
So think about the search list,
it's around a million threat interleavings
happened in one single test run.
So, but we still want to find bug faster.
Like, doing random walk is the, like, the baseline we have.
So every time you can randomly pick a threat to run.
But as we all know that, concurrency testing is actually a crowded area
and has been studied for many years.
And there are already very nice algorithms are proven more effective in finding bugs.
It's just that their own practical framework for developers are using them.
So now we actually bring those algorithms to the digital.
developers and developers can just use those algorithms to test their concurrent programs and
find much more efficient. Even better, we actually designed Frey in a way that implementing
new concurrency testing algorithm very easy. So we provide a uniformed interface for developers
to implement them. We already implemented many of them. So we know that implementing new
our rhythm is actually very easy inside Frey.
Awesome. So Frey is not only a bug-finding platform where you as researchers have already
implemented several techniques that could trigger concurrency bugs, but it's built in such a way
that you allow developers to write their own random walks if they want, or their own fault
injection, I'm assuming, or delay injection, whatever they feel, it's, it's,
it's appropriate for their scenario, which is, I think it's great.
It's something that I guess both, both he and Rohan were mentioning in the beginning,
not only the usability, but the flexibility of those platforms and the fact that developers can now have,
can implement something that they know or they're much easier for them to understand or maybe
very specific to their scenario, which I guess makes Frey a very versatile platform.
So Frey uses a collection of algorithms and techniques to surface all these concurrency bugs.
But I'm curious, how do you come up with the workload to run the programs and get them into a faulty state?
What are you using for inputs?
Are you relying on existing tests or you're hoping that developers will write their own tests using Frey?
Can you walk us through this?
Yeah, so this actually went back to the motivation of this project.
So as Roan had mentioned, we explored this popular concurrent Java programs,
Kapka losing Java.
There are already plenty concurrent unit has used running in their CI pipeline every day.
So our goal was not trying to provide a framework so that developers can improve,
new concurrence test. Our goal is to just to implement a platform that can just reuse those
existing concurrent tests. So it turns out, these concurrency tests are actually sufficient
to test many concurrent features inside the existing framework. It's just lacking of a better
testing framework that can surface those bugs. So in our evaluation, we found more than 18
bugs just by rerunging all the existing concurrency tests in those large-scale applications.
So that's how we did for free.
Yeah, if you want to, I can provide some numbers from our paper.
We actually saw that there were 2,600 or so unit tests across the three projects I mentioned
before, Apache Kafka, Apache Lucene, and Google Guava.
These are widely used, you know, sort of software libraries or frameworks.
and the 2,600 of their unit tests that had some sort of concurrency in them.
So they spawned some threads, did some work, joined the threads, and check the result.
And these pass in, you know, CI usually every day, as as Leo mentioned.
But with Frey, we found that there were 363 of these tests.
So that's a pretty large fraction, you know, almost, I guess, somewhere between 10 and 15% of
these tests, that could actually fail given a certain tenant relieving,
that we can now, you know, deterministically replay and show the developers saying,
a specific under leaving under which either the assertion fails or you get like an unexpected
null-pointer exception or you get a deadlock and the whole test hangs. And so this was definitely
very surprising to the developers, which is why, you know, we got some traction in having many of
these bugs. So I mean, there's lots of tests for failing. I think the root causes, as as as Leo
mentioned, were around like 18 unique bugs that we, that we filed in these repositories. But going
back to your original question bug done, so we, you know, these, these tests already were pretty good,
And the developers wrote these tests as a way to cover various functionality in their code.
And so they sort of pick their own loads, so to say.
And Frege just helps, you know, find the right interleaving that can maybe find some bug.
What could be challenging is if you actually end up having a load that's intentionally designed to be like a stress load, right?
Where you generate, you know, if you're spawning like tens of thousands of tests or you're like, you know, running a very large workload in a loop,
that can actually become a little bit problematic for.
fray because you have the state space to search what would really explore and so we sort of focus
on these like unit tests not not end-to-end system level stress tests for for for searching over
interlevings uh thank you thank you rohan for for for for mentioning this actually you know
one of my next questions was about phrase phrase impact but i'll hold hold off on that a little
bit you you both mentioned two interesting things and i guess you know this this is interesting to me
but I hope listeners also find is interesting.
The fact that these software applications
already shipped with tests
that are meant to exercise concurrent parts
of the application.
So developers are aware of at least some bug
or failure patterns
that they can surface in their application
while their application runs in production.
So they try to get ahead of this
and mitigate.
still, you know, writing test is not sufficient.
And I think this is exactly what Leo was mentioning,
a little bit in the beginning,
that it's just incredibly, incredibly challenging
to reason about concurrency bugs by just looking at the code.
You have to keep in mind all the various timing and interleavings
that could happen, could occur in practice.
And, of course, that is something very complicated to do
in a complex software.
You also mentioned that Frey finds a lot of bugs, but they map to about 20, you said, about 20 fundamental root causes.
I would be curious if you, you know, both Rohan and Dio, can you tell us a little bit more about these root causes and what is surprising about those and why do you feel, you know, developers are still falling for these pitfalls, if you will?
So, yeah, as you had mentioned, like, there are, to be honest, they're nothing new from
these buffs, like those buffs are already well-studded, like automaticity violations, like
basically the developer assumes that certain operation is automatic, but apparently it's
not, and while two threats are touching the same operation that things happen.
And there's also order dependencies where, like, developers assume there are certain orders
that for a specific operation will happen.
But in fact, if you run that in a concurrent setting, the order is not always guaranteed.
I think the surprising thing we found is actually many, you don't really need a sophisticated
searching algorithm to find many bucks.
And many bugs can surface even with random walk
with very small number of iterations.
Here, iteration I mean like different interleavings.
It's just like if you run them on a no more like CI environment,
you don't have access to this control concurrency testing environment.
You're running that on a normal Linux machine,
no more kernel scheduler, that bug won't appear.
It's just because how the kernel schedule your program is different from how a control concurrency testing framework will schedule your program.
So this also highlights the importance having this sort of framework where even though we have so many other methods techniques in searching algorithm,
but having this kind of framework is still important to just surface that many bugs.
Yeah, I agree with everything that Leo said.
I think one other sort of surprising quote pattern that I saw in some of the bugs that we found
was this notion of either spurious wake-ups or essentially like if you have code that is
suspending a thread, either using APIs like, you know, object dot weight in Java, or it could be like thread dot park, it could be thread.
does sleep, there's often assumptions that some developers make that once the thread
sort of wakes up from this blocking call, the state has transitioned to something good,
like the thread is woken up because, you know, there was a resource that it was waiting
for and now that resource is available. And it proceeds to do something with that assumption,
but it could be the case that the thread actually is woken up before the resource was made
available. And this is a very nuanced situation because it requires sort of understanding
how the sort of the suspend resume APIs for things like, you know,
object wait notify or thread park, unpark, and Java actually work.
It turns out that the JVM doesn't really give you many guarantees about when these
suspended threads wake up. And so you actually, in theory, could wake up from these pauses
even before the resources that you were waiting for were actually made available.
And so the correct code pattern to use is always to like check after you, your thread is
woken up that you are in the state that you expect to be in. So simply just have
having a check or an assertion would have prevented some of these errors.
But if in the absence of those checks, you know, the code just keeps going.
And often, you know, it might just cut up some data or try to like read from a null pointer
and the program would usually crash.
I see. Yeah. So I think I think this is this is one of the features that I love the most about
the fray and reading your UBSLAP.
paper. The fact that you, you know, it proves time and time again that you, you, you don't need
overly complicated algorithms to find these concurrency bugs. And still, you know, simple,
simple algorithms can find many of these bugs. And, and at the same time, it also reminds us
how difficult it is in going back again, what you both said in the beginning, going back
to this challenge of reasoning about time and timing.
in your
in your concurrent application.
And I think, you know,
this is what makes Frey so attractive
is so useful.
And I just have to mention this for the listeners.
Frey not only finds bugs
that happen in open source software
that developers care about,
it actually started to build a bit of a community around it.
I peeked a little bit on Leo's website
before this podcast,
and I saw a bunch of
user testimonials.
So maybe Leo, can you tell us a little bit more about this ongoing adoption effort,
if you will, going from a research prototype and a research paper to an open source
tools, that tool that attracts a community around it?
Yeah, definitely.
So I'm always interested in building tools for real-world developers.
And I learned a lot from developers as well.
So this is also one of the goal of the free project itself.
We are not only building concurrency testing framework for the research community.
We are also building a tool that developer can use.
So that's also an ongoing effort.
So we released Frey as an open source project.
and we also encourage developers to use Frey
so we wrote a couple tutorials and blog posts
about how Frey can be used to test conferred programs
and how developers can integrate Frey
into their testing pipeline.
And also this is an other word test plan here.
If you are interested in Frey
and you want to contribute definitely
email us or like,
broads the GitHub issue, grab anything, you can help.
So this is definitely welcome.
And listeners will have the opportunity to do that.
If we go on the episode page, you will find all the links, relevant links to the paper, to GitHub,
to Leo and Rohan's web pages where you can contact them and, yeah, join the Frey, if you will.
Rohan, how do you decide after all this, you know, after all this experience both in academia
and industry and working in software reliability for so long, how do you decide a research
prototype or research idea is ready to ship as a tool that software engineers can actually use?
Yeah, so that's a good question, especially because, you know, as researchers and academics,
We're sometimes, you know, fall into this trap of, you know, iterating on something until, you know, we think it's perfect and we set really high bars for ourselves and never get there.
So I think it does require a balance.
I mean, you know, the more pragmatic answer is we obviously, you know, we obviously have like publication deadlines.
And so we, you know, at some point we need to, you know, find like a logical point for the project to be considered as, you know, mature, ready to write up, ready to publish, let's say, evaluation results.
But I think the question you're asking you more from the point of view of how do we actually ship the software to users and say, hey, you know, it's ready to be used.
I think it requires several steps, right?
I mean, in our case, we, you know, we had some early versions, like almost like beta versions that we, you know, we made sure they work on real world code.
I think that was our main target was that we wanted to be able to work, you know, in a push button fashion on large-scale software targets like Apache Kafka, which is not at all trivial to do.
And then also once we're able to do this for a few projects,
we also check that it, you know, like a sort of unseen validation set, right?
Does the tool also work on other kinds of targets
that we haven't explicitly considered when developing the tool?
I think that's a good benchmark to know that the tool is ready,
that we don't have to keep, you know, iterating on our tool
with every new target that we try to evaluate on.
And once we sort of reach a, you do some convergence
where every new target we consider our tool runs on just fine,
I think that's good to go.
But I think the other thing also is the students in my group spent a lot of time
and sort of interacting with open source developers.
Like for example, Leo spent a lot of time in debugging all of the issues found by frame open source software,
you know, creating, you know, easy to understand, you know, bug reports,
providing more information to the developers to have those bugs getting fixed.
I believe we also gave, you know, a talk at some companies that were supporting this open source development.
And I think I would encourage other, you know, researchers in academia who care about, you know, the use of their artifacts to also consider these things that are not sort of strictly required from an academic point of view, right?
Like, you know, you don't really get as many, you know, beans, let's say, to count that for those who count beans on folks' TVs, like beyond your research publication.
But if you do care about the community, you do care about your tool being useful to spend the time and actually interacting with, you know, the open source community, get feedback, you know,
write blog posts, go and give talks at, like, developer conferences, at like, you know, meetups, you know, participate in interviews for podcasts hosted by, you know, eminent publishers of, of tech media, and things like that, right?
And just to make sure that, you know, you get feedback.
And I think as long as it open to that feedback, as long as it open to allocating some time and titrating on the artifact, even after.
the, you know, the paper is published and you move on to other projects, I think that's
important. And then that's helped us in not only this project, but other projects as well.
So thank you, Rohan, for saying this. I think many of the listeners will take all of this
advice to heart. And it's certainly something that I felt in my own research that, you know,
it's nice to have a paper accepted and counting beans as you, as you put it. But it's also
equally important to show or not necessarily to show others your impact, but at least to have an
idea yourself of what kind of impact, the tools you're building have. And this kind of guides you,
maybe this could guide your next move or your next iteration of your research. I'm curious what
you think about our research community and how can our research community support more these
these efforts because it's it's as you say we we we love counting beans and that's not
necessarily bad i think it's just a fact of life but i'm curious from your perspective can our
community do more and what what should that more be yeah i think i think that's a great question
and it's an important discussion i think in general yeah i think valuing you know contributions
beyond just the traditional academic beans of things like publications and citations, I think
is really important and our community should strive to doing that.
And I have seen changes, especially in like the systems community.
There is in at least some subsets of the community, a lot of respect for, you know,
researchers who are actually able to deliver on artifacts and sort of show real world impact.
I know this is true at CMU for sure.
Let's, you know, in our processes, everything from like faculty hiring,
you know, to promotions and tenure, you know, the university has a policy of taking into account
all different kinds of impact, not just, you know, traditional research publications. And I think
that also stands by the recommendations of, you know, the Computing Research Association and
some other sort of standards. So I think as long as, you know, the community embraces this kind of
work, I think we will make more progress, you know, moving not just as a field in terms of science,
in terms of acceptance by the actual, you know, consumers of our scientific innovations,
in this case, you know, software engineers or even the end users of the software systems
who care about the reliability of these systems.
That's great to hear that not only, you know, individual researchers care more and more
about these efforts, but also, you know, hiring committees and promoting committees are taking
into account impact more and more, and hopefully this will spark or will determine more groups
to at least publish their artifacts and having their artifacts out there.
And you mentioned that the systems and software engineering community and other sub-communities
computer science, they have these ongoing efforts for every conference.
There is a parallel track where you can get your artifact evaluated, and that definitely
helps a lot. So, which brings me to my next question, what is the next step into developing and
productizing fray? Would there be more extensions, follow-up, maybe replicate the same ideas to other
type of bugs? What's next? Yeah, so there are many things. I may have a different vision
that other, Rohan, but I can start with mine. One thing I noticed,
it's not about
bug finding, but about debugging.
Even though Frey found so many
bugs, and for each of them
debugging still take that.
It's just because the software is complex,
interleaving is complex,
and reproducing
certain bugs causes
many steps
and even
let the developer to fully understand
what happened behind the thing is still challenging.
So this involves
like, we need,
to design a better debugging algorithm here, what I mean is similar to inside fuzzling,
we have input minimization, right? So maybe for concurrency tests, we also need to have some
sort of like input or scheduling minimization for for concurrent program to help the
developers to debug them. On the other hand, there's also UI, UX challenges where we have
debugger, we have
like ID for sequential
program, but we don't really have
a debugger or IDE
for a concurrent program.
So we are also thinking
how we can build
better visualization tool for
developers to debug concurrent
program. So yeah, this is
something I feel important.
Yeah, that's
I totally agree with
the debuggerability aspects. I think that
can be improved. Another thing that we are
sort of excited about as sort of
the next step beyond Frey
is thinking beyond this single
process multi-threaded program
which is what Frey currently targets
but also looking at distributed application
so things like you know
Apache Zookeeper or Cassandra
distributed systems they're all sort of still within Java
and they still have
are susceptible to like race condition
issues
often these race conditions
are rooted not just in
inter-thread interleaving but
also in interviewing across, like, messages over the network, you know, through the distributed
system. And so we have some ongoing, you know, work on trying to use some of the core ideas
from Frey to also test its distributed systems. So, you know, we can apply these, you know,
concurrency testing algorithms, not just on intertreateningings, but also on the interleavings
of network messages. And that's sort of a work in progress. So, you know, maybe we'll come
back on this podcast once we have that, you know, those results are a bit more mature. And,
And yeah, that's the direction I'm very excited about.
I also like that Leo started his answer by saying that he might have a different vision than me.
I think he's now going on the tenure track job market.
So this is very apt for him to have his own independent vision.
And I also would be excited to see how he starts his own research group and where he takes things from there.
I'll keep a close eye on your lab and Leo's new lab.
and I'm sure you'll have more than a handful of offers to begin with beginning next year.
And I'm also excited for what his vision is about debugging in general
and how developers and researchers should think about finding bugs when building software.
And this is an excellent segue to sort of my last set of questions,
which revolve around creative process.
And I think many listeners and myself included
always struggle with coming up with new interesting ideas
that a community cares about
or finding research problems that you can actually tackle
in a few years or a grand cycle, right?
And it doesn't take a dozen years to solve.
Or you have to wait for a new,
a new math to be invented to solve.
So I was curious, I'm curious, how both of you,
how do you come up with these ideas beyond, you know,
just reading papers and maybe using your industry experience?
What's your creative process?
How do you scale a problem,
to select a problem for an appropriate project,
PhD project-wise scale?
I just hire amazing students
and I
delegate the task to them
so maybe Leo can
explain this. The whole freight project was I think
born out of Leo's observations
with
concurrent programming in Java and
debugging. He's very tuned in
to things that the
open source community is working on and
he's good at identifying
the gap between
academic research and
practice. And so
So a lot of the main motivation of treating, you know, applicability to deliver targets
as like a first-class research problem, which is what really does, right?
We didn't invent any new algorithms or, you know, research techniques.
We're just building a new platform.
That entire sort of process and that entire line of thinking, you know, came out of, you know,
Leo's experience and I'm lucky to work with amazing students such as them.
So I'll let Leo answer this in more detail.
Yeah, thank you, Rohan. Thank you. So actually, so when I started working on this related research, I also started by designing your algorithm. So I did my internship at Amazon three years ago where we were designing better concurrency testing framework inside P. So P is a framework for distributed model testing. And because it's testing distributed system models, so it has full control over
the message processing, concurrency, and, like, execution.
So that was cool, and that was a very fun experience.
And that also led me to learn that control concurrency testing is a really powerful tools for
debugging, sorry, for testing concurrent applications.
But if you look at the real world, right, real world, people write C, C++, Ross, Java, C Sharp.
They don't really write P.
they only model their application in P
but when they write real-word application,
they still use this commercial or like real-world languages.
But we don't see this thing exist in these like popular languages.
So even though like there are many research papers about this,
talking about this, but developers are still using stress testing.
Developers are still just to write some unit tests and hope that,
will help them to find the concurrency bug.
So having this, like, I think, like, having exposed to the platformies of, like,
control concurrency testing and also having observed this gap between the academic research,
we all know that, and also the real world gap, is, like, help us to, to solve this problem.
Right. So we all know that control concurrency testing is a good thing. And the other goal is just to bring control concurrency testing to the developers.
Awesome. Awesome. So basically what you're both saying is in order to find these interesting problems that the community cares about, you really have to be attuned to both what happened in kind of the research, academic space.
but also what happens in the software engineering space
and what software developers are actually facing
and what are the gaps between theory and practice,
if you will, if you will.
And I think that's something many of the listeners could relate to.
I definitely can relate to as having a grown-up job in the distant past.
And then going to research, I definitely felt this gap acutely.
And I'm happy that there are researchers out there that are trying to tackle this gap in their research every day.
Leo, Rohan, thank you so much for joining us today.
I'm really excited to see what's next on your research agenda.
And I'm more than happy to have you again on the podcast with the next project or the next set of bugs you're going to tackle.
Guys, thank you so much.
Thank you very much, Vodan, for inviting us.
This was a blast.
Yeah, thank you.
Thank you.
I'm very happy to be here.
Thank you.
