Microsoft Research Podcast - Abstracts: November 4, 2024

Episode Date: November 4, 2024

In their 2024 SOSP paper, researchers explore a common—though often undertested—software system issue: retry bugs. Research manager Shan Lu and PhD candidate Bogdan Stoica share how they’re comb...ining traditional program analysis and LLMs to address the challenge.Read the paper

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Abstracts, a Microsoft Research podcast that puts the spotlight on world-class research in brief. I'm Dr. Gretchen Huizenga. In this series, members of the research community at Microsoft give us a quick snapshot or a podcast abstract of their new and noteworthy papers. Today I'm talking to Dr. Shan Liu, a Senior Principal Research Manager at Microsoft Research, and Bogdan Stojka, also known as Bo, a doctoral candidate in computer science at the University of Chicago. Shan and Bogdan are co-authors of a paper called If at First You Don't Succeed, Try Try Again, Insights and LLM-Informed
Starting point is 00:00:46 Tooling for Detecting Retry Bugs in Software Systems. And this paper was presented at this year's Symposium on Operating Systems Principles, or SOSP. Shan and Bo, thanks for joining us on Abstracts today. Thank you. Thanks for having us. Shen, let's kick things off with you. Give us a brief overview of your paper. What problem or issue does it address and why should we care about it? Yeah, so basically from the title, we are looking at retry bugs in software systems. So what retry means is that people may not realize for big software, like the ones that's run in Microsoft, all kinds of unexpected failures, software failure, hardware failure may happen. So just to make our software system robust, there's often a retry mechanism built in.
Starting point is 00:01:38 So if something unexpected happens, a task, a request, a job will be re-executed. And what this paper talks about is it's actually very difficult to implement this retry mechanism correctly. So in this paper, we do a study to understand what are typical retry problems, and we give offer a solution to detecting these problems. Bo, this clearly isn't a new problem. What research does your paper build on, and how does your research challenge or add to it? Right. So retry is a well-known mechanism and is widely used. And retry bugs in particular have been identified in other papers as root causes for all sorts of failures but never have been studied as a standalone class of bugs and what I mean by that nobody looked into why is it so difficult to implement retry what are the symptoms that occur when you don't implement retry correctly? What are
Starting point is 00:02:46 the causes of why developers struggle to implement retry correctly? We built on a few key bug finding ideas that have been looked at by other papers, but never in this context. We use fault injection, we repurpose existing unit tests to trigger this type of bugs, as opposed to asking developers to write specialized tests to trigger retried bugs. So we're kind of making the developer's job easier in a sense. And in this pipeline, we also rely on large language models to augment the program and the code analysis that goes behind the fault injection and the reutilization of existing tests. Have large language models not been utilized much in this arena? I want to say that, you know, actually this work was started about two years ago. And at that time, large language model was really in its infancy. And people were just started exploring what large language things that we were able to do before,
Starting point is 00:04:11 like, you know, finding bugs, can now be replicated by using large language model. But at that time, we were not very happy because, you know, just use large language model to do something people were able to do using traditional program analysis. I mean, it seems cool, right, but does not add new functionality. So I would say what is new, at least when we started this project, is we were really thinking, hey, are there anything, right,
Starting point is 00:04:41 are there some program analysis, are there some bug finding that we were not able to do using traditional program analysis, but actually can be enabled by large language model? And so that was at, you know, what I feel like was novel, at least, you know, when we worked on this. But of course, you know, large language model is a field that is moving so fast. People are, you know, finding new ways to using it every day. So, yeah. Right. Well, in your paper, you say that retry functionality is commonly under-tested and thus prone to problems slipping into production. Why would it be under-tested if it's such a problem? So, testing retry is difficult because what you need is to simulate the system-wide conditions
Starting point is 00:05:28 that lead to retry. That often means simulating external transient errors that might happen on the system that runs your application. And to do this during testing and capture this in a small unit test is difficult. I think actually Bogdan said this very well. It's like, why do we need to retry? It's like when unexpected failure happens, right? And this is like something like Bogdan mentioned, like external transient errors, such as my network card suddenly does not work, right? And this may occur, you know, only for, say, one second and then it goes back on.
Starting point is 00:06:05 But this one second may cause some job to fail and need retry. So during normal testing, these kind of unexpected things rarely, rarely happen, if at all. And it's also difficult to simulate. That's why it's just not well tested. Well, Shan, let's talk about methodology. Talk a bit about how you tackled this work and why you chose the approach you did for this particular problem. Yeah, so I think this work includes two parts. One is a systematic study.
Starting point is 00:06:39 We study several big open source systems to see whether there are retry-related problems in this real system. Of course, there are. And then we did a very systematic categorization to understand the common characteristics. And the second part is about we have used, particularly in the detecting part, we actually used a hybrid of techniques of traditional static program analysis. We used this large language model enabled program analysis. In this case, imagine we just ask a large language model saying, hey, tell us, are there any retry implemented in this code?
Starting point is 00:07:26 If there is, where it is? And then we also use, as Bogdan mentioned, we repurposed unit test to help us to execute the part of code that large language model tell us there may be a retry. And in addition to that, we also used fault injection, which means we simulate those transient external environmental failures, such as network failures, that very rarely would occur by itself. Well, Bo, I love the part in every paper where the researchers say, and what we found was. So tell us, what did you find? Well, we found that implementing retry is difficult and complex.
Starting point is 00:08:14 Not only find new bugs, because yes, that was kind of the end goal of the paper, but also try to understand why these bugs are happening. As Sean mentioned, we started this project with a bug study. We looked at retry bugs across eight to 10 applications that are widely popular, widely used, and that the community is actively contributing to them. And the experiences of both users and developers, if we can condense that, what do you think about retry? Is that, yeah, they're frustrated because it's a simple mechanism,
Starting point is 00:08:51 but there's so many pitfalls that you have to be aware of. So I think that's the biggest takeaway. Another takeaway is that when I was thinking about bug finding tools, I was having this somewhat myopic view of, you know, you instrument at the program statement level, you figure out relationships between different lines of code and anti-patterns, and then you build your tools to find those anti-patterns. Well, with retry, this kind of gets thrown out the window because retry is a mechanism.
Starting point is 00:09:26 It's not just one line of code. It is multiple lines of code that span multiple functions, multiple methods, and multiple files. And you need to think about retry holistically to find these issues. And that's one of the reasons we used large language models, because traditional static analysis or traditional program analysis cannot capture this. And, you know, large language models turns out to be actually great at this task. And we try to harness the, I would say, fuzzy code comprehension capabilities of large language models to help us find retry bugs. Well, Shen, research findings are important, but real-world impact is the ultimate goal here. So who will this research help most and why?
Starting point is 00:10:17 Yeah, that's a great question. I would consider several groups of people. One is hopefully, you know, people who actually build, design real systems will find our study interesting. I hope it resonates with them about those difficulties in implementing retry because we studied a set of systems and there was a little bit of comparison about how different retry mechanisms are actually used in different systems. And you can actually see that, you know, this different mechanism, you know, they have pros and cons, and we have a little bit of, you know, suggestion about what might be good practice. That's the first group. The second group is our tool actually did find, I would say,
Starting point is 00:11:03 relatively large number of retry problems in the latest version of every system we tried. And we find these problems, right, by repurposing existing unit tests. So I hope our tool will be used, you know, in the field by, you know, being maybe integrated with future unit testing so that our future system will become more robust. And I guess the third type of, you know, audience I feel like may benefit by reading our work, knowing our work, the people who are thinking about how to use large language model. And as I mentioned, I think the takeaway is that language model can repeat, can replace some of the things we were able to do using traditional program analysis. And it can do more, right, for those fuzzy code comprehension related things. Because for traditional program analysis, we need to precisely describe what I want.
Starting point is 00:12:03 Like, oh, I need a loop. I need a right statement, right? For large language model, it's imprecise by nature. And that imprecision sometimes actually match with the type of things we're looking for. Interesting. Well, both of you have just sort of addressed nuggets of this research. And so the question that I normally ask now is, if there's one thing you want our listeners
Starting point is 00:12:30 to take away from the work, what would it be? So let's give it a try and say, okay, in a sentence or less, if I'm reading this paper and it matters to me, what's my big takeaway? What is my big aha that this research helps me with? So the biggest takeaway of this paper is not to be afraid to integrate large language models in your bug finding or testing pipelines. And I'm saying this knowing full well how imprecise large language models could be. But as long as you can trust but verify, as long as you have a way of checking what these models are outputting, you can effectively insert them into your testing framework.
Starting point is 00:13:17 And I think this paper is showing one use case and bring us closer to having it integrated more ubiquitously. Well, Shan, let's finish up with ongoing research challenges and open questions in this field. I think you've both alluded to the difficulties that you face. Tell us what's up next on your research agenda in this field. Yeah, so for me me personally, I mean, I learned a lot from this project and particularly this idea of leveraging large language model, but also has a way to validate its result. I'm actually working on how to leverage large language model to verify the correctness of code, code that may be generated by large language model itself. So it's not exactly, you know, a follow-up of this work, but I would say at ideal, you know,
Starting point is 00:14:15 philosophical level, it is something that is along this line of, you know, leverage large language model, leverage is creativity, leverage is sometimes, you know, leverage is imprecision, but has a way, you know, to control it, to verify it. That's what I'm working on now. Bo, you're finishing up your doctorate. What's next on your agenda? So I, we're thinking of, as Sean mentioned, exploring what large language models can do in this bug finding testing arena further and harvesting their imprecision. I think there are a lot of great problems that traditional code analysis has tried to tackle, but it was difficult. So in that regard, we're looking at performance issues and how large language models can help identify and
Starting point is 00:15:08 diagnose those issues, because my PhD was mostly focused up until this point on correctness. And I think performance inefficiencies are such a wider field and with a lot of exciting problems. And they do have this inherent imprecision and fuzziness to them that also large language models have. So I hope that combining the two imprecisions maybe gives us something a little bit more precise. Well, this is important research and very, very interesting. Shanlu Bogdan-Stojka, thanks for joining us today. And to our listeners, thanks for tuning in. If you're interested in learning more about this paper, you can find a link at aka.ms forward slash abstracts, and you can also find it on
Starting point is 00:15:57 the SOSP website. See you next time on Abstracts.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.