Disseminate: The Computer Science Research Podcast - Andrew Quinn | Debugging the OmniTable Way | #16

Starting point is 00:00:00 Hello and welcome to Disseminate, the computer science research podcast. I'm your host, Jack Wardby. I'm delighted to say I'm joined today by Andrew Quinn, who will be talking about his OSDI paper, Debugging the Omnitable Way. Andrew is an assistant professor at the University of California, Santa Cruz, and his research focuses on the intersection between operating systems, architecture, programming languages and software engineering. Andrew, thanks for joining us on the intersection between operating systems, architecture, programming languages, and software engineering. Andrew, thanks for joining us on the show. Cool. Yeah, happy to be here. So let's dive in. So to quote your homepage, you say that most of your work is about finding bugs, or is about bugs, finding them, understanding them, and fixing them. So how did you end up researching in this area? Yeah, that's an interesting question. You know, I think there's obviously sort of multiple causes for why somebody is interested in a specific area. And it's really a confluence of a bunch of stuff.

Starting point is 00:01:18 I think if I had to rule it down and say like, okay, what main one thing really spurred me looking in this direction? Right after undergrad, I worked in industry for a little over a year. And I was a, I guess we called it L3 support. And so I spent a lot of time interacting with customers who had a problem with this legacy application I worked on. And every once in a while, I'd do occasional enhancements to that application. So maybe I add some new feature that that some customer wanted and so I was kind of underemployed in this role I had a lot of time to think about you know how I could do this task better not just from a how would I be a better engineer but like systematically how could we design better tools for this job and that really got me started thinking a lot reading papers and trying to understand what

Starting point is 00:02:07 the research community said was the best way to do these tasks. And, and so then when I started grad school, I mean, I didn't go in saying, Oh, I'm going to work on debugging, but I definitely went in. And I was very probably predisposed to think of these problems as the interesting problems. And so, you know, the first papers I read that I really liked were about debugging or testing and just sort of steamrolled from there. You know, a couple other, I guess, small factors or, you know, just kind of like lucky scenarios, like the first project I worked on in grad school was something my advisor suggested to me and we worked on together. And it wasn't strictly about debugging, but it was debugging adjacent.

Starting point is 00:02:51 And so it was easy for me to take that project and turn towards more thinking about debugging as a topic. Yeah, and I just had a lot of freedom. I mean, I don't know, I worked with the right people who gave me a lot of freedom to kind of define that type of research agenda, especially as I got more senior. So, yeah, I think these types of questions can be really hard to answer. It's definitely not one cause. It's definitely a bunch of things leading into what your interests are and why you work on them. But there's, I guess, a little bit of my story describing how I got here. Fantastic. So let's get stuck into the paper then. So can you describe to the listeners what an Omnitable is? Yeah. So an Omnitable is an abstraction for an execution of your software. So

Starting point is 00:03:35 in particular, it models a particular execution of your program. so a particular instance of the process, as this gigantic database-style table. And the aim is to include everything you could possibly want to know about that program. That's our goal, is to include in this one giant table. So what that looks like is we have a column in this database-style table for every single byte of memory, and a column for, let's say, every single register, and some additional metadata, like a time column that tells you when did some event happen. And then there's a row in this table for every single instruction that's executed by your program. And so you can imagine it's this sort of gigantic entity that represents everything, all of the state that your program ever reached and ever had. And the idea then is when you want to debug that application, when you want to say

Starting point is 00:04:33 something about what happened, you write a SQL query on top of that table to pull out the particular state you're interested in observing, whatever that state might be. Kind of what's the motivation behind this approach? I know you touched on it a little bit there, but why did you kind of end up on this, taking this approach to debugging? And how does it compare to other debugging tools, maybe? That's interesting. So, you know, when I actually, before I worked on the Omni table, I had this paper, it's

Starting point is 00:05:02 called cluster-fueled debugging. I worked on the Omnitable, I had this paper. It's called Cluster Fueled Debugging. And Cluster Fueled Debugging looked at how you could use really large scale compute clusters. So imagine sort of thousands of machines in order and tap into all of that power in order to understand what went wrong when you had a particularly hairy or complex bug. And I won't bore you with the details of exactly how that worked. It turns out we're using some of those ideas in this Omnitable paper as well. But essentially, what wound up happening is you had to rewrite the way you would specify debugging code. And you had to think really hard as a developer about the fact that your code was going to be parallelized. So in essence, we were able to make

Starting point is 00:05:46 debugging much faster. But the trade-off, the cost was that you as the developer had to think a lot about complexity. You had to deal with the complexity of thinking about parallel programming. And I was always very dissatisfied with that as the trade-off. We were aiming to improve the latency, so I was fine with it. But I was always dissatisfied in the back of my mind thinking we have to be able to do better, right? Like we can do better than this. And the Omni table was really kind of my attempt to do better, right? So how can we get the improvements in terms of performance, maybe even do better than what I had done before, but also retain sort of the simplicity of debugging we might have today, or maybe even do

Starting point is 00:06:28 better in terms of simplifying the complexity that it takes to debug programs. As I got more and more involved in it, I realized that this trade-off actually is fairly fundamental. Lots and lots of tools either choose to have high overhead or high complexity, And sort of you can taxonomize and I taxonomize in the paper briefly prior tools that fall into this space. So there's things like, you know, any of the tools that I've worked on paralyzing debugging tasks.

Starting point is 00:06:56 So I brought up cluster field debugging. There's other examples. There's also other examples of tools that use really low level program analyses. And I would classify those as low overhead, but very high complexity, because as a developer, you have to contend with thinking really hard, how do I make this analysis work? And then on the other hand, there's a lot of tools that kind of like the OmniTable have proposed higher level languages for debugging. So these

Starting point is 00:07:20 are things like Faye or Endoscope or pivot tracing. And a lot of those tools are giving you this higher way of thinking about debugging, but they often don't do anything about the performance overhead. And so some of them either like to sort of contend with performance overhead, they either just expose that you're sort of stuck with paying really high costs for your debugging, or they just don't let you ask expensive questions. Right. And so that's sort of where we wound up eventually leading to the Omnitable was how do we balance these two things?

Starting point is 00:07:50 How do we balance complexity and performance overhead and get the best of both worlds? Can you kind of walk us through maybe then the kind of the query model for Omnitable? And so you said that you mentioned that you have like a SQL style language. What was the extension you made to kind of SQL to better support the query? Yeah. So one of the things that we had was we were sort of, I don't want to, we were a bit dogmatic almost about this idea that everything should boil down to being a query over this Omni table. And in particular, we didn't want to add other sort of, I won't call them hacks, but for lack of a better word, hacks into the query model.

Starting point is 00:08:32 So we didn't want to have this collection of tables that were sort of magically deduced. Like I didn't want to have a functions table that included all the functions executed in your program where that was derived externally from the OmniTable. I wanted that to be a view on top of the OmniTable. The rationality being that from a systems perspective, if everything boils down to a query on top of an OmniTable, then it's much easier for us to build. The optimization story is straightforward. Everything

Starting point is 00:09:01 gets optimized to the same optimizer. So there's the sort of nice systems things that happen. If you stick to this idea, everything is going to be built on top of this low level table. Okay. So that's where we started from. And that's a little bit contrary. Like a lot of people instead defined views they thought were interesting. And that's kind of the way they would try to design their system. So this is a little bit of a radical approach in terms of systems design. But then you ask, well, okay, so the problem then is like, well, there's certain things that I do, like, I often find myself wanting to ask a question about, you know, if I have the start of a function defined by some particular instruction in my Omni table, and I have the return or the end of that function defined by another instruction, how do I match those together when I'm dealing with things like

Starting point is 00:09:49 recursion, right? I need to somehow model a function stack, but especially your listeners who maybe come from more of a database background, they probably understand, well, it's very hard to model recursion with standard SQL. We don't have the ability to do that. So that was one of the things we did. So I built this particular type of join. I called it a stack join that allows you to express the idea of sort of rows being pushed onto a stack and rows being popped off of a stack as they match. And that allows you to model the function stack, which is really critical for debugging. That was probably the, maybe one of the few extensions to the language that was, that is required from a, what you can compute

Starting point is 00:10:33 perspective. There's a lot of other additions we made that are just convenient. So there were, we made a way to say, okay, if I had some event, I want to know the next event that happened that matches in some way. Well, with SQL and sub queries, you can make that happen. It's just really awkward. And so we said, hey, we're doing this all the time. Let's give you some syntactic sugar so that you can model this notion of the next event or the previous event. And then the final thing, we added some nice syntactic sugar around some user-defined functions where we allowed you to do traversals of data structures and things. And there we're just using kind of pre-existing user-defined function interfaces, but we're giving you nice ways of

Starting point is 00:11:17 expressing that around like, hey, here's a pointer-based data structure. I want you to calculate the transitive closure of that pointer-based data structure. I want you to calculate the transitive closure of that pointer-based data structure so I can traverse my entire B-tree when I'm doing this query, whatever that might be. In your paper, you give a really nice case study of a developer that's trying to diagnose some performance problem in Redis. Could you maybe walk us through how the developer would approach solving this issue or debugging this issue, should I say, with Omnitable and what the sort of debugging queries would look like to solve this issue? Yeah, so I think this is a good opportunity to describe the usage model and to talk about one thing I haven't talked about at all, which is

Starting point is 00:12:00 Omnitable queries are designed to be offline as opposed to online. And so what the Redis developer in this use case, you know, starts with, or what you, the person, the listener who wants to debug their code, you start by capturing an instance of your program that has the problem that you want. You can think about that as like, you're basically getting the Omni table for some problem that you had initially, whatever it might've been. And we use a technique called deterministic record and replay in order to do this. But that's sort of step one as a developer. And then once you do that, you have this Omni table

Starting point is 00:12:43 that's exposed to you. And you can sort of think of as, you know, just querying that existing Omni table at all times. So that's actually really convenient, because it means when you have multiple queries, and there'll be multiple queries in this use case, each of those queries occurs over the same existing Omni table. You're not sort of calculating a new execution and performing sort of this extraction, analyzing a new Omni table. You're always analyzing the initial Omni table that showed whatever that bug was.

Starting point is 00:13:16 So that's kind of the, I guess, step one. But let me talk about the use case a little bit, this Redis bug. So in particular, we're imagining a scenario, and it's based on a bug that was reported in Redis, where a developer is deploying Redis as an LRU cache in front of a backend service. So just imagine in front of some sort of database, whatever it might be. And the bug is that over time, they're seeing the end-to-end latency of that deployment, not necessarily just of Redis, but of the they're seeing the end-to-end latency of that deployment,

Starting point is 00:13:45 not necessarily just of Redis, but of the whole deployment, that end-to-end latency is slowly increasing. And the developer using existing telemetry, so just existing logs or things you might have, notices that the increase in latency or an average end-to-end latency is correlating with sort of a higher and higher percentage of requests being processed by that backend service. So in other words, the cache is sort of getting worse at caching things over time. And okay, if you think about these types of problems, right, it's such a high-level symptom to start with. And if you were trying to debug this, it actually might not even necessarily be a bug for starters, right?

Starting point is 00:14:25 Like in principle, the workload could be changing in certain ways such that you would expect this to happen. So it's a really hard problem to start staring at when you have a symptom in mind and you're not even totally certain that that symptom is indicative of this particular bug. So anyhow, so the developer starts and their first query is just trying to approximate the working set size of their cache. They're trying to get a sense of how many unique items are being accessed by the cache over time. And they write this query that uses a windowed aggregation on top of all of the queries against their Redis cache.

Starting point is 00:15:08 You can see that paper for more details. What's interesting about this query is we show in the paper a comparison between how you would maybe write this query if you were to use a tool like GDB or any sort of like logging based-like tool that you might use today. And you find that this one query, seemingly innocuous, has a lot of kind of interesting trade-offs in it, dealing with the performance complexity trade-off. So it's easy to make this query in GDB run really slowly, but it's very hard to make this query run quickly. And so there's a lot of complexity in making the query run reasonably fast. So it's kind of a poster child for the trade-off we were talking about.

Starting point is 00:15:50 So that's the first query. You know, what the developer learns at this time is, oh, actually, the working set size isn't really changing. And so this is starting to point, oh, there is a bug here, right? So the first query is just kind of saying like, well, is there actually a bug in Redis or not? OK, so then once they figure out that there's a sort of fairly constant working set size, the second thing the developer wants to figure out is, OK, well, how many items are actually in my cache over time? Right. So they're just trying to track like maybe there's some internal problem that's going on. Who knows exactly why this is happening? But maybe I'm sort of losing items. Maybe the size of each item is increasing.

Starting point is 00:16:29 And so I'm able to store fewer. You know, who knows? And there, the developer learns is that the number of items in the cache is actually decreasing over time, which is kind of surprising. Because the same query that they wrote also shows the number of bytes in each item is not decreasing. So you shouldn't see this decrease in size if the number of bytes in each item is not decreasing. That means that basically the size of your cache is decreasing over time. And that doesn't really make any sense. So, you know, the next thing the developer does is they write a query that essentially looks at, it's kind of like a use

Starting point is 00:17:06 after free query, but it's customized specifically for this Redis scenario. And so because Redis and lots of servers today don't clean up memory on termination, you can't just use an off-the-shelf use after free detection tool because everything is a leak. And so if there were a memory leak that was causing this problem in the cache, you sort of would say, okay, well, all of my memory is leaking. A good portion of my memory is leaking in Redis. So they write this customized query that uses this new view. It's sort of complex. It actually ends up being like roughly 13 lines

Starting point is 00:17:45 of a query, but essentially they're tracking for each allocation site that eventually leaks memory at each point in time, how much memory has it leaked by that point in time. And they're using that as a way of basically saying, okay, which allocation sites are leading to a growth in the amount of memory that I leak over time, as opposed to constant in the amount of memory that I leak over time, as opposed to constant in the amount of memory that they leak over time. The idea being, okay, if it's growing, then that's the leak, right? And if it's constant, then that's something that, okay, you technically are leaking, but it's not getting worse. So finally, the sort of fourth query, and then eventually the fifth query are really just about looking at the third query.

Starting point is 00:18:25 It shows you that the problem is these objects that should be reference counted are what's being leaked. And so the fourth and fifth query, and I won't get into the details there, are really just then tracking basic invariance about those reference counted objects. So, you know, when I see, do my reference counts ever get larger than I expect them to get? Do they ever get small? You know, do they ever get negative? Those types of invariants and they're just tracking those types of things. And they eventually discover that there's this bug in their reference count when certain objects are shared across two threads. So it's kind of continues to be a hairy bug all the way down. But, you know, hopefully that gives you somewhat of a sense of the types of queries you run

Starting point is 00:19:09 where you're thinking about how do I think about my problem in terms of these types of SQL aggregations? We found that those are super, super powerful tools. And so almost all of the queries I described are using some form of SQL, excuse me, aggregation to make reason about what's happening inside of my program. In the general case, what's happening in this exceptional case, you know, tracking how things progress over time, that sort of thing. Awesome. Yeah, I really like the idea of bringing SQL into this sort of domain and this sort of

Starting point is 00:19:39 help out how we can like leverage that and the kind of language that people are probably more familiar with. Yeah, so next kind of question is is we've touched on it so far this kind of this this trade-off in complexity versus performance so obviously kind of naively materializing an omni table seems like it has a significant performance overhead with it so how do you go about kind of mitigating against this potential performance overhead? Yeah, when I describe this paper to people, I describe this problem as like the elephant in the room. You know, usually we get about to this point in the conversation, and they sort of sit back and they're like, well, it's great what you're telling me. But you told me that it was too expensive to use my debugging tools today. And then you told me

Starting point is 00:20:22 you're going to record everything that ever happened in the program. That surely is more expensive than what I started with. And so, yeah, absolutely. Right. So the Omni table, you can just sort of back of the envelope, calculate this. You're talking about probably exabytes of data that you're recording for most reasonable applications, because it's the whole memory space on the, you know, in terms of number of columns. And there's, you know, it's the number of instructions executed, right, per row. So do a couple of seconds of computation, you know, go write it down. It's petabytes and approaches exabytes within a minute of computation or something like that. So, yeah, so we can't store it, right? So everything I've been telling you is impossible. So what we do instead is we have this technique that we call lazy materialization. And the idea behind lazy materialization is that we're going to guess, byte of Omnitable State before we

Starting point is 00:21:28 realize and actually calculate that byte. How do we make that possible? Well, the idea is to turn to this technique I alluded to called deterministic record and replay. And so what we do is during this initial execution, when you first discover or first capture the bug, whatever that bug might be, this technique called deterministic record and replay just records a log of all the non-deterministic events that are passed to your program. And that log of non-deterministic events is much smaller than all of the state of your program. In fact, a prior work out of my lab group at Michigan, when I was at Michigan, had showed that you could store, depending on the exact usage model of the user and the computers, but at least for like desktop

Starting point is 00:22:16 applications they had showed, you could store an average year's worth of these logs for the average user on like a single terabyte drive. So we're just building on top of that existing work. That system was called Arnold. And then what happens is when you actually query an Omnitable, we then turn your query into dynamic instrumentation that we execute on top of that recording. And so that dynamic instrumentation, essentially, if you asked for a particular state at a particular point in time, you can kind of imagine that as like saying, we create a printf statement that extracts that particular piece of bite of information you wanted at the particular points in time that you specified. And then we rerun that. And that's

Starting point is 00:23:02 how we then actually calculate the Omnitable Qu after the fact and sort of lazily, as we say. Oh, fantastic. So are there any other additional techniques you use? I mean, the prototype is called Steam Drill, right? That's how this is implemented. Are there any other techniques that you kind of pair up with the determinacy recovery plan. Yeah. And that's really where one of the, I think, really interesting consequences of doing this lazy materialization is that the process of deciding sort of where to introduce this instrumentation, it's very open, right? So if you think from a database perspective, you can think of the re-execution and all of this as being basically a query planning problem where you're deciding

Starting point is 00:23:51 Steam Drill, which is our prototype, is deciding when do I want to run this dynamic instrumentation? How do I actually want to instrument the program to do this dynamic instrumentation? Do I want to leverage many machines? How exactly do I want to do this? Right. And so in particular, we basically used two techniques. Okay. So the first one is we reused my work on cluster fuel debugging. So we used thousands of cores to split this instrumentation task across many, many cores and use lots of cores to try to figure out what this state was that you're interested in. That was more or less just using the state of the art that's out there. And the novelty here, the probably novel approach was we took this idea that we called sort of multi-replay query resolution, which is yet another mouthful. But the idea was rather than, you know, take your query and calculate it all over a single execution of the replay, right? So you could imagine converting your query, running it all on top of one replay, all in one go, and then being done. Okay. What we found,

Starting point is 00:25:00 though, is that for these queries, especially some of the more interesting ones that are joining tables that are maybe saying, I some of the more interesting ones that are joining tables that are maybe saying, I want to know all the instructions that were executed only in slow invocations of your functions or something, right? In those types of queries, what we find is there's these different parts of the query logic that are way more expensive than other parts of the query logic. And if we can calculate the less expensive parts of your query sort of early on, then we can often filter the amount of the more expensive thing that we have to go calculate. And so the idea is, okay, we'll split your query into these different

Starting point is 00:25:37 chunks basically, and we'll execute the less expensive things first and use those to try to filter how much of the more expensive stuff we actually have to calculate and extract later on. And we do that in part that only works because we have this replay. And every time you execute the replay, the idea of determinism is you get exactly the same Omni table out. And so as a result, we can always delay, you know, we can always sort of delay and recalculate exactly the state that we would have gotten earlier because we're building on top of that deterministic record and replay as a primitive. So that's the novel technique that we used to accelerate these tasks. Yeah. Awesome. So how did you actually implement Steam Drill? What's it built on top of?

Starting point is 00:26:22 Is it totally sort of handmade from kind of nothing? Or did you built on top of is it was it totally sort of hand handmade from kind of nothing or did you build on top of existing infrastructure right right right so uh you know there's a lot of components of course uh the record and replay is built on top of arnold which was this record and replay system that i mentioned earlier the and then there's a component that's an instrumentation specific component uh That instrumenter is actually custom for Steam Drill, basically. So that was sort of our own building. And then the query language, the optimizer, most of this stuff that looks like a database is all built on top of Spark. And so we had some extensions we had to make to Spark to support our new query operators I talked

Starting point is 00:27:02 about earlier. We had to hook into how the Catalyst optimizer works, et cetera. But we were able to re-leverage a lot of what they had in Spark to support a good number of operators and things like that. How long did it take you to develop it? Oh, man, that's a loaded question. Years. years. So I had started this project and I don't know if I, I wasn't probably developing in earnest in the very beginning, but there was a HataWest version of this paper that came out in 2019. And just before that was in May of 2019, just before that,

Starting point is 00:27:42 I had started enough sort of hacking and building versions of the prototype that eventually became Steam Drill. I had started working on that at around that time. As it turned out, there were a lot more changes that I had to make that I initially sort of didn't expect. So I mentioned, for example, we have our own instrumenter for Steam Drill. I didn't think I was going to have to do that. And I thought I was going to be able to use something off the shelf. It turned out that the existing tools either came with really high overhead or they couldn't support arbitrary queries like what we wanted to support. And so, yeah, I mean, it wound up being a really large implementation effort, in part

Starting point is 00:28:21 just because things I didn't expect or things I didn't expect or, you know, problems I didn't expect to have. And there were a lot of sort of features I started to add and I realized after a couple of months, so maybe I didn't really need that or, you know, that doesn't really help the story. So it was a long process, but but in the end, you know, I think I'm pretty happy with the work and, and the implementation is open source. You can go find it on GitHub. Both the IDETIC system is out there. There's like a standard, that's the Arnold work. There's a standard repo for that that I sort of support. I have support.

Starting point is 00:28:56 And then there's the Steam Drill repo is also out there. You can find that through my GitHub page as well. Yeah, we'll link it all to it in the show notes and everything so that you can go page as well. Yeah, we'll link it all to it in the show notes and everything so the listeners can go and find those. I guess my next question is, how did you go about evaluating Steam Drill and what were the questions you were trying to answer in your evaluation specifically?

Starting point is 00:29:17 Yeah, that's a good question, right? So we had these two claims. We were going to try to improve complexity and we were going to try to improve latency. And so that naturally kind of leads to the two types of questions that you asked. Right. How did we improve complexity? How did we improve latency to evaluate the complexity question? We went through a bunch of case studies, not unlike the reddest one I described earlier. So we actually looked at five different case studies, starting from a bug where we found the bug in like a bugzilla type bug report. And then we

Starting point is 00:29:54 walked through what the Omnitable queries would be. And so from those five case studies, we wound up with 14 queries themselves. So because each bug has more than one query, right? Like the reddest example I described earlier at five, many of them had two or three, something like that. And we then looked at those queries and we said, okay, so what do we compare against? And we thought, okay, the de facto debugging tool is GDB. And so we went on, we said, okay, well, how can you model, if we can easily say this is the Omnitable query, what's the equivalent in GDB? And we looked at GDB has these Python bindings that you can think of as being scriptable versions of GDB statements.

Starting point is 00:30:36 So you can say set a breakpoint on certain points in the program. You can perform some computation, like use a Python dictionary to count the number of times you got certain spots, whatever it might be. And so we're like, okay, that kind of models this idea of what if I use GDB to solve these bugs instead? So we took the same queries we implemented for the Omni table and we implemented them in GDB. And then we compared using metrics from software engineering community, what's the complexity gain, sorry, from using the Omnitable instead of GDB. And we used a few different metrics. So the first thing we did is just count the number of lines of code. And we found, okay, Omnitable queries are on average, like 3.7 times, require 3.7 times fewer lines than their GDB script

Starting point is 00:31:27 counterparts. And in certain cases we see, in fact, you know, way higher than that. So there's at least one query out there where it's more than an order of magnitude fewer lines. We also counted the number of terms in an abstract syntax tree. That's kind of like saying how many sort of units of whatever units of thing did I have to think about when I built my system or built my query. And there we saw again, improvements this time on average, about five times fewer terms. And then we also use this aggregate metric to measure complexity that's called the Halstead complexity, which looks at properties of your abstract syntax tree to estimate how long it would take a developer to write that correctly.

Starting point is 00:32:09 And those metrics that are using are things like how often is a term reused versus how many times do I have unique terms? So things like that. And there we found that on average, the Omnitable scripts are 2.7 times or would require 2.7 times less development time, according to this Halstead metric. So together, any one of those points in the space is maybe not the best argument, but the totality of information tells us that these are in fact less complex than their GDB counterparts. And we also show in the paper, I mentioned earlier this example, like the first Redis query, the side-by-side comparison. So we show a couple of those comparisons and we kind of let the reader be the judge. Do you think these seem more or less complex? And certainly the PC and most of the people I've talked to seem to think, okay, this

Starting point is 00:32:59 does look less complex. So that's the complexity side. For latency side, you just calculate these things, you measure them. And again, we compared against GDB and we just showed, okay, we didn't actually calculate them over all 14 of those queries. We chose a subset of representative ones. But for that subset of interesting queries, how much faster is the Omni table than GDB. And we found that on average, it's almost two orders of magnitude faster to calculate your queries using the Omni table instead of using these GDB scripts. So yeah, we saw massive gains in efficiency

Starting point is 00:33:36 and still, you know, I think reasonable gains in complexity against what's the de facto tool that most of us are using, as far as I can tell uh when we develop fantastic there's some good sound bites in there some good quotes i can take out for so yeah brilliant um yeah i guess you've covered kind of what the key results are there but is there anything else you like specific things you wanted to emphasize i don't know i mean i I, I'm not sure if I have any other like key results. I mean, I think, I think there's a broader, I guess, you know, potentially when I told people I wanted to build this tool, right. Like, you know, in 2019 or whatever, you know, there were

Starting point is 00:34:19 a lot of people that were like, Oh, I mean like, and even when we started this conversation and you describe this model, people are like, oh, that seems impossible. So maybe one of the key results of the paper that I sort of think back about is, you know, it's actually possible. And then sometimes it's even more efficient to support this like massive model of what your computation is that in some ways there's a one key result might be that like hey this query model is actually realizable and that's in of itself kind of a key result that like there exists a system that does this um might itself be uh you know a key result of the work fantastic the yeah the next question i'm going to ask is are there any situations in which this approach is the wrong way to go or is it always better than the existing

Starting point is 00:35:12 kind of approaches people use and what are the general limitations of it that kind of yeah i guess kind of what are the limitations of it right so yeah So, yeah. Right, right. Yeah. So there's two things I would say. You know, if you're faced with a really simple bug, like it's not that uncommon for me to have a bug in my software where like adding a line or adding one or two printf statements to my code kind of shows me what went wrong, you know. And for those types of really simple problems, I'm not sure if it's worth it to get out the whole query up the whole shebang of how this works some of that could be improved by integrating better into ide's we didn't know if we made no effort to integrate an ide this is a command line tool

Starting point is 00:35:59 you're writing your your your queries on the command line you're sort of interacting with it that way. If we made a better effort to integrate this into Visual Studio or something, it's possible some of that burden, that this might be useful for simpler bugs, but at least right now, you know, don't reach for this when you've got your, like, easy to understand problem. So that's, like, from a practicality perspective.

Starting point is 00:36:23 There's also a set of limitations, and again, I think your listeners, assuming they're coming from a database background or they have some database knowledge, you know, even though we added this like stack join, right, to handle functions, there's say you're in a program and you want to say, all right, I'm at some particular point. I have some, you know, X equals two. Okay. And then I want to say, well, what are the, what's all of the future instructions that that assignment ever impacted? This is like a transitive closure information flow question. And SQL can't do that type of unbounded traversal over a database table like that. So we need some new language construct, either by adding extensions to what we have or by starting with something like Datalog, I don't know,

Starting point is 00:37:22 something totally different than SQL that allows us to express those types of transitive closures to support those types of questions. So there's definitely things that you probably can't do. Now, we haven't found that those are, like, drastically standing in the way of our ability to debug, but certainly they could. Yeah, a point I just wanted to add is that I know there's an upcoming new,

Starting point is 00:37:48 well, there's also an extension to SQL that's going to make it better for doing graph-style queries, SQL PGQ, which is due to come out, right? And I think that there is also some effort to have, like, a standardized graph query language called GQL, which is kind of in progress. So maybe there might be something kind of there that might be more appropriate for this problem, right? Yeah, that's really interesting. I mean, you could imagine modeling the like assignments of variables throughout your program as a graph, and then using some sort of like graph query language on top of that, right? So I and I think that's a great example of where, you know,

Starting point is 00:38:25 I can imagine continuing to work in this space by trying to add back, you know, some of these features as views or embeddings on top of the Omni table, for sure. Are there any applications that you kind of can't use it for, or is it pretty much you can use it for any application, right? Yeah, so we developed this entirely for binary, it for or is it pretty much you can use it for any application right like there's no yeah yeah so we developed this entirely for binary uh or for c c++ so everything is designed around supporting um uh c and c++ programs i you know being able to there's this side of the omni table and i

Starting point is 00:39:02 didn't talk a lot about this earlier but we have this notion of these views that you build on top of the Omni table. And those views allow you to express your debugging in terms of like higher level language constructs instead of on top of like memory addresses. And our tooling right now, like I mentioned, sort of only supports C++ for those things. It's unclear to me at this time, and there's a little bit of a research question, whether or not you could build those types of views for dynamic languages, where the mapping from something like, what is the line of source code? Like, where does that line of source code exist? What instruction at the machine level does that exist at? That connection can change over time if you're running in the JVM or something. And it's not immediately obvious to me whether or not we could support that type of interface

Starting point is 00:39:57 and where the limitations would be. Mostly because I'm not an expert in how the JVM does that type of, what's the word, optimization and sort of re-optimization. So I don't really know exactly how that works. So I could imagine needing a set of extensions to SQL to support those types of support dynamic languages, essentially. Cool. As a software developer, we touched on usability a second ago. How easy would it be to integrate Steam Drill into my development workflow? Right. So I think, you know, we designed it and I've seen some work that's trying to use SQL-like tools for doing debugging. And that gives me hope that this style of debugging at the level of the Omnitable contribution would be something that developers would be fairly well, you know, they'd be ready,

Starting point is 00:40:55 hopefully, to add that into their debugging today. In terms of the tech stack that builds Steam Drill, probably the hardest initial hurdle to get over is using deterministic record and replay. And the system we're building on top of now is a custom operating system. It's a custom kernel to support this feature. And so you'd probably have to make pretty invasive changes to use that custom kernel to really, you know, make that work. So because of that, the implementation we have today, the actual artifact is great. If you're a PhD student, it's maybe great. If you're like an enthusiast, you hack your own kernel, you're happy to get in and, and sort of recompiling a kernel and running a custom kernel doesn't scare you. But I think for a lot of people,

Starting point is 00:41:46 that's kind of outside the boundary of what they would normally be using for standard workflows. And so I think, unfortunately, where we are right now with the tech stack is probably not, you know, it's not quite there yet for your average run-of-the-mill software developer

Starting point is 00:42:01 to just download it and roll with it. Cool. So what was the most interesting thing run-of-the-mill software developer to just download it and roll with it. Cool. So what was the most interesting thing or maybe perhaps the most unexpected lesson that you learned while working on this project? Yeah. So I think to me where the stuff I learned most actually was sort of wildly non-technical. And so, you know, I think retrospectively, you know, I mentioned I worked on this project, obviously, for years. And I mean, at least two years and probably closer to three. When I started this project right in the beginning, my advisor had left for sabbatical, and then he sort of never came back. I mean, the paper has a set of, you know, there's four authors on the paper. And at one point in time, you could with a straight space, say all of them were my advisor at different points of this

Starting point is 00:42:58 project. And there actually was a fourth person who you could have said was my advisor at a certain point in this project. So it was a really, I learned a lot about leading a project in a way that I had never really fully had to lead before. And I learned a lot about, you know, managing to stay on top of a schedule, like a lot of the things that maybe I think we don't pay enough attention to in the research community as really important parts of like doing great research. And a lot of my own sort of soul searching and working through like having bumpy different people, you know, finding a way to get feedback from a lot of different people over time. Finding a way to make the 30 minute meeting that I had with my advisor who left for industry. I only got 30 minutes of his time every week. How do I make the most of that 30 minutes? And those types of things, I think, were, you know, really made me a better researcher.

Starting point is 00:43:55 And that's part of why I think I'm able today to do what I can do. But it's very bumpy during the project. I think I learned more about that than I think I learned technically, probably in this project, which isn't to say I didn't learn technical things, but just that's what I would focus on. Did that experience sort of shape your decision to kind of become an assistant professor? Or did that kind of something you always wanted to do and kind of continue in academia? did that make your experience and sort of the resilience and the things you learned across it and how to kind of manage a complex project over a long period of time did that sort of kind of decide oh yeah i want to

Starting point is 00:44:33 stick at this and continue in this sort of line i think it certainly was a trial i mean i i came into grad school thinking i wanted to be an assistant professor so it's not the reason i decided to go be one. It certainly would have been a really great time to learn that I didn't want to be an assistant professor. And, you know, instead, here I am. So obviously, it told me something about, you know, sticking it out. I think it's made me a much better assistant professor for no other reason than the fact

Starting point is 00:45:01 that I've worked with like four different styles of advising. So I've seen a lot more things than I would have otherwise. And I think I'm more prepared than I would have been otherwise to lead projects and help students in different ways. So I view the whole process as a big part of of really shaping my style, really shaping where I am. And so, yeah, I mean, I think it was really instrumental in certainly in what I do now, if not instrumental in my career path, certainly instrumental in that side of how I run a lab and how I interact with students, et cetera. Brilliant. So you mentioned that it's very bumpy. What were the things that you tried that kind of failed across this journey?

Starting point is 00:45:43 Maybe you could touch on those a little bit. Well, I mentioned this slightly earlier, but I spent a lot of time sort of adding features that wound up being like totally useless. So I spent a lot of time trying to add new features that seemed like they'd be really convenient to have. And in hindsight, just really didn't matter. And in some cases, they didn't matter because I was kind of approaching the problem backwards in the beginning. I was thinking from the perspective of build the model and then find

Starting point is 00:46:21 the use cases. And if you make the model good enough, then you'll have all the use cases, right? And especially for this work where it wound up being that like naive SQL wasn't enough, I had to do other things. What I eventually learned after like a year of working on this project was I needed to totally flip my way of thinking. I needed to start from saying, here are the things I want to do. Here are the things I need this model for. What are the implementation techniques and features that I need to make those things work well? And so you had to start from the perspective of use cases rather than starting from the perspective of the system. And I actually knew that ahead of time. My advisor always started from the perspective of use cases. When I started my PhD,

Starting point is 00:47:10 I just didn't, I just hadn't internalized that until I was the one really leading the project. Right. So that's a great example. I spent so much time working on features that you won't find in the paper and you won't find in the artifact because they just didn't matter. And those, I would say those are kind of like probably the biggest things. The other thing that I did is, and I mentioned this, that Steam Drill has its own instrumentation framework for how we instrument code. And I needed that because the existing frameworks were either really slow, or they couldn't support adding arbitrary instrumentation to an execution. I dragged my feet for a really long time on that task and I've tried really hard not to do it. And I think if I would have just from the beginning accepted the reality that I needed to build my own, it really wouldn't have been that painful and I would have gotten through it much faster. So that's a little bit of like a, you know, I think don't be afraid of hard work. And I was really afraid of spending the week really banging my head against the wall because that code is really hard. You're dealing with

Starting point is 00:48:10 low level assembly code, you're adding functions to look very challenging work. But sometimes you just have to kind of sit down and be willing to put your nose to the grindstone and do the work. Again, ironically, this is something my first advisor taught me. And I just did not know it until I really had to lead a project and do it. So that's so funny. Anyhow, so I think those are like probably the two best examples of places where I really was wrong, and then I eventually became right. And the nice thing is I kind of learned both technically things that I needed to do and then also kind of like life lessons afterwards so it was sort of really great spots in the in the project looking back awesome so where do you where do you go from here in this sense what's next for

Starting point is 00:48:55 Omnitable for Steam Drill and I know you're working on many other projects as well what's next for you if like sorry your research agenda yeah so we'll start with like What's next for your research agenda? Yeah, so we'll start with what's next for Omnitable and Steam Drills. So there's a few things that I want to do. One of the things about the paper as it stands today, we're looking specifically at debugging a single execution of your program and all the programs are also, they're multi-threaded, but they're all single machine programs. And so what I'd like to do is take this model and sort of two straightforward directions for extending the Omnitable, I think. One is, what if I extend this to a distributed system? What if I have a collection of Omnitables for each of the nodes? And how do I debug that? How do I think about what's a query over a distributed Omni table look like? What am I

Starting point is 00:49:45 looking for? Those types of things. So that's one thing we've been thinking about as a group. Another thing is, you know, maybe I'm not looking for distributed systems, but instead I have lots of different Omni tables. If each one's an execution of a program, I now have, let's say, thousands of times I've run my program. I hold on to all those Omni tables and now I'm writing queries over all of the past executions of my program. So what do those queries look like, right? Like what can I learn and how do I make that efficient to query the universe of times when I've executed a program? So those I think are like what's next for the Omni table. One extra thing I'll mention,

Starting point is 00:50:27 we're also thinking about, I mentioned that it's very hard today, you have this custom kernel, it's very hard for you to start using Steam Drill the tool. We've been spending a fair bit of time thinking about how we can make that easier, that burden easier to get into using record and replay in the particular way that we want to be able to use record and replay. And we're not the only ones thinking about that. There's like dozens and dozens of papers, and I'm skipping citing all of them and not doing my academic due diligence. But the point is, I think that there's more work to be done there to ease the burden of adopting some of these tools and for looking into doing that as well.

Starting point is 00:51:02 So that's what's next for the Omni table. More broadly, like what am I working on today? In some ways, I've sort of pulled back a little bit from just working on software reliability. I'm working a lot on, I'm doing some work on disaggregated computing where people are thinking about what happens when some of my memory for my system might be far away,

Starting point is 00:51:26 you know, like literally far away from the rest of my compute and maybe some other memory. How do I take advantage of that to make systems still perform well? And one of the things I've enjoyed about that work is being able to use some of the stuff I learned about inspecting programs, maybe not debugging them, but certainly inspecting them. A lot of the tooling that the Omni table winds up being about is about profiling, understanding what happens inside applications. And a lot of what I'm doing in this aggregated space is about profiling and understanding what's going on inside an execution, inside a program to understand how to take better advantage of them. So I'm actually surprised at this point, how much carryover there is

Starting point is 00:52:05 across the techniques and the ideas from the two spaces. So that's like kind of the other thing that I've been spending a fair bit of time on recently. Awesome. Just rolling it back to the distributed system sort of limitation. I would find that tool super useful in my day-to-day life because trying to debug, we have like the RAF consensus protocol is kind of a core component of the stack here.

Starting point is 00:52:30 And trying to debug that is an absolute nightmare. So any tool that could make that process less soul-destroying would be gladly appreciated from me. Absolutely. Yeah, I mean, I think to sort of build on that, I think we're moving towards a world in which, you know, most applications, first of all, most applications already are distributed. Not most, I shouldn of hardware. And those kind of look like distributed systems as well. And so over time, I think adding that component of being able to use this system across multiple memory spaces and understand how those things interact is a really useful and important contribution. So definitely I see that as a,

Starting point is 00:53:22 yeah, definitely something we're looking at. So cool. Kind of penultimate question. Now, what do you think is the biggest challenge in your research area now? Yeah, this is a really interesting question. I don't know about the biggest challenge, right? That's such a tough question, but I'll describe a challenge that I think is important. And that has to do with understanding what's called like a software oracle. So a lot of work in testing, and I would argue, and you know, the OmniTable would benefit from this as well.

Starting point is 00:53:54 A lot of work in testing relies on some notion of correctness in order to say whether a program execution was, you know, correct. Did I find some bug with this test? Or did that test fail? And that process of defining oracles today is typically pretty complex, and it's often a bit ad hoc. And so the way that works today, a lot of times is we go out and we survey a bunch of bugs, we find common patterns of what those bugs look like, and then we use that to derive our oracles. And I think we can do a better job. I think we can look at potentially languages or structures to do a better job of letting developers specify correctness of their programs a little bit easier and maybe a little bit more of a

Starting point is 00:54:42 straightforward way so that we can then, as testing tools or debugging tools, find those problems for them and help give them more clues in terms of what went wrong. So I think that's maybe the problem. I don't want to say the biggest challenge, but a big challenge and a big problem in the reliability space today. That's what I'd say. Brilliant. Right. So it's time for the last word now. What's what I'd say. Brilliant, right. So it's time for the last word now. What's the one key thing you want the listeners today to take away from Omnitable

Starting point is 00:55:12 and from your research in general? Oh, man. You know, it's really hard. I think what I would say, and this gets back to you, it asked me like, what's the key result of the Omnitable? And I sort of was like, oh, you know, and you mentioned, I had already said numerically,

Starting point is 00:55:28 right? Like, what are all the results from, oh, you know, we're this much better. The, the key result in some ways, like when I tell people the story is that like, okay, when I started telling you this, you didn't think it was possible. And by the end, you kind of believed it was possible. And that, that tells you that there's this like big vision that seems not quite realizable. And then you realize it at the end. And in an abstract sense, I would like to read a lot more research papers. Oh man, now it sounds like I'm so conceited, but I would like to read a lot

Starting point is 00:55:59 more research papers where you start the paper thinking like this doesn't seem possible. And you finish the paper being like, doesn't seem possible and you finish the paper being like wow that was possible and so the key takeaway i guess if i was to encourage like grad students or people who are interested in computer systems research go do stuff that doesn't seem possible you know and i think that's like kind of where i i don't know i i do distinctly remember at this conference i presented an early version of this paper at and a colleague and I won't name who it was, but someone I very much deeply admire. And they're a really awesome person. They said, wow, this seems really, really ambitious. And at the time I said to them, like, I don't know, it seems pretty straightforward. Like, it's just building on my work, you know, it'll be fine. Right. And I was totally wrong, of course. But, but, but yeah,

Starting point is 00:56:50 and looking back, I'm like, I'm really glad I'm really proud that I went through that and still sort of stuck with the problem that seemed kind of really ambitious to start as your last, the last part of your thesis when presumably you want to graduate soon. So yeah, I think, I think that's maybe the, maybe that's like the key takeaway from the research, which I know isn't a technical idea at all. It's a bigger idea than that. But, but I think that's the impact research can have is sort of non-technically, right? Go do stuff that seems not impossible. That's amazing. Well, I think we will end it there. Thanks so much, Andrew. That was brilliant.

Starting point is 00:57:29 And if you're interested in knowing more about Andrew's work, all the links to all the relevant materials we've put in the show notes, and we will see you next time for more awesome computer science research. Yeah, thanks, Jack. you

CODACE Plant Stand

Disseminate: The Computer Science Research Podcast - Andrew Quinn | Debugging the OmniTable Way | #16

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

CODACE Plant Stand

Disseminate: The Computer Science Research Podcast - Andrew Quinn | Debugging the OmniTable Way | #16

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.