SemiWiki.com - Video EP7: The impact of Undo’s Time Travel Debugging with Greg Law

Starting point is 00:00:00 Hello, my name is Daniel Nennie, the founder of SemiWiki, the open forum for semiconductor professionals. Welcome to the Semiconductor Insiders video series, where we take 10 minutes to discuss leading edge semiconductor design challenges with industry experts. My guest today is Greg Law, CEO of Undo. He's a C++ debugging expert, well-known conference speaker, and the founder of Undo. Welcome to Semiconductor Insiders, Greg. Hey, very happy to be here. Thanks for having me.

Starting point is 00:00:30 So, Greg, for the last 12 years, you have worked with all the EDA tool vendors and some of their customers building semiconductors. What trends and common challenges are you seeing across the board? Hey, well, I think it's really just the pressure to deliver on time, deliver quality designs on time just gets ever greater right and I think we've just seen that particularly in the last few years with all the advances that we're seeing there's just ever pressure in a faster design cycle getting those designs to customers but they've got to be quality as well right it's not just speed, it's also the quality.

Starting point is 00:01:05 Right, and how does Undo help with this problem of engineering teams spending far too much time debugging? Yeah, sure. So, I mean, in this, and what we're seeing that some of the shifts we're seeing in terms of the approach to getting these designs out quicker and higher quality is really through the shift left, okay, and getting more of the development done earlier

Starting point is 00:01:25 in the cycle. And that means, it means virtual prototyping, okay, it means modeling of architectures, often just in pure software before we try and turn them into gates. And really all of these approaches are really very, they're essential now. Like they've changed from being perhaps about the leading

Starting point is 00:01:41 and leading companies were doing maybe five years ago to now it's like table stakes. You have to be doing this stuff just to keep up. And where Undo comes into that is really this is becoming increasingly about software. So our background is we come from a space of helping software companies. So it was the EDA vendors in the early days. In fact, they were some of the first companies that we started to work with. Since then, we've worked with all of the enterprise software people that you might expect, right? So Amazon, Bloomberg, and Cisco, and people like this. And then what we're seeing just in the last maybe year or two

Starting point is 00:02:16 is increasing numbers of silicon companies are becoming our customers, right? So not just the EDA companies, but the customers of the EDA companies too. And that's because of this shift left, which means that like more and more of the development effort is really being done in software, right? Through virtual prototyping and all the other stuff. And so the same stuff about Undo that helps these customers, helps these software companies to produce better code faster is now just becoming super relevant to silicon companies

Starting point is 00:02:45 as well. And, you know, a lot of these silicon companies are systems companies, right? So Apple and Tesla, I mean, they make the whole system. So there's a huge amount of software behind it. Absolutely. Yeah, yeah. So it's yeah, indeed a lot of that vertical integration. But you know, and even the yeah, so all of the all of the silicon companies now produce

Starting point is 00:03:01 a lot of software. But what's changed in the last few years is that it's also the Silicon teams who are actually producing more software. Not that they're shipping that software, right? Because it's a C model of the design or peripherals or it's virtual prototyping, right? That software never leaves the building, but it is essential to develop the latest version of the chip. So could you translate that into the kind of business impact engineering leaders can

Starting point is 00:03:26 make with your solution? Yeah, yeah, yeah. So I mean, the context of the problem here we just talked, we just spoke about, right? So, you know, developers, you know, hardware, silicon developers spending up to 50% of their time like debugging these models and these system C designs. So you know, whether it's, it could be a model of a Verilog implementation, or it could be just increasingly these days like system C, right, just implementing the chip directly in a traditional kind of, you know, in C++ basically. And then what we help with at Undo is to allow the software

Starting point is 00:04:01 engineers to see exactly what their code did, right? Not just rather than what they thought it was going to do. And we do that through a kind of three phase approach, right? So this record and then replay and then finally to resolve the issue. So you record the execution. So that might be through high level synthesis. It might be in system C design, or it might be a model of the chip you're making. It might be a model of the peripheral.

Starting point is 00:04:24 And you run that code, and when it does something that you weren't expecting, you have a recording of it. And this recording allows you then to replay the execution right down to the line-by-line level. The developer can now see every single line of code that executed and every piece of data for every line, and they can wind back to any point in

Starting point is 00:04:45 the in the program execution. So it's like complete information about what that what that piece of software was doing, which then makes that third step of resolution just you know, super straightforward. Right? There's no longer any this but how did that happen? How did I get here? I think maybe that you can just see exactly what happened and resolve these issues much faster and then get, therefore get much more coverage, much faster implementation of what you're doing. Okay, so that's kind of the, that's what we do, right? That's what this time travel debugging approach is

Starting point is 00:05:18 and what that means really in terms of how that looks to, whether it's to the engineer or to the development organization, the traditional one is at the top here, right? Where we have this loop and you go around this loop, like the code's not doing what I thought it was doing. Typically what people do is like add more logging, right? Or they get, maybe they get a dump or something.

Starting point is 00:05:35 The most common actually is just add another printf, run it again, let's see what happened that time. And then, you know, step by step, I go around this loop an indeterminate number of times. It's one of the problems actually, I've got no idea how many times I'm going to go on this loop until I, until I finished or maybe even given up. Now at the bottom, you've got the time travel debugging approach, which is just this straight line.

Starting point is 00:05:54 Okay. You take the program recording, you replay it, you resolve it. There's no iteration going around again, new bills, more information. Everything you need is just there. So it's like, it's so much more, it's not just faster. It's so much more, it's not just faster, it's so much more predictable. And that's really kind of how that looks. And then what that means in terms of the benefits that people see, which was really the question you were asking, so I'm going to get to that, is understanding what really happened, and knowing what really happened,

Starting point is 00:06:20 rather than trying to make these guesses, which in turn lets the engineer root cause what's really happened just with ease. They can even trace back through the code flow, through the data flow. I've got some piece of data, some signal or some variable that's in a state that I didn't expect. I can just wind the tape straight back to where that last got changed. And I can keep on doing that, keep following the chain back to the ultimate root cause, which is especially valuable when it comes to intermittent failures. Okay, particularly, and we have these, everybody has these very expansive now regression test suites. And those regression tests, they're not always 100%, okay? Sometimes you might get a failure,

Starting point is 00:07:06 you know, one in 100, one in 1,000. If you can pick a recording of that, just capture it just once, then the intermittent bug problem basically goes away. We had a customer just recently who was struggling with less than 50% of their clean regression run would just run, you know, all green, even if you haven't made any changes. And using this, they got that up into high 90s, like 97, 98% just, you know,

Starting point is 00:07:31 reliable green runs, which is then that has big culture impact, right? Because down at that level of 50% failure or even 70, 80% is green, then the engineers stop trusting the tests. Okay, and they see a failure and they say, well, it probably wasn't my fault. There's this inherent kind of flakiness in my test suite. I'll just run it again, see if I get lucky next time. Yeah, okay, off I go.

Starting point is 00:07:52 And that just then becomes this self-fulfilling prophecy and just gets worse and worse. And then the other final thing I just wanna comment on here is the ability for collaboration. So these days, the systems we build are so complex, no single human can get them all in their head. It's a collaborative effort. And if you look at how engineers collaborate on resolving any kind of issue, you'll often get a long trail of comments on your GitHub or Bugzilla or whatever it is you're

Starting point is 00:08:22 tracking this. It could be this and they're asking questions and people are jumping in. Now you can take a recording. And one of the nice things about these recordings is that they're portable. So you can take that recording, give it to your colleague and say, hey, I've seen this thing at like, time six minutes, 14 seconds, and that looks weird to me.

Starting point is 00:08:38 Can you explain what's going on there? And it just is a big kind of collaboration win rather than trying to, especially as we work, increasingly not all in the same office, right? So it's not so easy these days to say, hey, come and take a look at this and bring up the chair and working it together. Sometimes you get to do that,

Starting point is 00:08:54 but a lot of times people are remote and they're asynchronous. And so the collaboration you get through recordings is a big win. That's great. So Greg, how do companies generally engage with Undo? People always want to take this and try it for themselves. We work through it. We collaborate closely with customers

Starting point is 00:09:15 while they are going through what we call the bug hunt process, where one of the common reactions we get is, well, this sort of sounds great, but surely this is too good to be true. And people want to see it working for real as they engage with us. So yeah, there's a number of different ways that we kind of go about that. It will depend exactly on the customer. And then what they're looking for is to validate not just does the technology work, because yeah, it's kind of cool. But we don't want something's like, it's not, but we don't want something that's just cool and a neat trick.

Starting point is 00:09:46 We want to understand that it really has the business impact. Right. And so what we're looking to do in that evaluation process is demonstrate these three key points. Right. So the first is that we can, that using this technology, customers can get to market faster, right? They can reduce the time taken to produce these complex SOCs, ASICs and the rest,

Starting point is 00:10:07 and not just getting to market faster, but doing it more productively. Okay, so getting to market faster with the same or sometimes even fewer resources, certainly no extra resources. And even better than that, the kind of third leg of the stool here is that not only do you get out faster with better productivity, what you get out the end is also better. You get improved quality. One of our silicon customers was explaining to me the other day how they make a model of a chip in C++ before they sort of really in parallel with the RTL team turning it into silicon. And they run these workloads on the model and kind of trying to characterize them.

Starting point is 00:10:47 And the model, previously, they were able to get like 50% coverage of the workloads they wanted to cover on the model, because they just couldn't, they couldn't get the weird little differences out of the model enough to run all of the workloads. Now, with Undoom, with Time Travel, they can get 80, 90% coverage. So what, you have higher understanding of how that silicon is going to perform when it ships.

Starting point is 00:11:12 So it's these three key business impacts, faster time to market, more productive, and better quality of what you do ship. That's what we're looking to demonstrate in the process when we engage with a customer. Once you've demonstrated all those things, then it's pretty, it's a bit of a no brainer really to then adopt the technology. Great conversation, Greg. Thank you for your time. We will see you next at the Design Automation Conference.

Starting point is 00:11:36 That concludes our video. Thank you for watching and have a nice day.

Your Ad Here

SemiWiki.com - Video EP7: The impact of Undo’s Time Travel Debugging with Greg Law

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.