Embedded - 473: Math Is Not the Answer

Starting point is 00:00:00 Welcome to Embedded. I am Alicia White, alongside Christopher White. Our guest this week is Philip Kopman, author of a new book, Understanding Checksums and Cyclic Redundancy Checks. Now wait, before you wander off, there's a lot more we're going to talk about and it's going to be very fun. Cyclic redundancy checks are quite fun. Hello, Philip. Welcome back to the show. Thanks for having me. I was going to jump in right there, but I'll be snarky later. We're good. Could you tell us about yourself as if we met on, I don't know, open house day at CMU?

Starting point is 00:00:49 You're starting with a hard question because I've done like a whole bunch of things. So, here's the short version. And we're going to come back to my career and maybe that's fine. I was in the military. I was in industry. I'm currently sort of ending my cycle at university, shifting over to being a consultant. And I've been doing embedded systems and embedded networks for a really long time, and also safety in a variety of places. So I worked for the Navy. I worked for United Technologies. I was a CPU designer at Harris Semiconductor.

Starting point is 00:01:22 I've been teaching at Carnegie Mellon University. And I stopped counting at 200 design reviews for industries. You know, get in a car, get on a plane, go look at some embedded stuff, give them some ideas how to improve. I stopped counting at 200, and that's got to be 10 years ago. So, you know, done a lot of stuff. How's that? Well, then all this stuff leads to a couple of pretty interesting books about what you've done. Yeah, yeah. I've been busy lately. There's the orange book, as I call it, which is the Better Embedded System Software, which is now, what, almost more than 10 years old, but still going fine. And that's what I learned in my first 90 design reviews. And I've learned a lot more since then, but that's not the book we're here to talk about today. Actually, are you thinking

Starting point is 00:02:09 about doing a second edition for that? I have to decide. I need, as you well know, Alicia, putting out a book is a somewhat traumatic experience. So I have to recover. And it turns out the second time is no less traumatic. It turns out. That is correct. You think you're going to go in and just do a second edition and make a few edits and then suddenly you've rewritten the whole book and added 100 pages. Yes. And at this point, I don't know that a second edition is the right thing because that book has held up remarkably well. I mean, the hardcover printing sold out. And then so it's now paperback and Amazon and doing way better than I expected. But it hasn't changed that much.

Starting point is 00:03:06 And so now I think if I were to do a second edition, it might be better to leave that as it is and do something else that is more helpful rather than putting a lot of effort in just replowing the same ground. I didn't modify much from my book. I added stuff. I moved it around. I made the instructions better. And at some point, when you're an inch thick, maybe it's time for another book, right? Okay. Are you ready for lightning round?

Starting point is 00:03:24 Sure. Lay it on me. Okay. Are you ready for a lightning round? Sure. Lay it on me. Okay. Least favorite polynomial. 826-08-EDB. That's my story. I'm sticking to it. On a scale of 1 to 10, where 1 is a Flintstones car and 10 is hovercars of the future. Where are we with self-driving? Well, you know, like four or five. I don't know.

Starting point is 00:03:54 That scale is kind of hard to calibrate. It's a hard scale. So four or five could be, you know, just awful or mediocre. And that's exactly the issue. I was not sure. We can come back. At some point, if you want to come back and talk about

Starting point is 00:04:10 that, we certainly can, but goodness knows I've done enough podcasts on that and this is a chance for a brush of fresh air for me. Favorite course to teach? The one I'm teaching now, which is half embedded code quality, one quarter safety, one quarter security.

Starting point is 00:04:27 What is your preferred programming language? I've been using C and C plus because I'm old. I've stopped counting at 50 programming languages. I'll use whatever is thrown at me. But really, an embedded, that's still what's out there in embedded. Rust may come by someday, but it's going to take a while. And I'm kind of language agnostic. Whatever it takes, it's fine. Favorite bagel place in Pittsburgh? Oh, well, I guess that Brugger's does pretty well for me, but there's a lot of other places you can go, boutique places. So it depends what mood I'm in. Do you have a favorite fictional robot?

Starting point is 00:05:13 Well, it's a tie. So for some reasons, it's Huey, Dewey, and Louie from Silent Running, partly because they're obscure and partly because that's what I named my Roobas. Okay. But I would be remiss if I did not mention Johnny Cab because, from the first movie, right? Because Johnny Cab, right? He's hysterical.

Starting point is 00:05:34 You know, we hope you enjoyed the ride. What movie was that? Terminator. Thank you. Yes. 1991. Yes. Yes.

Starting point is 00:05:44 I'd like to thank our sponsor this week, Nordic Semiconductor, specializes in ultra-low-power wireless communication. They have a wide technology portfolio, including Bluetooth Low Energy, Low Energy Audio, Bluetooth Mesh, Thread, Matter, Cellular IoT, and Wi-Fi. They have thousands of customers worldwide with 40% market share in Bluetooth Low Energy. They have nearly 2 million system-on-a-chips produced every day. Sounds like a system that you can't go wrong with. So please check out nordicsemi.com. And if you would like to win one of the Power Profiling Kit 2s that our sponsor is so generously handing out,

Starting point is 00:06:32 please send us an email that is show at embedded.fm and tell us what your favorite PPK2 feature or spec is. Again, thank you to Nordic for sponsoring this week's show. Okay, let's get to the real questions. Sure. You mentioned in Orange Book the better embedded systems,

Starting point is 00:07:00 and that's been around a while. That was like 2010, I think, the hardcover. Yeah, it's been a while. That was like 2010, I think, the hardcover. Yeah, it's been a while. And now you have a book that just recently came out. And it's about CRCs and checksums. Isn't that a done sort of field? Well, it's pretty niche-y.

Starting point is 00:07:21 I'll save you having to say that. Well, it goes back to the 60s, I guess. And while the mechanics of it are done, of CRCs are done, the understanding of them took decades. And there's still some open questions as it turns out. I said, okay, I think I understand enough now to write the book. And then I got, there's like a thing in chapter 10, it's like, you know what? All the stuff that's been written in the last 20 years sort of missed the important part, and I don't know the answer either, so we're going to leave that part open. This is the kind of thing that takes decades to figure out.

Starting point is 00:07:58 But to be sure, inside the book is the answers to all the questions I've really heard. And there's one or two niche, really, really narrow places that are still open. But for most people's purposes, the answers are all there. And it really takes a lot of work to nail it down because there's so much going on. And really, there's a lot of folklore. There's a lot of misinformation. There's a lot of people just saying, oh, this will be fine. And no, it's not. So there's just a lot of folklore, there's a lot of misinformation, there's a lot of people just saying, oh, this will be fine, and no, it's not. So, there's just a lot of that. I mean,

Starting point is 00:08:35 inventing the technology was very cool done a long time ago. Understanding it has taken a long time. I mean, don't you just take your list of parameters about how long you're willing for your check bits to be and how Christopher keeps trying to interrupt and how long it takes to calculate and how long it takes to verify. You just take all these parameters and you shove them in and you find out which CRC is best. Doing this every show now. Can we start with a definition of what a CRC or checksum is for people who might not be familiar? Because I want people to understand what we're talking about before we ask which elliptical curve equation is the right one to use for some... Okay, let's start with data word and...

Starting point is 00:09:15 In fairness, so you don't beat up Christopher, I was going to go there too, so we're good. Okay, so let's start with what is a CRC and why would I use one? I want to start earlier. Let's start with a checksum and then talk about CRCs because they're really different technology. Yes, right. Checksums are easy. Let's all do checksums. Well, yeah, actually not. Depends which one. All right. So let's start at the basics. You got a piece of data and things go wrong in the world. And this is how I got involved because I do safety, I do dependability. Data rots. You know, the bits go bad, bits rot, they flip, whatever. You get a bad bit in a data transmission, you have a piece of memory and it suffers a bit flip before you read it back. And so data can accumulate faults, especially in embedded systems which have long operational cycles and noisy environments.

Starting point is 00:10:10 You're going to get bit errors in your network messages and in data storage. And we may talk more about network messages, but the principles are pretty much the same. And so if you have a piece of data and a bit goes bad, you really want to know that it's gone bad. And so the idea is you take this piece of data you care about and you add another piece of data next to it, which is sort of a summary, a digest. And it's gone bad, that this little digest thing next to it will tell you it went bad. So you can do a consistency check between the data and the little digest, the checksum. And the checksum will say, you know what?

Starting point is 00:10:59 That data isn't what it's supposed to be because I have a different answer. And when you store the data, you store the data and you run a computation across the data that crunches it down into a little digest, and you store the digest as well. And then you send it or you put it in memory, whatever. When you pull them back out, you take the data, you run the same computation, and you compare the result to the stored digest. If they're different, you know something went bad. And so the idea of a checksum is to give you a high probability way of detecting faults. I mean, you could do something really silly,

Starting point is 00:11:37 like you could take a blob of data and then you could make a copy of it, but that doesn't tell you which one's right. So you have to make another copy so you can fill it between three. Oh, and it's even worse than that. If you make two or three copies, the typical faults are the bigger the data is, the more probability something will go wrong. Exactly. So making two copies and comparing is like the worst idea ever

Starting point is 00:12:02 because you just doubled the chance one of them will be wrong. And as you said, you don't know which one. So then you send three and if two compare, probably that's the right one. But what if all three are different? 50 copies and you average them. Yeah, and probably that's not

Starting point is 00:12:19 how you want to go. And the point of checksums, and I'm going to use checksums generically sometimes, so I include CRCs in them, right? The point of checksums as a generic is to do better by saying, well, I've got a thousand bits or a few kilobits of data. You can tell I'm an embed network guy. You've got a few kilobits of data and I have eight bits or 16 bits, and that's really all I need to, with very high probability, detect that something went wrong in the hundreds or thousands of bits. That's the idea. So you have these 16, 32 bits that are checking your kilobytes of data, but it only tells you if something's wrong.

Starting point is 00:13:03 It doesn't tell you what's wrong. That's error correcting. So error detection is something's wrong. It's yes, no question. Actually, it's a yes and probably no question. Right. Because there's some probability you got unlucky and there's a fault in the check value that compensates for the fault in the data value. And so a lot of why you would use one checksum or another or one CRC or another is to reduce that residual probability of an undetected fault. But yeah, it's telling you it's wrong. Now, you can do better if you have a higher-powered, I'm using that term loosely, a higher-powered check computation. You can actually do error correction, but this book doesn't go there. Okay.

Starting point is 00:13:57 And the reason is there's all these books on coding theory. It's like the world arguably does not need yet another book on coding theory, okay? There's plenty of books on coding theory. Aficionados may differ, and I respect that. That's fine. It's not what I've done for a living. But they all have like this one page with some dense math saying, here's CRCs. Now let's get to the cool stuff.

Starting point is 00:14:21 And if you're doing a bed of networks, you care about the CRCs and checksums, and that page of math doesn't do you much good, right? And so this is the book that takes that page or two, expands it to a book, and calls all of those guys out on all the hand-waving they're doing that actually doesn't give you the answers you need. Does your book answer the question of unbalanced zeros or ones? I mean, I always thought that if you're dealing with like Flash, it's not a 50-50 probability when a bit goes bad, whether it's going to zero or one. That's right. It does in two different dimensions. So what I do, at the beginning of the book, it runs through all the algorithms. And we can sort of run down the list if you want at some point.

Starting point is 00:15:08 But it runs down single sum and dual sum. And I invented a new checksum writing this book, the Kotman checksum, because my wife said I had to name it after me. And that's a true story. And CRCs. And they all have different performance tradeoffs. But then I have some chapters saying, what about this? What about that? What about the other thing?

Starting point is 00:15:29 And one of the sections is, well, what about if it's not all zeros or all ones, right? And for some of the checksums, the data values matter. For any of the checksums that involve integer addition, the data values matter. And the more ones that involve integer addition, the data values matter. And the more ones or zeros you have, if it's all zeros or all one, it actually does, the checksums do really well. And the worst case is actually 50% bitmix. And it doesn't matter which way they fail. 50% bitmix is the worst case, which was kind of interesting.

Starting point is 00:16:07 For Kotman checksums and CRCs, the data values don't matter at all. What matters is how many bits have been flipped. And it's the direction of the bit flip that matters, not the data values. And the technical reason is integer addition does with carry bits, and the carry bits are affected by the data values. But CRCs are all XOR operations, and so the fault pattern is completely unrelated to the data pattern. Okay, wait a minute. Say that again. The fault pattern in CRCs, whether you can detect your fault or not, has absolutely nothing to do with the data values. But with checksums?

Starting point is 00:16:41 With addition-based checksums, it does, because in CRCs, it's about bit mixing. And I can try and verbally sketch that for listeners who haven't heard it. But actually, let me sort of quickly sketch them out so we have something to work with, okay? The general classes are, they're single addition checksums. We take all the, I'm going to call them chunks to make it obvious that I don't want to get too much

Starting point is 00:17:11 into the specific terminology. You have a big data word, you break it up into chunks, like a chunk might be 32 bits or 16 bits or 8 bits, but it doesn't have to be. You have data where you break it up into chunks and you add up

Starting point is 00:17:22 all the chunks using integer addition. And what do you get at the end is your checks on value. And whether or not that detects faults depends on not only where the bits are, but the carry patterns during the addition, if that makes sense.

Starting point is 00:17:37 Yeah, because you're getting mixed. If you XOR, if you just XOR all the chunks, what you get is a bitwise XOR called a longitudinal redundancy check, LRC. And that one can be fooled by pairs of bits in the same location. But if you're adding, you get carries and you get better mixing because the carry bits sort of mix the values across. They smear them across the width. And so it performs better.

Starting point is 00:18:01 You get better fault detection. But you only get the fault detection if the values smear. And that all depends on the bit values, right? And so depending which bit values were faulted, you will either get a detectable or undetectable two-bit error, for example. And whether it's zero to one, one to zero, it doesn't matter. The direction doesn't matter, just's just that the values matter. I expect we'll get into this, but is that something that

Starting point is 00:18:28 ... How do you decide which way to go? Do you statistically analyze your data? Or is that just not something that people really need to look at? No, CRCs are always better. Right, that's the question I'm asking.

Starting point is 00:18:44 So, I have a lot of cores on some servers that have spent a lot of time running multicolor simulations. Okay. Okay, and you know, for me, a billion simulations is just getting warmed up, okay? Because it takes that to get the, it's really amazing how many simulations it takes to get a smooth curve, okay? And so these things, this book took a year, and a lot of that had 8, 16, 24 cores just running flat out for the whole time to make the graphs in the book. And so I have a curve in the book

Starting point is 00:19:15 that says for each type of checksum, if you have 128, I think it was 128-bit data word, just to keep it tractable, if there's zero one bits, it does this. If there's zero one bits, it does this. If there's one one bit, it does this. If there's two one bits, you get where this is going, right? And there's these interesting curves that they're worse in the center and really nice at the ends. And so that's how I know it's 50%.

Starting point is 00:19:36 Yeah. And so a thing is, I don't actually try and do the mathematical analysis. I do a lot of, I do some, but not a lot. I do mostly empirical analysis because you look at the graph and you say, oh, I get it. You can look at the math and well, maybe, but if you look at the graph, it's like,

Starting point is 00:19:55 oh, it's obvious. Yeah, that's what's going on. And so a lot of the book has graphs in it. I was kind of shocked a few nights ago I had started reading your book, and my pre-bed book is Symmetry by Marcus Dussatoy, I think is his name. And it's about group theory. And I was really surprised to see error identification and error correcting codes in group theory. Now, I know I didn't prep you for this question at all,

Starting point is 00:20:31 so I'm basically asking you to think on your feet, but how is group theory related to this? Can't go there. I'm not a math guy, shockingly, perhaps. Christopher. I'm not that heavy into math. Let Christopher take that one. Yeah, let me think back to my abstract algebra class.

Starting point is 00:20:48 You were the one who said elliptical curve, so you're on the hook here. Let me cast my memory back to 1994 and nope. Group theory is very large in error correcting codes. That's all I can tell you. I don't remember why. Because you just looked it up on your phone. I watched you type it into Wikipedia. No, I was checking to make sure I turned something off.

Starting point is 00:21:07 No, I don't have an answer for you. I'm sorry. But this whole area of computer science is partially engineering because you do have to deal with all these trade-offs of speed of computation and size of your checksum

Starting point is 00:21:23 versus size of your data. And not only that, electrical engineering, because what kind of wire, and it doesn't matter in networking so much because you never know what kind of wire you're over, but in some cases, okay, I'm on. Other than it's a bad one. Right. You're on a high-speed serial link, and then there's encoding things with high-speed serial links where they want to balance the bits. So that might influence your choice. Anyway, yes. That's also in the chapters, right? Yes. So, the thing is that math isn't actually the answer here.

Starting point is 00:21:52 Let me write down that as a show title. That's a good show title. So, you know what? Let me go through the different types, and then I'll- Math is the question. I'm in a better position to explain why math isn't the answer, okay? So there's single-sum checksums. There's also dual-sum checksums. And many people have heard of a Fletcher checksum or an Adler checksum. Okay, that's what we're talking about, where you actually compute two running sums at the same time. And one sum is just the same as a single sum, but the second is the sum of the values of the running sum.

Starting point is 00:22:28 So you're saying, you know, the value A is chunk one plus chunk two plus chunk three, but value B is B plus A, B plus A, B plus A. And what that does is it implicitly multiplies the chunks by their position. And that makes you resistant to two-bit faults because it's hard. Until you sort of wrap around, you can't get a two-bit fault

Starting point is 00:22:53 that can fool both the current sum and the sum multiplied by its integer position in the chunk stream. Okay. Okay, so that's the deal with Fletcher checksum. And something I found out is that some of the intuition about the modulist is wrong. So the difference between a Fletcher and an Adler, Fletcher is, I'm going to use, we're just going to use 8-bit because the numbers are small, but it exhales.

Starting point is 00:23:20 If you just add mod 256, you know, just integer add, it turns out you get some really nasty behavior on the top bit because you throw away the carry out. And so what you want to do is you want to wrap around the carry out by doing mod 255 or one's complement. It's basically one's complement addition. And so a Fletcher checksum uses one's complement addition on both the A sum and the B sum. And I've seen, and actually there are standardized implementations that use mod 256 instead of mod 255, and they have horrible performance. Really important to use one's complement.

Starting point is 00:23:53 Why? Because when you're adding up a bunch of integers, remember I said that bit mixing is helping you from the carries? If you just do 8-bit integer adds, you throw away the carryout, don't you?

Starting point is 00:24:10 Sure. And throwing away the carryout doubles the vulnerability to bit flips in the top bit, it turns out.

Starting point is 00:24:19 Okay, okay, okay. I kind of get that, yeah. And for the Fletcher checksum in particular, or a dual-sum checksum, that implicit multiplication starts losing information if you don't wrap that carry bit back. And so it becomes vulnerable, basically it's instantly vulnerable to 2-bit faults if you use mod 256 instead of 255. You don't get the 100% detection of two-bit faults. You lose it basically instantly.

Starting point is 00:24:48 So it's terrible. And people make that mistake all the time. So these are the kind of things in our books, like don't make this mistake. And here's the graph that shows you why. Sorry, I'm just wondering how often I've made that mistake and whether or not I should go and fix it. And I mean, I don't know that I would have.

Starting point is 00:25:07 Yeah. And the thing is, there's no reason that anyone would even know this unless someone did the analysis, showed them the picture. And that's why I wrote the book. It's because there's so many people unknowingly making these things that seem, oh, it's no big deal. And it's terrible, right? And unless someone shows you why, it's no big deal. And it's terrible, right? And unless someone shows you why, there's no way to know. And how much are you showing with the graphs versus showing with the math? Very little on the math. So I decided not to use an equation editor. This entire book was written without an equation editor. And it isn't like I have really ASCII art math or something silly like that, right?

Starting point is 00:25:51 It's basically the math is all on one line with superscript and subscript type font, and that's it. And I did that on purpose because there are all these math books, and some folks are fine with the math, and that's great. Those books are already written. But some folks, and I'm actually one of them, I look at the math and my eyes glaze over, and I have to actually understand what's going on with the bits. Then I go back and say, oh, that's what the math was trying to tell me. Yeah, there are many times when reading the code is easier than reading the math. Yeah, so I read the math and I understand the math. That's great, but it's not how I think.

Starting point is 00:26:28 And so this approach in this book is, all right, here's the basics of the math, but I'm not going to throw pages of math at you. I'm actually going to run examples. And so for the dual sum checksum where I said there's implicit multiplication because of the way the addition works, I actually run an example so you can see what I mean by that. And I say, you know, this is added, this is added, this, and look at the end. You know, at the end, there's four nines and there's three eights and there's two sevens.

Starting point is 00:26:56 And like, oh, look at that. It was multiplied by the position. Isn't that cool? And you have a good use of color in the book. I liked that because it's hard to follow these things. do that. I started because trying to do monochrome graphs that you can understand is really painful, right? And it's like, and there's a lot of, I actually don't know how many pictures, you know, dozens. And it depends what you mean by a picture at this point, okay? There's a lot of, you know, I spend a lot of time, it's like, I really want to make this explaining things rather

Starting point is 00:27:42 than just saying, look how smart I am because I can do math, right? It's not, that's pointless for a book, or at least I think so. So a lot of time explaining a lot of pictures, a lot of graphs, and then the examples, I use color so you can see where the, you know, there's a five. Where'd that five come from?

Starting point is 00:27:59 Oh, that's a green five. And there's a green five. Test what? Those are the same five moving through. So I used a lot of color on the interior to try and make it easier. And then sometimes the word in the text is also green. It's like, oh, so those are all the same thing. I tried to do that. And apparently it came out okay because you like it. I wasn't sure it was an experiment, but that's what I was trying

Starting point is 00:28:19 to do. I just found when I read it with the color coding, it was just so much easier for me to follow what was going on and make sure it all lined up. So much of CRCs especially is about following a bit through a pattern. Yes, yes. And I did that with color. Yes, it is.

Starting point is 00:28:40 There's a step and another step and another step and I tried to use color to make it more obvious what's going on with the steps. That's right. Because I couldn't put video in. Right? That's a step and another step and another step and I tried to use color to make it more obvious what's going on with the steps. That's right. Because I couldn't put video in, right? That's a whole different deal. I mean, you do have a YouTube channel. So I do have a YouTube channel

Starting point is 00:28:55 and one of the things I did was I made a video showing how CRCs work, which forgive me for staying on task, but you got the dual-sum CRCs. Then you have, I'm got the dual-sum TSRCs. Then you have, I'm sorry, dual-sum checksums. And then you have a Koeppmann checksum, which is this thing that it's actually a remainder after division in integers, except I found a clever way to structure it so it looks like an iterated addition, but it's actually doing long division and computing the modulus. It's kind of a funky thing, but it works really

Starting point is 00:29:30 well. And it has much, much better fault detection than the dual-sum checksums. It's a little more expensive, but not as expensive as the CRC. Then there's CRC. So, Kupman checksums are remainder after division in integers. CRCs are remainder after division in polynomials over Galois field 2, which is all the field theory I know, at least as much as I'm going to admit. Galois is my least favorite mathematician. I'm not going to run down a field. I understand. Abstract algebra is one of the coolest fields, and I don't understand why you hate it so much. I don't hate it. I'm reading a book about group theory.

Starting point is 00:30:10 I'm doing okay on that. Yeah, but you hate reading it. Good for you. That's not bedtime reading in my book. But for some people, it is, and that's all good. Oh, it's pop-sci. It's not taxing. So, for those people who's like, what are they going on about Galois?

Starting point is 00:30:29 Think Boolean algebra. Okay, Galois field two, for our purposes, is Boolean algebra, and the rules are simple, and this is how CRCs work. Remember in grade school, probably, you learned how to do long division. Some readers are still traumatically scarred from that perhaps. I think they do it a different way now that involves some... Calculators? No, no. There's the new math they teach which has all sorts of techniques

Starting point is 00:30:55 that we don't understand. Hey, you know what? I had new math when I was a kid too. New math has been around a long time. Okay. I guess I just said old math. I had old math. I had old new math, okay? We're right. Long division, you draw a little thing. They used to teach it back when people dialed phones with rotary dials and it made clicking

Starting point is 00:31:19 sounds and all that other stuff. But if you look it up on Wikipedia, it's there. So you have long division. And CRCs are actually a clever hardware-based idea. You can do it in software, of course, but you're emulating some hardware that did long division in hardware. And it was over Galois field too. And what that means is instead of addition or subtraction, use XOR. That's it.

Starting point is 00:31:47 So you do long division and you say, all right, you know, I've got the dividend and I got the divisor and I put the divisor under the dividend and I do a subtraction. And if it fits, I go with it. And if it doesn't fit, I undo it and I shift over one digit and do the next long division. The only difference is you're doing all the digits are zero or one. You're doing XOR instead of add, and also XOR instead of subtract. It's all XOR. And it's zeros and ones.

Starting point is 00:32:16 And that's it. It's long division. And you do the long division of the ginormous data word thousands of bits by your 8-bit or 9-bit or 10-bit, doesn't really care how many, or 16-bit divisor, which is the CRC polynomial. We'll probably come back to that. And you just walk it across, you do the math, and when you get a remainder, that's the result of the CRC calculation. That's the essence of CRCs. It's the remainder after long division using the rule that addition and subtraction are XOR and all the digitsC calculation. That's the essence of CRCs. It's the remainder after long division using the rule that addition and subtraction are XOR

Starting point is 00:32:47 and all the digits are binary. Watching your video about this, it starts out and it shows the long division, which made sense to me. And this whole Galois 2 thing didn't penetrate. I mean, that was unrelated. It's just XOR. You have to understand what fields are. Fields are pretty simple to understand, but anyway,

Starting point is 00:33:10 if you don't know what a field is, then that's not going to make any sense. It doesn't matter. You can just push that to the side. And then you redid it as the hardware would implement it, and I think my brain exploded. At least a little bit. I'm pretty sure that's what the leakage from my ear was. Well, I started out as a CPU designer, so, you know, I think that way more normally, I guess. I don't know.

Starting point is 00:33:30 But the point, people have seen the shift. The shift in XOR registers, right? It's a bunch of bits and there's a feedback and it shifts in XORs. And the thing to take away is that that hardware mechanism is doing division with XORs. It's doing the division algorithm in hardware. So that shift in XOR is identical to long division. That's the thing to get out of it. And it doesn't matter which way you do it, they're doing exactly the same computation. And it took me a long while to understand why that was really true. And so part of the reason I made the video was to give a graphical explanation to convince myself that no one was blowing smoke. And in doing that, I realized that the CRC algorithms you're likely to download from the internet

Starting point is 00:34:25 don't actually work that way quite. They don't actually work quite that way. So there you go. They almost work that way, but not quite. So am I not supposed to be just downloading these? You can download it from the internet, but if you try and do the polynomial division by hand, you're likely to get a different answer

Starting point is 00:34:45 in some circumstances. As long as both sides of my communication mechanism get the same answer, I'm fine, right? That's right. That's right. And it has to do with the initial seed value is the difference. The initial seed value is,

Starting point is 00:35:01 if you do a pure division, you have to put the initial seed in front of the data, but no one wants to do that because it wastes a lot of time. So they put the initial seed in in parallel. The book has some details, but it turned out, I was trying to do the division and compare it to algorithms and getting different answers. And it took me a while to figure out that the software actually does a different thing. Now, the good news is the different thing has exactly the same fault detection properties. But one of the things I do in the book is I show you the math way, and then I show you the way people really do it. And I give code examples for each one. And then I send some, if you download the code, you get unit tests,

Starting point is 00:35:42 and the unit tests actually prove that they come up with the same answer if you know what's going on. I'm going to take a little step back here since we've gotten pretty far into the details and ask a listener question. Sure. From Doug G. What is the best practice? Do you put the CRC in the data to be CRC'd? And do you put something blank in, like FFF or 000,

Starting point is 00:36:08 or do you put the CRC outside your data block? Ah, okay. So there's a couple things going on there. There's whether or not you... Let's talk about the hardware shift register version, okay? Because some people find that easier to visualize. When you're computing a CRC, you have to reset the shift register before you start feeding bits into it. And do you reset

Starting point is 00:36:31 it to zero or do you reset it to all ones? You know, what do you do with that? You can do anything you want and it does not affect the fault detection other than a non-zero value makes it easy to see if a bunch of zeros have been stuck in front by accident, which some communication networks have that failure mode. Then you run the computation. At the end, you have a value. So you do initial value of whatever you want. Non-zero is probably better, but it's not a huge deal.

Starting point is 00:37:00 You don't go back and rebuild your gizmo just because you didn't do that. And you crunch some data, and you come up with an answer, you put it next to it. And the question here is, when you're checking the CRC, what do you do? Well, the purely mathematical way, in fact, includes the check value in the computation. So you run the data through the shift register and you run the check value through the shift register and you check for zero at the end. And if you're building hardware, you can do that. If you're building software, it turns out that whether you shift a bunch more times to see if the result is zero, including the check value, or you just do the data and compare it to the check value, it's the same computation

Starting point is 00:37:45 that makes no difference. So the answer is it makes no difference. Take your pick. There's no difference in fault detection. So I can have data, my structure that has all of my data that I'm serializing and sending over a pathway. I can either have this CRC on the inside and set it to zero or all ones when I'm doing the calculation, or I can have it on the outside and not set it at all. It depends on which software algorithm you use, but the high-speed algorithms don't care what you set it to. As long as I do the same on both sides. As long as you do the same on both sides. But the high-speed algorithms actually do not care at all whether you pre-initialize the data in your code word or not.

Starting point is 00:38:28 They don't care. They ignore it. And there's some code in the book that shows how that works and then explains why that's so. Okay, I have a question. You mentioned at the top of the show you do a lot of uh embedded networks was what got you into this yeah um and networking is primarily still tcpip which has a checksum which is fairly i mean it was established probably a very long time ago i think rfc flintstones cars I think RFC 1000 or something. Is that checksum a good checksum? Not particularly. I think you said one of your listeners wanted to know what's the worst protocol checksum, and it's the internet checksum. The one that's in the most widespread use that's the worst is the internet checksum, because it's a ones

Starting point is 00:39:25 complement addition, and they could have done so much better with very little extra compute cost. I mean, at least it's not a twos complement. I mean, that's a start, but it's... I mean, very little extra compute cost back when they were deciding it was probably a little

Starting point is 00:39:41 more than compared to now. They could have had none. Well, there were proposals to change it, right? Wasn't there an alternate? I remember there was some alternate, like they were going to switch to Fletcher or something. Yeah, and switching to Fletcher would have been great. Actually, you can do better than Fletcher, it turns out. There's a thing called a dual X checksum, which didn't exist until I started writing this book. Let me go into that one because that one's

Starting point is 00:40:15 kind of cool. It's simple to explain. When you're doing a Fletcher checksum, let's say you're doing a 32-bit Fletcher checksum, you're running two 16-bit sums next to each other. You have the A sum that takes the data 16 bits at a time, and you have the B sum, which is just the sum of the A sums wrapped around, you know, cycled back into the B. And so you're doing 16-bit additions, and you may be doing it with a 32-bit CPU, but, you know, you have to do this pair

Starting point is 00:40:41 of 16-bit additions each time. And then you have to do a mod operation to reduce it, to not lose the carry bit. And it turns out that if you run a Fletcher checksum 32 bits at a time, so you have a 16-bit addition, you add a 32-bit chunk in, and then you mod 65, 535 for 16-bit to the K-1.

Starting point is 00:41:09 If you add in 32 bits and then do the mod, you can process the data twice as fast. And the cool part is, Fletcher checksums give you, it's called Hamming Distance 3. You can detect all 1-bit and two-bit errors, and then there's some threes you don't get. If you do processing 32 bits at a

Starting point is 00:41:32 time, not only are you faster, but you get that hammy distance three out to twice the length. So it's faster, and it gives you better fault detection by processing 32 bits at a time. And the book has an explanation for what it is, but it's like and it gives you better fault detection by processing 32 bits at a time. And the book has an explanation for what it is, but it's like, that's just amazing that it works that way.

Starting point is 00:41:51 And I've never heard anyone talk about that possibility until I did some work last year and sort of stumbled into this. So they could have done that and it would run, it's like 1.5 times as much time. It's not even twice as slow. It's pretty close to the same speed as one's complement and gets you all two-bit errors, all two-bit faults, which addition doesn't give you. That's so unintuitive, and yet I have wandered around my life thinking, oh, checksums and CRCs are pretty intuitive. You just make sure that the bits are

Starting point is 00:42:31 what they're supposed to be, and now you're telling me that I don't need to spend a whole lot more processing and a whole lot more communication data length to get better effects? That's like something for nothing. Very suspicious. communication data length to get better effects. That's right.

Starting point is 00:42:46 That's like something for nothing. Very suspicious. Shockingly close, which is why, I mentioned the billions of simulations to make sure I didn't make a mistake. I'm like, really? Really? I'm not going to believe this one until I beat on it for a few weeks of CPU time. How do I ask this question?

Starting point is 00:43:08 Is this a field that's been kind of ignored for a while? It's interesting to me because it's interesting to think how much computation power is going on all the time computing checksums. Every single packet of data that's happening on the internet is having its checksum checked. And that's a lot of stuff. That's a lot of computing power devoted to just adding numbers

Starting point is 00:43:32 and seeing if something went wrong. Yeah, right. And doing it badly, apparently. Lately, I've been thinking about efficiencies that we're just kind of ignoring in the world. From waste heat to all these things. And this sounds like an inefficiency that's fairly large, although now it's all computed in custom ASICs

Starting point is 00:43:51 and maybe it doesn't matter that much to get any more efficiency because... Well, in embedded systems, it's still a lot of stuff. Right, right, yeah, yeah. But Christopher, it's actually worse because there are cases where you could do dramatically better at zero cost. Right. Those are the ones,

Starting point is 00:44:07 that's why I got in this field. I found these places where with zero cost whatsoever other than everyone agreeing to change their protocol. Which is a big ask. Which is a big ask, but that's a social ask. It's not a CPU time ask.

Starting point is 00:44:24 Right. And it turns out new protocols get invented all the time. And so there's actually a lot at work here. And you asked me what my least favorite polynomial was, and that's the IEEE 802.3 polynomial is the one I gave to you in hex notation, okay? Because it's actually pretty terrible. And people have known for decades you can do better. And as you point out, it's hard to change that.

Starting point is 00:44:53 But at the time, they picked it. They picked it for reasons, and that's fine, because that was the best they knew, and that's fine. But you can do so much better than that at no compute cost just by changing a constant in your code. And boom, you're dramatically better. And so that's what fascinated me about this topic.

Starting point is 00:45:14 It's sort of what you're saying is that you can do so much better for low cost or sometimes no cost, and you just sort of have to have the knowledge to be able to do that. What kind of witchery is this? It's mathematics, isn't it? Well, it's kind of math, but it's kind of not. That's the part. So, you asked if it's a broad field. There are very, very, very few people who are specialists. There's a lot of hobbyists and a lot of people who are sort of interested, which is cool. I get, you'd be surprised how often I get emails on this topic from people who just think it's cool to get into this. And that's great.

Starting point is 00:45:50 I mean, it's a fascinating area. There's very few people who've really gotten to be pros and most of them did so in previous decades. So they're not active anymore. There are certainly knowledgeable people around this in coding theory to be sure, but the coding theory guys spend most of their time on other things not on this uh so so there's not not a lot of people working on it right and encryption is probably oh where the overlap in math is and everyone will just go do that

Starting point is 00:46:18 yeah you can make a lot more money with encryption that's that's fine. Yeah, this is a hobby. There's no big checksum lobby cutting me checks to, you know. Are you in the pocket of big checksum? I'm not, sir. Well, wait, actually, that's a good question. That one? No, I don't think so.

Starting point is 00:46:37 You're bringing back to congressional testimony that I don't want to relive, so we're just not going to go there. Why? Why do this book? I mean, it's actually turned out to be way more interesting than I expected

Starting point is 00:46:50 for a 700-page book about CRCs. Well, that's the Kindle version. The print win is half that size. Oh, good. Just so nobody's scared off. So nobody's scared off. Your book came up at 700-something on Kindle. Oh, really?

Starting point is 00:47:02 Yeah, yeah. Oh, so they're about the same length. Yeah. It's like 370-something.. Oh, really? Yeah, yeah. Oh, so they're about the same. Yeah. It's like $370 something. You know, it's under $400 printed. It is, I mean, it's a dense, detailed book. The introduction is really good. And then there are some sticky parts that definitely my brain had to go through multiple times to get, which is a hard problem.

Starting point is 00:47:24 But you didn't just do this for fun. How did you come about this? Actually, kind of, yeah. Oh, okay. Well, never mind. I have an odd sense of fun. What can I say? So let me give you the history of how I got sort of sucked into this area, because I didn't

Starting point is 00:47:44 wake up one day and say, I think I'm going to be the king of how I got sort of sucked into this area because it was I didn't wake up one day and say I think I'm going to be the king of checksums you know it's like it was not my life plan when I started this please do not use the show to title king of checksums I don't want to have to look down I know her brain works

Starting point is 00:48:00 no no I wasn't going to be king of checksums it was going to be I'm going to be the king of checksums. It was going to be, I'm going to be the king of checksums. But that would be false because it's not what I did. So what happened was I was trying to get into embedded networks because I thought they were pretty cool. And there's a constant reinvention of embedded networks. So the thing about embedded networks is everyone makes their own network.

Starting point is 00:48:22 Why? It's been going on for decades. And the reason is because they need optimizations that they can't get. Right? So I'm not talking reinventing the internet. Yeah, yeah, yeah. That's a whole different thing. But, you know, you've got a little bit 8-bit micro,

Starting point is 00:48:39 and you've got a sensor, and SPI doesn't quite do it. Although you should use it if you can. Or using SPI, but what's the protocol for transferring data on top of it, right? Yes, yes. I had this problem last week and I made something stupid up, yes.

Starting point is 00:48:53 Yeah, or using control area network and you got these header bits, but what pattern to use in the header bits? So embedded networks are continually reinvented and that's not going away anytime soon. It just is, okay? Yes, you should use a standard network if one fits, but often it doesn't.

Starting point is 00:49:12 Or sometimes you get folks like, we need a better network for general aviation on top of CAN. And there's a committee that's working on that that I talk to once in a while. So this is going on and so i was contacted by some folks in both automotive and trains it turns out there was a big checksum cartel in switzerland in the 90s that abb research this is this guy i never got a chance to meet him which is just saying this guy this guy. Funk, who actually did a lot of the CRC work, have found the optimum CRCs up to 16 bits and published it, right? And so he and this cabal of, I'm having fun here, but this cabal of embedded network guys

Starting point is 00:49:58 were actually the folks who were behind a lot of the different networks, the automotive networks and the train networks. It's all the same handful of guys at ABB Research, near as I could tell. And they hired me out for a second opinion on a train control network. And I also was doing automotive work as part of the self-driving car stuff. And so one of the things we did was we just came up with Monte Carlo simulations to see if the CRCs were really working as well as they thought they were. And so Monte Carlo, you take a random network data packet and you throw some random corruption at it and you see whether the checksum catches the fault or not.

Starting point is 00:50:38 Rinse, repeat, right? Rinse, repeat, a lot. A lot. Yes, a lot. There's a few permutations. It's one of the really spiky noisy curves. You don't need too many. If you want that curve to be – so some of the curves in the books are a little bit spiky,

Starting point is 00:50:56 and that tells you how incredibly painful that one was to get to converge. Most of them are smooth. My CPU is just, you know, I have an eight core, my desktop's an eight core and I have an older eight core in the next room and they're just begging me for mercy at the end of this book. Okay. And so you do that.

Starting point is 00:51:18 And so we did that. And I just sort of got fascinated by the question of, well, is that the best one they could have used? And the answer was, the only way to know is to try them all. And that one takes, you know, age of the universe is just getting started if you do it the brute force way. And so the book has a section, which is probably what melted your brain, Alicia, about how I sped that up by many orders of magnitude. And it's okay for that to be tough sledding. But I thought I would write it down so someone else can replicate the work.

Starting point is 00:51:50 Because this is the first time I've ever written down what I actually did. That's why it's there. Cool. So that's how I got sucked into it. And then when TT Tech was doing TTP and the FlexRay Consortium was doing FlexRay because they didn't want to pay royalties, the TTP people. They all came to me and look at their checksums and their protocols. It was pretty funny.

Starting point is 00:52:12 And I just sort of got into being the CRC guy. I got some funding from the FAA to work on CRCs and checksums because they're like, well, is the airplane going to fall out of the sky because of the wrong checksum? And it turns out nobody knew the answer. So I got a little tiny bit of research money from them. I teamed up with some guys at Honeywell Aviation, and we co-authored, they led, but I co-authored an aviation embedded network handbook for airplanes. So this has really been my journey through embedded networks,

Starting point is 00:52:44 but while I was at it, the CRC checksum thing kept coming up and coming up and coming up. And I was the guy who had acquired the knowledge over time to be able to answer it. So here I am decades later. I'm at the university now. I won't be there that much longer. Time to retire to do something else. And it's a good chance to just take all the stuff in my head and get it down on paper so that it's there because my guess is 10 years from now, 20 years from now, people are still going to want to know about checksums and CRCs

Starting point is 00:53:16 and it'll be there for them to learn from my journey. And there is something really nice about a book. Instead of trying to learn each piece from the internet, having it laid out in order with terminology that's consistent. Yes, that I didn't realize how much of it I knew, but didn't know that I knew. That's part of it. But, you know, writing the book is also a journey for the author. Yes. So, the bunch of pieces I hadn't put together either until I wrote the book. And that was actually part of the satisfaction, was getting it straight in my own head and then trying to

Starting point is 00:53:55 help the reader get it straight in theirs. Okay. So, you wrote that book, but you've written a couple of others since you were last on the show. Yeah, I guess I've been busy. That's fair. How Safe is Safe Enough? Measuring and Predicting Autonomous Vehicle Safety and the UL4600 Guidebook, What to Include in an Autonomous Vehicle Safety Case. So autonomous vehicles. OSOW wanted me to ask about this. What is the state of autonomous driving? Where should it be heading? And where is it heading now?

Starting point is 00:54:31 Maybe, pun intended, we're not sure. And basically just said that autonomous vehicles are more interesting than CRCs. But I'm not as sure any longer because... I'm incredibly bored by autonomous vehicles, but yeah. I mean, it's a hard problem. But I didn't know CRCs were interesting at all. So, you know. Well, if I convince you there's someone interesting, then that's a win. I think they're fascinating in a very geeky way, right?

Starting point is 00:54:59 There's that. They really are. And that they need to be simulated via Monte Carlo in order to find the best path is odd. The thing that gets me is the lack of, I think it kind of wrinkled my brain a little bit, and the question about is the TCPIP checksum a good checksum goes to that. It's like, these are really important, and it doesn't seem like people have really been thinking rigorously about them for a long time. And so it makes me happy. To Greg Wilson saying, maybe we should try testing this stuff that we all agree is true.

Starting point is 00:55:34 That's fair. Let me go back and another piece of why I wrote it. The reason I stayed with the research is I kept turning up. I was originally way way back, decades ago, I was going to write a book on embedded network protocols because I found an interesting subject. It's a way to focus your mind to start a research program. And I tried writing one. And what I found out was every time I said, oh, everyone says to do this and this, is there a reason why that's a good idea? The answer is no,

Starting point is 00:56:05 they're just making it up. It's like, which CRC polynomial? So this is the pattern of the feedback of the XORs. Which one should you use? Well, the one that's in the standard. And some of the ones in the standard are just awful. And some of them are really good. The ones that came out from those ABB guys in Switzerland, they made good choices. The CAN polynomial is really good. The ones that came out from those ABB guys in Switzerland, they made good choices. The CAN polynomial is really good. The Train Communication Network, those guys knew

Starting point is 00:56:29 what they were doing. But that's like the only place I've gone where I've seen impressive results. The rest has just been like, really? You decide to use that one?

Starting point is 00:56:37 Are you kidding me? Parenthetically, this is a similar thing that bugs me about machine learning. Because there's a lot of stuff... Getting back to self machine learning. Because there's a lot of stuff... Getting back to self-driving cars. There's a lot of stuff in deep neural networks and machine learning where it's just like, well, we need an activation function for these things to make it nonlinear.

Starting point is 00:56:56 Let's try this. And they try that. It's like, oh, it works all right. And then they just stick with it. It's like, wait a minute. So there's an important difference there, okay? I'm an engineer, not a mathematician. So empirical evidence works for me.

Starting point is 00:57:12 I don't have to have a proof. And we never really answered this question. We're diving all over the place, but hopefully the listeners are okay with this. The reason I say math isn't enough is when you're talking about detecting all two-bit errors and three-bit errors with the CRC, you can prove that mathematically. Although I give an intuitive bit motion-based argument rather than the math in my book, and then I point to the proofs. But once it's detecting all four-bit errors and five-bit errors, which you need for safety critical, the math doesn't help you. The only way to find one that detects all five-bit errors is to try them all, all possible five-bit errors, and see how it turns out. And that's fine. I'm okay with empirical.

Starting point is 00:57:52 I'm not saying you need to go to first principles. But stopping and resting on laurels is what bugs me. It's like, maybe there's something better. Well, the difference is, historically, there wasn't enough compute power to try all five-bit errors. Yeah, that's fair. And so, the difference is, historically, there wasn't enough compute power to try of all 5-bit errors. Yeah, that's fair. And so, the work I've been doing in the last 10 years or whatever is I actually found algorithmic speedups that I can actually guarantee, no, really, I've looked at every single possible 5-bit error, 5-bit fault over 64K bits. No, really, I've done it. And over 32k bits, no, really, I've done it. Even though if you do the math, it's not possible.

Starting point is 00:58:27 The book explains the algorithmic speeds ups that let you do it. And I can prove that this has it as opposed to, ah, it looks okay, let's go with it, right? And so if you empirically have shown by brute force evaluation or something unquestionable,

Starting point is 00:58:46 then I'm okay not having the theory. But you're right that what has really happened is people have just sort of said, this looks okay to me and we're using it because they used it. And there's no rigorous evaluation, it just becomes folklore. Yeah, that's why I got into this, because I kept seeing all this folklore and stuff like numerical recipes. You guys know numerical recipes, right? Numerical recipes in C, version 2. In C and Fortran. Yes. So, second edition has a typographical error in one of the polynomials for their CRC chapter.

Starting point is 00:59:19 And people have used that for thousands of years. That's right. And if you look at the third edition, there's a little footnote thanking me for correcting them on that. So that's the kind of stuff that bugs me. I was just finding this stuff all over the place. And, you know, it isn't like... But like what are you supposed to do as a software developer? You're going to look at that and you're not going to know.

Starting point is 00:59:41 There's no way for you to know that that's wrong, right? That's right. And so I decided it was a worthwhile use of my time sort of as a hobby. You know, I've got a little bit of research funding, but trust me, it's mostly been a hobby, right? Like I said, there's no big Chexum... Chexum World Tour. Yeah, it's not the Chexum World Tour. You know, most of it's been hobby. I'm very appreciative of the sponsors who did give me support when they could. But, you know, a lot of it's been on my own time, which is fine. I'm not complaining.

Starting point is 01:00:10 It's great. It's an interest, but that's why I call it a hobby. Part of my hobby was, you know, it's been decades and people have just been getting it wrong on folklore. Why don't we write down the right answers so people now know the right answer? Simple as that. I mean, I thought people knew the right answer, and I was just using their code. Yeah, yeah, exactly. Well, in fairness, they thought they knew the right answer, too.

Starting point is 01:00:36 But only because that was what they learned, too. Exactly. Well, who's to fault? Who's the original person at fault here? Let's go find out. There's no fault. There's a bunch of at fault here? Let's go find out. There's no fault. There's a bunch of engineers who had to make decisions on limited knowledge. And the knowledge has changed and the decisions haven't.

Starting point is 01:00:52 That's very typical of a lot of things, isn't it? Isn't it? Isn't it? Speaking of which, we'd be remiss if we didn't answer Osau's actual question. Oh, are we going back to self-driving cars? Well, you asked the question, so you should probably let him answer it. What is the state of autonomous driving?

Starting point is 01:01:12 Where should it be heading? And where is it heading now? And I know you have some very good YouTube videos on this question. The answer is look at my YouTube channel, but I'm happy to give the short answer. The short answer is the industry's selling on we're busy saving lives, but the reality is nobody really knows how that's going to turn out. And I didn't say they aren't, but clearly there's not evidence there. Fatality rates once per 100 million miles, you know, round numbers, you know, a bunch more of tens of millions of miles to go before we know how it turns out.

Starting point is 01:01:50 Now, the industry would say we're really confident it'll be fine. It's like, well, yeah, but there's some assumptions in the models that I don't happen to agree with. And we'll see. Maybe they're right, maybe they're wrong. I don't know. Some of the people who are part of that industry are not particularly, as you would say, trustworthy. Well, and 16-year-olds are also very confident about their driving capabilities. I wasn't.

Starting point is 01:02:12 I tended to get on the freeway the wrong direction, so I knew I was a screw-up. That's fair, but the $100 million includes the drunks and the teenagers. Exactly. And that's where I sat in the bar, right? Yeah. Okay, to be clear. So, we're going to see how it turns out. And I know some people think I don't like the technology.

Starting point is 01:02:32 Well, if I didn't like the technology, why would I have written an international standard that I wrote? Technically, I originated it, which means I dumped a couple hundred pages of standard into a process. And then the industry helped clean it up, and then it was a real standard, consensus standard. I invest a lot of time in that. I would have written two books on this topic if I didn't like it. I like the technology. What I don't like is dangerous technology, and there's a big difference.

Starting point is 01:03:00 It does feel like, to me, that there was a great deal of hype for, what year is it now? 2024. My God. There was a great deal of hype, say, seven years ago, six, seven, eight years ago. And everybody was like, this is just going to take off. What are we going to do with all these truck drivers who can't drive? I remember hearing that, yes. And it's going to affect the economy in all these amazing ways. And then it just kind of plateaued as well. It's pretty good at lane keeping. And as long as you pay attention, you know, you can relax when you're commuting a little bit more. But as soon as you have a construction zone or an emergency vehicle, it all just goes kerplooey. I'm not going to weigh in on that part. It's all fair.

Starting point is 01:03:43 I'm not going to weigh in on that. Let me give you high-level perspective. I've been doing self-driving cars as long as I've been doing checksums and CRCs. Okay? So, I started in, I think it was 97. So, in 1995, Carnegie Mellon had a vehicle, NavLab 5, that went across the U.S. There's a segment to D.C. to Pittsburgh, but Pittsburgh to San Diego,

Starting point is 01:04:07 98-point-something percent hands off the wheel. In 1995, 98% hands off the wheel. And the alums of that project all love to say, and ever since, we've been working on that last 2%. Yeah, yeah, yeah. And that probably wasn't using DNNs because that was impossible back then, right? It was totally different technology.

Starting point is 01:04:26 It was, yeah, it was a long time ago. It was a laptop and a camera doing optical flow is my recollection. Optical flow, okay. Okay. And that's good enough to keep you in the line 98% of the time. It's the 2% that's a bear, right? If you're alive 98% of the time and dead 2%, well, that's a real problem. Safety lives at 99.99999999, you know. And so statistics struggle to get you that many 99.999999% without problems. Statistics struggle with that many nines. And machine learning is a statistical approach.

Starting point is 01:05:05 So how do you think that's going to work out? I mean, that's why it's so hard. And I think that's something people don't get about machine learning, especially today, even with LLMs and stuff. It's all statistics. And all the problems inherent in statistics are there in greater magnitude because you're doing billions of things. I mean, if 90% works for you, you're great. If 99% works for you, it's a challenge, but maybe they can pull it out.

Starting point is 01:05:28 But, you know, airplane falling out of the sky, there's an official FAA number for that. Airplane falls out of the sky once every billion with the B hours, 10 to the minus 9 per hour. And that's per hour. That's a lot of seconds in an hour. Okay? Statistics doesn't really get you there, right? You need something else. I mean, you can't just do simulation.

Starting point is 01:05:47 No, you can't simulate enough, right? That's right. Well, that's why. So for CRCs, I said I did Monte Carlo simulation for the other checksums and stuff. But for CRCs, I actually evaluated every single bit pattern because simulations won't find the one bit, you know, five, let's say, 32k bits taken five at a time. How many combinations of that? And there's exactly, and it's exactly one of them with an undetected fault. You're not going to find that with Monte Carlo simulation, now are you? Check back after the heat death of the universe.

Starting point is 01:06:24 That's right. And so, I actually do that calculation. It's an impressively large number in the book. And so that's why I switched over to no, no, we checked them all because that's the only way to know. That's it. You got to check them all. Like Pokemon. Yes. Oh, okay. So yes, informally, the folks who go out and test and something goes wrong and they fix it and they test and they go out, it's like they're going out playing Pokemon Go and trying to collect them all. And the problem is there's an infinite number of Pokemon. And that's why it takes so long for self-driving cars to get scalable. It's as simple as that. But in CRCs, Delicia's writing that down, I know. In CRCs,

Starting point is 01:07:09 in CRCs, the results at the table in the appendix of these are what I think are the best ones, or at least really good ones, right? Those are based on no, really, I caught all the Pokemon. All right. She's writing that down as a title, too. I'm saving so much for this.

Starting point is 01:07:23 Now we've got so many titles. And we are pretty much out of time. All right. She's writing that down as a title too. I'm saving so much for this. Now we've got so many titles. Now we're just going to... And we are pretty much out of time. Phil, do you have any thoughts you'd like to leave us with? I'll tell you really at the high level, let me say this. A lot of this is about engineering. So what I found with the experience of the book was that math only got me part of the way through and the rest of it had to be engineering and other techniques that weren't just mathematical analysis. That's part one.

Starting point is 01:07:58 And part two is everyone wants to know what's the best CRC. And the answer is it depends. But more generally, and maybe a check sum is what you need because one of the graphs in the book is the speed versus time tradeoff, right? If this is how much CPU power you have and this is how much data you have, this is the algorithm you should use to get the best bang for buck. And so if you think you know the optimal answer to a general problem, probably you don't understand the question. Engineering is never about the one optimal answer. Engineering is almost always about trade-offs. And the one thing I try and do this

Starting point is 01:08:32 book, but also when I teach, when I do research, is to help people understand the trade-offs so they can make an informed choice and pick the answer that's right for them. If you would like to pick up Philip Kopman's book about understanding checksums and cyclic redundancy checks, it is out in all forms. Our guest has been Philip Kopman, professor at Carnegie Mellon University. He's also the author of two books about autonomous vehicle safety and the classic better embedded system software. He also has three blogs, one about each subject here, Checksums and CRCs, Central, Safe Autonomy, and Better Embedded Software. Thanks, Philip. This was really quite entertaining.

Starting point is 01:09:17 Thanks for having me on. And you know what? I never thought I'd be on a podcast for Checksums and CRCs. I know, it's amazing. So thanks for making my day. Thank you to Christopher for producing and co-hosting. Thank you to our Patreon listener Slack group for questions. And of course, thank you for listening.

Starting point is 01:09:34 You can always contact us at show at embedded.fm or at the contact link on embedded.fm. Don't forget that there is a whole page of show notes, which will include links to Phil's blogs and his books. And now a quote to leave you with from, well, let's just go with Kopman. Safety isn't about working right most of the time. Safety is about the rare cases where it doesn't work properly.

Your Ad Here

Embedded - 473: Math Is Not the Answer

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.