Embedded - 473: Math Is Not the Answer
Episode Date: March 21, 2024Philip Koopman joined us to talk about how modulo 255 vs 256 makes a huge difference in checksum error detection, how to get the most out of your checksum or CRC, and why understanding how they wo...rk is worth the effort. Philip has recently published Understanding Checksums and Cyclic Redundancy Checks. He’s better known for Better Embedded System Software as well as his two books about safety and autonomous vehicles: The UL 4600 Guidebook: What to Include in an Autonomous Vehicle Safety Case How Safe Is Safe Enough?: Measuring and Predicting Autonomous Vehicle Safety Phil’s YouTube page has a number of videos with great visuals to go along with his books. He also has three(!) blogs: Safe Autonomy Better Embedded System SW Checksum and CRC Central (including a post on checksum speed comparison) Currently, Phil is a professor at Carnegie Mellon University (his page there). You can follow him on LinkedIn. Elecia read (and give 2.5 stars to) Symmetry: A Journey into the Patterns of Nature by Marcus du Sautoy: “Interesting but uneven, I kept reading to find out what horrible things math profs do to their children in the name of fun. Worth it when I finally got to a small section with Claude Shannon (and Richard Hamming). It didn’t help with this podcast but it was neat.” Transcript Nordic Semiconductor empowers wireless innovation, by providing hardware, software, tools and services that allow developers to create the IoT products of tomorrow. Learn more about Nordic Semiconductor at nordicsemi.com, check out the DevAcademy at academy.nordicsemi.com and interact with the Nordic Devzone community at devzone.nordicsemi.com.
Transcript
Discussion (0)
Welcome to Embedded.
I am Alicia White, alongside Christopher White.
Our guest this week is Philip Kopman, author of a new book,
Understanding Checksums and Cyclic Redundancy Checks.
Now wait, before you wander off, there's a lot more we're going to talk about and it's going to be very fun. Cyclic redundancy checks are quite fun.
Hello, Philip. Welcome back to the show. Thanks for having me. I was going to jump
in right there, but I'll be snarky later. We're good. Could you tell us about yourself as if we
met on, I don't know, open house day at CMU?
You're starting with a hard question because I've done like a whole bunch of things. So,
here's the short version. And we're going to come back to my career and maybe that's fine.
I was in the military. I was in industry. I'm currently sort of ending my cycle at university,
shifting over to being a consultant. And I've been doing embedded systems and embedded networks for a really long time,
and also safety in a variety of places.
So I worked for the Navy.
I worked for United Technologies.
I was a CPU designer at Harris Semiconductor.
I've been teaching at Carnegie Mellon University.
And I stopped counting at 200 design reviews for industries. You know, get in a car, get on a plane, go look at some embedded stuff, give them some ideas how to improve.
I stopped counting at 200, and that's got to be 10 years ago.
So, you know, done a lot of stuff.
How's that?
Well, then all this stuff leads to a couple of pretty interesting books about what you've done.
Yeah, yeah. I've been busy lately. There's the orange book, as I call it, which is the Better Embedded System Software, which is now, what, almost more than 10 years old, but still going fine. And that's what I learned in my first 90 design reviews. And I've learned a lot more
since then, but that's not the book we're here to talk about today. Actually, are you thinking
about doing a second edition for that? I have to decide. I need, as you well know, Alicia,
putting out a book is a somewhat traumatic experience. So I have to recover. And it turns
out the second time is no less traumatic. It turns out. That is correct. You think you're going to go in and just do a second
edition and make a few edits and then suddenly you've rewritten the whole book and added 100
pages. Yes. And at this point, I don't know that a second edition is the right thing because that
book has held up remarkably well. I mean, the hardcover printing sold out.
And then so it's now paperback and Amazon and doing way better than I expected.
But it hasn't changed that much.
And so now I think if I were to do a second edition, it might be better to leave that as it is and do something else that is more helpful rather than putting a lot of effort in just replowing the same ground.
I didn't modify much from my book.
I added stuff.
I moved it around.
I made the instructions better.
And at some point, when you're an inch thick, maybe it's time for another book, right?
Okay.
Are you ready for lightning round?
Sure. Lay it on me. Okay. Are you ready for a lightning round? Sure. Lay it on me.
Okay. Least favorite polynomial.
826-08-EDB. That's my story. I'm sticking to it.
On a scale of 1 to 10, where 1 is a Flintstones car and 10 is hovercars of the future. Where are we with self-driving?
Well,
you know, like
four or five.
I don't know.
That scale is kind of hard to calibrate.
It's a hard scale.
So four or five could be, you know, just
awful or mediocre.
And that's
exactly the issue. I was not sure.
We can come back.
At some point, if you want to come back and talk about
that, we certainly can, but
goodness knows I've done enough podcasts on that
and this is a chance for a brush of
fresh air for me. Favorite course
to teach? The one I'm teaching
now, which is
half embedded code quality,
one quarter safety, one quarter security.
What is your preferred programming language?
I've been using C and C plus because I'm old. I've stopped counting at 50 programming languages.
I'll use whatever is thrown at me. But really, an embedded, that's still what's out
there in embedded. Rust may come by someday, but it's going to take a while. And I'm kind of
language agnostic. Whatever it takes, it's fine. Favorite bagel place in Pittsburgh?
Oh, well, I guess that Brugger's does pretty well for me, but there's a lot of other places you can go, boutique places.
So it depends what mood I'm in.
Do you have a favorite fictional robot?
Well, it's a tie.
So for some reasons, it's Huey, Dewey, and Louie from Silent Running,
partly because they're obscure and partly because that's what I named my Roobas.
Okay.
But I would be remiss if I did not mention Johnny Cab because,
from the first movie, right?
Because Johnny Cab, right?
He's hysterical.
You know, we hope you enjoyed the ride.
What movie was that?
Terminator.
Thank you.
Yes.
1991.
Yes.
Yes.
I'd like to thank our sponsor this week, Nordic Semiconductor,
specializes in ultra-low-power wireless communication.
They have a wide technology portfolio, including Bluetooth Low Energy,
Low Energy Audio, Bluetooth Mesh, Thread, Matter, Cellular IoT, and Wi-Fi. They have thousands of customers worldwide with
40% market share in Bluetooth Low Energy. They have nearly 2 million system-on-a-chips produced
every day. Sounds like a system that you can't go wrong with. So please check out nordicsemi.com. And if you would like to win
one of the Power Profiling Kit 2s
that our sponsor is so generously handing out,
please send us an email that is show at embedded.fm
and tell us what your favorite PPK2 feature or spec is.
Again, thank you to Nordic
for sponsoring this week's show.
Okay, let's get to the real questions.
Sure.
You mentioned in Orange Book
the better embedded systems,
and that's been around a while.
That was like 2010, I think,
the hardcover. Yeah, it's been a while. That was like 2010, I think, the hardcover.
Yeah, it's been a while.
And now you have a book that just recently came out.
And it's about CRCs and checksums.
Isn't that a done sort of field?
Well, it's pretty niche-y.
I'll save you having to say that.
Well, it goes back to the 60s, I guess.
And while the mechanics of it are done, of CRCs are done, the understanding of them took decades.
And there's still some open questions as it turns out.
I said, okay, I think I understand enough now to write the book.
And then I got, there's like a thing in chapter 10, it's like, you know what? All the stuff that's been written
in the last 20 years sort of missed the important part, and I don't know the answer either,
so we're going to leave that part open. This is the kind of thing that takes decades to figure out.
But to be sure, inside the book is the answers to all the questions I've really heard.
And there's one or two niche, really, really narrow places that are still open.
But for most people's purposes, the answers are all there.
And it really takes a lot of work to nail it down because there's so much going on.
And really, there's a lot of folklore.
There's a lot of misinformation.
There's a lot of people just saying, oh, this will be fine.
And no, it's not. So there's just a lot of folklore, there's a lot of misinformation, there's a lot of people just saying, oh, this will be fine, and no, it's not. So, there's just a lot of that. I mean,
inventing the technology was very cool done a long time ago. Understanding it has taken a long time.
I mean, don't you just take your list of parameters about how long you're willing for your check bits to be and how Christopher keeps trying to interrupt and how long it takes to calculate and how long it takes to verify. You just take all these
parameters and you shove them in and you find out which CRC is best. Doing this every show now.
Can we start with a definition of what a CRC or checksum is for people who might not be familiar?
Because I want people to understand what we're talking about
before we ask which elliptical curve equation
is the right one to use for some...
Okay, let's start with data word and...
In fairness, so you don't beat up Christopher,
I was going to go there too, so we're good.
Okay, so let's start with what is a CRC and why would I use one?
I want to start earlier. Let's start with a
checksum and then talk about CRCs because they're really different technology. Yes, right. Checksums
are easy. Let's all do checksums. Well, yeah, actually not. Depends which one. All right. So
let's start at the basics. You got a piece of data and things go wrong in the world. And this is how I got involved because I do safety, I do dependability. Data rots. You know, the bits go bad, bits rot, they flip, whatever. You get a bad bit in a data transmission, you have a piece of memory and it suffers a bit flip before you read it back. And so data can accumulate faults,
especially in embedded systems which have long operational cycles and noisy environments.
You're going to get bit errors in your network messages and in data storage. And we may talk
more about network messages, but the principles are pretty much the same. And so if you have a
piece of data and a bit goes bad, you really want to know that it's gone bad.
And so the idea is you take this piece of data you care about and you add another piece of data next to it, which is sort of a summary, a digest.
And it's gone bad,
that this little digest thing next to it will tell you it went bad.
So you can do a consistency check between the data and the little digest, the checksum.
And the checksum will say, you know what?
That data isn't what it's supposed to be because I have a different answer. And when you store the data, you store the data and you run a computation across the data
that crunches it down into a little digest, and you store the digest as well.
And then you send it or you put it in memory, whatever.
When you pull them back out, you take the data, you run the same computation,
and you compare the result to the stored digest.
If they're different, you know something went bad.
And so the idea of a checksum is to give you a high probability way of detecting faults.
I mean, you could do something really silly,
like you could take a blob of data and then you could make a copy of it,
but that doesn't tell you which one's right.
So you have to make another copy so you can fill it between three.
Oh, and it's even worse than that.
If you make two or three copies, the typical faults are the bigger the data is,
the more probability something will go wrong.
Exactly.
So making two copies and comparing is like the worst idea ever
because you just doubled the chance one of them will be wrong.
And as you said, you don't know which
one. So then you send three
and if two compare,
probably that's the right one.
But what if all three are different?
50 copies and you average them.
Yeah, and probably that's not
how you want to go. And the
point of checksums, and I'm going to use
checksums generically sometimes, so I include CRCs in them, right? The point of checksums as a generic is to do better
by saying, well, I've got a thousand bits or a few kilobits of data. You can tell I'm an
embed network guy. You've got a few kilobits of data and I have eight bits or 16 bits,
and that's really all I need to, with very high probability, detect that something went wrong in the hundreds or thousands of bits.
That's the idea.
So you have these 16, 32 bits that are checking your kilobytes of data, but it only tells you if something's wrong.
It doesn't tell you what's wrong.
That's error correcting. So error detection is something's wrong. It's yes, no question.
Actually, it's a yes and probably no question. Right. Because there's some probability you got
unlucky and there's a fault in the check value that compensates for the fault in the data value.
And so a lot of why you would use one checksum or another or one CRC or another is to reduce that residual probability of an undetected fault.
But yeah, it's telling you it's wrong. Now, you can do better if you have a higher-powered, I'm using that term loosely, a higher-powered check computation.
You can actually do error correction, but this book doesn't go there.
Okay.
And the reason is there's all these books on coding theory.
It's like the world arguably does not need yet another book on coding theory, okay?
There's plenty of books on coding theory.
Aficionados may differ, and I respect that.
That's fine.
It's not what I've done for a living.
But they all have like this one page with some dense math saying, here's CRCs.
Now let's get to the cool stuff.
And if you're doing a bed of networks, you care about the CRCs and checksums, and that page of math doesn't do you much good, right?
And so this is the book that takes that page or two, expands it to a book, and calls all of those guys out on all the hand-waving they're doing that actually doesn't give you the answers you need.
Does your book answer the question of unbalanced zeros or ones?
I mean, I always thought that if you're dealing with like Flash, it's not a 50-50 probability when a bit goes bad, whether it's going to zero or one.
That's right.
It does in two different dimensions.
So what I do, at the beginning of the book, it runs through all the algorithms.
And we can sort of run down the list if you want at some point.
But it runs down single sum and dual sum.
And I invented a new checksum writing this book, the Kotman checksum, because my wife said I had to name it after me.
And that's a true story.
And CRCs.
And they all have different performance tradeoffs.
But then I have some chapters saying, what about this?
What about that?
What about the other thing?
And one of the sections is, well, what about if it's not all zeros or all ones, right?
And for some of the checksums, the data values matter.
For any of the checksums that involve integer addition, the data values matter.
And the more ones that involve integer addition, the data values matter.
And the more ones or zeros you have, if it's all zeros or all one, it actually does, the checksums do really well.
And the worst case is actually 50% bitmix.
And it doesn't matter which way they fail.
50% bitmix is the worst case, which was kind of interesting.
For Kotman checksums and CRCs, the data values don't matter at all. What matters is how many bits have been flipped. And it's the direction of the bit
flip that matters, not the data values. And the technical reason is integer addition does with
carry bits, and the carry bits are affected by the data values. But CRCs are all XOR operations,
and so the fault pattern is completely unrelated to the data pattern.
Okay, wait a minute. Say that again.
The fault pattern in CRCs, whether you can detect your fault or not,
has absolutely nothing to do with the data values.
But with checksums?
With addition-based checksums, it does,
because in CRCs, it's about bit mixing.
And I can try and verbally sketch that for listeners who haven't heard it.
But actually, let me sort of quickly sketch them out so we have something to work with, okay?
The general classes are, they're single addition checksums.
We take all the,
I'm going to call them chunks to make it obvious
that I don't want to get too much
into the specific terminology.
You have a big data word,
you break it up into chunks,
like a chunk might be 32 bits
or 16 bits or 8 bits,
but it doesn't have to be.
You have data where you break it up
into chunks and you add up
all the chunks using integer addition.
And what do you get at the end is your checks
on value.
And whether or not that detects faults
depends on not only
where the bits are, but
the carry patterns during the addition,
if that makes sense.
Yeah, because you're getting
mixed. If you XOR,
if you just XOR all the chunks,
what you get is a bitwise XOR called a longitudinal redundancy check, LRC.
And that one can be fooled by pairs of bits in the same location.
But if you're adding, you get carries and you get better mixing because the carry bits sort of mix the values across.
They smear them across the width.
And so it performs better.
You get better fault detection.
But you only get the fault detection if the values smear.
And that all depends on the bit values, right?
And so depending which bit values were faulted,
you will either get a detectable or undetectable two-bit error, for example.
And whether it's zero to one, one to zero, it doesn't matter.
The direction doesn't matter, just's just that the values matter.
I expect we'll get into this, but is that something that
...
How do you decide
which way to go?
Do you statistically analyze your data?
Or is that just not something
that people really need to look at?
No, CRCs are always better.
Right, that's the question I'm asking.
So, I have a lot of cores on some servers that have spent a lot of time running multicolor simulations.
Okay.
Okay, and you know, for me, a billion simulations is just getting warmed up, okay?
Because it takes that to get the, it's really amazing how many simulations it takes to get a smooth curve, okay? And so these things, this book took a year,
and a lot of that had 8, 16, 24 cores
just running flat out for the whole time
to make the graphs in the book.
And so I have a curve in the book
that says for each type of checksum,
if you have 128, I think it was 128-bit data word,
just to keep it tractable,
if there's zero one bits, it does this. If there's zero one bits, it does this.
If there's one one bit, it does this.
If there's two one bits, you get where this is going, right?
And there's these interesting curves that they're worse in the center and really nice at the ends.
And so that's how I know it's 50%.
Yeah.
And so a thing is, I don't actually try and do the mathematical analysis.
I do a lot of, I do some, but not a lot.
I do mostly empirical analysis
because you look at the graph and you say,
oh, I get it.
You can look at the math and well, maybe,
but if you look at the graph, it's like,
oh, it's obvious.
Yeah, that's what's going on.
And so a lot of the book has graphs in it.
I was kind of shocked a few nights ago
I had started reading your book, and my pre-bed book is Symmetry by Marcus Dussatoy, I think is his name.
And it's about group theory.
And I was really surprised to see error identification and error correcting codes in group theory.
Now, I know I didn't prep you for this question at all,
so I'm basically asking you to think on your feet,
but how is group theory related to this?
Can't go there.
I'm not a math guy, shockingly, perhaps.
Christopher.
I'm not that heavy into math.
Let Christopher take that one.
Yeah, let me think back to my abstract algebra class.
You were the one who said elliptical curve, so you're on the hook here.
Let me cast my memory back to 1994 and nope.
Group theory is very large in error correcting codes.
That's all I can tell you.
I don't remember why.
Because you just looked it up on your phone.
I watched you type it into Wikipedia.
No, I was checking to make sure I turned something off.
No, I don't
have an answer for you. I'm sorry.
But this whole area of
computer science
is partially engineering
because you do have to deal with all these trade-offs
of speed of computation
and size of your checksum
versus size of your data.
And not only that, electrical engineering, because what kind of wire, and it doesn't matter in networking so much because you never know what kind of wire you're over, but
in some cases, okay, I'm on.
Other than it's a bad one.
Right. You're on a high-speed serial link, and then there's encoding things with high-speed
serial links where they want to balance the bits. So that might influence your choice. Anyway, yes.
That's also in the chapters, right? Yes.
So, the thing is that math isn't actually the answer here.
Let me write down that as a show title.
That's a good show title. So, you know what? Let me go through the different types,
and then I'll- Math is the question.
I'm in a better position to explain why math isn't the answer, okay? So there's single-sum checksums.
There's also dual-sum checksums.
And many people have heard of a Fletcher checksum or an Adler checksum.
Okay, that's what we're talking about, where you actually compute two running sums at the same time.
And one sum is just the same as a single sum, but the second is the sum of the values of the running sum.
So you're saying, you know,
the value A is chunk one plus chunk two plus chunk three,
but value B is B plus A, B plus A, B plus A.
And what that does is it implicitly multiplies the chunks by their position.
And that makes you resistant to two-bit faults
because it's hard.
Until you sort of wrap around,
you can't get a two-bit fault
that can fool both the current sum
and the sum multiplied by its integer position
in the chunk stream.
Okay.
Okay, so that's the deal with Fletcher checksum.
And something I found out is that some of the intuition about the modulist is wrong.
So the difference between a Fletcher and an Adler,
Fletcher is, I'm going to use, we're just going to use 8-bit because the numbers are small, but it exhales.
If you just add mod 256, you know, just integer add,
it turns out you get some really nasty behavior on the top bit because you throw away the carry out.
And so what you want to do is you want to wrap around the carry out by doing mod 255 or one's complement.
It's basically one's complement addition.
And so a Fletcher checksum uses one's complement addition on both the A sum and the B sum. And I've seen, and actually there are standardized implementations
that use mod 256 instead of mod 255,
and they have horrible performance. Really important
to use one's complement.
Why?
Because when you're adding up a bunch of integers,
remember I said that bit mixing is helping you
from the carries?
If you just do 8-bit integer adds,
you throw away
the carryout,
don't you?
Sure.
And throwing away
the carryout
doubles
the vulnerability
to bit flips
in the top bit,
it turns out.
Okay, okay, okay.
I kind of get that, yeah.
And for the Fletcher
checksum in particular,
or a dual-sum checksum, that implicit multiplication starts losing information if you don't wrap that carry bit back.
And so it becomes vulnerable, basically it's instantly vulnerable to 2-bit faults if you use mod 256 instead of 255.
You don't get the 100% detection of two-bit faults.
You lose it basically instantly.
So it's terrible.
And people make that mistake all the time.
So these are the kind of things in our books,
like don't make this mistake.
And here's the graph that shows you why.
Sorry, I'm just wondering how often I've made that mistake
and whether or not I should go and fix it.
And I mean, I don't know that I would have.
Yeah. And the thing is, there's no reason that anyone would even know this unless someone did
the analysis, showed them the picture. And that's why I wrote the book. It's because there's so
many people unknowingly making these things that seem, oh, it's no big deal. And it's terrible,
right? And unless someone shows you why, it's no big deal. And it's terrible, right? And unless someone
shows you why, there's no way to know. And how much are you showing with the graphs
versus showing with the math? Very little on the math. So I decided
not to use an equation editor. This entire book was written without an equation editor.
And it isn't like I have really ASCII art math or something silly like that, right?
It's basically the math is all on one line with superscript and subscript type font, and that's it.
And I did that on purpose because there are all these math books, and some folks are fine with the math, and that's great.
Those books are already written.
But some folks, and I'm actually one of them, I look at the math and my eyes glaze over, and I have to actually understand what's going on with the bits.
Then I go back and say, oh, that's what the math was trying to tell me.
Yeah, there are many times when reading the code is easier than reading the math.
Yeah, so I read the math and I understand the math.
That's great, but it's not how I think.
And so this approach in this book is, all right, here's the basics of the math,
but I'm not going to throw pages of math at you.
I'm actually going to run examples.
And so for the dual sum checksum where I said there's implicit multiplication
because of the way the addition works,
I actually run an example so you can see what I mean by that.
And I say, you know, this is added, this is added, this, and look at the end.
You know, at the end, there's four nines and there's three eights and there's two sevens.
And like, oh, look at that.
It was multiplied by the position.
Isn't that cool?
And you have a good use of color in the book.
I liked that because it's hard to follow these things. do that. I started because trying to do monochrome graphs that you can understand is really painful,
right? And it's like, and there's a lot of, I actually don't know how many pictures,
you know, dozens. And it depends what you mean by a picture at this point, okay? There's a lot of,
you know, I spend a lot of time, it's like, I really want to make this explaining things rather
than just saying, look how smart I am because I can do math, right?
It's not, that's pointless for a book,
or at least I think so.
So a lot of time explaining a lot of pictures,
a lot of graphs, and then the examples,
I use color so you can see where the,
you know, there's a five.
Where'd that five come from?
Oh, that's a green five.
And there's a green five.
Test what?
Those are the same five moving through.
So I used a lot
of color on the interior to try and make it easier. And then sometimes the word in the text
is also green. It's like, oh, so those are all the same thing. I tried to do that. And apparently it
came out okay because you like it. I wasn't sure it was an experiment, but that's what I was trying
to do. I just found when I read it with the color coding, it was just so much easier for me to follow what was going on and make sure it all
lined up.
So much of CRCs especially
is about
following
a bit through a pattern.
Yes, yes.
And I did that with color. Yes, it is.
There's a step and another step and another
step and I tried to use color to make it
more obvious what's going on with the steps. That's right. Because I couldn't put video in. Right? That's a step and another step and another step and I tried to use color to make it more obvious what's going on with the steps.
That's right.
Because I couldn't put video in, right?
That's a whole different deal.
I mean, you do have a YouTube channel.
So I do have a YouTube channel
and one of the things I did was I made a video
showing how CRCs work,
which forgive me for staying on task,
but you got the dual-sum CRCs. Then you have, I'm got the dual-sum TSRCs.
Then you have, I'm sorry, dual-sum checksums.
And then you have a Koeppmann checksum, which is this thing that it's actually a remainder after division in integers,
except I found a clever way to structure it so it looks like an iterated addition,
but it's actually doing long division and computing the modulus. It's kind of a funky thing, but it works really
well. And it has much, much better fault detection than the dual-sum checksums. It's a little more
expensive, but not as expensive as the CRC. Then there's CRC. So, Kupman checksums are remainder after division in integers.
CRCs are remainder after division in polynomials over Galois field 2, which is all the field theory I know, at least as much as I'm going to admit.
Galois is my least favorite mathematician.
I'm not going to run down a field.
I understand. Abstract algebra is one of the coolest fields, and I don't understand why you hate it so much.
I don't hate it.
I'm reading a book about group theory.
I'm doing okay on that.
Yeah, but you hate reading it.
Good for you.
That's not bedtime reading in my book.
But for some people, it is, and that's all good.
Oh, it's pop-sci.
It's not taxing.
So, for those people who's like, what are they going on about Galois?
Think Boolean algebra.
Okay, Galois field two, for our purposes, is Boolean algebra, and the rules are simple, and this is how CRCs work.
Remember in grade school, probably, you learned how to do long division.
Some readers are still traumatically scarred from that perhaps.
I think they do it a different way now that involves
some... Calculators?
No, no. There's the
new math they teach which has all sorts of techniques
that we don't understand. Hey, you know what?
I had new math when I was a kid too.
New math has been around a long time.
Okay. I guess
I just said old math.
I had old math. I had old new math, okay?
We're right. Long division, you draw a little thing.
They used to teach it back when people dialed phones with rotary dials and it made clicking
sounds and all that other stuff. But if you look it up on Wikipedia, it's there.
So you have long division.
And CRCs are actually a clever hardware-based idea.
You can do it in software, of course,
but you're emulating some hardware that did long division in hardware.
And it was over Galois field too.
And what that means is instead of addition or subtraction, use XOR.
That's it.
So you do long division and you say, all right, you know, I've got the dividend and I got the divisor and I put the divisor under the dividend and I do a subtraction.
And if it fits, I go with it.
And if it doesn't fit, I undo it and I shift over one digit and do the next long division.
The only difference is you're doing all the digits are zero or one.
You're doing XOR instead of add,
and also XOR instead of subtract.
It's all XOR.
And it's zeros and ones.
And that's it.
It's long division.
And you do the long division of the ginormous data word thousands of bits
by your 8-bit or 9-bit or 10-bit, doesn't
really care how many, or 16-bit divisor, which is the CRC polynomial. We'll probably come back to
that. And you just walk it across, you do the math, and when you get a remainder, that's the
result of the CRC calculation. That's the essence of CRCs. It's the remainder after long division
using the rule that addition and subtraction are XOR and all the digitsC calculation. That's the essence of CRCs. It's the remainder after long division using the rule that addition and subtraction are XOR
and all the digits are binary.
Watching your video about this,
it starts out and it shows the long division,
which made sense to me.
And this whole Galois 2 thing didn't penetrate.
I mean, that was unrelated. It's just XOR.
You have to understand what fields are.
Fields are pretty simple to understand, but anyway,
if you don't know what a field is, then that's not going to make any sense.
It doesn't matter. You can just push that to the side.
And then you redid it
as the hardware would implement it,
and I think my brain exploded.
At least a little bit.
I'm pretty sure that's what the leakage from my ear was.
Well, I started out as a CPU designer, so, you know, I think that way more normally, I guess. I don't know.
But the point, people have seen the shift. The shift in XOR registers, right? It's a bunch of
bits and there's a feedback and it shifts in XORs. And the thing to take away is that that hardware mechanism is doing division with XORs. It's doing the division
algorithm in hardware. So that shift in XOR is identical to long division. That's the thing to
get out of it. And it doesn't matter which way you do it, they're doing exactly the same computation.
And it took me a long while to understand why that was really true.
And so part of the reason I made the video was to give a graphical explanation to convince myself
that no one was blowing smoke. And in doing that, I realized that the CRC algorithms you're likely
to download from the internet
don't actually work that way quite.
They don't actually work quite that way.
So there you go.
They almost work that way, but not quite.
So am I not supposed to be just downloading these?
You can download it from the internet,
but if you try and do the polynomial division by hand,
you're likely to get a different answer
in some circumstances.
As long as both sides of my communication mechanism
get the same answer, I'm fine, right?
That's right.
That's right.
And it has to do with the initial seed value
is the difference.
The initial seed value is,
if you do a pure division,
you have to put the initial seed in front of the data,
but no one wants to do that because it wastes a lot of time. So they put the initial seed in
in parallel. The book has some details, but it turned out, I was trying to do the division and
compare it to algorithms and getting different answers. And it took me a while to figure out
that the software actually does a different thing. Now, the good news is the different thing has exactly the same fault detection properties. But one of the things I do
in the book is I show you the math way, and then I show you the way people really do it. And I give
code examples for each one. And then I send some, if you download the code, you get unit tests,
and the unit tests actually prove that they come up with the same answer if you know what's going on.
I'm going to take a little step back here since we've gotten pretty far into the details and ask a listener question.
Sure.
From Doug G.
What is the best practice?
Do you put the CRC in the data to be CRC'd?
And do you put something blank in,
like FFF or 000,
or do you put the CRC outside your data block?
Ah, okay.
So there's a couple things going on there.
There's whether or not you...
Let's talk about the hardware shift register version, okay?
Because some people find that easier to visualize.
When you're computing
a CRC, you have to reset the shift register before you start feeding bits into it. And do you reset
it to zero or do you reset it to all ones? You know, what do you do with that? You can do anything
you want and it does not affect the fault detection other than a non-zero value makes it easy to see
if a bunch of zeros have been stuck in front by accident,
which some communication networks have that failure mode.
Then you run the computation.
At the end, you have a value.
So you do initial value of whatever you want.
Non-zero is probably better, but it's not a huge deal.
You don't go back and rebuild your gizmo
just because you didn't do that.
And you crunch some data, and you come up with an answer, you put it next to it.
And the question here is, when you're checking the CRC, what do you do?
Well, the purely mathematical way, in fact, includes the check value in the computation.
So you run the data through the shift register and you run the check value through the shift register and you check for zero at the end.
And if you're building hardware, you can do that.
If you're building software, it turns out that whether you shift a bunch more times to see if the result is zero, including the check value, or you just do the data and compare it to the check value, it's the same computation
that makes no difference. So the answer is it makes no difference. Take your pick.
There's no difference in fault detection. So I can have data, my structure that has all of my data
that I'm serializing and sending over a pathway. I can either have this CRC on the inside and set
it to zero or all ones when I'm doing the calculation, or I can have it on the outside and not set it at all.
It depends on which software algorithm you use, but the high-speed algorithms don't care what you set it to.
As long as I do the same on both sides.
As long as you do the same on both sides.
But the high-speed algorithms actually do not care at all whether you pre-initialize the data in your code word or not.
They don't care.
They ignore it.
And there's some code in the book that shows how that works and then explains why that's so.
Okay, I have a question.
You mentioned at the top of the show you do a lot of uh embedded networks was what got you into this
yeah um and networking is primarily still tcpip which has a checksum which is fairly
i mean it was established probably a very long time ago i think rfc flintstones cars I think RFC 1000 or something. Is that checksum a good checksum?
Not particularly. I think you said one of your listeners wanted to know what's the worst protocol checksum, and it's the internet checksum. The one that's in the most widespread use that's the worst is the internet checksum, because it's a ones
complement addition, and they could
have done so much better
with very little
extra compute cost. I mean, at least
it's not a twos complement. I mean, that's a start,
but it's... I mean, very little
extra compute cost back
when they were deciding it was probably a little
more than compared to
now. They could have had none.
Well, there were proposals to change it, right?
Wasn't there an alternate?
I remember there was some alternate, like they were going to switch to Fletcher or something.
Yeah, and switching to Fletcher would have been great.
Actually, you can do better than Fletcher, it turns out. There's a thing called a dual X checksum,
which didn't exist until I started writing this book. Let me go into that one because that one's
kind of cool. It's simple to explain. When you're doing a Fletcher checksum, let's say you're doing
a 32-bit Fletcher checksum, you're running two 16-bit sums next to each other. You have the A sum that takes the data 16 bits at a time,
and you have the B sum,
which is just the sum of the A sums
wrapped around, you know, cycled back into the B.
And so you're doing 16-bit additions,
and you may be doing it with a 32-bit CPU,
but, you know, you have to do this pair
of 16-bit additions each time.
And then you have to do a mod operation
to reduce it, to not lose the carry bit.
And it turns out that if you run a Fletcher checksum
32 bits at a time,
so you have a 16-bit addition,
you add a 32-bit chunk in,
and then you mod 65, 535 for 16-bit to the K-1.
If you add in 32 bits and then do the mod,
you can process the data twice as fast.
And the cool part is, Fletcher checksums give you,
it's called Hamming Distance 3.
You can detect all 1-bit and two-bit
errors, and then there's some threes you don't
get. If you do
processing 32 bits at a
time, not only are you
faster, but you get
that hammy distance three out to
twice the length.
So it's faster, and it gives you better
fault detection by processing
32 bits at a time. And the book has an explanation for what it is, but it's like and it gives you better fault detection by processing 32 bits at a time.
And the book has an explanation for what it is, but it's like, that's just amazing that it works that way.
And I've never heard anyone talk about that possibility until I did some work last year and sort of stumbled into this.
So they could have done that and it would run, it's like 1.5 times as much time.
It's not even twice as slow.
It's pretty close to the same speed as one's complement and gets you all two-bit errors, all two-bit faults, which addition doesn't give you.
That's so unintuitive, and yet I have wandered around my life
thinking, oh, checksums
and CRCs are pretty intuitive.
You just make sure that the bits are
what they're supposed to be, and
now you're telling me that
I don't need to spend a whole lot more
processing and a whole lot more
communication data
length to get better
effects? That's like something for nothing. Very suspicious. communication data length to get better effects.
That's right.
That's like something for nothing.
Very suspicious.
Shockingly close, which is why,
I mentioned the billions of simulations to make sure I didn't make a mistake.
I'm like, really? Really?
I'm not going to believe this one
until I beat on it for a few weeks of CPU time.
How do I ask this question?
Is this a field that's been kind of ignored for a while?
It's interesting to me because
it's interesting to think how much
computation power is going on all the time computing checksums.
Every single packet of data that's happening on the internet
is having its checksum checked.
And that's a lot of stuff.
That's a lot of computing power devoted to just adding numbers
and seeing if something went wrong.
Yeah, right.
And doing it badly, apparently.
Lately, I've been thinking about efficiencies
that we're just kind of ignoring in the world.
From waste heat to all these things.
And this sounds like an inefficiency that's fairly large,
although now it's all computed in custom ASICs
and maybe it doesn't matter that much
to get any more efficiency because...
Well, in embedded systems, it's still a lot of stuff.
Right, right, yeah, yeah.
But Christopher, it's actually worse
because there are cases where you could do dramatically better
at zero cost.
Right. Those are the ones,
that's why I got in this field. I found these
places where with zero cost
whatsoever other than
everyone agreeing to change their protocol.
Which is a big ask.
Which is a big ask, but that's a
social ask. It's not a
CPU time ask.
Right. And it turns out new protocols get invented all the time.
And so there's actually a lot at work here. And you asked me what my least favorite polynomial
was, and that's the IEEE 802.3 polynomial is the one I gave to you in hex notation, okay?
Because it's actually pretty terrible.
And people have known
for decades you can do better.
And as you point out,
it's hard to change that.
But at the time, they picked it.
They picked it for reasons, and that's
fine, because that was the best they knew,
and that's fine. But you can do
so much better
than that at no compute cost just by changing a constant in your code.
And boom, you're dramatically better.
And so that's what fascinated me about this topic.
It's sort of what you're saying is that you can do so much better for low cost or sometimes no cost, and you just sort of have to have the knowledge to be able to do that.
What kind of witchery is this? It's mathematics, isn't it?
Well, it's kind of math, but it's kind of not. That's the part. So, you asked if it's a broad
field. There are very, very, very few people who are specialists. There's a lot of hobbyists and
a lot of people who are sort of interested, which is cool. I get, you'd be surprised how often I get emails
on this topic from people who just think it's cool
to get into this.
And that's great.
I mean, it's a fascinating area.
There's very few people who've really gotten to be pros
and most of them did so in previous decades.
So they're not active anymore.
There are certainly knowledgeable people around this
in coding theory to be sure, but the coding theory guys spend most of their
time on other things not on this uh so so there's not not a lot of people working on it right and
encryption is probably oh where the overlap in math is and everyone will just go do that
yeah you can make a lot more money with encryption that's that's fine. Yeah, this is a hobby. There's no big checksum lobby
cutting me checks to, you know.
Are you in the pocket of big checksum?
I'm not, sir.
Well, wait, actually,
that's a good question.
That one?
No, I don't think so.
You're bringing back
to congressional testimony
that I don't want to relive,
so we're just not going to go there.
Why?
Why do this book?
I mean, it's actually turned out
to be way more interesting than I expected
for a 700-page book about CRCs.
Well, that's the Kindle version.
The print win is half that size.
Oh, good.
Just so nobody's scared off.
So nobody's scared off.
Your book came up at 700-something on Kindle.
Oh, really?
Yeah, yeah.
Oh, so they're about the same length.
Yeah.
It's like 370-something.. Oh, really? Yeah, yeah. Oh, so they're about the same. Yeah. It's like $370 something.
You know, it's under $400 printed.
It is, I mean, it's a dense, detailed book.
The introduction is really good.
And then there are some sticky parts that definitely my brain had to go through multiple times to get, which is a hard problem.
But you didn't just do this for fun.
How did you come about this?
Actually, kind of, yeah.
Oh, okay.
Well, never mind.
I have an odd sense of fun.
What can I say?
So let me give you the history of how I got sort of sucked into this area, because I didn't
wake up one day and say, I think I'm going to be the king of how I got sort of sucked into this area because it was I didn't wake up one day and say
I think I'm going to be the king of checksums
you know it's like
it was not my life plan when I started this
please do not
use the show to title king of checksums
I don't want to have to look down
I know her brain works
no no I wasn't going to be king of
checksums it was going to be I'm going to be
the king of checksums. It was going to be, I'm going to be the king of checksums.
But that would be false because it's not what I did.
So what happened was I was trying to get into embedded networks
because I thought they were pretty cool.
And there's a constant reinvention of embedded networks.
So the thing about embedded networks is everyone makes their own network.
Why?
It's been going on for decades.
And the reason is because they need optimizations that they can't get.
Right?
So I'm not talking reinventing the internet.
Yeah, yeah, yeah.
That's a whole different thing.
But, you know, you've got a little bit 8-bit micro,
and you've got a sensor, and SPI doesn't quite do it.
Although you should use it if you can.
Or using SPI,
but what's the protocol
for transferring data on top of it, right?
Yes, yes.
I had this problem last week
and I made something stupid up, yes.
Yeah, or using control area network
and you got these header bits,
but what pattern to use in the header bits?
So embedded networks
are continually reinvented
and that's not going away anytime soon.
It just is, okay?
Yes, you should use a standard network if one fits, but often it doesn't.
Or sometimes you get folks like, we need a better network for general aviation on top of CAN.
And there's a committee that's working on that that I talk to once in a while.
So this is going on and so i was
contacted by some folks in both automotive and trains it turns out there was a big checksum
cartel in switzerland in the 90s that abb research this is this guy i never got a chance to meet him
which is just saying this guy this guy. Funk, who actually did a lot
of the CRC work, have found the optimum CRCs up to 16 bits and published it, right?
And so he and this cabal of, I'm having fun here, but this cabal of embedded network guys
were actually the folks who were behind a lot of the different networks, the automotive
networks and the train networks.
It's all the same handful of guys at ABB Research, near as I could tell.
And they hired me out for a second opinion on a train control network.
And I also was doing automotive work as part of the self-driving car stuff.
And so one of the things we did was we just came up with Monte Carlo simulations
to see if the CRCs were really working as well as they thought they were.
And so Monte Carlo, you take a random network data packet and you throw some random corruption at it and you see whether the checksum catches the fault or not.
Rinse, repeat, right?
Rinse, repeat, a lot.
A lot.
Yes, a lot.
There's a few permutations.
It's one of the really spiky noisy curves.
You don't need too many.
If you want that curve to be – so some of the curves in the books are a little bit spiky,
and that tells you how incredibly painful that one was to get to converge.
Most of them are smooth.
My CPU is just, you know,
I have an eight core, my desktop's an eight core and I have an older eight core in the next room
and they're just begging me for mercy
at the end of this book.
Okay.
And so you do that.
And so we did that.
And I just sort of got fascinated by the question of,
well, is that the best one they could have used?
And the answer was, the only way to know is to try them all.
And that one takes, you know, age of the universe is just getting started if you do it the brute force way.
And so the book has a section, which is probably what melted your brain, Alicia, about how I sped that up by many orders of magnitude.
And it's okay for that to be tough sledding.
But I thought I would write it down so someone else can replicate the work.
Because this is the first time I've ever written down what I actually did.
That's why it's there.
Cool.
So that's how I got sucked into it.
And then when TT Tech was doing TTP and the FlexRay Consortium was doing FlexRay
because they didn't want to pay royalties, the TTP people.
They all came to me and look at their checksums and their protocols.
It was pretty funny.
And I just sort of got into being the CRC guy.
I got some funding from the FAA to work on CRCs and checksums because they're like, well, is the airplane going to fall out of the sky because of the wrong checksum?
And it turns out nobody knew the answer.
So I got a little tiny bit of research money from them.
I teamed up with some guys at Honeywell Aviation,
and we co-authored, they led, but I co-authored
an aviation embedded network handbook for airplanes.
So this has really been my journey through embedded networks,
but while I was at it,
the CRC checksum thing kept coming up and coming up and coming up. And I was the guy who had
acquired the knowledge over time to be able to answer it. So here I am decades later.
I'm at the university now. I won't be there that much longer. Time to retire to do something else.
And it's a good chance to just take all the stuff in my head
and get it down on paper so that it's there
because my guess is 10 years from now, 20 years from now,
people are still going to want to know about checksums and CRCs
and it'll be there for them to learn from my journey.
And there is something really nice about a book.
Instead of trying to learn each piece from the internet, having it laid out in order with terminology that's consistent.
Yes, that I didn't realize how much of it I knew, but didn't know that I knew.
That's part of it.
But, you know, writing the book is also a journey for the author.
Yes. So, the bunch of pieces I hadn't put together either until I wrote the book. And that was
actually part of the satisfaction, was getting it straight in my own head and then trying to
help the reader get it straight in theirs. Okay. So, you wrote that book, but you've
written a couple of others since you were last on the show. Yeah, I guess I've been busy. That's fair.
How Safe is Safe Enough? Measuring and Predicting Autonomous Vehicle Safety
and the UL4600 Guidebook, What to Include in an Autonomous Vehicle Safety Case.
So autonomous vehicles. OSOW wanted me to ask about this.
What is the state of autonomous driving?
Where should it be heading?
And where is it heading now?
Maybe, pun intended, we're not sure.
And basically just said that autonomous vehicles are more interesting than CRCs.
But I'm not as sure any longer because...
I'm incredibly bored by autonomous vehicles, but yeah.
I mean, it's a hard problem. But I didn't know CRCs were interesting at all.
So, you know.
Well, if I convince you there's someone interesting, then that's a win.
I think they're fascinating in a very geeky way, right?
There's that.
They really are.
And that they need to be simulated via Monte Carlo in order to find the best path is odd.
The thing that gets me is the lack of, I think it kind of wrinkled my brain a little bit,
and the question about is the TCPIP checksum a good checksum goes to that.
It's like, these are really important, and it doesn't seem like people have really been thinking rigorously about them for a long time.
And so it makes me happy.
To Greg Wilson saying, maybe we should try testing this stuff that we all agree is true.
That's fair.
Let me go back and another piece of why I wrote it.
The reason I stayed with the research is I kept turning up.
I was originally way way back, decades ago, I was going to write a book on embedded network protocols because I found an interesting subject.
It's a way to focus your mind to start a research program.
And I tried writing one.
And what I found out was every time I said, oh, everyone says to do this and this, is there a reason why that's a good idea?
The answer is no,
they're just making it up. It's like, which CRC polynomial? So this is the pattern of the
feedback of the XORs. Which one should you use? Well, the one that's in the standard. And some
of the ones in the standard are just awful. And some of them are really good. The ones that came
out from those ABB guys in Switzerland, they made good choices. The CAN polynomial is really good. The ones that came out from those ABB guys in Switzerland, they made good choices.
The CAN polynomial
is really good.
The Train Communication Network,
those guys knew
what they were doing.
But that's like
the only place I've gone
where I've seen
impressive results.
The rest has just been
like, really?
You decide to use that one?
Are you kidding me?
Parenthetically,
this is a similar thing
that bugs me
about machine learning.
Because there's a lot of stuff... Getting back to self machine learning. Because there's a lot of stuff...
Getting back to self-driving cars.
There's a lot of stuff in deep neural networks and machine learning where it's just like, well, we need an activation function for these things to make it nonlinear.
Let's try this.
And they try that.
It's like, oh, it works all right.
And then they just stick with it.
It's like, wait a minute.
So there's an important difference there, okay?
I'm an engineer, not a mathematician.
So empirical evidence works for me.
I don't have to have a proof.
And we never really answered this question.
We're diving all over the place, but hopefully the listeners are okay with this.
The reason I say math isn't enough is when you're talking about detecting all two-bit errors and three-bit errors with the CRC, you can prove that mathematically.
Although I give an intuitive bit motion-based argument rather than the math in my book, and then I point to the proofs.
But once it's detecting all four-bit errors and five-bit errors, which you need for safety critical, the math doesn't help you.
The only way to find one that detects all five-bit errors is to try them all,
all possible five-bit errors, and see how it turns out. And that's fine. I'm okay with empirical.
I'm not saying you need to go to first principles. But stopping and resting on laurels is what bugs
me. It's like, maybe there's something better. Well, the difference is, historically, there
wasn't enough compute power to try all five-bit errors. Yeah, that's fair. And so, the difference is, historically, there wasn't enough compute power to try of all 5-bit errors.
Yeah, that's fair.
And so, the work I've been doing in the last 10 years or whatever is I actually found algorithmic speedups that I can actually guarantee, no, really, I've looked at every single possible 5-bit error, 5-bit fault over 64K bits.
No, really, I've done it. And over 32k bits, no, really, I've done it.
Even though if you do the math,
it's not possible.
The book explains the algorithmic speeds ups
that let you do it.
And I can prove that this has it
as opposed to,
ah, it looks okay, let's go with it, right?
And so if you empirically have shown
by brute force evaluation
or something unquestionable,
then I'm okay not having the theory. But you're right that what has really happened is people have just sort of said,
this looks okay to me and we're using it because they used it. And there's no rigorous evaluation,
it just becomes folklore. Yeah, that's why I got into this, because I kept seeing all this folklore and stuff like numerical recipes.
You guys know numerical recipes, right?
Numerical recipes in C, version 2.
In C and Fortran.
Yes.
So, second edition has a typographical error in one of the polynomials for their CRC chapter.
And people have used that for thousands of years.
That's right.
And if you look at the third edition, there's a little footnote thanking me for correcting them on that.
So that's the kind of stuff that bugs me.
I was just finding this stuff all over the place.
And, you know, it isn't like...
But like what are you supposed to do as a software developer?
You're going to look at that and you're not going to know.
There's no way for you to know that that's wrong, right?
That's right.
And so I decided it was a worthwhile use of my time sort
of as a hobby. You know, I've got a little bit of research funding, but trust me, it's mostly
been a hobby, right? Like I said, there's no big Chexum... Chexum World Tour. Yeah, it's not the
Chexum World Tour. You know, most of it's been hobby. I'm very appreciative of the sponsors who did give me support when they could.
But, you know, a lot of it's been on my own time, which is fine.
I'm not complaining.
It's great.
It's an interest, but that's why I call it a hobby.
Part of my hobby was, you know, it's been decades and people have just been getting it wrong on folklore.
Why don't we write down the right answers so people now know the right answer?
Simple as that.
I mean, I thought people knew the right answer, and I was just using their code.
Yeah, yeah, exactly.
Well, in fairness, they thought they knew the right answer, too.
But only because that was what they learned, too.
Exactly.
Well, who's to fault?
Who's the original person at fault here?
Let's go find out.
There's no fault. There's a bunch of at fault here? Let's go find out. There's no fault.
There's a bunch of engineers who had to make decisions on limited knowledge.
And the knowledge has changed and the decisions haven't.
That's very typical of a lot of things, isn't it?
Isn't it?
Isn't it?
Speaking of which, we'd be remiss if we didn't answer Osau's actual question.
Oh, are we going back to self-driving cars?
Well, you asked the question,
so you should probably let him answer it.
What is the state of autonomous driving?
Where should it be heading?
And where is it heading now?
And I know you have some very good YouTube videos
on this question.
The answer is look at my YouTube channel,
but I'm happy to give the short answer.
The short answer is the industry's selling on we're busy saving lives, but the reality is nobody really knows how that's going to turn out.
And I didn't say they aren't, but clearly there's not evidence there. Fatality rates once per 100 million miles, you know, round numbers, you know, a bunch more of tens of millions of miles to go before we know how it turns out.
Now, the industry would say we're really confident it'll be fine.
It's like, well, yeah, but there's some assumptions in the models that I don't happen to agree with.
And we'll see.
Maybe they're right, maybe they're wrong.
I don't know.
Some of the people who are part of that industry are not particularly, as you would say, trustworthy.
Well, and 16-year-olds are also very confident about their driving capabilities.
I wasn't.
I tended to get on the freeway the wrong direction, so I knew I was a screw-up.
That's fair, but the $100 million includes the drunks and the teenagers.
Exactly.
And that's where I sat in the bar, right?
Yeah.
Okay, to be clear.
So, we're going to see how it turns out.
And I know some people think I don't like the technology.
Well, if I didn't like the technology, why would I have written an international standard that I wrote?
Technically, I originated it, which means I dumped a couple hundred pages of standard into a process.
And then the industry helped clean it up,
and then it was a real standard, consensus standard.
I invest a lot of time in that.
I would have written two books on this topic if I didn't like it.
I like the technology.
What I don't like is dangerous technology, and there's a big difference.
It does feel like, to me, that there was a great deal of hype for,
what year is it now? 2024. My God. There was a great deal of hype, say, seven years ago, six, seven, eight years ago. And everybody was like, this is just going to take off. What are we going to do with all these truck drivers who can't drive?
I remember hearing that, yes. And it's going to affect the economy in all these amazing ways. And then it just kind of plateaued as well.
It's pretty good at lane keeping.
And as long as you pay attention, you know, you can relax when you're commuting a little bit more.
But as soon as you have a construction zone or an emergency vehicle, it all just goes kerplooey.
I'm not going to weigh in on that part.
It's all fair.
I'm not going to weigh in on that.
Let me give you high-level perspective.
I've been doing self-driving cars as long as I've been doing checksums and CRCs.
Okay?
So, I started in, I think it was 97.
So, in 1995, Carnegie Mellon had a vehicle, NavLab 5, that went across the U.S.
There's a segment to D.C. to Pittsburgh,
but Pittsburgh to San Diego,
98-point-something percent hands off the wheel.
In 1995, 98% hands off the wheel.
And the alums of that project all love to say,
and ever since, we've been working on that last 2%.
Yeah, yeah, yeah.
And that probably wasn't using DNNs
because that was impossible back then, right?
It was totally different technology.
It was, yeah, it was a long time ago. It was a laptop and a camera doing optical flow is my recollection.
Optical flow, okay.
Okay. And that's good enough to keep you in the line 98% of the time. It's the 2% that's a bear, right?
If you're alive 98% of the time and dead 2%, well, that's a real problem.
Safety lives at 99.99999999, you know.
And so statistics struggle to get you that many 99.999999% without problems.
Statistics struggle with that many nines.
And machine learning is a statistical approach.
So how do you think that's going to work out?
I mean, that's why it's so hard.
And I think that's something people don't get about machine learning,
especially today, even with LLMs and stuff.
It's all statistics.
And all the problems inherent in statistics are there in greater magnitude because you're doing billions of things.
I mean, if 90% works for you, you're great.
If 99% works for you, it's a challenge, but maybe they can pull it out.
But, you know, airplane falling out of the sky, there's an official FAA number for that.
Airplane falls out of the sky once every billion with the B hours, 10 to the minus 9 per hour.
And that's per hour.
That's a lot of seconds in an hour.
Okay?
Statistics doesn't really get you there, right?
You need something else.
I mean, you can't just do simulation.
No, you can't simulate enough, right?
That's right.
Well, that's why.
So for CRCs, I said I did Monte Carlo simulation for the other checksums and stuff.
But for CRCs, I actually evaluated every single bit pattern because simulations won't find the one bit, you know,
five, let's say, 32k bits taken five at a time. How many combinations of that? And there's exactly,
and it's exactly one of them with an undetected fault. You're not going to find that with
Monte Carlo simulation, now are you? Check back after the heat death of the universe.
That's right. And so, I actually do that calculation. It's an impressively large number
in the book. And so that's why I switched over to no, no, we checked them all because that's
the only way to know. That's it. You got to check them all. Like Pokemon. Yes. Oh, okay. So yes,
informally, the folks who go out and test and something goes wrong and they fix it and they test and they go out, it's like they're going out playing Pokemon Go and trying to collect them all.
And the problem is there's an infinite number of Pokemon.
And that's why it takes so long for self-driving cars to get scalable.
It's as simple as that. But in CRCs, Delicia's writing that down, I know.
In CRCs,
in CRCs,
the results at the table in the appendix of these are what I think
are the best ones, or at least really good ones, right?
Those are based on
no, really, I caught all the Pokemon.
All right.
She's writing that down as a title, too.
I'm saving so much for this.
Now we've got so many titles.
And we are pretty much out of time. All right. She's writing that down as a title too. I'm saving so much for this. Now we've got so many titles. Now we're just going to...
And we are pretty much out of time.
Phil, do you have any thoughts you'd like to leave us with?
I'll tell you really at the high level, let me say this.
A lot of this is about engineering.
So what I found with the experience of the book was that math only got me part of the way through and the rest of it had to be engineering and other techniques that weren't just mathematical analysis.
That's part one.
And part two is everyone wants to know what's the best CRC.
And the answer is it depends.
But more generally, and maybe a check sum is what you need because one of the graphs in the book is the speed versus time tradeoff, right? If this is how much CPU power you have and this is how much data you have,
this is the algorithm you should use to get the best bang for buck.
And so if you think you know the optimal answer to a general problem,
probably you don't understand the question.
Engineering is never about the one
optimal answer. Engineering is almost always about trade-offs. And the one thing I try and do this
book, but also when I teach, when I do research, is to help people understand the trade-offs so
they can make an informed choice and pick the answer that's right for them. If you would like to pick up Philip Kopman's book
about understanding checksums and cyclic redundancy checks, it is out in all forms.
Our guest has been Philip Kopman, professor at Carnegie Mellon University. He's also the author
of two books about autonomous vehicle safety and the classic better embedded system software.
He also has three blogs, one about each subject here,
Checksums and CRCs, Central, Safe Autonomy, and Better Embedded Software.
Thanks, Philip. This was really quite entertaining.
Thanks for having me on.
And you know what? I never thought I'd be on a podcast for Checksums and CRCs.
I know, it's amazing.
So thanks for making my day.
Thank you to Christopher for producing and co-hosting.
Thank you to our Patreon listener Slack group for questions.
And of course,
thank you for listening.
You can always contact us at show at embedded.fm or at the contact link on
embedded.fm.
Don't forget that there is a whole page of show notes,
which will include links to Phil's blogs and his books.
And now a quote to leave you with from, well, let's just go with Kopman.
Safety isn't about working right most of the time.
Safety is about the rare cases where it doesn't work properly.