Two's Complement - How Fast Is Fast?
Episode Date: February 14, 2026Ben interviews Matt with a deceptively simple question: make my program go fast. 44 minutes later, robot dogs are falling over, Grace Hopper's wire makes an appearance, and Matt still hasn't gotten th...e job.
Transcript
Discussion (0)
I'm Matt Godbolt.
And I'm Ben Rady.
And this is Toos Compliment, a programming podcast.
Hi, Matt.
How are you doing?
I'm doing pretty well.
We've just had a moment of exasperation, haven't we?
Yeah, we're probably talking over the intro.
Probably not at this point, but maybe earlier.
Maybe earlier.
I don't know.
I, yeah, I'll fix it in post as I always try to do, or increasingly less.
In fact, I try.
I think I'm kidding myself.
that the charm is all about like the fact that a dog's barking in the background and there are children screaming across the road and we um and uh and lip smack and other things
um it's kind of ASMR actually isn't it no yeah ASM so recently there's been a lot of talk about the the the Dutch manufacturing company that makes the lithographic
systems for chip manufacture and they're called ASML.
and I just can't help but get the two things mixed up.
They're very different things.
But that's not what we came here to talk about today, is it?
It's not. It's not.
I have an idea.
I have an idea.
I have a thing.
Tell me about your thing.
Whoa.
Let's say, careful now.
Let's say for the sake of argument,
I have a program.
Stop it, phone.
There we have more charm.
More charm is.
Just next to the room is Ben's phone, unmooted phone.
Uh-huh.
I have a program and I want it to run fast.
How would I do that?
Oh, wow.
That is a great question.
And we only have around about 30 minutes, 40 minutes to talk about it.
So I've got a feeling.
So immediately this is going to be our first 10-part series.
Maybe so.
I don't know.
I don't know.
This is one of those questions because I think obviously,
deliberately you have asked me a leading and vague question.
Because what do you mean by, in fact, each of those words, right?
You're always to find each of the words that you just used in that sentence.
My program, make my program fast.
Right.
So I think make, we can, yeah, cause it to become my program.
You have something that you wish to go fast.
Right.
So the only word we really need to define then is.
fast and I'm pretty sure you don't mean
stopping eating for a period of time.
Right. Well, I think you could also define
my program because in this hypothetical
thing, I haven't written anything yet.
Oh, okay. It's like I want
to write a program.
It doesn't exist, but
I want the execution
of that program to be fast.
Fast. Again.
For some of the fast that we will define shortly.
Right. And so, yes, not no eating,
not also unmoving,
which is the weird other case of, you know,
stuck fast, which is like one of those really
is the opposite of this.
Auto-oxymoronic words, you know, like inflammable
or something like that.
You mean quick, to go quickly.
All right, so let's talk about what that could mean, right?
The instant you say that, and because of where
what we've been doing professionally for a long time
and for what I've been doing most of my career,
I think of low latency is fast.
Like if you decide you need to do a thing,
if your program needs to do a thing and it
discovers that that need, it should act upon it as quickly as possible having discovered that
need.
That is what I mean by sort of low latency.
It reacts to a real-time thing as quickly as possible.
And I think a lot of embedded systems work in this world.
A lot of UI, in fact, is this, right?
If you said, hey, I want my responsiveness in a UI is about latency, right?
But that's one form of fast.
But there's a very real, very important other kind of fast, which,
is I am
writing a distributed
message passing system
I'm mastered on right
I'm dealing with an enormous
volume of information and if I was
slow at doing that eventually I would
back up and there would be
I would never keep up with
the current well it doesn't matter
individual messages may take a few seconds
to go through if that's not fast
I mean maybe that is fast but that scale of network right
but you're talking about a
throughput problem, you know, or bank clearing is another one.
They've kind of got a bit of latency.
You know, you beep your card when you're at the coffee shop, and it goes beep.
And so you've got waiting six or seven seconds, and that's a completely fine human amount
of time.
You know, it's an eternity for computers.
But on the other hand, there are several hundred million people in that same five-second
period tapping their card on their reader at the same time.
So there's a lot of things to go.
So those things.
So we've got latency and throughput.
And there are there any other?
aspects. I think for the definition of this conversation, I think latency and throughput are two
dimensions of fast, which we should consider. Okay. I think you could maybe come up with some other ones,
but again, we have 30 minutes. Well, I mean, again, we can make this into more, but so, so.
That's true. Do you want to, which one do you want to start with? I think we should start with
latency. Okay. Okay. So, so then we get into this sort of amazing set of questions that we get
into when we're dealing with with computers because how fast is fast now, right? How quick is?
Like we just said, if I'm a human waiting at a sales kiosk, if it takes me 30 seconds to pay for
something, that is not fast. If it takes me three seconds, it's probably okay.
If it takes 0.3 seconds, I won't notice that anything happened, really.
There's a beep and I'm out of there.
And any other order of magnitude less than that, like 0.03 seconds or whatever, is immaterial to me at that point.
And it doesn't matter.
But you and I work in a world where at least some of the things that we have done or adjacent to,
we are talking about maybe seven or eight orders of magnitude faster than those three seconds.
We're talking hundreds of nanoseconds, hundreds of microsecond.
And I have to remind myself that a nanosecond is a billionth of a second, right?
Sometimes we throw these words around.
I mean, maybe this is just, again, you and I in the weird world that we live in.
But like nanoseconds are kind of a completely normal measurement for a, you know,
3 gigahertz computer, you know, a few clock cycles is a few nanoseconds.
That's completely normal.
Yeah, that's reasonable.
And then, you know, like to, you know, good old Grace Hopper's, you know, light.
foot that a nanosecond is a foot of copper wire, that's a nanosecond.
You're like, oh, shucks, man, you know, that's the speed of light.
The fastest thing that there is can only go a couple of feet in the time it takes me to add
264-bit numbers together.
That is insane, frankly.
It is insane.
You know, whenever I think about that, I just think about Grace's wire, just twisted around
in the inside of the CPU and all the places where it would need to go in order for the light
to move in that way to add those two numbers.
And it just makes my brain melt.
Yeah, it is, it is quite, quite something.
And the fact that, yeah, we've taken these factorial factories,
and we've crushed them down into the size of my, you know, thumbnail.
And somehow snaking around in there are many, many, many, many hundreds of feet of cable.
As you say, each individual electron is, well, they don't go all the way down.
It's complicated.
right?
Yeah.
Let's stop pretending that we can talk about that as well.
So, yeah, we've got many orders of magnitude here.
So, you know, we can break it down into things that need to be measured in nanoseconds,
in microseconds, in milliseconds, in milliseconds, and seconds.
And, you know, that's, I think probably after that, I think, you know, the latency of something,
maybe, I mean, maybe you still can measure it in minutes, right?
I mean, let's think about, like, you know, publishing a video on YouTube.
I've just updated my video on YouTube and I click the button to publish it,
and it takes how many minutes to transcode it
to all the different formats that YouTube does
before it becomes visible,
maybe that is measured in minutes,
and maybe I do care about that latency.
But I think the thing that I can confidently talk about,
and maybe you as well,
is the sub-second orders of magnitude of latency.
Yeah.
So let's talk about what can you do in that amount of time.
Right, right.
That's a good starting point of like,
Not only, you know, if you say I would like my program to be fast,
without even knowing what your program is,
I've got some bounds of what you could reasonably expect to do
in hundreds of nanoseconds, hundreds of microseconds, hundreds of milliseconds,
hundreds of milliseconds, right?
Right, right, right.
On commodity hardware, if we're just talking about things like a regular PC
or a decent server in a data center or a laptop or something like that,
they're all within spitting distance of each other.
So if we think of latency as sort of like the time at between,
an input is provided and an output is produced.
Yeah.
Then automatically down in the like tens of nanoseconds range, you're sort of like,
what output could you possibly consume or what input could you possibly consume and what
output could you possibly produce in that span of time, right?
Because it's got to go into the computer and then come out.
So like maybe a theoretical example here would be like a robotics application where you have
some like active control of a system or or, you know, a set of servos or something like that
where you're trying to like, you know, imagine like the, you know, like the robotic dogs type
thing where you're just like, you know, taking like some information from like the sensor on a,
on a on a robot and then feeding it back and using that to adjust.
Like that's already going to be up probably in the like low milliseconds range probably.
I mean, I don't know.
I'm not an expert in robots.
But no, but I mean, you think about things that do like auto-ban, or, you know, drones.
Like that can't be milliseconds because it's already upside down by the time a millisecond of wind has caught it.
So it must be a little faster than that.
But, you know, maybe we're talking microseconds at that level.
It seems not totally unreasonable to have some sensor update and put something on a bus internal to a system on a chip kind of thing that could then go, oh, wait a second.
I need to speed up this rotor a little bit more to account for that.
Right.
Yeah.
Because a lot of those outputs can be sort of like targets, right?
So they don't, it's not necessarily that it needs to go like all the way out to like a servo or something.
It's just kind of like, okay, I've gotten this input and now I need to adjust, you know, my targeted angle here.
Right.
And there's maybe even some other external system that is like actually making the adjustments to the physical objects.
But like you might want that target to update on the order of, you know, tens of mics or hundreds of mics.
just because when you do,
you get sort of like a smoother,
better prediction of like where the arm needs to be
or something like that.
Right, right, right, right.
And yeah, so, I mean,
let's sort of break it down to consumer hardware,
it is difficult to get a message in and out of a computer
in, you know, I mean, like you just do a ping, right?
You ping a local network, a computer on your network.
And it's less than a millisecond,
but it's, you know, still hundreds and hundreds of microseconds,
probably to go through commodity hardware.
And for the kind of weird stuff that you and I have been involved with,
there is some very specialist hardware that happens,
which we can't really talk about,
on the far end into the nanosecond range of things.
So I suppose if we're talking about things that most folks could reasonably get,
put their fingers on a keyboard and program,
be it a little embedded system like an Arduino or an army arm type thing,
or indeed a program like I'm typing on a computer and I'm expecting having pressed a key press
to see the key appear in the terminal, that kind of thing,
We've kind of picked now our range now.
We're talking probably hundreds of mics to low millies.
Yeah, okay.
So I've got a program.
Yeah, so we're down.
Yeah, that's right.
I forget where we were now.
It's going to read some input delivered from outside the computer,
and this is a commodity computer, you know.
Yep.
Yeah, yeah.
And then it's going to produce some output that is also going to leave the computer, right?
And we're thinking that this is going to have to occur.
It's going to take at least hundreds of mics for that to occur.
I think so.
It's like the upper bound of how fast we can make it, for my definition, but fast.
I think so.
I mean, you know, there are exotic techniques and things that you can even use on commodity hardware
that is a little bit more exotic like DPTK, which is a sort of network bypassing with you can fiddle with.
And if you put a second network card in your computer, you can go faster.
But yeah, I think let's say, yeah, already,
whatever you're doing, if you're responding to an external network event,
you're doing any kind of reaction to it and sending it back out again.
We're in the microsecond region, but probably hundreds of microsecond region,
would be my instinct.
And maybe I'm wrong.
Maybe I've forgotten how computers are these days.
No, I already have been normal computers, right?
Normal computers.
Yeah.
I mean, actually, I was telling you before we started recording,
I'm surrounded with networking gear and computers and things and AI agents running things
as I'm setting up.
I'm actually upgrading my home network to like two and a half gig because that's actually
something that is not totally unaffordable for the home these days now.
So maybe my ideas about what is, what commodity hardware is, is out of date.
Yeah.
So my first question would be when I'm writing my program, how do I make sure?
Because my assumption here is that the input here is unpredictable.
It's not like you're going to start the program and immediately be provided the input.
The program is running.
There's something outside of the computer that creates this input, and then the program has to react and create an output.
So my first question to you is, how do I make sure that there's the minimum possible latency
between the generation of that signal and the code in my program running?
Something something, Epo.
Right?
No, exactly right.
No, well, yes, yes and no.
I mean, that goes into the DPDK sort of thing I alluded to.
So, yeah, you know, a traditional server program that is maybe blocked on a socket.
So you've got a socket.
You've got, you know, let's say UDP just to make things maybe easier.
I don't know, right?
You've got a blocking call to receive, which means that your process has gone to sleep,
told the operating system, hey, anything interesting coming in on,
this port, this UDV port, I'd like to know, please.
And then I've gone to sleep.
And I've been descheduled.
And now I've just been chucked in the list of all the things that are currently,
all the processes that are currently asleep and have nothing to do.
So that's where we are now.
Pack it comes in, goes through the standard kernel process of saying,
well, who's this for?
Oh, it's port 9, 2007.
That is registered to this process.
So we're going to write it into a buffer now rather than discard it,
or rather we'll keep it probably around in a buffer that exists.
First of all, is there space in that buffer?
Yes, there is.
Okay, this is something you configure for each socket,
how much space that you can have,
which gets complicated if you're sharing multiple processes
listening on the same.
Yeah.
Anyway, and then we go, well, who cares about this?
Right?
We've put it into the buffer now.
We've copied the data.
Who cares?
Oh, there's a process.
Okay, well, this process is waiting on this.
So we should mark it as ready.
This process is now ready.
to be scheduled.
And then you're done.
And then as part of the whole
you're done process, the kernel goes back to
well, I guess I'm now finished,
but back to the user process.
Now, who should be run?
And maybe there was something else that was going on
on that CPU when the packet came in.
And that other process is actually still higher
priority in whatever queuing fairness
there is, in which case,
it will merrily do whatever it was doing before
until it runs out of the time slice
that it had allocated it.
And then it goes, oh, someone else has go now.
Oh, look, this thing's ready.
So what I'm sort of getting at here is there is a lot of things that happen between that packet arriving even in your kernel and your process being woken up to do anything about it.
And that's a hugely variable amount of work too.
And you can definitely reduce the amount of work.
So the amount of other things that the CPU or the kernel is trying to do on a CPU so that you are, you could.
your process is more likely to be woken up.
There are schedulers that let you set these things that are priorities and stuff.
But the way that I intuitively do these type of things is using some kind of,
yeah, like we said, DPDK, which is this sort of networking layer that lets you use shared memory with the network card.
And then instead of the kernel being involved at all, you just continuously read a place of memory that the network card will write directly to a network card.
and say, hey, I've got a packet for you, and then you just pick it up.
And then essentially your latency goes down to how fast a PCI bus transaction can notify that a piece of memory has changed and you pick it up and off you go.
Right.
But that is still fairly esoteric.
So it seems like that's not a fair thing.
But I guess it's the difference between, right, if the program part that we've not even started talking about now is itself going to take.
tens of milliseconds because of the innate amount of work that it has to do. And so the upper bound
is dominated by its processing time. Then maybe the effort and all the work to put in shaving
tens of microseconds off on the interrupt handling time isn't worth it. Right. It's a lot of
engineering effort. It's a lot of complexity. It's a lot of non-standard nonsense. But if we wanted
to make something go as fast as possible, those are the kind of things that I would take to doing is
to take the kernel out of the equation and burn a CPU core,
just spinning, doing nothing other than saying,
is there anything to do?
How about now?
How about now?
How about now?
In a tight loop.
And potentially reschedule everything else so they can't run on the same CPU as me.
There are ways that you can do this with CPU sets with isolation, trickery,
ISIL CPUs in a kernel.
There's a lot of weird things you can do.
Anyway, right.
And then you can get it down to,
in the limit single digit microseconds between a packet arriving and you going,
okay, I have work to do.
Mm-hmm.
Okay.
So if we assume that the work that needs to be done is relatively simple, right?
You're going to do some ads, you're going to do some multiplies,
you're going to try to structure your code so that you're really only doing ads and multiplies,
and then you're going to produce some output that is purely a function of that input and whatever
state in the help in memory.
What we've got is an ALU as a service.
You know, we are, have you seen this?
There's a guy who's written a micros,
micro services based CPU emulator where every single instruction is
handled by a different microservice,
each of which is written in a different language.
And the thing runs.
Oh, my God, that's amazing.
Isn't it cool?
I love that people think of these things.
You think you were in microservice hell.
you should.
Yeah, you've got like literally 256 different things and a RAM, you know, memory service
so that everyone can agree on this.
It's cool.
But so you're talking about something like that.
Let's just say, you know, it is.
In fact, honestly, probably a decent, unless you want to, I'm going to say these
and you can tell me if where you're trying to steer it is right or not.
But like something like a Reddist memory cache is like one of these things.
It's like something comes in.
I look it up in RAM and I'm like, yes, I've got it.
Here it is.
It's like a small piece of data, you know, a key, key, key.
key value store that is just, you know, like a cash, right? You're not doing very much work.
It's not as banal as like I'm adding the three numbers you've given to me and giving them back
to you. It is, you can imagine there is actual value in that, but it's still. Yeah. I'm trying to draw a line
about the other parts of the computer that you're going to be talking to. And what I'm trying to say by
this is that you can't ignore memory, right? It's not simply a function of the data that you've read in,
But you don't have to go beyond that, right?
Okay.
So let's use like an in-memory cache, like a Redditist,
without the disc backing type stuff.
Right, right.
Yeah.
Yeah, yeah, that's perfect.
Okay.
So yeah, you're looking, I mean, at that point,
there's a, without going into the extremely fun and interesting rabbit,
a whole of how you actually do the lookup and what data structures you use,
but that is probably the key thing is that, like,
you must consider the data structures.
use. But the interesting thing is that with a latency lens, you're not looking at data structures
in quite the same way as you do when you are doing like CS, whatever. I don't know what numbers
that use. But when you're doing your data structures courses and you're like looking at big O notation.
Right, right. You know, big O of N, whatever, n log n for your, you know, so like a not
unreasonable data structure
for a
key value store
is something like a balanced red black tree
right? It's O log
n log n log end to look into it,
whatever. Another one is a B tree
or a B plus tree or one of those kind of trees.
You know, hash maps are other ones as well. And you know,
obviously hash maps do have in theory O of one.
So you probably want to pick that out of big O notation or whatever.
But the other ones seem reasonable as well.
Well, like, it's like, but the dominating factor with something as straightforward as the thing you've just described, where we're literally just looking at some in the key key value store is going to be reading from RAM.
Right.
Now, if you're talking about something that's so small it fits into L1, L2 or L3, which are in the, of the order of tens of K's in the L1 case up to several megabytes at L3.
And, you know, bigger than that on bigger servers and whatnot.
But we're still talking like not all that much information.
Can you describe for our listeners, he said, covering for his own ignorance, what L1, L2 and L3 actually are?
Oh, I'm so sorry.
So there are layers of caches.
Cashes are expensive and both in terms of their, the amount of space to take up on a chip and how fast, you know, going back to our Grace Hopper, you know, the bigger something is, the longer it takes to get to the
other side of it.
So the way that CPUs
prevent us from having to read
from the very slow RAM,
which we'll talk about in a second,
is that we have on-chip caches,
and they are layered
so that you have a very small,
incredibly fast,
level one cache,
which is usually not very large,
tens of kilobytes kind of area.
Then you have a layer two cache,
which is maybe a meg,
maybe a hundreds of k,
that kind of era,
and then a layer 3 cache,
which is maybe tens of megs,
maybe even a little more than that,
and each of them is slightly further away,
like literally physically.
They're still on the actual thumb-sized chip
inside the machine.
They're further away,
but it takes longer to look through them
to find the data.
And so if you can fit something in L1,
it's kind of almost free.
I'm going to put air quotes over this.
We're talking somewhere in the region of,
you know, five or six cycles to read from RAM.
and that includes all of the other things that are like the fixed cost of reading from RAM includes doing the lookup for the logical address that you've decided to read from and then turning it into the actual physical address that it needs to go to the correct chip.
You know, different processes think that address 2000 has different data in it, right?
Yeah, yeah, yeah.
And something has to do that translation.
And that's not at this scale, when we're talking about, you know, a 3 gigahertz machine, a third of a nanosecond is a clock cycle.
it's that that lookup itself takes time right and so you know we need to be doing something
so that's that's about the limit of it and there's some other really complicated things that the poor
memory system has to do because of the weird out-of-order engine but we'll ignore that for now so five or six
cycles for like an l1 and now I'm going to forget this off the top of my head I really should have
brought this up beforehand you know but like tens of cycles low tens of cycles for L2 maybe
late tens you know 80 90 cycles maybe L3 no I don't think it's that much now someone so I you know
order of magnitude e things,
it's still
probably only
tens of nanoseconds,
but then you go out to RAM and it can be
hundreds of nanoseconds if you actually need to go out to RAM.
So those are the layers.
The other thing that's interesting is that
usually on chips, the L1 and maybe
the L2, are unique to the CPU
core that you're running on.
So you just get a copy of your own
and then that's yours and no one can do anything with it
with a massive asterisk and footnote
for older CPUs, but maybe
won't go there. And then the L3 typically is shared amongst the other cores that are physically
on the same chip as you. And so there are some contentions and there are obviously conversations
that happen to happen between the chips to say, hey, I'm reading that. Oh, I was just writing to it.
Oh, okay, well, we need to make sure that we do this in the right order. So there's a little bit more
complexity there. But yeah, so if we're still talking in single, you know, measurable, human
countable numbers of cycles, keeping it inside the L1 or L2 is great. L3 is okay. And if we have to go to
we have to be a bit careful about things.
And at that point,
two things are important to know
about the way cache's work.
One is that
the whole
theory behind a cache is that
having read a byte of memory,
it's really, really likely you're going to read the bytes
that are around that byte in memory.
So the fundamental
unit of work from
the memory system on the
CPU to the caches and beyond is
a sequence of bytes,
which is usually 64 or maybe 128 bytes long.
And that's kind of like the atom that gets shipped back and forth.
And some chips even do things whether you say,
well, you ask for this block of 64 bytes,
I'll just get the next one and maybe the one after that while we're going,
because we might as well having just done that.
And so this sort of temporal usage pattern is incredibly important for latency
because if you're following a linked list, as you may be,
if you just have a naive hash map,
suddenly you're jumping randomly around in memory
and the poor system can't make a guess.
But if you just have a damn array of everything
and just go, well, it's expensive.
It's an order N operation for me to search through these things,
but I'm going to run through them linearly.
Suddenly, it may be more worthwhile from a cash standpoint,
especially if a memory a cache misses hundreds of cycles.
Well, I can do a lot of work in a hundred cycles
on an out-of-order execution machine that has 12 execution units, right?
So why don't I just do tons of work and hope that that's quicker than going to RAM?
So anyway, popping the stack all the way back to keep the latency as low as possible,
you're going to have to think about this.
And so you might want to make some concessions where the distribution of your latencies needs to be taken into account.
Now, if you're going to say, well, most of the time, I think I'm going to be this, like 80% of the,
the stuff we're reading is hot.
Like it's going to be commonly used again.
In which case, I'm going to use a data structure, which is fine.
And that 80% fits inside, it doesn't matter what I'm doing,
whether it's a balanced tree or whatever.
It fits inside L3, let's say.
But the 20% that's outside of it is just outside.
It's going to be around.
We're just going to have to take the hit every time.
And the fact that it's incredibly expensive because we're using a balanced tree.
So I guess now we turn around and say, like, it's not just how fast to go.
it's like, what is the shape of your distribution?
Is it better to have the average case fast and the bad case awful?
Right.
Or is it better to bound the bad case and say, well, actually, I'm going to use a linear search every single time.
And I'm going to say that even the fast one where the linear search, you know,
it would have been faster just to check, I don't know, the first one or it's packed them in a different way.
But, you know, I'm paying the cost to do 16 compares.
Yeah, yeah, yeah.
every time and just go, well, that costs me a little bit more than one.
Of course, it does.
But it makes the worst case go away.
Right.
So, yeah, sorry, I've just jibbit around in a big, as I've sort of realized,
the enormity of the question you've asked.
That was the intention of the question.
But yes, whenever you're talking about latency, you need to not, it's not like,
oh, it's 10 microseconds of latency.
No, it's 10 microseconds of latency in what case, what percentage of the time?
you know oftentimes you talk about like you know p 95 p99 p ninety nine point nine
nine nine numbers of like what percentage of the of the responses are going to be
less than some particular target time and you may have multiple layers of that right
like you may have a maximum threshold that you need and like a p ninety nine point
ninety nine and then like a p ninety five and then like a medium right and you might
want to like structure things, you know, differently based on whatever those targets might be.
So in our case, sort of this hypothetical, you know, robot dog type plugged into a PC, perhaps,
that's receiving the signal. Maybe what we'd want to say here, as one fork of the tree to go down,
is actually the thing that I care about is the worst case latency, right?
I want to, because if the latency gets too high, the dog is going to fall over, and that will be
bad, right? It's a very common thing for like, you know, control boards and things. And, you know,
this is why in those situations, typically embedded systems who have that kind of very strong
requirements will use not regular operating systems. They use, you know, real-time operating
systems that can give you hard deadlines for things happening within a certain amount of time
and or do no preemption at all. So it's not like a case that your process will have its CPU time
taken away from it. It's like, you know, it's your responsibility to
keep running until you get to the end and then yield to the next guy so that you can
but, you know, this is a huge problem. As computers have gotten more complicated, it becomes
harder and harder to reason about what an upper bound could be, you know, picking up my 6502
in its box over here. Right. It's a little easier. It's a little easier. You know, it's like,
I can tell you exactly how long any single thing is going to take all the time. It's a
completely deterministic and it's well documented.
Right.
But we have to reverse engineer what the hell an Intel is doing in order to understand
sort of maybe what the worst case is in that particular situation.
Right.
You know, hey, what happens if, you know, we are trying, we have an L3 cache miss at the same time
as another CPU core, like a physically distant core, has an L3 cache.
Both go for it at the same time.
How does that work?
is there some really weird edge case where, you know, in the fabric it times out after some amount of time.
I don't know, but it's hard to find out.
But if I need to give you an actual, like, someone will get run over if the brakes do not come on when I put my on the break,
then maybe I can't, I certainly can't think about the world the same way, but I certainly can't use certain, like, CPUs and designs and operating systems like that.
So that's, but.
Yeah.
So if I come to you and I'm asking you this question of, how do I make this program fast?
and you go through all of these follow-up questions,
and what it turns out that I'm asking for here is,
how do I make sure that the round-trip latency
of this program that I have running on a just regular commodity PC
is not going to ever be higher than 10 microseconds?
The answer is you can't.
I think that's a fair, yeah, the nearest people I can think of
that have to deal with that specific problem,
are people who make digital audio workstations for consumer PCs because they have a genuine
latency bound because if you're playing a keyboard and there's a one second delay,
that's not between you're pressing the button and hearing the noise change.
That is not good for anybody.
And so the only way that you can get that is to make everything as low latency as possible.
But if you go too far and you don't have enough buffering between.
you and the set of audio that's coming out,
then if you for whatever reason can't keep up,
you're going to get a massive pop or a crack or a click
as suddenly you have to catch up
and just drop a whole bunch of samples that you were generating.
So I mean, I don't know how they do it, honestly.
In fact, I do know people who have worked on them
and we should perhaps consider getting them on to talk about this kind of stuff
because it's a fascinating problem.
But yeah, anyway.
But back to your...
Yeah.
So the result of this, you can't.
So now it's like, okay.
Yeah, right.
It's like you're putting constraints on this problem that you can't really sell for.
For example, you'd probably be better off using some specialized hardware rather than
trying to do this on a commodity machine.
If you truly have a constraint that, you know, of let's just say 10 microseconds or something like that.
To an extent, I mean, obviously microcontrollers have been around longer than computers,
because they've been in things,
or that's the original design.
But that is why the brake disc controller in my car
is not controlled by the one PC
that is like running the AV display in the middle of the panel, right?
It is like a very specialist piece of software
running on a very specialist operating system,
which probably isn't an operating system
by anyone else's definition of operating system.
It's just a library you link against
and then sets up the interrupt vector
and hands you a few nicer distances, good luck.
Yeah.
Yeah.
But that's why.
It's doing nothing but monitoring the disk because then finally you can potentially say,
all right, we can upabound this stuff.
We do know what the maximum latency could be.
And I can sit and work it out on paper because it's an arm.
This is an in-order arm and it only has an L1 cache.
And even if it's whatever, you know, there are things that you can say about it and you can
bound it that way.
Okay.
So we've got our inputs.
maybe we've had some discussions about our constraints here and we're like,
all right, well, we can't do this to a maximum level of 10 microseconds.
That's insane.
We could maybe do something where our like P95 is 100.
Yeah.
Less than that.
Yeah.
Something like that.
Yeah.
We're making this up as we go along.
Yeah.
And then our maximum is something much more reasonable like a second.
Right.
It's like it's never going to take more than a second.
If it does, then we have a bug and we should fix it, right?
As opposed to, yeah, that's how the system works, right?
So now we've gotten our message from the outside world.
We've gone and looked up the value that we've cached.
Hopefully that's an L1 cache and hopefully we've done some smart things to make sure that it's there.
But, you know, again, this is why we have these ranges of latency.
how do we write out the response?
Do we have the same sort of operating system packet issue
when we're trying to write this thing back out?
I mean, yes.
If we're talking about standard, like, I mean,
we keep changing what it is we're doing.
Is it a dog that's falling over?
Is it a UW packet going into a again?
But, you know, like, if we're going to send a packet back out again,
then, yes, there's a similar dance on the way out
through a traditional operating system
where you write to a bit of memory that you can read
and write and use a space,
then you say, hey, send it, please.
And the colonel goes, well, I need to send it.
I can't hand this bit of memory to the network card
because this bit of memory is just wherever you decided to put it.
Maybe on the stack, whatever.
And certainly the DMA engine for your network card
does not have carte blanche to read any piece of memory anywhere in the whole system
for all the obvious reasons of separation of concerns
and to do with some of the ways that they work internally.
So the card will has to copy it somewhere else in a buffer where it knows it could hand the address of that to a network card.
And then it returns, potentially returns back to you and says, yes, it's sent, even though it isn't.
It's sat in a buffer.
Time will pass.
Eventually, the colonel will decide to schedule sending it.
I mean, probably will actually send it immediately, all things considered, unless the network card is already busy, doing something else.
And then the network card will be notified through some mechanism, and that's relatively expensive.
requires a ping across the network.
Sorry, I say a network.
It is a network. The PCI bus itself, or however so your network card is plugged into your
computer, is effectively yet another network.
And there's another hop and there are complicated things going on there.
So the message is addressed to it.
The network car goes, oh, you say you have a packet for me.
How quaint.
How lovely.
Okay, I'll go and get the address from the ring buffer.
Here it is.
Okay, and I'll start streaming it out.
And so that process can be relatively latent too.
and so again
you're less likely to suffer from the scheduling
randomness that the colonel
does because it's not like something asynchronous happened
and now the colonel has to pick you
amongst all the things that the next to do
it's like you've given it a piece of work
and it's very obvious what it should be doing with that work
but there still is some redundant copying of data
there is still some messaging to an asynchronous
process, which is the network card itself.
The network card can be like, yes, sure, fine.
And then it could be busy doing something for all we know, right?
You know, I don't know what magic network cards are doing.
Maybe it's, you know, a virtual network card that doesn't even exist as a physical thing.
And then really it's pretending to be a network card.
But in fact, it is in a data center on a shared piece of infrastructure.
And what you're really doing is talking to the hypervisor, which then goes, oh, cool, a packet
from one of my many virtual machines.
Yeah, hypervised children, right.
So I'm kind of, I was maybe teeing this up a little bit, but this is like, yeah,
somebody comes along and says, hey, Ben, you made this program to control this robot,
and you did it in a very silly way.
Why are you using a UDP network?
That is not necessary.
Are there other devices that one could use?
I mean, like a USBC connection or something else that would be available to, you know,
just a regular consumer commercial.
computer.
Yeah, I don't know too much about the ins and outs of how the kernel interacts with the USB.
I know that the underlying USB protocol and the thing you talk with the hub or whoever you're
talking to on the end of it is sort of a negotiated thing.
You can ask for, you negotiate bandwidth.
You negotiate, I don't think you negotiate latency.
I think you can negotiate isocrinus.
That is repeating things.
So like if you've got a webcam plugged in, it can book, hey,
I need to send this much data just all the time,
and you get sort of time on the USB bus to be able to send packets over that data.
But I don't know about latency.
And obviously, things like USB keyboards and things are in mice.
And gamers obviously use USB mice and whatever,
but we're still talking almost human levels of latency there,
you know, millisecond, sub-millimeter second, I'm sure.
So, yeah, I don't really know.
The best guess I would be able to have is not to say something like,
like USB, because again, there are layers of indirection between you and the actual device,
even if you're talking the USB protocol through the kernel somehow.
But if you had like either, you know, GPI on a, which is general purpose IO pins on one of the
more microcontrollery type systems where you can literally just say, if I write to this memory
address, it's not really a memory address.
It is whatever data I put there is the highs and lows of these eight pins, right?
And at that point, you're like, yeah, you're off to the race.
right or you know say a serial interface which although serial is incredibly slow and awful
it is also something where you're kind of directly attached to and you can say like I'm just
driving it myself now I'm sure that's no longer true now I say it out loud I'm sure that it's
virtualized through some mechanism in the kernel and so it's no longer the case if you open
dev tty zero that like if you get the okay you are the only person talking to that now
and still it would would suffer from like you know if you read from it then you would
block and the kernel will put you to sleep
until a bite came in. So even that
doesn't make sense. So GPIO
is the nearest thing we have to
the sit in a tight loop and wait kind of
approach that we were talking about
before. You know, for both input and output, you could say like
hey, you know, I'm doing a
quiz show, whoever presses the button
first kind of thing.
And there are eight buttons and they go into the eight bits
inside my controller and I just sit in a tight loop reading from it.
Yeah. Obviously there are a better hard way ways of doing that.
but I'm just sort of thinking out loud.
Yeah.
Yeah, I don't know.
Did you have an idea?
You've been driving this in a direction.
I'm wondering if you have a solution in mind that I have not got to.
No, no, no.
By the way, have I got the job or?
Well, you're past the next round, but I'm going to have to, yeah.
We're now, we're now 14 minutes in.
So, yeah, this is now, do I not have questions for you?
we've done far too many interviews haven't we yeah yeah yeah yeah I mean I think that I mean we made the full
kind of round trip and I think that's good I think the other and if there is a part two of this I think
the part two should be okay now what happens when it's the actual like calculation that is the part
that needs to quote unquote go fast right yeah yeah yeah like the IO no longer dominates as it did in this
example and it's basically just this exercise and how do you take this signal, make
sure that you don't accidentally make it slow when it's being processed in the CPU.
That's an interesting one.
Yeah.
Now it's, okay, this is going to take a material a lot of time and we're trying to make it as fast
we can.
Yeah, which is probably what most folks think of when you say, can I make my code go fast,
is the code part, not all the other stuff around the outside, but often that can dominate.
Yeah, no, that would be an interesting one.
So yeah, okay, well, and then we've got to talk about throughput at some point.
Maybe that will leg into throughput because, you know, I think there is part of it.
But we'll see.
Well, all right.
Yeah.
Rather mysteriously then.
And now you and I have to look each other in the eye and go, we will remember that the next episode we record will be the continuation of this one.
And we won't just leave our poor audience hanging.
Right.
So I apologize in advance if we do.
But I think this has been very interesting.
Yeah.
That's good.
Our listener agrees.
And we will, I will see you.
Next time, my friend.
Next time.
You've been listening to Toos Compliment,
a programming podcast by Ben Radie and Matt Godbolt.
Find the show transcript and notes at www.org.
Contact us on Mastodon.
We are at Tooscomplement at hackyderm.io.
Our theme music is by Inverse Phase.
Find out more at InversePhase.com.
