Two's Complement - How Fast Is Fast?

Starting point is 00:00:00 I'm Matt Godbolt. And I'm Ben Rady. And this is Toos Compliment, a programming podcast. Hi, Matt. How are you doing? I'm doing pretty well. We've just had a moment of exasperation, haven't we? Yeah, we're probably talking over the intro.

Starting point is 00:00:30 Probably not at this point, but maybe earlier. Maybe earlier. I don't know. I, yeah, I'll fix it in post as I always try to do, or increasingly less. In fact, I try. I think I'm kidding myself. that the charm is all about like the fact that a dog's barking in the background and there are children screaming across the road and we um and uh and lip smack and other things um it's kind of ASMR actually isn't it no yeah ASM so recently there's been a lot of talk about the the the Dutch manufacturing company that makes the lithographic

Starting point is 00:01:09 systems for chip manufacture and they're called ASML. and I just can't help but get the two things mixed up. They're very different things. But that's not what we came here to talk about today, is it? It's not. It's not. I have an idea. I have an idea. I have a thing.

Starting point is 00:01:30 Tell me about your thing. Whoa. Let's say, careful now. Let's say for the sake of argument, I have a program. Stop it, phone. There we have more charm. More charm is.

Starting point is 00:01:45 Just next to the room is Ben's phone, unmooted phone. Uh-huh. I have a program and I want it to run fast. How would I do that? Oh, wow. That is a great question. And we only have around about 30 minutes, 40 minutes to talk about it. So I've got a feeling.

Starting point is 00:02:06 So immediately this is going to be our first 10-part series. Maybe so. I don't know. I don't know. This is one of those questions because I think obviously, deliberately you have asked me a leading and vague question. Because what do you mean by, in fact, each of those words, right? You're always to find each of the words that you just used in that sentence.

Starting point is 00:02:29 My program, make my program fast. Right. So I think make, we can, yeah, cause it to become my program. You have something that you wish to go fast. Right. So the only word we really need to define then is. fast and I'm pretty sure you don't mean stopping eating for a period of time.

Starting point is 00:02:48 Right. Well, I think you could also define my program because in this hypothetical thing, I haven't written anything yet. Oh, okay. It's like I want to write a program. It doesn't exist, but I want the execution of that program to be fast.

Starting point is 00:03:04 Fast. Again. For some of the fast that we will define shortly. Right. And so, yes, not no eating, not also unmoving, which is the weird other case of, you know, stuck fast, which is like one of those really is the opposite of this. Auto-oxymoronic words, you know, like inflammable

Starting point is 00:03:20 or something like that. You mean quick, to go quickly. All right, so let's talk about what that could mean, right? The instant you say that, and because of where what we've been doing professionally for a long time and for what I've been doing most of my career, I think of low latency is fast. Like if you decide you need to do a thing,

Starting point is 00:03:41 if your program needs to do a thing and it discovers that that need, it should act upon it as quickly as possible having discovered that need. That is what I mean by sort of low latency. It reacts to a real-time thing as quickly as possible. And I think a lot of embedded systems work in this world. A lot of UI, in fact, is this, right? If you said, hey, I want my responsiveness in a UI is about latency, right?

Starting point is 00:04:05 But that's one form of fast. But there's a very real, very important other kind of fast, which, is I am writing a distributed message passing system I'm mastered on right I'm dealing with an enormous volume of information and if I was

Starting point is 00:04:28 slow at doing that eventually I would back up and there would be I would never keep up with the current well it doesn't matter individual messages may take a few seconds to go through if that's not fast I mean maybe that is fast but that scale of network right but you're talking about a

Starting point is 00:04:43 throughput problem, you know, or bank clearing is another one. They've kind of got a bit of latency. You know, you beep your card when you're at the coffee shop, and it goes beep. And so you've got waiting six or seven seconds, and that's a completely fine human amount of time. You know, it's an eternity for computers. But on the other hand, there are several hundred million people in that same five-second period tapping their card on their reader at the same time.

Starting point is 00:05:07 So there's a lot of things to go. So those things. So we've got latency and throughput. And there are there any other? aspects. I think for the definition of this conversation, I think latency and throughput are two dimensions of fast, which we should consider. Okay. I think you could maybe come up with some other ones, but again, we have 30 minutes. Well, I mean, again, we can make this into more, but so, so. That's true. Do you want to, which one do you want to start with? I think we should start with

Starting point is 00:05:38 latency. Okay. Okay. So, so then we get into this sort of amazing set of questions that we get into when we're dealing with with computers because how fast is fast now, right? How quick is? Like we just said, if I'm a human waiting at a sales kiosk, if it takes me 30 seconds to pay for something, that is not fast. If it takes me three seconds, it's probably okay. If it takes 0.3 seconds, I won't notice that anything happened, really. There's a beep and I'm out of there. And any other order of magnitude less than that, like 0.03 seconds or whatever, is immaterial to me at that point. And it doesn't matter.

Starting point is 00:06:22 But you and I work in a world where at least some of the things that we have done or adjacent to, we are talking about maybe seven or eight orders of magnitude faster than those three seconds. We're talking hundreds of nanoseconds, hundreds of microsecond. And I have to remind myself that a nanosecond is a billionth of a second, right? Sometimes we throw these words around. I mean, maybe this is just, again, you and I in the weird world that we live in. But like nanoseconds are kind of a completely normal measurement for a, you know, 3 gigahertz computer, you know, a few clock cycles is a few nanoseconds.

Starting point is 00:06:59 That's completely normal. Yeah, that's reasonable. And then, you know, like to, you know, good old Grace Hopper's, you know, light. foot that a nanosecond is a foot of copper wire, that's a nanosecond. You're like, oh, shucks, man, you know, that's the speed of light. The fastest thing that there is can only go a couple of feet in the time it takes me to add 264-bit numbers together. That is insane, frankly.

Starting point is 00:07:24 It is insane. You know, whenever I think about that, I just think about Grace's wire, just twisted around in the inside of the CPU and all the places where it would need to go in order for the light to move in that way to add those two numbers. And it just makes my brain melt. Yeah, it is, it is quite, quite something. And the fact that, yeah, we've taken these factorial factories, and we've crushed them down into the size of my, you know, thumbnail.

Starting point is 00:07:50 And somehow snaking around in there are many, many, many, many hundreds of feet of cable. As you say, each individual electron is, well, they don't go all the way down. It's complicated. right? Yeah. Let's stop pretending that we can talk about that as well. So, yeah, we've got many orders of magnitude here. So, you know, we can break it down into things that need to be measured in nanoseconds,

Starting point is 00:08:13 in microseconds, in milliseconds, in milliseconds, and seconds. And, you know, that's, I think probably after that, I think, you know, the latency of something, maybe, I mean, maybe you still can measure it in minutes, right? I mean, let's think about, like, you know, publishing a video on YouTube. I've just updated my video on YouTube and I click the button to publish it, and it takes how many minutes to transcode it to all the different formats that YouTube does before it becomes visible,

Starting point is 00:08:37 maybe that is measured in minutes, and maybe I do care about that latency. But I think the thing that I can confidently talk about, and maybe you as well, is the sub-second orders of magnitude of latency. Yeah. So let's talk about what can you do in that amount of time. Right, right.

Starting point is 00:08:57 That's a good starting point of like, Not only, you know, if you say I would like my program to be fast, without even knowing what your program is, I've got some bounds of what you could reasonably expect to do in hundreds of nanoseconds, hundreds of microseconds, hundreds of milliseconds, hundreds of milliseconds, right? Right, right, right. On commodity hardware, if we're just talking about things like a regular PC

Starting point is 00:09:19 or a decent server in a data center or a laptop or something like that, they're all within spitting distance of each other. So if we think of latency as sort of like the time at between, an input is provided and an output is produced. Yeah. Then automatically down in the like tens of nanoseconds range, you're sort of like, what output could you possibly consume or what input could you possibly consume and what output could you possibly produce in that span of time, right?

Starting point is 00:09:49 Because it's got to go into the computer and then come out. So like maybe a theoretical example here would be like a robotics application where you have some like active control of a system or or, you know, a set of servos or something like that where you're trying to like, you know, imagine like the, you know, like the robotic dogs type thing where you're just like, you know, taking like some information from like the sensor on a, on a on a robot and then feeding it back and using that to adjust. Like that's already going to be up probably in the like low milliseconds range probably. I mean, I don't know.

Starting point is 00:10:24 I'm not an expert in robots. But no, but I mean, you think about things that do like auto-ban, or, you know, drones. Like that can't be milliseconds because it's already upside down by the time a millisecond of wind has caught it. So it must be a little faster than that. But, you know, maybe we're talking microseconds at that level. It seems not totally unreasonable to have some sensor update and put something on a bus internal to a system on a chip kind of thing that could then go, oh, wait a second. I need to speed up this rotor a little bit more to account for that. Right.

Starting point is 00:10:55 Yeah. Because a lot of those outputs can be sort of like targets, right? So they don't, it's not necessarily that it needs to go like all the way out to like a servo or something. It's just kind of like, okay, I've gotten this input and now I need to adjust, you know, my targeted angle here. Right. And there's maybe even some other external system that is like actually making the adjustments to the physical objects. But like you might want that target to update on the order of, you know, tens of mics or hundreds of mics. just because when you do,

Starting point is 00:11:25 you get sort of like a smoother, better prediction of like where the arm needs to be or something like that. Right, right, right, right. And yeah, so, I mean, let's sort of break it down to consumer hardware, it is difficult to get a message in and out of a computer in, you know, I mean, like you just do a ping, right?

Starting point is 00:11:43 You ping a local network, a computer on your network. And it's less than a millisecond, but it's, you know, still hundreds and hundreds of microseconds, probably to go through commodity hardware. And for the kind of weird stuff that you and I have been involved with, there is some very specialist hardware that happens, which we can't really talk about, on the far end into the nanosecond range of things.

Starting point is 00:12:02 So I suppose if we're talking about things that most folks could reasonably get, put their fingers on a keyboard and program, be it a little embedded system like an Arduino or an army arm type thing, or indeed a program like I'm typing on a computer and I'm expecting having pressed a key press to see the key appear in the terminal, that kind of thing, We've kind of picked now our range now. We're talking probably hundreds of mics to low millies. Yeah, okay.

Starting point is 00:12:31 So I've got a program. Yeah, so we're down. Yeah, that's right. I forget where we were now. It's going to read some input delivered from outside the computer, and this is a commodity computer, you know. Yep. Yeah, yeah.

Starting point is 00:12:45 And then it's going to produce some output that is also going to leave the computer, right? And we're thinking that this is going to have to occur. It's going to take at least hundreds of mics for that to occur. I think so. It's like the upper bound of how fast we can make it, for my definition, but fast. I think so. I mean, you know, there are exotic techniques and things that you can even use on commodity hardware that is a little bit more exotic like DPTK, which is a sort of network bypassing with you can fiddle with.

Starting point is 00:13:16 And if you put a second network card in your computer, you can go faster. But yeah, I think let's say, yeah, already, whatever you're doing, if you're responding to an external network event, you're doing any kind of reaction to it and sending it back out again. We're in the microsecond region, but probably hundreds of microsecond region, would be my instinct. And maybe I'm wrong. Maybe I've forgotten how computers are these days.

Starting point is 00:13:42 No, I already have been normal computers, right? Normal computers. Yeah. I mean, actually, I was telling you before we started recording, I'm surrounded with networking gear and computers and things and AI agents running things as I'm setting up. I'm actually upgrading my home network to like two and a half gig because that's actually something that is not totally unaffordable for the home these days now.

Starting point is 00:14:02 So maybe my ideas about what is, what commodity hardware is, is out of date. Yeah. So my first question would be when I'm writing my program, how do I make sure? Because my assumption here is that the input here is unpredictable. It's not like you're going to start the program and immediately be provided the input. The program is running. There's something outside of the computer that creates this input, and then the program has to react and create an output. So my first question to you is, how do I make sure that there's the minimum possible latency

Starting point is 00:14:37 between the generation of that signal and the code in my program running? Something something, Epo. Right? No, exactly right. No, well, yes, yes and no. I mean, that goes into the DPDK sort of thing I alluded to. So, yeah, you know, a traditional server program that is maybe blocked on a socket. So you've got a socket.

Starting point is 00:15:00 You've got, you know, let's say UDP just to make things maybe easier. I don't know, right? You've got a blocking call to receive, which means that your process has gone to sleep, told the operating system, hey, anything interesting coming in on, this port, this UDV port, I'd like to know, please. And then I've gone to sleep. And I've been descheduled. And now I've just been chucked in the list of all the things that are currently,

Starting point is 00:15:26 all the processes that are currently asleep and have nothing to do. So that's where we are now. Pack it comes in, goes through the standard kernel process of saying, well, who's this for? Oh, it's port 9, 2007. That is registered to this process. So we're going to write it into a buffer now rather than discard it, or rather we'll keep it probably around in a buffer that exists.

Starting point is 00:15:50 First of all, is there space in that buffer? Yes, there is. Okay, this is something you configure for each socket, how much space that you can have, which gets complicated if you're sharing multiple processes listening on the same. Yeah. Anyway, and then we go, well, who cares about this?

Starting point is 00:16:06 Right? We've put it into the buffer now. We've copied the data. Who cares? Oh, there's a process. Okay, well, this process is waiting on this. So we should mark it as ready. This process is now ready.

Starting point is 00:16:16 to be scheduled. And then you're done. And then as part of the whole you're done process, the kernel goes back to well, I guess I'm now finished, but back to the user process. Now, who should be run? And maybe there was something else that was going on

Starting point is 00:16:32 on that CPU when the packet came in. And that other process is actually still higher priority in whatever queuing fairness there is, in which case, it will merrily do whatever it was doing before until it runs out of the time slice that it had allocated it. And then it goes, oh, someone else has go now.

Starting point is 00:16:47 Oh, look, this thing's ready. So what I'm sort of getting at here is there is a lot of things that happen between that packet arriving even in your kernel and your process being woken up to do anything about it. And that's a hugely variable amount of work too. And you can definitely reduce the amount of work. So the amount of other things that the CPU or the kernel is trying to do on a CPU so that you are, you could. your process is more likely to be woken up. There are schedulers that let you set these things that are priorities and stuff. But the way that I intuitively do these type of things is using some kind of,

Starting point is 00:17:29 yeah, like we said, DPDK, which is this sort of networking layer that lets you use shared memory with the network card. And then instead of the kernel being involved at all, you just continuously read a place of memory that the network card will write directly to a network card. and say, hey, I've got a packet for you, and then you just pick it up. And then essentially your latency goes down to how fast a PCI bus transaction can notify that a piece of memory has changed and you pick it up and off you go. Right. But that is still fairly esoteric. So it seems like that's not a fair thing. But I guess it's the difference between, right, if the program part that we've not even started talking about now is itself going to take.

Starting point is 00:18:15 tens of milliseconds because of the innate amount of work that it has to do. And so the upper bound is dominated by its processing time. Then maybe the effort and all the work to put in shaving tens of microseconds off on the interrupt handling time isn't worth it. Right. It's a lot of engineering effort. It's a lot of complexity. It's a lot of non-standard nonsense. But if we wanted to make something go as fast as possible, those are the kind of things that I would take to doing is to take the kernel out of the equation and burn a CPU core, just spinning, doing nothing other than saying, is there anything to do?

Starting point is 00:18:53 How about now? How about now? How about now? In a tight loop. And potentially reschedule everything else so they can't run on the same CPU as me. There are ways that you can do this with CPU sets with isolation, trickery, ISIL CPUs in a kernel. There's a lot of weird things you can do.

Starting point is 00:19:10 Anyway, right. And then you can get it down to, in the limit single digit microseconds between a packet arriving and you going, okay, I have work to do. Mm-hmm. Okay. So if we assume that the work that needs to be done is relatively simple, right? You're going to do some ads, you're going to do some multiplies,

Starting point is 00:19:32 you're going to try to structure your code so that you're really only doing ads and multiplies, and then you're going to produce some output that is purely a function of that input and whatever state in the help in memory. What we've got is an ALU as a service. You know, we are, have you seen this? There's a guy who's written a micros, micro services based CPU emulator where every single instruction is handled by a different microservice,

Starting point is 00:20:01 each of which is written in a different language. And the thing runs. Oh, my God, that's amazing. Isn't it cool? I love that people think of these things. You think you were in microservice hell. you should. Yeah, you've got like literally 256 different things and a RAM, you know, memory service

Starting point is 00:20:19 so that everyone can agree on this. It's cool. But so you're talking about something like that. Let's just say, you know, it is. In fact, honestly, probably a decent, unless you want to, I'm going to say these and you can tell me if where you're trying to steer it is right or not. But like something like a Reddist memory cache is like one of these things. It's like something comes in.

Starting point is 00:20:37 I look it up in RAM and I'm like, yes, I've got it. Here it is. It's like a small piece of data, you know, a key, key, key. key value store that is just, you know, like a cash, right? You're not doing very much work. It's not as banal as like I'm adding the three numbers you've given to me and giving them back to you. It is, you can imagine there is actual value in that, but it's still. Yeah. I'm trying to draw a line about the other parts of the computer that you're going to be talking to. And what I'm trying to say by this is that you can't ignore memory, right? It's not simply a function of the data that you've read in,

Starting point is 00:21:11 But you don't have to go beyond that, right? Okay. So let's use like an in-memory cache, like a Redditist, without the disc backing type stuff. Right, right. Yeah. Yeah, yeah, that's perfect. Okay.

Starting point is 00:21:23 So yeah, you're looking, I mean, at that point, there's a, without going into the extremely fun and interesting rabbit, a whole of how you actually do the lookup and what data structures you use, but that is probably the key thing is that, like, you must consider the data structures. use. But the interesting thing is that with a latency lens, you're not looking at data structures in quite the same way as you do when you are doing like CS, whatever. I don't know what numbers that use. But when you're doing your data structures courses and you're like looking at big O notation.

Starting point is 00:21:59 Right, right. You know, big O of N, whatever, n log n for your, you know, so like a not unreasonable data structure for a key value store is something like a balanced red black tree right? It's O log n log n log end to look into it, whatever. Another one is a B tree

Starting point is 00:22:22 or a B plus tree or one of those kind of trees. You know, hash maps are other ones as well. And you know, obviously hash maps do have in theory O of one. So you probably want to pick that out of big O notation or whatever. But the other ones seem reasonable as well. Well, like, it's like, but the dominating factor with something as straightforward as the thing you've just described, where we're literally just looking at some in the key key value store is going to be reading from RAM. Right. Now, if you're talking about something that's so small it fits into L1, L2 or L3, which are in the, of the order of tens of K's in the L1 case up to several megabytes at L3.

Starting point is 00:23:01 And, you know, bigger than that on bigger servers and whatnot. But we're still talking like not all that much information. Can you describe for our listeners, he said, covering for his own ignorance, what L1, L2 and L3 actually are? Oh, I'm so sorry. So there are layers of caches. Cashes are expensive and both in terms of their, the amount of space to take up on a chip and how fast, you know, going back to our Grace Hopper, you know, the bigger something is, the longer it takes to get to the other side of it. So the way that CPUs

Starting point is 00:23:39 prevent us from having to read from the very slow RAM, which we'll talk about in a second, is that we have on-chip caches, and they are layered so that you have a very small, incredibly fast, level one cache,

Starting point is 00:23:52 which is usually not very large, tens of kilobytes kind of area. Then you have a layer two cache, which is maybe a meg, maybe a hundreds of k, that kind of era, and then a layer 3 cache, which is maybe tens of megs,

Starting point is 00:24:07 maybe even a little more than that, and each of them is slightly further away, like literally physically. They're still on the actual thumb-sized chip inside the machine. They're further away, but it takes longer to look through them to find the data.

Starting point is 00:24:20 And so if you can fit something in L1, it's kind of almost free. I'm going to put air quotes over this. We're talking somewhere in the region of, you know, five or six cycles to read from RAM. and that includes all of the other things that are like the fixed cost of reading from RAM includes doing the lookup for the logical address that you've decided to read from and then turning it into the actual physical address that it needs to go to the correct chip. You know, different processes think that address 2000 has different data in it, right? Yeah, yeah, yeah.

Starting point is 00:24:54 And something has to do that translation. And that's not at this scale, when we're talking about, you know, a 3 gigahertz machine, a third of a nanosecond is a clock cycle. it's that that lookup itself takes time right and so you know we need to be doing something so that's that's about the limit of it and there's some other really complicated things that the poor memory system has to do because of the weird out-of-order engine but we'll ignore that for now so five or six cycles for like an l1 and now I'm going to forget this off the top of my head I really should have brought this up beforehand you know but like tens of cycles low tens of cycles for L2 maybe late tens you know 80 90 cycles maybe L3 no I don't think it's that much now someone so I you know

Starting point is 00:25:31 order of magnitude e things, it's still probably only tens of nanoseconds, but then you go out to RAM and it can be hundreds of nanoseconds if you actually need to go out to RAM. So those are the layers. The other thing that's interesting is that

Starting point is 00:25:46 usually on chips, the L1 and maybe the L2, are unique to the CPU core that you're running on. So you just get a copy of your own and then that's yours and no one can do anything with it with a massive asterisk and footnote for older CPUs, but maybe won't go there. And then the L3 typically is shared amongst the other cores that are physically

Starting point is 00:26:07 on the same chip as you. And so there are some contentions and there are obviously conversations that happen to happen between the chips to say, hey, I'm reading that. Oh, I was just writing to it. Oh, okay, well, we need to make sure that we do this in the right order. So there's a little bit more complexity there. But yeah, so if we're still talking in single, you know, measurable, human countable numbers of cycles, keeping it inside the L1 or L2 is great. L3 is okay. And if we have to go to we have to be a bit careful about things. And at that point, two things are important to know

Starting point is 00:26:36 about the way cache's work. One is that the whole theory behind a cache is that having read a byte of memory, it's really, really likely you're going to read the bytes that are around that byte in memory. So the fundamental

Starting point is 00:26:54 unit of work from the memory system on the CPU to the caches and beyond is a sequence of bytes, which is usually 64 or maybe 128 bytes long. And that's kind of like the atom that gets shipped back and forth. And some chips even do things whether you say, well, you ask for this block of 64 bytes,

Starting point is 00:27:12 I'll just get the next one and maybe the one after that while we're going, because we might as well having just done that. And so this sort of temporal usage pattern is incredibly important for latency because if you're following a linked list, as you may be, if you just have a naive hash map, suddenly you're jumping randomly around in memory and the poor system can't make a guess. But if you just have a damn array of everything

Starting point is 00:27:35 and just go, well, it's expensive. It's an order N operation for me to search through these things, but I'm going to run through them linearly. Suddenly, it may be more worthwhile from a cash standpoint, especially if a memory a cache misses hundreds of cycles. Well, I can do a lot of work in a hundred cycles on an out-of-order execution machine that has 12 execution units, right? So why don't I just do tons of work and hope that that's quicker than going to RAM?

Starting point is 00:28:01 So anyway, popping the stack all the way back to keep the latency as low as possible, you're going to have to think about this. And so you might want to make some concessions where the distribution of your latencies needs to be taken into account. Now, if you're going to say, well, most of the time, I think I'm going to be this, like 80% of the, the stuff we're reading is hot. Like it's going to be commonly used again. In which case, I'm going to use a data structure, which is fine. And that 80% fits inside, it doesn't matter what I'm doing,

Starting point is 00:28:38 whether it's a balanced tree or whatever. It fits inside L3, let's say. But the 20% that's outside of it is just outside. It's going to be around. We're just going to have to take the hit every time. And the fact that it's incredibly expensive because we're using a balanced tree. So I guess now we turn around and say, like, it's not just how fast to go. it's like, what is the shape of your distribution?

Starting point is 00:28:57 Is it better to have the average case fast and the bad case awful? Right. Or is it better to bound the bad case and say, well, actually, I'm going to use a linear search every single time. And I'm going to say that even the fast one where the linear search, you know, it would have been faster just to check, I don't know, the first one or it's packed them in a different way. But, you know, I'm paying the cost to do 16 compares. Yeah, yeah, yeah. every time and just go, well, that costs me a little bit more than one.

Starting point is 00:29:29 Of course, it does. But it makes the worst case go away. Right. So, yeah, sorry, I've just jibbit around in a big, as I've sort of realized, the enormity of the question you've asked. That was the intention of the question. But yes, whenever you're talking about latency, you need to not, it's not like, oh, it's 10 microseconds of latency.

Starting point is 00:29:51 No, it's 10 microseconds of latency in what case, what percentage of the time? you know oftentimes you talk about like you know p 95 p99 p ninety nine point nine nine nine numbers of like what percentage of the of the responses are going to be less than some particular target time and you may have multiple layers of that right like you may have a maximum threshold that you need and like a p ninety nine point ninety nine and then like a p ninety five and then like a medium right and you might want to like structure things, you know, differently based on whatever those targets might be. So in our case, sort of this hypothetical, you know, robot dog type plugged into a PC, perhaps,

Starting point is 00:30:37 that's receiving the signal. Maybe what we'd want to say here, as one fork of the tree to go down, is actually the thing that I care about is the worst case latency, right? I want to, because if the latency gets too high, the dog is going to fall over, and that will be bad, right? It's a very common thing for like, you know, control boards and things. And, you know, this is why in those situations, typically embedded systems who have that kind of very strong requirements will use not regular operating systems. They use, you know, real-time operating systems that can give you hard deadlines for things happening within a certain amount of time and or do no preemption at all. So it's not like a case that your process will have its CPU time

Starting point is 00:31:23 taken away from it. It's like, you know, it's your responsibility to keep running until you get to the end and then yield to the next guy so that you can but, you know, this is a huge problem. As computers have gotten more complicated, it becomes harder and harder to reason about what an upper bound could be, you know, picking up my 6502 in its box over here. Right. It's a little easier. It's a little easier. You know, it's like, I can tell you exactly how long any single thing is going to take all the time. It's a completely deterministic and it's well documented. Right.

Starting point is 00:31:56 But we have to reverse engineer what the hell an Intel is doing in order to understand sort of maybe what the worst case is in that particular situation. Right. You know, hey, what happens if, you know, we are trying, we have an L3 cache miss at the same time as another CPU core, like a physically distant core, has an L3 cache. Both go for it at the same time. How does that work? is there some really weird edge case where, you know, in the fabric it times out after some amount of time.

Starting point is 00:32:26 I don't know, but it's hard to find out. But if I need to give you an actual, like, someone will get run over if the brakes do not come on when I put my on the break, then maybe I can't, I certainly can't think about the world the same way, but I certainly can't use certain, like, CPUs and designs and operating systems like that. So that's, but. Yeah. So if I come to you and I'm asking you this question of, how do I make this program fast? and you go through all of these follow-up questions, and what it turns out that I'm asking for here is,

Starting point is 00:32:58 how do I make sure that the round-trip latency of this program that I have running on a just regular commodity PC is not going to ever be higher than 10 microseconds? The answer is you can't. I think that's a fair, yeah, the nearest people I can think of that have to deal with that specific problem, are people who make digital audio workstations for consumer PCs because they have a genuine latency bound because if you're playing a keyboard and there's a one second delay,

Starting point is 00:33:36 that's not between you're pressing the button and hearing the noise change. That is not good for anybody. And so the only way that you can get that is to make everything as low latency as possible. But if you go too far and you don't have enough buffering between. you and the set of audio that's coming out, then if you for whatever reason can't keep up, you're going to get a massive pop or a crack or a click as suddenly you have to catch up

Starting point is 00:34:04 and just drop a whole bunch of samples that you were generating. So I mean, I don't know how they do it, honestly. In fact, I do know people who have worked on them and we should perhaps consider getting them on to talk about this kind of stuff because it's a fascinating problem. But yeah, anyway. But back to your... Yeah.

Starting point is 00:34:24 So the result of this, you can't. So now it's like, okay. Yeah, right. It's like you're putting constraints on this problem that you can't really sell for. For example, you'd probably be better off using some specialized hardware rather than trying to do this on a commodity machine. If you truly have a constraint that, you know, of let's just say 10 microseconds or something like that. To an extent, I mean, obviously microcontrollers have been around longer than computers,

Starting point is 00:34:51 because they've been in things, or that's the original design. But that is why the brake disc controller in my car is not controlled by the one PC that is like running the AV display in the middle of the panel, right? It is like a very specialist piece of software running on a very specialist operating system, which probably isn't an operating system

Starting point is 00:35:12 by anyone else's definition of operating system. It's just a library you link against and then sets up the interrupt vector and hands you a few nicer distances, good luck. Yeah. Yeah. But that's why. It's doing nothing but monitoring the disk because then finally you can potentially say,

Starting point is 00:35:28 all right, we can upabound this stuff. We do know what the maximum latency could be. And I can sit and work it out on paper because it's an arm. This is an in-order arm and it only has an L1 cache. And even if it's whatever, you know, there are things that you can say about it and you can bound it that way. Okay. So we've got our inputs.

Starting point is 00:35:50 maybe we've had some discussions about our constraints here and we're like, all right, well, we can't do this to a maximum level of 10 microseconds. That's insane. We could maybe do something where our like P95 is 100. Yeah. Less than that. Yeah. Something like that.

Starting point is 00:36:13 Yeah. We're making this up as we go along. Yeah. And then our maximum is something much more reasonable like a second. Right. It's like it's never going to take more than a second. If it does, then we have a bug and we should fix it, right? As opposed to, yeah, that's how the system works, right?

Starting point is 00:36:31 So now we've gotten our message from the outside world. We've gone and looked up the value that we've cached. Hopefully that's an L1 cache and hopefully we've done some smart things to make sure that it's there. But, you know, again, this is why we have these ranges of latency. how do we write out the response? Do we have the same sort of operating system packet issue when we're trying to write this thing back out? I mean, yes.

Starting point is 00:36:59 If we're talking about standard, like, I mean, we keep changing what it is we're doing. Is it a dog that's falling over? Is it a UW packet going into a again? But, you know, like, if we're going to send a packet back out again, then, yes, there's a similar dance on the way out through a traditional operating system where you write to a bit of memory that you can read

Starting point is 00:37:16 and write and use a space, then you say, hey, send it, please. And the colonel goes, well, I need to send it. I can't hand this bit of memory to the network card because this bit of memory is just wherever you decided to put it. Maybe on the stack, whatever. And certainly the DMA engine for your network card does not have carte blanche to read any piece of memory anywhere in the whole system

Starting point is 00:37:41 for all the obvious reasons of separation of concerns and to do with some of the ways that they work internally. So the card will has to copy it somewhere else in a buffer where it knows it could hand the address of that to a network card. And then it returns, potentially returns back to you and says, yes, it's sent, even though it isn't. It's sat in a buffer. Time will pass. Eventually, the colonel will decide to schedule sending it. I mean, probably will actually send it immediately, all things considered, unless the network card is already busy, doing something else.

Starting point is 00:38:08 And then the network card will be notified through some mechanism, and that's relatively expensive. requires a ping across the network. Sorry, I say a network. It is a network. The PCI bus itself, or however so your network card is plugged into your computer, is effectively yet another network. And there's another hop and there are complicated things going on there. So the message is addressed to it. The network car goes, oh, you say you have a packet for me.

Starting point is 00:38:33 How quaint. How lovely. Okay, I'll go and get the address from the ring buffer. Here it is. Okay, and I'll start streaming it out. And so that process can be relatively latent too. and so again you're less likely to suffer from the scheduling

Starting point is 00:38:49 randomness that the colonel does because it's not like something asynchronous happened and now the colonel has to pick you amongst all the things that the next to do it's like you've given it a piece of work and it's very obvious what it should be doing with that work but there still is some redundant copying of data there is still some messaging to an asynchronous

Starting point is 00:39:12 process, which is the network card itself. The network card can be like, yes, sure, fine. And then it could be busy doing something for all we know, right? You know, I don't know what magic network cards are doing. Maybe it's, you know, a virtual network card that doesn't even exist as a physical thing. And then really it's pretending to be a network card. But in fact, it is in a data center on a shared piece of infrastructure. And what you're really doing is talking to the hypervisor, which then goes, oh, cool, a packet

Starting point is 00:39:35 from one of my many virtual machines. Yeah, hypervised children, right. So I'm kind of, I was maybe teeing this up a little bit, but this is like, yeah, somebody comes along and says, hey, Ben, you made this program to control this robot, and you did it in a very silly way. Why are you using a UDP network? That is not necessary. Are there other devices that one could use?

Starting point is 00:40:02 I mean, like a USBC connection or something else that would be available to, you know, just a regular consumer commercial. computer. Yeah, I don't know too much about the ins and outs of how the kernel interacts with the USB. I know that the underlying USB protocol and the thing you talk with the hub or whoever you're talking to on the end of it is sort of a negotiated thing. You can ask for, you negotiate bandwidth. You negotiate, I don't think you negotiate latency.

Starting point is 00:40:34 I think you can negotiate isocrinus. That is repeating things. So like if you've got a webcam plugged in, it can book, hey, I need to send this much data just all the time, and you get sort of time on the USB bus to be able to send packets over that data. But I don't know about latency. And obviously, things like USB keyboards and things are in mice. And gamers obviously use USB mice and whatever,

Starting point is 00:40:59 but we're still talking almost human levels of latency there, you know, millisecond, sub-millimeter second, I'm sure. So, yeah, I don't really know. The best guess I would be able to have is not to say something like, like USB, because again, there are layers of indirection between you and the actual device, even if you're talking the USB protocol through the kernel somehow. But if you had like either, you know, GPI on a, which is general purpose IO pins on one of the more microcontrollery type systems where you can literally just say, if I write to this memory

Starting point is 00:41:34 address, it's not really a memory address. It is whatever data I put there is the highs and lows of these eight pins, right? And at that point, you're like, yeah, you're off to the race. right or you know say a serial interface which although serial is incredibly slow and awful it is also something where you're kind of directly attached to and you can say like I'm just driving it myself now I'm sure that's no longer true now I say it out loud I'm sure that it's virtualized through some mechanism in the kernel and so it's no longer the case if you open dev tty zero that like if you get the okay you are the only person talking to that now

Starting point is 00:42:07 and still it would would suffer from like you know if you read from it then you would block and the kernel will put you to sleep until a bite came in. So even that doesn't make sense. So GPIO is the nearest thing we have to the sit in a tight loop and wait kind of approach that we were talking about before. You know, for both input and output, you could say like

Starting point is 00:42:28 hey, you know, I'm doing a quiz show, whoever presses the button first kind of thing. And there are eight buttons and they go into the eight bits inside my controller and I just sit in a tight loop reading from it. Yeah. Obviously there are a better hard way ways of doing that. but I'm just sort of thinking out loud. Yeah.

Starting point is 00:42:46 Yeah, I don't know. Did you have an idea? You've been driving this in a direction. I'm wondering if you have a solution in mind that I have not got to. No, no, no. By the way, have I got the job or? Well, you're past the next round, but I'm going to have to, yeah. We're now, we're now 14 minutes in.

Starting point is 00:43:07 So, yeah, this is now, do I not have questions for you? we've done far too many interviews haven't we yeah yeah yeah yeah I mean I think that I mean we made the full kind of round trip and I think that's good I think the other and if there is a part two of this I think the part two should be okay now what happens when it's the actual like calculation that is the part that needs to quote unquote go fast right yeah yeah yeah like the IO no longer dominates as it did in this example and it's basically just this exercise and how do you take this signal, make sure that you don't accidentally make it slow when it's being processed in the CPU. That's an interesting one.

Starting point is 00:43:48 Yeah. Now it's, okay, this is going to take a material a lot of time and we're trying to make it as fast we can. Yeah, which is probably what most folks think of when you say, can I make my code go fast, is the code part, not all the other stuff around the outside, but often that can dominate. Yeah, no, that would be an interesting one. So yeah, okay, well, and then we've got to talk about throughput at some point. Maybe that will leg into throughput because, you know, I think there is part of it.

Starting point is 00:44:12 But we'll see. Well, all right. Yeah. Rather mysteriously then. And now you and I have to look each other in the eye and go, we will remember that the next episode we record will be the continuation of this one. And we won't just leave our poor audience hanging. Right. So I apologize in advance if we do.

Starting point is 00:44:28 But I think this has been very interesting. Yeah. That's good. Our listener agrees. And we will, I will see you. Next time, my friend. Next time. You've been listening to Toos Compliment,

Starting point is 00:44:42 a programming podcast by Ben Radie and Matt Godbolt. Find the show transcript and notes at www.org. Contact us on Mastodon. We are at Tooscomplement at hackyderm.io. Our theme music is by Inverse Phase. Find out more at InversePhase.com.

Two's Complement - How Fast Is Fast?

Ben interviews Matt with a deceptively simple question: make my program go fast. 44 minutes later, robot dogs are falling over, Grace Hopper's wire makes an appearance, and Matt still hasn't gotten th...e job.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.