TLB Hit 💥 - Episode 3: __builtin_expect(!!(x), 0)

Starting point is 00:00:00 Hey Chris, it's been a while since we've recorded a podcast, so how about we talk about something like built-in expected bang bang x comma zero. Yep, that sounds good. And standard disclaimer at the start of our podcasts, we only know so much. We'll try not to say things that are wrong and we'll follow up with errata corrections and clarifications in the show notes as we find out about things. And we do the lifelong learners thing. We're always trying to push the boundaries of things that we know about and have conversations about what's interesting to develop and hone our understanding.

Starting point is 00:00:31 All right. We kind of have a broad theme here. And this is really a first episode in a really broad series of, you know, what is it that we know statically? What's effectively discoverable only at runtime? And how do we tell the compiler and or the hardware things we know to be true things that we expect to be true things that we

Starting point is 00:00:51 expect to be true but want to do something about if it's not and things that we have no idea about so how do we collect that information that we have really no idea about and what happens if we're wrong what if the workload is just inherently irregular? And, you know, in the limit, imagine like random noise, it could drive our control decisions. So, you know, we talked a bit about precise versus imprecise exceptions before and, you know, automatic reordering. And I think we've mentioned vector machines and auto vectorization.

Starting point is 00:01:21 And so all of that should fall into the broader theme here. But, you know, we're always really questioning what you can actually cover in an hour. Let's try to give it a go at least for a subset of these things. Yeah. The time is often the limiting factor. So as our lead in, into the world of discussion, the episode title here is the thing that we macro into unlikely. So in C or C++ code, you might say, ah, this control condition is unlikely. Like if you do if some unlikely condition. And in C++20, there was added these double bracket likely and double bracket unlikely annotations that do the same thing, but with more square brackets. So it's clearly better. Right. It's really better. The idea is you want to be able to annotate your code and

Starting point is 00:02:03 say something's likely or unlikely, but what does it even make sense to annotate? Well, syntactically in C++ with these new annotations, you can apply those in a few places, but really what you're trying to do is you're trying to affect the assembly that you would get out of your compiler, right? So from the C++, your compiler does stuff, and then the assembly is what you're trying to affect with likely or unlikely. But we'd probably want to bias something like the test instructions, the conditional instructions, or something like that. But really in C++, your code doesn't look like those instructions that you have in assembly. It's really not what your compiler IR really looks like either. Would it make sense to apply these annotations to say if, else if, and else statements

Starting point is 00:02:43 only? Or maybe to switch cases as well or should be applicable to any boolean at all or even to any temporary boolean like value right so if you have like some temporary value that you're computing and passing that to a function directly without saving it into an explicit bool variable maybe you want to say that this is likely or unlikely and really what do they even mean in parallel execution say you have like some simd stuff so say you have like four values what does does it mean to say it's likely? Does it mean all four are likely? I don't know. And really what C++ decided was that those attributes can be applied in C++20 to any non-declaration statement as well as labels. So we'll try to dig into that a bit today.

Starting point is 00:03:20 Yeah. And I think it's interesting. we go through this mental model exercise sometimes of translating C++ into C as if we were the classic C front compiler, or C into assembly as if we were kind of a naive or straightforward C compiler. And so what would these annotations turn into in the resulting assembly is kind of what it comes back to. If you were writing assembly by hand, you might say, oh, this is my common case condition. How do I make this as fast as possible? So somehow it's just trying to hint that attribute to the compiler, like this is what I'm going for compiler. Right, right, right. We're really looking at like what the likelihood is of something happening and how often do we expect something to be true? The documentation, if you look at the built-in expected documentation in GCC, it says that the compiler is going to assume that it'll be like

Starting point is 00:04:09 90% accurate. Who knows where 90% comes from? I don't know. Someone pulled that out of somewhere and decided 90% is what it means. So what that means is in a loop, it'll affect unrolling of the loop, right? So the loop will assume like, hey, one time out of 10, the loop will exit. But if you want to have more fine grain control, better granularity, you can use some other intrinsic, such as built-in expect with probability. And that'll take the same parameters as built-in expected, but you can also give it a double value in the range zero to one. It'll allow you to have finer-grained control over the probability that something will be true or not. And what's funny is I played with it in GCC and it lets you pass in interesting values

Starting point is 00:04:48 like minus zero as well as an N or whatever else. It's kind of funny, whereas Clang is really strict about zero one, which I find kind of disappointing. Clang doesn't like fun, I guess. But really at the end of the day, the sanitation is trying to allow you to put a probability on the control transfer

Starting point is 00:05:02 that wasn't generally available in the language syntax. Right. And I think this ties back into a broader question of what do annotations even mean? So the big question that I always have about annotations is what does it mean for your program semantically? So if it's kind of hints that the compiler is free to do something about or not do anything about, it's free to discard them. That's kind of

Starting point is 00:05:25 how you end up with a register or inline keyword that in the modern era does basically nothing. Really, we're trying to provide value to the user, but we're also not trying to violate the principle of least astonishment, which is always walking this little tightrope. So when the person upgrades their compiler and it doesn't consider something the same way as it used to, maybe your annotations are discarded. And so you hit this performance cliff for upgrading your compiler. So that kind of violates the principle of least astonishment we usually think about. So to be robust in the face of what the annotations are telling you, you'd want to provide some way of reasoning about what must be the case when you put an annotation or what should be the case. And that goes back to the idea of semantics, right? At the end of the day, it's kind of what

Starting point is 00:06:09 mental framework can I use to reason about this annotation I'm putting? And that's the way you can help solve for the principle of least astonishment. If that framework exists for reasoning about it, the programmer or user has a chance of thinking about it properly. And so that mentions a little bit the idea of performance cliffs. So with scalar compilers, compilers for scalar C code, like we're traditionally used to, the performance cliffs tend to be a little lower than if you had maybe an accelerator program.

Starting point is 00:06:36 And there's always this interesting question. So for the machine learning compiler that I worked on, where slowdowns could cost you several X in your computation, like 2X or 3X or even more. You don't want to expose the internals too much because it slows down the progress, the evolution of your compiler. But you also don't want to astonish the user by changing the way that you think about these things in a way that would effectively discard their annotations. So there's a lot of ways that annotations are used in compilers.

Starting point is 00:07:04 There's things like auto vectorization annotations that don't necessarily port equivalently between environments or pragmas. And there's even cool compiler projects that have taken new approaches like Halide made this physical incarnation of the functional code of first class citizen called the schedule. And databases classically have these query planners, which can talk about the mapping of their logical form into a physical form, like in terms of how they execute. Yeah, yeah. Halide is a really cool paper when I read it and it's really cool. They got productized, at least, you know, Google and Adobe are both using a derivative of that.

Starting point is 00:07:38 So it's a really cool project. Yeah, it's used in a bunch of different things all over the place. And it's really great kernel generator, you know, for making these high performance kernels and trading off the parallelization versus rematerialization. There's always this, I think, Jason Reagan Kelly has this really good presentation about the three space of things that it's kind of aspect oriented programming, if people have heard of that before, where the challenge is how do I refer to a piece of the program and describe its attributes in a way that composes, but also stays relevant in the face of refactoring? And that's super hard, right? Because the compiler is effectively a tool that's automatically refactoring and reworking things for you under the hood. But if I change my program, you know, how do I know what's stable, even in the face of refactoring and changes in my program? Right. So why don't we even branch?

Starting point is 00:08:33 Programs often have branches around stuff that don't really make sense to do, either because we're trying to be more efficient or because it was causing exception. And you can imagine a world where no value causes exceptions. And then you want to select between different computed values. So it's largely a question of efficiency, which one it is that you want to do, throughout computation or not. So sometimes you might want to compute two possibilities

Starting point is 00:08:56 eagerly and just select one of them. And that might be more effective than just branching around the ones that you do want to pick. Really, if you think about at the assembly level, pretty much all ISAs have the idea of conditional selection, right? So instead of having branches in assembly, you can have just a conditional selects instruction. So for example, XC6 has Cmove, ARMv8 has Ccell. And before that, ARM could conditionalize pretty much every instruction that I had. And even in Thumb, it had IT blocks where you could conditionalize up to four subsequent instructions after the IT block. So that was kind

Starting point is 00:09:30 of an interesting way to, instead of having just branches, just saying like, do this thing conditionally or select this thing and move the results there or something like that. And even RISC-V, it doesn't have CMOV, so you have to branch around everything. And it creates these really small sequences of instructions where you have to branch around stuff all the time. So it's kind of an interesting tradeoff that they made there. Yeah, and it's interesting. I know that they've worked on things like instruction fusion. So in the front end, you'll kind of detect these little sequences.

Starting point is 00:09:57 I'm not sure the extent to which it'll detect things that are C move looking sequences. I'd have to go back and look at it, but it's also interesting. What kind of peepholes can you do in the front end that basically reconstruct this C moves. Speaking of predication and conditional moves and stuff, flags registers are often used as funny ways to communicate properties from the data stream, the data operations that are going on

Starting point is 00:10:19 to the control side of the machine. And so you can compare flags registers, which if you do a subtraction and the result is zero, maybe the zero flag will be set. That's effectively a predicate within this bigger flags register that you have access to versus actually having like explicit predicate registers. So that's an interesting contrast you see in computer architecture. Flags registers kind of notoriously can create these weird dependencies because they alias together a bunch of properties and they kind of create this sideband context that as instructions execute in a thread these flags are kind of going in and out from individual instructions and they may or may not be used

Starting point is 00:10:55 so at some point we'll have to do a fun with eflags episode just about all the fun and interesting things to do with flags registers yeah i bet we can spend more than an hour talking about flags in general. So I think also this ties back into a general theme of control on von Neumann style machines. So you can imagine different machines, the ones that we have today. Maybe you could make a pure data flow machine. I'm actually not super familiar with the Lisp machines that people had created, but they may be less von Neumann style with a big backing store that data is being put back against and more data flow. But as with everything, there's kind of this explore-exploit trade-off. And what we,

Starting point is 00:11:36 as an industry, discovered over time was we could tune the heck out of imperative instructions, executing little compute instructions against a register set and doing loads and stores out of a big backing store, or at least a virtually large backing store in this von Neumann style. You can imagine these machines that operate differently. And sometimes people talk about this idea of non-von for like non-von Neumann machines. But the kind of machines that we have, you control what instruction stream feeds the processor by redirecting it around to start fetching from other places using these branching or jumping instructions, right? And with an unconditional branch, you say exactly where you go start fetching from. With a conditional branch,

Starting point is 00:12:14 you say, start fetching from this other place, but only if this thing is true. And with an indirect branch, you say, fetch from this data dependent location that I placed in a register. So that's very closely tying a full data value from the data stream into the control side of the machine. So in some ways, the predication versus indirect branches are both cases of the data side of the computation feeding that control side. But it's also important to note that the machine has more knowledge of the possibilities when it's got conditional branches, since there's really only two possibilities of where it might go. Usually the jump amount in a conditional branch is relative to the current program counter, and the program counter is known to the machine,

Starting point is 00:12:53 so it's really just selecting between those two places for a conditional branch. Right, and so that takes us to loops, really. And the classic notion is that most of your program's execution time is probably spent in the loop somewhere. Otherwise, what would your program really do? And, you know, back of the envelope wise, you kind of have like straight line instructions on a gigahertz processor. And, you know, they can run in one millisecond, maybe like something like a million instructions or something like that, which is kind of a four megabyte binary. So it's a good amount of instructions that you're running. So it pays to think about

Starting point is 00:13:28 what a loop actually is, right? So the kind of branching that it does, well, loops usually just branch backwards. And when you branch forward, you're really just doing that for conditional things. And when you branch backward, it's probably what's called a loop back edge, right? So it has a negative delta. So it goes back to in memory when you do that. And usually loops have a trip count or something like that. And that goes into the flags and maybe it's counting down. So going towards zero or something like that,

Starting point is 00:14:00 one at a time or whatever. And for example, imagine you're processing, say some RGBA value through a loop. Well, you can predict taken for the back edge always, but if you do that, you're doing RGBA, well, you'll mispredict 25% of the time, which is actually pretty bad, right? It's not the type of loop you really wanna have

Starting point is 00:14:19 in your program. So why would you go with likely and likely? Well, what's even the point of the annotations at all? Well, what you're trying to do is you're trying to say, I want to optimize for the common path, really. The way I see it when I put these likely, unlikely annotations, I would say, I think this is going to be the common path. So you want to bias the code generation so that the likely code

Starting point is 00:14:39 is really going straight line and the branches are never taken. And the reason you want to do that isn't really to help the branch predictor at all. Some ISAs have branch hints inside the instructions themselves, say like Itanium had that, but I don't think that really matters nowadays. You're not trying to help the branch predictor. What you're trying to do is you're trying to tell the compiler that should generate code to improve the executions. For example, it could lay out the code so that each instruction fetch will fetch cache lines that are going to be fully executed. So as you have branches,

Starting point is 00:15:12 the likely taken branches all kind of follow each other and you just never take any of the unlikely branches. So when you fetch a cache line one after the other, you just fall through to the ones that are taken. And the other thing you might want to do is you want to think about cold code paths as well as out of line code. What you're doing with the likely unlikely annotations is you're telling the compiler, this isn't really likely, so don't make it bigger, don't make it faster, just make it small. And if it happens,

Starting point is 00:15:38 I'll take the hit on the extra instructions that I might need to execute. So speculate this won't happen. You don't wanna compile it or hoist a bunch of unlikely computation into likely code, because that would just increase the number of instructions that you execute without actually increasing the useful work that's being performed by the CPU. Yeah, and that feeds into this general notion of work efficiency.

Starting point is 00:16:01 Even in CPU speculation, what we're doing is we're trying to fill execution units with speculative instructions, but it also burns power to run them, right? And the compiler is also reasoning about what to put on the hot path, but it's just reasoning about it statically ahead of this execution time, whereas the CPU is doing it at runtime, right? And it could be useful. It could end up burning power and taking up lots of micro architectural slots in the data path that could be used for useful computation. So if you had an Oracle, it would tell you, oh, you know, you can speculate on this and it'll

Starting point is 00:16:35 definitely help you make forward progress, but we don't have those Oracles. Those are the idealizations. So if I wanted to increase my instructions per clock without getting more useful work done, I'd do it another way. It's actually fun to figure out how to write these power viruses for an ISA. So how do you max out the power consumption? And in order to do something like that, you'd really need to write assembly to very precisely control the machine. But ultimately, you probably want to optimize code size differently for a real program if something is very unlikely, right? Imagine you were writing the assembly again. You would put on a cold path. You generate smaller code, as you said. You'd force extra spills to happen when you entered that path. So I'd keep my registers in place for the common case. And if that actually

Starting point is 00:17:20 happened, then I'd go spill a bunch of stuff to the stack and patch it up and make it look the way I wanted to. So I let the straight line code in the common case have all the registers. So there's also a trade-off here of we're going to be grossing up code and we could be wrong. So sometimes you might hear in code review, if you're trying to write general purpose code that had a lot of these likely or unlikely annotations, that it might be premature optimization and you could actually be wrong about your annotation. So this counterbalance here is that you're grossing up your code with things you suspect to be the case,

Starting point is 00:17:52 but we all have this limited foresight. So ideally we wouldn't need to say anything that would turn out to be wrong. And instead we could just key off the evidence that we observed instead of using our intuition here. In some cases, when you are wrong, you have to clean up the speculative things that you did. Say if you change the state that wasn't in a register,

Starting point is 00:18:10 I guess that's at the CPU level more than at the compiler level, but it's something you have to do when the CPU is wrong, it speculates incorrectly. Say we were logging things as like verbose logging, and that only got activated when the verbose logging mode was on. We could actually put into our standard functions or logging macros, the likely or unlikely annotations. In this case,

Starting point is 00:18:30 it's unlikely that I'm doing verbose logging and production. And that helps the compiler know what slow paths be unlikely to be executed throughout the code base, and it can push those into cold code. Right. And when we talk about that, like, you know, you're trying to figure out, hey, you put these annotations, are they ever true or not? At some point you realize, well, shouldn't just profile guided optimizations in the compiler take care of all of this, right? Is it a magic wand that will solve all of it? And in theory, yes, it can totally solve this issue. You wouldn't need to annotate anything. But realistically, what I've seen is very few projects turn on PGO correctly, right? So some are gonna maybe just collect data once

Starting point is 00:19:07 or every few months or whatever, and then use those data they've collected for years down the line, right? So they just update it every so often and just whatever. They use whatever previous data they had. They keep changing the code and the data aren't really connected with what the new program does anymore.

Starting point is 00:19:22 Whereas others will just kind of collect PGO data on all the unit tests, right? And the thing with unit tests is by definition, they audit test the odd cases more than the common ones. And so sure, you got profile data on your unit tests, and it's cool because your unit tests might run faster once you compile them, but you've optimized for the wrong thing because I bet most of your users aren't running your unit tests, right? They're running, you know, something else. They're not running into all the corner cases unless you have really weird users. So in fact, with actual good profiles, you wouldn't need these likely and unlikely annotations. So you'd have to set up a really good pipeline to run representative samples of what it is that your program does to use PGO

Starting point is 00:20:06 correctly. But at the same time, you look at this in programs where performance really matters, where performance is a feature, right? Not just something you kind of sprinkle on at the end. It's actually kind of good documentation to say this is likely, this is not likely. But it's only good documentation, as any documentation goes, when you get it right. If you get it wrong, it's not very good documentation, as any documentation goes, when you get it right. If you get it wrong, it's not very good documentation. So as you were saying, it'd be cool to have PGO tell you if your annotations are wrong.

Starting point is 00:20:32 So if I had these annotations inside my code and I had a really good PGO instrumentation, PGO could come in and tell me, OK, well, here's what I figured out about your program. And by the way, you were completely wrong over there. Or it could say, by the way, I never executed this branch that you told me would be likely or unlikely. And that's interesting because it tells you, you don't have good PGO coverage, right? Not just you were wrong with the coverage you provided me, but you didn't even try to cover this, right? So that

Starting point is 00:20:59 be a super cool insight into the code where the programmer says what they think is important and they get a reality check on it, right? So as a documentation thing, when performance is a feature in your program, that'd be super cool. Yeah. And I will say one really neat thing that they do in hardware design is they have this idea of cover points. So basically ahead of time, you think about what scenarios are important for you to cover in terms of your verification of the piece of hardware that you're making. But you could also have these performance-oriented cover points, right, where you describe conditions that you definitely want to be collecting profiles for because you think they're important to see in your benchmarks and them running over your program, right? So there's this interesting idea that you could do performance-based cover points. And so I also remember that Firefox got

Starting point is 00:21:45 some good mileage out of its profile-guided optimization builds. And there's another system that I thought was really cool, which is, it's called GWIP, which is short for Google Wide Profiling. And it has the system called AutoFDO. And there's a paper on this, which is a really neat paper, where it keeps a continuous pipeline going, collecting information from applications that are running throughout Google. And then it actually describes in the paper how performance drops off as a function of time using stale data. So if you capture the data for the symbols in a binary on a given day, and then you wait a couple of weeks later, the code is constantly changing and evolving. So the effectiveness of that data you collected drops off over time. And it actually shows how effective it remains as a function of time, which is really

Starting point is 00:22:29 cool. The Gwip thing is really neat because it's running on a live system and collecting data on a system that's being continuously changed as well. So that's a really neat paper to read. And another one I heard about, I don't think I've ever read about this, but I've heard a few people come in and say, well, actually, when I use these likely unlikely annotation, I completely lie to the compiler because I, the person who says this, have an exceptional system. And they say, well, I wanna optimize for the unlikely thing, right?

Starting point is 00:23:00 So because that's the one that matters. So I'm gonna tell the compiler, this is likely when actually, that's the completely unlikely thing that never happens. But if it happens, I want the compiler to make it extremely fast. And I've heard the argument from a handful of people who either do something like high-frequency trading or folks who do kind of safety systems where, say, like some emergency break or firing missiles or whatever is a thing that you want to do really quickly.

Starting point is 00:23:24 After taking a nap, of course. Right. After taking a nap. I thought that was interesting kind of mind bend in the way people approach stuff, right? Because the code that ought to never execute, well, when it does execute, if it does execute, it ought to execute really fast when unfortunately has to execute. Or, you know, if you're doing HFT when unfortunately you're trying to make money really fast or whatever it is that they do. It's interesting to think when we go back to PGO,

Starting point is 00:23:50 well, if you use PGO and you had tests that biased towards the unlikely thing that you want to make fast, then maybe PGO would be a better option there. I don't know. Again, speaking about HFT, I don't know if options are a thing we should be talking about, but whatever.

Starting point is 00:24:05 It's good to have options. Right. You just have to call on them sometimes. I think it's also interesting to go back to this, I guess, the theme of what are the simplest possible solutions in the space here? So our episode title is this unlikely annotation. And it's potentially a relatively simple solution compared to something involved like setting up an FDO pipeline or something like this. And I'm sorry, did we define FDO? Feedback directed optimization. I forget if we did.

Starting point is 00:24:31 So there's a spectrum of feedback directed optimization, which goes from offline collection of profiles to, you know, just in time observation of your code. And it all lives in this kind of feedback directed spectrum. So there's something on the spectrum, which is the programmer has something in their head and they type it into the program because they believe that it's the case. That's somewhere on this feedback directed spectrum. It's actually the programmer's thinking intent. But we go back to what if you have no branch predictor at all in your hardware and you've got a conditional branch in your CPU front end, what are the simplest things you can do? So one is to foist it onto the C programmer and you know,

Starting point is 00:25:11 it's all about balance of probabilities. You could do simple things like just predict randomly which way the branch is going to go. And if you had a uniform distribution of taken and not taken in the source program, you'd have a 50% chance of getting it right. So that makes a nice mental baseline for comparison. But I think it's pretty clear that you could do even better than that,

Starting point is 00:25:30 either foisting it onto the user and randomly predicting whether it's taken or not taken as your two options. Yeah, absolutely. And really what we're trying to get to when we talk about branches is that hardware isn't instantaneous, right? Like instructions

Starting point is 00:25:45 take a little while to execute. And the problem is you have a branch and you don't know which way to go, right? The CPU is going to see a branch and then be like, oh, where should I go now? And that's why these likely and likely annotations are useful because it helps you try to figure out which way it might go and then try to bias the CPU. The compiler says, hey, let's help the CPU figure out where to go. And if you go back to say like the MIPS days and the Hennessy and Patterson book, they added delay slots to MIPS to do something like that, right?

Starting point is 00:26:12 So branches have what's called delay slots, which allows you while you figure out where the branch is going to go, well, maybe I have a few more instruction that you can munch on, right? So right after the branch, there's say an extra instruction that you can munch on. So right after the branch, there's, say, an extra instruction that you can execute as an extra thing. So you know you're going to take a branch. You're going to be like, oh, hey, I should figure out where this branch is going to go. But hey, while I'm figuring out this branch,

Starting point is 00:26:35 let me eat those extra instructions after the branch. And delay slots are a way to hide the latency to send the conditional predicate to the instruction fetch unit. And it effectively says like there's, you know, some instructions that execute unconditionally associated with every conditional branch, because otherwise, if I don't do that, I'm going to have dead space, right? So it's kind of acknowledging like, I'm going to need a few instructions to do stuff afterwards. It's baking it into the ISA. Like I'm going to need one slot

Starting point is 00:27:05 or two slots or three slots of work to do after you tell me that you're going to do a branch. It always looks really strange when you send an instruction or program under the branch, but conceptually you can think of it as being kind of paired up with the branch, right? It's branch and then also do these other things before branching. Right. And so just as a reminder, so if we're taking our conditional branch, then we're going to the target. So like you said, the instruction fetcher has to kind of Heisenfetch from these two possible locations. One is the branch is not taken. So we just fall through to the next instruction, or we have to redirect where we're fetching from to the other location. And there's a little memory in your system that it's trying to read the next

Starting point is 00:27:43 instruction from. And in the simplest possible case, it just has one port for you to go fetch an instruction out. And so you have to figure out the address and then go fetch that address from this little memory. And so now CPUs have this fundamental problem when they're trying to be as simple as possible, that they live in this poison state where you're fetching from either the next instruction or that thing arbitrarily far away or misaligned or switching processor modes or various forms of things that can happen when you jump. But when you collapse this wave function, it's only going to do one of those. So, but then here's the problem for modern CPUs. So we kind of talked about, okay, when you're going for max simplicity, what are you going to do? In a modern CPU, the branches could conceptually make you seize up and wait

Starting point is 00:28:25 until you're sure in the look before you leap kind of style. And then you'd look for something like delay slots. But alternatively, we could do the, it's easier to ask for forgiveness than permission thing. And then we just keep running from there to wherever we think we're going to branch to and use special retirement hardware to prevent those results from landing into committed architectural states. So the kind of thing you would see in a debugger if you stopped it on a given program counter or if you were stepping instructions. That's your architecturally visible and committed state.

Starting point is 00:28:56 So effectively, there's a little bit of microtransactionality happening inside where you have these faculties inside of the CPU pipeline that say, all right, I did this, but I'm not sure it's definitely what has to happen. I did it because I think it's probably going to have to happen, but I'm not sure. And so, you know, those have to be committed via the retirement hardware. So we get tons of performance out of the instruction stream on modern CPUs by speculating past branches that would otherwise paralyze us in place. And we run the stuff we expect we're going to have to run. So this is the classic notion of ILP mining. So ILP stands for instruction level parallelism. So we're churning through the instruction stream and

Starting point is 00:29:36 finding things that we can run concurrently on our execution paths along branches we expect we're going to take in order to reduce the overall execution time witnessed by the user for running the overall program. And, you know, a classic conundrum is as your CPU pipeline gets really deep, this kind of becomes very difficult or doesn't work at some point. So your pipeline depth means you have potentially a bunch of branches in flight simultaneously. So let's say just for random number purposes, let's say my pipeline depth was 40 and my basic block size was five. I'd have eight branches in flight at any given time. And so even if I can predict one with 90% accuracy, now I have

Starting point is 00:30:17 0.9 to the eighth branches chance of predicting all those branches properly. So I'll have a bunch of wasted work going back to that notion of wasted work that we talked about before. So there's some number of branches you can do before you're back at 50-50% probability of some misprediction. Yeah, and you mentioned retirement hardware. And what is that even?

Starting point is 00:30:37 Is that where compiler engineers go when they're done doing compilers or something? Well, not really basically you know you have a bunch of extra registers and caches and whatever else inside the hardware and they're not supposed to be visible from the isa so so these extra registers you know like say say you're you're on arm v8 or whatever you have say like 32 ish registers 31 and the other ones in the actual hardware aren't visible. You might have twice as many or something more. They're not supposed to be visible. We've learned in the last

Starting point is 00:31:10 three, four years that they are visible through timing, but you know what I mean. Let's ignore that the timing is a thing. But the hardware keeps track of all these extra registers and caches and whatever to remember which ones are speculative, right? Which ones are registers that hold values that they haven't really fully committed to and which don't hold those, which ones of these hold what's usually called architectural state. And what it's going to do, it's going to throw away missed speculation and roll back to the known good state. So say, you know, say you're on ARM or whatever, and you have 31 registers, there might

Starting point is 00:31:45 be, you know, twice as many. And it's not a set, like which one is real and which one isn't. It's, you know, through register renaming, you kind of figure out which ones are the quote unquote real registers right now or not. And it's the same thing with caches, right? Like caches are just this extra level of caching that allow you to have that speculation. And basically retirement just tracks what is it that's the point of no return in the reorderable instruction sequence. I'm really oversimplifying here. And again, like this is another one

Starting point is 00:32:16 we could do a slew of new episodes on the topic, but it's really interesting to think about this when we talk about how from a programming language perspective, you're trying to bias what the compiler is going to do to generate specific assembly instructions to then have the hardware run them and speculate on the things that are going to happen. There's really layers of speculations in there and layers of abstraction. Even when you talk about assembly,

Starting point is 00:32:41 there's so much abstraction in it. Yeah, it's amazing that assembly is really like a little interpreted language that gets unrolled into these micro instructions under the hood. That's always a pretty fascinating thing. And I think the point of no return is interesting because it shows the transactionality that we were talking about earlier of the point of no return usually says an exception cannot happen beyond this point like this point has all been committed to architectural state i know that the machine definitely got here so it's kind of the commit point that's running ahead through the program and so there's also a bunch of different kinds of machines that are cited through the literature and there is kind of a notion of brawny versus wimpy cores we're talking about here with things like you know branch delay splots and trying to keep hardware as simple as possible versus big

Starting point is 00:33:30 reordering hardware and speculation and prediction and things but even in the modern era for embedded systems or systems that have less heavyweight hardware per thread having the knowledge that programmers know in their head or that, you know, profiling determined can be very helpful. So things like GPUs, for example, they amortize their control hardware over some number of threads that they hope will be running in tandem. So they're kind of trying to get as much information about whether the threads will run in unison as possible. So in order to be Turing complete, which is kind of our baseline level of expressiveness that we expect out of machines today, you need data to be able to influence control somehow.

Starting point is 00:34:11 So basically you need to be able to do a while loop or an if or something like this, right? While loop effectively. So most systems will have to consider how the data is going to interact with the control side of the machine at some level. And to go back to the brawny versus wimpy core thing, there's some amazing hardware that's been landed or at least been proposed for beefy machines. So one example that I really loved was this idea of perceptrons. And that was the early use of neural networks for branch prediction. So ultimately, you just have to predict one bit given some history of what branches have done. And you have this little neural network doing a matrix multiply that has to determine this. And of course, there's all

Starting point is 00:34:52 these notions that come in of the size and how power hungry it is and the importance of latency. We need the answer quickly to which way this branch is going to go so that we know where to start fetching from. It can't take too long. Otherwise, we're trying to figure out how to hide the latency of the thing that determines how to hide the latency. And so that's kind of, it kind of moves towards what are state of the art branch predictors, which we'll have to talk about in some future episode. And effectively what we're doing here is we're forming things that to me look like VLIW bundles. So you kind of take code that's in a big sequence, and then you figure out which ones are really going to run, and then you turn them horizontal to run as micro operations against

Starting point is 00:35:29 all these data paths that you have. And you do that on the fly using the natural size of basic blocks that exist and the reservation station scoreboarding algorithms that exist inside of the machine. And so there's also interesting things like how do you form regions out of these little traces of code that get executed in the literature. That ties into things like the reorder buffer size. So CPUs have this big set of instructions that they can reorder to execute. Reorder buffering and region formation are related concepts. There's also challenges of dispatching out of a really big reorder buffer. And there's things like the M1, which have been pushing, the Apple M1 silicon, which have been pushing the limits here. And going back to historical papers, there's cool things like the trace cache in the Pentium 4 that really, I think, shows the spectrum

Starting point is 00:36:14 between the approaches, because that was actually a hardware structure that was storing micro-op decoded tracelets that it strung together using all hardware primitives, which is pretty cool. Yeah. And there's so much more to talk about here. There's so many other things, for example, you know, vector predicated masks, SIMT, vector threading, Cray style run ahead, and, you know, branch target prediction for indirect branches. How do branch prediction, you know, take address bits and track the history? What's the aliasing and the subset of bits that you use to track branches? How do you do local global history table data using branch prediction schemes? And so much more, like hybrid global local branch predictor mixes,

Starting point is 00:36:56 virtuals as the ultimate control structure and relying on de-virtualization. So how do you bias the virtual dispatch with de-virtualization? So there's a lot that we haven't covered here. And again, so many more episodes to go, but really we've been talking about built and expect, and we haven't really talked about built and assume, right? Because built and expect is just saying, yeah, expect this thing, but this other thing might happen as well. If you talk about built and assume, well, it's really saying, assume that this is going to happen. If it doesn't happen, then all bets are off just just like do whatever you want with my program i don't really care

Starting point is 00:37:29 it's totally undefined behavior if that happens and built and assume is another interesting thing to use in there right there there's certain programs where people have assert in one compilation mode and then assume in the other one and i've heard kind of extreme wars about whether that's a valid thing to do or not, depending on what the background of the programmer is. So it's kind of an interesting trade-off to do there. Expect and assume are definitely kind of hot button topics for different people. Yeah. And it's interesting to say, hey, compiler, you can really take this as a given. And then of course things can be wrong. And so then the question is how wrong do they go when something is, one of your axioms is untrue. It's a funny, funny idea.

Starting point is 00:38:12 And so at the end of the day, I think to kind of summarize a lot of this, scalar code really wants to have this notion of control inside where it's kind of moving around the instruction that's being executed, changing the program counter to point at different places inside of your program. But we really only have these crude ways of asking for things to happen. And so usually we hope the compiler will have some way of doing the right thing, but often it has no way of knowing what the right thing is at compile time, given just the information presented to it in the program at large. Yeah, so I guess the compiler just has to expect the unexpected.

Starting point is 00:38:47 Yeah, like if somebody was trying to predict how often our podcasts come out, that's basically impossible. That's true. I mean, unless they have a side channel, it would be a really good source of entropy for random number generators or something like that. Nice. All right, well, that's our episode.

Starting point is 00:39:01 So thanks for hanging out, Chris. And I hope I won't make predictions, but I think we'll do another episode in the near future. I expect that we will. Great. Thanks. Later.

TLB Hit 💥 - Episode 3: __builtin_expect(!!(x), 0)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.