TLB Hit 💥 - Episode 3: __builtin_expect(!!(x), 0)
Episode Date: April 19, 2021* What is it we know statically? * What's effectively discoverable only at runtime? * How do we tell "the machine" (compiler and/or hardware): * Things we *know* to be true... * Things we *expect* to ...be true... * Things we *expect* to be true but *want to do something about* when it's not... * Things we have no idea about! * How do we collect that information that we have no idea about? * What happens if we're wrong? What if the workload is inherently irregular? * In the limit, random noise could drive our control decisions! * We talked a bit about precise vs imprecise exceptions before and automatic reordering, and we've mentioned vector machines and auto-vectorization. * All falls into the broader theme here, but we're always questioning what we can actually cover in an hour... * We'll try to give it a go for a subset of these things! * Time is often the limiting factor. * The episode title is the thing that we often macro define as `#define UNLIKELY` * In C/C++ code you might say "this control condition is unlikely", and say `if (UNLIKELY(condition))` * In C++20 there was added these `[[likely]]` and `[[unlikely]]` annotations that do the same thing, but with more square brackets, so clearly better!
 Transcript
 Discussion  (0)
    
                                         Hey Chris, it's been a while since we've recorded a podcast, so how about we talk about something
                                         
                                         like built-in expected bang bang x comma zero. Yep, that sounds good. And standard disclaimer
                                         
                                         at the start of our podcasts, we only know so much. We'll try not to say things that are wrong
                                         
                                         and we'll follow up with errata corrections and clarifications in the show notes as we find out
                                         
                                         about things. And we do the lifelong learners thing. We're always trying to push the boundaries
                                         
                                         of things that we know about
                                         
                                         and have conversations about what's interesting
                                         
                                         to develop and hone our understanding.
                                         
    
                                         All right.
                                         
                                         We kind of have a broad theme here.
                                         
                                         And this is really a first episode
                                         
                                         in a really broad series of, you know,
                                         
                                         what is it that we know statically?
                                         
                                         What's effectively discoverable only at runtime?
                                         
                                         And how do we tell the compiler
                                         
                                         and or the hardware things we know to be true things that we expect to be true things that we
                                         
    
                                         expect to be true but want to do something about if it's not and things that we have no idea about
                                         
                                         so how do we collect that information that we have really no idea about and what happens if we're
                                         
                                         wrong what if the workload is just inherently irregular?
                                         
                                         And, you know, in the limit, imagine like random noise,
                                         
                                         it could drive our control decisions.
                                         
                                         So, you know, we talked a bit about precise versus imprecise exceptions before
                                         
                                         and, you know, automatic reordering.
                                         
                                         And I think we've mentioned vector machines and auto vectorization.
                                         
    
                                         And so all of that should fall into the broader theme here.
                                         
                                         But, you know, we're always really questioning what you can actually cover in an hour. Let's try to give
                                         
                                         it a go at least for a subset of these things. Yeah. The time is often the limiting factor.
                                         
                                         So as our lead in, into the world of discussion, the episode title here is the thing that we macro
                                         
                                         into unlikely. So in C or C++ code, you might say, ah, this control condition is unlikely. Like if
                                         
                                         you do if some unlikely condition. And in C++20, there was added these double bracket likely and
                                         
                                         double bracket unlikely annotations that do the same thing, but with more square brackets. So it's
                                         
                                         clearly better. Right. It's really better. The idea is you want to be able to annotate your code and
                                         
    
                                         say something's likely or unlikely, but what does it even make sense to annotate? Well, syntactically in C++ with these new annotations,
                                         
                                         you can apply those in a few places, but really what you're trying to do is you're trying to
                                         
                                         affect the assembly that you would get out of your compiler, right? So from the C++,
                                         
                                         your compiler does stuff, and then the assembly is what you're trying to affect with likely or
                                         
                                         unlikely. But we'd probably want to bias something like the test instructions, the conditional
                                         
                                         instructions, or something like that. But really in C++, your code doesn't look like those
                                         
                                         instructions that you have in assembly. It's really not what your compiler IR really looks
                                         
                                         like either. Would it make sense to apply these annotations to say if, else if, and else statements
                                         
    
                                         only? Or maybe to switch cases as well
                                         
                                         or should be applicable to any boolean at all or even to any temporary boolean like value right so
                                         
                                         if you have like some temporary value that you're computing and passing that to a function directly
                                         
                                         without saving it into an explicit bool variable maybe you want to say that this is likely or
                                         
                                         unlikely and really what do they even mean in parallel execution say you have like some simd
                                         
                                         stuff so say you have like four values what does does it mean to say it's likely? Does it mean all four are likely?
                                         
                                         I don't know. And really what C++ decided was that those attributes can be applied in C++20
                                         
                                         to any non-declaration statement as well as labels. So we'll try to dig into that a bit today.
                                         
    
                                         Yeah. And I think it's interesting. we go through this mental model exercise sometimes of translating C++ into C as if we were the classic C front compiler, or C into assembly
                                         
                                         as if we were kind of a naive or straightforward C compiler. And so what would these annotations
                                         
                                         turn into in the resulting assembly is kind of what it comes back to. If you were writing assembly
                                         
                                         by hand, you might say, oh, this is my
                                         
                                         common case condition. How do I make this as fast as possible? So somehow it's just trying to hint
                                         
                                         that attribute to the compiler, like this is what I'm going for compiler. Right, right, right. We're
                                         
                                         really looking at like what the likelihood is of something happening and how often do we expect
                                         
                                         something to be true? The documentation, if you look at the built-in expected documentation in GCC, it says that the compiler is going to assume that it'll be like
                                         
    
                                         90% accurate. Who knows where 90% comes from? I don't know. Someone pulled that out of somewhere
                                         
                                         and decided 90% is what it means. So what that means is in a loop, it'll affect unrolling of
                                         
                                         the loop, right? So the loop will assume like, hey, one time out of 10, the loop will exit.
                                         
                                         But if you want to have more fine grain control, better granularity, you can use some other
                                         
                                         intrinsic, such as built-in expect with probability. And that'll take the same parameters as built-in
                                         
                                         expected, but you can also give it a double value in the range zero to one. It'll allow you to have
                                         
                                         finer-grained control over the probability that something will be true or not. And what's funny is I played with it in GCC
                                         
                                         and it lets you pass in interesting values
                                         
    
                                         like minus zero as well as an N or whatever else.
                                         
                                         It's kind of funny,
                                         
                                         whereas Clang is really strict about zero one,
                                         
                                         which I find kind of disappointing.
                                         
                                         Clang doesn't like fun, I guess.
                                         
                                         But really at the end of the day,
                                         
                                         the sanitation is trying to allow you
                                         
                                         to put a probability on the control transfer
                                         
    
                                         that wasn't generally available in the language syntax.
                                         
                                         Right.
                                         
                                         And I think this ties back into a broader question of what do annotations even mean?
                                         
                                         So the big question that I always have about annotations is what does it mean for your
                                         
                                         program semantically?
                                         
                                         So if it's kind of hints that the compiler is free to do something about or not do anything
                                         
                                         about, it's free to discard them.
                                         
                                         That's kind of
                                         
    
                                         how you end up with a register or inline keyword that in the modern era does basically nothing.
                                         
                                         Really, we're trying to provide value to the user, but we're also not trying to violate the principle
                                         
                                         of least astonishment, which is always walking this little tightrope. So when the person upgrades
                                         
                                         their compiler and it doesn't consider something the same way as it used to, maybe your annotations are discarded.
                                         
                                         And so you hit this performance cliff for upgrading your compiler.
                                         
                                         So that kind of violates the principle of least astonishment we usually think about.
                                         
                                         So to be robust in the face of what the annotations are telling you, you'd want to provide some way of reasoning about what must be the case when you put an annotation or what should be the case.
                                         
                                         And that goes back to the idea of semantics, right? At the end of the day, it's kind of what
                                         
    
                                         mental framework can I use to reason about this annotation I'm putting? And that's the way you can
                                         
                                         help solve for the principle of least astonishment. If that framework exists for reasoning about it,
                                         
                                         the programmer or user has a chance of thinking about it properly. And so that mentions a little bit the idea of performance cliffs.
                                         
                                         So with scalar compilers,
                                         
                                         compilers for scalar C code,
                                         
                                         like we're traditionally used to,
                                         
                                         the performance cliffs tend to be a little lower
                                         
                                         than if you had maybe an accelerator program.
                                         
    
                                         And there's always this interesting question.
                                         
                                         So for the machine learning compiler that I worked on,
                                         
                                         where slowdowns could cost you several X
                                         
                                         in your computation, like 2X or 3X or even more.
                                         
                                         You don't want to expose the internals too much because it slows down the progress,
                                         
                                         the evolution of your compiler. But you also don't want to astonish the user by
                                         
                                         changing the way that you think about these things in a way that would effectively discard
                                         
                                         their annotations. So there's a lot of ways that annotations are used in compilers.
                                         
    
                                         There's things like auto vectorization annotations that don't necessarily port equivalently between environments or pragmas. And there's even cool compiler projects that have taken new approaches like Halide made this physical incarnation of the functional code of first class citizen called the schedule.
                                         
                                         And databases classically have these query planners, which can talk about the mapping of their logical form
                                         
                                         into a physical form, like in terms of how they execute.
                                         
                                         Yeah, yeah.
                                         
                                         Halide is a really cool paper when I read it
                                         
                                         and it's really cool.
                                         
                                         They got productized, at least, you know,
                                         
                                         Google and Adobe are both using a derivative of that.
                                         
    
                                         So it's a really cool project.
                                         
                                         Yeah, it's used in a bunch of different things
                                         
                                         all over the place.
                                         
                                         And it's really great kernel generator, you know, for making these high performance kernels and trading off the parallelization versus rematerialization. There's always this, I think, Jason Reagan Kelly has this really good presentation about the three space of things that it's kind of aspect oriented programming, if people have heard of that before, where the challenge is how do I refer to a piece of the program and describe its attributes in a way that composes, but also stays relevant in the face of refactoring?
                                         
                                         And that's super hard, right?
                                         
                                         Because the compiler is effectively a tool that's automatically refactoring and reworking things for you under the hood. But if I change my program, you know, how do I know what's stable,
                                         
                                         even in the face of refactoring and changes in my program?
                                         
                                         Right. So why don't we even branch?
                                         
    
                                         Programs often have branches around stuff that don't really make sense to do,
                                         
                                         either because we're trying to be more efficient or because it was causing exception.
                                         
                                         And you can imagine a world where no value causes exceptions.
                                         
                                         And then you want to select between different computed
                                         
                                         values.
                                         
                                         So it's largely a question of efficiency, which one
                                         
                                         it is that you want to do, throughout computation or not.
                                         
                                         So sometimes you might want to compute two possibilities
                                         
    
                                         eagerly and just select one of them.
                                         
                                         And that might be more effective than just branching
                                         
                                         around the ones that you do want to pick.
                                         
                                         Really, if you think about at the assembly level, pretty much all ISAs have the idea of conditional selection,
                                         
                                         right? So instead of having branches in assembly, you can have just a conditional selects
                                         
                                         instruction. So for example, XC6 has Cmove, ARMv8 has Ccell. And before that, ARM could
                                         
                                         conditionalize pretty much every instruction that I had. And even in Thumb, it had IT blocks where
                                         
                                         you could conditionalize up to four subsequent instructions after the IT block. So that was kind
                                         
    
                                         of an interesting way to, instead of having just branches, just saying like, do this thing
                                         
                                         conditionally or select this thing and move the results there or something like that. And even
                                         
                                         RISC-V, it doesn't have CMOV, so you have to branch around everything. And it creates these
                                         
                                         really small sequences of instructions where you have to branch around stuff all the time.
                                         
                                         So it's kind of an interesting tradeoff that they made there.
                                         
                                         Yeah, and it's interesting.
                                         
                                         I know that they've worked on things like instruction fusion.
                                         
                                         So in the front end, you'll kind of detect these little sequences.
                                         
    
                                         I'm not sure the extent to which it'll detect things that are C move looking sequences.
                                         
                                         I'd have to go back and look at it, but it's also interesting.
                                         
                                         What kind of peepholes can you do in the front end
                                         
                                         that basically reconstruct this C moves.
                                         
                                         Speaking of predication and conditional moves and stuff,
                                         
                                         flags registers are often used as funny ways
                                         
                                         to communicate properties from the data stream,
                                         
                                         the data operations that are going on
                                         
    
                                         to the control side of the machine.
                                         
                                         And so you can compare flags registers,
                                         
                                         which if you do a subtraction and the result is zero, maybe the zero flag will be set. That's effectively a predicate within this
                                         
                                         bigger flags register that you have access to versus actually having like explicit predicate
                                         
                                         registers. So that's an interesting contrast you see in computer architecture. Flags registers
                                         
                                         kind of notoriously can create these weird dependencies because they alias together a
                                         
                                         bunch of properties and they kind of create this sideband context that as instructions execute in a thread these
                                         
                                         flags are kind of going in and out from individual instructions and they may or may not be used
                                         
    
                                         so at some point we'll have to do a fun with eflags episode just about all the fun and
                                         
                                         interesting things to do with flags registers yeah i bet we can spend more than an hour talking about flags in general.
                                         
                                         So I think also this ties back into a general theme of control on von Neumann style machines.
                                         
                                         So you can imagine different machines, the ones that we have today.
                                         
                                         Maybe you could make a pure data flow machine.
                                         
                                         I'm actually not super familiar with the Lisp machines that people had created, but they
                                         
                                         may be less von Neumann style with a big backing store that data is being put back against and more
                                         
                                         data flow. But as with everything, there's kind of this explore-exploit trade-off. And what we,
                                         
    
                                         as an industry, discovered over time was we could tune the heck out of imperative instructions,
                                         
                                         executing little compute instructions against a register set and doing
                                         
                                         loads and stores out of a big backing store, or at least a virtually large backing store in this
                                         
                                         von Neumann style. You can imagine these machines that operate differently. And sometimes people
                                         
                                         talk about this idea of non-von for like non-von Neumann machines. But the kind of machines that
                                         
                                         we have, you control what instruction stream feeds the processor by redirecting it around to start
                                         
                                         fetching from other places using these branching or jumping instructions, right? And with an
                                         
                                         unconditional branch, you say exactly where you go start fetching from. With a conditional branch,
                                         
    
                                         you say, start fetching from this other place, but only if this thing is true. And with an indirect
                                         
                                         branch, you say, fetch from this data dependent location that I placed in a register. So that's
                                         
                                         very closely tying a full data value from the data stream into the control side of the machine. So in some ways,
                                         
                                         the predication versus indirect branches are both cases of the data side of the computation feeding
                                         
                                         that control side. But it's also important to note that the machine has more knowledge of the
                                         
                                         possibilities when it's got conditional branches, since there's really only two possibilities of where it might go.
                                         
                                         Usually the jump amount in a conditional branch is relative to the current program counter,
                                         
                                         and the program counter is known to the machine,
                                         
    
                                         so it's really just selecting between those two places for a conditional branch.
                                         
                                         Right, and so that takes us to loops, really.
                                         
                                         And the classic notion is that most of your program's execution time
                                         
                                         is probably spent in the loop somewhere.
                                         
                                         Otherwise, what would your program really do?
                                         
                                         And, you know, back of the envelope wise, you kind of have like straight line instructions on a gigahertz processor.
                                         
                                         And, you know, they can run in one millisecond, maybe like something like a million instructions or something like that, which is kind of a four megabyte binary.
                                         
                                         So it's a good amount of instructions that you're running. So it pays to think about
                                         
    
                                         what a loop actually is, right? So the kind of branching that it does, well, loops usually just
                                         
                                         branch backwards. And when you branch forward, you're really just doing that for conditional
                                         
                                         things. And when you branch backward, it's probably what's called a loop back edge, right?
                                         
                                         So it has a negative delta.
                                         
                                         So it goes back to in memory when you do that.
                                         
                                         And usually loops have a trip count or something like that.
                                         
                                         And that goes into the flags and maybe it's counting down.
                                         
                                         So going towards zero or something like that,
                                         
    
                                         one at a time or whatever.
                                         
                                         And for example, imagine you're processing,
                                         
                                         say some RGBA value through a loop.
                                         
                                         Well, you can predict taken for the back edge always,
                                         
                                         but if you do that, you're doing RGBA,
                                         
                                         well, you'll mispredict 25% of the time,
                                         
                                         which is actually pretty bad, right?
                                         
                                         It's not the type of loop you really wanna have
                                         
    
                                         in your program.
                                         
                                         So why would you go with likely and likely?
                                         
                                         Well, what's even the point of the annotations at all?
                                         
                                         Well, what you're trying to do is you're trying to say,
                                         
                                         I want to optimize for the common path, really.
                                         
                                         The way I see it when I put these likely, unlikely annotations,
                                         
                                         I would say, I think this is going to be the common path.
                                         
                                         So you want to bias the code generation so that the likely code
                                         
    
                                         is really going straight line and the branches are never taken.
                                         
                                         And the reason you want to do that isn't really to help the branch predictor at all. Some ISAs have branch hints inside the instructions
                                         
                                         themselves, say like Itanium had that, but I don't think that really matters nowadays. You're not
                                         
                                         trying to help the branch predictor. What you're trying to do is you're trying to tell the compiler
                                         
                                         that should generate code to improve the executions. For example, it could lay out the code
                                         
                                         so that each instruction fetch will fetch cache lines
                                         
                                         that are going to be fully executed.
                                         
                                         So as you have branches,
                                         
    
                                         the likely taken branches all kind of follow each other
                                         
                                         and you just never take any of the unlikely branches.
                                         
                                         So when you fetch a cache line one after the other,
                                         
                                         you just fall through to the ones that are taken.
                                         
                                         And the other thing you might want to do
                                         
                                         is you want to think about cold code paths as well as out of line code. What you're
                                         
                                         doing with the likely unlikely annotations is you're telling the compiler, this isn't really
                                         
                                         likely, so don't make it bigger, don't make it faster, just make it small. And if it happens,
                                         
    
                                         I'll take the hit on the extra instructions that I might need to execute. So speculate this won't
                                         
                                         happen. You don't wanna compile it
                                         
                                         or hoist a bunch of unlikely computation into likely code,
                                         
                                         because that would just increase the number of instructions
                                         
                                         that you execute without actually increasing
                                         
                                         the useful work that's being performed by the CPU.
                                         
                                         Yeah, and that feeds into this general notion
                                         
                                         of work efficiency.
                                         
    
                                         Even in CPU speculation,
                                         
                                         what we're doing is we're trying to fill execution
                                         
                                         units with speculative instructions, but it also burns power to run them, right? And the compiler
                                         
                                         is also reasoning about what to put on the hot path, but it's just reasoning about it statically
                                         
                                         ahead of this execution time, whereas the CPU is doing it at runtime, right? And it could be useful.
                                         
                                         It could end up burning power and taking
                                         
                                         up lots of micro architectural slots in the data path that could be used for useful computation.
                                         
                                         So if you had an Oracle, it would tell you, oh, you know, you can speculate on this and it'll
                                         
    
                                         definitely help you make forward progress, but we don't have those Oracles. Those are the
                                         
                                         idealizations. So if I wanted to increase my instructions per clock without getting more
                                         
                                         useful work done, I'd do it another way. It's actually fun to figure out how to write these
                                         
                                         power viruses for an ISA. So how do you max out the power consumption? And in order to do something
                                         
                                         like that, you'd really need to write assembly to very precisely control the machine. But ultimately,
                                         
                                         you probably want to optimize code size differently for a real program if something is very unlikely, right? Imagine you were writing the assembly again. You would put on
                                         
                                         a cold path. You generate smaller code, as you said. You'd force extra spills to happen when
                                         
                                         you entered that path. So I'd keep my registers in place for the common case. And if that actually
                                         
    
                                         happened, then I'd go spill a bunch of stuff to the stack and patch it up and make it look the
                                         
                                         way I wanted to. So I let the straight line code in the common case have all the registers.
                                         
                                         So there's also a trade-off here of we're going to be grossing up code and we could be wrong. So
                                         
                                         sometimes you might hear in code review, if you're trying to write general purpose code that had a
                                         
                                         lot of these likely or unlikely annotations, that it might be premature optimization and you could
                                         
                                         actually be wrong about your annotation.
                                         
                                         So this counterbalance here is that you're grossing up
                                         
                                         your code with things you suspect to be the case,
                                         
    
                                         but we all have this limited foresight.
                                         
                                         So ideally we wouldn't need to say anything
                                         
                                         that would turn out to be wrong.
                                         
                                         And instead we could just key off the evidence
                                         
                                         that we observed instead of using our intuition here.
                                         
                                         In some cases, when you are wrong,
                                         
                                         you have to clean up the speculative things that you did.
                                         
                                         Say if you change the state that wasn't in a register,
                                         
    
                                         I guess that's at the CPU level
                                         
                                         more than at the compiler level,
                                         
                                         but it's something you have to do when the CPU is wrong,
                                         
                                         it speculates incorrectly.
                                         
                                         Say we were logging things as like verbose logging,
                                         
                                         and that only got activated
                                         
                                         when the verbose logging mode was on.
                                         
                                         We could actually put into our standard functions or logging macros, the likely or unlikely annotations. In this case,
                                         
    
                                         it's unlikely that I'm doing verbose logging and production. And that helps the compiler know what
                                         
                                         slow paths be unlikely to be executed throughout the code base, and it can push those into cold
                                         
                                         code. Right. And when we talk about that, like, you know, you're trying to figure out, hey, you put
                                         
                                         these annotations, are they ever true or not? At some point you realize, well, shouldn't
                                         
                                         just profile guided optimizations in the compiler take care of all of this, right? Is it a magic
                                         
                                         wand that will solve all of it? And in theory, yes, it can totally solve this issue. You wouldn't
                                         
                                         need to annotate anything. But realistically, what I've seen is very few projects turn on PGO
                                         
                                         correctly, right? So some are gonna maybe just collect data once
                                         
    
                                         or every few months or whatever,
                                         
                                         and then use those data they've collected
                                         
                                         for years down the line, right?
                                         
                                         So they just update it every so often and just whatever.
                                         
                                         They use whatever previous data they had.
                                         
                                         They keep changing the code
                                         
                                         and the data aren't really connected
                                         
                                         with what the new program does anymore.
                                         
    
                                         Whereas others will just kind of collect PGO data
                                         
                                         on all the unit tests, right? And the thing with unit tests is by definition, they audit test the
                                         
                                         odd cases more than the common ones. And so sure, you got profile data on your unit tests, and it's
                                         
                                         cool because your unit tests might run faster once you compile them, but you've optimized for the
                                         
                                         wrong thing because I bet most of your users aren't running your unit tests, right? They're running, you know, something else. They're not running into
                                         
                                         all the corner cases unless you have really weird users. So in fact, with actual good profiles,
                                         
                                         you wouldn't need these likely and unlikely annotations. So you'd have to set up a really
                                         
                                         good pipeline to run representative samples of what it is that your program does to use PGO
                                         
    
                                         correctly. But at the same time, you look at this in programs where performance really matters,
                                         
                                         where performance is a feature, right? Not just something you kind of sprinkle on at the end.
                                         
                                         It's actually kind of good documentation to say this is likely, this is not likely.
                                         
                                         But it's only good documentation, as any documentation goes, when you get it right.
                                         
                                         If you get it wrong, it's not very good documentation, as any documentation goes, when you get it right.
                                         
                                         If you get it wrong, it's not very good documentation.
                                         
                                         So as you were saying, it'd be cool to have
                                         
                                         PGO tell you if your annotations are wrong.
                                         
    
                                         So if I had these annotations inside my code
                                         
                                         and I had a really good PGO instrumentation,
                                         
                                         PGO could come in and tell me, OK, well, here's
                                         
                                         what I figured out about your program.
                                         
                                         And by the way, you were completely wrong over there. Or it
                                         
                                         could say, by the way, I never executed this branch that you told me would be likely or unlikely.
                                         
                                         And that's interesting because it tells you, you don't have good PGO coverage, right? Not just you
                                         
                                         were wrong with the coverage you provided me, but you didn't even try to cover this, right? So that
                                         
    
                                         be a super cool insight into the code where the programmer says what they think is important and they get a reality check on it, right? So as a documentation thing, when performance is a feature
                                         
                                         in your program, that'd be super cool. Yeah. And I will say one really neat thing that they do in
                                         
                                         hardware design is they have this idea of cover points. So basically ahead of time, you think
                                         
                                         about what scenarios are important for you to cover in terms of your verification of the piece of hardware that you're making. But you could also have these performance-oriented cover
                                         
                                         points, right, where you describe conditions that you definitely want to be collecting profiles for
                                         
                                         because you think they're important to see in your benchmarks and them running over your program,
                                         
                                         right? So there's this interesting idea that you could do performance-based cover points.
                                         
                                         And so I also remember that Firefox got
                                         
    
                                         some good mileage out of its profile-guided optimization builds. And there's another system
                                         
                                         that I thought was really cool, which is, it's called GWIP, which is short for Google Wide
                                         
                                         Profiling. And it has the system called AutoFDO. And there's a paper on this, which is a really
                                         
                                         neat paper, where it keeps a continuous pipeline going, collecting information from applications that are
                                         
                                         running throughout Google. And then it actually describes in the paper how performance drops off
                                         
                                         as a function of time using stale data. So if you capture the data for the symbols in a binary on a
                                         
                                         given day, and then you wait a couple of weeks later, the code is constantly changing and evolving.
                                         
                                         So the effectiveness of that data you collected drops off over time. And it actually shows how effective it remains as a function of time, which is really
                                         
    
                                         cool. The Gwip thing is really neat because it's running on a live system and collecting data on
                                         
                                         a system that's being continuously changed as well. So that's a really neat paper to read.
                                         
                                         And another one I heard about, I don't think I've ever read about this, but I've heard a few people come in and say,
                                         
                                         well, actually, when I use these likely unlikely annotation,
                                         
                                         I completely lie to the compiler because I,
                                         
                                         the person who says this, have an exceptional system.
                                         
                                         And they say, well, I wanna optimize
                                         
                                         for the unlikely thing, right?
                                         
    
                                         So because that's the one that matters.
                                         
                                         So I'm gonna tell the compiler,
                                         
                                         this is likely when actually,
                                         
                                         that's the completely unlikely thing that never happens.
                                         
                                         But if it happens, I want the compiler to make it extremely fast.
                                         
                                         And I've heard the argument from a handful of people who either do something like high-frequency trading
                                         
                                         or folks who do kind of safety systems where, say, like some emergency break or firing missiles or whatever
                                         
                                         is a thing that you want to do really quickly.
                                         
    
                                         After taking a nap, of course.
                                         
                                         Right. After taking a nap. I thought that was interesting kind of mind bend in the way
                                         
                                         people approach stuff, right? Because the code that ought to never execute,
                                         
                                         well, when it does execute, if it does execute, it ought to execute really fast when unfortunately
                                         
                                         has to execute. Or, you know, if you're doing HFT when unfortunately
                                         
                                         you're trying to make money really fast
                                         
                                         or whatever it is that they do.
                                         
                                         It's interesting to think when we go back to PGO,
                                         
    
                                         well, if you use PGO and you had tests
                                         
                                         that biased towards the unlikely thing
                                         
                                         that you want to make fast,
                                         
                                         then maybe PGO would be a better option there.
                                         
                                         I don't know.
                                         
                                         Again, speaking about HFT,
                                         
                                         I don't know if options are a thing
                                         
                                         we should be talking about, but whatever.
                                         
    
                                         It's good to have options.
                                         
                                         Right. You just have to call on them sometimes.
                                         
                                         I think it's also interesting to go back to this, I guess, the theme of what are the simplest possible solutions in the space here?
                                         
                                         So our episode title is this unlikely annotation.
                                         
                                         And it's potentially a relatively simple solution compared to something involved like setting up an FDO pipeline or something like this.
                                         
                                         And I'm sorry, did we define FDO?
                                         
                                         Feedback directed optimization.
                                         
                                         I forget if we did.
                                         
    
                                         So there's a spectrum of feedback directed optimization, which goes from offline collection of profiles to, you know, just in time observation of your code.
                                         
                                         And it all lives in this kind of feedback directed spectrum. So there's something
                                         
                                         on the spectrum, which is the programmer has something in their head and they type it into
                                         
                                         the program because they believe that it's the case. That's somewhere on this feedback directed
                                         
                                         spectrum. It's actually the programmer's thinking intent. But we go back to what if you have no
                                         
                                         branch predictor at all in your hardware and you've got a conditional branch in your CPU front end,
                                         
                                         what are the simplest things you can do?
                                         
                                         So one is to foist it onto the C programmer and you know,
                                         
    
                                         it's all about balance of probabilities.
                                         
                                         You could do simple things like just predict randomly which way the branch is
                                         
                                         going to go.
                                         
                                         And if you had a uniform distribution of taken and not taken in the source
                                         
                                         program, you'd have a 50% chance of getting it right.
                                         
                                         So that makes a nice mental baseline for comparison.
                                         
                                         But I think it's pretty clear
                                         
                                         that you could do even better than that,
                                         
    
                                         either foisting it onto the user
                                         
                                         and randomly predicting whether it's taken
                                         
                                         or not taken as your two options.
                                         
                                         Yeah, absolutely.
                                         
                                         And really what we're trying to get to
                                         
                                         when we talk about branches
                                         
                                         is that hardware isn't instantaneous, right?
                                         
                                         Like instructions
                                         
    
                                         take a little while to execute. And the problem is you have a branch and you don't know which way to
                                         
                                         go, right? The CPU is going to see a branch and then be like, oh, where should I go now? And
                                         
                                         that's why these likely and likely annotations are useful because it helps you try to figure out
                                         
                                         which way it might go and then try to bias the CPU. The compiler says, hey, let's help the CPU
                                         
                                         figure out where to go.
                                         
                                         And if you go back to say like the MIPS days
                                         
                                         and the Hennessy and Patterson book,
                                         
                                         they added delay slots to MIPS to do something like that, right?
                                         
    
                                         So branches have what's called delay slots,
                                         
                                         which allows you while you figure out where the branch is going to go,
                                         
                                         well, maybe I have a few more instruction that you can munch on, right?
                                         
                                         So right after the branch, there's say an extra instruction that you can munch on. So right after the branch,
                                         
                                         there's, say, an extra instruction that you can execute as an extra thing.
                                         
                                         So you know you're going to take a branch.
                                         
                                         You're going to be like, oh, hey, I should figure out where this branch is going to go.
                                         
                                         But hey, while I'm figuring out this branch,
                                         
    
                                         let me eat those extra instructions after the branch.
                                         
                                         And delay slots are a way to hide the latency
                                         
                                         to send the conditional predicate to the instruction fetch unit.
                                         
                                         And it effectively says
                                         
                                         like there's, you know, some instructions that execute unconditionally associated with every
                                         
                                         conditional branch, because otherwise, if I don't do that, I'm going to have dead space, right? So
                                         
                                         it's kind of acknowledging like, I'm going to need a few instructions to do stuff afterwards.
                                         
                                         It's baking it into the ISA. Like I'm going to need one slot
                                         
    
                                         or two slots or three slots of work to do after you tell me that you're going to do a branch.
                                         
                                         It always looks really strange when you send an instruction or program under the branch,
                                         
                                         but conceptually you can think of it as being kind of paired up with the branch, right? It's
                                         
                                         branch and then also do these other things before branching. Right. And so just as a reminder,
                                         
                                         so if we're taking our conditional branch, then we're going to the target. So like you said, the instruction fetcher
                                         
                                         has to kind of Heisenfetch from these two possible locations. One is the branch is not taken. So we
                                         
                                         just fall through to the next instruction, or we have to redirect where we're fetching from to the
                                         
                                         other location. And there's a little memory in your system that it's trying to read the next
                                         
    
                                         instruction from. And in the simplest possible case, it just has one port for you to go fetch an instruction out.
                                         
                                         And so you have to figure out the address and then go fetch that address from this little memory.
                                         
                                         And so now CPUs have this fundamental problem when they're trying to be as simple as possible,
                                         
                                         that they live in this poison state where you're fetching from either the next instruction or that thing arbitrarily far away or misaligned or switching processor modes or various forms of things that can happen when
                                         
                                         you jump. But when you collapse this wave function, it's only going to do one of those.
                                         
                                         So, but then here's the problem for modern CPUs. So we kind of talked about, okay, when you're
                                         
                                         going for max simplicity, what are you going to do? In a modern CPU, the branches could
                                         
                                         conceptually make you seize up and wait
                                         
    
                                         until you're sure in the look before you leap kind of style. And then you'd look for something
                                         
                                         like delay slots. But alternatively, we could do the, it's easier to ask for forgiveness than
                                         
                                         permission thing. And then we just keep running from there to wherever we think we're going to
                                         
                                         branch to and use special retirement hardware to prevent those results from landing into
                                         
                                         committed architectural states.
                                         
                                         So the kind of thing you would see in a debugger if you stopped it on a given program counter
                                         
                                         or if you were stepping instructions.
                                         
                                         That's your architecturally visible and committed state.
                                         
    
                                         So effectively, there's a little bit of microtransactionality happening inside where you have these faculties
                                         
                                         inside of the CPU pipeline that say, all right,
                                         
                                         I did this, but I'm not sure it's definitely what has to happen. I did it because I think it's
                                         
                                         probably going to have to happen, but I'm not sure. And so, you know, those have to be committed
                                         
                                         via the retirement hardware. So we get tons of performance out of the instruction stream on
                                         
                                         modern CPUs by speculating past branches that would otherwise paralyze us in place. And we run the
                                         
                                         stuff we expect we're going to have to run. So this is the classic notion of ILP mining. So ILP
                                         
                                         stands for instruction level parallelism. So we're churning through the instruction stream and
                                         
    
                                         finding things that we can run concurrently on our execution paths along branches we expect we're
                                         
                                         going to take in order to reduce the overall execution
                                         
                                         time witnessed by the user for running the overall program. And, you know, a classic conundrum is as
                                         
                                         your CPU pipeline gets really deep, this kind of becomes very difficult or doesn't work at some
                                         
                                         point. So your pipeline depth means you have potentially a bunch of branches in flight
                                         
                                         simultaneously. So let's say just for random
                                         
                                         number purposes, let's say my pipeline depth was 40 and my basic block size was five. I'd have eight
                                         
                                         branches in flight at any given time. And so even if I can predict one with 90% accuracy, now I have
                                         
    
                                         0.9 to the eighth branches chance of predicting all those branches properly. So I'll have a bunch
                                         
                                         of wasted work going back to that notion of wasted work
                                         
                                         that we talked about before.
                                         
                                         So there's some number of branches
                                         
                                         you can do before you're back at 50-50% probability
                                         
                                         of some misprediction.
                                         
                                         Yeah, and you mentioned retirement hardware.
                                         
                                         And what is that even?
                                         
    
                                         Is that where compiler engineers go when they're done doing
                                         
                                         compilers or something?
                                         
                                         Well, not really basically you know you
                                         
                                         have a bunch of extra registers and caches and whatever else inside the hardware and they're not
                                         
                                         supposed to be visible from the isa so so these extra registers you know like say say you're you're
                                         
                                         on arm v8 or whatever you have say like 32 ish registers 31 and the other ones in the actual
                                         
                                         hardware aren't visible. You might have
                                         
                                         twice as many or something more. They're not supposed to be visible. We've learned in the last
                                         
    
                                         three, four years that they are visible through timing, but you know what I mean. Let's ignore
                                         
                                         that the timing is a thing. But the hardware keeps track of all these extra registers and
                                         
                                         caches and whatever to remember which ones are speculative, right? Which ones are registers that hold values that they haven't really fully committed to
                                         
                                         and which don't hold those, which ones of these hold what's usually called architectural
                                         
                                         state.
                                         
                                         And what it's going to do, it's going to throw away missed speculation and roll back to the
                                         
                                         known good state.
                                         
                                         So say, you know, say you're on ARM or whatever, and you have 31 registers, there might
                                         
    
                                         be, you know, twice as many. And it's not a set, like which one is real and which one isn't. It's,
                                         
                                         you know, through register renaming, you kind of figure out which ones are the quote unquote real
                                         
                                         registers right now or not. And it's the same thing with caches, right? Like caches are just
                                         
                                         this extra level of caching that allow you to have that speculation. And basically retirement just tracks
                                         
                                         what is it that's the point of no return
                                         
                                         in the reorderable instruction sequence.
                                         
                                         I'm really oversimplifying here.
                                         
                                         And again, like this is another one
                                         
    
                                         we could do a slew of new episodes on the topic,
                                         
                                         but it's really interesting to think about this
                                         
                                         when we talk about how
                                         
                                         from a programming language perspective,
                                         
                                         you're trying
                                         
                                         to bias what the compiler is going to do to generate specific assembly instructions to then
                                         
                                         have the hardware run them and speculate on the things that are going to happen. There's really
                                         
                                         layers of speculations in there and layers of abstraction. Even when you talk about assembly,
                                         
    
                                         there's so much abstraction in it. Yeah, it's amazing that assembly is really like a little interpreted language that gets unrolled into these micro instructions under the hood.
                                         
                                         That's always a pretty fascinating thing.
                                         
                                         And I think the point of no return is interesting because it shows the transactionality that we were talking about earlier of the point of no return usually says an exception cannot happen beyond this point like
                                         
                                         this point has all been committed to architectural state i know that the machine definitely got here
                                         
                                         so it's kind of the commit point that's running ahead through the program and so there's also
                                         
                                         a bunch of different kinds of machines that are cited through the literature and there is kind
                                         
                                         of a notion of brawny versus wimpy cores we're talking about here with things
                                         
                                         like you know branch delay splots and trying to keep hardware as simple as possible versus big
                                         
    
                                         reordering hardware and speculation and prediction and things but even in the modern era for embedded
                                         
                                         systems or systems that have less heavyweight hardware per thread having the knowledge that
                                         
                                         programmers know in their head or that, you know, profiling determined
                                         
                                         can be very helpful. So things like GPUs, for example, they amortize their control hardware
                                         
                                         over some number of threads that they hope will be running in tandem. So they're kind of trying to
                                         
                                         get as much information about whether the threads will run in unison as possible.
                                         
                                         So in order to be Turing complete, which is kind of our baseline level of expressiveness
                                         
                                         that we expect out of machines today, you need data to be able to influence control somehow.
                                         
    
                                         So basically you need to be able to do a while loop or an if or something like this, right?
                                         
                                         While loop effectively. So most systems will have to consider how the data is going to interact with
                                         
                                         the control side of the machine at some level. And to go back to the brawny versus wimpy core thing, there's some amazing hardware
                                         
                                         that's been landed or at least been proposed for beefy machines.
                                         
                                         So one example that I really loved was this idea of perceptrons.
                                         
                                         And that was the early use of neural networks for branch prediction.
                                         
                                         So ultimately, you just have to predict one bit given some history of what branches have done. And you have this little
                                         
                                         neural network doing a matrix multiply that has to determine this. And of course, there's all
                                         
    
                                         these notions that come in of the size and how power hungry it is and the importance of latency.
                                         
                                         We need the answer quickly to which way this branch is going to go so that we know where to
                                         
                                         start fetching from. It can't take too long. Otherwise, we're trying to figure out how to hide the latency of the thing that determines
                                         
                                         how to hide the latency.
                                         
                                         And so that's kind of, it kind of moves towards what are state of the art branch predictors,
                                         
                                         which we'll have to talk about in some future episode.
                                         
                                         And effectively what we're doing here is we're forming things that to me look like VLIW bundles.
                                         
                                         So you kind of take code that's in a big sequence, and then you figure out which ones are really going to run, and then you turn them horizontal to run as micro operations against
                                         
    
                                         all these data paths that you have. And you do that on the fly using the natural size of basic
                                         
                                         blocks that exist and the reservation station scoreboarding algorithms that exist inside of
                                         
                                         the machine. And so there's also interesting things like how do you form regions out of these
                                         
                                         little traces of code that get executed in the literature. That ties into things like the reorder
                                         
                                         buffer size. So CPUs have this big set of instructions that they can reorder to execute.
                                         
                                         Reorder buffering and region formation are related concepts. There's also challenges of dispatching
                                         
                                         out of a really big reorder buffer. And there's things like the M1, which have been pushing, the Apple M1 silicon, which have been pushing the limits here. And going back to historical papers,
                                         
                                         there's cool things like the trace cache in the Pentium 4 that really, I think, shows the spectrum
                                         
    
                                         between the approaches, because that was actually a hardware structure that was storing micro-op
                                         
                                         decoded tracelets that it strung together using all hardware primitives, which is pretty cool.
                                         
                                         Yeah. And there's so much more to talk about here. There's so many other things, for example,
                                         
                                         you know, vector predicated masks, SIMT, vector threading, Cray style run ahead, and, you know,
                                         
                                         branch target prediction for indirect branches. How do branch prediction, you know, take address
                                         
                                         bits and track the history? What's the aliasing and the subset of bits that you use to track branches?
                                         
                                         How do you do local global history table data using branch prediction schemes?
                                         
                                         And so much more, like hybrid global local branch predictor mixes,
                                         
    
                                         virtuals as the ultimate control structure and relying on de-virtualization.
                                         
                                         So how do you bias the virtual dispatch with de-virtualization?
                                         
                                         So there's a lot
                                         
                                         that we haven't covered here. And again, so many more episodes to go, but really we've been talking
                                         
                                         about built and expect, and we haven't really talked about built and assume, right? Because
                                         
                                         built and expect is just saying, yeah, expect this thing, but this other thing might happen as well.
                                         
                                         If you talk about built and assume, well, it's really saying, assume that this is going to
                                         
                                         happen. If it doesn't happen, then all bets are off just just like do whatever you want with my program i don't really care
                                         
    
                                         it's totally undefined behavior if that happens and built and assume is another interesting thing
                                         
                                         to use in there right there there's certain programs where people have assert in one
                                         
                                         compilation mode and then assume in the other one and i've heard kind of extreme wars about whether that's a valid thing
                                         
                                         to do or not, depending on what the background of the programmer is. So it's kind of an interesting
                                         
                                         trade-off to do there. Expect and assume are definitely kind of hot button topics for different
                                         
                                         people. Yeah. And it's interesting to say, hey, compiler, you can really take this as a given.
                                         
                                         And then of course things can be wrong. And so then the question is how
                                         
                                         wrong do they go when something is, one of your axioms is untrue. It's a funny, funny idea.
                                         
    
                                         And so at the end of the day, I think to kind of summarize a lot of this,
                                         
                                         scalar code really wants to have this notion of control inside where it's kind of moving around
                                         
                                         the instruction that's being executed, changing the program counter to point at different places inside of your program.
                                         
                                         But we really only have these crude ways of asking for things to happen.
                                         
                                         And so usually we hope the compiler will have some way of doing the right thing, but often
                                         
                                         it has no way of knowing what the right thing is at compile time, given just the information
                                         
                                         presented to it in the program at large.
                                         
                                         Yeah, so I guess the compiler just has to expect the unexpected.
                                         
    
                                         Yeah, like if somebody was trying to predict how often our podcasts come out,
                                         
                                         that's basically impossible.
                                         
                                         That's true.
                                         
                                         I mean, unless they have a side channel,
                                         
                                         it would be a really good source of entropy for random number generators
                                         
                                         or something like that.
                                         
                                         Nice.
                                         
                                         All right, well, that's our episode.
                                         
    
                                         So thanks for hanging out, Chris.
                                         
                                         And I hope I won't make predictions,
                                         
                                         but I think we'll do another episode in the near future.
                                         
                                         I expect that we will.
                                         
                                         Great.
                                         
                                         Thanks.
                                         
                                         Later.
                                         
