TLB Hit 💥 - Episode 3: __builtin_expect(!!(x), 0)
Episode Date: April 19, 2021* What is it we know statically? * What's effectively discoverable only at runtime? * How do we tell "the machine" (compiler and/or hardware): * Things we *know* to be true... * Things we *expect* to ...be true... * Things we *expect* to be true but *want to do something about* when it's not... * Things we have no idea about! * How do we collect that information that we have no idea about? * What happens if we're wrong? What if the workload is inherently irregular? * In the limit, random noise could drive our control decisions! * We talked a bit about precise vs imprecise exceptions before and automatic reordering, and we've mentioned vector machines and auto-vectorization. * All falls into the broader theme here, but we're always questioning what we can actually cover in an hour... * We'll try to give it a go for a subset of these things! * Time is often the limiting factor. * The episode title is the thing that we often macro define as `#define UNLIKELY` * In C/C++ code you might say "this control condition is unlikely", and say `if (UNLIKELY(condition))` * In C++20 there was added these `[[likely]]` and `[[unlikely]]` annotations that do the same thing, but with more square brackets, so clearly better!
Transcript
Discussion (0)
Hey Chris, it's been a while since we've recorded a podcast, so how about we talk about something
like built-in expected bang bang x comma zero. Yep, that sounds good. And standard disclaimer
at the start of our podcasts, we only know so much. We'll try not to say things that are wrong
and we'll follow up with errata corrections and clarifications in the show notes as we find out
about things. And we do the lifelong learners thing. We're always trying to push the boundaries
of things that we know about
and have conversations about what's interesting
to develop and hone our understanding.
All right.
We kind of have a broad theme here.
And this is really a first episode
in a really broad series of, you know,
what is it that we know statically?
What's effectively discoverable only at runtime?
And how do we tell the compiler
and or the hardware things we know to be true things that we expect to be true things that we
expect to be true but want to do something about if it's not and things that we have no idea about
so how do we collect that information that we have really no idea about and what happens if we're
wrong what if the workload is just inherently irregular?
And, you know, in the limit, imagine like random noise,
it could drive our control decisions.
So, you know, we talked a bit about precise versus imprecise exceptions before
and, you know, automatic reordering.
And I think we've mentioned vector machines and auto vectorization.
And so all of that should fall into the broader theme here.
But, you know, we're always really questioning what you can actually cover in an hour. Let's try to give
it a go at least for a subset of these things. Yeah. The time is often the limiting factor.
So as our lead in, into the world of discussion, the episode title here is the thing that we macro
into unlikely. So in C or C++ code, you might say, ah, this control condition is unlikely. Like if
you do if some unlikely condition. And in C++20, there was added these double bracket likely and
double bracket unlikely annotations that do the same thing, but with more square brackets. So it's
clearly better. Right. It's really better. The idea is you want to be able to annotate your code and
say something's likely or unlikely, but what does it even make sense to annotate? Well, syntactically in C++ with these new annotations,
you can apply those in a few places, but really what you're trying to do is you're trying to
affect the assembly that you would get out of your compiler, right? So from the C++,
your compiler does stuff, and then the assembly is what you're trying to affect with likely or
unlikely. But we'd probably want to bias something like the test instructions, the conditional
instructions, or something like that. But really in C++, your code doesn't look like those
instructions that you have in assembly. It's really not what your compiler IR really looks
like either. Would it make sense to apply these annotations to say if, else if, and else statements
only? Or maybe to switch cases as well
or should be applicable to any boolean at all or even to any temporary boolean like value right so
if you have like some temporary value that you're computing and passing that to a function directly
without saving it into an explicit bool variable maybe you want to say that this is likely or
unlikely and really what do they even mean in parallel execution say you have like some simd
stuff so say you have like four values what does does it mean to say it's likely? Does it mean all four are likely?
I don't know. And really what C++ decided was that those attributes can be applied in C++20
to any non-declaration statement as well as labels. So we'll try to dig into that a bit today.
Yeah. And I think it's interesting. we go through this mental model exercise sometimes of translating C++ into C as if we were the classic C front compiler, or C into assembly
as if we were kind of a naive or straightforward C compiler. And so what would these annotations
turn into in the resulting assembly is kind of what it comes back to. If you were writing assembly
by hand, you might say, oh, this is my
common case condition. How do I make this as fast as possible? So somehow it's just trying to hint
that attribute to the compiler, like this is what I'm going for compiler. Right, right, right. We're
really looking at like what the likelihood is of something happening and how often do we expect
something to be true? The documentation, if you look at the built-in expected documentation in GCC, it says that the compiler is going to assume that it'll be like
90% accurate. Who knows where 90% comes from? I don't know. Someone pulled that out of somewhere
and decided 90% is what it means. So what that means is in a loop, it'll affect unrolling of
the loop, right? So the loop will assume like, hey, one time out of 10, the loop will exit.
But if you want to have more fine grain control, better granularity, you can use some other
intrinsic, such as built-in expect with probability. And that'll take the same parameters as built-in
expected, but you can also give it a double value in the range zero to one. It'll allow you to have
finer-grained control over the probability that something will be true or not. And what's funny is I played with it in GCC
and it lets you pass in interesting values
like minus zero as well as an N or whatever else.
It's kind of funny,
whereas Clang is really strict about zero one,
which I find kind of disappointing.
Clang doesn't like fun, I guess.
But really at the end of the day,
the sanitation is trying to allow you
to put a probability on the control transfer
that wasn't generally available in the language syntax.
Right.
And I think this ties back into a broader question of what do annotations even mean?
So the big question that I always have about annotations is what does it mean for your
program semantically?
So if it's kind of hints that the compiler is free to do something about or not do anything
about, it's free to discard them.
That's kind of
how you end up with a register or inline keyword that in the modern era does basically nothing.
Really, we're trying to provide value to the user, but we're also not trying to violate the principle
of least astonishment, which is always walking this little tightrope. So when the person upgrades
their compiler and it doesn't consider something the same way as it used to, maybe your annotations are discarded.
And so you hit this performance cliff for upgrading your compiler.
So that kind of violates the principle of least astonishment we usually think about.
So to be robust in the face of what the annotations are telling you, you'd want to provide some way of reasoning about what must be the case when you put an annotation or what should be the case.
And that goes back to the idea of semantics, right? At the end of the day, it's kind of what
mental framework can I use to reason about this annotation I'm putting? And that's the way you can
help solve for the principle of least astonishment. If that framework exists for reasoning about it,
the programmer or user has a chance of thinking about it properly. And so that mentions a little bit the idea of performance cliffs.
So with scalar compilers,
compilers for scalar C code,
like we're traditionally used to,
the performance cliffs tend to be a little lower
than if you had maybe an accelerator program.
And there's always this interesting question.
So for the machine learning compiler that I worked on,
where slowdowns could cost you several X
in your computation, like 2X or 3X or even more.
You don't want to expose the internals too much because it slows down the progress,
the evolution of your compiler. But you also don't want to astonish the user by
changing the way that you think about these things in a way that would effectively discard
their annotations. So there's a lot of ways that annotations are used in compilers.
There's things like auto vectorization annotations that don't necessarily port equivalently between environments or pragmas. And there's even cool compiler projects that have taken new approaches like Halide made this physical incarnation of the functional code of first class citizen called the schedule.
And databases classically have these query planners, which can talk about the mapping of their logical form
into a physical form, like in terms of how they execute.
Yeah, yeah.
Halide is a really cool paper when I read it
and it's really cool.
They got productized, at least, you know,
Google and Adobe are both using a derivative of that.
So it's a really cool project.
Yeah, it's used in a bunch of different things
all over the place.
And it's really great kernel generator, you know, for making these high performance kernels and trading off the parallelization versus rematerialization. There's always this, I think, Jason Reagan Kelly has this really good presentation about the three space of things that it's kind of aspect oriented programming, if people have heard of that before, where the challenge is how do I refer to a piece of the program and describe its attributes in a way that composes, but also stays relevant in the face of refactoring?
And that's super hard, right?
Because the compiler is effectively a tool that's automatically refactoring and reworking things for you under the hood. But if I change my program, you know, how do I know what's stable,
even in the face of refactoring and changes in my program?
Right. So why don't we even branch?
Programs often have branches around stuff that don't really make sense to do,
either because we're trying to be more efficient or because it was causing exception.
And you can imagine a world where no value causes exceptions.
And then you want to select between different computed
values.
So it's largely a question of efficiency, which one
it is that you want to do, throughout computation or not.
So sometimes you might want to compute two possibilities
eagerly and just select one of them.
And that might be more effective than just branching
around the ones that you do want to pick.
Really, if you think about at the assembly level, pretty much all ISAs have the idea of conditional selection,
right? So instead of having branches in assembly, you can have just a conditional selects
instruction. So for example, XC6 has Cmove, ARMv8 has Ccell. And before that, ARM could
conditionalize pretty much every instruction that I had. And even in Thumb, it had IT blocks where
you could conditionalize up to four subsequent instructions after the IT block. So that was kind
of an interesting way to, instead of having just branches, just saying like, do this thing
conditionally or select this thing and move the results there or something like that. And even
RISC-V, it doesn't have CMOV, so you have to branch around everything. And it creates these
really small sequences of instructions where you have to branch around stuff all the time.
So it's kind of an interesting tradeoff that they made there.
Yeah, and it's interesting.
I know that they've worked on things like instruction fusion.
So in the front end, you'll kind of detect these little sequences.
I'm not sure the extent to which it'll detect things that are C move looking sequences.
I'd have to go back and look at it, but it's also interesting.
What kind of peepholes can you do in the front end
that basically reconstruct this C moves.
Speaking of predication and conditional moves and stuff,
flags registers are often used as funny ways
to communicate properties from the data stream,
the data operations that are going on
to the control side of the machine.
And so you can compare flags registers,
which if you do a subtraction and the result is zero, maybe the zero flag will be set. That's effectively a predicate within this
bigger flags register that you have access to versus actually having like explicit predicate
registers. So that's an interesting contrast you see in computer architecture. Flags registers
kind of notoriously can create these weird dependencies because they alias together a
bunch of properties and they kind of create this sideband context that as instructions execute in a thread these
flags are kind of going in and out from individual instructions and they may or may not be used
so at some point we'll have to do a fun with eflags episode just about all the fun and
interesting things to do with flags registers yeah i bet we can spend more than an hour talking about flags in general.
So I think also this ties back into a general theme of control on von Neumann style machines.
So you can imagine different machines, the ones that we have today.
Maybe you could make a pure data flow machine.
I'm actually not super familiar with the Lisp machines that people had created, but they
may be less von Neumann style with a big backing store that data is being put back against and more
data flow. But as with everything, there's kind of this explore-exploit trade-off. And what we,
as an industry, discovered over time was we could tune the heck out of imperative instructions,
executing little compute instructions against a register set and doing
loads and stores out of a big backing store, or at least a virtually large backing store in this
von Neumann style. You can imagine these machines that operate differently. And sometimes people
talk about this idea of non-von for like non-von Neumann machines. But the kind of machines that
we have, you control what instruction stream feeds the processor by redirecting it around to start
fetching from other places using these branching or jumping instructions, right? And with an
unconditional branch, you say exactly where you go start fetching from. With a conditional branch,
you say, start fetching from this other place, but only if this thing is true. And with an indirect
branch, you say, fetch from this data dependent location that I placed in a register. So that's
very closely tying a full data value from the data stream into the control side of the machine. So in some ways,
the predication versus indirect branches are both cases of the data side of the computation feeding
that control side. But it's also important to note that the machine has more knowledge of the
possibilities when it's got conditional branches, since there's really only two possibilities of where it might go.
Usually the jump amount in a conditional branch is relative to the current program counter,
and the program counter is known to the machine,
so it's really just selecting between those two places for a conditional branch.
Right, and so that takes us to loops, really.
And the classic notion is that most of your program's execution time
is probably spent in the loop somewhere.
Otherwise, what would your program really do?
And, you know, back of the envelope wise, you kind of have like straight line instructions on a gigahertz processor.
And, you know, they can run in one millisecond, maybe like something like a million instructions or something like that, which is kind of a four megabyte binary.
So it's a good amount of instructions that you're running. So it pays to think about
what a loop actually is, right? So the kind of branching that it does, well, loops usually just
branch backwards. And when you branch forward, you're really just doing that for conditional
things. And when you branch backward, it's probably what's called a loop back edge, right?
So it has a negative delta.
So it goes back to in memory when you do that.
And usually loops have a trip count or something like that.
And that goes into the flags and maybe it's counting down.
So going towards zero or something like that,
one at a time or whatever.
And for example, imagine you're processing,
say some RGBA value through a loop.
Well, you can predict taken for the back edge always,
but if you do that, you're doing RGBA,
well, you'll mispredict 25% of the time,
which is actually pretty bad, right?
It's not the type of loop you really wanna have
in your program.
So why would you go with likely and likely?
Well, what's even the point of the annotations at all?
Well, what you're trying to do is you're trying to say,
I want to optimize for the common path, really.
The way I see it when I put these likely, unlikely annotations,
I would say, I think this is going to be the common path.
So you want to bias the code generation so that the likely code
is really going straight line and the branches are never taken.
And the reason you want to do that isn't really to help the branch predictor at all. Some ISAs have branch hints inside the instructions
themselves, say like Itanium had that, but I don't think that really matters nowadays. You're not
trying to help the branch predictor. What you're trying to do is you're trying to tell the compiler
that should generate code to improve the executions. For example, it could lay out the code
so that each instruction fetch will fetch cache lines
that are going to be fully executed.
So as you have branches,
the likely taken branches all kind of follow each other
and you just never take any of the unlikely branches.
So when you fetch a cache line one after the other,
you just fall through to the ones that are taken.
And the other thing you might want to do
is you want to think about cold code paths as well as out of line code. What you're
doing with the likely unlikely annotations is you're telling the compiler, this isn't really
likely, so don't make it bigger, don't make it faster, just make it small. And if it happens,
I'll take the hit on the extra instructions that I might need to execute. So speculate this won't
happen. You don't wanna compile it
or hoist a bunch of unlikely computation into likely code,
because that would just increase the number of instructions
that you execute without actually increasing
the useful work that's being performed by the CPU.
Yeah, and that feeds into this general notion
of work efficiency.
Even in CPU speculation,
what we're doing is we're trying to fill execution
units with speculative instructions, but it also burns power to run them, right? And the compiler
is also reasoning about what to put on the hot path, but it's just reasoning about it statically
ahead of this execution time, whereas the CPU is doing it at runtime, right? And it could be useful.
It could end up burning power and taking
up lots of micro architectural slots in the data path that could be used for useful computation.
So if you had an Oracle, it would tell you, oh, you know, you can speculate on this and it'll
definitely help you make forward progress, but we don't have those Oracles. Those are the
idealizations. So if I wanted to increase my instructions per clock without getting more
useful work done, I'd do it another way. It's actually fun to figure out how to write these
power viruses for an ISA. So how do you max out the power consumption? And in order to do something
like that, you'd really need to write assembly to very precisely control the machine. But ultimately,
you probably want to optimize code size differently for a real program if something is very unlikely, right? Imagine you were writing the assembly again. You would put on
a cold path. You generate smaller code, as you said. You'd force extra spills to happen when
you entered that path. So I'd keep my registers in place for the common case. And if that actually
happened, then I'd go spill a bunch of stuff to the stack and patch it up and make it look the
way I wanted to. So I let the straight line code in the common case have all the registers.
So there's also a trade-off here of we're going to be grossing up code and we could be wrong. So
sometimes you might hear in code review, if you're trying to write general purpose code that had a
lot of these likely or unlikely annotations, that it might be premature optimization and you could
actually be wrong about your annotation.
So this counterbalance here is that you're grossing up
your code with things you suspect to be the case,
but we all have this limited foresight.
So ideally we wouldn't need to say anything
that would turn out to be wrong.
And instead we could just key off the evidence
that we observed instead of using our intuition here.
In some cases, when you are wrong,
you have to clean up the speculative things that you did.
Say if you change the state that wasn't in a register,
I guess that's at the CPU level
more than at the compiler level,
but it's something you have to do when the CPU is wrong,
it speculates incorrectly.
Say we were logging things as like verbose logging,
and that only got activated
when the verbose logging mode was on.
We could actually put into our standard functions or logging macros, the likely or unlikely annotations. In this case,
it's unlikely that I'm doing verbose logging and production. And that helps the compiler know what
slow paths be unlikely to be executed throughout the code base, and it can push those into cold
code. Right. And when we talk about that, like, you know, you're trying to figure out, hey, you put
these annotations, are they ever true or not? At some point you realize, well, shouldn't
just profile guided optimizations in the compiler take care of all of this, right? Is it a magic
wand that will solve all of it? And in theory, yes, it can totally solve this issue. You wouldn't
need to annotate anything. But realistically, what I've seen is very few projects turn on PGO
correctly, right? So some are gonna maybe just collect data once
or every few months or whatever,
and then use those data they've collected
for years down the line, right?
So they just update it every so often and just whatever.
They use whatever previous data they had.
They keep changing the code
and the data aren't really connected
with what the new program does anymore.
Whereas others will just kind of collect PGO data
on all the unit tests, right? And the thing with unit tests is by definition, they audit test the
odd cases more than the common ones. And so sure, you got profile data on your unit tests, and it's
cool because your unit tests might run faster once you compile them, but you've optimized for the
wrong thing because I bet most of your users aren't running your unit tests, right? They're running, you know, something else. They're not running into
all the corner cases unless you have really weird users. So in fact, with actual good profiles,
you wouldn't need these likely and unlikely annotations. So you'd have to set up a really
good pipeline to run representative samples of what it is that your program does to use PGO
correctly. But at the same time, you look at this in programs where performance really matters,
where performance is a feature, right? Not just something you kind of sprinkle on at the end.
It's actually kind of good documentation to say this is likely, this is not likely.
But it's only good documentation, as any documentation goes, when you get it right.
If you get it wrong, it's not very good documentation, as any documentation goes, when you get it right.
If you get it wrong, it's not very good documentation.
So as you were saying, it'd be cool to have
PGO tell you if your annotations are wrong.
So if I had these annotations inside my code
and I had a really good PGO instrumentation,
PGO could come in and tell me, OK, well, here's
what I figured out about your program.
And by the way, you were completely wrong over there. Or it
could say, by the way, I never executed this branch that you told me would be likely or unlikely.
And that's interesting because it tells you, you don't have good PGO coverage, right? Not just you
were wrong with the coverage you provided me, but you didn't even try to cover this, right? So that
be a super cool insight into the code where the programmer says what they think is important and they get a reality check on it, right? So as a documentation thing, when performance is a feature
in your program, that'd be super cool. Yeah. And I will say one really neat thing that they do in
hardware design is they have this idea of cover points. So basically ahead of time, you think
about what scenarios are important for you to cover in terms of your verification of the piece of hardware that you're making. But you could also have these performance-oriented cover
points, right, where you describe conditions that you definitely want to be collecting profiles for
because you think they're important to see in your benchmarks and them running over your program,
right? So there's this interesting idea that you could do performance-based cover points.
And so I also remember that Firefox got
some good mileage out of its profile-guided optimization builds. And there's another system
that I thought was really cool, which is, it's called GWIP, which is short for Google Wide
Profiling. And it has the system called AutoFDO. And there's a paper on this, which is a really
neat paper, where it keeps a continuous pipeline going, collecting information from applications that are
running throughout Google. And then it actually describes in the paper how performance drops off
as a function of time using stale data. So if you capture the data for the symbols in a binary on a
given day, and then you wait a couple of weeks later, the code is constantly changing and evolving.
So the effectiveness of that data you collected drops off over time. And it actually shows how effective it remains as a function of time, which is really
cool. The Gwip thing is really neat because it's running on a live system and collecting data on
a system that's being continuously changed as well. So that's a really neat paper to read.
And another one I heard about, I don't think I've ever read about this, but I've heard a few people come in and say,
well, actually, when I use these likely unlikely annotation,
I completely lie to the compiler because I,
the person who says this, have an exceptional system.
And they say, well, I wanna optimize
for the unlikely thing, right?
So because that's the one that matters.
So I'm gonna tell the compiler,
this is likely when actually,
that's the completely unlikely thing that never happens.
But if it happens, I want the compiler to make it extremely fast.
And I've heard the argument from a handful of people who either do something like high-frequency trading
or folks who do kind of safety systems where, say, like some emergency break or firing missiles or whatever
is a thing that you want to do really quickly.
After taking a nap, of course.
Right. After taking a nap. I thought that was interesting kind of mind bend in the way
people approach stuff, right? Because the code that ought to never execute,
well, when it does execute, if it does execute, it ought to execute really fast when unfortunately
has to execute. Or, you know, if you're doing HFT when unfortunately
you're trying to make money really fast
or whatever it is that they do.
It's interesting to think when we go back to PGO,
well, if you use PGO and you had tests
that biased towards the unlikely thing
that you want to make fast,
then maybe PGO would be a better option there.
I don't know.
Again, speaking about HFT,
I don't know if options are a thing
we should be talking about, but whatever.
It's good to have options.
Right. You just have to call on them sometimes.
I think it's also interesting to go back to this, I guess, the theme of what are the simplest possible solutions in the space here?
So our episode title is this unlikely annotation.
And it's potentially a relatively simple solution compared to something involved like setting up an FDO pipeline or something like this.
And I'm sorry, did we define FDO?
Feedback directed optimization.
I forget if we did.
So there's a spectrum of feedback directed optimization, which goes from offline collection of profiles to, you know, just in time observation of your code.
And it all lives in this kind of feedback directed spectrum. So there's something
on the spectrum, which is the programmer has something in their head and they type it into
the program because they believe that it's the case. That's somewhere on this feedback directed
spectrum. It's actually the programmer's thinking intent. But we go back to what if you have no
branch predictor at all in your hardware and you've got a conditional branch in your CPU front end,
what are the simplest things you can do?
So one is to foist it onto the C programmer and you know,
it's all about balance of probabilities.
You could do simple things like just predict randomly which way the branch is
going to go.
And if you had a uniform distribution of taken and not taken in the source
program, you'd have a 50% chance of getting it right.
So that makes a nice mental baseline for comparison.
But I think it's pretty clear
that you could do even better than that,
either foisting it onto the user
and randomly predicting whether it's taken
or not taken as your two options.
Yeah, absolutely.
And really what we're trying to get to
when we talk about branches
is that hardware isn't instantaneous, right?
Like instructions
take a little while to execute. And the problem is you have a branch and you don't know which way to
go, right? The CPU is going to see a branch and then be like, oh, where should I go now? And
that's why these likely and likely annotations are useful because it helps you try to figure out
which way it might go and then try to bias the CPU. The compiler says, hey, let's help the CPU
figure out where to go.
And if you go back to say like the MIPS days
and the Hennessy and Patterson book,
they added delay slots to MIPS to do something like that, right?
So branches have what's called delay slots,
which allows you while you figure out where the branch is going to go,
well, maybe I have a few more instruction that you can munch on, right?
So right after the branch, there's say an extra instruction that you can munch on. So right after the branch,
there's, say, an extra instruction that you can execute as an extra thing.
So you know you're going to take a branch.
You're going to be like, oh, hey, I should figure out where this branch is going to go.
But hey, while I'm figuring out this branch,
let me eat those extra instructions after the branch.
And delay slots are a way to hide the latency
to send the conditional predicate to the instruction fetch unit.
And it effectively says
like there's, you know, some instructions that execute unconditionally associated with every
conditional branch, because otherwise, if I don't do that, I'm going to have dead space, right? So
it's kind of acknowledging like, I'm going to need a few instructions to do stuff afterwards.
It's baking it into the ISA. Like I'm going to need one slot
or two slots or three slots of work to do after you tell me that you're going to do a branch.
It always looks really strange when you send an instruction or program under the branch,
but conceptually you can think of it as being kind of paired up with the branch, right? It's
branch and then also do these other things before branching. Right. And so just as a reminder,
so if we're taking our conditional branch, then we're going to the target. So like you said, the instruction fetcher
has to kind of Heisenfetch from these two possible locations. One is the branch is not taken. So we
just fall through to the next instruction, or we have to redirect where we're fetching from to the
other location. And there's a little memory in your system that it's trying to read the next
instruction from. And in the simplest possible case, it just has one port for you to go fetch an instruction out.
And so you have to figure out the address and then go fetch that address from this little memory.
And so now CPUs have this fundamental problem when they're trying to be as simple as possible,
that they live in this poison state where you're fetching from either the next instruction or that thing arbitrarily far away or misaligned or switching processor modes or various forms of things that can happen when
you jump. But when you collapse this wave function, it's only going to do one of those.
So, but then here's the problem for modern CPUs. So we kind of talked about, okay, when you're
going for max simplicity, what are you going to do? In a modern CPU, the branches could
conceptually make you seize up and wait
until you're sure in the look before you leap kind of style. And then you'd look for something
like delay slots. But alternatively, we could do the, it's easier to ask for forgiveness than
permission thing. And then we just keep running from there to wherever we think we're going to
branch to and use special retirement hardware to prevent those results from landing into
committed architectural states.
So the kind of thing you would see in a debugger if you stopped it on a given program counter
or if you were stepping instructions.
That's your architecturally visible and committed state.
So effectively, there's a little bit of microtransactionality happening inside where you have these faculties
inside of the CPU pipeline that say, all right,
I did this, but I'm not sure it's definitely what has to happen. I did it because I think it's
probably going to have to happen, but I'm not sure. And so, you know, those have to be committed
via the retirement hardware. So we get tons of performance out of the instruction stream on
modern CPUs by speculating past branches that would otherwise paralyze us in place. And we run the
stuff we expect we're going to have to run. So this is the classic notion of ILP mining. So ILP
stands for instruction level parallelism. So we're churning through the instruction stream and
finding things that we can run concurrently on our execution paths along branches we expect we're
going to take in order to reduce the overall execution
time witnessed by the user for running the overall program. And, you know, a classic conundrum is as
your CPU pipeline gets really deep, this kind of becomes very difficult or doesn't work at some
point. So your pipeline depth means you have potentially a bunch of branches in flight
simultaneously. So let's say just for random
number purposes, let's say my pipeline depth was 40 and my basic block size was five. I'd have eight
branches in flight at any given time. And so even if I can predict one with 90% accuracy, now I have
0.9 to the eighth branches chance of predicting all those branches properly. So I'll have a bunch
of wasted work going back to that notion of wasted work
that we talked about before.
So there's some number of branches
you can do before you're back at 50-50% probability
of some misprediction.
Yeah, and you mentioned retirement hardware.
And what is that even?
Is that where compiler engineers go when they're done doing
compilers or something?
Well, not really basically you know you
have a bunch of extra registers and caches and whatever else inside the hardware and they're not
supposed to be visible from the isa so so these extra registers you know like say say you're you're
on arm v8 or whatever you have say like 32 ish registers 31 and the other ones in the actual
hardware aren't visible. You might have
twice as many or something more. They're not supposed to be visible. We've learned in the last
three, four years that they are visible through timing, but you know what I mean. Let's ignore
that the timing is a thing. But the hardware keeps track of all these extra registers and
caches and whatever to remember which ones are speculative, right? Which ones are registers that hold values that they haven't really fully committed to
and which don't hold those, which ones of these hold what's usually called architectural
state.
And what it's going to do, it's going to throw away missed speculation and roll back to the
known good state.
So say, you know, say you're on ARM or whatever, and you have 31 registers, there might
be, you know, twice as many. And it's not a set, like which one is real and which one isn't. It's,
you know, through register renaming, you kind of figure out which ones are the quote unquote real
registers right now or not. And it's the same thing with caches, right? Like caches are just
this extra level of caching that allow you to have that speculation. And basically retirement just tracks
what is it that's the point of no return
in the reorderable instruction sequence.
I'm really oversimplifying here.
And again, like this is another one
we could do a slew of new episodes on the topic,
but it's really interesting to think about this
when we talk about how
from a programming language perspective,
you're trying
to bias what the compiler is going to do to generate specific assembly instructions to then
have the hardware run them and speculate on the things that are going to happen. There's really
layers of speculations in there and layers of abstraction. Even when you talk about assembly,
there's so much abstraction in it. Yeah, it's amazing that assembly is really like a little interpreted language that gets unrolled into these micro instructions under the hood.
That's always a pretty fascinating thing.
And I think the point of no return is interesting because it shows the transactionality that we were talking about earlier of the point of no return usually says an exception cannot happen beyond this point like
this point has all been committed to architectural state i know that the machine definitely got here
so it's kind of the commit point that's running ahead through the program and so there's also
a bunch of different kinds of machines that are cited through the literature and there is kind
of a notion of brawny versus wimpy cores we're talking about here with things
like you know branch delay splots and trying to keep hardware as simple as possible versus big
reordering hardware and speculation and prediction and things but even in the modern era for embedded
systems or systems that have less heavyweight hardware per thread having the knowledge that
programmers know in their head or that, you know, profiling determined
can be very helpful. So things like GPUs, for example, they amortize their control hardware
over some number of threads that they hope will be running in tandem. So they're kind of trying to
get as much information about whether the threads will run in unison as possible.
So in order to be Turing complete, which is kind of our baseline level of expressiveness
that we expect out of machines today, you need data to be able to influence control somehow.
So basically you need to be able to do a while loop or an if or something like this, right?
While loop effectively. So most systems will have to consider how the data is going to interact with
the control side of the machine at some level. And to go back to the brawny versus wimpy core thing, there's some amazing hardware
that's been landed or at least been proposed for beefy machines.
So one example that I really loved was this idea of perceptrons.
And that was the early use of neural networks for branch prediction.
So ultimately, you just have to predict one bit given some history of what branches have done. And you have this little
neural network doing a matrix multiply that has to determine this. And of course, there's all
these notions that come in of the size and how power hungry it is and the importance of latency.
We need the answer quickly to which way this branch is going to go so that we know where to
start fetching from. It can't take too long. Otherwise, we're trying to figure out how to hide the latency of the thing that determines
how to hide the latency.
And so that's kind of, it kind of moves towards what are state of the art branch predictors,
which we'll have to talk about in some future episode.
And effectively what we're doing here is we're forming things that to me look like VLIW bundles.
So you kind of take code that's in a big sequence, and then you figure out which ones are really going to run, and then you turn them horizontal to run as micro operations against
all these data paths that you have. And you do that on the fly using the natural size of basic
blocks that exist and the reservation station scoreboarding algorithms that exist inside of
the machine. And so there's also interesting things like how do you form regions out of these
little traces of code that get executed in the literature. That ties into things like the reorder
buffer size. So CPUs have this big set of instructions that they can reorder to execute.
Reorder buffering and region formation are related concepts. There's also challenges of dispatching
out of a really big reorder buffer. And there's things like the M1, which have been pushing, the Apple M1 silicon, which have been pushing the limits here. And going back to historical papers,
there's cool things like the trace cache in the Pentium 4 that really, I think, shows the spectrum
between the approaches, because that was actually a hardware structure that was storing micro-op
decoded tracelets that it strung together using all hardware primitives, which is pretty cool.
Yeah. And there's so much more to talk about here. There's so many other things, for example,
you know, vector predicated masks, SIMT, vector threading, Cray style run ahead, and, you know,
branch target prediction for indirect branches. How do branch prediction, you know, take address
bits and track the history? What's the aliasing and the subset of bits that you use to track branches?
How do you do local global history table data using branch prediction schemes?
And so much more, like hybrid global local branch predictor mixes,
virtuals as the ultimate control structure and relying on de-virtualization.
So how do you bias the virtual dispatch with de-virtualization?
So there's a lot
that we haven't covered here. And again, so many more episodes to go, but really we've been talking
about built and expect, and we haven't really talked about built and assume, right? Because
built and expect is just saying, yeah, expect this thing, but this other thing might happen as well.
If you talk about built and assume, well, it's really saying, assume that this is going to
happen. If it doesn't happen, then all bets are off just just like do whatever you want with my program i don't really care
it's totally undefined behavior if that happens and built and assume is another interesting thing
to use in there right there there's certain programs where people have assert in one
compilation mode and then assume in the other one and i've heard kind of extreme wars about whether that's a valid thing
to do or not, depending on what the background of the programmer is. So it's kind of an interesting
trade-off to do there. Expect and assume are definitely kind of hot button topics for different
people. Yeah. And it's interesting to say, hey, compiler, you can really take this as a given.
And then of course things can be wrong. And so then the question is how
wrong do they go when something is, one of your axioms is untrue. It's a funny, funny idea.
And so at the end of the day, I think to kind of summarize a lot of this,
scalar code really wants to have this notion of control inside where it's kind of moving around
the instruction that's being executed, changing the program counter to point at different places inside of your program.
But we really only have these crude ways of asking for things to happen.
And so usually we hope the compiler will have some way of doing the right thing, but often
it has no way of knowing what the right thing is at compile time, given just the information
presented to it in the program at large.
Yeah, so I guess the compiler just has to expect the unexpected.
Yeah, like if somebody was trying to predict how often our podcasts come out,
that's basically impossible.
That's true.
I mean, unless they have a side channel,
it would be a really good source of entropy for random number generators
or something like that.
Nice.
All right, well, that's our episode.
So thanks for hanging out, Chris.
And I hope I won't make predictions,
but I think we'll do another episode in the near future.
I expect that we will.
Great.
Thanks.
Later.