Two's Complement - CPUs are Clever
Episode Date: August 19, 2021Matt and Ben discuss the kinds of things modern CPUs do behind the scenes. Branch prediction, caching, speculation, out-of-order processing, hyper-threading, register renaming... Lots of things most p...eople don't need to know. Matt gets overly excited, and Ben channels Larry King to try and keep him on track.
Transcript
Discussion (0)
I'm Matt Godbolt.
And I'm Ben Rady.
And this is Two's Compliment, a programming podcast.
Hey, Ben.
Hey, Matt.
How you doing?
Good.
So I've been thinking recently recently there's an awful lot of
cool things that processors do that I don't think people are aware of except when maybe they come
along and ruin their day because they hear things in the press like Spectre or Meltdown or other
side channel attacks. These scary sounding things that probably aren't quite as scary as you think.
But there's a lot of technology
and a lot of very cool and clever development
that's gone into CPUs.
And we have a podcast
and I love talking about this stuff.
So I figured let's just write down
Matt talks about CPUs
and then see what comes from that.
There's a whole bunch of things that CPUs do that I don't know about.
So I feel like I'm going to learn a lot in this podcast.
So many years ago, actually, I had a lightning talk, which was something like five things
you didn't realize that your CPU does for you.
Oh, yeah.
It was a fun one to do because it was for an audience of people that weren't necessarily
low-level engineers, as indeed I'm sure many of our listeners are not necessarily interested in all the low-level things.
Oh, both listeners. Sorry, there are two.
We have at least two.
Exactly.
I tried to come up with this format that I thought was interesting because they were only five minutes long anyway. But just phrasing it in terms of your CPU, even if you have only the vaguest understanding of
like assembly code, machine code, or whatever's actually going on from the point at which you're
typing your code to the point at which something useful happens in your computer, you've probably
got this idea of instructions that move registers around like MOV EAX comma 123, that kind of thing.
Or if you're an ARM, something similar.
Or, you know, you've maybe done some, you know,
RISC thing at university when you've been forced
to take some kind of assembly course.
And you kind of have a rough idea about the operations
that a CPU can perform.
I think, okay, there's an opcode identified
by probably an integer for each of these instructions.
And they take some number of arguments, not totally unlike a function call, but they take some number of arguments.
It's a great way of thinking of it.
And they are things like, you know, move this data from memory into this register.
Take this register and this register and add them together and put the result in this register, take this register and this register and add them together and put the
result in this register. And that's sort of my, you know, very basic mental model of how
assembly works. And all of the basic operations of the CPU are represented by those commands.
So I think that's a really useful way of thinking about the way the computer works. You mentioned,
you know, like the arguments are like the parameters to the instruction as if it's like a function call. And I think the only thing I add to that when I kind
of mentally model it is like the registers are global variables, right? At the end of the day,
the CPU only has global variables, which is pretty horrible and pretty nasty.
But so that one of those five things I was talking about that you didn't know your CPU does is that
your CPU is kind of a JIT compiler. It doesn't run that either. It only has the illusion, right? The instruction set architecture,
the ISA, the machine code that you're giving to it, is an API boundary. And provided the CPU
appears to do what you would expect it to do when you do mov,ax comma one two three four it is up to the cpu
how it achieves that goal right it's a it's a completely um uh it's an abstraction layer right
what goes on underneath is completely up to the cpu now you might reasonably think like i do because
my mental model is sort of sat back in the 1970s 1980s of cpus is that like there is a little box that says EAX somewhere inside the PC
and when I do mov EAX comma one two three then the number one two three goes into that box
right that's what happens right but that's not what's happening at all anymore unfortunately
things have gotten far more complicated and like the first thing that a modern CPU will do is it
will take that expression effectively that you give it this instruction and it will devolve it into smaller instructions called micro operations and those micro operations
are what the actual core inside is going to work on now in the case of that mov instruction we were
just talking about it probably doesn't have anything more to do than than uh that operation
it's like the operation is about as small as you're going to get.
But when you're doing something like move from memory,
there are several things that need to happen.
And while we're here, let's just talk about a pet peeve of mine.
Why on earth does it move?
It's always move, isn't it?
It's mov this, comma that.
There's no moving.
It's not like the old value has gone away.
It's copy, right?
When I say move the value one two three into eax then
i haven't actually just deleted the value one two three from outside of the world right there's no
longer it's a copy put it's like move yeah move i can't even think of the right word without saying
the word move but anyway so but if you're if you're say acting on memory and say the intel
instructions give you an awful lot of flexibility to, say, do an add instruction where one of the operands is a register and the other
operand is some complicated expression that says, read from this memory address offset by this other
register multiplied by four, because you're going to use it to index into, like, integers,
and each integer is four long. There's kind of three things going on there. One is the actual addition.
Another one is the calculation of which memory address was it that you actually wanted to go
and read from. And I'm not talking about virtual memory here. I'm literally talking about the
instruction was ESI plus 100 plus EDI times four. And you're like, well, that's a complicated thing.
And then third of all, there's the actual going to fetch the memory that is at that address.
And so those really are split into three operations.
And it's not just to make it simpler on the architecture underneath.
The core thing that's happening is that those micro operations can be broken apart and then executed separately and individually which means that if for example the memory unit is is free that can
but the adding unit is busy doing something else then you can go off and fetch the memory or the
memory calculation go and fetch that that and then sort of hold on to that temporary result and then
when the adder becomes free you can just do the add you don't have to wait for all three units
to be ready together so that you can do the operation interesting so is this like an extension of moore's law running out where it's
like they couldn't make processor any faster so they just started giving us more cores now they're
like well what if we take the operations and we do those in parallel too is that just it is exactly
that yeah except that there's the sequence is the other way around really you know as moore's law
was sort of reaching the end and as we sort of hit the sort of four gigahertz ish wall instead of making cpus more and more and more uh fast in terms of
clock speed the cunningness went along multiple dimensions and parallelism is something that
silicon does like it's hard to stop it from doing the parallelism in fact all of like chip design
is trying to get rid of it from being parallel where you don't want it to be and handling the crazy race conditions.
I mean, I've never done silicon, but in terms of FPGA design stuff, it's always the timing.
It's always the synchronization points.
And if you think you've got problems when you've got multiple threads accessing stuff, try imagining you've got 2 million individual gates that are all doing their own thing so um yeah the sort of the first axis i mean so back in the day
we're going to go back to like 6502 era then um it took multiple cpu uh cycles multiple clock
cycles to execute a single cpu instruction So, you know, the classic fetch the instruction,
decode the instruction, execute the instruction, usually, and then sometimes they have a retire
stage or a write back stage and a retire or something like that. Those are kind of like
the pipeline stages that an instruction went through. And he went through them one at a time,
which was cool. But the obvious observation is it's a pipeline. You can be fetching instructions one cycle at a time,
and then you can be decoding the next one and executing the one after. Oh, right. Sorry.
The other way around. The classic pipeline, which is fantastic. And it meant that a lot of things
that were taking five cycles to complete, if you imagine a five cycle pipeline, well,
for every individual instruction, it does indeed take five cycles to get
from one end of the pipeline out the other but every single clock clock of the tick tick of the
clock you're doing one more piece of useful work just like um you know a modern car production line
it might take two weeks for a car to be assembled from a bag of parts that are put in
one end and come out the other for any one car so if you were to like spray paint that car red and
then it would have to wait five weeks but every day a whole new car rolls off the end of the
production line and so you've got this latency throughput trade-off that's going on and so we
were already taking steps in that direction even like so i said the 6502 grabs these things
actually as it happens the 6502 had a
very, very simple pipeline where on the
end of the last tick of
an instruction, it
actually was fetching the next
instruction, which had some interesting,
if you're ever going to emulate it, it had some interesting side effects,
but we've done that to death.
So there you're
getting sort of parallelization
out of a single stream of instructions, which is great.
What tends to happen, though, in those cases is that that only works It's like when you finally get to assembling the
last step, you look at that part of the instructions and it says, oh no, make this one
red. And you're like, oh, but it's already been all the way through and we're nearly at the end.
And the analogy I'm trying to make there is that's like a branch that's taken or not taken.
You've made a decision and it takes a few cycles for the decision-making process to come to its conclusion and say, ah, actually, we need to
be doing something different here. But the pipeline is full of the other instructions
that would have happened if you had just carried on in a straight line. So at that point, what used
to happen would you'd have bubbles in the pipeline. So those fetched and decoded instructions that weren't going to be executed were just flushed away thrown away and then the
whole thing would start again and so it would take several clock cycles before useful work could be
done again and that's what made branches expensive back in the day and that's cool so what next
happened was the concept of making a guess hey if we can make a guess at where the branch is going to
go ahead of time then we can steer that decode that fetcher and a decoder in a particular direction
and then as long as when we get to do the execute stage we check our guess if we were right hey
no harm no foul we carry on like the instructions that we've fetched and decoded are ready to go
we don't have to have this bubble in the pipeline and if we get it wrong it's no worse than how we were
before we have to throw away the pipeline and flush it yeah now as it happened about the time
that branch predictors were coming in and becoming more and more important it was actually the case
that the pipelines were getting longer and longer and longer. For boring technical reasons to do with actually speeding up
the clock speeds overall.
Obviously, if you're making something go faster and faster and faster,
it's better to have smaller and smaller pieces of work to do,
which in this pipeline analogy means having simpler and simpler steps
in the pipeline, but potentially more of them to achieve the ultimate goal.
So now we're looking at latencies of the orders of tens of cycles
between the fetch that picked up the instruction
and the branch instruction actually running and determining,
yes, we are going to take the branch,
or no, we're not going to take the branch.
So the interesting side effect of having a branch bridge and we can talk for
to death about how they work and they are amazingly amazingly good these days is that
what the core the sort of like the engine that's that's churning out all of the actual work the
stream of micro operations remember we've turned these these instructions into micro operations
it's just given a sequence with no branches in it,
effectively, of instructions of work to do. And it can be filling that, that could be a huge amount
of work. It can be hundreds ahead of itself, potentially, you know, with a long enough
pipeline. But also, what that gives it an opportunity to do is to sort of do, and this
is where I'm going to go back to the analogy I made of like a jit compiler it's sort of just in time compilation it can do some analysis of the stream of instructions
and it can say this particular instruction that's later in the stream doesn't depend on an earlier
instruction so and this is all happening in the silicon this is happening this is in the hardware
completely in hardware absolutely wow. And so effectively,
a DAG is being generated of all the interdependencies between instructions.
And instead of running the instructions strictly in the order that you wrote them down,
it is running these micro operations when they are ready to be run.
So in the case of, say, a divide, divide is a classical example of a really lengthy instruction
so i'm dividing two numbers together it's going to take a good dozen to two dozen cycles before i
get the integer result of that divide meanwhile the next instruction was just like yeah mov eax
comma 12 because i'm setting up to do some other piece of work right well that instruction can run
it can jump the queue because there's no observable side effects of the divide needs and so there's all it's not like it's moving the result
of the divide in that correct right it's just it's just moving some other unrelated thing but
even if the next instruction was moving that that instruction would be blocked and the the cpu would
be continuing to look down the stream of instructions for something that finally didn't
depend on anything that was currently being right busy right and it's just it's amazing to think that that can be done in silicon
quickly but that but wait there's more one of the problems that you've you if you hit very quickly
if you're looking for more work to do you're looking for these dependency change these dag
nodes that don't relate to each other so that
you can get as much parallel work being done as possible, is that very quickly you get bottlenecked
on those registers, those global variables. If you imagine, again, going back to 6502,
you basically only had one register. It was the A register. The X and Y registers were over there
as well, and they were used for indexing. but let's assume that you had something akin to only one register and the you
know we issued that divide right so that divide happened and the result's going to go into the
accumulator and then the next instruction was to say and store that result in some memory location
and then the next instruction was and now load a different value into the a register because i need
to go and do some additions okay right well at that point, I have to stop because I can't write into that accumulator register because the divide's using it right now.
So we grind to a halt again.
What happens as the instructions, the micro-operations flow through from the decoding unit into this out-of-order system
is that as well as looking at the dag of dependencies between them it's also rewriting
those registers that you gave it and it's using tons of internal temporary register names so in
the example i just gave that first divide um that's going to write into like a sub zero like
with a little subscript zero but as soon as i do mov a value into a i've completely
obliterated what used to be there i've just you know i moved another value into it and at that
point the the um the cpu goes well that's a different a from the last a so i last used a
sub zero now i'm going to use a sub one this is a trick that compilers will do themselves right as
compilers are looking at stuff they'll be like well i can reuse this or i cannot reuse that but it's happening in silicon so now we've got a sub zero and a sub one and now
suddenly they're two independent instruction flows again and now they can be parallelized and
this is again parallelized within a single thread of execution on the cpu and that's so cool right
is it fair to say that this is just like another classic example of the problems to so many things in computing is another layer of indirection?
It's sort of like we have these backward compatibility reasons why we can't just change all these instructions, right?
But we don't want the hardware to look like this anymore, right?
Right, right, right.
So now what are we going to do?
Okay, well, we'll just pretend like we're executing these instructions and underneath the covers, we can do whatever we want. And like that effect is, you know, sort of like a classic programmer trick, right? Like I have some API that the guy before me wrote, and I hate it, but I'm stuck with something different that's much faster or much more efficient or whatever or you know fits the architecture that i have or is cheaper and
that's kind of so now assembly is just basically like an api at this point and we're just using it
to drive all and as you say it sort of has to be backwards compatible back to like 1970s era code
because yeah so that's a really interesting observation. And it's one that Intel themselves made
and they decided, well, what would it look like?
And I apologize for those who probably know
a lot more about this than me,
but they decided to investigate
what would it look like if we actually switched to something
which was more amenable to easy execution on the CPU,
but doesn't require all this trickery.
And they came up with the Itanium
which was a very long
instruction word chip with
thousands of registers that you could actually access
or hundreds of registers that you could access.
And by very long instruction word
that means that each
instruction, I'm putting little air quotes here,
is actually a bundle of multiple
instructions that will run concurrently.
Oh, interesting.
So they exposed, they peeled back the abstraction layer
and they said, what if we just let you run
what is essentially our microcode directly?
Kind of, yes, yeah, exactly.
And they said, let's invest in smart compilers
that can work out how to rearrange and pack
the code that you've written into these instructions
yeah yeah but it turned out to be a dead end either the compilers just weren't up to the
task and given how good compilers are i suspect that's not it or else what we don't get with that
um approach is any kind of dynamism right there's a reason why i kind of use a jit as an example
because a jit has extra knowledge at the time it has the dynamic environment like if you look at
how the jre works and i'm switching out analogies left right and center here but like it's very hard
to statically decide whether or not something should be done one way or another but it could
be trivial to make that decision when you have dynamic information in front of you.
Like, which of these two branches is more likely to be taken?
I don't know.
Sure.
Is it the same way that I went the last 80 times?
Yes.
Well, then this is now the more optimal route to take.
Whereas staring at code ahead of time, you've only got so much information.
And especially when you add complexities like, you know, I alluded to the divide taking forever.
Sometimes those divides are dependent on the data that's being put into them and so you've got like a data dependency like even if you're dividing two 64-bit numbers it can take
fewer cycles if the numbers are smaller or bigger like actual magnitude of the numbers and so there's
a huge amount of dynamism that comes another thing that's very hard to predict is memory latency.
We've got so many layers of caching now
because RAM has never really kept up with the speed of the CPU.
So you don't know if something's going to be in cache or out of cache.
And so statically trying to predict that
and kind of padding your VLIW code,
it's really, it's almost, I'd say it's impossible.
Got it.
So is it the case
that sort of like breaking the abstraction here has has then constrained what you can do
underneath the abstraction uh and and therefore made it less efficient because now you're you're
making promises that may you're trying to make promises that may not hold up in the dynamic
environment of actually running the code right right like you're sort to make promises that may not hold up in the dynamic environment of actually
running the code right right like you're sort of implying that certain things will be fast or
certain things will be slow or certain things will be more memory efficient when they're not
necessarily because it just depends but if you've if you've pulled back that abstraction now the the
the compilers in this case which are are sort of the things producing this, now sort of like can't really – they're just sort of too smart for their own good.
Is that a reasonable way to think about it or am I oversimplifying it?
No, I think so.
And, you know, like whereas in a software JIT situation, which can also take advantage of this information, like I said, you know, like you can work these – you can observe the pattern of code and kind of make a decision to say well i'm going to
re-optimize this particular area of code um i don't think that even a software jit that's
producing code for say itanium could observe enough information at the sort of micro level
to take full advantage of this you know like the nowadays there's the the cpus have
hundreds of um out of order buffers so they can be like two or three hundred instructions ahead
of themselves so if there's that one slow memory load that you're stuck with as long as it can find
enough information sorry enough useful work to do
that can be predicted which are a lot of ifs but again all these things are getting very clever
then that one slow load isn't going to block you from doing too much work you're going to be able
to tear through and in fact there's another sort of cunning trick that um so switching away back
from this itanium um pre pre uh pre-compiled solution to the problem.
So thinking about what a single CPU actually looks like, we've got the front end, which is
this thing that's going to do the fetch, the branch prediction. It's going to fetch, it's going to
decode and generate a bunch of micro operations. And then there's going to be some kind of magical
renaming, which is the thing
i alluded to where it rewrites what registers you're actually using to be internal ones you
can't even address so i can't even say the name of these they're totally internal to the cpu yeah
then they kind of put in the temp directory of registers it's exactly that that's exactly what
it is yeah but those get all thrown into a big pot of like here's all the instructions i would
ideally like you to run and now the out of order core is going to just pick off the ones that can be run.
And then finally, they get completed in order again.
There's like a big master list of like, well, this is where we're good to.
These things have actually happened.
Anything below this mark is like, may have completed out of order,
but we have to make it look like it happens in order.
And that's where, for example, even if it's run out of order, but we have to make it look like it happens in order. And that's where, for example,
even if it's run ahead of itself,
but there was a page fault or a seg fault on an instruction,
which maybe takes a while for it to discover,
the seg fault doesn't happen
until the instruction that caused the seg fault retires,
and then it aborts everything that's not happened after it.
So that's kind of how that magical time-traveling aspect happens.
Building a software emulation of this seems like years of work to me.
It's just mind-blowing that there's physical hardware
that is making all these kinds of decisions.
That's just incredible.
It's really quite an astounding achievement.
And the most interesting thing to me is that this information is hard to come by intel
deliberately doesn't try to tell you very much about what's going on underneath this layer
because they want to have reserved the right to change it quite reasonably their api and contract
sort of ends with you at the the isa at the the machine code level right right well yeah i do the
same thing exactly like you know but don't look under the covers because i might want to change my mind later exactly but you know hyrum's law
the the law that you know any observable side effect will be used by somebody and become relied
on by somebody comes into play but yeah the interesting thing is all this information is
is inferred from some information that's in the the manuals that you get from intel but just an awful lot of reverse engineering by a community of dedicated engineers who are using the performance counters and seeing
when things tick up you know how how often does a front end re-steer you know intel doesn't really
tell me what that particular counter is measuring other than if it try and keep it as low as possible
but i can run my code and sort of see under what conditions that happens and i can start
building a theory and the the people are there's travis downs and agne fog uh peter cordes there's
there's a few really interesting people i mean like agne fog for example he's a danish professor
of anthropology which is like the most strange hobby to have. But I mean, who am I to throw stones in that respect?
But yeah,
it's really interesting that this is not as well understood,
even among the people who are really,
really interested in it.
I mean,
is it,
are there intellectual property reasons why I can imagine?
There are security considerations.
Yeah,
that too.
You know,
so just to sort of,
I've mentioned about Spectre and Meltdown at the very beginning as being the scary reason why people might think of this.
Well, the mechanism by which they attack is absolutely core to that thing I was talking about.
Like when a fault happens, it sort of erases everything that happened after the fault.
But we know that several hundred instructions might have actually run based on speculating that the fault wouldn't happen.
And then the fault did happen.
Now, as I said, this retirement sequence happens strictly in order.
So until an instruction has retired, it's as if it didn't really happen.
So we can throw that away.
So you're like, well, how can possibly this be a problem, right?
Although we may have done some work beyond the fault um it will get thrown away and so like it's not visible
to me i can't see past it right that is unfortunately not true because also in this system
is the cpu data and uh and instruction caches those can't be undone.
Any changes that happen in them,
and now I'm not talking about data
that's changed inside the caches
because it won't actually have been committed,
but if somebody has read from main memory
at a particular address,
that will now be in the cache,
that value will be in the cache.
And now I can, after that fault has happened,
I can go back and say,
how quickly can I read this memory address?
Oh, it only took three cycles.
Therefore it must be in the L1 cache.
And the only reason it must be in the L1 cache
is because my little targeted attack
that I did over here
caused the speculation system
to get way ahead of itself
and pull in some information that was in the cache.
Now I don't know what was in that information,
but if I can contrive a situation in which the address of the read corresponds to a secret piece
of information that I shouldn't otherwise have access to, I now have a way to sneak the values
out by just doing timing on the cache. And this is a very big topic, and I'm really not doing it
a proper service here. But that kind of mechanism there, there's like the dirty secret is the undo buffer that we have inside the instructions for like, whoops, I didn't mean to do those things.
Can't undo the side effects of cache, things being read into the cache.
So this is like you have some data structure where the address in memory tells you something important
about the data itself you know like a a hash or something like that where it's like oh well
you know this value gets written to index zero this one gets written index one index two so forth
if in if if you know that uh yeah i guess it's almost like a, what is that called?
A bloom filter?
There are some things.
I mean, you can actually do a much more simple thing.
Because if you can control the code that's going to run.
And so, for example, the canonical example of that is the safety mechanisms around a, say, a JavaScript JIT engine.
Right? around a, say, a JavaScript JIT engine, right? You're doing an array access,
and you want to, you go out of your way
to read within the bounds of the array
over and over and over and over and over and over again.
And now the branch predictor has learned
that you never ever trip outside of that check
that you know must be in the JIT somewhere
that says make sure they don't read outside of this array right I've made myself a 256k array and I've sat there reading a million
times out of it and so now the CPU is essentially it's cold for it to deal with the exceptional case
where inside the JS engine it's going to throw an exception to say out of bounds access or it's
going to return like the undefined or whatever it's going to do. But it has to do something different. Now, I happen to know that my buffer,
at some giant offset from my buffer,
if it were to actually try and read there,
then I'd be reading some protected piece of memory
that I shouldn't otherwise be able to access,
like something outside of my sandbox.
And so what I'm going to do is i'm going to read my million times inside inside
arrays and then i'm going to suddenly read out of of index and then as soon as i've got that value
out which i won't which will never actually happen but the speculation will get that far
um i will then take that value and immediately use it to read one of 256 different cache lines
and then I hope
that I can do that before the
branch predictor goes whoops I didn't mean to
go this way and then it all
gets undone and then I get my
undefined return and I go oh cool
so you're using the value that you read out
to do another read
right exactly
which now will
show up in your
analysis essentially of like oh this one was really fast.
So, like, I'm going to read some value that I don't know out of memory.
And let's say it's three.
But you don't know that, right?
And then you're going to read whatever value you read's index into the memory, right?
And that turns out to be three.
And now three is fast.
Exactly right.
Now you know three is fast, and the secret value is three. That's right. And, yeah, exactly that. And there are different to be three. And now three is fast. Exactly right. So now you know three is fast and the secret value is three.
That's right.
And yeah, exactly that.
And there are different ways of achieving.
And that's called a side channel attack.
It's a side channel.
It's like a way of exfiltrating information that isn't through sort of the architectural state of the computer.
Got it.
So we got a bit derailed there.
Not as much as we have any railings and any of these things.
But one thing I do want to talk about was going back to that big soup of uh out of order execution um things we had the front end we decode a bunch
of stuff we get micro operations they get thrown into this giant soup they've been rewritten to
use the internal registers and now we're trying to use um you know we've got an adder we've got
a multiplier we've got a divide we've got multiple of these things right they're called execution
ports there's usually five between five and eight of them in different things
and they can do different things but the the sort of the the goal of the out of order system is to
try and fill up all these parts of the silicon with useful work to do and as we've said one of
the ways that it can do that is to get way ahead of itself by predicting branches and starting to do loads
of work that maybe is totally speculative. Another thing it can do is rewrite the registers so that
more seams of independent work can be found in a single stream of instructions. But there's one
more clever trick that you can do. For very little silicon cost,
what if, instead of having a whole other CPU to make on the side, what if I just have another
set of registers? Not even the internal rewritten ones, just the external registers. They do have
to, you know, I said at the beginning, there isn't really a box called EAX with a one, two, three. Well, there is, there is one canonical one. And that's in fact,
what that retirement stage is doing is it's writing the real value of EAX back into the
actual real EAX register somewhere so that anyway. But for a small cost of having another program
counter and another set of all of the registers of which there are 16 or 32 of them
i can get another thread and i'm putting air quotes over this so what if i were then to say
to the front end every other cycle read from either the first program counter or the second
program counter and when you read from the first program counter, tag it to say,
these are different registers
from the other program counter.
So I kind of have two sets of front end state.
So now every other cycle,
I'm reading two completely independent streams
of instructions
from two different physical places in memory.
Once they've been through the rewriter,
we know that they won't be able to talk to each other.
So like the stream one's EAX
is a different physical register from stream two's EAX.
But once they've been rewritten to internal register
one, two, three, four, or one, two, three, five,
whatever, it doesn't matter.
At that point now, we've got twice as much work to do
that is guaranteed to be independent
because they're from two completely different streams of execution but i haven't got any more adders or multipliers or dividers or
anything i just throw that into the big soup in the middle of the out of order execution system
and naturally we've got two non-overlapping dags that are being processed right and provided we
just have the tiny one bit that says the ultimate result
of this needs to go back to eax in hyper thread and i'm going to call it that now hyper thread one
or hyper thread two we've got twice as many sources of independent work to do for very little cost
so so like very sort of naively here if i two programs, one of which was a very long series of ads and the other one was a very long series of subtracts, you could mix those two instructions together and know that they would never overlap and basically do them all in parallel because you know that they're never going to do two ads in the same register at the same time.
Correct.
Now, ads and subtracts happen to fall in the same category because they're exactly the same operation and when it comes down to it but
like if you said multiply an ad see absolutely or you know linked list one's reading through
memory really fast whatever you know so you get for free inverted commas uh at least a very small
cost of this little extra architectural state you get twice as much potential for um filling up those seven or eight
real work ports these execution ports with useful work that can allow you to make forward progress
and that's what hyper threading was and so that's another way of getting parallelism you know we've
got instruction level parallelism which is itself the ability for the the CPU to go out of order and say, well,
I can run more than one instruction at a time from a single stream. Then you've got hyperthreading,
which says, well, I can interleave more than one CPU's worth of instructions as long as I
just do a tiny bit of accounting to make sure they don't mix up with each other,
which has led again to more of the kind of scary problems that you have, like spectrum meltdown type things.
I forget which ones are specific to hyperthread siblings,
but there are some.
And then even just at the instructions themselves,
we can use single and multiple data instructions
where you can treat a single instruction as being,
well, actually, this is acting on eight independent ints at once.
But that's a whole other topic, I think, for another time.
But that's another level of parallelism.
And that's before we even go to genuinely multi-threading,
where they said, we can't make the chip go any faster,
but what we can do is we can put eight copies of the chip on a single die.
So where are we in the timeline?
What year is it?
What year is it?
What year is it?
Who's the president?
Yes.
Yeah. We've talked about all these technologies up to this point. What year is it what year is it who's the president yes yeah we've talked about all these technologies up to this point what year is it that's a great um there's a great question
so my sort of memory of where i started looking at certainly intel cpus was the mid late 90s
um and in fact i was working for a company who who were doing some work for Intel at the time.
I remember the very first 233 megahertz Clamath got delivered to the, which was the name of the
Pentium 2, got delivered to the office. And even though I was the new guy in the office,
I was tasked with getting it up and running and working, which was painful,
all these strange um
drivers and things that needed to be installed and actually ultimately ended up inheriting it
and taking it home which was a whole other story but um yeah in our lunchtime quake sessions i had
by far the best ping it was brilliant because no one else had a computer was even this close
it was wonderful it didn't make me much better atake, but it was a lot of fun at the time.
But, you know, that era was sort of the last era where you had any hope of really understanding
or at least getting help from Intel
about how their CPUs worked.
I don't really actually exactly remember when,
but it was at least Pentium 1
where they were told this rather explicitly.
There's a U-pipe and a V-pipe.
And if your instructions look like this, they can go into the U-pipe. And if they look like this,
they can go into the V-pipe. And then you could sort of sit down
and hand arrange these pairs of instructions.
So this is very much before
the more complicated stuff we've been talking about.
Around
2000 was
when the NetBurst CPU
came out. And that was one of the
ones that
definitely used microcode by that point,
so micro operations.
And rather interestingly,
that took a sidestep down.
We didn't talk about micro operations themselves,
but the whole decoding thing is quite complicated.
And so there's caches for that as well.
That took an interesting sidestep
down an avenue of something called trace caches,
which is sort of an alternative way of kind of encoding both the branch prediction
and the decode of the stream of instructions that happened when you went to that particular place.
And so there was a cache that said, hey, the last time you got to address 1, 2, 3, 4,
with this prediction of what branches might look like,
here's just the pre-decoded micro operations.
Half the work's already been done for them.
Just start executing those things
and only just abort if you hit the wrong one.
If you discover along the way that you've made a mistake,
you have to undo it and go back.
And the problem with that is it just didn't work
as well as the engineers thought it was going to work.
So, and at this point,
we're still in like 32-bit processors
for what it's worth.
Around this time though,
AMD were working on a 64-bit extension
and there was some kind of argy-bargy
back and forth between AMD and Intel
about like, you know, who owned it
and how they could do it.
I think 2001-ish,
they came to an agreement over a patent.
And I think that was like a sort of unilateral patent agreement between the two.
They wouldn't go after each other for the things.
And so in 2004, the first 64-bit processor came out, and it was an AMD, which was quite a coup.
I mean, I imagine lots of heads rolled inside Intel when they were beaten.
I think it was another couple of years
before the core processor came out,
and that was 64-bit.
So yeah, in those early 2000s
is when most of this micro-operation stuff
started to be, to my knowledge anyway, more used.
My real deep dive of this stuff
starts in the 2006- plus era of the core, which was a separate offshoot.
It was actually the original.
It was a team in Israel who were doing the mobile version, the Pentium M.
And when the whole trace cache that was inside the netbus seemed to be a dead end, they stepped up to the plate and said, hey, well, how about we take the mobile core, take some ideas from the trace cache, but move on in a different direction?
And that seemed to work out pretty well. Yeah, so 2010 was the sort of Sandy Bridge era, which
had reintroduced, ironically, something called the microoperation cache, which is different from
the trace cache, but morally in the same sort of ballpark. But it didn't try to be as clever. It wasn't storing
entire threads of execution, like, hey, if you get to this address, then this is what you're
going to do until you discover you've made a mistake, which is actually genuinely more like
what a JIT does right nowadays. A JIT will kind of say, hey, this is this bit of the code. We did
this last time. Let's do that. But it was more like a traditional cache where there's still logic that's sort of pumping out
which instruction should be next. Then it goes, well, rather than go to memory and pull down and
decode the instruction, let's just see if that memory address maps to somewhere that's in our
micro operation. Oh, it does. Okay. That saves us going out to memory at all and doing the decode.
So it's like a cache on the other side of the decode um yes oh sorry i think i made i've got a note here actually
the the the 2003 was the amd optaron which was the first 64-bit machine and in 2004
was the intel no kona so which was the 64-bit bit uh and it would be rude not to mention ARM processors here as well.
So the ARM processors started off, as I love to say,
as they were a coprocessor to the BBC Micro,
which was a great, fun thing.
In fact, the ARM 1 was the engineering sample for that,
was a coprocessor, and it printed out,
hello, I am ARM or something like that.
And the cool little thing about that story is that they had no idea what the power consumption was going to be like.
And after they realized it was booted and running, they realized they hadn't actually turned it on yet.
It was running, but it was the I.O. pins.
There was enough parasitic power coming from the Iopins from the device that was plugged into that
the thing just was like working so at that point i think they realized they were onto kind of a low
power thing which obviously has been a theme ever since for them they're in every cell cell phone
ever um but you know all the levels of sophistication that have happened in intel have
happened in arm i haven't followed them as closely but you know the the the M1 amazing achievement and the M2 that's going to be coming soon from Apple who have licensed the ARM ISA, but essentially have completely re-implemented it themselves with all the same kind of tricks and the same depth of pipeline and the same...
Oh, I didn't realize the M1 came from ARM's legacy.
That's interesting.
Right.
Yeah. From ARM's legacy. That's interesting. Right, yeah. Well, these are all, I mean, they have to pay ARM for the ISA to license the,
essentially the encoding of the instructions is what they're paying for, I believe,
and like the idea of it.
Otherwise, you know, I'm sure they would have preferred to do everything themselves,
but it's only so far you can go.
But, yeah, it's, oh, and the other thing about the M series is that they've gone for
both a big and a small core.
So they've actually gone from heterogeneous cores. So at the like any intel cpu if you've got eight cpus they're
all exactly the same but in in an m1 you've got the firestorm core which i think is the big one
and the ice storm core which is like a small one and it's like four of each and so there's some
clever power management stuff that the operating system can do to say well let's just put these lower priority things on the lower power consumption cpu that isn't as fast whereas this graphics editing code
or whatever can run on the other one and i'm super vague on this but i just know that it exists and
it's it's it's coming i think to everywhere you know with so so yeah there's an awful lot of
things that go on inside that little little chip and. And we haven't even talked about how the caches work,
how the branch predictor works.
I often joke that if we were to somehow come up with a way
of getting the previous year's worth of lottery numbers
encoded as a sequence of jumps,
we could probably use the branch predictor to predict the next set,
and then I can retire.
Because they are so good.
They are just so clever.
And the levels of which the engineers
go to and obviously this is into trade secret area so reverse engineering those branch predictors is
is is challenging and even just getting the idea so that the pipelines now are so big and weighty
that the um like the old school idea about what a branch predictor is, is like, is this branch taken
or is this not taken? That's kind of how I've described it. But really, even the, is it a
branch, right? I've just pulled out 16 bytes of memory and certainly for Intel, is there a branch
in there and does it have a destination, like an unconditional branch, a jump, if I have to wait five or six cycles for
it to get down all the various pre-decode steps and all those bits and pieces before it gets to
the actual decoder, it goes, oh, sugar, there's an unconditional jump to 100 here. Then I have
to re-steer all the four or five steps beforehand. So there's a predictor for the destination of a
branch, or even if there exists a branch in just an arbitrary block of code and so before anything happens at all there's this sort of magic eight ball that the
bpu shakes and it goes there will be a branch in the next fetch it goes brilliant okay well the
next one we're not going to get from where we think it's going to come from we're going to get
from like 100 off we go it's just so clever so clever and so much so much um of this stuff that
we take for granted we just write code
and it runs and largely we can ignore it which is a testament to to uh where we are but i have a
cool little example where you can demonstrate the effects of the branch predictor even in python
right you can write a relatively straightforward piece of code that runs i'm not going to say a lot faster but
measurably reliably measurably faster one way than another and they're identical workloads
it's just whether or not the branch is predictable that's in the heart of it and that's like 10
layers deep inside the python interpreter for heck's sake so it does affect us all um it's just
interesting to know about and very rarely does it actually make a difference but it's it's cool to know that it's there well i feel like i've learned a lot today it's been
fun and i mean if people are interested in hearing more of this stuff we can
dig into something else along this line another time or this is this is there's like two or three
other podcasts that i'm sure are going to come out of this. This is really good stuff.
I'd like to shout out as well.
There's a friend of mine's podcast who he and a friend are doing far, far deeper and far more authoritative podcasts on these kinds of subjects. So if you're interested in this stuff, I would heartily recommend going to TLBHIT, which is TLBH.IT, and subscribing to J.F. Bastien's and I forget his co-host's name.
But their podcast is brilliant.
But ours is also going to cover some of these topics, hopefully.
Maybe just in a different way and to a different kind of audience.
Well, you're definitely getting the layman's
perspective on some of this.
Hopefully I'm doing that Larry King thing
where I ask the questions that people are wondering
in their heads, but we'll see.
Fantastic. Well, I guess
until next time. Until next time.
You've been listening to
Two's Compliment, a programming podcast by ben rady and matt godwald
find the show transcript and notes at twos compliment.org
contact us on twitter at two cp that's at t w o s c p
theme music by Inverse Phase