Two's Complement - CPUs are Clever

Episode Date: August 19, 2021

Matt and Ben discuss the kinds of things modern CPUs do behind the scenes. Branch prediction, caching, speculation, out-of-order processing, hyper-threading, register renaming... Lots of things most p...eople don't need to know. Matt gets overly excited, and Ben channels Larry King to try and keep him on track.

Transcript
Discussion (0)
Starting point is 00:00:00 I'm Matt Godbolt. And I'm Ben Rady. And this is Two's Compliment, a programming podcast. Hey, Ben. Hey, Matt. How you doing? Good. So I've been thinking recently recently there's an awful lot of
Starting point is 00:00:27 cool things that processors do that I don't think people are aware of except when maybe they come along and ruin their day because they hear things in the press like Spectre or Meltdown or other side channel attacks. These scary sounding things that probably aren't quite as scary as you think. But there's a lot of technology and a lot of very cool and clever development that's gone into CPUs. And we have a podcast and I love talking about this stuff.
Starting point is 00:00:57 So I figured let's just write down Matt talks about CPUs and then see what comes from that. There's a whole bunch of things that CPUs do that I don't know about. So I feel like I'm going to learn a lot in this podcast. So many years ago, actually, I had a lightning talk, which was something like five things you didn't realize that your CPU does for you. Oh, yeah.
Starting point is 00:01:19 It was a fun one to do because it was for an audience of people that weren't necessarily low-level engineers, as indeed I'm sure many of our listeners are not necessarily interested in all the low-level things. Oh, both listeners. Sorry, there are two. We have at least two. Exactly. I tried to come up with this format that I thought was interesting because they were only five minutes long anyway. But just phrasing it in terms of your CPU, even if you have only the vaguest understanding of like assembly code, machine code, or whatever's actually going on from the point at which you're typing your code to the point at which something useful happens in your computer, you've probably
Starting point is 00:01:58 got this idea of instructions that move registers around like MOV EAX comma 123, that kind of thing. Or if you're an ARM, something similar. Or, you know, you've maybe done some, you know, RISC thing at university when you've been forced to take some kind of assembly course. And you kind of have a rough idea about the operations that a CPU can perform. I think, okay, there's an opcode identified
Starting point is 00:02:23 by probably an integer for each of these instructions. And they take some number of arguments, not totally unlike a function call, but they take some number of arguments. It's a great way of thinking of it. And they are things like, you know, move this data from memory into this register. Take this register and this register and add them together and put the result in this register, take this register and this register and add them together and put the result in this register. And that's sort of my, you know, very basic mental model of how assembly works. And all of the basic operations of the CPU are represented by those commands. So I think that's a really useful way of thinking about the way the computer works. You mentioned,
Starting point is 00:03:01 you know, like the arguments are like the parameters to the instruction as if it's like a function call. And I think the only thing I add to that when I kind of mentally model it is like the registers are global variables, right? At the end of the day, the CPU only has global variables, which is pretty horrible and pretty nasty. But so that one of those five things I was talking about that you didn't know your CPU does is that your CPU is kind of a JIT compiler. It doesn't run that either. It only has the illusion, right? The instruction set architecture, the ISA, the machine code that you're giving to it, is an API boundary. And provided the CPU appears to do what you would expect it to do when you do mov,ax comma one two three four it is up to the cpu how it achieves that goal right it's a it's a completely um uh it's an abstraction layer right
Starting point is 00:03:52 what goes on underneath is completely up to the cpu now you might reasonably think like i do because my mental model is sort of sat back in the 1970s 1980s of cpus is that like there is a little box that says EAX somewhere inside the PC and when I do mov EAX comma one two three then the number one two three goes into that box right that's what happens right but that's not what's happening at all anymore unfortunately things have gotten far more complicated and like the first thing that a modern CPU will do is it will take that expression effectively that you give it this instruction and it will devolve it into smaller instructions called micro operations and those micro operations are what the actual core inside is going to work on now in the case of that mov instruction we were just talking about it probably doesn't have anything more to do than than uh that operation
Starting point is 00:04:42 it's like the operation is about as small as you're going to get. But when you're doing something like move from memory, there are several things that need to happen. And while we're here, let's just talk about a pet peeve of mine. Why on earth does it move? It's always move, isn't it? It's mov this, comma that. There's no moving.
Starting point is 00:04:59 It's not like the old value has gone away. It's copy, right? When I say move the value one two three into eax then i haven't actually just deleted the value one two three from outside of the world right there's no longer it's a copy put it's like move yeah move i can't even think of the right word without saying the word move but anyway so but if you're if you're say acting on memory and say the intel instructions give you an awful lot of flexibility to, say, do an add instruction where one of the operands is a register and the other operand is some complicated expression that says, read from this memory address offset by this other
Starting point is 00:05:35 register multiplied by four, because you're going to use it to index into, like, integers, and each integer is four long. There's kind of three things going on there. One is the actual addition. Another one is the calculation of which memory address was it that you actually wanted to go and read from. And I'm not talking about virtual memory here. I'm literally talking about the instruction was ESI plus 100 plus EDI times four. And you're like, well, that's a complicated thing. And then third of all, there's the actual going to fetch the memory that is at that address. And so those really are split into three operations. And it's not just to make it simpler on the architecture underneath.
Starting point is 00:06:13 The core thing that's happening is that those micro operations can be broken apart and then executed separately and individually which means that if for example the memory unit is is free that can but the adding unit is busy doing something else then you can go off and fetch the memory or the memory calculation go and fetch that that and then sort of hold on to that temporary result and then when the adder becomes free you can just do the add you don't have to wait for all three units to be ready together so that you can do the operation interesting so is this like an extension of moore's law running out where it's like they couldn't make processor any faster so they just started giving us more cores now they're like well what if we take the operations and we do those in parallel too is that just it is exactly that yeah except that there's the sequence is the other way around really you know as moore's law
Starting point is 00:07:01 was sort of reaching the end and as we sort of hit the sort of four gigahertz ish wall instead of making cpus more and more and more uh fast in terms of clock speed the cunningness went along multiple dimensions and parallelism is something that silicon does like it's hard to stop it from doing the parallelism in fact all of like chip design is trying to get rid of it from being parallel where you don't want it to be and handling the crazy race conditions. I mean, I've never done silicon, but in terms of FPGA design stuff, it's always the timing. It's always the synchronization points. And if you think you've got problems when you've got multiple threads accessing stuff, try imagining you've got 2 million individual gates that are all doing their own thing so um yeah the sort of the first axis i mean so back in the day we're going to go back to like 6502 era then um it took multiple cpu uh cycles multiple clock
Starting point is 00:07:59 cycles to execute a single cpu instruction So, you know, the classic fetch the instruction, decode the instruction, execute the instruction, usually, and then sometimes they have a retire stage or a write back stage and a retire or something like that. Those are kind of like the pipeline stages that an instruction went through. And he went through them one at a time, which was cool. But the obvious observation is it's a pipeline. You can be fetching instructions one cycle at a time, and then you can be decoding the next one and executing the one after. Oh, right. Sorry. The other way around. The classic pipeline, which is fantastic. And it meant that a lot of things that were taking five cycles to complete, if you imagine a five cycle pipeline, well,
Starting point is 00:08:42 for every individual instruction, it does indeed take five cycles to get from one end of the pipeline out the other but every single clock clock of the tick tick of the clock you're doing one more piece of useful work just like um you know a modern car production line it might take two weeks for a car to be assembled from a bag of parts that are put in one end and come out the other for any one car so if you were to like spray paint that car red and then it would have to wait five weeks but every day a whole new car rolls off the end of the production line and so you've got this latency throughput trade-off that's going on and so we were already taking steps in that direction even like so i said the 6502 grabs these things
Starting point is 00:09:23 actually as it happens the 6502 had a very, very simple pipeline where on the end of the last tick of an instruction, it actually was fetching the next instruction, which had some interesting, if you're ever going to emulate it, it had some interesting side effects, but we've done that to death.
Starting point is 00:09:41 So there you're getting sort of parallelization out of a single stream of instructions, which is great. What tends to happen, though, in those cases is that that only works It's like when you finally get to assembling the last step, you look at that part of the instructions and it says, oh no, make this one red. And you're like, oh, but it's already been all the way through and we're nearly at the end. And the analogy I'm trying to make there is that's like a branch that's taken or not taken. You've made a decision and it takes a few cycles for the decision-making process to come to its conclusion and say, ah, actually, we need to
Starting point is 00:10:28 be doing something different here. But the pipeline is full of the other instructions that would have happened if you had just carried on in a straight line. So at that point, what used to happen would you'd have bubbles in the pipeline. So those fetched and decoded instructions that weren't going to be executed were just flushed away thrown away and then the whole thing would start again and so it would take several clock cycles before useful work could be done again and that's what made branches expensive back in the day and that's cool so what next happened was the concept of making a guess hey if we can make a guess at where the branch is going to go ahead of time then we can steer that decode that fetcher and a decoder in a particular direction and then as long as when we get to do the execute stage we check our guess if we were right hey
Starting point is 00:11:18 no harm no foul we carry on like the instructions that we've fetched and decoded are ready to go we don't have to have this bubble in the pipeline and if we get it wrong it's no worse than how we were before we have to throw away the pipeline and flush it yeah now as it happened about the time that branch predictors were coming in and becoming more and more important it was actually the case that the pipelines were getting longer and longer and longer. For boring technical reasons to do with actually speeding up the clock speeds overall. Obviously, if you're making something go faster and faster and faster, it's better to have smaller and smaller pieces of work to do,
Starting point is 00:11:54 which in this pipeline analogy means having simpler and simpler steps in the pipeline, but potentially more of them to achieve the ultimate goal. So now we're looking at latencies of the orders of tens of cycles between the fetch that picked up the instruction and the branch instruction actually running and determining, yes, we are going to take the branch, or no, we're not going to take the branch. So the interesting side effect of having a branch bridge and we can talk for
Starting point is 00:12:26 to death about how they work and they are amazingly amazingly good these days is that what the core the sort of like the engine that's that's churning out all of the actual work the stream of micro operations remember we've turned these these instructions into micro operations it's just given a sequence with no branches in it, effectively, of instructions of work to do. And it can be filling that, that could be a huge amount of work. It can be hundreds ahead of itself, potentially, you know, with a long enough pipeline. But also, what that gives it an opportunity to do is to sort of do, and this is where I'm going to go back to the analogy I made of like a jit compiler it's sort of just in time compilation it can do some analysis of the stream of instructions
Starting point is 00:13:11 and it can say this particular instruction that's later in the stream doesn't depend on an earlier instruction so and this is all happening in the silicon this is happening this is in the hardware completely in hardware absolutely wow. And so effectively, a DAG is being generated of all the interdependencies between instructions. And instead of running the instructions strictly in the order that you wrote them down, it is running these micro operations when they are ready to be run. So in the case of, say, a divide, divide is a classical example of a really lengthy instruction so i'm dividing two numbers together it's going to take a good dozen to two dozen cycles before i
Starting point is 00:13:52 get the integer result of that divide meanwhile the next instruction was just like yeah mov eax comma 12 because i'm setting up to do some other piece of work right well that instruction can run it can jump the queue because there's no observable side effects of the divide needs and so there's all it's not like it's moving the result of the divide in that correct right it's just it's just moving some other unrelated thing but even if the next instruction was moving that that instruction would be blocked and the the cpu would be continuing to look down the stream of instructions for something that finally didn't depend on anything that was currently being right busy right and it's just it's amazing to think that that can be done in silicon quickly but that but wait there's more one of the problems that you've you if you hit very quickly
Starting point is 00:14:38 if you're looking for more work to do you're looking for these dependency change these dag nodes that don't relate to each other so that you can get as much parallel work being done as possible, is that very quickly you get bottlenecked on those registers, those global variables. If you imagine, again, going back to 6502, you basically only had one register. It was the A register. The X and Y registers were over there as well, and they were used for indexing. but let's assume that you had something akin to only one register and the you know we issued that divide right so that divide happened and the result's going to go into the accumulator and then the next instruction was to say and store that result in some memory location
Starting point is 00:15:18 and then the next instruction was and now load a different value into the a register because i need to go and do some additions okay right well at that point, I have to stop because I can't write into that accumulator register because the divide's using it right now. So we grind to a halt again. What happens as the instructions, the micro-operations flow through from the decoding unit into this out-of-order system is that as well as looking at the dag of dependencies between them it's also rewriting those registers that you gave it and it's using tons of internal temporary register names so in the example i just gave that first divide um that's going to write into like a sub zero like with a little subscript zero but as soon as i do mov a value into a i've completely
Starting point is 00:16:05 obliterated what used to be there i've just you know i moved another value into it and at that point the the um the cpu goes well that's a different a from the last a so i last used a sub zero now i'm going to use a sub one this is a trick that compilers will do themselves right as compilers are looking at stuff they'll be like well i can reuse this or i cannot reuse that but it's happening in silicon so now we've got a sub zero and a sub one and now suddenly they're two independent instruction flows again and now they can be parallelized and this is again parallelized within a single thread of execution on the cpu and that's so cool right is it fair to say that this is just like another classic example of the problems to so many things in computing is another layer of indirection? It's sort of like we have these backward compatibility reasons why we can't just change all these instructions, right?
Starting point is 00:16:56 But we don't want the hardware to look like this anymore, right? Right, right, right. So now what are we going to do? Okay, well, we'll just pretend like we're executing these instructions and underneath the covers, we can do whatever we want. And like that effect is, you know, sort of like a classic programmer trick, right? Like I have some API that the guy before me wrote, and I hate it, but I'm stuck with something different that's much faster or much more efficient or whatever or you know fits the architecture that i have or is cheaper and that's kind of so now assembly is just basically like an api at this point and we're just using it to drive all and as you say it sort of has to be backwards compatible back to like 1970s era code because yeah so that's a really interesting observation. And it's one that Intel themselves made and they decided, well, what would it look like?
Starting point is 00:17:49 And I apologize for those who probably know a lot more about this than me, but they decided to investigate what would it look like if we actually switched to something which was more amenable to easy execution on the CPU, but doesn't require all this trickery. And they came up with the Itanium which was a very long
Starting point is 00:18:10 instruction word chip with thousands of registers that you could actually access or hundreds of registers that you could access. And by very long instruction word that means that each instruction, I'm putting little air quotes here, is actually a bundle of multiple instructions that will run concurrently.
Starting point is 00:18:27 Oh, interesting. So they exposed, they peeled back the abstraction layer and they said, what if we just let you run what is essentially our microcode directly? Kind of, yes, yeah, exactly. And they said, let's invest in smart compilers that can work out how to rearrange and pack the code that you've written into these instructions
Starting point is 00:18:46 yeah yeah but it turned out to be a dead end either the compilers just weren't up to the task and given how good compilers are i suspect that's not it or else what we don't get with that um approach is any kind of dynamism right there's a reason why i kind of use a jit as an example because a jit has extra knowledge at the time it has the dynamic environment like if you look at how the jre works and i'm switching out analogies left right and center here but like it's very hard to statically decide whether or not something should be done one way or another but it could be trivial to make that decision when you have dynamic information in front of you. Like, which of these two branches is more likely to be taken?
Starting point is 00:19:30 I don't know. Sure. Is it the same way that I went the last 80 times? Yes. Well, then this is now the more optimal route to take. Whereas staring at code ahead of time, you've only got so much information. And especially when you add complexities like, you know, I alluded to the divide taking forever. Sometimes those divides are dependent on the data that's being put into them and so you've got like a data dependency like even if you're dividing two 64-bit numbers it can take
Starting point is 00:19:55 fewer cycles if the numbers are smaller or bigger like actual magnitude of the numbers and so there's a huge amount of dynamism that comes another thing that's very hard to predict is memory latency. We've got so many layers of caching now because RAM has never really kept up with the speed of the CPU. So you don't know if something's going to be in cache or out of cache. And so statically trying to predict that and kind of padding your VLIW code, it's really, it's almost, I'd say it's impossible.
Starting point is 00:20:24 Got it. So is it the case that sort of like breaking the abstraction here has has then constrained what you can do underneath the abstraction uh and and therefore made it less efficient because now you're you're making promises that may you're trying to make promises that may not hold up in the dynamic environment of actually running the code right right like you're sort to make promises that may not hold up in the dynamic environment of actually running the code right right like you're sort of implying that certain things will be fast or certain things will be slow or certain things will be more memory efficient when they're not
Starting point is 00:20:53 necessarily because it just depends but if you've if you've pulled back that abstraction now the the the compilers in this case which are are sort of the things producing this, now sort of like can't really – they're just sort of too smart for their own good. Is that a reasonable way to think about it or am I oversimplifying it? No, I think so. And, you know, like whereas in a software JIT situation, which can also take advantage of this information, like I said, you know, like you can work these – you can observe the pattern of code and kind of make a decision to say well i'm going to re-optimize this particular area of code um i don't think that even a software jit that's producing code for say itanium could observe enough information at the sort of micro level to take full advantage of this you know like the nowadays there's the the cpus have
Starting point is 00:21:48 hundreds of um out of order buffers so they can be like two or three hundred instructions ahead of themselves so if there's that one slow memory load that you're stuck with as long as it can find enough information sorry enough useful work to do that can be predicted which are a lot of ifs but again all these things are getting very clever then that one slow load isn't going to block you from doing too much work you're going to be able to tear through and in fact there's another sort of cunning trick that um so switching away back from this itanium um pre pre uh pre-compiled solution to the problem. So thinking about what a single CPU actually looks like, we've got the front end, which is
Starting point is 00:22:35 this thing that's going to do the fetch, the branch prediction. It's going to fetch, it's going to decode and generate a bunch of micro operations. And then there's going to be some kind of magical renaming, which is the thing i alluded to where it rewrites what registers you're actually using to be internal ones you can't even address so i can't even say the name of these they're totally internal to the cpu yeah then they kind of put in the temp directory of registers it's exactly that that's exactly what it is yeah but those get all thrown into a big pot of like here's all the instructions i would ideally like you to run and now the out of order core is going to just pick off the ones that can be run.
Starting point is 00:23:08 And then finally, they get completed in order again. There's like a big master list of like, well, this is where we're good to. These things have actually happened. Anything below this mark is like, may have completed out of order, but we have to make it look like it happens in order. And that's where, for example, even if it's run out of order, but we have to make it look like it happens in order. And that's where, for example, even if it's run ahead of itself, but there was a page fault or a seg fault on an instruction,
Starting point is 00:23:33 which maybe takes a while for it to discover, the seg fault doesn't happen until the instruction that caused the seg fault retires, and then it aborts everything that's not happened after it. So that's kind of how that magical time-traveling aspect happens. Building a software emulation of this seems like years of work to me. It's just mind-blowing that there's physical hardware that is making all these kinds of decisions.
Starting point is 00:23:56 That's just incredible. It's really quite an astounding achievement. And the most interesting thing to me is that this information is hard to come by intel deliberately doesn't try to tell you very much about what's going on underneath this layer because they want to have reserved the right to change it quite reasonably their api and contract sort of ends with you at the the isa at the the machine code level right right well yeah i do the same thing exactly like you know but don't look under the covers because i might want to change my mind later exactly but you know hyrum's law the the law that you know any observable side effect will be used by somebody and become relied
Starting point is 00:24:33 on by somebody comes into play but yeah the interesting thing is all this information is is inferred from some information that's in the the manuals that you get from intel but just an awful lot of reverse engineering by a community of dedicated engineers who are using the performance counters and seeing when things tick up you know how how often does a front end re-steer you know intel doesn't really tell me what that particular counter is measuring other than if it try and keep it as low as possible but i can run my code and sort of see under what conditions that happens and i can start building a theory and the the people are there's travis downs and agne fog uh peter cordes there's there's a few really interesting people i mean like agne fog for example he's a danish professor of anthropology which is like the most strange hobby to have. But I mean, who am I to throw stones in that respect?
Starting point is 00:25:26 But yeah, it's really interesting that this is not as well understood, even among the people who are really, really interested in it. I mean, is it, are there intellectual property reasons why I can imagine? There are security considerations.
Starting point is 00:25:41 Yeah, that too. You know, so just to sort of, I've mentioned about Spectre and Meltdown at the very beginning as being the scary reason why people might think of this. Well, the mechanism by which they attack is absolutely core to that thing I was talking about. Like when a fault happens, it sort of erases everything that happened after the fault. But we know that several hundred instructions might have actually run based on speculating that the fault wouldn't happen.
Starting point is 00:26:09 And then the fault did happen. Now, as I said, this retirement sequence happens strictly in order. So until an instruction has retired, it's as if it didn't really happen. So we can throw that away. So you're like, well, how can possibly this be a problem, right? Although we may have done some work beyond the fault um it will get thrown away and so like it's not visible to me i can't see past it right that is unfortunately not true because also in this system is the cpu data and uh and instruction caches those can't be undone.
Starting point is 00:26:45 Any changes that happen in them, and now I'm not talking about data that's changed inside the caches because it won't actually have been committed, but if somebody has read from main memory at a particular address, that will now be in the cache, that value will be in the cache.
Starting point is 00:27:00 And now I can, after that fault has happened, I can go back and say, how quickly can I read this memory address? Oh, it only took three cycles. Therefore it must be in the L1 cache. And the only reason it must be in the L1 cache is because my little targeted attack that I did over here
Starting point is 00:27:15 caused the speculation system to get way ahead of itself and pull in some information that was in the cache. Now I don't know what was in that information, but if I can contrive a situation in which the address of the read corresponds to a secret piece of information that I shouldn't otherwise have access to, I now have a way to sneak the values out by just doing timing on the cache. And this is a very big topic, and I'm really not doing it a proper service here. But that kind of mechanism there, there's like the dirty secret is the undo buffer that we have inside the instructions for like, whoops, I didn't mean to do those things.
Starting point is 00:27:53 Can't undo the side effects of cache, things being read into the cache. So this is like you have some data structure where the address in memory tells you something important about the data itself you know like a a hash or something like that where it's like oh well you know this value gets written to index zero this one gets written index one index two so forth if in if if you know that uh yeah i guess it's almost like a, what is that called? A bloom filter? There are some things. I mean, you can actually do a much more simple thing.
Starting point is 00:28:31 Because if you can control the code that's going to run. And so, for example, the canonical example of that is the safety mechanisms around a, say, a JavaScript JIT engine. Right? around a, say, a JavaScript JIT engine, right? You're doing an array access, and you want to, you go out of your way to read within the bounds of the array over and over and over and over and over and over again. And now the branch predictor has learned that you never ever trip outside of that check
Starting point is 00:29:00 that you know must be in the JIT somewhere that says make sure they don't read outside of this array right I've made myself a 256k array and I've sat there reading a million times out of it and so now the CPU is essentially it's cold for it to deal with the exceptional case where inside the JS engine it's going to throw an exception to say out of bounds access or it's going to return like the undefined or whatever it's going to do. But it has to do something different. Now, I happen to know that my buffer, at some giant offset from my buffer, if it were to actually try and read there, then I'd be reading some protected piece of memory
Starting point is 00:29:37 that I shouldn't otherwise be able to access, like something outside of my sandbox. And so what I'm going to do is i'm going to read my million times inside inside arrays and then i'm going to suddenly read out of of index and then as soon as i've got that value out which i won't which will never actually happen but the speculation will get that far um i will then take that value and immediately use it to read one of 256 different cache lines and then I hope that I can do that before the
Starting point is 00:30:07 branch predictor goes whoops I didn't mean to go this way and then it all gets undone and then I get my undefined return and I go oh cool so you're using the value that you read out to do another read right exactly which now will
Starting point is 00:30:23 show up in your analysis essentially of like oh this one was really fast. So, like, I'm going to read some value that I don't know out of memory. And let's say it's three. But you don't know that, right? And then you're going to read whatever value you read's index into the memory, right? And that turns out to be three. And now three is fast.
Starting point is 00:30:42 Exactly right. Now you know three is fast, and the secret value is three. That's right. And, yeah, exactly that. And there are different to be three. And now three is fast. Exactly right. So now you know three is fast and the secret value is three. That's right. And yeah, exactly that. And there are different ways of achieving. And that's called a side channel attack. It's a side channel. It's like a way of exfiltrating information that isn't through sort of the architectural state of the computer.
Starting point is 00:30:57 Got it. So we got a bit derailed there. Not as much as we have any railings and any of these things. But one thing I do want to talk about was going back to that big soup of uh out of order execution um things we had the front end we decode a bunch of stuff we get micro operations they get thrown into this giant soup they've been rewritten to use the internal registers and now we're trying to use um you know we've got an adder we've got a multiplier we've got a divide we've got multiple of these things right they're called execution ports there's usually five between five and eight of them in different things
Starting point is 00:31:28 and they can do different things but the the sort of the the goal of the out of order system is to try and fill up all these parts of the silicon with useful work to do and as we've said one of the ways that it can do that is to get way ahead of itself by predicting branches and starting to do loads of work that maybe is totally speculative. Another thing it can do is rewrite the registers so that more seams of independent work can be found in a single stream of instructions. But there's one more clever trick that you can do. For very little silicon cost, what if, instead of having a whole other CPU to make on the side, what if I just have another set of registers? Not even the internal rewritten ones, just the external registers. They do have
Starting point is 00:32:21 to, you know, I said at the beginning, there isn't really a box called EAX with a one, two, three. Well, there is, there is one canonical one. And that's in fact, what that retirement stage is doing is it's writing the real value of EAX back into the actual real EAX register somewhere so that anyway. But for a small cost of having another program counter and another set of all of the registers of which there are 16 or 32 of them i can get another thread and i'm putting air quotes over this so what if i were then to say to the front end every other cycle read from either the first program counter or the second program counter and when you read from the first program counter, tag it to say, these are different registers
Starting point is 00:33:06 from the other program counter. So I kind of have two sets of front end state. So now every other cycle, I'm reading two completely independent streams of instructions from two different physical places in memory. Once they've been through the rewriter, we know that they won't be able to talk to each other.
Starting point is 00:33:24 So like the stream one's EAX is a different physical register from stream two's EAX. But once they've been rewritten to internal register one, two, three, four, or one, two, three, five, whatever, it doesn't matter. At that point now, we've got twice as much work to do that is guaranteed to be independent because they're from two completely different streams of execution but i haven't got any more adders or multipliers or dividers or
Starting point is 00:33:51 anything i just throw that into the big soup in the middle of the out of order execution system and naturally we've got two non-overlapping dags that are being processed right and provided we just have the tiny one bit that says the ultimate result of this needs to go back to eax in hyper thread and i'm going to call it that now hyper thread one or hyper thread two we've got twice as many sources of independent work to do for very little cost so so like very sort of naively here if i two programs, one of which was a very long series of ads and the other one was a very long series of subtracts, you could mix those two instructions together and know that they would never overlap and basically do them all in parallel because you know that they're never going to do two ads in the same register at the same time. Correct. Now, ads and subtracts happen to fall in the same category because they're exactly the same operation and when it comes down to it but
Starting point is 00:34:46 like if you said multiply an ad see absolutely or you know linked list one's reading through memory really fast whatever you know so you get for free inverted commas uh at least a very small cost of this little extra architectural state you get twice as much potential for um filling up those seven or eight real work ports these execution ports with useful work that can allow you to make forward progress and that's what hyper threading was and so that's another way of getting parallelism you know we've got instruction level parallelism which is itself the ability for the the CPU to go out of order and say, well, I can run more than one instruction at a time from a single stream. Then you've got hyperthreading, which says, well, I can interleave more than one CPU's worth of instructions as long as I
Starting point is 00:35:34 just do a tiny bit of accounting to make sure they don't mix up with each other, which has led again to more of the kind of scary problems that you have, like spectrum meltdown type things. I forget which ones are specific to hyperthread siblings, but there are some. And then even just at the instructions themselves, we can use single and multiple data instructions where you can treat a single instruction as being, well, actually, this is acting on eight independent ints at once.
Starting point is 00:36:03 But that's a whole other topic, I think, for another time. But that's another level of parallelism. And that's before we even go to genuinely multi-threading, where they said, we can't make the chip go any faster, but what we can do is we can put eight copies of the chip on a single die. So where are we in the timeline? What year is it? What year is it?
Starting point is 00:36:22 What year is it? Who's the president? Yes. Yeah. We've talked about all these technologies up to this point. What year is it what year is it who's the president yes yeah we've talked about all these technologies up to this point what year is it that's a great um there's a great question so my sort of memory of where i started looking at certainly intel cpus was the mid late 90s um and in fact i was working for a company who who were doing some work for Intel at the time. I remember the very first 233 megahertz Clamath got delivered to the, which was the name of the Pentium 2, got delivered to the office. And even though I was the new guy in the office,
Starting point is 00:36:59 I was tasked with getting it up and running and working, which was painful, all these strange um drivers and things that needed to be installed and actually ultimately ended up inheriting it and taking it home which was a whole other story but um yeah in our lunchtime quake sessions i had by far the best ping it was brilliant because no one else had a computer was even this close it was wonderful it didn't make me much better atake, but it was a lot of fun at the time. But, you know, that era was sort of the last era where you had any hope of really understanding or at least getting help from Intel
Starting point is 00:37:33 about how their CPUs worked. I don't really actually exactly remember when, but it was at least Pentium 1 where they were told this rather explicitly. There's a U-pipe and a V-pipe. And if your instructions look like this, they can go into the U-pipe. And if they look like this, they can go into the V-pipe. And then you could sort of sit down and hand arrange these pairs of instructions.
Starting point is 00:37:52 So this is very much before the more complicated stuff we've been talking about. Around 2000 was when the NetBurst CPU came out. And that was one of the ones that definitely used microcode by that point,
Starting point is 00:38:07 so micro operations. And rather interestingly, that took a sidestep down. We didn't talk about micro operations themselves, but the whole decoding thing is quite complicated. And so there's caches for that as well. That took an interesting sidestep down an avenue of something called trace caches,
Starting point is 00:38:25 which is sort of an alternative way of kind of encoding both the branch prediction and the decode of the stream of instructions that happened when you went to that particular place. And so there was a cache that said, hey, the last time you got to address 1, 2, 3, 4, with this prediction of what branches might look like, here's just the pre-decoded micro operations. Half the work's already been done for them. Just start executing those things and only just abort if you hit the wrong one.
Starting point is 00:38:53 If you discover along the way that you've made a mistake, you have to undo it and go back. And the problem with that is it just didn't work as well as the engineers thought it was going to work. So, and at this point, we're still in like 32-bit processors for what it's worth. Around this time though,
Starting point is 00:39:09 AMD were working on a 64-bit extension and there was some kind of argy-bargy back and forth between AMD and Intel about like, you know, who owned it and how they could do it. I think 2001-ish, they came to an agreement over a patent. And I think that was like a sort of unilateral patent agreement between the two.
Starting point is 00:39:31 They wouldn't go after each other for the things. And so in 2004, the first 64-bit processor came out, and it was an AMD, which was quite a coup. I mean, I imagine lots of heads rolled inside Intel when they were beaten. I think it was another couple of years before the core processor came out, and that was 64-bit. So yeah, in those early 2000s is when most of this micro-operation stuff
Starting point is 00:39:55 started to be, to my knowledge anyway, more used. My real deep dive of this stuff starts in the 2006- plus era of the core, which was a separate offshoot. It was actually the original. It was a team in Israel who were doing the mobile version, the Pentium M. And when the whole trace cache that was inside the netbus seemed to be a dead end, they stepped up to the plate and said, hey, well, how about we take the mobile core, take some ideas from the trace cache, but move on in a different direction? And that seemed to work out pretty well. Yeah, so 2010 was the sort of Sandy Bridge era, which had reintroduced, ironically, something called the microoperation cache, which is different from
Starting point is 00:40:42 the trace cache, but morally in the same sort of ballpark. But it didn't try to be as clever. It wasn't storing entire threads of execution, like, hey, if you get to this address, then this is what you're going to do until you discover you've made a mistake, which is actually genuinely more like what a JIT does right nowadays. A JIT will kind of say, hey, this is this bit of the code. We did this last time. Let's do that. But it was more like a traditional cache where there's still logic that's sort of pumping out which instruction should be next. Then it goes, well, rather than go to memory and pull down and decode the instruction, let's just see if that memory address maps to somewhere that's in our micro operation. Oh, it does. Okay. That saves us going out to memory at all and doing the decode.
Starting point is 00:41:23 So it's like a cache on the other side of the decode um yes oh sorry i think i made i've got a note here actually the the the 2003 was the amd optaron which was the first 64-bit machine and in 2004 was the intel no kona so which was the 64-bit bit uh and it would be rude not to mention ARM processors here as well. So the ARM processors started off, as I love to say, as they were a coprocessor to the BBC Micro, which was a great, fun thing. In fact, the ARM 1 was the engineering sample for that, was a coprocessor, and it printed out,
Starting point is 00:42:03 hello, I am ARM or something like that. And the cool little thing about that story is that they had no idea what the power consumption was going to be like. And after they realized it was booted and running, they realized they hadn't actually turned it on yet. It was running, but it was the I.O. pins. There was enough parasitic power coming from the Iopins from the device that was plugged into that the thing just was like working so at that point i think they realized they were onto kind of a low power thing which obviously has been a theme ever since for them they're in every cell cell phone ever um but you know all the levels of sophistication that have happened in intel have
Starting point is 00:42:38 happened in arm i haven't followed them as closely but you know the the the M1 amazing achievement and the M2 that's going to be coming soon from Apple who have licensed the ARM ISA, but essentially have completely re-implemented it themselves with all the same kind of tricks and the same depth of pipeline and the same... Oh, I didn't realize the M1 came from ARM's legacy. That's interesting. Right. Yeah. From ARM's legacy. That's interesting. Right, yeah. Well, these are all, I mean, they have to pay ARM for the ISA to license the, essentially the encoding of the instructions is what they're paying for, I believe, and like the idea of it. Otherwise, you know, I'm sure they would have preferred to do everything themselves,
Starting point is 00:43:14 but it's only so far you can go. But, yeah, it's, oh, and the other thing about the M series is that they've gone for both a big and a small core. So they've actually gone from heterogeneous cores. So at the like any intel cpu if you've got eight cpus they're all exactly the same but in in an m1 you've got the firestorm core which i think is the big one and the ice storm core which is like a small one and it's like four of each and so there's some clever power management stuff that the operating system can do to say well let's just put these lower priority things on the lower power consumption cpu that isn't as fast whereas this graphics editing code or whatever can run on the other one and i'm super vague on this but i just know that it exists and
Starting point is 00:43:54 it's it's it's coming i think to everywhere you know with so so yeah there's an awful lot of things that go on inside that little little chip and. And we haven't even talked about how the caches work, how the branch predictor works. I often joke that if we were to somehow come up with a way of getting the previous year's worth of lottery numbers encoded as a sequence of jumps, we could probably use the branch predictor to predict the next set, and then I can retire.
Starting point is 00:44:21 Because they are so good. They are just so clever. And the levels of which the engineers go to and obviously this is into trade secret area so reverse engineering those branch predictors is is is challenging and even just getting the idea so that the pipelines now are so big and weighty that the um like the old school idea about what a branch predictor is, is like, is this branch taken or is this not taken? That's kind of how I've described it. But really, even the, is it a branch, right? I've just pulled out 16 bytes of memory and certainly for Intel, is there a branch
Starting point is 00:44:59 in there and does it have a destination, like an unconditional branch, a jump, if I have to wait five or six cycles for it to get down all the various pre-decode steps and all those bits and pieces before it gets to the actual decoder, it goes, oh, sugar, there's an unconditional jump to 100 here. Then I have to re-steer all the four or five steps beforehand. So there's a predictor for the destination of a branch, or even if there exists a branch in just an arbitrary block of code and so before anything happens at all there's this sort of magic eight ball that the bpu shakes and it goes there will be a branch in the next fetch it goes brilliant okay well the next one we're not going to get from where we think it's going to come from we're going to get from like 100 off we go it's just so clever so clever and so much so much um of this stuff that
Starting point is 00:45:44 we take for granted we just write code and it runs and largely we can ignore it which is a testament to to uh where we are but i have a cool little example where you can demonstrate the effects of the branch predictor even in python right you can write a relatively straightforward piece of code that runs i'm not going to say a lot faster but measurably reliably measurably faster one way than another and they're identical workloads it's just whether or not the branch is predictable that's in the heart of it and that's like 10 layers deep inside the python interpreter for heck's sake so it does affect us all um it's just interesting to know about and very rarely does it actually make a difference but it's it's cool to know that it's there well i feel like i've learned a lot today it's been
Starting point is 00:46:31 fun and i mean if people are interested in hearing more of this stuff we can dig into something else along this line another time or this is this is there's like two or three other podcasts that i'm sure are going to come out of this. This is really good stuff. I'd like to shout out as well. There's a friend of mine's podcast who he and a friend are doing far, far deeper and far more authoritative podcasts on these kinds of subjects. So if you're interested in this stuff, I would heartily recommend going to TLBHIT, which is TLBH.IT, and subscribing to J.F. Bastien's and I forget his co-host's name. But their podcast is brilliant. But ours is also going to cover some of these topics, hopefully. Maybe just in a different way and to a different kind of audience.
Starting point is 00:47:22 Well, you're definitely getting the layman's perspective on some of this. Hopefully I'm doing that Larry King thing where I ask the questions that people are wondering in their heads, but we'll see. Fantastic. Well, I guess until next time. Until next time. You've been listening to
Starting point is 00:47:44 Two's Compliment, a programming podcast by ben rady and matt godwald find the show transcript and notes at twos compliment.org contact us on twitter at two cp that's at t w o s c p theme music by Inverse Phase

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.