Dwarkesh Podcast - Reiner Pope – Chip design from the bottom up

Starting point is 00:00:00 I'm back with Reiner Pope, who is the CEO of Maddox, which is a new AI chip company. Last time we were talking about what happens inside a data center, now I'm going to understand what happens inside an AI chip. How does the chip actually work? Full disclosure, by the way, I am an initial investor in Maddox. So hopefully you have designed a good chip. Also, if you're listening to this on an audio platform, it's much preferable to watch this blackboard lecture on a platform where you can see what's happening.

Starting point is 00:00:28 So switch over to YouTube or Spotify. So I'll start with sort of the very smallest fundamental unit of chip design, and we'll sort of build up into what an overall like actual production chip. What are the components of that? At the very bottom level of a chip, the primitives that we work with are logic gates, which are very simple things like and or not. And then these are connected together by wires

Starting point is 00:00:53 that have to be laid out physically as metal traceted on a chip. The main function that AI chips want to compute is multiplication of matrices, and really inside that is the fundamental primitive is multiply accumulative just like of pairs of numbers. So we're going to sort of demonstrate what that calculation looks like by hand, and then sort of infer what a circuit would look like for that. It'll turn out to be sort of easiest if I do multiplication accumulative, something like a 4-bit number, with another 4-bit number. And then we're going to,

Starting point is 00:01:31 the actual clearest primitive is actually multiply accumulate. So there's a multiply these two terms, and then we're going to add in, so product of these two terms, and then we're going to add in an 8-bit number. And can I ask a clarifying question? Why is this the natural primitive or whatever computation happens inside a computer?

Starting point is 00:02:02 Yeah, so there's a few reasons for this. It's a little bit more efficient, but the reason it's natural for AI chips is that if you look what's happening during a matrix multiply, what is matrix multiply in very short? It is there's a four loop over I and over J and over K of, output i k plus equals to input i j times other input jk and so multiply accumulate happens at every single step

Starting point is 00:02:47 of a matrix multiply. And then the other observation is that the precision will almost always be higher in the accumulation step than in the multiplication step. This is maybe specific to AI chips, but you're multiplying low precision numbers, but then when you accumulate, errors accumulate quickly, and so you need more precision here. So this is why we've chosen to do a 4-bit multiplication and an 8-bit addition. Let me make sure I understood that. There's two ways to understand that.

Starting point is 00:03:12 One is that the value will be larger than the inputs, and the other is that if it was a floating-point number, it would be maybe that part is less intuitive to me, but it's maybe the same principle. It is really the same principle. I guess the separate principle is that as you are summing up this number, you are summing up a whole bunch of numbers, and so you've got a lot of rounding errors accumulating. Whereas in this case, there's only one multiplication in that chain, and so there's not a lot of rounding errors accumulating in the multiplication.

Starting point is 00:03:43 Why are you something of a whole bunch of numbers? It's just two numbers, right? I mean, this summation happens. It's repeated J many times. Yeah, any errors accumulate. I see. Yep. Makes sense.

Starting point is 00:03:52 So how would we perform this calculation by hand? I mean, as a human, we would probably separate it into two, but we can sort of do it all in one using long multiplication. So the multiplication term first, we're going to multiply this number, this 4-bit number here, by every single bit position in the other 4-bit number. So we write that out. First 1-01 multiplied by this bit position. That is this number itself, then shift it off. across by 1 we're multiplying by 0. That gives us an old 0's number. Shifted across even one more to multiply by this one, we get 1-0-0-1. And then finally for this last-bit position, we get an old 0's number again. So this sort of gives us a bunch of terms that we're going to have to add for the

Starting point is 00:04:52 multiplication. And then while we're doing that summation of this, we might as well add in the actual accumulated term as well. And so we just copy that directly. across. So this is the sum. It's a five-way sum that we're going to want to compute. So firstly, what logic gates did it take us to even get to this intermediate step? We needed to produce all 16 of these partial products. How do I produce one of these partial products? So let's take this, this number one, for example, here. It is one. So how do we produce this number by multiplying this number by this one over here? we can actually produce that by an and gate.

Starting point is 00:05:43 This number is one if both this bit is one and this bit is one. If either of them is zero, then the multiplication of one times one, of zero times anything is zero. So to produce all of this stuff, we ended up consuming 16 and gates. Or in the general case, if I were doing a like a p-bit multiply times a q bit multiply, then this will be like P times Q, many ants. Exactly. Finally, I sum them.

Starting point is 00:06:22 Actually, most of the work is going to happen in the summing. And so let me describe sort of the other logic gate that we use here. And is almost the simplest logic gate that exists on a chip. It's almost the smallest. At the other extreme, typically the very largest logic gate that you'll use is something called a full adder. And what this does is sort of it does like, it kind of kind of, it kind of, it kind of, From software, you might think that a full adder, it adds 32-bit numbers together. In this case, it just adds three single-bit numbers together.

Starting point is 00:06:55 And so you can think of it as like adding 0, 1, and 1 together. Now when I add these together, the result can be 0, 1, 2, or 3. So I can express that in binary using just 2 bits. So as input it has 3 bits and as output it has 2 bits, which in this case are 1, like the number 2 in binary is 1-0. So this is also known as a three to two compressor, because it takes three bits of input and produces two bits of output. The two inputs are an X and a by value,

Starting point is 00:07:33 and then some carry that came in from like... Sorry, the three inputs are all bits that are in the same bit position, like three bits that are in a column here. Yeah, yeah. And then the two outputs, I have sort of drawn them vertically here and horizontally here, to kind of match this vertical versus horizontal layout here, which is expressing that things that are in the same column are in the same bit position, whereas things that are in adjacent columns,

Starting point is 00:08:00 like this is a carryout where this was the sum. So if the inputs in the full ladder are, let's say like 101, then the output would still be one zero. If it was one-one, it would be 0-1, it would be 0-0.00. It was like 0-1-0. It'd still be 0-1. So yeah, it's just counting essentially the number of the number of, the number of things and expressing that in binary. So this circuit actually can sort of capture what we as humans naturally do

Starting point is 00:08:29 when we're doing summing along a column. So I can show that sort of, I'll show sort of one iteration of using the full adder to sum. The way I sum here is going to be a little bit unnatural for humans. Humans, we would sort of sum along the column and then remember the carry. But instead of remembering the carry, we'll actually just explicitly write it out. So in this, we proceed from the rightmost column towards the left. On the rightmost column, we sum the one and the one. And that produces a zero here and a carry of one.

Starting point is 00:09:01 So we've sort of used this full adder circuit on this pair of bits and produced a pair of bits as output. Now we can do the same thing with this column. We've got a column of one, two, three, four numbers. And so maybe we'll take the first three of them, run a full adder on them and that gives us a zero and a zero is output. So like some of these is zero zero. So that's the full compressor, full adder applied to all of these bits. As I've used up bits, I'm going to sort of just cross them out to indicate that I've handled them. Let's just keep going a little bit more. So we'll go here.

Starting point is 00:09:40 I take these three numbers, I add them. That gives me a one and a zero. I've dealt with these three numbers. And now I take one, two, and I can even take these three numbers, for example, right now, and add them, and that gives me a one and a zero, and I've dealt with these numbers. So, I can sort of, like, the way I should view this is that I have this whole grid of numbers that need to be added. I'm going to just keep applying full adders to all the bits that are here, constantly removing three numbers from a column and then writing out two numbers as output. Keep going with this over and over and over and again until I eventually get like some, just one single number coming out here, something like that.

Starting point is 00:10:23 This is probably the wrong sum. So this approach that I've described here, this is called a data multiplier. And this is sort of like the standard for how you do area efficient multipliers using full adders. try and quantify the circuit size of this, just so we have got a sense of like how big things are that, so we can compare it to them later. How many full adders do I, did I use? I started with how many numbers. I have the 16 partial products, which is the product of all of these terms, with all of these terms, plus the eight terms that I'm adding here. So I started off with 24 bits, and then I produced eight bits on the output.

Starting point is 00:11:19 eventually. And in every step, I was sort of crossing off three numbers and writing two numbers out in result. And so every single use of a full adder eliminates one of the bits here. And so how many full adders, it must be the 24 minus the 8. So there were 16 full adders in this circuit. In general, this is true in the general case as well. There will be p times Q many full adders in this circuit. Let me sure I understand the logic of that. So the input bits 24 is p times Q plus p plus Q. That's right. And the output bits is just p plus Q.

Starting point is 00:12:03 And so p times q plus p plus q minus p plus q equals p times q. That's right. So I think this explains sort of, or at least hints that the second reason why we chose to do a multiply accumulate. First reason being that's actually what shows up in matrix multiplication, but second reason being it gave us this very slick p times Q, very simple algebra. So we've sort of described, like this whole procedure, every single atomic step that I took here becomes a logic gate and then sort of the wires connected together. Like when I had these three inputs that I salvaged to produce these two outputs,

Starting point is 00:12:41 like if I think of mapping this to a physical device, it would be a wire that runs sort of connecting all three of these things together into a logic gate that produced this output. Okay, so this is the main primitive at different bitwits that is inside an AI chip. We're going to build up from here to how would you use that to run all of the other operations you may want. This might be the wrong time to ask this question, but whenever Nvidia reports that this chip can do X many FP4 or half as many FP8. It seems to imply that those circuits are fungible, that there's not as dedicated like FP4 versus FP8.

Starting point is 00:13:26 But the way you're mapping it out here, it seems like you would need, if it has to be mapped out in the logic, you would need a dedicated FP4, multiply, accumulate, and then a dedicated FP8 accumulate. Basically, can you funge them? As drawn, they're actually not particularly. fungible. This is actually one of the main choices you have to make when designing a chip, which is how much of FP4, how much of FP8 do I have? And then sometimes I'll make that

Starting point is 00:13:54 consideration from the point of view of what do I think is the customer requirement? Another way to take an angle on that is to say, what is the power budget for, equalize the power budget between FP4 and FP8? But so then when they report those numbers and they just happen to be the case that like it does 2X as many FP4 as FP8, they just happen to choose, like, give equivalent di-areas to all the floating points, and as a result, it ended up being... Like, why is the ratio exactly to X? Yeah, exactly. Yeah. So part of it is, I mean, surely that won't be exactly equivalent to die area. There's a data movement reason, actually, and we'll maybe come back to this when we sort of look through how it goes into and out of memories.

Starting point is 00:14:37 There's something really nice just from a software level of the fact that I can pack two four-bit numbers into the same storage as an eight-bit number, and so when I store that to a memory or something like that. It's the sizing of the buses that I wire within the chip actually makes that work out really really nicely. I actually come to think of it. It's not just 2x. It would be the amount of area it takes it sounds like is quadratic. It's quadratic. It's quadratic in fact. Yeah. With the bit of length. So that's why a smaller precision is like even more favorable than you're nice. This is a really big reason. So, um, In fact, Invidia made a change historically, up until B-100 or B-200, every time you have

Starting point is 00:15:23 the bit precision, you double the flop count. That ratio is exactly, like for the reason you said, because of this quadratic scaling, that ratio is actually slightly wrong. It should be like an even bigger, you should get an even bigger speed up than you might otherwise think. Nvidia's product specs have sort of started acknowledging that in B-300 and beyond, where the FP4 is three times faster than the FP8. Though it should be 4x.

Starting point is 00:15:49 Yeah. What I've shown here is the simplest case of integer multiply. When you are dealing with floating point as you do in FP4 and FP8, there's this sort of other term which is the exponent that just complicates this calculation. So what can we see already from this? I think the big observation you've made is that there's this quadratic scaling with bit width, which is very effective and is the single reason why. low-precision arithmetic has worked so well for neural nets.

Starting point is 00:16:17 But the other thing we're going to do now is we're going to compare sort of the area spent on the multiplication itself with all of the circuitry that is around it. So we'll walk back in time a little bit and see how did GPUs prior to tensor cores work, which is the same way as the way that CPUs worked in fact. So which is like where do we stick this multiply accumulate unit? So generically, I'll describe like a kuda core or a CPU. You'll have some register file, which stores some number of entries. Maybe it's like eight entries of like, in this case, I guess, four bit numbers, but typically like 32 bit numbers or something like that of, which is numbers. So this is the inside the kuda core, I'll have some register file of some depth,

Starting point is 00:17:17 and then I will have my multiply accumulate circuit, multiply and accumulate circuit. And what it's going to do, it's going to take three arbitrary registers from this register file, perform the multiply accumulate, and then write back to the register file. So it's going to maybe write to this one, but it was able to read from this one, this one and another random one, so it'll take three inputs like this. So this is the core. data path of many processes. Most processes look like this. You've got some set of registers, and then you've got some set of logic units or ALUs. We want to analyze the cost of the data

Starting point is 00:18:01 movement from the register file to the ALU and back. So ultimately, there's going to be some circuit that says, well, I don't always have to select this guy. I might select any of the registers in any point in time. And so sort of a first question is, how can I build a circuit? The circuit that I'm going to look for is a MUX. So in this case, it's going to have eight inputs, one from each entry of the register file, and it's going to have one output, which is actually producing this output. And then, like, what is the cost of this thing? It's like all we have to build it out of is and and or.

Starting point is 00:18:43 And so how do we build it? We do the dumbest thing possible. We like form a mask saying, okay, when we want to read like the third entry, we're going to end every single entry with either one or zero based on whether that's what I want to read, and then we're going to or all of them together. Okay, just to make sure I understand the basics. What the mux is doing is just like selecting. Just selecting.

Starting point is 00:19:07 Just selecting an input. Yeah. So like invisible to software is like you say I want input number three, that means there's a mux. And so like what is the cost of this mux? So an input mux operating on p bits, well, I'm going to, so I have n rows, that's this eight rows, and I've got like each row is p bits wide. Well, I have to and every single bit. So I get n times p many and gates.

Starting point is 00:19:41 Every single input I have to say, am I going to like mask it out or not? And then and then I'm going to or them all together. And so there's going to be like n minus one times p many or gates, which is saying I've got all these different things, almost all of them are zeros, but I need to sort of collapse them down into, into, like, from my eight options down into one option. And so every step I need to or, like, one row into an existing row. Got it, yeah. It's actually kind of funny that you would sort of, you don't think at the level of hardware, you sort of just think, like, oh, I'll just select element three, and something as simple as that is sort of, like, in and of itself a quite

Starting point is 00:20:23 complicated circuit. Yeah. I mean, this is the first step of all of the hidden data movement costs that shop. Yeah. And so like the thing like we're just going to like compare like I have to pay this cost and I've got one marks here and then in fact I have two more copies of that for each of the three inputs to my multiply accumulate operation. And so I have this cost which is like like three times n times p and gates over here compared to this p times q like sort of gates in the actual circuit that I that is doing the thing I care about. And if we plug in actual numbers, like this n being 8, I get like 24 times p gates over just in the data movement compared to like if Q is 4, like 4 times p gates just in the in the adder, multiply adder. And sorry, where is the three coming from?

Starting point is 00:21:16 Three different inputs here. Got it. Okay. So the case like really just what I'm hinting at here is that like all of this work, which scales like, as the size of the register file, and this is a very small register file, all of this work just moving the data from the register file to the logic unit is many, many times more expensive

Starting point is 00:21:38 than the logic unit. In the most recent Cluster Max report, semi-analysis ranked almost 100 different GPU clouds. Crusoe was one of only five that made the gold tier. Semianl Analysis found that gold tier providers that Crusoe had a total cost of ownership that was 5 to 15% lower than silver tier ones, even when they had identical GPU

Starting point is 00:21:56 pricing. This makes sense because total cost of ownership is downstream of a bunch of different things that don't necessarily show up in the sticker price, but that Crusoe has optimized, things like how well you detect faults and how quickly you replace failed nodes. For example, Crusoe was one of the first clouds to adopt Envy Sentinel in Media's own GPU monitoring and self-healing software for enhanced GPU uptime, utilization, and reliability. This sets Crusoe makes use of everything that NMedia has learned about why chips fail across all their different fleece and deployments so that Crusoe can catch faults earlier in the process. And once they identify failure, Crusoe can swap in a healthy node in less than 10 minutes.

Starting point is 00:22:34 Because you're not running bare metal, Crusoe doesn't have to spend time installing an operating system or configuring drivers. They can just spin up a new VM on an already running and pre-qualified host. If you want to learn more about this or the other reasons that Crusoe made Goldier, go to crusoe.a.a.com. It might be helpful to just see what a mux looks like, maybe like a two-bit or a four-bit mux. Yeah, great. So we'll take some inputs. We'll have maybe like, we'll just do a two-way mix.

Starting point is 00:23:04 So we've got two different numbers. We've got these two inputs. And then we have a, so these are the inputs that are being selected between, and then we have a selector, which says, which can either be like, I want this one, or it could be, I want the other one. So this is a one-hot encoding. So this is what we all start with. And then the output we want to produce, like, let's focus on this case.

Starting point is 00:23:43 So this is the actual input we got. We just want to produce this guy as the result. And so, like, sort of very laboriously what we do is we and this bit with all of these. And so that produces like anding this bit with this row. with this row. And likewise, we add this bit with this row that produces all zeros. So this was the, there's four ends here, there's four ends here. And then finally we just oar these two together. This gives a one, we oar these two together, this gives a one, we owe these

Starting point is 00:24:22 two together, gives a zero, we owe these two together and it gives a one. And so this is the four oars. So like this actually ends up looking a little bit like addition in fact. Like we we did exactly the same set of ands here. We've added all of these things together, but then instead of collapsing it by using these full ladder circuits, we just get a very simple collapsing with ore gates. But I guess that doesn't look like n times p? So yeah, so this was with n equals two inputs.

Starting point is 00:24:57 Ah, great. In the general case, we will have N, like, and this is n rows, and then we'll have p bits per row. So that gives us the n times p many and gates. So this circuit I've described here, almost all of the cost, like 7 eighths of the cost is in the reading and writing the register file, and only a tiny fraction of the is in the logic you know of itself. So this is the problem to solve.

Starting point is 00:25:35 This essentially was the state of play prior to the Volta generation of Nvidia GPUs. This kind of thing is what was inside the Kuda cores. And this sort of problem statement is what motivated introduction of tensor cores, which are more generically called systolic arrays. So if we think about how are we going to solve this problem? Like we're spending almost all of our circuit area

Starting point is 00:25:58 on something that we just really don't care about and is hidden to the software programmer, and the thing that we actually care about is not much of the area. Well, make this one bigger somehow while keeping this at the same size. That's the goal. So, sort of the evolution was like we had baked this much into hardware in this stage, that this single line is a multiply accumulate, and this single thing was baked into hardware.

Starting point is 00:26:22 The idea of a systolic array is to go two levels of loop up and bake this entire loop out here, into hardware. And so the idea being that if we have a much bigger granularity fixed function piece of logic, maybe the taxes we pay on the input and output are much smaller. It sounds like you're suggesting that if you have, if you go up one step in the matrix multiply loop, that there's some, you can tilt the balance more towards compute than communication. That's right.

Starting point is 00:26:55 So there's two effects that we're going to take advantage of here. One is just that we can do more stuff before per every trip through a register file. And then the other thing we're going to take advantage of is, in fact, in some of this loop, we can take advantage of, for example, some certain things staying fixed. So let's sort of visually, we're going to look at this matrix multiplication. So this portion of the loop corresponds to a matrix vector multiplication, in fact. So we'll take a matrix and multiply. it by a vector.

Starting point is 00:27:31 So how do we do this? We take every column gets multiplied by the vector and then summed. So we're going to sum sort of along columns. So this 0 and 3 gets multiplied by the 3 and 7 and gets summed, and then the 1 and 2 gets multiplied by the 3 and 7 gets summed. So there is a multiplier accumulated associated with every single one of these entries in the matrix. So we'll just draw out these four multiply accumulates.

Starting point is 00:27:55 And just make sure I understand why there's four multiply accumulates. So if each entry in the column that corresponds to the output vector is a dot product, and in this case it will be like two multiplications and then the addition of those two multiplications are like you're accumulating. Yeah, so the addition, so really there's only one addition per dot product, but like we would like to start with zero. the initialization of zero. Yeah. Yeah.

Starting point is 00:28:35 So what we're going to aim for is to have, so we want to have quadratically more compute. We do, we have, we have, we've got sort of X times Y as much compute as we had before. But we're going to want to somehow aim for having only X times as much communication. And this is sort of the intention so that we get this advantage term going as Y. So we've laid down the multiplications, bringing in, like, we're going to want to bring in a vector of size 2.

Starting point is 00:29:16 And so that sort of already is in line with our comms target. That's fine. However, we need to somehow manage the communication of this matrix, which, which exceeds our budget of X. And so the idea is that in an AI context, this matrix is actually. going to stay fixed for a long period of time. And so instead of like bringing it in from the outside, so we've got some register file sitting over here, we don't want to have like the amount of stuff coming out of this register file. This is the term that we want to go sort of as as X in

Starting point is 00:29:51 some sense. We don't want to bring this full matrix in from the register file every cycle because we don't have enough that would cost too much in terms of wiring from the register file. And so instead, we're going to store, our key trick is that this matrix can be stored locally to the systolic array. And so where we'll store these numbers, 0, 1, 2, and 3 in just like a gate called a register that, like, physically stores these numbers and we're going to reuse these numbers over and over again for a large number of different vectors. And so the optimization here is that like the nature of matrix multiplication is you can store this like square quadratic thing directly where the logic is happening and which is like higher

Starting point is 00:30:49 dimension than the or has an extra dimension compared to the inputs which you keep swapping in and out that's right I mean this is the nature of what a matrix multiplication is that you do a lot of multiplication to get one value out. Like a dot product is the result of a lot of multiplications. And so that optimization means that you're just like, you can stuff a lot of like multiplication in before you get some value out of it.

Starting point is 00:31:13 That's right. That's right. Yeah. So like just to complete the picture here of concretely how that looks, I swapped the three and the two here, three and two. So just like this zero and three is going to multiply by the three and seven. And so we're going to form a dot product sort of along columns here. So somehow we're going to feed a three and a seven in here.

Starting point is 00:31:41 These participate in sort of this feeds into this multiplication and also feeds into this multiplication. Likewise, the three feeds into here and also into here. And then we're going to sum sort of along here with like starting at the top of a column we feed in zero. and then coming out the bottom, we get results coming out. So just to visually see what we've got, there's a dot product that is performed along columns in a matrix, and that sort of maps exactly to what is done spatially in the systolic array here. So this is one dot product summed vertically,

Starting point is 00:32:19 and this is a second dot product also summed vertically. And then what is the data that needs to go into and out of the register file? We have X amount of data that's coming out here on the output. And we also have sort of this data coming from the input, X amount of data from the input. And so with respect to the input and output vectors at least, we've sort of met our goal of having, or only X as much data going in and out of the register file. This leaves open the question of like, I said that the weight matrix is stored locally in the systolic rate.

Starting point is 00:33:02 How did it get there in the first place? Is sort of a, like, at some point you need to boot your chip and populate this data, and so where did that come from? The trick is just, we just do it very slowly. So we very slowly trickle-feeded into the systolic array. The sort of the simplest strategy is that we sort of run this daisy chain that says, like, feed a number into here, and then on the next clock cycle it'll move down to the next entry of the systolic array. And so we can do that in every, in every column in paper. parallel, and that gives us the sort of this is also going to come from here, and this is

Starting point is 00:33:37 going to be another factor of approximately X units of bandwidth coming in. Can you just, would you mean repeating the other sentence? So we're sort of, like, we know that we're going to be bringing in numbers only rarely into the matrix. And so we just want to come up with any construction at all such that the amount of wiring that actually feeds into, sort of crosses this boundary of the systolic array, like, this boundary right here, we just want to keep that bounded to X and not go as X, Y. And so a particularly simple strategy is that we sort of bring in a number into the top row of the

Starting point is 00:34:14 systolic array. That's what we can do in one clock cycle. And then for like, for Y consecutive clock cycles we're going to be bringing in the top row every time and then sort of shift all of the other rows down by one. And that keeps the sort of the wiring that needs to come from this expensive register file only down to a factor of X rather than X. I see. Okay. So there's two questions in terms of communication. There's like communication time and then there's communication bandwidth.

Starting point is 00:34:44 Yes. And you're saying since we're only going to be loading this in once, let's maximize bandwidth. That's right. Because bandwidth equals diarrhea. And let's just like load it in slowly over like smaller lanes because we're just going to keep this value in there for a while. Exactly. Interesting.

Starting point is 00:35:00 So it's interesting to me that when we were talking last time about inference across many chips, the big, high-level thing we're trying to optimize for is increase the amount of compute per memory bandwidth, that is to say, per communication. And here also we're trying to increase the amount of like actual multiplies or actual additions relative to transporting information from registers to the logic. So in both cases, you're trying to maximize compute relative to communication. Yeah, this shows up sort of all the way up and down the stack. This is sort of close to the bottom, sort of like to the gates. There's sort of a version that's maybe even closer to the gates of just like even the precision of number format that you choose to use.

Starting point is 00:35:48 We saw that same effect. There's like a square cube law or a like squared versus linear term going on both in just purely the precision of this ALU, but then also in terms of the size of the matrix. Yeah, very interesting. So this unit is sort of the next bigger unit. We had like the multiplication circuit, and then on top of that we have a pretty large systolic array. I drew it as two by two, but in like, for example,

Starting point is 00:36:16 older TPUs they were described as 128 by 128 of this circuit shown here. And this circuit ends up being, this is the most efficient known mechanism for a circuit for implementing a matrix multiply. I see. So we've talked about sort of, it seems obvious that you should try to maximize compute relative to communication. What are non-obvious trade-offs that actually you are, you know, keep you up at night about should we do X or should we do Y?

Starting point is 00:36:48 And it's not obvious what the answer is. Yeah. So, I mean, I think most of the decisions in chip design are sizing decisions. And so already in what we've drawn so far, like, so AI chips all have this circuit in it. They have a systolic array and then somewhere near at a register file, providing inputs and outputs. The two sort of, like even within this scope, sizing questions that you have are, how big should I make my systolic array and how big should I make the register file? the trade-off for the size of the systolic array, actually these two questions are coupled

Starting point is 00:37:33 is one way to think of it is to say I'm going to have a budget for how much, what percentage of my chip area I want to spend on data movement. So maybe I just say that I want this to be 10% and the systolic array to be 90%. And then, sort of, I can size my register file. Bigger register files are more flexible. They allow me to run sort of more more, I can get more application level performance out. But then they sort of take away from this area spent on the system like Corray. Yeah, makes sense. I recently ran an essay contest where I asked people to write about

Starting point is 00:38:09 what I consider to be some of the biggest open questions about AI. The submission window closed last week, so I used cursor to create a couple of different interfaces to help me review the entries. One interface anonymizes submissions and hides unnecessary information. It lets me group responses by question, add notes, and record my scores. The other interface helps me review entrants who also want to be considered for the researcher role that I'm hiring for. The UI puts the applicant's essay right next to their resume and their personal website

Starting point is 00:38:34 so that I can see everything at once. Cursor's harness is really good at helping these models see and improve their UIs. I watched it render these interfaces in the built-in browser, take screenshots, click through sections, and keep iterating. At this point, cursor is where I do most of my work. Whether I'm reading and visualizing a bunch of research papers or coding up an interface to review applications or making flashed. cards for my Blackboard lectures. Curser just makes it very easy for an AI to look at whatever

Starting point is 00:39:00 I'm looking at and help me understand it and work with me on it. So whatever you're working on, you should do it in cursor. Go to cursor.com slash Dwarcash. Where does the clock cycle of a chip come in? What determines what that is? Yeah. And what is a clock cycle of the ship? So I guess at baseline, it's sort of worth observing that chips are incredibly incredibly parallel, right? You've got 100 billion transistors in a chip. A key thing that you need to do whenever you have massive parallelism is you need to synchronize between the different parallel units. In software typically you have these very expensive synchronization methods like a mutex.

Starting point is 00:39:40 So one thread will finish what it's doing, it will ink, like it will grab a lock somewhere stored in memory and then notify the other thread that it's done. On chips we take a very different approach and say that every nanosecond or so, all circuitry in the chip will kind of pause for a moment and then synchronize every. So it actually synchronizes every single nanosecond or so. And so that is the clock cycle. The entire chip, typically all in sort of one fell swoop, goes in lockstep to the next operation that happens. And so what this looks like in circuitry is that you will have, it's typically drawn, so the clock is sort of mediated by registers, which are these storage devices that we've drawn elsewhere.

Starting point is 00:40:31 And the way to think of it is that I have some storage, which is storing like a bit, which might be zero or one. And then I have some sort of cloud of logic, which maybe is like this systolic array or this multiplier or something like that. And then I've got some, and that's going to produce some output. So my inputs, I've got a bunch of inputs feeding to this cloud of logic. And then eventually later there's going to be some output register that this writes to. There is a global clock signal which drives all of these registers and it says at a certain instance in time when the clock strikes,

Starting point is 00:41:11 whatever value happens to be on this wire at that instant, that's what's going to get stored in there. And so the sort of the challenge here is like I would like to have my clock speed run as fast as possible because if I can run at two gigahertz, I can get twice as many operations done per per second than if I run at one gigahertz. But what that ends up meaning is that I am very sensitive to the delay through this cloud of logic, because any computation that is going to happen in here needs to sort of finish before the next clock cycle hits. So a major point of sort of optimization on any chip then is to make this delay, delay from here as short as possible. Interesting.

Starting point is 00:42:04 And is there ever, because the constraint here seems to be that if you add too much logic, then you might risk missing the clock cycle. But if you don't add enough, then you're leaving potential compute on the table. Is there ever a situation where you're like, you'd take a probabilistic chance that a computation finishes and or is it just like no, either it's going to finish by clock cycle or not. Yeah. In standard chip design, you margin it's such that, I mean, there is a probability, but it's like many, many standard deviations, like weigh standard deviations out such that for all intents and purposes, it is a reliable part. It will, it will always meet the clock.

Starting point is 00:42:51 There are some weird exceptions to that. there are clock domain crossings where you go from one clock to the other clock, and then you actually do have to reason about this probability. Interesting. In the main path, you just, like, you margin that's such that you'll get there, like, 25% of the clock cycle in advance, so that it's very unlikely. In this, in this, the clock, where the clock synchronize, I guess, where the registers are, this is not something you determined as a chip designer.

Starting point is 00:43:22 this is sort of just like an artifact of, hey, I want whatever sequence of logic, and then the software you use to convert your varrolog into the thing you send to TSM, that just determines like, hey, in order to make this work, you've got to put a register here, here, and here to make sure that there's a, there's no one step that is like too long, such as it makes the whole clock cycle of the entire chip longer than it has to be. Yeah, so this is actually a huge part of the work of designing a chip, actually, is inserting them.

Starting point is 00:43:51 So it is done in a combination of manually and automatically. So, I mean, just like to show the very sort of dumb version of like what you can do here, you can take this logic and split it in half. And so like say actually instead of just one cloud of logic, I'm going to have two smaller clouds of logic, which do the same thing, but split them up by a register. Right. Feeding in like this.

Starting point is 00:44:19 And this is like, like if you split it like in the middle, you can hit twice the clock frequency. That's great. You get twice the performance at the cost of this extra register. And so at the cost of some more storage. And so stepping back, why do we need to synchronize the whole chip? Like if you have like, if you imagine playing Factorio or something, there's no like global clock cycle. It's just shit is done when it's done. There's iron on the plate. You can take it if you want.

Starting point is 00:44:43 Yeah. So taking that analogy, the thing that you need to be mindful of is if I've got two different pads through some logic. So I have to do a computation like F here and then computation G here, and then they're going to come and meet for computation H somewhere here. And so there's going to be manufacturing variants here. In some chips, F will take a little longer, maybe in some chips, G will take a little bit longer. And so if I've got some signal that's propagating through here and the result from F and have to sort of meet up at H.

Starting point is 00:45:22 The thing that can go wrong is that F can get there early, and it meets like the previous value of G or the next value of G or something like that. And H needs to know when to start. Exactly. Like when has this next iteration of... And so this explains why different ships made at the same process note, the same like TSMC technology, can have different clock cycles.

Starting point is 00:45:46 Two ships made at three nanometer, right at different clock cycles. based on whether they were able to optimize making sure that there's no one critical path that is so long that it slows down the whole ship's clock cycle. That's right. This optimization that I showed here, this is just the, this is sort of pipeline register insertion, it's called.

Starting point is 00:46:08 We've inserted in the middle of the pipeline a register here. This is a sort of pure trade-off between clock speed and an area. Yeah. This is the easy case. There is a harder case too, which is sort of drawn it as a pipeline of logic here, but in other cases you may have some calculation which actually feeds back in on itself. So it runs some function F and then writes back to itself like this.

Starting point is 00:46:38 So for example, this might be this addition, like you've got some number that you're adding into every clock cycle. And so this could be like a plus, where you're a, this could be like a plus, where adding in some number every clock cycle. So this little circuit, essentially it's just going to sum all of the numbers that get presented on different clock cycles. And the challenge is, if this plus takes too long, what can I do?

Starting point is 00:47:05 If I, like, split it in, if I try and put a pipeline register right in the middle of it, like here in the middle of it, this will end up changing the computation that's done. Instead of forming a running sum of everything that comes here, I will actually have two different running sums. I'll end up having a running sum of the even numbers and running some of the odd numbers. So, sort of this constraint where I have a loop in my logic, which all chips have somewhere, this is actually the thing that is the

Starting point is 00:47:36 hardest thing to address and sets the clock cycle. I don't understand why it'd be a problem to have that, or I'm not sure even what it would mean to have a register there. Like, is it sort of atomic operation, right? Yeah, well, so plus is not really atomic. Like, I think... As you just demonstrated. Yeah, it took a whole lot of work to do a summation. And so, like, you can take the early parts of that work

Starting point is 00:47:56 and then stick a register in the middle and then to take the late parts of that work. Good, okay. Yeah. And I guess it's then up to... So TSMC offers a PDK, which says that I say, here's the primitives of logic that we can grant you in the chip. And it's up to them to determine that no primitive is bigger than like the clock cycle they're hoping a process node targets.

Starting point is 00:48:21 But other than that, is there like what further optimising? Can't you just say like, hey, here's all the primitives from TSM? And keep adding registers in between the primitives as much as it's needed until you get to your desired clock cycle? Yeah, as a logic designer, like the chip architects set the clock cycle. So just for one example, the primitives you get from TSM are on the order of like and gates. or full ladders, they depends a lot on voltage and which library you choose and so on. But generally, you can typically have about like 10 or 20 or 30 of these in a clock cycle

Starting point is 00:48:59 sequentially. So these primitives are very, very fast, like 10 picoseconds or something like that. And so as a logic design, I mean, like in principle, if you literally just had like register and then and gate, kind of in a little. a loop like that, you could get insanely fast-clock speed, like more than four or five, six gighertz, something like that. But if you take this, this like really sort of like simple circuit, and you look at the area you're spending here, like, this is maybe like one, I mean, this is, this is called one gate equivalent in size, so like unit of one in area, and this thing is like

Starting point is 00:49:38 unit of eight in area or something like that. And so, like, this is just, again, almost all of your cost has been this, like, synchronization of. communication cost compared to the actual logic. And so this would be a case where you've gone too far. You've made your clock speed really, really fast at the cost of spending almost all of your area on pipeline registers. Interesting. So what you're hinting at is a dynamic where you can have really fast clock speed, but you're not getting that much work done. Yeah, yeah, yeah.

Starting point is 00:50:07 And so you can have like low latency but low bandwidth or throughput rather. Yeah, it hurts your throughput, in fact. because like the throughput of your chip, you can think of as the product of how much I get done per clock cycle, which is based on this area efficiency thing, times how many clocks I get per second. This is actually so similar to the thing we were discussing last time about batch size, where if you have a low batch size,

Starting point is 00:50:34 then any one user can receive their next token really fast, but the total number of tokens that are processed and say an hour will be kind of lower than it could otherwise be. Yeah, exactly. You get less parallelism out if you drive your clock speed up, really. Language models are starting to compete against the best human forecasters. I sat down with two senior Jane Streeters, Ron Minsky and Dan Ponte Corvo, and asked, at some point, does AI just do what Jane Street does? There's a world that we should take seriously where, you know, we're going to build large language models or some other AI systems that are, like, strictly smarter than all

Starting point is 00:51:09 humans on the planet and more capable at all cognitive tasks. Trading in particular feels to me as like kind of AGI complete, sort of like NP complete, because at the end of the day, trading involves figuring out what things are worth, which means making predictions about the future. Jane Street isn't betting against AI. They just signed a $6,000 compute deal. But Ron's view is that the edge keeps moving. I have never been more desperate to hire more engineers and more traders than I am today.

Starting point is 00:51:38 You know, you have the usual thing of like the other hard parts that we don't yet know how to automate, well, that ends up being where the competitive edge lies. You can find these open positions and watch the full interview at jane street.com slash thorekash. Okay, so I remember talking to an FPGA engineer at Jane Street, Clark, who actually helped me prep for the previous interview we did together. And he was explaining why they used FPGA's. And I imagine that for high-frequency trading, the throughput is less important than latency. And so having very specific control over the clock cycle. in a deterministic way is the most important thing.

Starting point is 00:52:18 Maybe it'd be interesting to talk about how you, why you can't just achieve that within ASIC or why FPGA is the, yeah, why you might use an FPGA to have deterministic clock cycles for like high frequency trading? Yeah. So, I mean, firstly, like, let's consider sort of the business case for an FPGA versus an ASIC.

Starting point is 00:52:39 FPGAs and ASICs use largely the same sort of conceptual model, which is that I have a series of gates built from ANS or as XORs, those very small primitives, connected together with a fixed clock cycle, and connected together with wires that are running in a fixed clock cycle. So anything you can express in an FPGA, you can express in an ASIC 2, and it'll be about an order of magnitude cheaper and better energy. efficiency on an ASIC than an FPGA. The trade-off is that the first FPGA costs you $10,000,

Starting point is 00:53:16 whereas the first ASIC you make costs you $30 million because it requires an entire tape-out. So sort of the business use case for an FPGA would be that I want something that has this very deterministic latency and fast runtime and high parallelism. But, you know, I'm going to change very frequently. It should change what I do every month or something like that. And so then I don't want to pay the tape out cost every time. Now, how does an FPG actually implement, it's sort of like, it emulates the ASIC programming model, but in a fixed piece of hardware. And so how does that work actually? So what it has at the base is it's got a, it's got the two components we just talked about.

Starting point is 00:54:05 It's got these registers as storage devices. And then it's got these, these are called lots lookup tables, which actually provide all of the gates. So, and then we're going to see even the sort of the third component. We then have like sort of a swarm of these registers and lots. And all of these are available and then they're connected by sort of this big set of sort of So in front of every single one of these, we've got something like one of these muxes, which selects one input from everywhere else, sort of selecting from all of these different things.

Starting point is 00:54:55 You know, we've got a whole bunch of different options feeding into all of these things. So what this allows is essentially a, when I program my FPGA, I can say that I'm going to take all of these components and I'm going to sort of superimpose on top of this a particular wiring which like goes through this lot and then feeds into this lot and then goes to this register and then feeds into this lot or something like that. So what I've drawn in orange is like how you like FPGA means field programmable gate array. This is the orange is what has been programmed in the field, whereas the white is all of the wires that must exist in the FPGA in order to actually take, to make the device in the

Starting point is 00:55:40 What does it mean to be programming to a field? Like programmed in the field, so like the device has been deployed in a data center, it's sitting in the field and then you can come and program. That field isn't like electric field, no field as in like out there in the world field. Okay. And so if I see, look at the, how the field programming comes out of the first lookup table and goes in a second one. How is it, how? Yeah, how? Like where are the wires that made that happen, I guess?

Starting point is 00:56:09 Yeah. Yeah, so I got a little bit lazy in drawing all of these. Every single device here has a MUX sitting in front of it, which can select from all of the nearby circuits that are available. Yeah. And so the actual configuration with the FPGA is like, amounts to it, it is the MUX control. So in this MUX here we have the data inputs, and then we have like the the control that selects.

Starting point is 00:56:41 And so there's a little storage device sitting next to every single one of these muxes saying, this is where you're going to source your input from. And so programming it consists of like configuring every single one of these muxes. So that makes sense. What is happening inside of the lookup table? Yeah.

Starting point is 00:57:01 So the purpose of the lookup table, so it's going to also have a little bit of control feeding, telling it what to do as well. The purpose of the lookup table is to function to be able to configureably take the role of an and gate or gate XOR or any of those different things. So there's many ways you could consider doing that. The way it is done in sort of traditional FPGAs is to say it will support, so it will be a lookup table will be,

Starting point is 00:57:35 it will have four bits of input, one bit of output. How many different functions are there from four bits to one bit? there are 16 different functions. And so you can actually just tabulate this as like 16 different numbers. You've got a table of 0111101, 16 entries. And so what it does is this table is stored in this blue configuration bit. And then it views these four bits as binary, looks up the relevant row of the table, and emits that bit.

Starting point is 00:58:18 So this is a truth table view of lookup tables, essentially. Okay, so the lookup table, if you think about an and gate, or gate, nor gate, XOR gate, these are all like take as input. Those are like two input functions. Sometimes we have like more complicated, like a three input function would be a three-way XOR or a four-way XOR. And in this case how many, it just depends how big it is. Typical size for Lutz is four input, which is sort of just a sweet spot between

Starting point is 00:58:47 There's another computer communication trade-off like here. Like if it has too few inputs, then you need to use more lots. Yeah. If it has too many. But basically the lookup table is like a truth table. It's a truth table. And with a truth table, you can program in any gate you want. That's right.

Starting point is 00:59:01 Yeah. And so it's a lookup table just thinks like a programmable gate. That's right. And so, I mean, one of the things you can do here is you can see why, where the rule of thumb, that an FPGA is like an order of magnitude more expensive than an ASIC comes from, is to to count how many gates would be inside this lookup table. Yeah. So we can view this lookup table essentially as one of these muxes.

Starting point is 00:59:24 And so it is a max with has to select between 16 different values. And so it is a max with sort of n equals 16 options, p equals one bits. And so what we saw away earlier is that This circuit costs like n times p many gates, and so it's like, um, so it costs like n times p equals 16 and gates and also 16 ores. This circuit being the mux. Yeah, exactly, the mux. The mucks is the core. The muckeye.

Starting point is 01:00:05 The lookup table itself, you can think of as being actually a big mux that like selects from all 16 rows down to one out. Yeah, okay. up table. But the way you've drawn in here, there's like a mux and then a lookup table. It's a mux is all the way down. So I mean, there's a, there is a second box that is inside here. This mux is this mux. Got it. Okay. And then the other mux is just saying where it came from in this sort of mess of gates. Right. And then the second mux is, okay, now you have one value, but that value is still a four-bit value. Yeah. So I've selected four bits from the soup. Right. And then I use those four bits to select.

Starting point is 01:00:45 which entry in the lookup table I'm going to use. Right, okay. So suppose in the first mux there's like eight nearby, you're pulling from eight nearby registers as input. And so that's like a total of like 32 bits going in. And then out of that four bits come out, those four bits go into the second mux which is inside the lookup table. So actually I would say in, yeah, in this case these registers are single bit registers.

Starting point is 01:01:14 So if there are eight nearby registers and lookup tables, then I have eight bits total coming in in nearby. I select from eight down to four different values. So there's actually like four different muxes, one associated with each of these input, like little mucks associated with each of these input bits. Each of them is selecting one out of eight. And what are those eight coming from? Nearby registers and other lots. And each register is one bit. Yes.

Starting point is 01:01:43 Yeah. And so I guess EMD or whoever makes these FHAs still has to be opinionated about what registers should connect to which registers. And then you can program in the actual gates, but they add a wire in the connection, like the communication topology, right? Yeah. So there's the sort of like you get flexibility in a local grain thing. There's a sort of nearby neighborhood where you can select from. but then more grossly, like more coarsely, longer distant connections, they form an opinion. Right, yeah. And the reason it's 10x lower is why?

Starting point is 01:02:19 So if you look at the cost of, like, building this lookup table, it's like 32 gates. Yeah. And then it can give me the equivalent of like, what's one, an interesting thing I can do here, I can do a four-way and gate. And so that's like, I'm using 32 gates of look-up table to sort of implement like a four-way and means like, What is a four-way and I would do? Like, and, and, and then and of and. So to implement, like, this is a circuit that I could implement in an ASIC directly using these three and gates.

Starting point is 01:02:51 But using a lot, I can also implement it, but it's going to take, like, these 32 gates instead of three. Right. And so the overhead is really coming from the, like, the fact that the lookup table, the mucks in the lookup table is, there's a more concise way to describe a truth table than listing out every single possible combination of inputs which is just to like write out the gate yeah like to like place down the polysilicon and the that's right yeah interesting

Starting point is 01:03:24 one important point he made to me is that the reason they prefer FPGAs to CPUs is because they get deterministic clock cycles they know when a packet will come in and go out why as it not a guarantee in CPC? So you can actually design a CPU that has deterministic latency as well. And in fact, like the processes that are inside a lot of AI chips actually also have deterministic latency too. Grock has advertised this. TPUs have that in the core as well. The challenge is getting sort of deterministic latency and high speed at the same time.

Starting point is 01:04:01 And so where does the non-determinism in latency come from? Non-deterministic latency comes from specific design choices in a CPU. It's actually possible to remove those design choices and make a CPU that has deterministic latency. Those are not very attractive in the market, and so people don't make those CPUs anymore. But actually, in some sense, like, deterministic latency is maybe a sort of a simpler designing starting point. And then like some chip designers have added things in to be non-deterministic. To take a concrete example of that, probably the most important example is on a CPU just like the the CPU cache itself. So in a CPU you have the CPU, this is the CPU dye itself,

Starting point is 01:04:59 and then there is a memory off on the side. This is the DDR memory off on the side. And then you have a cache system here inside it is the cache that sort of remembers recent accesses to DDR and stores them. And so when I'm running through my CPU instructions, every time I have an instruction that accesses memory, it first checks in cache was the data stored in cache, and then if not it goes, it fashes out to DDR. Yep.

Starting point is 01:05:32 This is a huge optimization. The cache is like two orders of magnitude faster than the DDR. If you never use the cache, basically all programs would run 100 times slower. So the presence of a cache is absolutely necessary for a CPU to run at reasonable speed. But whether or not you get a cache hit is dependent on the sort of ambient environment of the CPU.

Starting point is 01:05:56 Like what other programs are running, what has run recently, what is the random number generator inside the cache system doing? And so that is a big source of nondeterminism in the runtime of a CPU. Yeah. So this is sort of the memory system for a CPU. The big thing that you can do differently is instead of having the hardware say, I'm going to read memory and then decide,

Starting point is 01:06:20 the hardware decides whether or not it comes from cache or not. You can actually bake this decision into software. So a different design philosophy is to, and you see this in maybe, for example, TPUs. The TPU instead has, I mean, I'll draw the same diagram, but I'll call it a scratch pad. And so the main difference is, so this would be like a TPU and then like HBM, in this case, rather than DDR, but it's still an off-chip memory. And instead of like the software saying first access like memory and then the hardware decides, you've got some instructions that go here. This is like one kind of instruction and then a totally different kind of instructions

Starting point is 01:07:04 that goes to HPM. Yeah. And so this style is generically known as scratch pad instead of cache. The key distinction being that you have one kind of instruction that says read or write scratch pad and a totally different instruction that says read or write HBM. So is scratch pad being the cache? Yeah, this thing here is the scratch pad.

Starting point is 01:07:26 So stepping way back, people say computers have the quote-unquote John Moyne, von Neumann architecture. where there's this serial processing of information. And maybe just because we've been talking about parallel accelerators, but I just don't, like, the FEGA is super parallel. Yeah. The kinds of AI accelerators, the TPUs are super parallel. Even CPUs are super parallel, if you think about all the cores they have.

Starting point is 01:07:56 And so is it actually, like, in what sense is modern hardware, actually the Von Neumann architecture? Is that actually a fair way to describe modern hardware? I think it's a fair way to describe CPUs. Like just the amount of parallelism on a CPU, the amount of parallelism you get is about 100 cores times maybe 16 way vector unit.

Starting point is 01:08:12 So 106, about 1,000 way parallelism on a CPU. Yeah. One question is what is the, there is a dye that is being used for the CPU and if there's fewer threads, just as a matter of like transistor

Starting point is 01:08:28 voltages or like switching on and off, is it that there's like literally one control flow like a small part of the dye where like voltages are switching on and off or like in what how how do you actually occupy the dye area of a CPU if there's only as opposed to if there's so few cores like what are my saying in there yeah the cores are just much bigger and more complicated so I mean like so I guess we we should compare like a CPU core which takes up one 16 one one hundredth of the die to like I mean to a lot like a lot is just only

Starting point is 01:09:02 these 16 gates. So it's clear why there are so many more lots in an FPGA than cores in a CPU. But then sort of maybe the, like, why are they more CUDA cores, for example, than CPU cores, I think would be like, why, what's the difference between a CPU and a GPU or something like that would be a big difference? Inside the CPU, you have, so one big use of,

Starting point is 01:09:28 so the top unit, uses of area inside of CPU are the CAP cache. Very little is actually the ALUs. Mostly, it's like these register files rather than the logic units. And then both of these things have equivalents in a GPU, and so that's not a big difference. But the thing that does not have an equivalent in a GPU is the, sort of this branch predictor. And so there is a whole big area in the CPU, which is sort of just a whole bunch of predictors that are saying, when will my next branch be and where is the branch target for that? And so stripping a lot of that out as well as sort of making these register files tighter, in a sense, is driving a lot of where the, like, GPU gains.

Starting point is 01:10:19 And what is the purpose of the branch predictor to, like, execute both branches at once? Or what does they do? So the issue is that when I've got a series of instructions, like instructions, instructions, instructions, I, if I have a branch, like here, if this instruction is branch, the actual processing step of processing an instruction takes a really long amount of time. It takes like maybe five nanoseconds or something like that. So like the time to actually notice that I've got a branch and then like evaluate the Boolean whether it's true and then and then update the program counter to the new target and then read from the instruction memory for that. That could take like actually five nanoseconds to finish. And so in reality,

Starting point is 01:11:08 this may finish somewhere down here. I don't want to, like, but like I want to run a clock speed that is much faster than what five nanoseconds allows. Like five nanoseconds is 200 megahertz clock speed. I would like to run at one or two gigahertz or something like that. And so I need to run other instructions while the branch is being evaluated. And so, I really just want to keep running the following instructions that happen after me. But that might have been wrong. If the branch ended up being taken, then I need to know that instead of evaluating these instructions, I actually need to jump to wherever the target is and run these instructions instead.

Starting point is 01:11:48 And so the purpose of the branch predictor is like genuinely to predict based on, like, before you even get to this instruction to be like five cycles earlier to predict there was going to be a branch that's going to happen. So if I think about how the brain works versus what you're describing here, at a high level the differences might be that while you can do structured sparsity in these accelerators

Starting point is 01:12:11 and then save yourself some area that you would have otherwise had to dedicate to these gates, in the brain there's unstructured sparsity. Any neuron can connect to any other neuron and not in ways where it would be column aligned or whatever. Then there's a fact that memory

Starting point is 01:12:29 and computer co-located? I guess you could say in a way the memory and computer co-located on... This is exactly the collocation in some sense of the memory of computer. That's right, that's right, yeah. Yes, maybe that actually isn't a big difference. And the other, maybe a big difference

Starting point is 01:12:43 is that the clock cycle on the brain is much slower than on computers. And partly that's to preserve energy because the faster, the clock cycle, the bigger, the voltage needs to be in order to identify, for the signal to settle and to like identify what state of transistor is that. Yeah, that's right. That's right. I don't know if you have other high level takes about like how any commentary on what

Starting point is 01:13:14 the brain might be doing versus how these chips work. Yeah, I mean, so let's take the clock speed one first actually. The clock speed is quite high on a chip because that, I mean, drives higher throughput. like when we compare a like a GPU running some workload, it's running batch size 1,000 or something like that, whereas like the brain is not running batch size 1,000. There's only one of me. And so you could sort of imagine saying, well, take a GPU and like instead of running at a gigahertz run at a megahuts or something like that. And that would start to look maybe a little bit more like like sort of equivalent things that you're talking about in the brain.

Starting point is 01:13:55 There is, in the way that silicon works, there are, like, that does not give you an thousand X advantage in energy efficiency. So what it ends up looking like is you can, like, you sort of just end up running this circuit once to stabilization, and then it'll sit idle for a long period of time. It doesn't consume a lot of energy while it's sitting idle because most of the energy is consumed in sort of toggling bits from zero to one and back. So actually let's talk about the energy consumption of a circuit like this. The way to think of a bit being stored is you've actually deposited some charge in a

Starting point is 01:14:42 capacitor somewhere, sitting somewhere in the chip implicitly. So it becomes charged when it is, bit becomes a one, and then it becomes discharged when it next goes to a zero. And that cycle of charging the capacitor and then dumping that charge out to ground, that is where the energy is consumed. Yeah. This is called the dynamic or switching power. This is most of the energy consumption of a chip.

Starting point is 01:15:02 There is some other energy consumption just coming from the fact that insulators aren't perfect insulators, but we'll discard that. Most of the energy consumption actually comes from just the charging and discharging of like toggling from zero to one and back to zero. Right. So if you run a chip much slower and you only clock it once every thousand clock cycles or something, you will have a thousand times fewer transitions. it'll be about a thousand times less energy consumption, but not a substantial advantage in energy efficiency. Okay, so you described how a TPU works at a high level. What is the difference at a high level

Starting point is 01:15:40 between how a GPU and a TPU work? Yeah, so, I mean, I think there's sort of a high-level organization principle that is different, and then there's sort of inside the core as what are different. But we'll look sort of outside the, like at the high level, so we'll take a GPU and a TPU and what does like sort of the top level block structure look like. If you think of this as the whole chip in each case, the organization of the GPU is mostly a bunch of almost identical units, which are these, these are the SMs. And then they've got an L2 memory in the middle, and then a bunch more of these SMs on the bottom.

Starting point is 01:16:26 And so there's sort of this fairly regular grid, of course. And then, like, if we look at a TPU in comparison, you end up with much coarser-grained units of logic. And so you end up with something like some... large number of maybe like maybe just a few matrix units. These are the big like systolic arrays. And then in the middle you've got some vector unit. And then you've got your matrix units at the bottom.

Starting point is 01:17:06 So now sort of like matrix units with a vector unit in the middle, sort of this is the whole TPU chip. You can sort of think of scaling this thing down into a really tiny unit with a smaller matrix unit, smaller vector unit. And that is sort of what an SM is. So sort of at a very high level point of view, the GPU has a lot of tiny, tiny TPUs sort of tiled across the whole chip.

Starting point is 01:17:35 Oh, interesting. So like you're suggesting the tensor core within a streaming SM is analogous to an MXU. Yeah, it's very, very similar. I see. And so if you had more like, more lack of structure, having a bunch of tiny TPUs makes a lot of sense. Whereas if you kind of just have like huge matrix multiplications, you're like, why don't we just, why don't we avoid the cost of having the individual SMs

Starting point is 01:18:03 with their own registers and warp schedulers and things like that? Why don't we just like make a huge thing and like amortize those costs across the whole thing? And I mean, I think this shows up in how large you can grow things. We've sort of seen this theme, like especially with this, array where largest systolic array amortizes the register file costs better. Yeah. This sort of design allows you to have larger systolic arrays, whereas the sort of GPU design constrains you to having small units of everything.

Starting point is 01:18:31 There is a trade-off, however. The, there ends up being, because of this sort of coarse-grained separation of things there, you need to move a lot of data from the vector unit to the matrix units. And so, like, you need to move a lot of data from the vector unit to the matrix units. And so, like, you need to move a lot of data through a sort of, like, two lines of of perimeter here. Whereas if you sort of look at the equivalent thing, here you've got vector units everywhere, and you need to move data through this line, through this line, through this line, through this line, through this line, through this line.

Starting point is 01:19:06 So the amount of data you can move between a vector unit or matrix unit is actually much higher in a GPU than in a TPU because because instead of having to move all the data through these just two lines, you're moving all these data through like 16 lines or something of wiring instead in a GPU. Right, but also you might have to move across less area. Which, I mean, is also a saving, like it's an energy setting. So data ends up moving like if you can operate entirely within an SM, the data movement is much smaller, but then the moment you want to operate across SMs,

Starting point is 01:19:44 It becomes sort of more complicated and expensive. So you don't have a comment, but one might expect that a thing Maddox might try to do is to get the GPU-like smaller structure of systolic arrays surrounded by SRAM, but also at the same time make it so that like the things you need in an SM to support the Kuta architecture, but take a bunch of space, you might discard. Yeah. We've talked publicly about something which we call us a splitable systolic array, which is sort of in some sense you can think of as like big systolic arrays that can be small systolic arrays as well. Cool. Okay, I think this is a good note to close on. Reiner, thank you so much. Thanks for Keshe.

Dwarkesh Podcast - Reiner Pope – Chip design from the bottom up

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.