Cheeky Pint - Reiner Pope of MatX on accelerating AI with transformer-optimized chips

Starting point is 00:00:01 Rainer Pope is the co-founder and CEO of Matt X. He's a former MathWiz and Haskell programmer who became a TPU architect for Google. And now he's teamed up with Google's former chief chip architect to design a better chip for AI. So a year ago, everyone was saying Google is canceled. You know, AI is going to eat their search. No one's going to search for things. And therefore the business, you know, won't do well. Obviously, that sentiment is really shifted.

Starting point is 00:00:28 Yeah. In part helped by, you know, Gemini III is really good. and then also it's really fast, you know, it's part by the custom chip hardware Google has. You were inside Google for actually, I think a lot of the foundational period, laying the groundwork for that stuff. What do people not appreciate about what Google did right to lay all the groundwork for their current AI success?

Starting point is 00:00:52 They started with the research, right? The Transformers came from there. Pretty much anyone who's maybe, I don't know, over 30 and at a large lab has been at Google Brain at some point. So I think there's just like there was and has been a lot of talent there. TPUs are pretty good. I mean, we think there's better you can do, of course, but they at least had the option, like the opportunity to design the TPUs for

Starting point is 00:01:17 for neural nets at least rather than graphics applications like in video. And so the overall architecture starting with single core doing what was at the time reasonably large systolic arrays by today's standards, near as much, but I think those were a lot of, like, really good decisions. When did the TPU project start? TPUV-1 was announced in 2016, I think. That was what actually kind of led to the creation of all of those 2016-2017 startups, so Syramus, Grog, Kraft Corps, San Bono, all of those.

Starting point is 00:01:49 TPUV-1 actually was, I think, is a really impressive project. It was done on a very short timeline, maybe, I don't know, the full details, but maybe about a year or so, maybe a year and a half, with a scale. team of 20, 30 people. Really, really minimal viable product. More recent TPUs and more recent AI chips in general can't do that because the market has moved and the stakes or the table stakes are much higher. But the first generation product, they just one big systolic array, stick a memory next to it,

Starting point is 00:02:22 we're done. And it was really simple, a nice elegant product. And obviously that TPUV-1 predates the transformer. Is that just a coincidence that they happened at very similar times or related in some way? Yeah, I mean, there was a period of maybe about four years of like a lot of, I mean, a lot of ML research or neural net research prior to Transformers. So what was popular? LSDMs and Confidence and Resonetum inception. The big thinking at the time was to adapt it to be used. for LSTMs. It's a reasonable fit there.

Starting point is 00:03:04 But yeah, no, I mean, I think just there was just a huge flyer of activity. I think what, like, why did it all happen then and not later is probably just because people stopped publishing. In 2020 was about the timing, just Google completely stopped publishing its research. Yes, yes. And so all the good papers are from before that as a result. Right, right. But is there some hand wavy story you can tell about parallelization where both transformers and TPUs are about really internalizing the importance of parallelization?

Starting point is 00:03:36 So, I mean, definitely. I put it somewhat on people, actually. So, I mean, it is just true. Hardware is massively parallel. Like, you've got tens of billions, hundreds of billions of transistors on your chip, and it takes like maybe 100 clock cycles to get from one side of the chip to the other,

Starting point is 00:03:54 and so you can't, like, do a sequential computation involving transistors on both sides of the chip. So the hardware is just fundamental. mentally parallel and you have to take advantage of that. TPUV1 and all later TPUs naturally took advantage of that. Just matrix multiply is really nice because it is so parallel. So I think on the hardware side that's generally understood. I think most ML researchers, especially of the time,

Starting point is 00:04:19 were not sort of super deep in what hardware wants and what is sort of mechanical sympathy or something that term this used for that. So I mean. So what are the term, I mean, it kind of makes sense. Yeah, it speaks for itself. It's like, I mean, think about the poor machine and what does it want? What do you want? I mean, the term actually, I think, originates in maybe high-frequency trading in areas like that,

Starting point is 00:04:45 which is, I haven't worked in. I've just like reading about the software that people have built from there. And it's like, for them, what does the machine want? It wants a lot of instruction level parallelism. This is CPUs, not deep-us. Once a lot of, don't branch, so unpredictable branches, kill your performance. And so think about the things that CPUs do and how to use them best. Can I get to peak performance on a CPU?

Starting point is 00:05:09 It's sort of that idea. I think the whole idea of peak performance on a CPU is kind of crazy. Like, no one even says, what is peak performance, what is my percentage of peak on a CPU? Because performance of software running on CPUs is really bad. But running on GPUs or TPUs or AI chips in general, actually that is the main focus. It's like, what is my percentage of peak? Can I get 70% or 80%? Yeah. Okay. I feel like many people listening to this know that GPUs perform better for AI workloads than CPUs. And kind of a funny history when you think about it, where just one day we woke up with all these very mathematically intensive workloads, first crypto mining and then AI.

Starting point is 00:05:50 And so then, Nvidia is extremely well positioned because they've been making like GPUs for gamers that you'd plug into, you know, you'd buy your Dell PC back in the day and maybe upgrade the graphics card by plugging in a better Nvidia graphics card than the one the Stockdale computer came with. And they were incredibly well positioned to capture that. So I think people know that.

Starting point is 00:06:14 What is the intuitive explanation as to why GPUs are better for AI workloads? and CPUs. Because, I mean, people say, yeah, they're better for, you know, these mathematical computations, but that's kind of a tautological answer, basically. Is there some way you can have a mental model for why that is the case? Because, I mean, software instruction sets also involve doing math. Yeah. So, I mean, intuitions, I'm not sure. Let me try and just go to the, to the, some of the big differences, which is really wide vector instructions, is sort of the hallmark of a CPU, which I think it's maybe, if you want some intuition, it's like, how much is spent

Starting point is 00:06:57 on controlling the thing? And maybe control means, like, if I'm driving a truck, how much, like, is the driver versus the payload? Yes. A truck has a huge payload in it. That's more like the GPU, whereas maybe a motorcycle is more like the CPU where you've got, like, the instruction, like actually just processing the instructions, reading, what do I have to do next? Okay, how do I do that. That is most of the cost on a CPU, whereas if you just keep the same instructions but make the payload a hundred times bigger, then you can shift most of the cost to be in the actual work that you want to do. Okay. Okay, so CPUs have been optimized for very complex instruction sets, whereas GPU is optimized for... Yeah, complex instruction sets and

Starting point is 00:07:40 sort of fine-grained changing what you want to do. So, like, I mean, steering, like, in this analogy, like, a CPU can steer an obstacle cost, no problem. Whereas, like, on a GPU, you're just going to go straight line for a really long time. Yes, yes. Okay, so this is getting us into what is MATX? How did you guys start and which part of this space are you attacking? Yeah, so MadX is making the best chips physically possible for LLMs. The, where, what led us into MadEx. So Mike is the other founder. Mike and I were both working at Google. And we, I mean, I was working on the inference stack for running LLMs. And I was saying, like, how can we make the best software on TPUs for running LLMs? And then what we really wanted out of hardware was

Starting point is 00:08:35 support much, much larger matrices. The matrices have grown from maybe 128 in dimension into the many thousands. And so, like, Ruck goes to like, many trailers. So much larger matrices and much lower precision arithmetic. And we, you know, we tried to move the TPUs in this direction. TPUs have been moving in this direction, but they're kind of constrained by a lot of other workloads. There was a big ads workload at the time. And so back in 22, before Chat-GPT was released, there was this idea that LLMs were going to be a big thing, but not conviction and really hard to make a big bet on that. I think a startup is more of the right place to make a big bet on a workload. If you fail, it's fine. You just, like,

Starting point is 00:09:24 another startup will succeed. Whereas I think a company like Google or Nvidia, the next chip has to work for sure. And then so... You can take more technical risks as startup. Yeah. Yeah. Well, actually, I would say we're taking sort of product risks rather than technical risks. But is there actually product risk? Because it seems like LLMs are going to work. I think now we understand it. Two years ago or three years ago, I think it was not saying. Fair, okay.

Starting point is 00:09:51 And when you're just say the best chips for LLMs, I mean, I can think of multiple ways to measure best. It could be best performance per watt. It could be lowest latency, capable of handling the largest models. What is best? In general, there are two metrics which LLM workloads care about, which is throughput, which is really just an economics thing. I buy a chip for $30,000.

Starting point is 00:10:10 and then can I do 10,000 tokens a second or 100,000 tokens per second of throughput. That determines the dollars per token. So throughput and then latency, how fast does the thing respond? As I see the market, the economics seems to be most important. Ultimately, the quality of the AI you can train and serve is constrained by, I have only a $10 billion budget and I want to train and serve the best model I can on that budget. And so if I can have more tokens per dollar, then I can get a better quality out. So we are, the product we aim to build is far ahead on latency, on throughput.

Starting point is 00:10:52 But then actually the sort of surprising thing is we're competitive with the best on latency as well. And so I think that is a unique thing in offering both in the same place. And is this for obviously in AI, there's training the models and then running the models, inference. is this most interesting for inference or is there any training angle? I mean, incidentally, is it useful for trading, but you are trying to win inference? Is that how you think about it? I think that's a reasonable way to look at it. I think the best inference chip today will be a really good training chip as well.

Starting point is 00:11:23 And so our product is both training and inference, but I think the first sales will be an inference. That's mostly just a market effect where it's easier to buy. like it's not as big of a risk to go to buy an inference cluster than as a training cluster. I think the product is really compelling for training as well. And so I think it should be the best training product. And you guys just raised a big new round of financing. Yeah, yeah, that's right. So we, this is a, we've raised a series B round.

Starting point is 00:11:51 It's led by Jane Street and situational awareness. Situational awareness, that is Leopold Ash and Brennan's Fund. He wrote the definitive book on where AGI. and where it's going. And then Jane Street, they're real technical experts. They understand all the details really well. So we're very happy to be having them lead the round. It is a $500 million round helps us actually ramp the manufacturing and supply chain for our chip so we can bring our chip to market. That's a lot of money. Yeah, it is. Yeah. Yeah. No, I mean, I think the, like roughly I would say it costs all about a hundred million dollars to produce a chip in small

Starting point is 00:12:31 volumes, but then if you want to, like, you see the orders that are going around, like Open AI, Anthropic, Google are going around buying multi-gigabot clusters. They cost, like, tens of billions of dollars of chips, and you want to deploy all of that in, like, in a year or so. And so you just need a massive supply chain behind you. And so assuming everything works technically, what rate of production could you start to see? We have some estimates of where we would like to be on this.

Starting point is 00:12:59 This is, I mean, ramping to very large volumes is a huge challenge for anyone. And so obviously for the large place, they've had some practice at it. Getting to a very large volume for a startup is hard. We would like to be at a place where we're shipping multiple gigawatts a year. Multiple gigawatts per year. Speaking of the metrics, you talked about tokens per second. We used to measure chips in flops. And I guess there's some kind of custom flop thing for AI chips.

Starting point is 00:13:26 But is everyone just using tokens per seconds these days? the industry aligning that on that as the chip metric? Yeah, so I mean, I guess it's sort of like an application metric versus the chip itself. Flops of the chip is the key chip metric. There's a little bit of like when you, if I go and say I've got a like an ex-flop chip to you, then sort of the appropriate suspicion is to say, okay, but can I actually use those flops effectively?

Starting point is 00:13:53 I see. And so then you need to map the application to that. Yeah, yeah, yeah. So this is kind of telling you the usable, flots, yeah, for your purposes. Okay. As a consumer of AI, we have known for a long time that

Starting point is 00:14:06 lower latency products succeed. Google talked about their internal testing where the differences were down to, was it, 50 milliseconds? Something like that. Yeah, yeah. In result times where they noticed more Google engagement

Starting point is 00:14:21 the faster the results were, and you'd think that 50 milliseconds is imperceptible to a human. And it almost is, but turns out it's not. And I think Amazon has, I mean, certainly they've optimized the latency of the Amazon experience quite a lot. I don't know if they've talked about this stuff publicly, but you know that their internal metrics similarly show that, like, the faster the product wage loads and more people buy it. And yet, in AI, Google has carved out a meaningful advantage via Gemini just being really fast for its level of intelligence. And that's why,

Starting point is 00:14:58 as far as I can tell, ahead of most of the other labs, at a latency at a fixed, you know, high level of intelligence. Yeah, yeah. Why have you guys or GROC or kind of better chips not being adopted faster to give this product latency? It's just that, like, this will happen, and you guys will be powering all the AI products. But I note that Google has an interesting lead there.

Starting point is 00:15:22 I think there's ultimately, like, at least for existing chips in the market, there's a really uncomfortable tradeoff between, latency and throughput. The chips that are best at throughput have historically been the chips that are based on HBM, as the memory. So that is Google, Amazon, NVIDIA. In order to have very large throughput, you need a lot of inferences in flight simultaneously. So that needs the large memory, but that hasn't been so good at latency. And then there's the Grogon Cerebus that are much better at latency because they've got this, the S-Ram, weights are in S-Ram, very low latency. The problem is, and the challenge when you go to a Grosurabra system is that the throughput you get there, it just is not very good.

Starting point is 00:16:03 And so the fundamental dollars per token is just not competitive with Google or NVIDIA or Amazon. It is actually possible to do both in the same chip. It's kind of an obvious thing. You say you take the HPM, you take the Sram, put them together on the same chip. You put the weights in Sram and you put all of the inference data in HPM. That is what we are doing, in fact. And I think that actually hits a really nice sweet spot where you can get a load latency and also be very cheap.

Starting point is 00:16:31 So I think that's a really attractive point to be. It hasn't happened in the market yet just because of product decisions that have been made by the different chips. Got it. But we should expect it. We should expect all the AIs we're using to get significantly faster over the coming three to five years.

Starting point is 00:16:45 Order of being too faster, I'd say. Yeah. So, I mean, generally, HBM-based chips tend to be about 10 milliseconds or 20 milliseconds per... So sorry, HBM-based chips are things like TPUs. That's right, that's right. Yeah. There's just some simple math of like, how long does it take you to read through all of

Starting point is 00:17:01 HBM? It takes about 20 milliseconds, and so that's the amount of time per token it runs, whereas the amount of time to read through all of S-RAM is much faster, and so you can typically get about one millisecond. Yes. Yes. It's more than software used to be, like old-fashioned deterministic software, the kind that's now out of favor, used to be very easy and quick to scale.

Starting point is 00:17:22 And you know, you would have social networks that have some... West-West moment, and they can scale through, you know, 10, 100, 1,000 orders of magnitude of, you know, adding users because it's just, you know, a few rows in a database. That's right. And it's a very underutilized CPU. What's interesting about the AI world is there are very real bottlenecks, you know, lots of time talking about power.

Starting point is 00:17:47 But it's not just bringing power online, you know, just you mentioned HBM is reminding me of, it seems like there's a view that maybe there's some going to be, you know, there's going to be some HBM supply chain crunch. And so where do you see, are we in for just a crunched world where some limiter is pacing the rate of AI buildout over the coming few years where the economics work of the products and everything like that?

Starting point is 00:18:14 But ultimately, we just can't bring the components online fast enough because we have to build out the factories, things like that. And what are those crunched components? Yeah, no, I mean, I think. I think so, and I'll just comment, by the way, this is a great time to be a supplier in this place. Yeah. Or just really-

Starting point is 00:18:30 You should have started an HBM company. I know, right? I think it's also just a fun time to be someone who optimizes software. That's always what I like doing. Always the challenge is, why am I optimizing this if no one cares? But finally, there's a place where actually, like, you can, it's actually very meaningful in a very tangible sense. Like, if I can make this 20% more efficient,

Starting point is 00:18:50 then it can save that 20% of the build-out. The supply chain, we're going to have crunches on all of the supply chain, really. So if you look at the sort of the big components of what any company, but like us, for example, build out, there is dependency on logic dyes from typically TSM, maybe Samsung, or HBM, which are the big three HBM vendors, Hynix, Samsung and Micron. And then there's also just the whole rack manufacturing, which includes, I mean, literally just sheet metal. and so on that builds the rack, but also cables and connectors, because of all the high-speed interconnect.

Starting point is 00:19:29 That's what we- Racks don't sound hard. Are they sneaky hard? The big challenge is that you want to bring in a huge amount of power, get a huge amount of heat out, and also have phenomenal interconnect, which has very high signal integrity requirements. And so pack a lot of cables in with, cables don't bend too much,

Starting point is 00:19:48 they have to have enough copper in them and so on, but you don't lose data rate on the interconnect. Yes, yes. So, yeah, if you push it to a limited time. Okay, wafers, racks, HBM, what else? Data centers, which I think is power, primarily, a little bit of build-out, but primarily power and great infrastructure there. Okay.

Starting point is 00:20:08 How do you then, as a startup that is looking to acquire all these components, elbow your way in amongst the giants of the Googles and the NVIDias and all these people who are, you know, have long learning relationships. relationships and have been buying for much longer. Yeah. I mean, ultimately what all of these suppliers care about, they do somewhat care about a diversity of their own customers. It's not a great position to be.

Starting point is 00:20:35 They don't want to monopsony. That's right. Yeah. But then, you know, what is their hesitation or the calculus for one of these large suppliers is if I reserve some of my capacity for you, a startup, are you going to be around in a year? Is anyone going to even buy your product? Our approach has been to just actually find buyers for the product,

Starting point is 00:20:59 and then the buyers answer that question, ultimately. Got it. And so if you show up with a bunch of fairly ironclad contracts to a supplier, then that has had. That's the nature of it, yeah. I presume also the round you just raised really helps there, where showing that you are incredibly well capitalized and not going anywhere.

Starting point is 00:21:21 also helps from a supplier point of view, a supplier validation point of view. Yeah, absolutely. Yeah. I mean, it helps just to say that we are around. We, in some cases are actually, it depends on which part of the supply chain, but some parts of the supply chain,

Starting point is 00:21:35 some are fungible. LogicDies are typically pretty fungible. But other parts of manufacturing are, you actually need something specifically set up for you. And so we're also able to cover the capital costs for that. Yeah, yeah, that makes sense. And coming back to the Mattax architecture, Okay, you want to build the best trip for all the Lums.

Starting point is 00:21:53 What is that? Yeah. Sounds great. Yeah. So, I mean, there's a few aspects to that. I think the first one is just pick your memory system right. And so I said, like, we've seen this HBM family. We've got the Sram family.

Starting point is 00:22:08 Put the most together is actually, I mean, most obvious idea, but like you can actually do it. There are a lot of details to make that work well. We've done that work. One of the things that shows up there is you've spent all of this area on your on S-RAM. How do you fit in the matrix multipliers, which are the other big thing you really need to do. And so somehow create a much more efficient matrix multiply engine. There is a gold standard for that that is called the systolic array.

Starting point is 00:22:33 Make a really large systolic array. You can't beat that in area or power efficiency. Like provably so? Practically. Practically. It is not known a better approach there. The main thing is, like, where are the inefficiencies typically? The inefficiencies show up when you leave the systolic array. So if you make a really big, if you make your systolic array really big, then you just don't leave it as often. So that's the idea. So make a really big systolic array.

Starting point is 00:22:57 That is sort of the theme of several of the 2023-era startups, including us. But one of the challenges there is now, there is this part of the neural network as part of the transformer, which is this attention that doesn't map well onto a large systolic array. And so that's attention. The mixture of expert layer maps really well, but the attention does not. And so what we came up with, which is quite different than some of the other startups in this space, is say, take a really large systolic array, but have a way to split it up into pieces without losing efficiency. So sort of that is the core of the design for us. And then the third component. So first was HBM and Sram. Second is the systolic ray.

Starting point is 00:23:42 Third component is just an interesting new approach on low precision arithmetic. Low precision arithmetic, in general we've seen number formats get narrower and narrower. They get faster and faster as you make them less precise. Number formats get narrower. What does that mean? Yeah, so Float 32 was how people used to train neural nets. That's just too much precision like it's useful. Too much precision, yeah. It's like saying I've got an image with like a billion-column.

Starting point is 00:24:12 bit depth. It's like too many colors. Like, you'd rather have more pixels and fewer colors. And so that trend seems to go all the way, like almost all the way down to like one bit even, where just have very few colors, but a huge number of pixels. And that in net seems to be better, just more efficient way to train models. And so sorry, literally what's precision are you dealing with in these? So we have a range. I mean, we actually have an ML team who we hired specifically to research different forms of numerics and how to make them all work together really well.

Starting point is 00:24:57 We have a range of precisions. It's not just one precision. We think probably the main thing will be similar to where Nvidia is that, which is 4-bit precision. But I think a mix of different positions is useful for just when you look at the research, you want some layers in higher precision or lower precision and so on. Yeah, yeah. Okay, so four bits is 16. Yeah, you get 16 choices.

Starting point is 00:25:16 That's it. Yeah, that's it. Yeah, it's pretty imprecise. Yeah, yeah. That's really interesting. I didn't know about that dynamic, but it makes sense. Yeah, and half of them are positive, half of them are negative. So, like, it's even less precise than that.

Starting point is 00:25:28 Yeah, yeah. How do you design a chip? Like, what's, is that a whiteboard? Like, what software are you working in? I just love to know, like, I understand how you design software and what that process looks like. I've actually know something. for what chip design looks like? So the way that you actually type a chip into a computer

Starting point is 00:25:46 is similar to software. So you write varilog. Verylog is a programming language. It is a very parallel program language, which makes it different than like C or Python or something. But it is a program language. So the mechanics of how you express the design are the same as software.

Starting point is 00:26:02 And we have continuous integration, Git, all of those things. But like a program executes, like your Verilog program. We don't really run it, right? Yeah, exactly. How does it run? We synthesize it. Yeah.

Starting point is 00:26:13 Okay. So synopsis and cadence provide EDA tools. So EDA, if you remember. Electronic design automations. I don't even know what it means, really. I think it's electronic design automation. It takes the relog and says, first turns it into a description of what are the logic gates that are involved and source knots and then the wires between them. And then it runs for days.

Starting point is 00:26:40 doing some really difficult algorithms and then eventually produces, I mean, so gates are the first thing, and then even below that, it literally just produces polygons. It says like P-type semiconductor here, N-type semiconductor here, and polysilicon. Okay. So, like, you write Verilog, and then that compiles down into gates and ultimately, like, the Minecraft, you know, 3D, just this is where your elements should go. But like, then what is the iteration loop? Like when we write coded stripe, we build a first version of something and then we try it out, and then we refine it and we add more functionality over time.

Starting point is 00:27:25 We're going to write some tests at some point. We'll ship that. We'll find product market fit and then we'll refine it in market. Like, do you just sit down and write the completed chip and it works really well? Yeah, every year we tape out a chip and if there's a bug, we just wait till next year. It's not really how we do it. Yeah, well, so what's the iterative view? How do we actually do it?

Starting point is 00:27:41 Yeah. It's much more waterfall than software is. So like waterfall is almost a bad word in software development. Yeah, yeah. But it's just a fact of life in Jimpsang. Yeah, yeah. So the waterfall goes from architects to logic designers who are writing varieg, and then there's this design verification and then physical design.

Starting point is 00:28:01 So there's this really big architecture phase, which happens before even writing any varilogue, which is, what? what do I want the organization of my chip to be? There's in some sense, I mean, what I really, like I came to hardware after doing almost 10 years in software. I really like the blank slate you get in hardware. You've got like all of the raw materials you have a much more varied in what you have available.

Starting point is 00:28:26 So what is the organization of your chip? Do I have 100 cores? Do I have one core? Do I have systolic arrays? Do I have vector units? All of those things. And then we spend a lot, long time coming up with that general principle, And then saying, okay, now I've got these applications I want to run. I want to run a transformer of a particular shape.

Starting point is 00:28:44 I want to map that onto this architecture that I've got in my head. And so we do a lot of iteration. Well, I've got this architecture in my head. I write it down to communicate to other people, but that's just like a markdown file. And then still, actually, a lot in my head, but maybe with Python simulation and so on, I'll see, do my applications map well to it. And so can I run LLM? This part was going to go.

Starting point is 00:29:07 Okay, so you have a simulator. where you write your chip, you can then simulate its performance, and you have some battery of tests that you kind of see how this chip design works. Is it like an industry standard, you know, is it the X plane of chip testing? Yeah. So, I mean, you, there's an industry standard thing for the Verilog once you've done the design. They're just Verilog simulators that you can test against. Okay.

Starting point is 00:29:36 That is, but you've already invested a huge amount of work. by the time you've got to that point. And so you sure hope you haven't made a big mistake at that point. Yes. So the thing that everyone does prior to that is we'll write our own performance simulator, which, I mean, it is very specific to your particular architecture, and you can write it quite concisely in just like a normal programming language. And so that is where most of the architecture work is done.

Starting point is 00:30:01 And then the simulation on varilog is more, I know what I'm doing. I just want to make sure I didn't have any bugs when I implemented it. Right. But I presume it's a game of inches where, different people are trying different things, and then you do simulate it to see if it runs 1% better across the battery of tests, or is that not how it works? In this space, not so much. So, I mean, just to sort of characterize what performance of an AI chip is, it is how many, really, like, if you're just, like, first thing you care about is flops.

Starting point is 00:30:30 How many flops have I got? That's a product of how many multiplies, like, I've got a grid of a certain size, like a thousand by a thousand, so that's a, That's, it can do a million multiplies in a clock cycle, and then I have a certain clock frequency, like a gigahertz. And so I multiply it them out. That is the speed of it. I don't even need to write that and test it to see how fast it is. Yeah. So like what I plan in advance is it's going to be this fast. What I can then optimize on maybe a little bit is clock speed. There's not a lot I can do there. And then I can optimize a bit on area as well. So there is some room for optimization, but actually a lot of it gets set. Like the actually just the speed of the chip gets set. very much upfront. Got it.

Starting point is 00:31:10 And then how many chips do you fab? Is it only the ones going into production, or is it just build a few to throw away, or how does it work? Yeah. So the ideal, which companies tend to hit about 50% of the time, is that your first tapeout, tape out costs like $30 million. Your first- Tape out is just production.

Starting point is 00:31:30 That's right. It's the actual manual, like the first chip costs $30 million, the second chip costs $1,000, something. So tape out is that first chip. Yeah, okay. The ideal is that your first tape out is actually is your production thing. So you do a tape out, you make maybe a thousand chips and test them, and then you do production volume. In the unlucky 50% of the time, you need to redo some or all of your tape out.

Starting point is 00:31:57 So in good cases and in many cases you can redo just the metal layers, which costs you only like $100,000. As opposed to the... The play paid the $30 million again. But in bad cases, like, if you've made something serious and you can't fix out the metal layers, you have to do the whole thing again. Why can't that be solved? Like, is that definitionally an error in simulation where it turns out these two gates were close together and it just led to some reliability issues? Yeah. So, yeah, like, what you're describing is like physical, like the physical implementation of the chip?

Starting point is 00:32:36 is wrong. That's one class. The other class is that the logical specification of the chip is wrong. But shouldn't that be... Shouldn't you call that before? Yeah. Before you spent $30 million dollars on the actually? So, I mean, yeah, we do a lot of testing. We try not to ship these things. I hear software companies also ship bugs to production as well. Fair! Sometimes things... There's a very good rhetor. Shouldn't you not be shipping bugs?

Starting point is 00:33:01 But I mean, like, there is a real trade-off in... You can spend more and more time on design verification. Yeah. Like, there's always this question of when do you stop? Yeah, yeah. And so you stop when your coverage metrics have hit a certain point, but, like, maybe not 100%. And then if, you know, Apple has to discreetize the iPhone release cycle

Starting point is 00:33:28 and they've settled on, you know, once per year. And so they'll decide, you know, we've got this better camera, but it's got to wait for the next version. or, you know, we're going to improve the waterproofing, but, you know, that's got to wait for the iPhone 8 or whatever. And so they have taken a continuous process of like, always coming up with ways to make the iPhone better and discretized it into annual iPhone releases.

Starting point is 00:33:48 What will your discrete cadence be? Many chip vendors have this sort of TikTok model, which is you'll do on one generation, like maybe you're trying to release every year. On even-numbered years, you'll do a physical technology, upgrade, so new transistor technology, new memory technology, and new interconnect. And then on odd-numbered years, you might do an architecture overall. I think that's a pretty good fit because you have different parts of your company that are skilled at different areas, and it allows you

Starting point is 00:34:18 to keep sort of both of them occupied without like having instead every two years doing a massive risk release. Yeah, yeah, yeah. Okay, and so you think that's probably likely for you. Yeah, that's me. You mentioned interconnect. So there's an out of there that, Nvidia, a huge part of the defensibility comes not from the chips, which are good, but from the software layer and the ability for engineers to write these really parallel workloads and the fact that they've been refining Kuda for whatever number. Yeah, a decade or something. Exactly, yeah, a long time. Just how do you think about parallelization and is that narrative true? Yeah, it's true, for sure. It's true in some, in many areas of the market. I think,

Starting point is 00:35:04 And especially where you look at where Nvidia entered the market, they're doing PC devices, lots of gaming, and so on. There are thousands of games, maybe tens of thousands of games released, and they all need to be programmed against Kuda.

Starting point is 00:35:21 And so there's such a huge investment in the software that this is really important to their compatibility. There are not thousands of LLMs. There's one LLM per Frontier Lab. There's maybe five Frontier Labs or something like that. And so just the economics of that is different. The calculation for Front Hill Lab roughly goes as I just bought a $10 billion compute cluster. I have hired 50 of the best people who can write optimized GPU or TPU or Traneum software.

Starting point is 00:35:57 I pay them less than $10 billion, a lot less. And so let's put them to work optimizing the computer. compute. And so they can, like, good work there can, I mean, depends on what your baseline is, but it can very easily double the performance of the software you write. And so there is a huge amount of custom software written for every generation of chip. When a new chip comes out, software is like substantially rewritten to optimize for that specific chip. And that's just the right trade-off given the relative costs of these things. What that means for us is that that ecosystem already exists

Starting point is 00:36:32 and that way of operating where you say I'm just going to staff a 50% team to run to write software for this chip works really well if you're trying to sell to Frontier Labs. Okay, so you're saying Kuta is way more important for the games environment where there's just a lot of games than this top-heavy AI market that we're in

Starting point is 00:36:56 where if people say you need to to then customize your workload for a MATX chip. It's like, well, fine. Yeah, it's a cost of business. Yeah, yeah, yeah. That makes a lot of sense. Where will you fab the chips?

Starting point is 00:37:13 TSM. Okay. Yeah. Why is TSM so durable? Yeah. I mean, it's interesting, they don't charge a lot as well. You'd think that if they're a monopoly provider, they should charge a lot of money.

Starting point is 00:37:28 They don't. I think that is a big aspect of why they're so durable. It's like this cyclical. It's cyclical conservatism crossed with Taiwanese business conservatism means you're at the most conservative part of the matrix. But I mean, it does, I mean, like an American capitalist might say, well, they're just screwing up. They could have extracted more money from the market.

Starting point is 00:37:52 But you could also say that there's actually this long-term sustaining advantage because they will just stay ahead for a really long time. They don't encourage the creation of competitors. Yeah, yeah. But isn't the creation of competitors kind of priced in because of the geopolitical risk? And so, like, it's not like everyone's fat, dumb and happy with their TSM dependents.

Starting point is 00:38:13 They're actually thinking a lot about it. Yeah, I mean, so there is real technical advantage there as well. It's not just, like, the discouragement. But, like, standing chip seems really hard, building airplanes seems really hard. There are so many areas where, competitive market forces create multiple options. Yeah.

Starting point is 00:38:31 And yet, that has not occurred here. So, I mean, there are multiple options. You can buy from Intel or Samsung. But at leading edge nodes. Yeah, yeah. So, I mean, what do we even care about in leading edge nodes, I guess? The big advantage is on power. The advantage on area is smaller.

Starting point is 00:38:46 The leading edge nodes, the density doesn't go up as much as it used to. So when you are really, really sensitive to power, it is a good idea to be on leading edge nodes. So that is AI chips and mobile phone chips. But there's a lot of the market where you don't, like, devices in cars and so on. Yeah, car chips, yeah, that's fine. But you're kind of saying, like, if you exclude the two most interesting parts of the market. Yeah, that's true. That's true.

Starting point is 00:39:10 Yeah. Just for this super high growth area of the market, it's interesting to me. Like, again, there's a lot of other really complex business problems out there that competition has solved. Chip design is, like, why has someone not left to SMC and got and built a new fab? Yeah, I mean, I don't know. It's, it's, the cost of a lab, of a fab is extremely expensive. I mean, I recognize that also the cost of a lab is extremely expensive too. I don't really understand the technical details of why it's so hard.

Starting point is 00:39:45 I mean, there is some amount of just a $10 billion fab versus a $100 million like tape out and chip development. There's a huge difference there. But beyond that, I'm not sure. What's TSM like to deal with? So they're very big. So as a startup, we tend to work with, not directly with TSMC, but with an ASIC vendor who, I mean, firstly does a huge amount of the actual backend work for us, that change

Starting point is 00:40:08 device with them, but then also has their existing relationships with them. Got it. TSMC cares a lot about diversity of their customer pool. And so... Again gets back to that conservatism. Yeah. So they're great to work with from that perspective. They want to encourage startup.

Starting point is 00:40:25 That's right. Yeah. That's very cool. Why don't the labs design their own chips? I mean, Google does. Google does. Open AI is starting. It's really a tradeoff of how much advantage you get from vertical integration

Starting point is 00:40:35 versus how much advantage you get by concentration of R&D work. So you take the five labs, and if they all buy from one player, then you can put like five times as much R&D into that chip. And does that beat the advantage you get from saying, I know exactly what my model is? Because of the like several years delay from designing a chip to being in production, you can't actually say I know exactly what my model is because models change, like much faster than that. So even the labs are forced into this position where they have to make predictions and they have to hedge against what they might do two years from now.

Starting point is 00:41:12 The calculus is sort of like, what is the probability distribution of what my model might look like and then sort of design a chip that gets like 90% of that probability distribution. something. Yeah, yeah. Elon is excited about data centers in space. Yeah. The two criticisms I've heard are that cooling is very hard, and then just repairing the chips is hard. But I know nothing about chips.

Starting point is 00:41:39 You do. Yeah. So, I mean, the repair, I think, is really interesting. When you look at how Nvidia deploys their acts, how we do something pretty similar to what Nvidia does. I mean, in general, you always need to design. for the fact that some of your ships are going to be down. Like meantime between failure of chips is not that large.

Starting point is 00:41:57 And so in a cluster of 100,000 chips, there's going to be chips that are down all the time. One way you can do that is you can make a rack where one rack has some spare chips in it. NVIDIA has eight spare chips and a rack of 64. That's pretty good. The common Antarctic works really well for you there. That's the sort of you can actually,

Starting point is 00:42:20 because you can pick which ones to avoid, you can, like, with very high probability, you tolerate a lot of failures. And then the other, just for, like, the other family of things is to say, my rack has to work, but I have some spare racks as well. So, you can math that out with, like, the tax of reliability here is only, like, 10%. It's pretty good. And you, but that relies on someone coming and servicing the device within a day or something like that.

Starting point is 00:42:46 If you say they're going to service it never, then I think you actually can get where, you you want to be, but maybe with 100% tax on reliability rather than 10%. So for example, if you think the average lifetime of a chip is in the range of three to five years, so that means if I deploy twice as many chips, then three to five years from now, half of them will still work. Yeah. And also the burning is particularly failure. And how about the cooling? So most of the challenge, I mean, I guess there's actually really a data center design aspect. then at the rack level, the challenge of cooling is just getting the heat out, like as quickly as possible, out of the rack into the cooling network.

Starting point is 00:43:26 How you get it out of the spaceship, other people would know that better than I do. Okay, yeah, yeah. Again, that seems to be the main objection, but I don't know. Yeah, I mean, I think it's sort of, like, if you think the cost of repair is that you need to have deployed twice as many chips, then, like, it's a tradeoff of the capital of the chips versus the power saving. the repair thing it feels like can be solved because also I think part of the best you know probably one's claim is that we will just be so power limited that you know you have no option but to go

Starting point is 00:43:59 to space and you know people can argue about that but were that to be the case then yes it's like well you can get power in space and you cannot on earth and so you uh you might as well go there whereas like the cooling is a more fundamental does the product actually work yeah at all about AI the unglamorous way, compute, systems architecture, and what it takes to run models reliably at scale. And if you're building an AI product, the business model similarly has a ton of unglamorous complexity. You're not just selling AI, you're monetizing consumption across API calls, tokens processed, GPU hours. Stripe billing is a scalable system for usage-based billing. It lets you launch token-based pricing, subscriptions, credits, hybrid models, whatever

Starting point is 00:44:42 you want. So you can create revenue models based on usage without rebuilding your pricing system every six months. If you're building an AI product, stripe billing is worth a look. What are your AI predictions for 2026? I mean, what I'm really excited about is just being able to, I'm still excited about the coding. There's what we do as a company. It's what many others do as a company as well. The one aspect of this is expanding into more domains. So for example, in where we spend our time, We, as a company, we write Rust, we write Verilog, we write Python. No, Haskell.

Starting point is 00:45:26 Yeah, no, there's a story there. I used to love Haskell. Rust is my current favorite. Okay. Mutation is good. The models are extremely good at Rust and Python. They've done a lot of RL on them. They have not done as much RRL on Verilog.

Starting point is 00:45:45 They've done almost none on And, okay, write me a markdown file that describes a chip architecture. And then how do you even RL on that? You have to say, like, what is a good chip architecture? I have to somehow say whether that's a good result or not. Yes. I think one of the things the labs are doing is trying to broaden what they've done RL on, RL on, source it from customers and so on, in order to sort of fill out the knots,

Starting point is 00:46:14 make it less spiky, fill out the gaps between the spikes. I presume the labs would love to work with you on improving the models by doing RL on this specific task. However, it's also... How does it make sense for us? Yeah, you're a special source of it. So do you want to come up with some AI approaches but keep them proprietary? Is that... Yeah, so, I mean, we've looked at a few different aspects here.

Starting point is 00:46:42 There's the... I mean, what we're able to do by yourself, our business is not training models. We do it in order to do the research on numerics, but actual production models we don't do. So it's like the biggest mileage, I think, is on the RL, and it's not something we can really do ourselves. We love it if we could have a custom model just for us, but that doesn't seem to be able to do it. The terms we've been offered by labs so far have not been on those terms. Because you have to share the IP back. The way they prefer to do it is that they put it into their mainstream model.

Starting point is 00:47:15 because it's good for them. Yeah, yeah, yeah, which obviously you don't want to do. And, yeah, I mean, how do you think you, what does you use AI to design a model do you think look like? Because this is actually, I think, an interesting sightglass into the, you know, a weak version of recursive self-improvement where, you know, we're using the AIs to develop better AIs. And so I'm curious, yeah, what you think that looks like? Is it your own proprietary recursive models?

Starting point is 00:47:40 What else? Like, is there kind of day-to-day AI usage that's load-bearing? Yeah, I mean, so the stuff that is available today, and I think will become even better very quickly, is just the stuff that looks most like software. So writing barilogue, running tests, running continuous integration, and so on. And that is a big fraction of the development time in the chip. It's probably 9, 12, 15 months or so. There's some stuff that's downstream of that, which is physical design, which is you take that barrelog and you generate the... the gates and the polygons.

Starting point is 00:48:17 We don't have a clear path for, like, it's not, at least the most obvious thing is not clear for how to compress that. Like, the goal, can you tape out a chip in one month? One month would be the goal. In theory, you could compress all of the logic design and design verification down to a short amount of time if the, just by continuing on the same path we're doing now.

Starting point is 00:48:36 But if you wanted to take the physical design down, that has to leave code. You're now doing, like, graphical interfaces and saying, well, I want to place stuff and so on. Actually, there has been work on this even prior to LLMs, which is like specific model trained for that particular problem. Yeah.

Starting point is 00:48:55 And I think the vendors, which is like synoptis and cadence, probably should move in that direction. Most of the focus has not been do it faster. It's been to it with higher quality. But that is a big bottleneck on, like, can I have a new chip every month? And then there's just the practical thing of like, A new chip every month doesn't really make sense because then if I'm deploying, like if it takes me a year to populate a data center,

Starting point is 00:49:21 that means I'm going to have different chips in different corners of the data center. Yes, yes. Sorry, when you talk about one month to tape out, so you do all this work to ultimately produce a file, everything TSM then does, it's not entirely in software. Like, is there some type setting that has to happen of moving stuff around? But yeah, what happens when you send your file to TSM? Yeah.

Starting point is 00:49:45 Then what? So they create a mask. So that is where the ASML tools come in. And a mask is, it is really just a stencil. You shoot the lasers through the mask or the x-ray through the mask, and then that produces the different P-type and N-type semiconductors. So they produce the mask. That is the expensive part. And then they're building up these like 15 or so metal layers. So they place down the silicon and then there are different layers of metals, which connect all the transistors together. They do that on a wafer. It happens on a stepping basis. So there's sort of a maximum size of chip you can build, which is constrained by this machinery.

Starting point is 00:50:30 The wafer stepper is part of the ASML special sauce, right? Yeah, I guess there's probably some important alignment requirement. Yeah, I think I remember that being quite like the, you know, it's a classic manufacturing throughput problem. I think we've done a lot of work on optimising. Yeah. Yeah. So they take that, so then you just produce hundreds of copies of your chip. You have to test it because there's defects.

Starting point is 00:50:52 You typically, I think the average rates really depends on process and so on, but small, single digit number of defects per chip. So you test the chip and see whether it has any defects in it. Many chips are designed to be able to tolerate a few defects, and so you need to configure it to tolerate the defects. And now you have a dye that by itself works. And then you need to package it. So you put it on in a package together with memories. Typically, that's the HPM. And then maybe you escape the wires to connect to other chips.

Starting point is 00:51:25 How long does it take to make a mask? So, I mean, what we see is time from, like, tape out to first chip, to chips back. Again, depends on Node, but it's bullpuck four or five months. Oh, so tape out is just like sending the file? Yeah, well, I mean, we can say to tape it as send a file, and then there's a whole process. So you make the masks for all the layers and then actually just producing the chips. Got it. So producing the masks and producing the chips happens after tape hash.

Starting point is 00:51:51 That's right. I see. Okay. So, like, is the term tape out from, like, you send a magnetic tape with the instructions or something? It could be. I was in software when it's so much created. I'm curious what the tape actually means. It feels like, you know, I think about AI predictions, one thing I'm really struck by is how

Starting point is 00:52:09 still in 2026 every time you open a chat window it's contextless it's got no memory and now to be fair it's like guys it's been four years not even four years it's been three and a half years just calm down we'll get there but I also interpret

Starting point is 00:52:29 a lot of the current enthusiasm for OpenClaw and all that stuff as it's like this super hacky backdoor into state management where your little claw will write a markdown file of what it's doing and then look at that markdown file the next time and things like that. But it just feels like state management and memory is going to be a huge deal.

Starting point is 00:52:54 And that will really change the character of AI products. Yeah, it's really interesting. So, I mean, long context is the reason, is one of the biggest bottlenecks on speed of the model. Yes. Every single token you generate, it reads through all of the previous tokens, or maybe it reads through a subset of them, but reads through a lot of the previous tokens you've written. And so memory bandwidth for that is really constraining.

Starting point is 00:53:21 You can think of model-level ways to solve that problem, which is to say maybe I can compress it and just fuel bytes or something like that. But it's interesting that the sort of most effective way to solve it has been, I mean, it's really a combination of everything, but the most effective way to solve it has been once you hit your 300,000 token limit, have the model go back through it and compact. Yes, yes. I mean, it's kind of what OpenClaw is doing.

Starting point is 00:53:46 It's like compacting everything you've done. But it's funny that it's so manual. Yeah, I mean, I think... Manual is the wrong word. You don't mean it's so primitive? It's maybe because it's so controllable. You can, like, if you want to iterate on how you compact, you give a different prompt and you say,

Starting point is 00:54:04 compact this way, compact that way. You can iterate that on that in seconds or minutes. Whereas if you're trying to do some iteration on the model level where you say, now I've got a different model architect, it's going to take months to launch something. Yes, yes. Any other AI predictions? I'm generally just interested in what makes models cheaper and faster.

Starting point is 00:54:23 So that's just at the model architecture level. Really tied into this context thing. I think the context size will stay ballpark the same where it is, maybe a few times larger. But the parameter account will go. up. Like, parameter account should grow much faster than context length, actually, just because of the underlying physics of what's available. Though, has that been the story, like, would that be a reacceleration of parameter

Starting point is 00:54:47 account? Because it feels like we've leveled off slightly in the last year or two, and instead we've been focusing on more and better RL. Yeah, okay. Paramount account or thinking tokens, I guess. Those are available, but the context length, I think, is sort of struggling to grow. Yeah, yeah. Okay, but you think we... We say context things are struggling to grow, but you're saying we keep context the same length. Keep context the same length. But we're better at working with large context, is that what you're saying?

Starting point is 00:55:16 Yeah, I mean, just have application-level interventions to manage large context, like compacting. Yeah, yeah, because I think everyone's had the experience, you know, currently of, like, the chat conversation and the further down in the chat, you ask. It just gets looser and... Yeah, it's just, like, really sloppy by the end, and it's like making mistakes with our... So you're saying we start to do better with large contacts. Okay, about that. When will I be typing into a chat window and it is a Mattax chip underneath it, powering it? Tape out in under a year.

Starting point is 00:55:46 And then that means chips available in sort of... Okay, that's great. Okay, so in 2027, I will be seeing very high-performing chats as a result of... In the 1% experiment of the users or something like that. Yeah, exactly, yeah. I need to find a way to finagle myself into the A-V-Texam. Yeah. Maddox is 100 people. That's right. Yeah. How have you gone about building the team, the culture?

Starting point is 00:56:13 Yeah. So I mean, so what we have on the team is hardware, mostly hardware, but a big software team and also a big ML team. I think the ML team is quite unusual in what we ask them to do. When you look at a typical ML team in a AI chip company, it will be what I might say, ML engineering or ML performance, they're writing kernels that actually will just use your hardware as well on a given model. There's sort of a missed opportunity there, if you're saying all we do is we take other people's models and we write kernels for them, you're optimizing this,

Starting point is 00:56:51 but you can optimize this at the same time. And so we want to optimize the whole thing at the same time. So like real code design. Yeah, yeah. So our ML team is actual real ML research. What they do every day is they train, small LLMs, from scratch, focusing on numerics and attention. And this has really, really helped us make an interesting product.

Starting point is 00:57:18 It's straight up most strong in our numerics. We often what you see when people design numerics is they say, well, back in when Flood 32 was popular, it would be, I'm going to follow the I-Triplea standard. Now it is like follow the open compute standard. And there's lots of little details where you say things like maybe what's the rounding mode I'm going to use like round to nearest even or something like that, which is like the best known standard for how to round. We want to cut corners anywhere we can. And so like maybe don't do the best rounding. Maybe don't do the like don't get all the corner cases correctly. That's a very scary proposition if you're just making those choices blind.

Starting point is 00:57:59 But if you have the benefit of a research team who can sort of back you up as you do that, it's really powerful and it's really interesting that we can make some sloppy choices in these cases. I feel like often technical advances come through better iteration loops. A favorite example of this I found recently was that the Wright brothers actually had a failed season before first flight. So I guess first flight was 1904 and they were down in Kitty Hawk in 1903. and not making that much progress. And they went back to Ohio, and they had a wind tunnel. And they were, like, testing their design in a wind tunnel.

Starting point is 00:58:37 You can imagine not a lot of wind tunnels in 1904. And they did a lot of wind tunnel testing, and their successful flight was after that. Is this something you're focused on where, you know, to get better chips, you allow for a better testing and iteration loop, and what does that look like? Yeah. I think this mostly happens in the architecture.

Starting point is 00:58:59 and product definition stage. Maybe even more generally, I think AI chips seem to live or die by product definition and architecture. What is the most extreme form of fast iteration? It's doing it in your head. And so can you map a model to hardware in your head? Can you estimate the performance of what it is in your head? You're not going to be 100% perfect,

Starting point is 00:59:20 but maybe you can prove some kind of lower bound on performance. And so the simplest possible thing is my model has a Trident parameters. My device can do a billion multiplies per second, so it takes 1,000 seconds to run or something like that. Just do that simple division. But then there are much more complicated things. We tend to look at resource balances.

Starting point is 00:59:42 And so, like, how many memory fetches do I need to do per multiply or something like that? So we do, I mean, at least the way I like to do design and architecture and optimization is to be able to sort of estimate the performance to within about 30, 40%, percent before even typing anything in at all. And so we've tried to do that a lot. A lot of our architecture comes from there. Then sort of the next stage of iteration is, oh, that's kind of on the performance side. This also happens on the circuit design side as well. Can you take a circuit and say, what is the gate count on that? So like a 16-bit multiplier has approximately 16 squared many gates, and you can do that for more complicated things by sort of sorting networks.

Starting point is 01:00:28 and so on. So we already have a pretty good idea of the costs and speeds of things at that point after doing these calculations. Then what we tend to do as sort of the next step of iteration is on the ML side, we run model experiments. You get iteration speed just by having small models mostly. And then on the hardware side, we use simulators, performance simulators to like do the next level of detail to make sure we're seeing all the things we want to see. Yeah, yeah. This idea that you should, the best iteration is in your head is kind of reminding me of Jeff Dean's, you know, numbers. Yeah, like, do you have your equivalent of that number is every Matax? Yeah, we have go-slash gates in our company, which says what is the cost of an X-Or gate, an and gate, a full ladder, S-Rabit cell and so on.

Starting point is 01:01:15 And you want people to be working with that stuff in their head and have an intuitive sense for it because again, leads to better iteration. What is the pitch to someone joining Matax? I mean, I think if you are someone who likes optimizing, just optimize something. Yeah, yeah, yeah. Software, Hardware, Factoria, whatever. If you're trying to, like, fit something into the smallest budget possible, I think it's a pretty exciting place to be. I think hardware companies in general are really exciting because you have such a broad range of skills of people on the team. You have software people, you have hardware people, you've got physical design, you've got people who have people who have people who have

Starting point is 01:01:55 people who are just like looking at the insertion force of a rack into a card into a rack. And so there's like so much discussion and learning you can do. I think Maddox in particular we really care about this and I think we extend it all the way up into the application and the machine learning as well. And so really, I mean really, really, really interesting to actually the problems and I think just generally, like there's lots of interesting people to talk to. Yes. And presumably, in terms of impact, if you can design a meaningfully higher throughput chip,

Starting point is 01:02:33 a 20% higher throughput chip means 20% more AI is happening. You know, if the bottleneck is elsewhere, like, you know, power or something like that or cost, you actually just are meaningfully increasing the amount of intelligence in the world, which is presumably exciting to people. Yeah, yeah. I mean, I think this shows up both as just can apply in more applications as well as just How smart is the model? Yes, yes.

Starting point is 01:02:58 Quire Rust. So a previous project I worked on at Google, we did a lot of Haskell. I did Haskell growing, like when I was at school. I loved it, like very principal, very interesting. I like Haskell, but I also like making stuff fast. And then the question is, what is the first thing you want to do? You want to be able to modify your memory. Haskell, you jump through groups to do that.

Starting point is 01:03:21 Maybe I just want a language that is like programming, like functional programming that lets me modify memory. memory. So I think Rust has a lot of the nice things which are like type classes or traits and a rich type system. One of the things that we have done, like, interesting ways we use it at Maddox are the range of sort of data types that you express on software. Like, what are the integer types? Intr 32, in 64, int 8, maybe that's all you care about. But it turns out in hardware, you care about every single bit, and so you want to use like 17, 18, 19 bit integers. That is quite natural to express, and we build up sort of a whole ecosystem of rich

Starting point is 01:04:08 hardware data types in Rust. Has Rust beaten Go for the position of sort of performant type programming language with modern features, or do they actually address different pockets? Yeah, I mean, so there's like, there's the, there's what the Rust marketing will say, which is safe without a garbage collection, which I think is a real, I mean, is the objective thing that you can say is different, but sort of very is the lead, which is, it's also just like it's got nice type system features that that Go doesn't have. And then like, why is garbage collection, why does it matter at all? Like, it's not, I mean, people often focus on the time it takes to run a garbage collector, but the, the, the, the,

Starting point is 01:04:50 The other thing is that every time you allocate an object, you've got the object, and then you've got the garbage collector header at the beginning. So it uses a lot more memory as well. And so if you want to design some, I don't know, data structure that uses the right amount of memory rather than a bit more than... I said, I hadn't realized that in Rusture allocating your memory manually versus in Go, you have a... Yeah, that's right. Okay. And you prefer that for what you're doing.

Starting point is 01:05:12 Or it just is... I just really like dealing with the details. Like, you give me a puzzle and I'll be like, let me solve every single piece of it. Yeah, yeah, yeah. So that tickles that. part of my mind with rust. It seems like you're a fan of optimization, generally. Is that a fair characterization?

Starting point is 01:05:25 Yeah. Where else have you? So chip optimization is one domain. Where else? Yeah. So I mean, I started, I mean, when one of the really exciting things I found about working at Google is like the whole Google code base is available and you can look at how does a memory allocated work?

Starting point is 01:05:44 How does a mutex work? How does a hash map work? Any of those things? and you can go and look inside the implementations. And Google has excellent implementations of those, like some of the best you could write. So, like, one of the things I did on my nights and weekends when I was at Google was just go find those implementations,

Starting point is 01:06:06 write a benchmark, how many nanoseconds does it take to allocate eight bytes of memory, and then can I make that faster? Can I, maybe I inline this function, maybe I look at the assembly and say, looks like there's a few memory moves here or there's some registers that are being used that I don't need in the in the fast path I only need in a slow path can I can I do something there? So I don't know that was always my Like just fun and learning activity being outside of Google

Starting point is 01:06:34 I feel I mean I probably could have done this inside of Google as well, but outside of Google I felt the sort of luxury to be to be able to like talk about these results as well One of the things I've looked at recently is just hash tables are used so much. One prompt for me was like what would, if I want to design like custom CPU instructions for accelerating hash tables, like hash tables are one of the most common things. I'm looking at them up and writing them all the time. What would the optimal CPU be for that? And so then, well, then the following down that chain is like what are the best hash table implementation in the first?

Starting point is 01:07:13 place. And so I spent some time looking at different SIMD implementations and there's this really cool technique called Kuku hashing, where you hash into two different locations and then you use the bucket which is less full. It's been in the literature for decades and yet the best hash table implementations don't use it because it's somehow like not practical. And so... I'm sorry, why is it not practical? Practical hash tables are These days considered to be ones that use SIMD, vector instructions, to scan like eight buckets at a time. And the way Kuku Hashing is normally described is I look up one bucket here and one bucket there.

Starting point is 01:07:59 And so I'm not using the vector instructions. Vector instructions are much faster than scalar instructions. And so there's kind of a missed opportunity. Again, just like take the two good ideas and stick them together. Do vector instructions on Koku Hashing is... You have to be careful to get the details right, but if you get it right, you can actually just win. Sorry, is your claim that one could design a custom CPU that has way better hash table performance, or even on current chips, you could get way better hash table performance?

Starting point is 01:08:27 So both. I mean, I'm interested in what you can design in custom hardware, but Maddox doesn't make CPUs. We're not going to make CPUs. You could. New light of business. I mean, we just want to focus on shipping one product well. For the time here. Fair. Good answer.

Starting point is 01:08:44 So, I mean, I think it's an interesting exercise, but I don't get to feel the endorphins of seeing the number giving it down. So I first did this on just Intel CPUs. And you can get better performance than, like, some of the best hash table implementations available using Kuku hashing on Intel CPUs, even. And what are examples of workloads that are really hash table-reached intensive? I mean, I know kind of everything. I mean, JavaScript, I guess, but, yeah, I mean, it's sort of a tricky exercise because, like, when you really think about it, you're like, did I really need a hash table there? I probably didn't, but you just reach for it all the time.

Starting point is 01:09:25 Okay, but you could go to your Google JavaScript team and probably help them eke out better performance in the Chrome JavaScript engine? Yeah, I mean, potentially. I mean, I'm not going to spend my time on that. Well, if they're listening to this podcast, here's a free idea. Yes. And then explain the dragon. Yeah. This is from a book. that when I was working on the JAX team, so the JAX team is one of the ML infrastructure teams at Google.

Starting point is 01:09:47 I was there as the most recent team before I left. I'm sorry, what does the JAX team do? Oh, yeah. So the JAX team develops, this is sort of Google's new, more modern version of TensorFlow or competitive Pytorch. It's how you write models in Python to run on TPUs. A big part of the JAX team, however, is to say, OK, we have JAX the technical artifact.

Starting point is 01:10:08 Can we help enable users to actually use really well and get high performance. And so ultimately that became, well, who are the users? It's people writing LLMs. How do you get good performance on LLMs? And so really, really strong team, the Jax team at Google, although as with a lot of brain people are now elsewhere as well. And so we developed a lot of the different techniques

Starting point is 01:10:30 for how to lay out models efficiently on many chips. And so ultimately, some people at Google, and I contributed after I left Google, wrote this guide called How to Scale Your Model, how to run an LLM as fast. as possible. It is sort of the main reference for how to get high performance on TPUs. There is now also a GPU version of this as well. It's a dragon because it's how to train your dragon. I see. Okay. Last question. People might not have thought that there's room for

Starting point is 01:11:01 new chip companies that might seem unusual or very hard. And you guys, it seems like a very good approach with that. Where do you think are other opportunities for companies to be started here in 2026, where do you think people should be looking for entrepreneurial opportunities or just technical challenges that haven't been properly addressed? More labs, I think, is still interesting. Can we do more on model architecture is always interesting? You think we have not fully explored model architecture space? Yeah, I mean, the Frontier Labs have done a pretty good job of exploring it,

Starting point is 01:11:32 but I think, I mean, as the hardware changes, the shape of the model should change, for sure. Yeah, okay, and presumably you're not thinking, like, yet another frontier lab pursuing the same architecture, you think there's probably off-the-wall-looking architecture that will actually make a lot of... Yeah, I think there's a little bit off the wall. Okay, for sure. Yeah, yeah.

Starting point is 01:11:53 Do you have a specific architecture in mind? My mentality is always sticking within the Transformer family, but what are the constraints that are currently available, like, are currently imposed that you could lift? Yeah, yeah. So, for example, one of the things is there's this idea, when you're doing Transformer inference, you do pre-fill, that is sort of processing what the user said to you,

Starting point is 01:12:14 and then there's decode, which is generating the response to that. And those are totally different in pretty much every aspect of how they actually run. One runs a step at a time, the other one runs really in parallel. So there is this somewhat artificial constraint today that those are the same model that's doing both. Maybe lift that constraint. Another example would be there's this idea that the model that you, I mean, this is more fundamental constraint that you have to train the same model as you serve. But again, training is very different from serving.

Starting point is 01:12:45 At training, it's very compute intensive. At serving, it's more memory bandwidth intensive. And so maybe is there a way you can make a model that when you use it at inference time, it increases the amount of compute it does to use some of the available resources. Yeah, it makes sense. We're running. Thank you. Pleasure.

Cheeky Pint - Reiner Pope of MatX on accelerating AI with transformer-optimized chips

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.