Computer Architecture Podcast - Ep 11: Future of AI Computing and How to Build & Nurture Hardware Teams with Jim Keller, Tenstorrent

Starting point is 00:00:00 Hi, and welcome to the Computer Architecture Podcast, a show that brings you closer to cutting-edge work in computer architecture, the remarkable people behind it. We are your hosts. I'm Suvini Subramanian. And I'm Lisa Hsu. Today we have with us Jim Keller, who is the CTO of TenStorent and a veteran computer architect. Prior to TenStorent, he has held roles of Senior Vice President at Intel, Vice President of Autopilot at Tesla, Vice President and Chief Architect at AMD, and at PI Semi, which was acquired by Apple.

Starting point is 00:00:31 Jim has led several successful silicon designs over the decades, from the DEC Alpha processors to AMD K7, K8, and K12, HyperTransport and AMD Zen family, the Apple A4 and A5 processors, and Tesla's self-driving car chip. Today, he's here to talk to us about the future of AI computing and how to build and nurture hardware teams. A quick disclaimer that all views shared on the show are the opinions of individuals and do not reflect the views of the organizations they work for. far. Jim, welcome to the podcast. We're so thrilled to have you here with us today. Thanks. Good to be here. Yeah, we're so thrilled. Long-time listeners will know that our first question is, what's getting you up in the morning these days? I was going to say, I thought you were going to say, what's keeping you up at night? Well, I had a literal keep me up at night. I was just in India for a week. We opened up a design team there and then I met the IT minister for computes.

Starting point is 00:01:37 India has an initiative to promote RISC-V high performance servers and India-based design. So I've been talking to those guys about it. Literally was up very early in the morning. So that's one thing. I think the AI and modern tools and a few other things are causing change faster than anybody really thinks about. The tools are changing, the design point's changing, the code's changing, and like how do you build and design computers and software so you can go faster and use those tools, you know, top to bottom. Like we went from, you know, custom design to design using CAD tools to like SOC designs where you have multiple IP components and then you put those together.

Starting point is 00:02:30 And now the design complexity keeps going up. Moore's law gives you more transistors, but you still want to make progress at a good rate. And then how do you do all that together? And so like that's, and then that opens you up to applications, and then AI applications are really crazy. And I've been learning a lot about it. So you think at some point things would slow down, but the opposite's happening.

Starting point is 00:02:58 Things are actually happening faster. Although I tend to wake up more in the middle of the night thinking about things. Me too, actually. I'm a 2 or 3 AM-er myself. So one thing you said I thought was really interesting, which is about tools. And I know you've talked about this in some of your other interviews and stuff. Because it seems like everything's changing faster than you can kind of accommodate for. And then in order to build new systems with all the changing technologies, you need the tools to change with it. And then in order to build new systems with all the changing technologies, you need the tools to change with it. But then it's kind of like a conundrum because

Starting point is 00:03:29 you need to change the tools on top of technology that's changing faster and faster. So when I was young, I thought being a computer architect was great because this ground you stand on is always changing, which means that nothing stays stagnant and you can always kind of innovate and do new things. But now it's almost like the ground is changing and one day it's lava and one day it's like ice cold and you have to change your shoes and nobody's designed the shoes yet. One day it's lava, one day it's ice. That's a pretty good metaphor. I like that oh thanks how do you accommodate and deal with all this change when like the tools that you would want to reason about the change and help you to make these designs particularly as we shift towards like these really really specialized and AI algorithms themselves are changing very fast everything is changing very fast so how do you

Starting point is 00:04:20 how do you cope with that yeah so well first one requirement this is hard on big companies with lots of legacy is you need new designs like this is really like every i've said every five years you need to write stuff from scratch period and and there's a big reason for that because no matter how good your old thing is as you add little improvements and patches it slowly becomes tangled together. A friend of mine sent me this paper titled A Big Ball of Mud. And it's a really old school website with a picture of a big ball of mud on the top. And it talks about no matter how carefully architect hardware or software, you have nice clean components, well-defined interfaces

Starting point is 00:05:03 or APIs. Over time, like this piece of software will learn about this piece, and this will do something because of that. And somebody will communicate in a meeting, and they'll figure out a clever way to make something faster, and pretty soon it's all tied together. So you need to have something new. We're building a RISC-V processor, a fairly high-end one.

Starting point is 00:05:27 We spend a lot of time up front architecting it so that each subsection of the computer is really modular and has really clean interfaces so that, you know, the mission, I told him it would be unbelievably great if we found 90, 95, maybe 99% of the bugs at component level instead of at the CPU integration level. And like SOCs went through that transition. Like if you buy high quality IP from IP vendors

Starting point is 00:05:54 and you make a chip, you don't really expect to find any bugs in the IP. So, and, but if you're a company with lots of legacy IP and some of the IP was created by breaking up a more complicated thing, and you never cleaned it up, you might find a large percentage of your bugs when you put pieces together. And you have to fix that. And so new design gives you an opportunity to say, I'm going to go redesign this and make it clean at the modular level. And when I worked on Xen, some of the verification team came to me and basically told, like,

Starting point is 00:06:32 Jim, we really want to test all the units with really extensive test benches. And the old school way of thinking was, oh, yeah, sure, recreate the whole design twice. You know, you have a load store unit, and now you have to make a thing to test the load store unit, which looks a lot like the rest of the computer, right? But they were right, because actually the code to test the load store unit is actually simpler than the rest of the computer. And if you put 10 units in a row together, you know, fetch, instruction, iCache, decode, rename, schedule, integer, execute, load store, your ability from a program to manipulate the load store unit is tough because you're looking through five layers of really complex state machines, right? Whereas if you

Starting point is 00:07:21 want to make the load store unit do all its things right from the load store pins, you can do that. So I was sort of getting there, but the verification engineers made the case it would be way easier for them to write more code at the test bench level and have modular stealthy tests. And then as soon as you think that, then you say, well, why does the interface between these two units have 100 randomly named signals? Like if you've done detailed computer architecture, you know, there's a signal called stage four fetch valid, except when, you know, something happens, right? That's not a verifiable signal. And computers are full of that stuff for timing reasons for no good reason for though they had to fix a bug oh no this unit needs to know that this is in a state right whereas you know computers have well-defined interfaces you know memory fetch fill exception kill stall like so so there's a really interesting challenge

Starting point is 00:08:30 of like how do you build a computer with well-defined interfaces and so your question like whenever something's i made this comment in a talk i did whenever something's hard it's because there's too much complicated too much stuff in one place. Like you have to break things down. And sometimes the problem isn't whether the tools are there or not, but that you've tried to solve something, you know, in too many ways. Like you have an RTL problem, a timing problem, a physical design problem, and it gets to be too much. And you have to really figure out how to break that down

Starting point is 00:09:02 and make it simple enough to do. And know verifiable design verifiable interfaces you know architecture thinking a little bit differently is important so yeah i've thought a lot about that and the uh and and then you can see it you know like some projects you have a lot of success and it's partly because you really took the time to architect the pieces up front and make them very independent and really clean and you had discipline not to slowly turn it into a ball of mud which is a natural human tendency apparently yeah so that's that's super interesting i want to follow up a little bit with this ball of med question because two things come to mind for that. So early on in your answer, you said something about how it's hard for large companies with a lot of legacy to do this. And yet we do have a lot of large companies who have

Starting point is 00:09:52 stayed alive for a long time. So I sort of wonder, you know, when I was a young engineer, I was like, how does anything work? I didn't understand how anything could possibly work. And then secondarily, this whole thing about, you know, I've seen the kind of signals where you've got like a signal that's called one hot selector for this thing in the front half of the cycle and the other one hot selector for the back half, you know, it's just, it's just a total mess. And then we've got, we do get these students coming out of schools and, and maybe some of them have never written RTL in their life. They learn all their computer architecture from, from reading, you know, boxes and arrows and stuff. So how, how do you then form a team where then you do have the discipline to

Starting point is 00:10:30 avoid this ball of mud where it's just like, okay, no, we're going to name these things, right? There's going to be a reason for this signal that the signal is going to be, you know, eight bits wide, and we're going to enumerate every single one of those eight bits with a proper name and proper state. Like, how do you, how do you push that out from where you are yeah so there's a couple things there so one is you know like 100 of the fortune 100 companies from 100 years ago are gone like ge is still around but a completely different form so so companies do go through life cycles and almost all of them disappear over time. And some get propelled for monopoly reasons or infrastructure reasons or something.

Starting point is 00:11:11 So success today does not guarantee success, although the time of that is longer than you think. Most companies don't fail in five years. They fail in 25 or 50. So that's a thing. And then Steve Jobs famously said, so you have some new product and then you make it a lot better and then you refine it. But to get to the next level, you have to make another new product. And the problem is, is the new product isn't as good as the refined old product, but you can't make the old product any better.

Starting point is 00:11:46 That's the best rotary phone that will ever be. The push button phone is, it doesn't feel as good. It doesn't look as good. And you have to have the courage to jump off the refined high spot to a lower spot that has headroom. Like this is a quote from one of his random talks. So I recommend people go like some of Steve Jobs' keynotes were great. Like I watched a bunch of them and he was very clear about what

Starting point is 00:12:13 it means to design a product and then to believe in where you're going. So you have to, and it's really hard for marketing and salespeople in a big company. So you go, hey, we've got this great new product. It's 10% worse than what we have today. But over the next five years, it's going to be twice as good. And they're all like, well, we'll wait five years. But then you're not working on the right thing. So you have to do that. So that's one.

Starting point is 00:12:38 The other is I had this funny talk with a famous big company who was providing IP. And they had this interface, which was a lot of wires and a lot of random stuff. And I told them, the interface is too complicated. And they'd go, yeah, but it's really, it's a small number of gates. And I said, no, you don't get it. The wires are expensive. The gates are free. So one thing you do is instead of saying, I got this lean and mean state machine and I export all the wires,

Starting point is 00:13:10 is you take that and you put it into a box and you turn it into an interface. And you trade off, at the time I thought we traded off gates for wires. Add a bunch of gates because they're cheap. Morse law gives you lots of gates and have less wires. But a better of gates, because they're cheap. Morse log gives you lots of gates, and have less wires.

Starting point is 00:13:26 But a better way to say it is we trade off technology for complexity. Like if you go look at a DDR memory controller, for example, on an AXI bus. So a typical IP you can buy from three or four or five people. AXI transactions are very simple. There's like 15 commands and mostly use read and write. So at the controller, you say, I want to read

Starting point is 00:13:54 32 bytes a day at this address. That's a really simple interface. Inside the controller, there's a memory queue. You might have 32 or 64 transactions. There's a pending write buffer. There's a little state machine that knows the current state of the DDR channel. Maybe there's two DDR channels. There's two DIMMs per channel. DDR DRAMs are really complicated widgets because they got a read cycle, a RAS cycle, a cast cycle, a refresh time.

Starting point is 00:14:23 They're in different states. At the read transaction, a RAS cycle, a cast cycle, a refresh time. They're in different states. At the read transaction, it's really easy. Read, address. Write, address, data. You don't know anything about the state of them. Now, if you build a high-performance system, you say the CPU is going to be optimized, and we're going to send read commands to the DRAMs, and we know that we're going to have to sequence a read command, so we're going to send read commands to the DRAMs and we know that we're going to have to sequence a read command. So we're going to hit it with this RAS early to get the DRAM. So you can export the complexity of that. And then the CPU knows exactly on cycle 167 that there's the first piece of data that's going to come out. We used to build CPUs that would wrap the return read transactions. So you got the requested word first.

Starting point is 00:15:10 Like we had all kinds of complexity. But nowadays, the transaction is really simple. Read or write at the memory controller, period. The data comes out at a random time. It always comes out in the same order. You don't export the complexity of the CPU to the memory controller. Now, partly it's for a good reason. The CPUs have really big caches and really good prefetchers.

Starting point is 00:15:35 They're running at three, five gigahertz. And the memory controller latency is 150 nanoseconds. Like wrapping that transaction saves 0.2 nanoseconds out of 100. It's a dumb complexity, right? So you look in your designing to go, well, how do I get the complexity of an interface to be so simple, right? And then there's a funny one, which is people always want to say, well, why don't we have an industry standard for cache coherency? Well, cache coherency is a distributed state machine. So now you're saying, well, the people at Qualcomm

Starting point is 00:16:12 are going to have the same spec as ARM as somebody else. And that's a hard thing to do. Whereas specs like AXI, there's a bunch of specs that are pretty commonly used that are really simple. And when you make them simple enough, then many people can use them. PCI Express is simple enough. Ethernet is mostly simple enough. So you see these things that are common, standards work on them. Like the first version of InfiniBand tried to optimize latency and they had a 1300 page spec and nobody could build a device to the InfiniBand spec. So they went through some soul searching and

Starting point is 00:16:51 radically simplified it and focused on being a generation ahead on PHYs and having a simple, but good enough RDMA and some other things. And then that became a product that people could use somewhat successfully. How do you avoid complexity? Well, at the top, you have to decide it's a real goal. And then you're going to spend something on it. I'm going to put an extra 100,000 gates in each interface so that the interface is simple. And in a computer with 200,000 gates, that's crazy. In a computer with 100 million gates, that's genius.

Starting point is 00:17:28 Like there's a different calculation. Yeah, that was a fascinating set of insights. I think starting from the top, you talked about the need for new designs and the need for us to sort of revamp things from scratch like every five years. And I guess some of the ingredients for how do you enable doing that? One part of it is, of course, modular, clean interfaces, but also the discipline of ensuring that you have these interfaces that are simple, not too complex at various layers of the stack. Maybe I can double click on your experiences of doing this in the AI world, because that's one of the places where there's a lot of need for this clearly because compute demands are growing unabated.

Starting point is 00:18:11 At the same time, there seems to be a desire or a willingness to try out and experiment with new ideas. argue that at certain layers of the stack, we have seen some amount of abstractions forming, for example, you know, matrix operations or tensor operations more generally have been the bread and butter for deep neural networks in this coming era. Do you see that philosophy and that perspective sort of trickling up and down the stack? Because there is the operators themselves, but on the software side, once again, there's a lot of complexity. Once you push down into the hardware side, as you said, you know, you're still designing interfaces with 100 wires for something that is semantically just a read and a write. Yeah, so this is a really good one. And I'd say in AI, we haven't figured out

Starting point is 00:18:56 what the hardware-software contract is yet. And I'll give you an example. So in the CPU world, and this is not quite true, but this is close to true, software does arbitrarily complicated things. Like if you go look at a virtual machine, JavaScript engine, like it's amazing, right? There's really complicated things. And then I grew up, like when I learned to program, I programmed an assembly. Like I used to know all the opcodes for a 6502 and an 8086, and most of them for VACs.

Starting point is 00:19:29 And then I learned C programming. C programming is great because it's an assembly language. It's a high-level assembly language at some level. As an architect, you write C code, and you can see what the instructions will be generated and mostly see how it's going to execute. It's pretty simple. But the actual contract for a modern computer is operations happen on registers, period. You put data in registers,

Starting point is 00:19:53 and then you do abstracts and multiplies on them, and you can branch on them. And then from a programmer's point of view, there's a memory model where you load things basically in order. Like if you load address A and then you load it again you never get like a older value after you get a newer value so it looks like you have ordered loads and then you mostly have ordered stores like and there's weak ordering models but they mostly don't work because you have to put barriers in to fix it so so you basically data lives in memory you load it with relatively ordered loads, you do operations on registers,

Starting point is 00:20:31 and then you store the data out with ordered stores. Right. And then there's a page model, paging model, a privilege model, a security model. And then, but those are orthogonal to the execution model. And so underneath that simple model, you can build an out-of-order computer. And it took us 20 years to figure out how to do that. Rule number one is you don't violate the execution model that the software people see. So VLAW failed because they tried to violate the model. Weak ordering kind of fails because it violates the model. Like anybody, like people who did radical recompiles to get performance, you know, with a simple engine failed. Like out of order execution is really wild because the instructions issue wildly out of order, but they don't violate that model.

Starting point is 00:21:25 And to achieve that, we built registry naming for something called a kill. So you execute a bunch of instructions out of order and some event happens, and then you flush all the younger instructions from the kill point, and you finish the older ones in order, right? We have massive branch predictors, data prefetchers,

Starting point is 00:21:46 but no matter what you do, you don't violate the contract on execution. And that means that the software programmers do not have to be microarchitects, right? As soon as you ask them to be microarchitects, you failed. Mytanium had like eight barriers. Nobody knew what they were for. Like I was at digital when we built

Starting point is 00:22:05 alpha, we had a memory barrier and then we added a, cause we had weak memory ordering. So we violated the execution contract and we broke all the software. So we had a memory barrier, but they didn't know where to put it in. They, the operating system had a double MB macro because they didn't know where to put it in. So two of them, some places seemed to fix some random bugs. Like, I'm not kidding. Now we added a right memory barrier, which we thought would make things better. And it made it worse because they just put the right memory barrier in the memory barrier macro because they didn't know what to do with it.

Starting point is 00:22:38 Right. So it was like a worst case scenario. So now look at AI software. So AI software has been developed mostly by programmers. And programmers understand the execution model pretty well. Data lives in memory. You declare variables, which gives you a piece of memory, or you do something like malloc and free, which is some kind of memory allocator on top of our memory model but generally speaking when you're in a program you don't talk about variables as

Starting point is 00:23:12 addresses you have they have names and generally to do operate you know so you say a will be time c implicit in that is the load of a b and, and C. They go in the registers. You do operates on them and send them back. And GPUs today are sort of executing that model. Like you have lots of very fast HPMDRAM, data all that's in memory. For every matrix multiplied, the data is in memory. You load it in the registers. You do the operations. You write it back out again.

Starting point is 00:23:41 So you're constantly writing in and out data. That makes sense. Now, at TenStorm, we believe that when you write that program and you can see it very clearly in all the descriptions of AI, that program actually defines a data flow graph. So if you go Google Transformers or ResNet, you'll probably get a picture, right? And the picture will be a graph. And the graph says there's an operation box where data goes into it, something happens, and then something flows out of it.

Starting point is 00:24:16 And then they generally, they call the input activations and the local data for that operational weight. And the number of operations they do is actually quite small. Matrix multiply, convolution, some version of ReLU, GLU, Softmax, and then a variety of what they call tensor modifications, where you shrink or blow the matrix,

Starting point is 00:24:38 you pivot it, you transpose it, you convert 2D to 3D. There's a bunch of tensor modifications, but the number of operators in that space is low. And then people are stylized on things like, you know, how do you exactly program ReLU? Like there's some implementation methods. So it's interesting that the programmers are writing code in PyTorch to a programming model that looks like standard programming. They're describing what they're doing in terms of graphs, because that's a nice way to think about it. But the code itself we see is a mix. So the challenge is like, how do we come up with a programming model that we all believe in and understand that can go fast and not have to do

Starting point is 00:25:21 things like read and write all the data to memory all the time because some operator expects data to be in memory and the only good way to do it. And then I've talked to AI programmers who are like, I'd happily recode that to make it twice as fast. That's one view. And the other view is I really don't care because all the upside in this is either size, bigger weights, more parameters. And the hardware is going to make it faster.

Starting point is 00:25:50 And in the short run, I'll just buy more processors. So there's a really interesting dynamic about this. And it sort of feels like, you know, when we first started building out-of-order processors back in, I guess I started working on it in 95. Like it had been around for a while. The IBM 360 did out-of-order processors back in, I guess I started working on it in 95. It had been around for a while, the IBM 360 did out-of-order execution. The 360-91, I think, it was amazing. But when I was at Digital, there was a debate about whether you could actually make an out-of-order computer work. And there was the competing the competing ideas where you know there was supervised by a super scaler vlaw out of order and then little window big window there's a bunch of ideas about it and what one like clearly i think but you know there's still some people

Starting point is 00:26:40 who are debating this is you know outordered machines with big windows and really well architected reorder buffers, renaming and kill interfaces. Like that's, that works, right? And a really simple programmer model that works. So, so that the, the interesting thing is so that, and the GPUs, like some people tell me well gpus just work but nvidia's thousands of people hand coding low-level libraries there was a really good academic paper that said hey i decided to write matrix multiply and i wrote it in cuda the obvious way and i got five percent of the performance of nvidia's library and then they did the obvious HPC transformations of transpose one of the matrix, stub block it for the known register file size, and they got 30%. And from 30 to 60%,

Starting point is 00:27:33 they start to hack. They know how big the register file is, how many execution units are, how many threads, how many, you know, and the NVIDIA library has, you know and the nvidia library has you know they have what they call cuda ninjas who are great programmers who know how to make that work now that the charm of cuda is you write a cuda program it'll work the downside is the performance may be randomly off by a factor of a lot but when you're writing your code if, well, why would you write matrix multiply on the GPU? There's a big library for that. So they have a program model that works in libraries that mostly solve your needs. And now you're arbitraging the last 10, 20%.

Starting point is 00:28:18 But that computer doesn't look anything like the way the actual AI programmers describe the program, right? And so that's a really interesting thing. So we're building a graph compiler. So our processors and array of processors, which have some low-level hardware support to support data flow. And there's some interesting methods about how you take big operations and break them up to small operations and coordinate that. And the charm of it is it gives you much more efficient processing and less data movement, less reading the right of memory. There's interesting things.

Starting point is 00:29:01 Think of like, so you say, if I have the ram big enough to hold all the data i ever need it's a big ram there's lots of power so if you break the ram into a small spot smaller thing it's much more efficient per access but then you might have to go to other ram so there's a there's a trade-off between the ram size and then matrix multiply has this curious phenomena of, for an N by N matrix, it's N squared data movements for N cubed operations, right? Which is sort of what AI works. Like AI, as you make the operations bigger, you get more computation and data movement. And then there's ways to optimize that further by breaking the big operations into the right size.

Starting point is 00:29:49 So they're big enough to have a good ratio of compute to data movement, but small enough to be efficient on local memory access. But it's very much like you can see all the AI startups are taking different approaches at this. And it's not because people are, you know, trying to do something different. It's because there's a real problem there,

Starting point is 00:30:07 which is, you know, how the programs are written, how they're described, what they look at are very different things. And like, it's technically interesting. And I think the solution will be much better than, oh, we'll just keep scaling faster memories forever. Like that doesn't seem like the right approach. Yeah, I think that's a fascinating set of points.

Starting point is 00:30:29 I do want to expand a little more on the AI hardware software contract or execution model that we know of in the hardware software realm typically. So one of the attributes, at least of the state-of-the-art models today is like they require a lot of scale. Like you have chips, they're interconnected together. scale them out to really really large systems uh i wanted to get your perspective on well actually they're really small systems so i think you have your metric wrong so so the human brain seems to be intelligent and people estimate it at 10 to the 18th to 10 to the 22 operations depending on who you are right so a gpu is currently you know a 10 to the 18th to 10 to the 22 operations, depending on who you are.

Starting point is 00:31:07 So a GPU is currently a 10 to the 15th operations a second. So it's off by something like six orders of magnitude. So we have a computer about this big, which is an average intelligent operation computer. And then to build that today with GPUs would take, you know, 100,000 GPUs or something, which is like the problem is in the GPU side. So we say that's big, but it's not that big. Well, big compared to what, right? That's the funny part. Like it used to be, it took a really big computer to run a simple Fortran program. You could say

Starting point is 00:31:46 that it was a big computer, but now that computer fits on a half a millimeter of silicon. The Fortran computer of the 70s is 0.1 square millimeters. Moore's law fixed it. Size is a relative thing. Today, yes, to build a big training machine, you put together a thousand GPUs and it feels really big. And the parameter count is like, you know, 30 billion parameters. And there's, you know, a petabyte of training data. And we go around going, those numbers are so big. Here's a funny number. So a transistor is a thousand by a thousand by a thousand atoms.

Starting point is 00:32:23 A hundred nanometers. Just think about like a seven nanometer transistor, they call it seven, but it's about a hundred by a hundred by a hundred nanometers and a hundred nanometers, a thousand transistors. So that's a billion atoms. And we use that to resolve a one and a zero. Now, we resolve a one and a zero at about a gigahertz which is pretty cool

Starting point is 00:32:46 so it's a billion ones and zeros per second out of a billion atoms so it's a billion a big number small number i don't know like it's a machines look big but like the thing the computer and iphone would have been like you know 101 computers, which were big in their day. And now we think of them as a $20 part that, you know, fits in a three-watt envelope. So it's a relative measure. And AI programs today are big compared to traditional computing, but they're small compared to actual, you know, most average intelligent people. Yeah, I hear you. I think, I think that's a fair point. The intent behind the question was more to say you have things where you run things on a single chip, like traditionally, you know, we had this,

Starting point is 00:33:36 when you're building a chip, we had a clear execution model and contract and how these chips were hooked together was a separate problem in some sense, like the distributed computing realm. If you went to databases, they had their own set of protocols, their own set of execution models for how database transactions would execute and so on. If you went to the HPC world, they had a different set of execution models and contracts. For ML or AI in particular, do you still see that we can have this separation? Do you think that there's a need for a more unified view across the

Starting point is 00:34:06 chip level boundary to the system level boundary as well? Because you have various forms of parallelism. The fascinating thing about a current, like a thousand chip ML GPU computer, is first there's an accelerator model. There's a host CPU and an accelerator. Then in the GPU itself, there's a memory-to-memory operator model. And then that node runs some kind of networking stack between multiple nodes, and then it's coordinated with something like MPI. So you have a memory-to-memory model, an accelerated model, a networking model, an MPI model. And so to make it all work, this is before you even run a program. It's kind of amazing. And you can look back when processors had FPU accelerators.

Starting point is 00:35:02 The FPU had a driver, right? So you had to send operands to the FPU and then pull it. But when the FPU got integrated together, the floating point just became a data type and instruction and a standard programming model. So the accelerator model occasionally disappears. As floating point got integrated, there were still vector processors, which were accelerators for vector programs. And they died essentially because the floating point got fast enough that it was way easier to just have a couple more computers running floating point programs than it was to manage accelerator driver models. So the current software structure is, I would say, somewhat archaic and complicated, but it's built on well-founded things like GPUs accelerating graphics programs that have been solid for years. Everybody looks at it and goes, man, there's a lot of memory copies there,

Starting point is 00:36:00 and oh, the programming model, the GPU is too simple but you know that's 20 year old model and networking works and npi has been used in hpc for for a long time and it's pretty well you know fleshed out but the fact that you know to run an ai program you need something like four programming models before you even write a pytororch program. It's kind of amazing. And even the PyTorch doesn't really comprehend the higher level things. So they're running locally on nodes underneath some MPI coordinator thing. It's fairly complicated. Now, if you had a really, really fast computer that ran AI, those layers would go away.

Starting point is 00:36:45 But we don't have that computer yet. And, and that's where the, you know, the excitement happens at the, you know, what's the right way to think about this stuff. And it feels very much like we're, we're in a transitional place. And we've been through these before, like the change from in-order computers to super scalers, super vector, you know, out of order, VLAW war, all that took like 15 years. And we're probably in year three of this on AI. like it will land. Because it seems like one of the tricky parts with AI questions is, you know, there's, like you say, you know,

Starting point is 00:37:29 from the program perspective, there's a data flow graph of what they're trying to do. You know, here's this tensor, then you send these things here. And then we have this hardware that we want to build to do it fast. And then, you know, the NVIDIA solution is they have this middleware

Starting point is 00:37:40 where they translate that high-level data flow graph into some really low-level libraries so that they can make sure that it's fast on this particular piece of hardware. But the question that always seems to come up is like, how big should, you know, we don't want to have a huge DRAM, as you say, like that can handle all of the memory that's like in one giant chunk. We don't necessarily want one single matrix multiplier that can handle the very largest matrix multiplier you could ever imagine. You want that broken up. And so then how they should get sized, how they should then communicate with each other,

Starting point is 00:38:08 and then how in the end they get, you know, how condensed down all sent to maybe some small process that's actually doing like the relu or something like that. Like the question is always coming up about size. And then that sizing is often really dictated by the current state of the art, which is not going to be the state of the art in like six or eight months. So you asked a bunch of questions. So first, AI is like the capabilities are changing really fast. But the models, there's been a couple, you know, there was obviously AlexNet and then ResNet, which is a huge refinement and an uptick on that.

Starting point is 00:38:46 And then the language models came out with Transformers and Intention. And then they had the bitter lesson, like size always beats cleverness. So there's something interesting about, there's a certain stability of that. They're obviously learning tweak. There's a bunch of tweaking going on, like how do you tokenize the data? How do you map it into a space? How do you manage your training? There's a whole bunch of things going on there.

Starting point is 00:39:16 But it's over the last couple of years, that's been somewhat stable. The transformer paper came out, you know, how many years ago? Four years ago, right? And we're building way bigger models that are much refined on top of that but that that stability so there's a there's a new benchmark every six months and they're they're hitting something called benchmark saturation they say you know like hey we have this huge set of images, you know, how good does the AI recognize it? And it went from like 20%, you know, accurate to 50 to 80 to 90 to 97. And all of a sudden, those benchmarks are

Starting point is 00:39:53 saturatable, right? At 100%, you're done. Like, it doesn't say, you know, whereas a lot of CPU benchmarks are how many floating point operations a second can you do and twice as twice as many is always better so so some of those things like there's a bunch of like natural language tests and math tests those are saturatable benchmarks because you can get all the answers right right and so they so they've been in this kind of churn of these benchmarks are going to be good for five years and they saturate into one. So that's a funny thing. But let's talk about size. So at the high end, our sizes are large compared to our technology, but small compared to the need, I would say.

Starting point is 00:40:36 That's one thing. And then let's differentiate memory capacity. So if memory capacity was big because it stored a lot of useful information, that would be really interesting. But if it's big because it needs a lot of place to store intermediate operations, that's kind of a drag, right? So architecture models and technology will move to the point where you don't need memory to store intermediate operations. Like modern server chips, the caches are big enough that the memory accesses should mostly be for first time you needed the data.

Starting point is 00:41:13 Not like there's a big working set and you're reading and writing the DRAMs over and over to do a matrix multiply. That would be a drag. So in that case, the caches should get bigger and then the matrix multiplier should be structured so that you can do blocks. And so that kind of behavior is well understood.

Starting point is 00:41:33 So large memories for holding a trillion useful bits of information seems like a fine use for a large memory. Eight terabytes of bandwidth because you need to store intermediate operations seems kind of crazy. So there's a couple of differentiators you could make. And then there's the observation that the brain doesn't look anything like a large memory that you're reading and writing. So you know what a cortical column is? It's, you know, 10 or 100,000 neurons organized in a set of layers, and they're very densely connected together there, and then they talk to each other at relatively low bandwidths. So that looks like an array of processors to me with local storage and distributed

Starting point is 00:42:16 computing and messaging. And it sort of looks like the graphs people say that they're building when they write AI, which is why architecturally, an architecture that embodies data flow and knows how to do graphs and knows how to pass intermediate results instead of having to store them all the time, seems like the natural thing. Was that clear? Large memories to hold large numbers of things. Yay. Our current memories are small compared to the needs. Large memories for intermediate results. That seems like an architectural anomaly. We've been through this before.

Starting point is 00:42:53 You know, HPC machines that used to, you know, it's all about memory bandwidth, right? That used to be memory bandwidth, memory bandwidth, you know, run streams, run streams, right? So then you got 100 hundred processors with a hundred megabytes of on-chip cache. And we started to hear less about that because more and more problems were factored into dense computation and sufficient on-chip storage and memory starts to be stored for large datasets. Now it's not always true. And there's, there's a bunch of problems that are very hard to factor that way. There are some interesting things about very large sparse datasets and unpredictable datasets. The HPC guys still have limits everywhere they look, but it's not as clear cut as it used to be. Show me the DRAM

Starting point is 00:43:35 bandwidth and I'll tell you the performance of the computer. It's more complicated. So does that mean in some ways, it almost sounded like in terms of the sizing of the structures inside of an ML computational engine, that sounds to me like you feel like that's kind of stabilized and that's relatively solved. But then we have all these AI startups that are trying to build hardware and the software stacks on top of them. And you mentioned before, they all have their own different ways of doing it. So there, there is still, you know, if the structures themselves are largely stable now, because there's some now primitives. Let me be clear about that. So my point was, it's moving slower than people think. So the results at the benchmark level, and some of the tweaks and stuff are moving quickly. The current set of structures have kind of gone through two or three

Starting point is 00:44:27 generations, which are somewhat stable, but there could be a new structure next year that changes everything. So I don't think it's, it's not reached a plateau stability, like out of order execution has. Gotcha. Gotcha. Right. It's, it's still, it's still, it's a interrupted equilibrium, right? I see. Pumptuated equilibrium, let's say.

Starting point is 00:44:49 So like when Pete, Ben, and I, we worked together on the Tesla chip, we used to wake up every once in a while and say, what if the engine we just spent a year building doesn't work at all for the algorithm that they come up with tomorrow, right? Like, that's a real, that's a keep you up, that's a wake you up at four o'clock in the morning idea.

Starting point is 00:45:08 But it turned out there's always been methods. They did come up with algorithms that don't work on that engine, but they found ways to transform the one algorithm to the execution engine, and they've had success with that. And they did get a huge power and cost savings by building a really focused engine as opposed to the general one. So, you know, that was a net win. So we're in a state of punctuated and how the people write the code, describe the software

Starting point is 00:45:47 and what the execution engine is, the fact that those are different is really curious and invites, let's say, innovation and thinking. And the sizes aren't stable because people are pushing sizes right now because most things would be better if they were 10 times bigger.

Starting point is 00:46:05 Like some ask methodically so, but there's some AI curves that are just still going. Like you make them 10 times bigger and you're still getting better at a real rate. And that's where I think there probably will be some really interesting breakthroughs in the next five years about how information is organized and how to do a better job of representing essentially meaning and relationships, which is what AI does. Right. Yeah. Just before we close out this particular theme on the topic of future breakthroughs and so on, as reflecting back on the progress of AI, you talked about a couple of things. One is how graphs seem to be, or data flow graphs seem to be a very good abstraction

Starting point is 00:46:44 to sort of express computations and build systems on top of. You mentioned a little of things. One is how graphs seem to be, or data flow graphs seem to be a very good abstraction to sort of express computations and build systems on top of. You mentioned a little bit about architectural anomalies that we should probably fix, like these large intermediate memory operations and so on. But moving forward, as you look forward to newer breakthroughs in AI, are there any opinionated bets that you're making at Tenstar and that you think we should be looking at as a trend in the future? Well, there's a couple of things. One is, you know, some people observe this, but when it first came to me, it's like, so you're taught that, you know, AI is, you know, inference and training. So inference is you have a, you put an input into a train network, you get a result. And then training is something like you have some data with an expected

Starting point is 00:47:25 result and you put it in and then you get an error and you back propagate the error. Right. And when somebody explained how they train language models and some image models, you basically take a sentence and you put a blank in it. You run it through and guess the blank, which is, I think is really amazing. But to do that, you do the forward

Starting point is 00:47:47 calculation, you save everything. And then on the back propagation, you use optimization methods to look at what you calculated and what you should have calculated and you update the weights, right? So, so brains clearly do not save all the information that they're doing on the forward pass. And then there's some cool papers. There's one called RevNet, which is like a reversible ResNet. So you don't save the intermediate results. You calculate the backward pass, which is cool. So it seems like there's going to be breakthroughs on how we do training.

Starting point is 00:48:19 And also, when humans think, we don't train all the time. Like Ilya at OpenAI said, when you do something really fast, it only goes through When humans think, we don't train all the time. Ilya at OpenAI said, when you do something really fast, it only goes through six layers of neurons. You're not thinking it's trained. That's inference. Everything you do really fast is inference. And then the really interesting thing that we humans mostly do is more like generative stuff. We have some set of inputs.

Starting point is 00:48:43 You go through your inference network. That generates stuff. And then you look at what it produced and what your inputs are. And then you make a decision to do it again. And you're not training, you're doing these cycling inference loops with, you know, that part of your mind is sort of your current stage of understanding, which, you know, you could say is your input tokens, but it's decorated with like what you're trying to do, what your goals are, what your history is. And then every once in a while, as you're thinking about something, you go, that's really good. That's good. And then you'll train that. So we have humans on multiple kinds of training. We have

Starting point is 00:49:21 something exciting happens. You remember from your life from one instance, right? So we have something exciting happens you remember from your life from one instance right so we have a method for training like doing a flash remember exactly what happened and then we have procedural training where you do something repetitively and you slowly train yourself to do that the automatized way and then we have the thinking part which is like generative learning where you're stewing on it you're trying this you're trying that and then you find a pattern that's superior to anything else you've thought of and then we train that because you use that as a building block for the next thing so humans are generative models and people are there's a lot of innovation and they call it

Starting point is 00:50:02 prompt engineering and there's all kinds of things but But the structure of it, it's almost like it's not philosophical enough yet to be thinking. So humans think in terms of we have overall goals, we have moral standards, we have stuff our parents told us to do. We have short-term goals, long-term goals. We have constraints by our friends in society. And that's also our present when we're doing our daily tasks of whatever we're trying to do, which is mostly not instantaneous inference. And it's mostly not training. So I think that's a really interesting phenomenon.

Starting point is 00:50:37 And the fact that these big generative language models are starting to do that, it's really, really curious. And then thinking about like, how would you build a computer to do that better? Like that's a really interesting phenomenon. Yeah. No, speaking of, you know, humans, intelligence goals and building computers, maybe this is a good segue into the other theme we wanted to talk to you about, which is, you know, you've been at multiple companies, you've built, led, and sort of nurtured successful teams to deliver multiple projects. I wanted to get a perspective from you on how do you think about building teams, nurturing them, growing them, and scaling them, especially from a lens of building, you know, hardware systems

Starting point is 00:51:20 or processors and so on. What have been your key learnings? How do you view this problem? So you know what the words creative tension mean, right? Like where you hold opposite ideas in your head and then there's tension between them. You know, I want to get ahead, but I want to goof off this afternoon. It's creative tension, right? Like everybody does that. So I partly think, so I'm a computer architect. When I first started managing a big team,

Starting point is 00:51:55 when I went to AMD in, I guess, 2012 or something like that, like I was working at Apple and I had one employee and I wasn't managing them particularly well. And then I was going to manage 500 people and I grew to 2000 or something. So I realized I could treat organizational design and people as an architectural problem. I'm a computer architect and people generally speaking have some function that they're good at and then there's inputs and outputs. So everybody knows how that works. You write a box with a function, input, output, and one of your missions as a computer architect is to organize functional units in a way that you get the results you want. So if you're trying to build a computer, you need to break down what's the architecture of the organization that solves that problem.

Starting point is 00:52:36 In modern processor designs, there's architecture group, architecture group, RTL, physical design, validation. And then people, for probably evolutionary reasons, operate best in teams of 10 or smaller. There's a manager and there's 10 people. And a really good team of 10 people will outperform 10 individuals. Humans are designed to do that. A bad team will underperform. I mean, there's all these jokes about as you add people, productivity goes down. But if your teams are well-designed and your problem is well-architected, people love to work in teams they like. You know, five to ten people working together, they're happy to see each other. Up to like 50 people, like people all know each other pretty well. You know, at 100, it becomes difficult to know people and you start needing boundaries because humans tend to view strangers as enemies,

Starting point is 00:53:35 no matter what you say. You can be nice about it. That's where a director will manage 100 people. Directors know each other, but the people in the director's teams don't need to know each other. There's an organizational dynamic that you need to figure out. Then there's the people side. Engineers love what they do. That's one of my givens. Engineering is way too boring and hard to do it every day with, you know, excitement if you didn't really love it. Like people are willing to do hard, boring things if they like what they're doing. And people who don't love engineering leave it because it's actually hard and annoying and repetitive. And there's a bunch of stuff about it.

Starting point is 00:54:19 Like I think about the same problem over and over and over and over. So people love, you know, so engineers generally like what they're doing. They have to or they couldn't do it. And then, but they, you know, there's this interesting dimension of they love to do things they own, but they don't always know what the right thing to do is. Right?

Starting point is 00:54:39 So you need to have some like hierarchy of goals and steps to do and processes and methods and the way people interact and motivate each other because you're trying to get that creative tension spot between they own it and they're doing the right thing but they're still following some kind of plan organizing together and that's that's difficult does that make sense so there's creative tension between organizational design you know requirements and then let's say that's human spirit which is you know like people who are excited do 10 times more work than people who think this place sucks and you know i'd rather do anything else so there's a huge swing and then

Starting point is 00:55:26 teams that are working well together create stuff that individuals can't and two follow-up questions on on that theme because i think about this a lot so yeah yeah i'm sure um because i think one thing you know as i've transitioned from being a young engineer to a less young engineer, let's call it, you know, what the, that second piece of like constructing teams and having a clear sense of like what, what you own and how you solve your problems and having everybody kind of, you know, autonomous, but marching towards the same direction is like one of the hardest organizational problems, it seems. And so I think I saw once you said something like people don't like to do what they're told to do. They like to do what they're inspired to do. But

Starting point is 00:56:09 one of the things that I've witnessed across, you know, multiple organizations and multiple groups is just that just getting everybody to feel like, okay, this is what we're doing. We've all agreed on it. You own this and you own this. I mean, that's like a, one of the hardest parts. So that's one question that, and then the second question was, you said something something about groups of 10 how's your feeling about like remote work these days i know they're very different questions but those are the thing that popped up yeah yeah so i don't know what to think about remote work because i'm not a fan i like to work with people that said i've had very successful projects with teams in different places talking to each other. But I've also seen people working remotely on Slack, talking all day long. They got their Zoom chats up and running. They're talking. It's almost like they're working next to each other. So there's a lot to

Starting point is 00:56:56 figure out about that one. Your first question. So it really helps to be an expert. So I've led some projects successfully, and I'm a computer architecture expert. I'm not an expert in everything, but I've written CAD tools. I've architected computers. I've written performance models. I've done transistor level design. Like I have a lot of capability and then I'm also relatively fearless to ask dumb questions. So if I'm in a room and people are explaining something like young people, please listen to this. If you don't know, ask a question. If people don explaining something, like, young people, please listen to this. If you don't know, ask a question. If people don't want to tell you the answer, go work somewhere else.

Starting point is 00:57:29 Like, go figure out what's going on. Somebody filed a complaint on me one time because I was a senior VP. I asked too many technical questions. Because they were used to walking in the room with bullshit PowerPoint and bullshitting for an hour about progress. And on page one, I was like, well, what the hell is this? What's going on here? Explain, you know, sentence one, word one, doesn't make any sense to me. Explain it. Nobody could explain it. And so you can imagine word two wasn't making things better, right? Somebody said, you run fastest when you're running towards

Starting point is 00:58:02 something and away from something. And I am more than happy as a leader to have a vision and lay out what I want and work with people to get there. But I'm also more than happy to dig into everything and like, does it make sense? And can you do it? And you say, well, that doesn't scale. But apparently it does. I worked at Apple and Steve Jobs had everybody like on the balls of their feet working on shit. Because I knew if Steve found out you were screwing around, there'd be hell to pay. Elon does it. I watched him. You know, he motivated very large numbers of people to be very active, hands-on, technically

Starting point is 00:58:36 ready to answer questions about what they're doing, no bullshits and slides. So you need to have a good goal. You need to factor it into something and people say, yeah, I get it. I believe it. And I could do that. You need to have competence in the management structure. My team on Zen was, the managers were very competent. They were all technically good. They were good managers because, you know, people do kind of divide into when you wake up in the morning, do you think about a technical problem or a people problem? So I'm a technical person. I wake up thinking about technical problem, but then I want to solve problems that take lots of people. So I've turned people into a technical problem. So I read a whole bunch of books on psychology and anthropology and

Starting point is 00:59:18 organizational structure and search of excellence and you name it. And then I came up with a theory about how to do that. And one of my theories is I like to have managers work for me who are technically competent, but good people people. And that helps soften the edges around, let's say me, for example, or the problem or the company. Like when an employee does work, they have technical problems in front of them. They have their organizational problem. They have their boss might be a problem. That's a drag. Like the company might be a problem. Competition might be a problem. You know, it can be tough,

Starting point is 00:59:54 right? So people need somebody to look after them, take care of them, inspire them. You know, but at the same time, you have to be doing something that's worth doing. And balancing that out, that's what I mean. This is a huge space of creative tension. There are certain leaders that are really hard. I think they're too hard. I think life's too hard for a lot of people. I look for ways to solve organizational and technical challenges

Starting point is 01:00:22 the way that most people fit. Ken Olson at Digital said there's no bad employees, there's only bad employee job matches. When I was young, I thought that was stupid. And somewhere around 45, I decided it was a pretty good thought. If somebody is a good person, there's almost always a job that they can contribute. Now, if you're in a financial downturn, you have to lay people off. You lay off people in certain orders. Like people know that.

Starting point is 01:00:49 But, you know, solving the problem for people is important because I've seen it turn into a really positive results in the organization. But there's multiple dimensions. Like somebody said, well, what's the way to do it? Well, you know, is your goals clear? Well, a lot of people fail right there. The goals are clear. There's this organizational infrastructure called goals, organization, capability, and trust. You have to solve all four of them. Goals clear,

Starting point is 01:01:19 they're doable, yay. Does the organization serve if you you know the processor is broken into six units do you have six good leads and inside each unit is then capability do you have the technical ability to do it or can you identify the problems really and you have people who are possibly able to solve those problems that capability is is a big one. Trust is a really complicated sentence because it's usually the output of the first three, not the input. If some manager says, we're going to focus on trust and execution,

Starting point is 01:01:54 those are the outputs. In the world of input, function, output, you can't change the output by looking at the output. You can change either the input or the function. The output is the output by looking at the output. You can change either the input or the function. The output is the output. The output changes when you change one of those two. So any manager says, oh, execution, execution, execution, unless they're doing something about

Starting point is 01:02:16 it. Are they doing something about it? Are they training people? Are they hiring for talent? Are they reviewing people properly? Do they buy new CAD tools? Like, what did they do to make execution better? If all they do is say the word execution, then they're bullshitters. So you have to solve multiple dimensions. And you can't just solve one of them. And there's a bunch of places underneath that where there's multiple dimensions.

Starting point is 01:02:42 And then that's when you start to really see the difference between you know great leaders who get projects done and i've worked with some really great leaders i'm i'm just amazed like model three got built so fast with so many people across so many dimensions and you know elon was super inspirational and unbelievably good at details but doug field built and staffed a really wide ranging organization. And I watched him do it. I was there, you know, when we built an autopilot and drove a car in 18 months.

Starting point is 01:03:12 But compared to, you know, building Model 3 and shipping it, that was relatively small potatoes. So it's really interesting to look at these things. And then you have to take them seriously. And then you have to realize no matter what, you don't know that much. And then, you know, and then you have to dig into it. And then

Starting point is 01:03:28 if you're lucky, you find the right people and you get the right place. But yeah, it's a hard problem. And engineers probably should read way more books than that. And people always ask me, well, what three books should I read? And I think, well, I read a thousand. The three books I like the best only probably like them because of the other stuff I knew. So I have a hard time recommending the book to a novice, but reading a lot can help a lot. Yeah, that's one thing that's always mystified me about some of the higher level leaders that I've worked for. And they talk about all the books that they read, because in some ways, you know, when you're making when you're a junior engineer, you're just like, OK, I'm just going to I'm going to do my job. Right. I'm going to write my code. I'm going to do my module. I'm

Starting point is 01:04:07 going to run my unit tests or whatever. And then as you make the transition, it becomes much more like, okay, there's more than just technical things to getting this stuff done. And then making that transition to spending your brain power on the other part, and then making sure that you're spending it appropriately is a little bit of a tough transition because there's a lot of comfort in your boxes and arrows, right? It should be like this. But then it's just like, well, how do you get everybody to believe that it should be like this? And how do we get everybody to believe it?

Starting point is 01:04:34 And you know, you are between 35 and 45, you realize almost all your problems aren't technical. Yes. Well, and then unfortunately, it's a little like when you train a language model, you don't say, what's the 10 sentences I need to train this model? No, more actually helps. Right. And a lot of times a really great book is only a great book because of the other 100 books you read, because it's the one that brought the ideas together. And if you read it first, it wouldn't mean anything. So quantity kind of counts i'm not afraid like i frequently read a book and i realize so most people who write books have 25 or 50 pages of

Starting point is 01:05:12 stuff to tell you but the editor tells you to write 200 pages because that's what sells so so don't be afraid to read 50 pages and go i got it he seems to be repeating himself and almost all writers, once they start to repeat himself, they don't bury some really good nugget 100 pages later, right? Once they start repeating themselves, they're just repeating themselves.

Starting point is 01:05:33 Because people passionate enough to write a book about engineering, management, idea, inspiration, projects, they pour their heart out until they're done. And then they fill it out until they get to 200 pages. So don't be afraid to throw a book out after 50 pages. Now, I read this book, Against Method. It's my current favorite book by Paul Feierabend.

Starting point is 01:05:55 And the goddamn book was like 300 pages long. And he just had one idea after another. I kept waiting for him to start repeating himself so I could put it down because it was too dense. But he didn't quit. He just kept writing all the way through the damn book. And it was pretty fun. But yeah, it's a real thing.

Starting point is 01:06:13 So if you start to realize that there's more to work than just engineering technical problems, you've reached the next level, which is good. And you should solve it because it'll make you happier. And you'll be more successful if you do and you may conclude that's really cool and now i'm better at managing the team and i can focus on technical stuff and now i want to manage bigger teams or now i want to go into sales or or now i get it i really should spend more time surfing like it's all fine but when you reach that point it's a good thing it's a a tough thing. It's harder than college. It's harder than college.

Starting point is 01:06:49 Yeah. Yeah, college there's answers in the book for the most part. And then, you know, when you start to work, you realize that they, you know, if the answer is in the book, you don't get paid that much. But then when you start to try to solve this next level problem, there's no answers at all but there are some solutions which is good so maybe this is a good time to wind the clocks back

Starting point is 01:07:13 and maybe you can tell our audience like how you got into computer architecture how did you achieve the you know employee to job match fit in your how did you get interested in the field and how did you eventually get to you know your current role sort of random because in college like i basically goofed around in high school like i see kids today studying for sats and i still remember you know being out with my buddies thinking i should have to take the s tomorrow? I should have probably not stayed out all night. But I got to college, and I'd done well enough in high school. I liked math and physics and a few topics. And I went to Penn State, and I was a combination electrical engineering philosophy major. But it turns out I can't write at all. And in my sophomore year, the head of the philosophy

Starting point is 01:08:06 department sent me a note, like I think through the electrical engineering department. And he said, I really wanted to meet you. And I was like, yeah, it's great to meet you too. This is really wild. I didn't expect, you know, I'd just taken four philosophy classes. He goes, yeah, we noticed that you're a philosophy major. And then he pulled out like a paper written by a typical philosophy student for like a midterm 10 pages you know nicely written perfectly you know perfect sentences and everything and then he had my page which was like a half a paragraph with scratches out and words in the margin and he said he Jim, you're never going to get a philosophy degree at Penn State. Like you're, he said, we're happy. We like you in class, but we write a lot and you

Starting point is 01:08:53 can't write at all. And I was like, oh my God, really? You're kicking me out of philosophy. I didn't, I didn't even know that was a thing. So, but, but Penn State was great. We had a two inch wafer fab. I made wafers. So in college, I thought I was a semiconductor, you know, major, my advisor ran that lab. And I learned a lot about that. And I took a random job to build fiber optic networking controllers because I wanted to live in Florida near the beach. And while I was there, like it was a terrible job but a great experience somebody said I should work at digital equipment and they gave me the BACS 1170 and 780 paper manuals which I read on the plane to my job interview and I thought that was really cool but I went in there and I had a lot of questions so I met the chief architect of the 1170 and the 11780, Bob Stewart. He was a great engineer. I had all these questions

Starting point is 01:09:46 and he thought I was funny and he hired me as a lark, I think, because he knew I didn't know anything about computers. I literally told him I just read the book on the plane. I'd taken one Fortran class in my life and it didn't go well. I spent 15 years in digital and that's where I learned to be a computer architect, mostly working for Bob Stewart at the start. But there were some other guys, Doug Clark, Dave Sager. There were a couple of legendary people there that were really good, and I had the opportunity to work with them, and I was fairly energetic as a kid. I just jumped into stuff, and I learned a lot.

Starting point is 01:10:23 I slowly learned computer architecture. Like Pete Bannon and I worked together starting in 1983 or two or something like that. We worked on the AVAX 8800 together, a couple of follow-on projects. The second alpha chip, EB-5, and wrote the performance model for that. And back then, you read papers sometimes back then you read papers you know sometimes and then you did hands-on work you know it was really good and i and i got a chance to go into lots of different things i wrote a logic simulator a timing verifier several performance models i drew schematics i i wrote a little bit rtl weirdly not that much rtl in my life but some

Starting point is 01:11:03 because that's that's the main method now, but when I did it, like I used to know how to do Carnot maps and all kinds of screwball stuff, I don't know what he does anymore. Yeah, so digital equipment, I was there about 15 years, worked on some very successful and some unsuccessful products, which was, you know, our second, our follow on to the VEX 8800 was canceled for partly design reasons and partly political reasons. And it was super painful to realize. And then digital itself went out of business right when I left and it got sold to Compact.

Starting point is 01:11:32 So, you know, like you get five ex-deckies together with a beer and we'll all start crying in about 30 minutes. Because it was a great place to work and, you know, a surprising, you know, disaster. Let's say we were building the world's fastest computers and going out of business at the same time. So the combination of, you know, product, market, management, let's say business plan, like digital didn't transition the business plan. And by then it had been captured by the marketing people to the extent that they thought raising prices on VAXs was a win.

Starting point is 01:12:07 Those PCs and workstations came out. That's one thing I try to sometimes tell young engineers or interns that come in. That a lot of success in any business does not have to come down to the nuts and bolts of the technical stuff. And in fact, often, often doesn't. And so at the same time, though, it's really important for the people who are coming up, as you said, to know a lot of these nuts and bolts, right? So it sounds like in those first 15 years, that's where you collected a lot of experience and knowledge about the whole process end to end of how to build a good computer. So we want that. But then at the same time, you know, this understanding that, you know, you might be, you know, like betas might be better than Betamax might be better, but they didn't win.

Starting point is 01:12:53 You know, decks, I think every ex-deck person I know, like you said, like thought it was a wonderful experience, but then really sad by how it ended. How have you, maybe it sounds like maybe you've, at this point in your career, been able to sort of bridge where you can have a good engineering service, good build, build good teams, and then somehow also translate that into successful product. Yeah, it's kind of complicated. So first, like my first job, I worked with some really smart people out of college, but everybody hated the company.

Starting point is 01:13:22 And when I went to digital, everybody loved digital. Like a friend of mine's partner said, what do they put in the water? What else you guys ever do? Like we'd work at work and then we'd go out drinking and talk about work and then go work on Saturday. When you're young, like there's two really important things. One is, are you working for somebody you can learn from and make sure you learn, ask questions, try projects, get good feedback, and also work in an environment where it's interesting. Now, you could be in a really good group in a failing company and get a great experience. But generally speaking, companies are happier if you're in a growth market and less happy in a shrinking market.

Starting point is 01:14:00 You can kind of pick and choose, but right now there's some very big growing companies which have sort of lost the plot on engineers and engineers just get lost. They hire a hundred you can kind of pick and choose, but right now there's some very big growing companies, which has sort of lost the plot on engineers and engineers just get lost. You know, they hire a hundred interns and they put them on random stuff and they don't learn anything. So you got to be careful about that. But, you know, a positive environment with somebody you can really learn from is really important. But then when you get the bigger stuff, it's like, how do you build a successful product? Like I went to A&D knowing that they were a failing company, right? And I thought part of the fun would be, could we turn it around? And I worked for Rory Reid, who was, he was very clear.

Starting point is 01:14:37 He said he didn't realize they were going bankrupt until he got there and he looked at the books. And then he said, I'll save the company. You guys build the products. And Roger Cadori and I were the architects of turning around CPUs and graphics. But we had some really good people. We had organizational problems. And some, you know, a relatively small number of bad managers and bad people, you know, job fits. And, you know, we made a bunch of adjustments. Some were pretty visible and some were subtler.

Starting point is 01:15:06 But I was really working hard on how do you get the goals, organization capabilities all lined up. And I believed that trust would come out of that. And I had a really great consultant working with me, Venkat Rao, who gave me a nonstop stream of books and articles and stuff to read and brainstorm with me a lot about how to be successful. So I invested in, you know, becoming like a manager that could do something useful. And I, I went to a company where I knew some people and I knew they had really good people and I knew that we could build a good product. But the challenge was all the operational organizational stuff in the way and some some serious technical problems too but so yeah that was a relatively big investment whereas tesla was like a real tesla and apple were like weirdly successful companies and i i went there thinking i don't

Starting point is 01:15:55 know how they work and my job wasn't to change them i just wanted to learn from it when i went to apple like i went through three locked doors and like, like some people are saying, no, the most important thing is sharing and doing that. And Apple's full of silos. And most important is carrying management leadership. And Steve Jobs was famously difficult person and still really successful. And people really love working there. And, you know, even though it got all the obvious things wrong, you know, like people were hard on each other. You went through locked door to your project, it was siloed. Steve famously yelled at people

Starting point is 01:16:28 when they didn't do exactly what he wanted while simultaneously expecting them to be creative and do something new. You know, Tesla was very chaotic, but they're producing cars. You know, like how does that stuff work? And then it turns out there's lots of reasons why it works. And this is where when people are inspired, despite or because of, hard to say, the situation, like wild things happen.

Starting point is 01:16:53 I learned a tremendous amount. And then I went to work at Intel because I also, well, actually, I joined there partly because I had some ideas about how to build really high-end servers with Intel, some really great technology. But I spent most of my time working on methodologies, team dynamics, and some basics. I met a lot of people there. I had a lot of fun. I had too much fun working there. Yeah, it was an interesting set of challenges and stretched me out. But then my next thing I wanted, AI is boiling the ocean on computing in a really neat way.

Starting point is 01:17:27 And working with an architecture I believe in, in terms of like, this looks like the right map of what programmers are writing and what they say they're doing in the hardware. Well, it doesn't mean that that description gives you the right hardware software contract, but there's lots of technical work to do there. And it's evolving. And then AI has attracted really smart people. I meet really smart people all the time. I met a guy recently doing AI for game engines. We talked for four hours and it felt like five minutes. It was really interesting.

Starting point is 01:17:57 So I like that kind of stimulating thing. And then part of me thinks, well, how's that going to impact how we do design, how we work with teams, how we work with well how's that going to impact how i do design how i work with teams how we work with people what's going to happen in life and you know it's good i feel lucky to be able to meet with people like that and talk to them but you know i've done a lot of work to get there you know i work hard i read a lot of books i work on projects sweated through difficult times with both technical problems and people problems so the engineers do the work because they love it not because it's easy or particularly you know

Starting point is 01:18:32 short-term rewarding engineering is not a good place for short-term rewards but it's relatively satisfying you know the difference between happiness and satisfaction like this is a funny thing because, there was all these studies they always published, like, people who have kids are less happy than people who don't. And 100% of parents, well, not 100%, but, you know, 100% of good parents say the best thing they ever did in their life was raise children. Like, my dad told me that. I was like, really?

Starting point is 01:19:00 You worked all the time. But happiness is what happens today, and satisfaction is the successful project over time. Engineering is much more of a satisfaction thing than a happiness. Humans have two reward systems, a slow one and a fast one. Am I hungry? I need food today. Yeah, I got it. I'm happy.

Starting point is 01:19:18 Did I survive the year? Did my children survive childhood? Those are satisfaction dimensions. Engineering is way more oriented to that, although it is fun to get your model to compile or a test to run or you solve a technical problem or file a patent. There's a bunch of short-term happiness, but mostly it's a long-term reward.

Starting point is 01:19:39 So maybe on that note, you can provide some words of wisdom to our listeners interested in computer architecture, interested in AI, interested in building their careers, perhaps in likeness after yours. Well, like I said, work some. So for people coming out of college or interns or stuff, try to find a place where people are doing real work, real hands-on work, and they're relatively excited about it. Like if you're, you know, when you're young, you should be working a lot of hours.

Starting point is 01:20:11 You know, you can't get where you want at 40 hours a week, 50, 60. You know, some people say they work 80, but mostly they're screwing around part of that time. But something where you really feel like it's easy to work hard, it's easy to put in time, do real hands-on work, and make sure you have at least a few people that you really respect. They seem to know a lot. They teach you stuff. They take the time. And then work on a couple different things. I know people that worked on one group for 10 years, and I really loved the group.

Starting point is 01:20:41 But working in multiple projects in different groups is really useful over time. It doesn't mean you leave in the middle of a project, but periodically you find something new that's challenging, that takes you back. Some people are really worried, like, I'm at this level here, but if I go to that project, I'll be here. And the answer to that is, great, do it.

Starting point is 01:21:05 Go somewhere where you have to start over you know tesla at one point i was i was walking along shelves looking for visors you know sun visors for a model x like it's a ridiculous job for me but then i it made me think a lot about well how is all the parts in this factory organized and then how the parts flow into the factory and what does it look like and you know and why is it built this way and I learned a boatload about how cars go together. Who knew? And then that turned out to be really useful for thinking about how computers come together and some of the computer skills I had are actually useful for building cars. Yeah it was was really stimulating, made me think about things way differently than before, and surprising and, you know, kind of unusual. So yeah, don't avoid those opportunities, jump into them. Jim Keller, thank you so much for joining us today. It's been

Starting point is 01:21:56 a real pleasure talking to you. We've learned so much. And I'm sure our listeners will enjoy a lot too. Yeah, it was a truly insightful conversation. And thank you so much for being on the podcast. And to our listeners, thank you for listening to the Computer Architecture Podcast. Till next time, it's goodbye from us.

Computer Architecture Podcast - Ep 11: Future of AI Computing and How to Build & Nurture Hardware Teams with Jim Keller, Tenstorrent

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.