Computer Architecture Podcast - Ep 11: Future of AI Computing and How to Build & Nurture Hardware Teams with Jim Keller, Tenstorrent
Episode Date: February 14, 2023Jim Keller is the CTO of Tenstorrent, and a veteran computer architect. Prior to Tenstorrent, he has held roles of Senior Vice President at Intel, Vice President of Autopilot at Tesla, Vice President ...and Chief Architect at AMD, and at PA Semi which was acquired by Apple. Jim has led multiple successful silicon designs over the decades, from the DEC Alpha processors, to AMD K7/K8/K12, HyperTransport and the AMD Zen family, the Apple A4/A5 processors, and Telsa's self-driving car chip.
Transcript
Discussion (0)
Hi, and welcome to the Computer Architecture Podcast, a show that brings you closer to
cutting-edge work in computer architecture, the remarkable people behind it.
We are your hosts.
I'm Suvini Subramanian.
And I'm Lisa Hsu.
Today we have with us Jim Keller, who is the CTO of TenStorent and a veteran computer architect.
Prior to TenStorent, he has held roles of Senior Vice President at Intel, Vice President
of Autopilot at Tesla, Vice President and Chief Architect at AMD, and at PI Semi, which was acquired by Apple.
Jim has led several successful silicon designs over the decades, from the DEC Alpha processors to AMD K7, K8, and K12, HyperTransport and AMD Zen family, the Apple A4 and A5 processors, and Tesla's self-driving
car chip. Today, he's here to talk to us about the future of AI computing and how to build and
nurture hardware teams. A quick disclaimer that all views shared on the show are the opinions
of individuals and do not reflect the views of the organizations they work for. far. Jim, welcome to the podcast. We're so thrilled to have you here with us today.
Thanks. Good to be here. Yeah, we're so thrilled. Long-time listeners will know that our first
question is, what's getting you up in the morning these days? I was going to say, I thought you were
going to say, what's keeping you up at night? Well, I had a literal keep me up at night. I was just in India
for a week. We opened up a design team there and then I met the IT minister for computes.
India has an initiative to promote RISC-V high performance servers and India-based design. So
I've been talking to those guys about it.
Literally was up very early in the morning. So that's one thing. I think the AI and modern tools and a few other things are causing change faster than anybody really thinks about.
The tools are changing, the design point's changing, the code's changing,
and like how do you build and design computers and software so you can go faster and use those tools,
you know, top to bottom. Like we went from, you know, custom design to design using CAD tools
to like SOC designs where you have
multiple IP components and then you put those together.
And now the design complexity keeps going up.
Moore's law gives you more transistors,
but you still want to make progress at a good rate.
And then how do you do all that together?
And so like that's,
and then that opens you up to applications, and then AI applications are really crazy.
And I've been learning a lot about it.
So you think at some point things would slow down, but the opposite's happening.
Things are actually happening faster.
Although I tend to wake up more in the middle of the night thinking about things.
Me too, actually.
I'm a 2 or 3 AM-er myself. So one thing you said I thought was really interesting, which is about tools. And I know
you've talked about this in some of your other interviews and stuff. Because it seems like
everything's changing faster than you can kind of accommodate for. And then in order to build new
systems with all the changing technologies, you need the tools to change with it. And then in order to build new systems with all the changing
technologies, you need the tools to change with it. But then it's kind of like a conundrum because
you need to change the tools on top of technology that's changing faster and faster. So when I was
young, I thought being a computer architect was great because this ground you stand on is always
changing, which means that nothing stays stagnant and you can always kind of innovate and do new things. But now it's almost like the ground is changing and one day it's lava and one day it's like ice cold and you
have to change your shoes and nobody's designed the shoes yet. One day it's lava, one day it's
ice. That's a pretty good metaphor. I like that oh thanks how do you accommodate and deal
with all this change when like the tools that you would want to reason about the change and help you
to make these designs particularly as we shift towards like these really really specialized and
AI algorithms themselves are changing very fast everything is changing very fast so how do you
how do you cope with that yeah so well first one requirement this is hard on big
companies with lots of legacy is you need new designs like this is really like every i've said
every five years you need to write stuff from scratch period and and there's a big reason for
that because no matter how good your old thing is as you add little improvements and patches it
slowly becomes
tangled together. A friend of mine sent me this paper titled A Big Ball of Mud. And it's a really
old school website with a picture of a big ball of mud on the top. And it talks about no matter
how carefully architect hardware or software, you have nice clean components, well-defined interfaces
or APIs.
Over time, like this piece of software will learn about this piece,
and this will do something because of that.
And somebody will communicate in a meeting,
and they'll figure out a clever way to make something faster,
and pretty soon it's all tied together.
So you need to have something new.
We're building a RISC-V processor, a fairly high-end one.
We spend a lot of time up front architecting it so that each subsection of the computer
is really modular and has really clean interfaces
so that, you know, the mission,
I told him it would be unbelievably great
if we found 90, 95, maybe 99% of the bugs
at component level instead of at the CPU integration level.
And like SOCs went through that transition.
Like if you buy high quality IP from IP vendors
and you make a chip,
you don't really expect to find any bugs in the IP.
So, and, but if you're a company with lots of legacy IP
and some of the IP was created by breaking up a more complicated thing, and you never cleaned it up, you might find a large percentage of your bugs when you put pieces together.
And you have to fix that.
And so new design gives you an opportunity to say, I'm going to go redesign this and make it clean at the modular
level.
And when I worked on Xen, some of the verification team came to me and basically told, like,
Jim, we really want to test all the units with really extensive test benches.
And the old school way of thinking was, oh, yeah, sure, recreate the whole design twice.
You know, you have a load store unit, and now you have to make
a thing to test the load store unit, which looks a lot like the rest of the computer, right?
But they were right, because actually the code to test the load store unit is actually simpler
than the rest of the computer. And if you put 10 units in a row together, you know, fetch, instruction, iCache, decode, rename, schedule, integer, execute,
load store, your ability from a program to manipulate the load store unit is tough because
you're looking through five layers of really complex state machines, right? Whereas if you
want to make the load store unit do all its things right from the load store pins, you can do that.
So I was sort of getting there, but the verification engineers made the case it would be way easier for them to write more code at the test bench level and have modular stealthy tests.
And then as soon as you think that, then you say, well, why does the interface between these two units have 100 randomly named signals?
Like if you've done detailed computer architecture, you know, there's a signal called stage four fetch valid, except when, you know, something happens, right?
That's not a verifiable signal.
And computers are full of that stuff for timing reasons for no good
reason for though they had to fix a bug oh no this unit needs to know that this is in a state
right whereas you know computers have well-defined interfaces you know memory fetch fill exception kill stall like so so there's a really interesting challenge
of like how do you build a computer with well-defined interfaces and so your question
like whenever something's i made this comment in a talk i did whenever something's hard it's
because there's too much complicated too much stuff in one place. Like you have to break things down.
And sometimes the problem isn't whether the tools are there or not,
but that you've tried to solve something, you know, in too many ways.
Like you have an RTL problem, a timing problem, a physical design problem,
and it gets to be too much.
And you have to really figure out how to break that down
and make it simple enough to do. And know verifiable design verifiable interfaces you know architecture
thinking a little bit differently is important so yeah i've thought a lot about that and the uh
and and then you can see it you know like some projects you have a lot of success and it's
partly because you really took the time to architect the pieces up front and make them very independent and really clean and you had
discipline not to slowly turn it into a ball of mud which is a natural human tendency apparently
yeah so that's that's super interesting i want to follow up a little bit with this ball of
med question because two things come to mind for that. So early on in your answer, you said something about how it's hard for large
companies with a lot of legacy to do this. And yet we do have a lot of large companies who have
stayed alive for a long time. So I sort of wonder, you know, when I was a young engineer, I was like,
how does anything work? I didn't understand how anything could possibly work. And then secondarily,
this whole thing about, you know, I've seen the kind of signals where you've got like a signal that's called one hot selector for this thing in the
front half of the cycle and the other one hot selector for the back half, you know, it's just,
it's just a total mess. And then we've got, we do get these students coming out of schools and,
and maybe some of them have never written RTL in their life. They learn all their computer
architecture from, from reading, you know, boxes
and arrows and stuff. So how, how do you then form a team where then you do have the discipline to
avoid this ball of mud where it's just like, okay, no, we're going to name these things, right?
There's going to be a reason for this signal that the signal is going to be, you know, eight bits
wide, and we're going to enumerate every single one of those eight bits with a proper name and
proper state. Like, how do you, how do you push that out from where you are yeah so there's a couple things there so one is
you know like 100 of the fortune 100 companies from 100 years ago are gone like ge is still
around but a completely different form so so companies do go through life cycles and almost
all of them disappear over time.
And some get propelled for monopoly reasons or infrastructure reasons or something.
So success today does not guarantee success, although the time of that is longer than you think.
Most companies don't fail in five years.
They fail in 25 or 50.
So that's a thing.
And then Steve Jobs famously said, so you have some new product
and then you make it a lot better and then you refine it. But to get to the next level, you have
to make another new product. And the problem is, is the new product isn't as good as the refined
old product, but you can't make the old product any better.
That's the best rotary phone that will ever be.
The push button phone is, it doesn't feel as good.
It doesn't look as good.
And you have to have the courage to jump off the refined high spot
to a lower spot that has headroom.
Like this is a quote from one of his random talks.
So I recommend people go like some
of Steve Jobs' keynotes were great. Like I watched a bunch of them and he was very clear about what
it means to design a product and then to believe in where you're going. So you have to, and it's
really hard for marketing and salespeople in a big company. So you go, hey, we've got this great
new product. It's 10% worse than what we have today.
But over the next five years, it's going to be twice as good.
And they're all like, well, we'll wait five years.
But then you're not working on the right thing.
So you have to do that.
So that's one.
The other is I had this funny talk with a famous big company who was providing IP.
And they had this interface,
which was a lot of wires and a lot of random stuff. And I told them, the interface is too
complicated. And they'd go, yeah, but it's really, it's a small number of gates. And I said,
no, you don't get it. The wires are expensive. The gates are free.
So one thing you do is instead of saying,
I got this lean and mean state machine
and I export all the wires,
is you take that and you put it into a box
and you turn it into an interface.
And you trade off,
at the time I thought we traded off gates for wires.
Add a bunch of gates because they're cheap.
Morse law gives you lots of gates
and have less wires. But a better of gates, because they're cheap. Morse log gives you lots of gates,
and have less wires.
But a better way to say it is we trade off
technology for complexity.
Like if you go look at a DDR memory controller,
for example, on an AXI bus.
So a typical IP you can buy from
three or four or five people.
AXI transactions are very simple. There's like
15 commands and mostly use read and write. So at the controller, you say, I want to read
32 bytes a day at this address. That's a really simple interface. Inside the controller,
there's a memory queue. You might have 32 or 64 transactions.
There's a pending write buffer.
There's a little state machine that knows the current state of the DDR channel.
Maybe there's two DDR channels.
There's two DIMMs per channel.
DDR DRAMs are really complicated widgets because they got a read cycle,
a RAS cycle, a cast cycle, a refresh time.
They're in different states. At the read transaction, a RAS cycle, a cast cycle, a refresh time. They're in different states.
At the read transaction, it's really easy. Read, address. Write, address, data. You don't know
anything about the state of them. Now, if you build a high-performance system, you say the CPU
is going to be optimized, and we're going to send read commands to the DRAMs, and we know that we're going to have to sequence a read command, so we're going to send read commands to the DRAMs and we know that we're going to have
to sequence a read command. So we're going to hit it with this RAS early to get the DRAM.
So you can export the complexity of that. And then the CPU knows exactly on cycle 167 that
there's the first piece of data that's going to come out. We used to build CPUs that would wrap the return read transactions.
So you got the requested word first.
Like we had all kinds of complexity.
But nowadays, the transaction is really simple.
Read or write at the memory controller, period.
The data comes out at a random time.
It always comes out in the same order.
You don't export the complexity of the CPU to the memory controller.
Now, partly it's for a good reason.
The CPUs have really big caches and really good prefetchers.
They're running at three, five gigahertz.
And the memory controller latency is 150 nanoseconds.
Like wrapping that transaction saves 0.2 nanoseconds out of 100. It's a dumb
complexity, right? So you look in your designing to go, well, how do I get the complexity of an
interface to be so simple, right? And then there's a funny one, which is people always want to say,
well, why don't we have an industry standard for cache coherency?
Well, cache coherency is a distributed state machine.
So now you're saying, well, the people at Qualcomm
are going to have the same spec as ARM as somebody else.
And that's a hard thing to do.
Whereas specs like AXI,
there's a bunch of specs that are pretty commonly used that are
really simple. And when you make them simple enough, then many people can use them. PCI Express
is simple enough. Ethernet is mostly simple enough. So you see these things that are common,
standards work on them. Like the first version of InfiniBand tried to optimize latency and they had a 1300 page spec
and nobody could build a device to the InfiniBand spec. So they went through some soul searching and
radically simplified it and focused on being a generation ahead on PHYs and having a simple,
but good enough RDMA and some other things. And then that became a product that people could use
somewhat successfully. How do you avoid complexity?
Well, at the top, you have to decide it's a real goal.
And then you're going to spend something on it.
I'm going to put an extra 100,000 gates in each interface so that the interface is simple.
And in a computer with 200,000 gates, that's crazy.
In a computer with 100 million gates, that's genius.
Like there's a different calculation.
Yeah, that was a fascinating set of insights.
I think starting from the top, you talked about the need for new designs
and the need for us to sort of revamp things from scratch like every five years.
And I guess some of the ingredients for how do you enable doing that?
One part of it is, of course, modular, clean interfaces, but also the discipline of ensuring
that you have these interfaces that are simple, not too complex at various layers of the stack.
Maybe I can double click on your experiences of doing this in the AI world, because that's one of the places where there's a lot of need for this clearly because compute demands are growing unabated.
At the same time, there seems to be a desire or a willingness to try out and experiment with new ideas. argue that at certain layers of the stack, we have seen some amount of abstractions forming, for example, you know, matrix operations or tensor operations more generally have been the bread and
butter for deep neural networks in this coming era. Do you see that philosophy and that perspective
sort of trickling up and down the stack? Because there is the operators themselves, but on the
software side, once again, there's a lot of complexity. Once you push down into the hardware
side, as you said, you know, you're still designing interfaces with 100 wires
for something that is semantically just a read and a write.
Yeah, so this is a really good one.
And I'd say in AI, we haven't figured out
what the hardware-software contract is yet.
And I'll give you an example.
So in the CPU world, and this is not quite true, but this is close to true, software does arbitrarily complicated things.
Like if you go look at a virtual machine, JavaScript engine, like it's amazing, right?
There's really complicated things.
And then I grew up, like when I learned to program, I programmed an assembly.
Like I used to know all the opcodes for a 6502 and an 8086, and
most of them for VACs.
And then I learned C programming.
C programming is great because it's an assembly language.
It's a high-level assembly language at some level.
As an architect, you write C code, and you can see what the instructions will be generated
and mostly see how it's going to execute.
It's pretty simple. But the actual contract for a modern computer is
operations happen on registers, period.
You put data in registers,
and then you do abstracts and multiplies on them,
and you can branch on them.
And then from a programmer's point of view,
there's a memory model where you load things basically in order.
Like if you load address A and then you load it again you never get like a older value after you get a newer value so it
looks like you have ordered loads and then you mostly have ordered stores like and there's weak
ordering models but they mostly don't work because you have to put barriers in to fix it
so so you basically data lives in memory you load it with relatively ordered loads, you do operations on registers,
and then you store the data out with ordered stores. Right. And then there's a page model,
paging model, a privilege model, a security model. And then, but those are orthogonal to the execution model.
And so underneath that simple model, you can build an out-of-order computer.
And it took us 20 years to figure out how to do that.
Rule number one is you don't violate the execution model that the software people see.
So VLAW failed because they tried to violate the model.
Weak ordering kind of fails because it violates the model. Like anybody, like people who did radical recompiles to get performance, you know, with a simple engine failed.
Like out of order execution is really wild because the instructions issue wildly out of order, but they don't violate that model.
And to achieve that, we built registry naming
for something called a kill.
So you execute a bunch of instructions out of order
and some event happens,
and then you flush all the younger instructions
from the kill point,
and you finish the older ones in order, right?
We have massive branch predictors, data prefetchers,
but no matter what you do,
you don't violate the contract on execution.
And that means that the software programmers
do not have to be microarchitects, right?
As soon as you ask them to be microarchitects, you failed.
Mytanium had like eight barriers.
Nobody knew what they were for.
Like I was at digital when we built
alpha, we had a memory barrier and then we added a, cause we had weak memory ordering. So we
violated the execution contract and we broke all the software. So we had a memory barrier,
but they didn't know where to put it in. They, the operating system had a double MB macro because
they didn't know where to put it in. So two of them, some places seemed to fix some random bugs.
Like, I'm not kidding.
Now we added a right memory barrier, which we thought would make things better.
And it made it worse because they just put the right memory barrier in the memory barrier
macro because they didn't know what to do with it.
Right.
So it was like a worst case scenario.
So now look at AI software.
So AI software has been developed mostly by programmers.
And programmers understand the execution model pretty well. Data lives in memory. You declare
variables, which gives you a piece of memory, or you do something like malloc and free, which is
some kind of memory allocator on top of our
memory model but generally speaking when you're in a program you don't talk about variables as
addresses you have they have names and generally to do operate you know so you say a will be time
c implicit in that is the load of a b and, and C. They go in the registers. You do operates on them and send them back.
And GPUs today are sort of executing that model.
Like you have lots of very fast HPMDRAM, data all that's in memory.
For every matrix multiplied, the data is in memory.
You load it in the registers.
You do the operations.
You write it back out again.
So you're constantly writing in and out data.
That makes sense. Now,
at TenStorm, we believe that when you write that program and you can see it very clearly in all
the descriptions of AI, that program actually defines a data flow graph. So if you go Google
Transformers or ResNet, you'll probably get a picture, right? And the picture will be a graph.
And the graph says there's an operation box
where data goes into it, something happens,
and then something flows out of it.
And then they generally, they call the input activations
and the local data for that operational weight.
And the number of operations they do is actually quite small.
Matrix multiply, convolution,
some version of ReLU, GLU, Softmax,
and then a variety of what they call
tensor modifications,
where you shrink or blow the matrix,
you pivot it, you transpose it,
you convert 2D to 3D.
There's a bunch of tensor modifications,
but the number of operators in that space is low. And then people are stylized on things like, you know,
how do you exactly program ReLU? Like there's some implementation methods. So it's interesting
that the programmers are writing code in PyTorch to a programming model that looks like standard programming.
They're describing what they're doing in terms of graphs, because that's a nice way to think about it. But the code itself we see is a mix. So the challenge is like, how do we come up with
a programming model that we all believe in and understand that can go fast and not have to do
things like read and write all the data to memory all the time
because some operator expects data to be in memory and the only good way to do it.
And then I've talked to AI programmers who are like,
I'd happily recode that to make it twice as fast.
That's one view.
And the other view is I really don't care because all the upside in this is either size,
bigger weights, more parameters.
And the hardware is going to make it faster.
And in the short run, I'll just buy more processors.
So there's a really interesting dynamic about this.
And it sort of feels like, you know, when we first started building out-of-order processors back in, I guess I started working on it in 95. Like it had been around for a while. The IBM 360 did out-of-order processors back in, I guess I started working on it in 95.
It had been around for a while, the IBM 360 did out-of-order execution. The 360-91, I think,
it was amazing. But when I was at Digital, there was a debate about whether you could actually make
an out-of-order computer work. And there was the competing the competing ideas where you know there was
supervised by a super scaler vlaw out of order and then little window big window there's a bunch
of ideas about it and what one like clearly i think but you know there's still some people
who are debating this is you know outordered machines with big windows and really well
architected reorder buffers, renaming and kill interfaces. Like that's, that works, right? And
a really simple programmer model that works. So, so that the, the interesting thing is so that,
and the GPUs, like some people tell me well gpus just work but nvidia's thousands of
people hand coding low-level libraries there was a really good academic paper that said hey i decided
to write matrix multiply and i wrote it in cuda the obvious way and i got five percent of the
performance of nvidia's library and then they did the obvious HPC transformations of transpose one of the matrix,
stub block it for the known register file size, and they got 30%. And from 30 to 60%,
they start to hack. They know how big the register file is, how many execution units are,
how many threads, how many, you know, and the NVIDIA library has, you know and the nvidia library has you know they have what they call cuda ninjas
who are great programmers who know how to make that work now that the charm of cuda is you write
a cuda program it'll work the downside is the performance may be randomly off by a factor of
a lot but when you're writing your code if, well, why would you write matrix multiply on the GPU?
There's a big library for that.
So they have a program model that works in libraries that mostly solve your needs.
And now you're arbitraging the last 10, 20%.
But that computer doesn't look anything like the way the actual AI programmers describe the program, right?
And so that's a really interesting thing.
So we're building a graph compiler.
So our processors and array of processors,
which have some low-level hardware support to support data flow.
And there's some interesting methods about how you take big operations and break them up to
small operations and coordinate that. And the charm of it is it gives you much more efficient
processing and less data movement, less reading the right of memory. There's interesting things.
Think of like, so you say, if I have the ram big enough to hold all the data i
ever need it's a big ram there's lots of power so if you break the ram into a small spot smaller
thing it's much more efficient per access but then you might have to go to other ram so there's a
there's a trade-off between the ram size and then matrix multiply has this curious phenomena of, for an N by N matrix,
it's N squared data movements for N cubed operations, right? Which is sort of what AI
works. Like AI, as you make the operations bigger, you get more computation and data movement.
And then there's ways to optimize that further
by breaking the big operations into the right size.
So they're big enough to have a good ratio
of compute to data movement,
but small enough to be efficient on local memory access.
But it's very much like you can see all the AI startups
are taking different approaches at this.
And it's not because people are, you know,
trying to do something different.
It's because there's a real problem there,
which is, you know, how the programs are written,
how they're described, what they look at
are very different things.
And like, it's technically interesting.
And I think the solution will be much better than,
oh, we'll just keep scaling faster memories forever.
Like that doesn't seem like the right approach.
Yeah, I think that's a fascinating set of points.
I do want to expand a little more on the AI hardware software contract or execution model
that we know of in the hardware software realm typically.
So one of the attributes, at least of the state-of-the-art models today is like they
require a lot of scale.
Like you have chips, they're interconnected together. scale them out to really really large systems uh i wanted to get
your perspective on well actually they're really small systems so i think you have your metric
wrong so so the human brain seems to be intelligent and people estimate it at 10 to the 18th to 10 to
the 22 operations depending on who you are right so a gpu is currently you know a 10 to the 18th to 10 to the 22 operations, depending on who you are.
So a GPU is currently a 10 to the 15th operations a second. So it's
off by something like six orders of magnitude. So we have
a computer about this big, which is an
average intelligent operation computer.
And then to build that today with GPUs
would take, you know, 100,000 GPUs or something, which is like the problem is in the GPU side.
So we say that's big, but it's not that big. Well, big compared to what, right? That's the
funny part. Like it used to be, it took a really big computer to run a simple Fortran program. You could say
that it was a big computer, but now that computer fits on a half a millimeter of silicon. The
Fortran computer of the 70s is 0.1 square millimeters. Moore's law fixed it. Size is
a relative thing. Today, yes, to build a big training machine, you put together a thousand GPUs and it feels really big.
And the parameter count is like, you know, 30 billion parameters.
And there's, you know, a petabyte of training data.
And we go around going, those numbers are so big.
Here's a funny number.
So a transistor is a thousand by a thousand by a thousand atoms.
A hundred nanometers.
Just think about like a seven nanometer transistor,
they call it seven,
but it's about a hundred by a hundred by a hundred nanometers
and a hundred nanometers, a thousand transistors.
So that's a billion atoms.
And we use that to resolve a one and a zero.
Now, we resolve a one and a zero at about a gigahertz which is pretty cool
so it's a billion ones and zeros per second out of a billion atoms so it's a billion a big number
small number i don't know like it's a machines look big but like the thing the computer and
iphone would have been like you know 101 computers, which were big in their day.
And now we think of them as a $20 part that, you know, fits in a three-watt envelope. So it's a
relative measure. And AI programs today are big compared to traditional computing, but they're
small compared to actual, you know, most average intelligent people. Yeah, I hear you. I think, I think that's a fair point.
The intent behind the question was more to say you have things where you run
things on a single chip, like traditionally, you know, we had this,
when you're building a chip,
we had a clear execution model and contract and how these chips were hooked
together was a separate problem in some sense,
like the distributed computing realm. If you went to databases, they had their own set of protocols, their own set of
execution models for how database transactions would execute and so on. If you went to the HPC
world, they had a different set of execution models and contracts. For ML or AI in particular,
do you still see that we can have this separation? Do you think that there's a need for a more
unified view across the
chip level boundary to the system level boundary as well? Because you have various forms of
parallelism. The fascinating thing about a current, like a thousand chip ML GPU computer,
is first there's an accelerator model. There's a host CPU and an accelerator. Then in the GPU itself, there's a memory-to-memory operator model.
And then that node runs some kind of networking stack between multiple nodes, and then it's coordinated with something like MPI. So you have a memory-to-memory model, an accelerated model,
a networking model, an MPI model.
And so to make it all work, this is before you even run a program.
It's kind of amazing.
And you can look back when processors had FPU accelerators.
The FPU had a driver, right?
So you had to send operands to the FPU and then pull it.
But when the FPU got integrated together, the floating point just became a data type and instruction and a standard programming model.
So the accelerator model occasionally disappears.
As floating point got integrated, there were still vector processors, which were accelerators for vector programs.
And they died essentially because the floating point got fast enough that it was way easier to just have a couple more computers running floating point programs than it was to manage accelerator driver models. So the current software structure is, I would say, somewhat archaic and
complicated, but it's built on well-founded things like GPUs accelerating graphics programs that have
been solid for years. Everybody looks at it and goes, man, there's a lot of memory copies there,
and oh, the programming model, the GPU is too simple but you know that's 20 year old
model and networking works and npi has been used in hpc for for a long time and it's pretty well
you know fleshed out but the fact that you know to run an ai program you need something like four
programming models before you even write a pytororch program. It's kind of amazing. And even the PyTorch doesn't really
comprehend the higher level things. So they're running locally on nodes
underneath some MPI coordinator thing.
It's fairly complicated. Now, if you had a really, really fast computer
that ran AI, those layers would go away.
But we don't have that computer yet.
And, and that's where the, you know, the excitement happens at the, you know,
what's the right way to think about this stuff.
And it feels very much like we're, we're in a transitional place.
And we've been through these before, like the change from in-order computers to super scalers, super vector, you know, out of order, VLAW war, all that took like 15 years.
And we're probably in year three of this on AI. like it will land. Because it seems like one of the tricky parts
with AI questions is, you know,
there's, like you say, you know,
from the program perspective,
there's a data flow graph of what they're trying to do.
You know, here's this tensor,
then you send these things here.
And then we have this hardware
that we want to build to do it fast.
And then, you know, the NVIDIA solution
is they have this middleware
where they translate that high-level data flow graph
into some really low-level libraries so that they can make sure that it's fast on this particular piece of
hardware. But the question that always seems to come up is like, how big should, you know,
we don't want to have a huge DRAM, as you say, like that can handle all of the memory that's
like in one giant chunk. We don't necessarily want one single matrix multiplier that can handle the
very largest matrix multiplier you could ever imagine. You want that broken up. And so then
how they should get sized,
how they should then communicate with each other,
and then how in the end they get, you know,
how condensed down all sent to maybe some small process
that's actually doing like the relu or something like that.
Like the question is always coming up about size.
And then that sizing is often really dictated by the current state of the art, which is not going to be the state of the art in like six or eight months.
So you asked a bunch of questions.
So first, AI is like the capabilities are changing really fast.
But the models, there's been a couple, you know, there was obviously AlexNet and then ResNet, which is a huge refinement and an uptick on that.
And then the language models came out with Transformers and Intention.
And then they had the bitter lesson, like size always beats cleverness.
So there's something interesting about, there's a certain stability of that.
They're obviously learning tweak.
There's a bunch of tweaking going on, like how do you tokenize the data?
How do you map it into a space?
How do you manage your training?
There's a whole bunch of things going on there.
But it's over the last couple of years, that's been somewhat stable.
The transformer paper came out, you know, how many years ago?
Four years ago, right?
And we're building way
bigger models that are much refined on top of that but that that stability so there's a there's a new
benchmark every six months and they're they're hitting something called benchmark saturation
they say you know like hey we have this huge set of images, you know, how good does the AI recognize it? And it went from
like 20%, you know, accurate to 50 to 80 to 90 to 97. And all of a sudden, those benchmarks are
saturatable, right? At 100%, you're done. Like, it doesn't say, you know, whereas a lot of CPU
benchmarks are how many floating point operations a second can you do and twice as twice as many
is always better so so some of those things like there's a bunch of like natural language tests
and math tests those are saturatable benchmarks because you can get all the answers right
right and so they so they've been in this kind of churn of these benchmarks are going to be good
for five years and they saturate into one. So that's a funny thing.
But let's talk about size.
So at the high end, our sizes are large compared to our technology, but small compared to the need, I would say.
That's one thing.
And then let's differentiate memory capacity.
So if memory capacity was big because it stored a lot of useful information,
that would be really interesting. But if it's big because it needs a lot of place to store
intermediate operations, that's kind of a drag, right? So architecture models and technology
will move to the point where you don't need memory to store intermediate operations. Like modern server chips, the caches are big enough
that the memory accesses should mostly be
for first time you needed the data.
Not like there's a big working set
and you're reading and writing the DRAMs over and over
to do a matrix multiply.
That would be a drag.
So in that case, the caches should get bigger
and then the matrix multiplier should be structured
so that you can do blocks.
And so that kind of behavior is well understood.
So large memories for holding a trillion useful bits
of information seems like a fine use for a large memory.
Eight terabytes of bandwidth because you need to store
intermediate operations seems kind of crazy. So there's a couple of differentiators you could
make. And then there's the observation that the brain doesn't look anything like a large memory
that you're reading and writing. So you know what a cortical column is? It's, you know, 10 or 100,000
neurons organized in a set of layers,
and they're very densely connected together there, and then they talk to each other at relatively low bandwidths. So that looks like an array of processors to me with local storage and distributed
computing and messaging. And it sort of looks like the graphs people say that they're building
when they write AI, which is why architecturally,
an architecture that embodies data flow and knows how to do graphs and knows how to pass
intermediate results instead of having to store them all the time, seems like the natural thing.
Was that clear? Large memories to hold large numbers of things. Yay. Our current memories are small compared to the needs.
Large memories for intermediate results.
That seems like an architectural anomaly.
We've been through this before.
You know, HPC machines that used to, you know, it's all about memory bandwidth, right?
That used to be memory bandwidth, memory bandwidth, you know, run streams, run streams, right?
So then you got 100 hundred processors with a hundred
megabytes of on-chip cache. And we started to hear less about that because more and more problems
were factored into dense computation and sufficient on-chip storage and memory starts to be
stored for large datasets. Now it's not always true. And there's, there's a bunch of problems
that are very hard to factor that way. There are some interesting things about very large sparse datasets and unpredictable datasets. The HPC guys still
have limits everywhere they look, but it's not as clear cut as it used to be. Show me the DRAM
bandwidth and I'll tell you the performance of the computer. It's more complicated.
So does that mean in some ways, it almost sounded like in terms of the sizing of the structures inside of an ML computational engine, that sounds to me like you feel like that's kind of stabilized and that's relatively solved.
But then we have all these AI startups that are trying to build hardware and the software stacks on top of them.
And you mentioned before, they all have their own different ways of doing it. So there, there is still, you know,
if the structures themselves are largely stable now, because there's some now primitives.
Let me be clear about that. So my point was, it's moving slower than people think. So the results at
the benchmark level, and some of the tweaks and stuff are moving quickly. The current set of
structures have kind of gone through two or three
generations, which are somewhat stable,
but there could be a new structure next year that changes everything.
So I don't think it's, it's not reached a plateau stability,
like out of order execution has.
Gotcha. Gotcha.
Right. It's, it's still, it's still, it's a interrupted equilibrium, right?
I see.
Pumptuated equilibrium, let's say.
So like when Pete, Ben, and I,
we worked together on the Tesla chip,
we used to wake up every once in a while and say,
what if the engine we just spent a year building
doesn't work at all for the algorithm
that they come up with tomorrow, right?
Like, that's a real, that's a keep you up,
that's a wake you up at four o'clock in the morning idea.
But it turned out there's always been methods.
They did come up with algorithms that don't work on that engine,
but they found ways to transform the one algorithm
to the execution engine, and they've had success with that.
And they did get a huge power and cost savings by building a really focused engine as opposed to the general one.
So, you know, that was a net win.
So we're in a state of punctuated and how the people write the code,
describe the software
and what the execution engine is,
the fact that those are different
is really curious
and invites, let's say, innovation and thinking.
And the sizes aren't stable
because people are pushing sizes right now
because most things would be better
if they were 10 times bigger.
Like some ask methodically so, but there's some AI curves that are just still going. Like you
make them 10 times bigger and you're still getting better at a real rate. And that's where I think
there probably will be some really interesting breakthroughs in the next five years
about how information is organized and how to do a better job of
representing essentially meaning and relationships, which is what AI does.
Right. Yeah. Just before we close out this particular theme on the topic of future
breakthroughs and so on, as reflecting back on the progress of AI, you talked about a couple
of things. One is how graphs seem to be, or data flow graphs seem to be a very good abstraction
to sort of express computations and build systems on top of. You mentioned a little of things. One is how graphs seem to be, or data flow graphs seem to be a very good abstraction
to sort of express computations and build systems on top of. You mentioned a little bit about architectural anomalies that we should probably fix, like these large intermediate memory
operations and so on. But moving forward, as you look forward to newer breakthroughs in AI,
are there any opinionated bets that you're making at Tenstar and that you think we should be looking at as a
trend in the future? Well, there's a couple of things. One is, you know, some people observe
this, but when it first came to me, it's like, so you're taught that, you know, AI is, you know,
inference and training. So inference is you have a, you put an input into a train network, you get
a result. And then training is something like you have some data with an expected
result and you put it in and then you get an error and you back propagate
the error.
Right.
And when somebody explained how they train language models and some image
models, you basically take a sentence and you put a blank in it.
You run it through and guess the blank, which is,
I think is really amazing.
But to do that, you do the forward
calculation, you save everything. And then on the back propagation, you use optimization methods to
look at what you calculated and what you should have calculated and you update the weights,
right? So, so brains clearly do not save all the information that they're doing on the forward pass.
And then there's some cool papers.
There's one called RevNet, which is like a reversible ResNet.
So you don't save the intermediate results.
You calculate the backward pass, which is cool.
So it seems like there's going to be breakthroughs on how we do training.
And also, when humans think, we don't train all the time.
Like Ilya at OpenAI said, when you do something really fast, it only goes through When humans think, we don't train all the time.
Ilya at OpenAI said, when you do something really fast, it only goes through six layers of neurons.
You're not thinking it's trained.
That's inference.
Everything you do really fast is inference.
And then the really interesting thing that we humans mostly do is more like generative stuff.
We have some set of inputs.
You go through your inference network.
That generates stuff. And then you look at what it produced and what your inputs are. And then you make a decision
to do it again. And you're not training, you're doing these cycling inference loops with, you
know, that part of your mind is sort of your current stage of understanding, which, you know,
you could say is your input tokens, but it's decorated
with like what you're trying to do, what your goals are, what your history is. And then every
once in a while, as you're thinking about something, you go, that's really good. That's good.
And then you'll train that. So we have humans on multiple kinds of training. We have
something exciting happens. You remember from your life from one instance, right? So we have something exciting happens you remember from your life from one instance
right so we have a method for training like doing a flash remember exactly what happened
and then we have procedural training where you do something repetitively and you slowly train
yourself to do that the automatized way and then we have the thinking part which is like generative
learning where you're stewing on it you're
trying this you're trying that and then you find a pattern that's superior to anything else you've
thought of and then we train that because you use that as a building block for the next thing
so humans are generative models and people are there's a lot of innovation and they call it
prompt engineering and there's all kinds of things but But the structure of it, it's almost like it's not philosophical enough yet to be thinking.
So humans think in terms of we have overall goals, we have moral standards, we have stuff
our parents told us to do.
We have short-term goals, long-term goals.
We have constraints by our friends in society.
And that's also our present when we're doing our daily tasks of whatever we're trying to do, which is mostly not instantaneous inference.
And it's mostly not training.
So I think that's a really interesting phenomenon.
And the fact that these big generative language models are starting to do that, it's really, really curious.
And then thinking about like, how would you build a computer to do that better?
Like that's a really interesting phenomenon.
Yeah. No, speaking of, you know, humans, intelligence goals and building computers,
maybe this is a good segue into the other theme we wanted to talk to you about,
which is, you know, you've been at multiple companies, you've built, led, and sort of nurtured successful teams to deliver multiple projects.
I wanted to get a perspective from you on how do you think about building teams, nurturing them,
growing them, and scaling them, especially from a lens of building, you know, hardware systems
or processors and so on. What have been your key learnings? How do you view this problem?
So you know what the words creative tension mean, right?
Like where you hold opposite ideas in your head and then there's tension
between them.
You know, I want to get ahead, but I want to goof off this afternoon.
It's creative tension, right?
Like everybody does that.
So I partly think, so I'm a computer architect. When I first started managing a big team,
when I went to AMD in, I guess, 2012 or something like that, like I was working at Apple and I had one employee and I wasn't managing them particularly well. And then I was going to
manage 500 people and I grew to 2000 or something. So I realized I could treat organizational
design and people as an architectural problem. I'm a computer architect and people generally
speaking have some function that they're good at and then there's inputs and outputs.
So everybody knows how that works. You write a box with a function, input, output, and one of
your missions as a computer architect is to organize
functional units in a way that you get the results you want. So if you're trying to build a computer,
you need to break down what's the architecture of the organization that solves that problem.
In modern processor designs, there's architecture group, architecture group, RTL, physical design, validation. And then people, for probably evolutionary reasons,
operate best in teams of 10 or smaller. There's a manager and there's 10 people.
And a really good team of 10 people will outperform 10 individuals. Humans are designed
to do that. A bad team will underperform. I mean, there's all these jokes about as you add people, productivity goes down.
But if your teams are well-designed and your problem is well-architected, people love to work in teams they like.
You know, five to ten people working together, they're happy to see each other.
Up to like 50 people, like people all know each other pretty well.
You know, at 100, it becomes difficult to know people and you start needing boundaries because humans tend to view strangers as enemies,
no matter what you say. You can be nice about it. That's where a director will manage 100 people.
Directors know each other, but the people in the director's teams
don't need to know each other. There's an organizational dynamic that you need to figure
out. Then there's the people side. Engineers love what they do. That's one of my givens.
Engineering is way too boring and hard to do it every day with, you know, excitement if you didn't really love it.
Like people are willing to do hard, boring things if they like what they're doing.
And people who don't love engineering leave it because it's actually hard and annoying and repetitive.
And there's a bunch of stuff about it.
Like I think about the same problem over and over and over and over.
So people love, you know, so engineers generally like what they're doing.
They have to or they couldn't do it.
And then, but they, you know,
there's this interesting dimension of
they love to do things they own,
but they don't always know what the right thing to do is.
Right?
So you need to have some like hierarchy of goals
and steps to do and processes and methods and the way people interact
and motivate each other because you're trying to get that creative tension spot between they own
it and they're doing the right thing but they're still following some kind of plan organizing
together and that's that's difficult does that make sense so there's
creative tension between organizational design you know requirements and then let's say that's
human spirit which is you know like people who are excited do 10 times more work than people
who think this place sucks and you know i'd rather do anything else so there's a huge swing and then
teams that are working well together create stuff that individuals can't and two follow-up questions
on on that theme because i think about this a lot so yeah yeah i'm sure um because i think
one thing you know as i've transitioned from being a young engineer to a less young engineer, let's call it, you know, what the,
that second piece of like constructing teams and having a clear sense of like what, what you own
and how you solve your problems and having everybody kind of, you know, autonomous, but
marching towards the same direction is like one of the hardest organizational problems, it seems.
And so I think I saw once you said something like
people don't like to do what they're told to do. They like to do what they're inspired to do. But
one of the things that I've witnessed across, you know, multiple organizations and multiple groups
is just that just getting everybody to feel like, okay, this is what we're doing. We've all agreed
on it. You own this and you own this. I mean, that's like a, one of the hardest parts. So that's
one question that, and then the second question was, you said something something about groups of 10 how's your feeling about like remote work these days
i know they're very different questions but those are the thing that popped up yeah yeah so i don't
know what to think about remote work because i'm not a fan i like to work with people that said
i've had very successful projects with teams in different places talking to each other. But I've also seen people working remotely on Slack, talking all day long. They got their Zoom chats up and
running. They're talking. It's almost like they're working next to each other. So there's a lot to
figure out about that one. Your first question. So it really helps to be an expert. So I've led
some projects successfully, and I'm a computer architecture expert. I'm not an expert in everything, but I've written CAD tools.
I've architected computers. I've written performance models. I've done transistor level
design. Like I have a lot of capability and then I'm also relatively fearless to ask dumb questions.
So if I'm in a room and people are explaining something like young people, please listen to
this. If you don't know, ask a question. If people don explaining something, like, young people, please listen to this.
If you don't know, ask a question.
If people don't want to tell you the answer, go work somewhere else.
Like, go figure out what's going on.
Somebody filed a complaint on me one time because I was a senior VP.
I asked too many technical questions.
Because they were used to walking in the room with bullshit PowerPoint and bullshitting for an hour about progress.
And on page one, I was like,
well, what the hell is this? What's going on here? Explain, you know, sentence one, word one,
doesn't make any sense to me. Explain it. Nobody could explain it. And so you can imagine word two
wasn't making things better, right? Somebody said, you run fastest when you're running towards
something and away from something. And I am more than happy as a leader to have a vision and lay out what I want and work with people to get there.
But I'm also more than happy to dig into everything and like, does it make sense?
And can you do it?
And you say, well, that doesn't scale.
But apparently it does.
I worked at Apple and Steve Jobs had everybody like on the balls of their feet working on shit.
Because I knew if Steve found out you were screwing around, there'd be hell to pay. Elon does it. I watched him.
You know, he motivated very large numbers of people to be very active, hands-on, technically
ready to answer questions about what they're doing, no bullshits and slides.
So you need to have a good goal. You need to factor it into something and
people say, yeah, I get it. I believe it. And I could do that. You need to have competence in the
management structure. My team on Zen was, the managers were very competent. They were all
technically good. They were good managers because, you know, people do kind of divide into when you
wake up in the morning, do you think about a technical problem or a people problem? So I'm a technical person. I wake up thinking about technical
problem, but then I want to solve problems that take lots of people. So I've turned people into
a technical problem. So I read a whole bunch of books on psychology and anthropology and
organizational structure and search of excellence and you name it. And then I came up with a theory about how to do that.
And one of my theories is I like to have managers work for me
who are technically competent, but good people people.
And that helps soften the edges around, let's say me, for example,
or the problem or the company.
Like when an employee does work, they have technical problems in front of them.
They have their organizational problem. They have their boss might be a problem. That's a drag.
Like the company might be a problem. Competition might be a problem. You know, it can be tough,
right? So people need somebody to look after them, take care of them, inspire them.
You know, but at the same time, you have to be doing something that's worth doing.
And balancing that out, that's what I mean.
This is a huge space of creative tension.
There are certain leaders that are really hard.
I think they're too hard.
I think life's too hard for a lot of people.
I look for ways to solve organizational and technical challenges
the way that most people fit.
Ken Olson at Digital
said there's no bad employees, there's only bad employee job matches. When I was young,
I thought that was stupid. And somewhere around 45, I decided it was a pretty good thought.
If somebody is a good person, there's almost always a job that they can contribute.
Now, if you're in a financial downturn, you have to lay people off.
You lay off people in certain orders.
Like people know that.
But, you know, solving the problem for people
is important because I've seen it turn
into a really positive results in the organization.
But there's multiple dimensions.
Like somebody said, well, what's the way to do it?
Well, you know, is your goals clear? Well, a lot of
people fail right there. The goals are clear. There's this organizational infrastructure called
goals, organization, capability, and trust. You have to solve all four of them. Goals clear,
they're doable, yay. Does the organization serve if you you know the processor is broken into
six units do you have six good leads and inside each unit is then capability do you have the
technical ability to do it or can you identify the problems really and you have people who are
possibly able to solve those problems that capability is is a big one. Trust is a really complicated
sentence because it's usually the output
of the first three, not the
input. If some manager says, we're going to focus
on trust and execution,
those are the outputs.
In the world of
input, function, output, you
can't change the output by looking at
the output. You can change either the input
or the function. The output is the output by looking at the output. You can change either the input or the function.
The output is the output. The output changes when you change one of those two.
So any manager says, oh, execution, execution, execution, unless they're doing something about
it. Are they doing something about it? Are they training people? Are they hiring for talent? Are
they reviewing people properly?
Do they buy new CAD tools?
Like, what did they do to make execution better?
If all they do is say the word execution, then they're bullshitters.
So you have to solve multiple dimensions.
And you can't just solve one of them.
And there's a bunch of places underneath that where there's multiple dimensions.
And then that's when you start to really see the difference between you know great leaders who get projects done and i've worked with
some really great leaders i'm i'm just amazed like model three got built so fast with so many people
across so many dimensions and you know elon was super inspirational and unbelievably good at
details but doug field built and staffed a really wide ranging organization.
And I watched him do it.
I was there, you know,
when we built an autopilot
and drove a car in 18 months.
But compared to, you know,
building Model 3 and shipping it,
that was relatively small potatoes.
So it's really interesting
to look at these things.
And then you have to take them seriously.
And then you have to realize
no matter what, you don't know that much. And then, you know, and then you have to dig into it. And then
if you're lucky, you find the right people and you get the right place. But yeah, it's a hard
problem. And engineers probably should read way more books than that. And people always ask me,
well, what three books should I read? And I think, well, I read a thousand. The three books I like
the best only probably like them because of the other stuff I knew.
So I have a hard time recommending the book to a novice, but reading a lot can help a lot.
Yeah, that's one thing that's always mystified me about some of the higher level leaders that I've worked for.
And they talk about all the books that they read, because in some ways, you know, when you're making when you're a junior engineer, you're just like, OK, I'm just going to I'm going to do my job.
Right. I'm going to write my code. I'm going to do my module. I'm
going to run my unit tests or whatever. And then as you make the transition, it becomes much more
like, okay, there's more than just technical things to getting this stuff done. And then
making that transition to spending your brain power on the other part, and then making sure
that you're spending it appropriately is a little bit of a tough transition because there's a lot of comfort in your boxes and arrows, right?
It should be like this.
But then it's just like, well, how do you get everybody to believe that it should be
like this?
And how do we get everybody to believe it?
And you know, you are between 35 and 45, you realize almost all your problems aren't technical.
Yes.
Well, and then unfortunately, it's a little like when you train a language model,
you don't say, what's the 10 sentences I need to train this model? No, more actually helps.
Right. And a lot of times a really great book is only a great book because of the other 100
books you read, because it's the one that brought the ideas together. And if you read it first,
it wouldn't mean anything. So quantity kind of counts i'm not afraid like
i frequently read a book and i realize so most people who write books have 25 or 50 pages of
stuff to tell you but the editor tells you to write 200 pages because that's what sells
so so don't be afraid to read 50 pages and go i got it he seems to be repeating himself and
almost all writers,
once they start to repeat himself,
they don't bury some really good nugget
100 pages later, right?
Once they start repeating themselves,
they're just repeating themselves.
Because people passionate enough
to write a book about engineering,
management, idea, inspiration, projects,
they pour their heart out until they're done.
And then they fill it out until they get to 200 pages.
So don't be afraid to throw a book out after 50 pages.
Now, I read this book, Against Method.
It's my current favorite book by Paul Feierabend.
And the goddamn book was like 300 pages long.
And he just had one idea after another.
I kept waiting for him to start repeating himself so I could put it down because it
was too dense.
But he didn't quit.
He just kept writing all the way through the damn book.
And it was pretty fun.
But yeah, it's a real thing.
So if you start to realize that there's more to work than just engineering technical problems, you've reached the next level, which is good.
And you should solve it because it'll make you happier.
And you'll be more successful if
you do and you may conclude that's really cool and now i'm better at managing the team and i can
focus on technical stuff and now i want to manage bigger teams or now i want to go into sales or
or now i get it i really should spend more time surfing like it's all fine but when you reach
that point it's a good thing it's a a tough thing. It's harder than college.
It's harder than college.
Yeah.
Yeah, college there's answers in the book for the most part.
And then, you know, when you start to work,
you realize that they, you know,
if the answer is in the book, you don't get paid that much.
But then when you start to try to solve
this next level problem, there's no answers at all
but there are some solutions which is good so maybe this is a good time to wind the clocks back
and maybe you can tell our audience like how you got into computer architecture
how did you achieve the you know employee to job match fit in your how did you get interested in
the field and how did you eventually get to you
know your current role sort of random because in college like i basically goofed around in high
school like i see kids today studying for sats and i still remember you know being out with my
buddies thinking i should have to take the s tomorrow? I should have probably not stayed out all night. But I got to college, and I'd done well enough in high school. I liked math and
physics and a few topics. And I went to Penn State, and I was a combination electrical engineering
philosophy major. But it turns out I can't write at all. And in my sophomore year, the head of the philosophy
department sent me a note, like I think through the electrical engineering department. And he
said, I really wanted to meet you. And I was like, yeah, it's great to meet you too. This is really
wild. I didn't expect, you know, I'd just taken four philosophy classes. He goes, yeah, we noticed
that you're a philosophy major. And then he pulled out like a paper written by a typical
philosophy student for like a midterm 10 pages you know nicely written perfectly you know perfect
sentences and everything and then he had my page which was like a half a paragraph with scratches
out and words in the margin and he said he Jim, you're never going to get a philosophy degree
at Penn State. Like you're, he said, we're happy. We like you in class, but we write a lot and you
can't write at all. And I was like, oh my God, really? You're kicking me out of philosophy. I
didn't, I didn't even know that was a thing. So, but, but Penn State was great. We had a two inch wafer fab. I made wafers. So in college,
I thought I was a semiconductor, you know, major, my advisor ran that lab. And I learned a lot about
that. And I took a random job to build fiber optic networking controllers because I wanted to live in
Florida near the beach. And while I was there, like it was a terrible job but a great experience somebody said I should
work at digital equipment and they gave me the BACS 1170 and 780 paper manuals which I read on
the plane to my job interview and I thought that was really cool but I went in there and I had a
lot of questions so I met the chief architect of the 1170 and the 11780, Bob Stewart. He was a great engineer. I had all these questions
and he thought I was funny and he hired me as a lark, I think, because he knew I didn't know
anything about computers. I literally told him I just read the book on the plane. I'd taken one
Fortran class in my life and it didn't go well. I spent 15 years in digital and that's where I
learned to be a computer architect,
mostly working for Bob Stewart at the start.
But there were some other guys, Doug Clark, Dave Sager.
There were a couple of legendary people there that were really good, and I had the opportunity to work with them, and I was fairly energetic as a kid.
I just jumped into stuff, and I learned a lot.
I slowly learned computer architecture.
Like Pete Bannon and I worked together starting in 1983 or two or something like that.
We worked on the AVAX 8800 together, a couple of follow-on projects.
The second alpha chip, EB-5, and wrote the performance model for that.
And back then, you read papers sometimes back then you read papers you know sometimes
and then you did hands-on work you know it was really good and i and i got a chance to go into
lots of different things i wrote a logic simulator a timing verifier several performance models i
drew schematics i i wrote a little bit rtl weirdly not that much rtl in my life but some
because that's that's the main method now, but
when I did it, like I used to know how to do Carnot maps and all kinds of screwball stuff,
I don't know what he does anymore. Yeah, so digital equipment, I was there about 15 years,
worked on some very successful and some unsuccessful products, which was, you know,
our second, our follow on to the VEX 8800 was canceled for partly design reasons and partly
political reasons.
And it was super painful to realize.
And then digital itself went out of business right when I left and it got sold to Compact.
So, you know, like you get five ex-deckies together with a beer and we'll all start crying in about 30 minutes.
Because it was a great place to work and, you know, a surprising, you know, disaster.
Let's say we were building the world's fastest computers and going out of business at the
same time.
So the combination of, you know, product, market, management, let's say business plan,
like digital didn't transition the business plan.
And by then it had been captured by the marketing people to the extent that they thought raising
prices on VAXs was a win.
Those PCs and workstations came out.
That's one thing I try to sometimes tell young engineers or interns that come in.
That a lot of success in any business does not have to come down to the nuts and bolts of the technical stuff.
And in fact, often, often doesn't. And so at the same
time, though, it's really important for the people who are coming up, as you said, to know a lot of
these nuts and bolts, right? So it sounds like in those first 15 years, that's where you collected
a lot of experience and knowledge about the whole process end to end of how to build a good
computer. So we want that. But then at the same time, you know, this understanding that, you know, you might be, you know, like betas might be better than Betamax might be better, but they didn't win.
You know, decks, I think every ex-deck person I know, like you said, like thought it was a wonderful experience, but then really sad by how it ended. How have you, maybe it sounds like maybe you've, at this point in your career, been able to sort of bridge
where you can have a good engineering service,
good build, build good teams,
and then somehow also translate that into successful product.
Yeah, it's kind of complicated.
So first, like my first job,
I worked with some really smart people out of college,
but everybody hated the company.
And when I went to digital, everybody loved
digital. Like a friend of mine's partner said, what do they put in the water? What else you guys
ever do? Like we'd work at work and then we'd go out drinking and talk about work and then go work
on Saturday. When you're young, like there's two really important things. One is, are you working
for somebody you can learn from and make sure you learn, ask questions, try projects, get good feedback,
and also work in an environment where it's interesting. Now, you could be in a really
good group in a failing company and get a great experience. But generally speaking,
companies are happier if you're in a growth market and less happy in a shrinking market.
You can kind of pick and choose, but right now there's some very big growing companies
which have sort of lost the plot on engineers and engineers just get lost. They hire a hundred you can kind of pick and choose, but right now there's some very big growing companies,
which has sort of lost the plot on engineers and engineers just get lost. You know, they hire a hundred interns and they put them on random stuff and they don't learn anything.
So you got to be careful about that. But, you know, a positive environment with somebody you
can really learn from is really important. But then when you get the bigger stuff, it's like,
how do you build a successful product?
Like I went to A&D knowing that they were a failing company, right? And I thought part of
the fun would be, could we turn it around? And I worked for Rory Reid, who was, he was very clear.
He said he didn't realize they were going bankrupt until he got there and he looked at the books.
And then he said, I'll save the company. You guys build the products.
And Roger Cadori and I were the architects of turning around CPUs and graphics.
But we had some really good people.
We had organizational problems.
And some, you know, a relatively small number of bad managers and bad people, you know, job fits.
And, you know, we made a bunch of adjustments.
Some were pretty visible and some were subtler.
But I was really working hard on how do you get the goals, organization capabilities all lined up.
And I believed that trust would come out of that.
And I had a really great consultant working with me, Venkat Rao, who gave me a nonstop stream of books and articles and stuff to read and brainstorm with me a lot about how to be successful. So I invested in, you know, becoming like a manager that could do something
useful. And I, I went to a company where I knew some people and I knew they had really good people
and I knew that we could build a good product. But the challenge was all the operational
organizational stuff in the way and some some serious technical
problems too but so yeah that was a relatively big investment whereas tesla was like a real
tesla and apple were like weirdly successful companies and i i went there thinking i don't
know how they work and my job wasn't to change them i just wanted to learn from it when i went
to apple like i went through three locked doors and like, like some people are saying, no, the most important thing is sharing and doing that. And Apple's full of silos. And
most important is carrying management leadership. And Steve Jobs was famously difficult person
and still really successful. And people really love working there. And, you know, even though
it got all the obvious things wrong, you know, like people were hard on each other. You went
through locked door to your project,
it was siloed.
Steve famously yelled at people
when they didn't do exactly what he wanted
while simultaneously expecting them to be creative
and do something new.
You know, Tesla was very chaotic,
but they're producing cars.
You know, like how does that stuff work?
And then it turns out there's lots of reasons why it works.
And this is where when people are inspired, despite or because of, hard to say, the situation, like wild things happen.
I learned a tremendous amount.
And then I went to work at Intel because I also, well, actually, I joined there partly because I had some ideas about how to build really high-end servers with Intel, some really great technology.
But I spent most of my time working on methodologies, team dynamics, and some basics.
I met a lot of people there.
I had a lot of fun.
I had too much fun working there.
Yeah, it was an interesting set of challenges and stretched me out.
But then my next thing I wanted, AI is boiling the ocean on computing in a really neat way.
And working with an architecture I believe in, in terms of like, this looks like the right map of what programmers are writing and what they say they're doing in the hardware.
Well, it doesn't mean that that description gives you the right hardware software contract, but there's lots of technical work to do there.
And it's evolving.
And then AI has attracted really smart people.
I meet really smart people all the time.
I met a guy recently doing AI for game engines.
We talked for four hours and it felt like five minutes.
It was really interesting.
So I like that kind of stimulating thing.
And then part of me thinks, well, how's that going to impact
how we do design, how we work with teams, how we work with well how's that going to impact how i do design how
i work with teams how we work with people what's going to happen in life and you know it's good
i feel lucky to be able to meet with people like that and talk to them but you know i've done a
lot of work to get there you know i work hard i read a lot of books i work on projects
sweated through difficult times with both technical problems and people problems
so the engineers do the work because they love it not because it's easy or particularly you know
short-term rewarding engineering is not a good place for short-term rewards
but it's relatively satisfying you know the difference between happiness and
satisfaction like this is a funny thing because, there was all these studies they always published,
like, people who have kids are less happy than people who don't.
And 100% of parents, well, not 100%, but, you know, 100% of good parents say the best
thing they ever did in their life was raise children.
Like, my dad told me that.
I was like, really?
You worked all the time.
But happiness is what happens today, and satisfaction is the successful project over time.
Engineering is much more of a satisfaction thing than a happiness.
Humans have two reward systems, a slow one and a fast one.
Am I hungry?
I need food today.
Yeah, I got it.
I'm happy.
Did I survive the year?
Did my children survive childhood?
Those are satisfaction dimensions.
Engineering is way more oriented to that,
although it is fun to get your model to compile or a test to run
or you solve a technical problem or file a patent.
There's a bunch of short-term happiness,
but mostly it's a long-term reward.
So maybe on that note, you can provide some words of wisdom
to our listeners interested in computer
architecture, interested in AI, interested in building their careers, perhaps in likeness
after yours.
Well, like I said, work some.
So for people coming out of college or interns or stuff, try to find a place where people
are doing real work, real hands-on work, and they're relatively excited about it.
Like if you're, you know, when you're young, you should be working a lot of hours.
You know, you can't get where you want at 40 hours a week, 50, 60.
You know, some people say they work 80, but mostly they're screwing around part of that time.
But something where you really feel like it's easy to work hard, it's easy to put in time, do real hands-on work, and make sure you have at least a few people that you really respect.
They seem to know a lot.
They teach you stuff.
They take the time.
And then work on a couple different things.
I know people that worked on one group for 10 years, and I really loved the group.
But working in multiple projects
in different groups is really useful over time.
It doesn't mean you leave in the middle of a project,
but periodically you find something new that's challenging,
that takes you back.
Some people are really worried, like, I'm at this level here,
but if I go to that project, I'll be here.
And the answer to that is, great, do it.
Go somewhere where you have to start over you know tesla at one point i was i was walking along
shelves looking for visors you know sun visors for a model x like it's a ridiculous job for me
but then i it made me think a lot about well how is all the parts in this factory organized and
then how the parts flow into the factory and what does it look like and you know and why is it built this way and I learned
a boatload about how cars go together. Who knew? And then that turned out to be really useful for
thinking about how computers come together and some of the computer skills I had are actually
useful for building cars. Yeah it was was really stimulating, made me think about things way differently than before, and surprising and, you know, kind of unusual. So yeah, don't avoid
those opportunities, jump into them. Jim Keller, thank you so much for joining us today. It's been
a real pleasure talking to you. We've learned so much. And I'm sure our listeners will enjoy a lot
too. Yeah, it was a truly insightful conversation. And thank you so much for being on the podcast.
And to our listeners, thank you for listening to the Computer Architecture Podcast.
Till next time, it's goodbye from us.