Computer Architecture Podcast - Ep 2: Domain-specific Accelerators with Dr. Bill Dally, Nvidia
Episode Date: October 21, 2020Dr. Bill Dally is the Chief Scientist and Senior Vice President of Research at Nvidia, and a Professor of Computer Science at Stanford University. Dr. Dally has had a storied career with contributions... to parallel computer architectures, interconnection networks, GPUs, accelerators and more. He has a history of designing innovative and experimental computing systems such as the MARS accelerator, the MOSSIM simulation engine, the J-Machine and M-machine, to name a few. He talks to us about computing innovation in the post-Moore era, domain-specific accelerators, and technology transfer in computing.
Transcript
Discussion (0)
Hi, and welcome to the Computer Architecture Podcast, a show that brings you closer to
cutting-edge work in computer architecture and the remarkable people behind it.
We are your hosts.
I'm Suvinesh Brahmanian.
And I'm Lisa Hsu.
Before we begin, we just want to acknowledge that we are in truly unprecedented times with
this COVID-19 pandemic, and we want to wish all of our listeners health and safety.
We hope you're
safe and keeping well. And hopefully, listening to this podcast will take your mind off things
for at least a brief moment. And we're very, very excited to have with us today Dr. Bill Daly,
who is the Chief Scientist and Senior Vice President of Research at NVIDIA,
as well as a Professor of Computer Science at Stanford University. Dr. Daly has had a very storied career with contributions to parallel computer architectures,
interconnection networks, GPUs, accelerators, and more. He has a history
of designing innovative and experimental computing systems such as the Mars
accelerator, the Mossen simulation engine, the J machine, and the M machine to name
a few. He is here to talk with us about computing innovation in the post-Moore era,
domain-specific accelerators, and a number of other topics.
A quick disclaimer that all views shared on this show are the opinions of individuals
and do not reflect the views of the organizations they work for. Dr. Bill Daly, welcome to the podcast. We're so excited to have you here today.
I'm really happy to be here. Thank you for inviting me.
As we mentioned in the intro, you've had a long and storied career across a lot of different topics.
But especially these days, what is getting you up
in the morning? I'm just really excited about a lot of things. And probably the one right now
is probably building accelerators for certain demanding problems. So you're talking about
domain-specific accelerators. So for several decades, Moore's law has been a major driver
of computing performance improvements. But as we head towards the sunset years of Moore's
law, what do you think are the promising paradigms to sort of keep computing innovation going
in the next years and next decades coming up ahead? Well, I think the big one is domain-specific
accelerators. And alongside that parallel computing, and they're really very closely
related, it boils down to the fact that if serial computers, single thread
processors aren't getting any faster, and if you want to add value to applications,
you need more performance, then you need to have lots of threads in parallel.
And to do that efficiently, you really need to specialize for a certain domain.
And the gains to be had there are often into the thousands of
fold compared to a conventional processor.
Right. So what goes into designing a good domain-specific accelerator?
Well, in my mind, designing a domain-specific accelerator is really a programming exercise.
It's understanding the application and reprogramming it in a way that is hardware-friendly
so that you get rid of serial bottlenecks.
You have to make the application very parallel.
You get rid of memory bottlenecks.
You minimize accesses to large global memory structures,
and you try to transform it so that storage can be local with small memory footprint per element.
And it's really that reprogramming that is the challenge of designing a domain-specific
accelerator. Once you've reformulated the problem into a hardware-friendly form,
crunching out the hardware is a largely mechanical activity. I think it could
actually be automated in the future. So it sounds like you're saying that a lot of this is
co-designing the algorithm along with the hardware, along with a good cost model for
what the hardware has to provide. Right. It's absolutely, you're redesigning the algorithm
in a form that's hardware friendly. That is the real challenge of accelerator design.
It's very rarely the case that the existing algorithm without modification
is very suitable for acceleration. Yeah. So that's a very interesting thing,
because I feel like that's a little bit of
a chicken and an egg issue, particularly in the architecture community, right? Because the typical
MO for architects, if you go to all the major conferences, is someone finds a topic or a domain
area and analyzes the heck out of an algorithm and then looks for little exploits that you could do
in the way the algorithm works and then builds hardware in order to accelerate that whereas in many cases a reformulation of the algorithm itself from
fundamentally is the more appropriate thing to do how do you how have you been able to sort of get
out of that loop and how would you encourage people in our community to also get out of that
well i think it's you just have to let yourself do it.
I mean, so when we started our Darwin project
that I did with my former student, Yatish Tarakia,
we actually went exactly down that loop.
And what we found is if we took the,
this was for Genome Assembly,
for de novo Genome Assembly,
if we took the existing best algorithms
and through everything we knew about hardware to accelerate them, we might get 4x. And I guess we could have stopped there,
but we weren't very happy with 4x. So we asked ourselves, why is it limited to that? And we
found that it was because that particular algorithm had been very carefully tuned to get optimum
performance on a CPU where memory accesses were relatively cheap
and doing the dynamic programming, which is the core of the algorithm, was very expensive.
So they had basically set the balance between what's called the filtering stage,
where you take lots of candidates and filter them out, which takes lots of memory accesses,
and the alignment stage, which is all dynamic programming.
And they made it all filtering because that was cheaper on the CPU.
And the first thing we did that kind of transformed the game was to say,
okay, we don't need to stick to that,
those thresholds of how much filtering to do.
We could do less filtering because it's expensive.
That means there's more work left to be done in the alignment stage,
but that's really cheap.
And doing that turned it from being 4X better to being 15,000 times better.
And so I think architects just need to understand the application well enough that they feel comfortable changing it.
So they can change the details of the home implemented, but understand that they're going to deliver the answers that whoever's using that application, in this case, you know, it's biologists doing, you
know, genetic research, want and with equal or higher fidelity.
And so in that case, it was a measure called sensitivity.
We had to basically get equal or higher sensitivity than the existing algorithms.
And then all the users were happy.
And do you find that doing that sort of co-design work might be more amenable to an academic
environment where you have all sorts of different kinds of domain expertise packed into one
campus?
Or are you able to continue that kind of close alignment from an industrial standpoint?
I think you could probably do it in either framework.
The Darwin project we did at Stanford, we collaborated with Gil Bejarano in the developmental biology department, who's a real expert on people who are ultimately going to be the end users.
And that's relatively easy to do in an academic environment.
We do have a lot of experts around.
I think it's possible to do in industry, but it usually means forming collaborations with people outside the company.
Yeah, makes sense.
So obviously you are at NVIDIA now. And so is that kind of a collaboration, one where you seek people out and
say, here we have these amazing GPUs, what can we do with them for you? Or is there a different
approach to kind of create that synergy? A lot of different things happen. There's no one formula.
So we're building a lot of deep learning accelerators. And there,
I think all the expertise we have in the company, we have some of the world's best deep learning
researchers. And we understand the directions algorithms are going in and that the models are
going in. Because if you tune a deep learning accelerator for running ImageNet on ResNet,
you're five years behind the times.
Almost nobody cares, actually, about the vision networks anymore.
It's all the actions in the natural language processing and recommender networks, which are fundamentally a very different problem in a lot of very serious ways.
And then we have other things where people are applying GPUs to running some problem.
And they'll come to the company saying,
gee, I'd really like to get more out of this, and we'll look at what they're doing.
And sometimes we actually have an organization that actually helps customers tune their programs for GPUs. And sometimes in doing that, we'll see an opportunity where we could add acceleration.
Another example is in graphics. We recently added ray tracing acceleration to our
GPUs. That's a project we started in NVIDIA Research. And that was where, here's an algorithm
that people say could never run real time. And through the combination of hardware that accelerates
things, and here the gain of the hardware is probably like 10x, it's not thousands, because
the software schemes are doing pretty well. Between that hardware and applying deep learning so that we can actually generate relatively noisy images and then denoise them,
we've been able to bring ray tracing to real time and get really photorealistic graphics running on somebody's personal computer at 60 frames per second, which is pretty amazing.
And there again, I think we had all the expertise.
We had some of the world's leading researchers in ray tracing, which at that point in time
was used for offline graphics, for, you know, like rendering a movie at, you know, one frame
per hour.
But we could take their expertise and combine it with the hardware expertise and architecture
expertise we had and use that to craft a good accelerator.
Right. So obviously, domain-specific accelerators have been successful in graphics and more recently
in machine learning and deep learning. And you talked about emerging applications such as
genomics and other places where we could find opportunities to design such domain-specific
accelerators. How do you think about legacy compute as well? Because that is a
huge part of the overall computing ecosystem. Do you see the same paradigm applicable to those
things as well? Are there additional challenges or different challenges in that domain?
Well, I'm not sure what you mean by legacy compute. I mean, all of computing
are applications. And those applications solve some problem, whether it's simulating physics or whether it's analyzing data or whether it's, you know, monitoring video feeds, which I guess is a form of analyzing data.
What legacy means to me is that, you know, somebody has a system and it works for them and there is inertia, which, you know, keeps them from, you know, they don't want to make any changes and progress always
requires change.
And so whatever they're doing in the legacy compute, whether it's, say they're doing
data analytics, running over large data sets, trying to glean insight about something, you
can look at what algorithms they're using there, find where the bulk of the time is
being spent, do that same co-design where you say, okay, we've got to grovel over all this data, but is there
a way of making intermediate references very local so we can avoid large global memory
bandwidth problems, and design an accelerator for it.
And it's really a question of finding applications where the effort-reward ratio is right. When you look at things like deep learning and genomics,
and to some extent, even the ray tracing, what makes it really amenable to the accelerator is
that there's sort of one core kernel. For deep learning, it's matrix multiply. For genomics,
it's dynamic programming. For ray tracing, it's traversing a bounding volume
hierarchy. And if you accelerate that one core kernel, the whole application speeds up enormously.
And so I think those applications are the ones that are really, they're the easier,
they're the low-hanging fruit for acceleration. There are other applications where you look at
the code and it's several
hundred thousand lines of code and it's hard to tell any one place most of the time is being
spent. That's a mess. That would be very hard to accelerate. And I think in many cases there,
you have to go back and ask, what is the fundamental problem they're solving? And do
you really need a hundred thousand lines of code? Or did that just aggregate over a long period of
time of different programmers going in and adding a few lines here and a few lines there and not taking out anything that wasn't being used that often.
And can you streamline it?
Can you actually refocus and rebuild it from the ground up?
So your career has also had a lot of, or you've spent a lot of time on interconnection networks.
And so for things like domain-specific accelerators,
there's certainly the notion of accelerating the core kernel of compute that is happening over and over again. How do you view
the balance of accelerating compute with the communication patterns in terms of building a
good accelerator? Yeah, so if you look at where the performance comes in accelerators, it largely
comes from parallelism. I mean, specialization can buy you 10 to 50 fold,
depending on what the application is. But the cases where you get thousands is because you're
operating in parallel. And once you're operating in parallel, you need to communicate data between
the processing elements, between various storage and processing elements. And what you find very
quickly is that communications
becomes a dominant problem. It's where most of the energy goes. A lot of people still don't know
the energy is in memory access, but when you look at the memory access, the fundamental operation
of reading or writing a bit cell takes almost no energy. All of the energy and memory access is
moving the bit from that bit cell to the sense amps and then from the sense amps from a subarray to some global memory
interface and ultimately to where it's being consumed. And so because it's so dominant,
you need to do a very good job of that communication. And so it's an area where
I've been working on it since the 1970s, actually. It really is nice because there's a good
theoretical framework to understand
that communication and then to optimize it in a way that you can make sure that every, you know,
femtajoule of energy you spend moving bits is being spent in a very optimal way.
And that can really make the difference between an accelerator that works well and one that
is horribly inefficient. And so do you feel like, you know, it seems the amount of data that we need to
process in sort of a single problem continues to expand? And initially, there's a lot of work on
things like on-chip networks. And then, of course, there's a lot of work in sort of
inter-system networks. Is there, in your mind, something fundamentally the same about both?
Is it sort of reapplying things that we know in one domain to another as the scope of data expands?
Or is there something fundamentally different about the about like intrasystem versus intrasystem networks?
That's a really good question. So there are things that are the same, the things that are different.
And what's the same is all the theory is the same. The problems are the same.
When you're constructing a network, you have topology, you have routing algorithms, you
have flow control.
All of those things are the same on-chip and off-chip.
The answers are often different.
The reason why the answers are different is the cost models are very different.
So on-chip, you're signaling with CMOS logic over on-chip wires, which are RC wires.
They're lossy and they're resistive.
You have to put repeaters in them.
And that gives you a certain cost model, an area and energy for moving bits on-chip.
And it actually is a cost model where the energy is proportional to distance.
So the longer a channel is, the more energy you consume on that channel.
Once you go off-chip, you pay a big upfront cost
for a CERTIs, for an I.O. driver on the chip to drive a wire off-chip and perhaps even for an
optical engine to drive a fiber that's connected to that. But once you've paid that cost, in the
electrical case, you can go meters. In the optical case, you can go hundreds of meters or even
kilometers with no additional cost. And so
because the cost model for the channels is different, the solutions change. And also
because the channels are now very expensive, you have to pay this big upfront cost for the IO.
You can now amortize against that channel a lot of buffering, a lot of routing logic.
So you often will get a more complex solutions. The fault model is
different as well. On-chip, even though errors do occur and there's fit rates for everything,
particularly storage cells and SRAMs, a lot of on-chip interconnection network done is assuming
a basically reliable network where then you'll put some error checking on it, CRC or parity,
and have some fallback if a fault is detected, but it's a relatively rare occurrence.
The channels off-chip often are as bad as 10 to the minus 8th
or 10 to the minus 9th bit error rate.
So there, faults happen on a really many times a second.
And so you have to have some fault recovery mechanism
that is transparent and recovers from that.
But the theory is the same, and it's just that the optimal answers differ
because you're applying different cost models to it.
Makes sense.
One thing I didn't hear you say in all those factors was latency.
There's two sides to latency.
One is how much your application can tolerate,
and the other is how much is essential and how much is gratuitous.
And in on-chip networks, it turns out moving a bit on-chip is actually slower in terms of velocity
because those RC lines don't operate anywhere near the speed of light.
And once you go off-chip and you're on a good LC transmission line or on an optical fiber,
you're typically running at about half the speed of light,
depending on what your dielectric is in the transmission line.
And so there's nothing fundamentally higher latency about going off-chip.
You could have an on-chip network that, you know, to cover, you know, we build really
big chips that are, you know, 800 square millimeters.
And so, you know, you figure over 20 millimeters on an edge, you could be going 40 millimeters on chip, and that can wind up taking you, you know, many hundreds of nanoseconds, right? And if you think about it, you can go on a single optical fiber, you know, you could go about roughly a half a foot per nanoseconds, you can be, say, 50 feet, 50 feet per hundred nanoseconds, several hundred nanoseconds.
You can be almost a football field length away.
Now, very often off-chip networks, because people can amortize more complicated routers, they do.
And they build very complicated routers, and the routers can wind up having latencies of easily 50 clock cycles.
I find that very painful sometimes.
I built many routers back in the 1980s and early 1990s where we were 10 clock cycles from bits arriving on the input pin
to bits going out on the output pin.
And some of the latency is inherent because on these big router chips now,
you have maybe 20 or 30 millimeters of traversal to happen
to actually just get the bit from the input port through some on-chip network, which forms a core of that router to the output port.
And that's essential latency.
But some of it is sort of designers being a little bit lazy with modularity, right?
They say, OK, I'm going to design a block here and put some buffers before it and buffers after it.
So I can think about that separately from other things.
Whereas if they crafted it more carefully, I think much of that delay could go away.
The other thing that eats up a lot of latency is sometimes expensive clock domain crossings.
Synchronization failure is a real problem, and many people still use what I refer to as brute force synchronizers to cross clock domains.
And those in modern technologies can easily cost five or ten clocks just to cross one clock domain.
Whereas there are known techniques, we've published many papers about them, where you can have an average cost of a half a clock to cross a clock domain.
And most people, because they don't feel enough pain about that latency, they don't feel motivated to go and apply the more sophisticated synchronizers. How do you see GPUs in the context of domain-specific accelerators?
How are they positioned, and how do you see them evolving within this paradigm?
So GPUs are a great platform. That's the way I think about them.
Because if you look at the difficulty of building an accelerator,
say you want to do dynamic programming.
Once you figure out what you actually want to do
you can probably code the verilog for that in a couple days right but to code the verilog for
the on-chip interconnection network and the on-chip memory system and the interface
for various types of off-chip memory be it hbm or ddr and the interface to some off-chip network
you know that's years person years of effort, actually tens, maybe hundreds of
person years of effort.
And so what you really want is you want a platform that has all that done with lots
of bandwidth, lots of parallelism, so you're not bottlenecked by things, into which you
can plug accelerators.
So you can plug in a deep learning accelerator, you can plug in a genomics accelerator, you
can plug in a bonding volume hierarchy accelerator and not have to redesign the hard part.
You just redesign the easy part, the one little algorithm that you're trying to do that you
can do in a few days.
And so they're a great platform for plugging things in.
They're also a great platform for playing with the application before you plug it in,
right?
And so you can go ahead and get a lot of performance out of the GPU.
It's a very parallel machine.
It runs hundreds of thousands of threads simultaneously on thousands of cuda cores um and basically get a certain
amount of performance and then see where you have enough pain that you feel like you might want to
add you know an instruction like you know we have our deep learning accelerators we call tensor
cores are actually instructions they're they're hmMA and IMMA instructions, matrix multiply, accumulate
instructions. If you have enough time being spent in one area, that then motivates you to put that
instruction in. And that one area then gets a lot faster and then everything else looks a lot worse.
Then you go and you find the next thing and push it down. And they're also a great platform for
playing around with that co-design process because they're a naturally parallel platform and they also have a local memory hierarchy. Optimizing a program for GPU
is exactly the same optimization that you want to do before you turn it into hardware. It's the
same set of constraints. Right. Do you have a vision for how you view domain-specific computing
or computing in general to be in the future, both in terms of how, I guess, computer architects and hardware architects think about designing systems,
as well as how users, programmers and above think about computing platforms.
Like I said, I personally think of developing a domain-specific accelerator as developing a parallel program,
just a parallel program with this cost model reflecting hardware.
And in the future, I'm hoping we can develop programming systems
that largely automate that process,
where I can write in some high-level language
my description of the computation,
and separate from that, a description of how it maps.
Too many of our existing programming languages
combine the specification of the function
with the mapping of that function in time and space.
And for an accelerator, you really want to fundamentally separate those two.
We've developed a number of programming languages over the years that do that separation to target conventional hardware.
Here you would do that separation and then have some back end that feeds into CAD tools and gives you a measure of how expensive it is, how much area is it going to take, how fast is it going to run.
The typical thing you do when you write a computer architecture paper
and you need the results section.
So you get the results section and you say,
okay, it's not as good as I thought it was.
What do I change either in the algorithm or in the mapping?
You try to find the bottlenecks and do that.
And perhaps even some of the searching of the mapping space can be automated.
But it's a very large space.
Searching that isn't an empty, complete problem, and that's where human intuition sometimes does better than algorithms.
The mapping is sort of done at a layer above the hardware, right?
You've got the problem at a certain level of abstraction, you've got the hardware at another level, and then there's mapping, which is presumably a software layer.
What determines the hardware right so say you have a an algorithm like dynamic
programming right you can write the specification which says how you compute the value for every
point in in the h matrix right that doesn't say how you're going to map that in time and space
right there are many mappings of that in time and space the the algorithm does constrain you by data
dependencies right because you the mapping in time has to preserve data dependencies. You can't compute something
before you compute the things that it's dependent on. But other than maintaining those data
dependencies, you have a tremendous amount of freedom. So the mapping is really saying,
for each point in that matrix, which piece of hardware computed it and when, right?
And then by doing that,
you're also determining the hardware
because you're saying how many pieces of hardware there are.
One mapping is to say there's one piece of hardware
and I walk it left to right,
or I could have one piece of hardware
and walk it in row major, top to bottom,
and I could do it in either order
or I could decide I'm going to have many pieces of hardware
and do a wavefront,
or I could have fewer pieces of hardware and do wavefronts left to right or wavefronts top to bottom.
You can come up with lots of different mappings for the one function.
So you can specify that function, specify a bunch of different mappings, and then optimize in that mapping space for some figure of merit, which is some combination of cost and performance. For that particular type of mapping, you know, where you have the algorithmic sort of specification.
So I guess what I was getting at is when you think about like general purpose compute,
you essentially had, like you were saying before, you know, it was all bundled into one.
You had the program, which was both in some ways the algorithm itself.
And then once you got to the hardware,
it was mostly the control plane of the hardware
that was sort of deciding how to do things.
And now there's this other layer
where you have a lot of highly parallel sort of compute elements.
And now you've got this other layer
that's sort of mapping a high-level specification to hardware
and sort of taking some of the responsibility
out from the hardware itself
into this sort of intermediate layer.
Is that the way that you see it going?
When you think of conventional computing, where this really comes down to, I think,
is people thinking too serially.
And they use programming notations where the common way of expressing kind of the inner
kernel of something is a loop nest.
And a loop nest fundamentally specifies not just what the inner kernel of something is a loop nest. And a loop nest
fundamentally specifies not just what the computation is, which is sort of down in the
middle of the loop nest, you actually do some work. But by ordering the dimensions, it's telling you
what order you're doing it in. And, you know, sometimes compilers try to reach in there and
rip it up and tile it and stuff. But, and that's actually doing the mapping, right? And so if you
think about conventional programming notations, you're somehow specifying and ordering and the function,
and then your compiler sometimes tries to come back and undo your ordering. But if you inadvertently
introduce a few dependencies or control constraints or something, it's impossible to do that because
the two are tied together and it's hard to untie it
and just like people you know spend a lot of time pounding their heads against walls trying to do
parallelizing compilers the the answer is don't start serial right if you start with just a
specification of here's what i want to compute you know hij is a function of you know hi minus one j
h i j minus one and hi minus one j minus one boom that's that's you know, h i minus 1 j, h i j minus 1 and h i minus 1 j minus 1. Boom. That's
one line of code, right? It's a specification of function without ordering, right? And I want to do
it as parallel as I can subject to the data dependencies. And then it's a separate problem
to decide how to map that. Let's not combine the two together. By combining the two together,
you then create this huge compilation problem of uncombining them. Just like by expressing things serially, you can create this huge
problem of parallelization. Here, I've expressed ultimate parallelism, right? I'm just saying,
do this computation. I'm not telling you that one thing has to follow another. I do have data
dependency constraints, and I'm not specifying any ordering. You can do whatever ordering you
want. Then I can search over many orderings, many mappings to different places in space.
And by the way, the exact placement in space matters as well because of locality, right?
The cost of accessing something.
Part of this is factoring the compute where the function happens,
and part of it is factoring the storage where the input data and the output data are stored
and how they're staged and moved. Because that's where all the energy is with these interconnection networks and so by
searching you know the space i'm putting things in the right place in space i get enormous amounts
of memory bandwidth the tiny local memories sitting right next to things which is part of
what you need to make an accelerator work right if every reference is to a random number of say a
random address um in a global memory you're yourosed. You're not going to do any better than a conventional processor because they have a memory system with a certain amount of bandwidth. you could, right? Because things like HBM and DDR are all accidents of history, and they're nowhere
near an optimal interface to get to the memories. You're limited to whatever that bandwidth is. But
if you refactor the algorithm so that you're accessing things only very locally to you,
then you can go very, very fast. Right. As you mentioned, this mapping problem,
it's a very large mapping space, something that seems very complicated.
It sounds like something that's very ripe for research, both in terms of the techniques that we develop, the abstractions and the frameworks that we have yet to develop.
So that might be a good point to sort of talk about, you know, how you view the interaction between research and pathways to products. I'm sure you've seen several iterations of ideas coming in from research, both in an
industrial setting and academic setting, and ultimately making its way into a product.
Give us your thoughts about how you think about pathways from research to product, collaborations,
and so on.
Yeah, so I think it's a really good one.
So as the leader of an industrial research lab, one of my primary functions is to make
sure that things that get done in NVIDIA research result in improvements to NVIDIA products, not just publications at leading conferences.
Because I think I'd have a real hard time justifying my budget to my boss if that's all that came out of NVIDIA research was lots of fame for the researchers and maybe good PR for the company.
And so I think that to have effective technology transfer, you really
have to make that the goal from day one. And when you start the project in a video research, I ask
the individual research to identify two people, and sometimes they're the same person, but one of
them is the champion for the technology, and the other is the receiver of the technology. Very often
they're the same person, right? The champion is also going to receive the technology and productize it.
And on day one, right when they're still thinking about the project, they haven't written a single line of code or a bit of Verilog or designed any circuits, I ask them to set up a meeting with this person and talk to them about the project and have them involved in the project the whole way through.
This solves a bunch of problems.
Probably the most important one is very often the researcher doesn't understand the real
constraints.
They have an academic view of the problem.
And when they talk to the person who has to receive it, they say, well, that's nice.
But the real problem is this.
And if you can't do that, then it doesn't matter if you do this thing that you're going
to do really well because we can't use it.
And so getting those unknown constraints in early is absolutely essential.
You go a long way down the road and the project is kind of misguided.
Also, by having regular meetings with this person, they become invested in the project.
They don't have this sort of organ rejection reaction to it if they see it kind of as a finished project at the end and they don't know what's going on and it just looks very foreign to them.
And that's sort of step one for that transfer.
And then the other is that you have to be very sensitive to sort of how the product
development people work.
They have a very low tolerance for risk.
And it's understandable, right?
I mean, many generations ago at NVIDIA, we had a GPU called Fermi.
And actually, this wasn't even anything that should have been risky, but there was a circuit error in it that delayed that product shipping for probably close to six months.
And financially, that was a huge blow.
I mean, we're talking about many hundreds of millions of dollars.
And so in research, we have this wonderful advantage that we can make mistakes.
We can have projects fail. If the people who have to ship the
next GPU, we develop one GPU at a time. We don't have a backup GPU. If that GPU cannot ship at the
time that it's planned, really, really bad things happen. So these guys have a very low tolerance
for risk. So you have to understand what sort of level of quality and technical maturity you have to mature an idea to, to make it palatable to them.
And so, you know, some great examples are like the RT cores that shipped in Turing.
We started that project as a way of sort of accelerating ray tracing, but it took a lot of interchange with the product development group to turn what we thought was a really great architecture into something that was palatable to them, that it was low enough risk. And the other thing is estimating cost well, right? We
had come up with various cost estimates of how expensive it will be. And invariably, our estimates
were low, right? When they actually put in all of the built-in self-tests that's needed around the
memory arrays and very other pragmatic things you had to do in the real world, the thing grew quite a bit.
And so the costs went up. And so you have to understand that's part of the technical maturity
is understanding the real cost of a piece of hardware. Probably the most important thing to
really, I've seen a bunch of research projects very successfully transfer into product. I've
seen a bunch of them fall short and They could fall short for two reasons.
One fundamental reason is there was some mismatch. It actually wasn't going to work in the product.
If that's the case, then the only thing I regret is that we didn't kill the project sooner. That's
actually one of my roles in research. I go around looking at what people do and ask hard questions
about, well, what gain are you going to get?
Who's going to receive this?
What's the benefit going to be?
And try to kill off projects that it's really clear aren't going to make that jump.
Because if we kill them off sooner, then we can put all those resources on projects that
are going to make it.
But the other reason why it fails sometimes is that the person stops working before it's
really done.
And there's a difference between having something done enough,
you can publish the ISCA paper on it, and having something done enough that somebody's going to
bet the next generation GPU on it. That's actually a really, really big difference, right? I mean,
you're maybe a quarter of the way there, maybe even a tenth of the way there when you publish
the ISCA paper. And this is a question of getting people's goals aligned, right? Because if somebody's
goal is, I want to publish lots of papers because I want to become famous, they're never going to
transfer anything to product, right? Because they'll get it just far enough to get that paper
and then they'll go and move on to the next thing and get another paper and they'll become very
famous, but they won't have done any good for the company. We really need to sort of try to make
sure that people's goals are set, that they're motivated to do that work, that you don't get a lot of
professional recognition in the community from except within the company, to do the maturation
of risk reduction needed to carry it all the way to the point where the product guys say, yeah,
we understand what the real costs are. We're absolutely sure it's going to work. You've
made sure it's testable and all of these other things, and we can drop it in.
Yeah, so that's sort of a philosophy there. If I go back to the academic world, I think it's harder.
We had a bunch of technology transfer successes in the academic world.
We worked with Cray on every generation of interconnection network they developed from the T3D,
which first shipped in 92, I started working on the project in 89 through the network in Cascade, the original Dragonfly network.
But even there, it required a lot of relationship.
And again, it's kind of understanding who the champion, who the receiver of the technology was.
It was a set of Steves over the years, Steve Nelson at first and then Steve Oberlin and then Steve Scott at various stages and have a very tight relationship with them.
But even then, we develop things in the academic world. It would be a decade before we'd see a product.
And it was only because we had a very good relationship that it actually made it through those hurdles and eventually shipped in real machines.
And the part of that relationship was us understanding kind of the
problems as the field evolved. And a great example there was making the jump, you know,
from low-radics to high-radics networks. And in some sense, you know, I wrote a bunch of papers
in the late 80s and early 90s telling people that they should stop building these high-radics
networks like binary N-cubes and start building low-dimensional torus networks.
And then I wrote a bunch of papers starting in mid-2000s explaining that they should stop building all these torus and mesh networks and start building high-radix networks.
But it's because the technology changed.
Again, the theory remained the same, but with different cost models, you come up with different answers.
But it was having that great relationship with industry where we were able to sort of understand the real cost model by working with the guys at Cray at that point in time.
So how do you know when it's time to revisit something like that? Because I think one of the
interesting things about our field is that the technology, I think I described it once in grad
school, you know, the sands on which you were standing are constantly shifting. So the answer
is never always the answer because the sands shift, But it's a little bit of an art to know when the sand has shifted enough to re-look at something
and then potentially fight the conventional wisdom that has been established about it. So
when do you know that it's time to look again? I always re-evaluate. I don't assume any, I
reject all conventional wisdom. I try to sort of re-derive everything
from first principles every time I do a project.
And that way you're never doing it too slowly.
Right.
This might be a good point
to sort of wind the clock back a little bit.
And you've had a really long career.
Maybe you can tell our audience,
how did you get interested in computer architecture and, you know, how you got to NVIDIA eventually and how you thought about
like the various inflection points over the course of your career. Okay. So I don't know exactly where
you want me to start, but I never finished high school. So back in the 1970s, I dropped out of
high school when I worked preparing cars and pumping gas for a while. And sometime during
that period, I discovered microprocessors and actually wound up getting a job as, I guess the title was electronics technician, but it was basically doing microprocessor system development.
And it was kind of fun, but I realized that I probably should go back to school and get a degree.
So I wrote a very persuasive letter.
I don't know if you could do this today.
People are too bureaucratic today.
I got admitted to Virginia Tech as an undergrad in 1977 without a high school diploma.
And in three years, completed a bachelor's degree in EE and then took a job at Bell Labs
as a microprocessor designer designing a product we called the Bell Mac 32. I think it was officially offered as a Western Electric 32100.
And it was a great initial experience.
I didn't realize how lucky I was to be at Bell Labs.
I figured, oh, all jobs must be like this.
But it was a great place.
There were really smart people all around.
And they were always challenging me with thoughts and ideas.
They also paid for me to go to grad school.
I worked there for one summer, and then they sent me to Stanford to get a master's degree in double E. And I actually
came very close to staying in the Bay Area. I probably should have in hindsight, but I felt
somewhat loyalty and they'd paid all this money to send me to get my master. I needed to go back
to work for them. But being in Silicon Valley at that point in time was just a real hotbed of
interesting things going on in the industry. But I went back to Bell Labs and worked there for a while. And ultimately,
I decided I should go get a PhD. This was about the time that Carver Mead had just written his
book on VLSI. And so I thought that was really cool. And so I decided I'm going to go to Caltech
and work with Carver Mead. So I went to Caltech, but instead of working with Carver Mead, I actually initially got aligned with Randy Bryant as my first PhD supervisor.
And he had developed a simulator called Mossim.
And so I like to build hardware.
And so I built what was probably my first accelerator was the Mawson Simulation Engine.
Now, because I had gotten my master's degree at Stanford, Caltech didn't recognize it because it did not have a thesis.
And so I wrote a master's thesis equivalent on that project.
And of course, it wouldn't give me another master's thesis because you already have one.
We won't give you a thesis that you already have.
But I basically had to do that.
Then about that point in time, Randy came to me one day and he said,
well, I've decided to go take a faculty position at CMU.
You're coming with me, right?
And I go, Pittsburgh?
I don't think so.
And so at that point in time, I started shopping around for another thesis advisor
and another project and was very fortunate to find Chuck Seitz,
who's just you know a
a brilliant person a real pioneer of parallel computing and wound up doing my uh phc thesis
with chuck and actually the the topic of my thesis was mostly about programming systems and i
developed a language called concurrent small talk and how you could build what i called concurrent
data structures where the synchronization and parallelism was built into the data structure, most applications being developed around data structures.
You could then easily build parallel applications by just plugging together the sort of standard template library of data structures.
But what everybody noticed about my thesis was this little part at the end where I explained how to actually build the machine to run these.
And in there, I actually had developed much of the theory that's used now by interconnection networks.
So the whole notation for deadlock analysis, virtual channels, wormhole routing,
all of that was actually developed in one chapter of that thesis.
That's the only part anybody remembers.
So anyway, from there, what's interesting is when I got my PhD at Caltech, there was nothing
more than I wanted than to be a professor at Caltech.
And the nicest thing anybody's actually probably done for me in terms of selfless act, um,
in my benefit rather than theirs was the administration at Caltech writing me the nicest
rejection letter I ever got and, uh, telling me to choose between the offers that I gotten
from Stanford, Berkeley, MIT, and CMU.
And that was great because I think my professional development would not have
been nearly as good had I stayed at Caltech. I probably would have had more fun, but that's a
different matter. But anyway, I went to MIT, which was interesting because it was a place with a very
different way of thinking about things. And so I popped down in the middle of the MIT AI lab.
One of my great mentors at the time was Patrick Winston.
I wound up having to defend these ideas that everybody at Caltech just sort of took as obvious.
Well, locality is important.
You need to worry about area.
And thinking of things from sort of a VLSI-centric point of view,
where the majority of people there were sort of LISP programmers,
and they thought of everything in terms of, you know, what you could define as a
lambda. And it was just interesting sort of intersecting with that culture. And I think
I developed a lot in a very short period of time. And also I had the opportunity to work with a lot
of really brilliant graduate students because MIT, you know, like Stanford is this place where
really great students just show up on your doorstep. You don't have to do hardly any work.
And by the way, I think that's's one big advantage of being in the academic role
is just the opportunity to work with amazing students.
And I had the opportunity
to build a bunch of interesting parallel machines.
There's the J machine and the M machine
that pioneered a lot of techniques
that are found in all parallel machines today.
And then in 1995, I wanted to do computer graphics.
So I went on a sabbatical to the university of North Carolina to work with
Henry Fuchs and John Poulton on graphics and graphics hardware.
And that sabbatical did two things for me.
One is it made me realize that I could move, right?
I had a family and I felt kind of very settled in, in the Boston area,
but we picked the whole family up.
We went down to Chapel Hill and it was great. It says, okay, I can move.
The other is while I was down there, you know, Fred Brooks and a bunch of people started whining and dining me and offered me endowed
chairs and stuff like that. I said, oh, maybe. A, I can move and B, maybe people actually want me.
It had not occurred to me. Then I said, okay, well, if I can move and people want me, where
would I really like to be? It didn't take me very long after having lived through 11 Boston winters to decide, you know, I need to move back to the West Coast. And so I tell everybody I went, I moved back to Stanford, largely Silicon Valley. In fact, it shifted before I did.
I was a lagging indicator.
And Stanford was actually intellectually a much better place to be than MIT.
MIT is a great place, but Stanford was a step up.
And for me, it was just a great career move where I got an improvement in quality of life.
And I think the intellectual environment around me was much more in tune to the industry.
And so that was in 97 when I made that move.
And so we did some great, I sort of got back into interconnection networks.
We did some great work working with people like Lisha and Pei and Brian Tolles.
We wrote the book on interconnection networks.
And we did the Imagine project and Merrimack project for stream processing that really
was the forerunners of GPU computing.
And then I got corralled into being a department chair. And I'm not quite sure
how I ever agreed to this, but it basically was a very large fraction of my time was spent on
non-technical things, like trying to keep the CS department at Stanford from going bankrupt.
I inherited the department, and I found that we had a million dollar year deficit
and it was $500,000 left. So I had like six months to sort of patch that hole. I'm always still
curious to this day as to what would have happened. Because Stanford has lots of money,
they would have bailed us out, but it probably would have been unpleasant for the department
had I actually run a deficit. You're not supposed to do that. But anyway, I wound up sort of being department chair
while still running stream processing projects
and also a project called Elm for efficient low-power microprocessors.
And around that time, I started consulting for NVIDIA.
I think it was probably around 2003.
My longtime friend, John Nichols, said,
gee, the stream processing stuff is what we need to get into our GPUs.
And so I worked with John and Eric Lindholm and a bunch of the architects on getting the GPU compute features into what we called NV50 and ultimately shipped as the G80 to sort of take our stream processing stuff and move it across.
And in this way, I developed a really good relationship with David Kirk,
who's the CTO of NVIDIA, and Jensen Wang, the CEO.
So when my term as department chair was coming up at Stanford,
I was trying to figure out what to do next.
I figured I've got to get out of this place a little bit
because everybody comes to me with their problems.
And even if I'm not officially still department chair,
if they can't get what they want from the department chair, they're going to come to me next.
So I'd actually set up a sabbatical.
I was going to go to UC Santa Cruz, and I was going to work with David Hausler on genomics
because that was something I was very interested in even at that time.
And then I was having dinner with David Kirk at some point, and he said, well, why don't
you come to NVIDIA?
And I said, well, why would I want to do that?
And so he and Jensen started working on me about coming to NVIDIA and sort of starting a research lab that inherits some research that was already going on on ray tracing.
And after a while, it actually started sounding like the right thing. So in 2009, I made the
jump from the academic world to industry. And to me, the real motivation for doing that was to
maximize my impact. And I've always sort of viewed my success and measured my output by what is the impact I'm having on the world.
As a professor, a lot of that impact is with people, right?
You produce graduate students go off and you realize that they probably would have done really well anyway because they're really smart people.
But you hope that your mentoring over the years there where your student had some positive impact on that.
And I used to really enjoy teaching undergraduate classes, especially to freshmen, because you figure you're having a lot of impact on those people.
But then in terms of research output, it was very hard to sort of get projects to impact industrial practice.
And here was an opportunity perfectly matched with my set of research interests where I could hire large numbers of people. So the resource was also getting harder
and harder to do systems projects in the academic world. The money was drying up. DARPA was no
longer funding academics. NSF was funding things in too small a chunk to really do real systems work.
So the combination of being able to leverage enough resources to make real progress,
having immediate impact, it became very compelling.
And so I did it.
And it's been a blast.
I mean, it's been it'll be 12 years in January.
And I feel like I've had a lot of impact.
I've built a great organization.
NVIDIA Research spans from circuits and architecture and VLSI to graphics, robotics and AI.
And, you know, I can look at every generation of gpu that shipped and key
features are things that we kicked off in research and it's had tremendous impact so it's been it's
been a lot of fun what i miss are the students um i really love working with students and i love
teaching um that's a big you know gap that um that i miss um and uh although it's probably much
tougher now i'm probably a lot less enjoyable that everybody's doing it over Zoom. Yeah. And I still keep in touch.
I had a Zoom call with a former student of mine who, you know, I used to teach this class at Stanford called Green Electronics, which is the nuts and bolts of sustainable energy systems.
And this guy liked my class enough that he dropped out of Stanford to start a company on his class project.
He actually came back and finished his degree.
I'll give him credit for that.
But he now has another company doing another green electronics thing. And I have regular calls
with him. I just give him advice and see how he's doing. So I still have some of that interaction
with students, but I'd like to be kind of mentoring that next generation of students.
And I'm missing that right now.
I see. I see. That's what I feel like there's been this exodus of professors from academia
to industry lately. I don't know if it's more than it used to be, but it feel like there's been this exodus of professors from academia to industry lately.
I don't know if it's more than it used to be, but it feels like there's.
Oh, it's way more than it used to be.
I'm very worried about it.
I guess I'm part of the problem.
But, you know, people react to incentives and the incentives are all set up.
You know, even even both the selfish incentives and the selfless incentives are all set up
to suck people into the industrial world. I mean, the salaries are at least double.
You don't have to spend a large fraction of your time kind of begging for money. You don't have
to deal with the bureaucracy increased monotonically the whole time I was an academic.
And so I think that they've created a whole set of incentives that, except for the fun of working with students, pulls you in the other direction.
And I worry about that a lot because I think a lot of the success of the United States in being a technological leader has hinged on, since the 1960s, having the world's best technological universities and the best people in those
positions. And I worry if the best people go to industry, they'll develop great products,
but they won't be educating that next generation of people and we'll lose that edge. So the
incentives have gotten turned upside down. And I think it's bad for the country and probably bad
for the world that the incentives are not getting the best and brightest people to go
be professors.
Because I guess I'm enough of an academic at heart that I think that's where the best
and brightest people ought to be.
Yeah.
Do you think of that as another system that could be re-architected?
Yeah.
No, we need to describe the function and the mapping, and then we can probably fix that.
Yeah.
Yeah.
I always find that engineering human systems
is so much harder than engineering computer systems
because people do whatever they wanna do,
despite the fact that you sort of tried to engineer it
for them to do something different.
They can't be reliably programmed.
Right.
You mentioned that this is a very exciting time
to be a computer architect.
What's on the horizon for you?
What is exciting for you in the near future or in the long-term future?
I'm really excited about finding out how to generalize domain-specific accelerators
and lower the barrier to entries so that people can do them for a lot more applications.
And part of that is making it easier to do the design.
Another part is actually making it easier to do the tooling and having a good platform like a GPU as part of that,
but it's not the whole story. I'm very excited about a lot of applications of interconnection
networks within those accelerators, within the platform on the GPU chip, between GPUs,
you know, between clusters of GPUs in the data center. And I think there's a real
migration of a certain part of that communication to optics these days that, again, changes the
cost equation. Whenever you change the cost equation, the solutions change. There'll be a
lot of interesting new technologies developed as a result of that. And I'm interested in the whole design process.
I think it's way too hard
to design computer hardware these days,
and there's got to be ways of making that simpler,
and again, that lowers the barrier to people doing it.
I think it's one thing that's required for accelerators,
but even for general-purpose machines,
it shouldn't be a many-thousand-person-year project to do,
maybe many hundred,
and if we can get an order of magnitude
increase in productivity, that would be a great thing.
Well, there you have it, folks.
Thank you so much, Dr. Bildali,
for sharing your thoughts and perspectives with us.
It's been an absolute delight speaking with you.
Yes, thank you so much, Bill.
Oh, my pleasure.
And I hope you guys have a great day.
And to our listeners,
thank you for being with us on
the Computer
Architecture Podcast.
Till next time, it's
goodbye from us.