Computer Architecture Podcast - Ep 2: Domain-specific Accelerators with Dr. Bill Dally, Nvidia

Episode Date: October 21, 2020

Dr. Bill Dally is the Chief Scientist and Senior Vice President of Research at Nvidia, and a Professor of Computer Science at Stanford University. Dr. Dally has had a storied career with contributions... to parallel computer architectures, interconnection networks, GPUs, accelerators and more. He has a history of designing innovative and experimental computing systems such as the MARS accelerator, the MOSSIM simulation engine, the J-Machine and M-machine, to name a few. He talks to us about computing innovation in the post-Moore era, domain-specific accelerators, and technology transfer in computing.

Transcript
Discussion (0)
Starting point is 00:00:00 Hi, and welcome to the Computer Architecture Podcast, a show that brings you closer to cutting-edge work in computer architecture and the remarkable people behind it. We are your hosts. I'm Suvinesh Brahmanian. And I'm Lisa Hsu. Before we begin, we just want to acknowledge that we are in truly unprecedented times with this COVID-19 pandemic, and we want to wish all of our listeners health and safety. We hope you're
Starting point is 00:00:25 safe and keeping well. And hopefully, listening to this podcast will take your mind off things for at least a brief moment. And we're very, very excited to have with us today Dr. Bill Daly, who is the Chief Scientist and Senior Vice President of Research at NVIDIA, as well as a Professor of Computer Science at Stanford University. Dr. Daly has had a very storied career with contributions to parallel computer architectures, interconnection networks, GPUs, accelerators, and more. He has a history of designing innovative and experimental computing systems such as the Mars accelerator, the Mossen simulation engine, the J machine, and the M machine to name a few. He is here to talk with us about computing innovation in the post-Moore era,
Starting point is 00:01:07 domain-specific accelerators, and a number of other topics. A quick disclaimer that all views shared on this show are the opinions of individuals and do not reflect the views of the organizations they work for. Dr. Bill Daly, welcome to the podcast. We're so excited to have you here today. I'm really happy to be here. Thank you for inviting me. As we mentioned in the intro, you've had a long and storied career across a lot of different topics. But especially these days, what is getting you up in the morning? I'm just really excited about a lot of things. And probably the one right now is probably building accelerators for certain demanding problems. So you're talking about
Starting point is 00:01:55 domain-specific accelerators. So for several decades, Moore's law has been a major driver of computing performance improvements. But as we head towards the sunset years of Moore's law, what do you think are the promising paradigms to sort of keep computing innovation going in the next years and next decades coming up ahead? Well, I think the big one is domain-specific accelerators. And alongside that parallel computing, and they're really very closely related, it boils down to the fact that if serial computers, single thread processors aren't getting any faster, and if you want to add value to applications, you need more performance, then you need to have lots of threads in parallel.
Starting point is 00:02:34 And to do that efficiently, you really need to specialize for a certain domain. And the gains to be had there are often into the thousands of fold compared to a conventional processor. Right. So what goes into designing a good domain-specific accelerator? Well, in my mind, designing a domain-specific accelerator is really a programming exercise. It's understanding the application and reprogramming it in a way that is hardware-friendly so that you get rid of serial bottlenecks. You have to make the application very parallel.
Starting point is 00:03:08 You get rid of memory bottlenecks. You minimize accesses to large global memory structures, and you try to transform it so that storage can be local with small memory footprint per element. And it's really that reprogramming that is the challenge of designing a domain-specific accelerator. Once you've reformulated the problem into a hardware-friendly form, crunching out the hardware is a largely mechanical activity. I think it could actually be automated in the future. So it sounds like you're saying that a lot of this is co-designing the algorithm along with the hardware, along with a good cost model for
Starting point is 00:03:44 what the hardware has to provide. Right. It's absolutely, you're redesigning the algorithm in a form that's hardware friendly. That is the real challenge of accelerator design. It's very rarely the case that the existing algorithm without modification is very suitable for acceleration. Yeah. So that's a very interesting thing, because I feel like that's a little bit of a chicken and an egg issue, particularly in the architecture community, right? Because the typical MO for architects, if you go to all the major conferences, is someone finds a topic or a domain area and analyzes the heck out of an algorithm and then looks for little exploits that you could do
Starting point is 00:04:22 in the way the algorithm works and then builds hardware in order to accelerate that whereas in many cases a reformulation of the algorithm itself from fundamentally is the more appropriate thing to do how do you how have you been able to sort of get out of that loop and how would you encourage people in our community to also get out of that well i think it's you just have to let yourself do it. I mean, so when we started our Darwin project that I did with my former student, Yatish Tarakia, we actually went exactly down that loop. And what we found is if we took the,
Starting point is 00:04:56 this was for Genome Assembly, for de novo Genome Assembly, if we took the existing best algorithms and through everything we knew about hardware to accelerate them, we might get 4x. And I guess we could have stopped there, but we weren't very happy with 4x. So we asked ourselves, why is it limited to that? And we found that it was because that particular algorithm had been very carefully tuned to get optimum performance on a CPU where memory accesses were relatively cheap and doing the dynamic programming, which is the core of the algorithm, was very expensive.
Starting point is 00:05:29 So they had basically set the balance between what's called the filtering stage, where you take lots of candidates and filter them out, which takes lots of memory accesses, and the alignment stage, which is all dynamic programming. And they made it all filtering because that was cheaper on the CPU. And the first thing we did that kind of transformed the game was to say, okay, we don't need to stick to that, those thresholds of how much filtering to do. We could do less filtering because it's expensive.
Starting point is 00:06:00 That means there's more work left to be done in the alignment stage, but that's really cheap. And doing that turned it from being 4X better to being 15,000 times better. And so I think architects just need to understand the application well enough that they feel comfortable changing it. So they can change the details of the home implemented, but understand that they're going to deliver the answers that whoever's using that application, in this case, you know, it's biologists doing, you know, genetic research, want and with equal or higher fidelity. And so in that case, it was a measure called sensitivity. We had to basically get equal or higher sensitivity than the existing algorithms.
Starting point is 00:06:40 And then all the users were happy. And do you find that doing that sort of co-design work might be more amenable to an academic environment where you have all sorts of different kinds of domain expertise packed into one campus? Or are you able to continue that kind of close alignment from an industrial standpoint? I think you could probably do it in either framework. The Darwin project we did at Stanford, we collaborated with Gil Bejarano in the developmental biology department, who's a real expert on people who are ultimately going to be the end users. And that's relatively easy to do in an academic environment.
Starting point is 00:07:29 We do have a lot of experts around. I think it's possible to do in industry, but it usually means forming collaborations with people outside the company. Yeah, makes sense. So obviously you are at NVIDIA now. And so is that kind of a collaboration, one where you seek people out and say, here we have these amazing GPUs, what can we do with them for you? Or is there a different approach to kind of create that synergy? A lot of different things happen. There's no one formula. So we're building a lot of deep learning accelerators. And there, I think all the expertise we have in the company, we have some of the world's best deep learning
Starting point is 00:08:09 researchers. And we understand the directions algorithms are going in and that the models are going in. Because if you tune a deep learning accelerator for running ImageNet on ResNet, you're five years behind the times. Almost nobody cares, actually, about the vision networks anymore. It's all the actions in the natural language processing and recommender networks, which are fundamentally a very different problem in a lot of very serious ways. And then we have other things where people are applying GPUs to running some problem. And they'll come to the company saying, gee, I'd really like to get more out of this, and we'll look at what they're doing.
Starting point is 00:08:55 And sometimes we actually have an organization that actually helps customers tune their programs for GPUs. And sometimes in doing that, we'll see an opportunity where we could add acceleration. Another example is in graphics. We recently added ray tracing acceleration to our GPUs. That's a project we started in NVIDIA Research. And that was where, here's an algorithm that people say could never run real time. And through the combination of hardware that accelerates things, and here the gain of the hardware is probably like 10x, it's not thousands, because the software schemes are doing pretty well. Between that hardware and applying deep learning so that we can actually generate relatively noisy images and then denoise them, we've been able to bring ray tracing to real time and get really photorealistic graphics running on somebody's personal computer at 60 frames per second, which is pretty amazing. And there again, I think we had all the expertise.
Starting point is 00:09:46 We had some of the world's leading researchers in ray tracing, which at that point in time was used for offline graphics, for, you know, like rendering a movie at, you know, one frame per hour. But we could take their expertise and combine it with the hardware expertise and architecture expertise we had and use that to craft a good accelerator. Right. So obviously, domain-specific accelerators have been successful in graphics and more recently in machine learning and deep learning. And you talked about emerging applications such as genomics and other places where we could find opportunities to design such domain-specific
Starting point is 00:10:20 accelerators. How do you think about legacy compute as well? Because that is a huge part of the overall computing ecosystem. Do you see the same paradigm applicable to those things as well? Are there additional challenges or different challenges in that domain? Well, I'm not sure what you mean by legacy compute. I mean, all of computing are applications. And those applications solve some problem, whether it's simulating physics or whether it's analyzing data or whether it's, you know, monitoring video feeds, which I guess is a form of analyzing data. What legacy means to me is that, you know, somebody has a system and it works for them and there is inertia, which, you know, keeps them from, you know, they don't want to make any changes and progress always requires change. And so whatever they're doing in the legacy compute, whether it's, say they're doing
Starting point is 00:11:12 data analytics, running over large data sets, trying to glean insight about something, you can look at what algorithms they're using there, find where the bulk of the time is being spent, do that same co-design where you say, okay, we've got to grovel over all this data, but is there a way of making intermediate references very local so we can avoid large global memory bandwidth problems, and design an accelerator for it. And it's really a question of finding applications where the effort-reward ratio is right. When you look at things like deep learning and genomics, and to some extent, even the ray tracing, what makes it really amenable to the accelerator is that there's sort of one core kernel. For deep learning, it's matrix multiply. For genomics,
Starting point is 00:12:00 it's dynamic programming. For ray tracing, it's traversing a bounding volume hierarchy. And if you accelerate that one core kernel, the whole application speeds up enormously. And so I think those applications are the ones that are really, they're the easier, they're the low-hanging fruit for acceleration. There are other applications where you look at the code and it's several hundred thousand lines of code and it's hard to tell any one place most of the time is being spent. That's a mess. That would be very hard to accelerate. And I think in many cases there, you have to go back and ask, what is the fundamental problem they're solving? And do
Starting point is 00:12:36 you really need a hundred thousand lines of code? Or did that just aggregate over a long period of time of different programmers going in and adding a few lines here and a few lines there and not taking out anything that wasn't being used that often. And can you streamline it? Can you actually refocus and rebuild it from the ground up? So your career has also had a lot of, or you've spent a lot of time on interconnection networks. And so for things like domain-specific accelerators, there's certainly the notion of accelerating the core kernel of compute that is happening over and over again. How do you view the balance of accelerating compute with the communication patterns in terms of building a
Starting point is 00:13:16 good accelerator? Yeah, so if you look at where the performance comes in accelerators, it largely comes from parallelism. I mean, specialization can buy you 10 to 50 fold, depending on what the application is. But the cases where you get thousands is because you're operating in parallel. And once you're operating in parallel, you need to communicate data between the processing elements, between various storage and processing elements. And what you find very quickly is that communications becomes a dominant problem. It's where most of the energy goes. A lot of people still don't know the energy is in memory access, but when you look at the memory access, the fundamental operation
Starting point is 00:13:53 of reading or writing a bit cell takes almost no energy. All of the energy and memory access is moving the bit from that bit cell to the sense amps and then from the sense amps from a subarray to some global memory interface and ultimately to where it's being consumed. And so because it's so dominant, you need to do a very good job of that communication. And so it's an area where I've been working on it since the 1970s, actually. It really is nice because there's a good theoretical framework to understand that communication and then to optimize it in a way that you can make sure that every, you know, femtajoule of energy you spend moving bits is being spent in a very optimal way.
Starting point is 00:14:35 And that can really make the difference between an accelerator that works well and one that is horribly inefficient. And so do you feel like, you know, it seems the amount of data that we need to process in sort of a single problem continues to expand? And initially, there's a lot of work on things like on-chip networks. And then, of course, there's a lot of work in sort of inter-system networks. Is there, in your mind, something fundamentally the same about both? Is it sort of reapplying things that we know in one domain to another as the scope of data expands? Or is there something fundamentally different about the about like intrasystem versus intrasystem networks? That's a really good question. So there are things that are the same, the things that are different.
Starting point is 00:15:19 And what's the same is all the theory is the same. The problems are the same. When you're constructing a network, you have topology, you have routing algorithms, you have flow control. All of those things are the same on-chip and off-chip. The answers are often different. The reason why the answers are different is the cost models are very different. So on-chip, you're signaling with CMOS logic over on-chip wires, which are RC wires. They're lossy and they're resistive.
Starting point is 00:15:46 You have to put repeaters in them. And that gives you a certain cost model, an area and energy for moving bits on-chip. And it actually is a cost model where the energy is proportional to distance. So the longer a channel is, the more energy you consume on that channel. Once you go off-chip, you pay a big upfront cost for a CERTIs, for an I.O. driver on the chip to drive a wire off-chip and perhaps even for an optical engine to drive a fiber that's connected to that. But once you've paid that cost, in the electrical case, you can go meters. In the optical case, you can go hundreds of meters or even
Starting point is 00:16:22 kilometers with no additional cost. And so because the cost model for the channels is different, the solutions change. And also because the channels are now very expensive, you have to pay this big upfront cost for the IO. You can now amortize against that channel a lot of buffering, a lot of routing logic. So you often will get a more complex solutions. The fault model is different as well. On-chip, even though errors do occur and there's fit rates for everything, particularly storage cells and SRAMs, a lot of on-chip interconnection network done is assuming a basically reliable network where then you'll put some error checking on it, CRC or parity,
Starting point is 00:16:59 and have some fallback if a fault is detected, but it's a relatively rare occurrence. The channels off-chip often are as bad as 10 to the minus 8th or 10 to the minus 9th bit error rate. So there, faults happen on a really many times a second. And so you have to have some fault recovery mechanism that is transparent and recovers from that. But the theory is the same, and it's just that the optimal answers differ because you're applying different cost models to it.
Starting point is 00:17:30 Makes sense. One thing I didn't hear you say in all those factors was latency. There's two sides to latency. One is how much your application can tolerate, and the other is how much is essential and how much is gratuitous. And in on-chip networks, it turns out moving a bit on-chip is actually slower in terms of velocity because those RC lines don't operate anywhere near the speed of light. And once you go off-chip and you're on a good LC transmission line or on an optical fiber,
Starting point is 00:17:58 you're typically running at about half the speed of light, depending on what your dielectric is in the transmission line. And so there's nothing fundamentally higher latency about going off-chip. You could have an on-chip network that, you know, to cover, you know, we build really big chips that are, you know, 800 square millimeters. And so, you know, you figure over 20 millimeters on an edge, you could be going 40 millimeters on chip, and that can wind up taking you, you know, many hundreds of nanoseconds, right? And if you think about it, you can go on a single optical fiber, you know, you could go about roughly a half a foot per nanoseconds, you can be, say, 50 feet, 50 feet per hundred nanoseconds, several hundred nanoseconds. You can be almost a football field length away. Now, very often off-chip networks, because people can amortize more complicated routers, they do.
Starting point is 00:18:55 And they build very complicated routers, and the routers can wind up having latencies of easily 50 clock cycles. I find that very painful sometimes. I built many routers back in the 1980s and early 1990s where we were 10 clock cycles from bits arriving on the input pin to bits going out on the output pin. And some of the latency is inherent because on these big router chips now, you have maybe 20 or 30 millimeters of traversal to happen to actually just get the bit from the input port through some on-chip network, which forms a core of that router to the output port. And that's essential latency.
Starting point is 00:19:31 But some of it is sort of designers being a little bit lazy with modularity, right? They say, OK, I'm going to design a block here and put some buffers before it and buffers after it. So I can think about that separately from other things. Whereas if they crafted it more carefully, I think much of that delay could go away. The other thing that eats up a lot of latency is sometimes expensive clock domain crossings. Synchronization failure is a real problem, and many people still use what I refer to as brute force synchronizers to cross clock domains. And those in modern technologies can easily cost five or ten clocks just to cross one clock domain. Whereas there are known techniques, we've published many papers about them, where you can have an average cost of a half a clock to cross a clock domain.
Starting point is 00:20:14 And most people, because they don't feel enough pain about that latency, they don't feel motivated to go and apply the more sophisticated synchronizers. How do you see GPUs in the context of domain-specific accelerators? How are they positioned, and how do you see them evolving within this paradigm? So GPUs are a great platform. That's the way I think about them. Because if you look at the difficulty of building an accelerator, say you want to do dynamic programming. Once you figure out what you actually want to do you can probably code the verilog for that in a couple days right but to code the verilog for the on-chip interconnection network and the on-chip memory system and the interface
Starting point is 00:20:54 for various types of off-chip memory be it hbm or ddr and the interface to some off-chip network you know that's years person years of effort, actually tens, maybe hundreds of person years of effort. And so what you really want is you want a platform that has all that done with lots of bandwidth, lots of parallelism, so you're not bottlenecked by things, into which you can plug accelerators. So you can plug in a deep learning accelerator, you can plug in a genomics accelerator, you can plug in a bonding volume hierarchy accelerator and not have to redesign the hard part.
Starting point is 00:21:25 You just redesign the easy part, the one little algorithm that you're trying to do that you can do in a few days. And so they're a great platform for plugging things in. They're also a great platform for playing with the application before you plug it in, right? And so you can go ahead and get a lot of performance out of the GPU. It's a very parallel machine. It runs hundreds of thousands of threads simultaneously on thousands of cuda cores um and basically get a certain
Starting point is 00:21:50 amount of performance and then see where you have enough pain that you feel like you might want to add you know an instruction like you know we have our deep learning accelerators we call tensor cores are actually instructions they're they're hmMA and IMMA instructions, matrix multiply, accumulate instructions. If you have enough time being spent in one area, that then motivates you to put that instruction in. And that one area then gets a lot faster and then everything else looks a lot worse. Then you go and you find the next thing and push it down. And they're also a great platform for playing around with that co-design process because they're a naturally parallel platform and they also have a local memory hierarchy. Optimizing a program for GPU is exactly the same optimization that you want to do before you turn it into hardware. It's the
Starting point is 00:22:34 same set of constraints. Right. Do you have a vision for how you view domain-specific computing or computing in general to be in the future, both in terms of how, I guess, computer architects and hardware architects think about designing systems, as well as how users, programmers and above think about computing platforms. Like I said, I personally think of developing a domain-specific accelerator as developing a parallel program, just a parallel program with this cost model reflecting hardware. And in the future, I'm hoping we can develop programming systems that largely automate that process, where I can write in some high-level language
Starting point is 00:23:13 my description of the computation, and separate from that, a description of how it maps. Too many of our existing programming languages combine the specification of the function with the mapping of that function in time and space. And for an accelerator, you really want to fundamentally separate those two. We've developed a number of programming languages over the years that do that separation to target conventional hardware. Here you would do that separation and then have some back end that feeds into CAD tools and gives you a measure of how expensive it is, how much area is it going to take, how fast is it going to run.
Starting point is 00:23:46 The typical thing you do when you write a computer architecture paper and you need the results section. So you get the results section and you say, okay, it's not as good as I thought it was. What do I change either in the algorithm or in the mapping? You try to find the bottlenecks and do that. And perhaps even some of the searching of the mapping space can be automated. But it's a very large space.
Starting point is 00:24:05 Searching that isn't an empty, complete problem, and that's where human intuition sometimes does better than algorithms. The mapping is sort of done at a layer above the hardware, right? You've got the problem at a certain level of abstraction, you've got the hardware at another level, and then there's mapping, which is presumably a software layer. What determines the hardware right so say you have a an algorithm like dynamic programming right you can write the specification which says how you compute the value for every point in in the h matrix right that doesn't say how you're going to map that in time and space right there are many mappings of that in time and space the the algorithm does constrain you by data dependencies right because you the mapping in time has to preserve data dependencies. You can't compute something
Starting point is 00:24:49 before you compute the things that it's dependent on. But other than maintaining those data dependencies, you have a tremendous amount of freedom. So the mapping is really saying, for each point in that matrix, which piece of hardware computed it and when, right? And then by doing that, you're also determining the hardware because you're saying how many pieces of hardware there are. One mapping is to say there's one piece of hardware and I walk it left to right,
Starting point is 00:25:13 or I could have one piece of hardware and walk it in row major, top to bottom, and I could do it in either order or I could decide I'm going to have many pieces of hardware and do a wavefront, or I could have fewer pieces of hardware and do wavefronts left to right or wavefronts top to bottom. You can come up with lots of different mappings for the one function. So you can specify that function, specify a bunch of different mappings, and then optimize in that mapping space for some figure of merit, which is some combination of cost and performance. For that particular type of mapping, you know, where you have the algorithmic sort of specification.
Starting point is 00:25:52 So I guess what I was getting at is when you think about like general purpose compute, you essentially had, like you were saying before, you know, it was all bundled into one. You had the program, which was both in some ways the algorithm itself. And then once you got to the hardware, it was mostly the control plane of the hardware that was sort of deciding how to do things. And now there's this other layer where you have a lot of highly parallel sort of compute elements.
Starting point is 00:26:16 And now you've got this other layer that's sort of mapping a high-level specification to hardware and sort of taking some of the responsibility out from the hardware itself into this sort of intermediate layer. Is that the way that you see it going? When you think of conventional computing, where this really comes down to, I think, is people thinking too serially.
Starting point is 00:26:37 And they use programming notations where the common way of expressing kind of the inner kernel of something is a loop nest. And a loop nest fundamentally specifies not just what the inner kernel of something is a loop nest. And a loop nest fundamentally specifies not just what the computation is, which is sort of down in the middle of the loop nest, you actually do some work. But by ordering the dimensions, it's telling you what order you're doing it in. And, you know, sometimes compilers try to reach in there and rip it up and tile it and stuff. But, and that's actually doing the mapping, right? And so if you think about conventional programming notations, you're somehow specifying and ordering and the function,
Starting point is 00:27:10 and then your compiler sometimes tries to come back and undo your ordering. But if you inadvertently introduce a few dependencies or control constraints or something, it's impossible to do that because the two are tied together and it's hard to untie it and just like people you know spend a lot of time pounding their heads against walls trying to do parallelizing compilers the the answer is don't start serial right if you start with just a specification of here's what i want to compute you know hij is a function of you know hi minus one j h i j minus one and hi minus one j minus one boom that's that's you know, h i minus 1 j, h i j minus 1 and h i minus 1 j minus 1. Boom. That's one line of code, right? It's a specification of function without ordering, right? And I want to do
Starting point is 00:27:53 it as parallel as I can subject to the data dependencies. And then it's a separate problem to decide how to map that. Let's not combine the two together. By combining the two together, you then create this huge compilation problem of uncombining them. Just like by expressing things serially, you can create this huge problem of parallelization. Here, I've expressed ultimate parallelism, right? I'm just saying, do this computation. I'm not telling you that one thing has to follow another. I do have data dependency constraints, and I'm not specifying any ordering. You can do whatever ordering you want. Then I can search over many orderings, many mappings to different places in space. And by the way, the exact placement in space matters as well because of locality, right?
Starting point is 00:28:32 The cost of accessing something. Part of this is factoring the compute where the function happens, and part of it is factoring the storage where the input data and the output data are stored and how they're staged and moved. Because that's where all the energy is with these interconnection networks and so by searching you know the space i'm putting things in the right place in space i get enormous amounts of memory bandwidth the tiny local memories sitting right next to things which is part of what you need to make an accelerator work right if every reference is to a random number of say a random address um in a global memory you're yourosed. You're not going to do any better than a conventional processor because they have a memory system with a certain amount of bandwidth. you could, right? Because things like HBM and DDR are all accidents of history, and they're nowhere
Starting point is 00:29:25 near an optimal interface to get to the memories. You're limited to whatever that bandwidth is. But if you refactor the algorithm so that you're accessing things only very locally to you, then you can go very, very fast. Right. As you mentioned, this mapping problem, it's a very large mapping space, something that seems very complicated. It sounds like something that's very ripe for research, both in terms of the techniques that we develop, the abstractions and the frameworks that we have yet to develop. So that might be a good point to sort of talk about, you know, how you view the interaction between research and pathways to products. I'm sure you've seen several iterations of ideas coming in from research, both in an industrial setting and academic setting, and ultimately making its way into a product. Give us your thoughts about how you think about pathways from research to product, collaborations,
Starting point is 00:30:16 and so on. Yeah, so I think it's a really good one. So as the leader of an industrial research lab, one of my primary functions is to make sure that things that get done in NVIDIA research result in improvements to NVIDIA products, not just publications at leading conferences. Because I think I'd have a real hard time justifying my budget to my boss if that's all that came out of NVIDIA research was lots of fame for the researchers and maybe good PR for the company. And so I think that to have effective technology transfer, you really have to make that the goal from day one. And when you start the project in a video research, I ask the individual research to identify two people, and sometimes they're the same person, but one of
Starting point is 00:30:56 them is the champion for the technology, and the other is the receiver of the technology. Very often they're the same person, right? The champion is also going to receive the technology and productize it. And on day one, right when they're still thinking about the project, they haven't written a single line of code or a bit of Verilog or designed any circuits, I ask them to set up a meeting with this person and talk to them about the project and have them involved in the project the whole way through. This solves a bunch of problems. Probably the most important one is very often the researcher doesn't understand the real constraints. They have an academic view of the problem. And when they talk to the person who has to receive it, they say, well, that's nice.
Starting point is 00:31:37 But the real problem is this. And if you can't do that, then it doesn't matter if you do this thing that you're going to do really well because we can't use it. And so getting those unknown constraints in early is absolutely essential. You go a long way down the road and the project is kind of misguided. Also, by having regular meetings with this person, they become invested in the project. They don't have this sort of organ rejection reaction to it if they see it kind of as a finished project at the end and they don't know what's going on and it just looks very foreign to them. And that's sort of step one for that transfer.
Starting point is 00:32:10 And then the other is that you have to be very sensitive to sort of how the product development people work. They have a very low tolerance for risk. And it's understandable, right? I mean, many generations ago at NVIDIA, we had a GPU called Fermi. And actually, this wasn't even anything that should have been risky, but there was a circuit error in it that delayed that product shipping for probably close to six months. And financially, that was a huge blow. I mean, we're talking about many hundreds of millions of dollars.
Starting point is 00:32:37 And so in research, we have this wonderful advantage that we can make mistakes. We can have projects fail. If the people who have to ship the next GPU, we develop one GPU at a time. We don't have a backup GPU. If that GPU cannot ship at the time that it's planned, really, really bad things happen. So these guys have a very low tolerance for risk. So you have to understand what sort of level of quality and technical maturity you have to mature an idea to, to make it palatable to them. And so, you know, some great examples are like the RT cores that shipped in Turing. We started that project as a way of sort of accelerating ray tracing, but it took a lot of interchange with the product development group to turn what we thought was a really great architecture into something that was palatable to them, that it was low enough risk. And the other thing is estimating cost well, right? We had come up with various cost estimates of how expensive it will be. And invariably, our estimates
Starting point is 00:33:34 were low, right? When they actually put in all of the built-in self-tests that's needed around the memory arrays and very other pragmatic things you had to do in the real world, the thing grew quite a bit. And so the costs went up. And so you have to understand that's part of the technical maturity is understanding the real cost of a piece of hardware. Probably the most important thing to really, I've seen a bunch of research projects very successfully transfer into product. I've seen a bunch of them fall short and They could fall short for two reasons. One fundamental reason is there was some mismatch. It actually wasn't going to work in the product. If that's the case, then the only thing I regret is that we didn't kill the project sooner. That's
Starting point is 00:34:17 actually one of my roles in research. I go around looking at what people do and ask hard questions about, well, what gain are you going to get? Who's going to receive this? What's the benefit going to be? And try to kill off projects that it's really clear aren't going to make that jump. Because if we kill them off sooner, then we can put all those resources on projects that are going to make it. But the other reason why it fails sometimes is that the person stops working before it's
Starting point is 00:34:41 really done. And there's a difference between having something done enough, you can publish the ISCA paper on it, and having something done enough that somebody's going to bet the next generation GPU on it. That's actually a really, really big difference, right? I mean, you're maybe a quarter of the way there, maybe even a tenth of the way there when you publish the ISCA paper. And this is a question of getting people's goals aligned, right? Because if somebody's goal is, I want to publish lots of papers because I want to become famous, they're never going to transfer anything to product, right? Because they'll get it just far enough to get that paper
Starting point is 00:35:12 and then they'll go and move on to the next thing and get another paper and they'll become very famous, but they won't have done any good for the company. We really need to sort of try to make sure that people's goals are set, that they're motivated to do that work, that you don't get a lot of professional recognition in the community from except within the company, to do the maturation of risk reduction needed to carry it all the way to the point where the product guys say, yeah, we understand what the real costs are. We're absolutely sure it's going to work. You've made sure it's testable and all of these other things, and we can drop it in. Yeah, so that's sort of a philosophy there. If I go back to the academic world, I think it's harder.
Starting point is 00:35:47 We had a bunch of technology transfer successes in the academic world. We worked with Cray on every generation of interconnection network they developed from the T3D, which first shipped in 92, I started working on the project in 89 through the network in Cascade, the original Dragonfly network. But even there, it required a lot of relationship. And again, it's kind of understanding who the champion, who the receiver of the technology was. It was a set of Steves over the years, Steve Nelson at first and then Steve Oberlin and then Steve Scott at various stages and have a very tight relationship with them. But even then, we develop things in the academic world. It would be a decade before we'd see a product. And it was only because we had a very good relationship that it actually made it through those hurdles and eventually shipped in real machines.
Starting point is 00:36:42 And the part of that relationship was us understanding kind of the problems as the field evolved. And a great example there was making the jump, you know, from low-radics to high-radics networks. And in some sense, you know, I wrote a bunch of papers in the late 80s and early 90s telling people that they should stop building these high-radics networks like binary N-cubes and start building low-dimensional torus networks. And then I wrote a bunch of papers starting in mid-2000s explaining that they should stop building all these torus and mesh networks and start building high-radix networks. But it's because the technology changed. Again, the theory remained the same, but with different cost models, you come up with different answers.
Starting point is 00:37:20 But it was having that great relationship with industry where we were able to sort of understand the real cost model by working with the guys at Cray at that point in time. So how do you know when it's time to revisit something like that? Because I think one of the interesting things about our field is that the technology, I think I described it once in grad school, you know, the sands on which you were standing are constantly shifting. So the answer is never always the answer because the sands shift, But it's a little bit of an art to know when the sand has shifted enough to re-look at something and then potentially fight the conventional wisdom that has been established about it. So when do you know that it's time to look again? I always re-evaluate. I don't assume any, I reject all conventional wisdom. I try to sort of re-derive everything
Starting point is 00:38:07 from first principles every time I do a project. And that way you're never doing it too slowly. Right. This might be a good point to sort of wind the clock back a little bit. And you've had a really long career. Maybe you can tell our audience, how did you get interested in computer architecture and, you know, how you got to NVIDIA eventually and how you thought about
Starting point is 00:38:30 like the various inflection points over the course of your career. Okay. So I don't know exactly where you want me to start, but I never finished high school. So back in the 1970s, I dropped out of high school when I worked preparing cars and pumping gas for a while. And sometime during that period, I discovered microprocessors and actually wound up getting a job as, I guess the title was electronics technician, but it was basically doing microprocessor system development. And it was kind of fun, but I realized that I probably should go back to school and get a degree. So I wrote a very persuasive letter. I don't know if you could do this today. People are too bureaucratic today.
Starting point is 00:39:02 I got admitted to Virginia Tech as an undergrad in 1977 without a high school diploma. And in three years, completed a bachelor's degree in EE and then took a job at Bell Labs as a microprocessor designer designing a product we called the Bell Mac 32. I think it was officially offered as a Western Electric 32100. And it was a great initial experience. I didn't realize how lucky I was to be at Bell Labs. I figured, oh, all jobs must be like this. But it was a great place. There were really smart people all around.
Starting point is 00:39:37 And they were always challenging me with thoughts and ideas. They also paid for me to go to grad school. I worked there for one summer, and then they sent me to Stanford to get a master's degree in double E. And I actually came very close to staying in the Bay Area. I probably should have in hindsight, but I felt somewhat loyalty and they'd paid all this money to send me to get my master. I needed to go back to work for them. But being in Silicon Valley at that point in time was just a real hotbed of interesting things going on in the industry. But I went back to Bell Labs and worked there for a while. And ultimately, I decided I should go get a PhD. This was about the time that Carver Mead had just written his
Starting point is 00:40:16 book on VLSI. And so I thought that was really cool. And so I decided I'm going to go to Caltech and work with Carver Mead. So I went to Caltech, but instead of working with Carver Mead, I actually initially got aligned with Randy Bryant as my first PhD supervisor. And he had developed a simulator called Mossim. And so I like to build hardware. And so I built what was probably my first accelerator was the Mawson Simulation Engine. Now, because I had gotten my master's degree at Stanford, Caltech didn't recognize it because it did not have a thesis. And so I wrote a master's thesis equivalent on that project. And of course, it wouldn't give me another master's thesis because you already have one.
Starting point is 00:41:01 We won't give you a thesis that you already have. But I basically had to do that. Then about that point in time, Randy came to me one day and he said, well, I've decided to go take a faculty position at CMU. You're coming with me, right? And I go, Pittsburgh? I don't think so. And so at that point in time, I started shopping around for another thesis advisor
Starting point is 00:41:20 and another project and was very fortunate to find Chuck Seitz, who's just you know a a brilliant person a real pioneer of parallel computing and wound up doing my uh phc thesis with chuck and actually the the topic of my thesis was mostly about programming systems and i developed a language called concurrent small talk and how you could build what i called concurrent data structures where the synchronization and parallelism was built into the data structure, most applications being developed around data structures. You could then easily build parallel applications by just plugging together the sort of standard template library of data structures. But what everybody noticed about my thesis was this little part at the end where I explained how to actually build the machine to run these.
Starting point is 00:42:04 And in there, I actually had developed much of the theory that's used now by interconnection networks. So the whole notation for deadlock analysis, virtual channels, wormhole routing, all of that was actually developed in one chapter of that thesis. That's the only part anybody remembers. So anyway, from there, what's interesting is when I got my PhD at Caltech, there was nothing more than I wanted than to be a professor at Caltech. And the nicest thing anybody's actually probably done for me in terms of selfless act, um, in my benefit rather than theirs was the administration at Caltech writing me the nicest
Starting point is 00:42:39 rejection letter I ever got and, uh, telling me to choose between the offers that I gotten from Stanford, Berkeley, MIT, and CMU. And that was great because I think my professional development would not have been nearly as good had I stayed at Caltech. I probably would have had more fun, but that's a different matter. But anyway, I went to MIT, which was interesting because it was a place with a very different way of thinking about things. And so I popped down in the middle of the MIT AI lab. One of my great mentors at the time was Patrick Winston. I wound up having to defend these ideas that everybody at Caltech just sort of took as obvious.
Starting point is 00:43:14 Well, locality is important. You need to worry about area. And thinking of things from sort of a VLSI-centric point of view, where the majority of people there were sort of LISP programmers, and they thought of everything in terms of, you know, what you could define as a lambda. And it was just interesting sort of intersecting with that culture. And I think I developed a lot in a very short period of time. And also I had the opportunity to work with a lot of really brilliant graduate students because MIT, you know, like Stanford is this place where
Starting point is 00:43:39 really great students just show up on your doorstep. You don't have to do hardly any work. And by the way, I think that's's one big advantage of being in the academic role is just the opportunity to work with amazing students. And I had the opportunity to build a bunch of interesting parallel machines. There's the J machine and the M machine that pioneered a lot of techniques that are found in all parallel machines today.
Starting point is 00:44:01 And then in 1995, I wanted to do computer graphics. So I went on a sabbatical to the university of North Carolina to work with Henry Fuchs and John Poulton on graphics and graphics hardware. And that sabbatical did two things for me. One is it made me realize that I could move, right? I had a family and I felt kind of very settled in, in the Boston area, but we picked the whole family up. We went down to Chapel Hill and it was great. It says, okay, I can move.
Starting point is 00:44:22 The other is while I was down there, you know, Fred Brooks and a bunch of people started whining and dining me and offered me endowed chairs and stuff like that. I said, oh, maybe. A, I can move and B, maybe people actually want me. It had not occurred to me. Then I said, okay, well, if I can move and people want me, where would I really like to be? It didn't take me very long after having lived through 11 Boston winters to decide, you know, I need to move back to the West Coast. And so I tell everybody I went, I moved back to Stanford, largely Silicon Valley. In fact, it shifted before I did. I was a lagging indicator. And Stanford was actually intellectually a much better place to be than MIT. MIT is a great place, but Stanford was a step up. And for me, it was just a great career move where I got an improvement in quality of life.
Starting point is 00:45:18 And I think the intellectual environment around me was much more in tune to the industry. And so that was in 97 when I made that move. And so we did some great, I sort of got back into interconnection networks. We did some great work working with people like Lisha and Pei and Brian Tolles. We wrote the book on interconnection networks. And we did the Imagine project and Merrimack project for stream processing that really was the forerunners of GPU computing. And then I got corralled into being a department chair. And I'm not quite sure
Starting point is 00:45:50 how I ever agreed to this, but it basically was a very large fraction of my time was spent on non-technical things, like trying to keep the CS department at Stanford from going bankrupt. I inherited the department, and I found that we had a million dollar year deficit and it was $500,000 left. So I had like six months to sort of patch that hole. I'm always still curious to this day as to what would have happened. Because Stanford has lots of money, they would have bailed us out, but it probably would have been unpleasant for the department had I actually run a deficit. You're not supposed to do that. But anyway, I wound up sort of being department chair while still running stream processing projects
Starting point is 00:46:28 and also a project called Elm for efficient low-power microprocessors. And around that time, I started consulting for NVIDIA. I think it was probably around 2003. My longtime friend, John Nichols, said, gee, the stream processing stuff is what we need to get into our GPUs. And so I worked with John and Eric Lindholm and a bunch of the architects on getting the GPU compute features into what we called NV50 and ultimately shipped as the G80 to sort of take our stream processing stuff and move it across. And in this way, I developed a really good relationship with David Kirk, who's the CTO of NVIDIA, and Jensen Wang, the CEO.
Starting point is 00:47:13 So when my term as department chair was coming up at Stanford, I was trying to figure out what to do next. I figured I've got to get out of this place a little bit because everybody comes to me with their problems. And even if I'm not officially still department chair, if they can't get what they want from the department chair, they're going to come to me next. So I'd actually set up a sabbatical. I was going to go to UC Santa Cruz, and I was going to work with David Hausler on genomics
Starting point is 00:47:34 because that was something I was very interested in even at that time. And then I was having dinner with David Kirk at some point, and he said, well, why don't you come to NVIDIA? And I said, well, why would I want to do that? And so he and Jensen started working on me about coming to NVIDIA and sort of starting a research lab that inherits some research that was already going on on ray tracing. And after a while, it actually started sounding like the right thing. So in 2009, I made the jump from the academic world to industry. And to me, the real motivation for doing that was to maximize my impact. And I've always sort of viewed my success and measured my output by what is the impact I'm having on the world.
Starting point is 00:48:11 As a professor, a lot of that impact is with people, right? You produce graduate students go off and you realize that they probably would have done really well anyway because they're really smart people. But you hope that your mentoring over the years there where your student had some positive impact on that. And I used to really enjoy teaching undergraduate classes, especially to freshmen, because you figure you're having a lot of impact on those people. But then in terms of research output, it was very hard to sort of get projects to impact industrial practice. And here was an opportunity perfectly matched with my set of research interests where I could hire large numbers of people. So the resource was also getting harder and harder to do systems projects in the academic world. The money was drying up. DARPA was no longer funding academics. NSF was funding things in too small a chunk to really do real systems work.
Starting point is 00:48:59 So the combination of being able to leverage enough resources to make real progress, having immediate impact, it became very compelling. And so I did it. And it's been a blast. I mean, it's been it'll be 12 years in January. And I feel like I've had a lot of impact. I've built a great organization. NVIDIA Research spans from circuits and architecture and VLSI to graphics, robotics and AI.
Starting point is 00:49:22 And, you know, I can look at every generation of gpu that shipped and key features are things that we kicked off in research and it's had tremendous impact so it's been it's been a lot of fun what i miss are the students um i really love working with students and i love teaching um that's a big you know gap that um that i miss um and uh although it's probably much tougher now i'm probably a lot less enjoyable that everybody's doing it over Zoom. Yeah. And I still keep in touch. I had a Zoom call with a former student of mine who, you know, I used to teach this class at Stanford called Green Electronics, which is the nuts and bolts of sustainable energy systems. And this guy liked my class enough that he dropped out of Stanford to start a company on his class project. He actually came back and finished his degree.
Starting point is 00:50:03 I'll give him credit for that. But he now has another company doing another green electronics thing. And I have regular calls with him. I just give him advice and see how he's doing. So I still have some of that interaction with students, but I'd like to be kind of mentoring that next generation of students. And I'm missing that right now. I see. I see. That's what I feel like there's been this exodus of professors from academia to industry lately. I don't know if it's more than it used to be, but it feel like there's been this exodus of professors from academia to industry lately. I don't know if it's more than it used to be, but it feels like there's.
Starting point is 00:50:30 Oh, it's way more than it used to be. I'm very worried about it. I guess I'm part of the problem. But, you know, people react to incentives and the incentives are all set up. You know, even even both the selfish incentives and the selfless incentives are all set up to suck people into the industrial world. I mean, the salaries are at least double. You don't have to spend a large fraction of your time kind of begging for money. You don't have to deal with the bureaucracy increased monotonically the whole time I was an academic.
Starting point is 00:51:04 And so I think that they've created a whole set of incentives that, except for the fun of working with students, pulls you in the other direction. And I worry about that a lot because I think a lot of the success of the United States in being a technological leader has hinged on, since the 1960s, having the world's best technological universities and the best people in those positions. And I worry if the best people go to industry, they'll develop great products, but they won't be educating that next generation of people and we'll lose that edge. So the incentives have gotten turned upside down. And I think it's bad for the country and probably bad for the world that the incentives are not getting the best and brightest people to go be professors. Because I guess I'm enough of an academic at heart that I think that's where the best
Starting point is 00:51:49 and brightest people ought to be. Yeah. Do you think of that as another system that could be re-architected? Yeah. No, we need to describe the function and the mapping, and then we can probably fix that. Yeah. Yeah. I always find that engineering human systems
Starting point is 00:52:05 is so much harder than engineering computer systems because people do whatever they wanna do, despite the fact that you sort of tried to engineer it for them to do something different. They can't be reliably programmed. Right. You mentioned that this is a very exciting time to be a computer architect.
Starting point is 00:52:24 What's on the horizon for you? What is exciting for you in the near future or in the long-term future? I'm really excited about finding out how to generalize domain-specific accelerators and lower the barrier to entries so that people can do them for a lot more applications. And part of that is making it easier to do the design. Another part is actually making it easier to do the tooling and having a good platform like a GPU as part of that, but it's not the whole story. I'm very excited about a lot of applications of interconnection networks within those accelerators, within the platform on the GPU chip, between GPUs,
Starting point is 00:53:00 you know, between clusters of GPUs in the data center. And I think there's a real migration of a certain part of that communication to optics these days that, again, changes the cost equation. Whenever you change the cost equation, the solutions change. There'll be a lot of interesting new technologies developed as a result of that. And I'm interested in the whole design process. I think it's way too hard to design computer hardware these days, and there's got to be ways of making that simpler, and again, that lowers the barrier to people doing it.
Starting point is 00:53:34 I think it's one thing that's required for accelerators, but even for general-purpose machines, it shouldn't be a many-thousand-person-year project to do, maybe many hundred, and if we can get an order of magnitude increase in productivity, that would be a great thing. Well, there you have it, folks. Thank you so much, Dr. Bildali,
Starting point is 00:53:53 for sharing your thoughts and perspectives with us. It's been an absolute delight speaking with you. Yes, thank you so much, Bill. Oh, my pleasure. And I hope you guys have a great day. And to our listeners, thank you for being with us on the Computer
Starting point is 00:54:06 Architecture Podcast. Till next time, it's goodbye from us.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.