ACM ByteCast - Wen-Mei Hwu - Episode 58

Starting point is 00:00:00 This is ACM ByteCast, a podcast series from the Association for Computing Machinery, the world's largest education and scientific computing society. We talk to researchers, practitioners, and innovators who are at the intersection of computing research and practice. They share their experiences, the lessons they've learned, and their own visions for the future of computing. I'm your host today, Scott Hanselman. Hi, I'm Scott Hanselman. Hi, I'm Scott Hanselman.

Starting point is 00:00:26 This is another episode of Hanselman. It's in association with the ACM ByteCast. Today, I have the distinct pleasure to speak with Dr. Wenmei Hu from NVIDIA, from the University of California, Berkeley. You got your PhD in the 80s. Do you also teach at the University of Illinois at Urbana-Champaign, sir? Yes, I taught there for 34 years before I graduated and went to NVIDIA. That's wonderful. So you have a wonderful paper that we'll put a link to in the show notes

Starting point is 00:00:57 called Running on Parallelism and a few lessons learned along the way. And I love that it starts on the first slide with a list of years and people. It's so refreshing and wonderful to see the acknowledgement of the history of the space as the first slide. Was that a conscious decision to lead with that? Yes. This actually was the set of slides that I used after I received the ACM IEEE Eckermarkley Award. So basically, when the committee chair called me, told me that I would be the recipient, it took me a little bit of time to accept it. These are the people who literally were building the historic machines and publishing extremely influential work in the history. Many of them are the authors of the textbook that I read when I was a grad student.

Starting point is 00:01:55 So I wanted to put that slide up in the context before I talk about any thoughts after receiving that award, that these are the people that I was standing on their shoulder. Yeah, that's so wonderful, because it's such an amazing group. And to be acknowledged in the same list in the same breath, going back so many years must feel really amazing. The discussion is about parallelism and where it fits into the historical context of things. And if we think about Moore's law, throwing transistors at the problem and maybe starting to flatten out, we're seeing computing in general start to move towards not just parallelism, but massive parallelism. Is that because we're starting to hit some physical limitation, some quantum limitations? We can only throw so much power at these things. We have to

Starting point is 00:02:51 start to go wide. I actually kind of lived through the whole cycle in my professional career. When I started as a grad student in 1983, it was really, I would call it the real beginning of the Moore's Law. Later on, we also understood scaling. These two things are incredibly important for the next 20 years. So the Moore's Law basically drives the industry to reduce the transistor size, you can argue 18 months or two years. Basically, that allows people to pack in twice the amount of transistors per chip without increasing the size of the chip. That's economically very good. But for my career, the NARS scaling was even more important.

Starting point is 00:03:39 The NARS scaling, as you reduce the transistor size, you will reduce the voltage by the square root. As a result, you can reduce the power consumption per chip area by 1 over square root. You can improve the speed of the transistor or reduce the switching time by one over square root of two. And you can still have more transistors, right? So that is incredibly important because for the next 15 years, people literally say that they can write a piece of code, go to sleep for five years and wake up.

Starting point is 00:04:24 That piece of code will run much faster without doing anything to it. This is great because during my grad school days, I went to University of California, Berkeley, and that was the time when we really debated what the future course of microprocessors will be. That's when Tennessee and Patterson were advocating risk. There were a huge number of discussions. But the research group that my advisor, Yale Pat, and some of my fellow grad students look

Starting point is 00:04:56 sort of beyond the immediate next five years and look to the 10, 15 years and say, well, if we have so many transistors and these transistors are going to be running so fast, how about if we use these transistors to detect the parallelism that nobody can currently detect in their processors and use the parallelism to run things much faster? And we got lucky.

Starting point is 00:05:24 We got lucky that by 1993, when Intel designed the first P6, the Moore's law and the NARS scaling got to the point where they can design such a processor based on some of the intellectual research we did in the 1980s and many other people did after us so that they can build the P6, which is the Pentium Pro processor. And that processor embodied all the parallel mechanisms that we envisioned. That was history. One of the interesting lessons we learned is that when we do research, we really need to understand where the

Starting point is 00:06:07 technology is going to be 10 years from now. Because by the time people put any of your research ideas into this hardware, it will be at least five years, if not 10 years. So for folks that are listening who may not be familiar with some of these terms, I want to call out some things that you said that folks can go and read more about. So Denard scaling, also known as MOSFET scaling, that was 1974 when that was being thought about. And that held, as you pointed out, for 25 plus years until it started to kind of falter in the early 2000s there. And then with modern GPUs, you're looking for ways to get around it, stacking chips, coming up with new ways to cool. Because you may be getting higher performance without power consumption, but you're still getting power leakage. You're getting heat dissipation. You're pushing up against the

Starting point is 00:06:53 laws of physics itself. You say, though, you're thinking 10 years out. Is there a limit? Is there a possible theoretical limit where you will actually say, nope, that is as far as this is going to go? I think there is, but we don't know exactly when it's going to hit. This is like a stock that keeps going up. Everyone says it's going to go down at some point, but nobody knows when. We have been saying that the NARS scaling will be ending. Which ended? The NARS scaling ended in early 2000, as you mentioned. For Moore's law, we can probably see at least another two generations, but we are already at the point where we're building computers very differently than scaling on the chip.

Starting point is 00:07:39 Scott, you already mentioned that we try to get a stack memory with chip packaging. But I think most of the innovations today, if you look at the big innovations are for today's research, we are already into building very big systems with reasonable latency. So we're going to see big changes in networking, especially going from electrical to optical. And we're going to see the very different system communication primitives that allow us to synchronize a huge number of machines fast. And these are all because we are already seeing that coming, right? We don't know exactly when this will come, but we need to run as fast as we can so that we're ready when they come. And chances are, we may not even be totally ready when it comes. Yeah. This might be an interesting question or a dumb question, but we hear about people being

Starting point is 00:08:34 full-stack engineers, and you're probably not thinking about Node or Erlang or JavaScript at the highest level. When someone is a software engineer, for them, low level might be going into Google Chrome and hitting F12. But for you, you're starting to bump up against like protons and electrons moving around in space. But the software stack overhead has us wondering, even as I sit here on a highly parallel machine. Why does my machine feel slow? Are you thinking about software paradigms and how people are going to program in five or 10 years and applying that to how you're going to create these pieces, these chips in 10 years?

Starting point is 00:09:15 Absolutely. There are two important forces in these kinds of things, right? One is reality. There are a huge number of people programming with Python. When we deal with frameworks based on Python, such as PyTorch, we see all kinds of interesting, not just overhead, but bottlenecks. For example, the Python interpreter lock that fundamentally restricts the kind of parallelism we can expect at the source language level. And more subtly, all the GPU activities that are launched through the Python-based frameworks are based on APIs. And what that means is that these API calls are separated from each other. They're not adjacent to each other in terms of the compilation process

Starting point is 00:10:06 or interpretation process. We're seeing these individual activities. So one of the important things that we will begin to see is how the lower level software and hardware combination will be able to take these API calls and maybe doing things like fusing these activities on the fly and optimizing away some of the inefficiencies because these are isolated events. But these are the kind of things that people have been working on for decades, starting from the Java days. Microsoft has the runtime compilation activities for a long, long time, all the runtime linker optimizers. But I think we will be getting to the point where

Starting point is 00:10:55 some of these technologies will begin to really be deployed in the next decade or so because we are running out of room for further optimizations without deploying these kind of techniques. Yeah. Seems like, if I recall, when I was getting my first 386, there was the SX and the DX, and then we had ones with a coprocessor.

Starting point is 00:11:18 And companies would say, we need to figure out something for that coprocessor to do. Maybe people will buy it if they use excel so then we would buy this computer because i want to use excel when i wanted to play quake or doom i would buy a 3d effects card because i wanted that yes when i want to play a great game now i'll buy an nvidia processor i'm on a machine right now with a 4080 Super. I'm thinking to myself, what is that really good at that my Intel chip is not very good at? And if I look to my right here, I have a co- competent, different kinds of processing units here. Do you think there'll only be three? Will I come into a world where there's five or six or 10

Starting point is 00:12:11 different fundamentally interesting processor units that do different tasks? I think that's a billion-dollar question, I think personally. I think there'll be at least one more. Scott, you hit the nail on the head. People don't want processors. People want applications. People buy processors reluctantly because they want to run their applications well. Today, I would say that people buy GPUs because they want to train models. These models have all these training needs. People buy these GPUs reluctantly. I'll turn your question just a little bit and maybe say that I think there will be another kind of processing unit because at some point people will want to be able to deal with

Starting point is 00:13:00 massive amount of data on the first-hand basis. What I mean is that we are going through the machine learning era where we reduce all that data, the training data, into these models. Then we converted the query process into a computation inference through this model, and we get the information. But we kind of bypass the need of finding data in real time. Google has been doing search, and Microsoft has been doing search, and these search are real-time searches, right? If you use Google, all these things are pre-indexed data

Starting point is 00:13:45 that we hope to serve the user real-time. One thing that is very interesting is that we kind of swing the pendulum so much in the deep learning models and large language models. We should begin to think about what happens, these models have their limitations.

Starting point is 00:14:06 What if people want to be able to go straight to the data, but use some of the model capabilities to be able to find the data much better than the previous generation search? So I can imagine in order to do this, we will need to have a unit that is going to be able to tolerate latency and introduce very efficient ways to grab, let's say, one kilobyte of data out of four petabytes of data and serve it to the user in real time and go through some of the models to give user very digestible solving the needle in a haystack problem at scale because that traditionally you mentioned this in the paper and this in the presentation this this memory storage divide it feels like the majority of the work that my computer is doing as a user is simply moving data

Starting point is 00:15:04 from one format to another it's just transformation And that's becoming quite tedious as I simply want to work or I want to enjoy my computer. Absolutely. Right now, we have all these file system, software stacks, and so on that we built over the years that are standing in the way. And the format that you're talking about are imposed partly by this kind of software stack. What people really want is their data. People don't want the file system. People don't want the format, right? How do we get people what they want?

Starting point is 00:15:36 And I think that's going to be the next unit. ACM ByteCast is available on Apple Podcasts, Google Podcasts, Podbean, Spotify, Stitcher, and TuneIn. If you're enjoying this episode, please do subscribe and leave us a review on your favorite platform. And it seems like we're narrowing the space between our work and the data. We're trying to, we have that storage throughput, we have that latency that is being applied. We're setting up little islands where your parallelism and your data that it's running

Starting point is 00:16:08 on are near each other. So as to avoid that latency, which is now down to what picoseconds, it's down to nothing. It's down, you know, as little latency as possible. Absolutely. Yeah. I've seen, you mentioned millisecond level latency and terabytes a second throughput. It seems like a dream, but it's going to happen at some point. I really believe that if we do the work right,

Starting point is 00:16:30 when someone does this interview 10 years from now, the person may have a machine on the desktop even. I can see how we can build these kinds of things in the data center at a very high cost. But the real test is that can we put this kind of capability into every person's desktop or even a small device that the person carries that has the entire lifetime's memory and information that you can retrieve real time when you talk to someone. It's interesting that you mentioned that because just literally yesterday, a friend of mine sent me a package. And for those of you who are listening to this podcast, you won't see this. I'm holding up a compact PC companion and some of these early devices, the Palm Pilot.

Starting point is 00:17:17 We've always wanted our life, our family, our friends, our photos in a tiny device like this. My first applications were on the Palm Pilot. This is a pocket supercomputer with its own GPU. And now we're going to see AI start to move its way into our pockets. The idea that what once filled a room is carried around with me with all day battery life. To your point, the data won't be in the cloud. It could just be in my pocket. Exactly. I can see how that can happen. I work with a lot of the SSD storage vendors

Starting point is 00:17:53 on a regular basis. I work with the PCB builders, the mobile chip builders. I can see if we do the right things, within a decade, we can begin to have these devices that truly extend us. I really feel that the real benefit for these devices is to make each of us an infinite memory person. And it can complement some of the, let's say, just like calculators complemented our way to do arithmetic, we should be able to remember the names of all our associates and colleagues and so on in real time. No problem, right? Right now, I can't. I was looking at some pictures of a former US

Starting point is 00:18:38 president who was being introduced to some dignitaries, and you could tell that someone was whispering, that's the president of France, and this is his wife, and their children's names are. That's what I would like for my coworkers, for my colleagues, for people I haven't seen in 10 years, is someone to whisper in my ear. But right now, to do that, I have to go to the cloud, interact with a vendor. That's sparse access to massive data. And who knows what that costs in power consumption. As I go to Google Photos today and I say, show me pictures of my child in the snow when they were two. Yes.

Starting point is 00:19:13 You're saying that algorithms, memory capacity, the theoretical limits will happen such that we could have that in our pocket. Not only that, you don't need to scroll through 100 results and pick the best one. That's amazing. That's what I want. But that's going to require silicon changes. It's going to require algorithmic changes. That's beyond a generational shift. That's going to get the entire industry to head in a different direction. Yes or no. We have been witnessing these kind of things. If there's enough value, these things move fast. For example, I have seen the whole GPU movement. I still remember in 2011, when we proposed to build a supercomputer called

Starting point is 00:19:55 Blue Waters based on GPUs to NSF, there was a review panel. These review panelists were all supercomputing center directors who had decades of experience building supercomputers and serving communities with supercomputers, right? And after my presentation, one of the panelists said, this proposal is risky to say the least and probably irresponsible if you really look at it. I said, why do you say this? He said, compare the number of math library functions that GPUs have versus the Intel math kernel library, right? And you're ways away from that. And I said, yes, but we know that there are only a limited number of science

Starting point is 00:20:42 applications that people truly care about using supercomputers, we know that this is going to start with more special purpose. But if you look at the computational speed over wattage, this thing is a small puppy with big paws. Let's take that step. And somehow by miracle, that panel agreed, allowed us to do it, right? 13 years later, we're sitting here looking at all the math libraries that people already built. And if you look at the model training process, people put together multiple generations of these kinds of frameworks over the past five years.

Starting point is 00:21:27 If it provides enough value, if there's enough incentive, it will happen. That's amazing. There's the people who have the audacity to do it once, to propose a PETA scale supercomputer, to get the National Science Foundation to think it's a good idea, to give you $200 million to do it, and now it's being done by others. That's unbelievable. Thank you. And then now, is it the Delta project that is the next step of Blue Waters? The Delta project, it is the next step of Blue Waters in the sense that there's a real academic need for more of these GPUs,

Starting point is 00:22:05 and there's a serious shortage of GPUs available to academic researchers today. And so Delta is meant to fill that gap. But Delta is not like Blue Waters, where we know there's so many known risks. We know there's so many potential failures that can become scandals for the public. We had to work with all the science teams. There are four science teams that I'm eternally grateful how much work they put into their applications to move them into GPUs and prove the benefits. NVIDIA dedicated hundreds of people working on this project to make sure that it does not fail, right?

Starting point is 00:22:49 So Delta is not at that kind of level. Delta is really about how we can enable academic researchers who really need the GPU time to be able to make some real progress. I see. So it's about access to those resources more than it is about pushing the boundaries

Starting point is 00:23:07 of computational power. Exactly. Okay. What projects are you excited about where folks can go and learn about supercomputing projects and audacious moonshot type things that are going to take us into the exaflops?

Starting point is 00:23:22 Yeah. There are several exascale projects that come in together. And I would say most of them have very interesting ideas. But I think one of the biggest challenges that all these supercomputers are trying to figure out is this whole networking. Because we do have the option to build optical networks. These optical networks are so expensive. They're more than 10 times more expensive than their copper equivalents. That's why whenever the distances are short, optical connections are not used. If we look at the next generation, we're going to see, I would say,

Starting point is 00:24:09 big movements in terms of the network connection and in terms of the cooling. We're definitely at the cooling limit for all the machines. I still remember that when we were calculating the amount of water that we need to pump through one of these data centers, and I said, can we really have that much water? So I think some of the exascale machines would have very interesting cooling mechanisms. And these things will be published as soon as these machines are operational. I'm trying to think about what I'm going to do as I think about retirement, as I think about what are the next 50 years going to be.

Starting point is 00:24:42 Are you going to be doing this until you can't walk anymore? Are you going to be a parallelism person and a computing pioneer until the end? Because you seem like you're having a lot of fun. Exactly. And Scott, let me tell you when I think I'm going to retire. I will retire when I have that device in my hand. And I can go around and show people, pretend to be the most intelligent person on earth, spend my time explaining to people how they can use this device. That's wonderful. I hope we both get to see that happen because that is the promise of a pocket supercomputer that really makes your life better, makes your relationships better.

Starting point is 00:25:27 It is about the things you can do with it. It's about the people and the connections. Absolutely. I really hope to live to see that day. Fantastic. Thank you so much, sir, for chatting with me today. Thank you, Scott, for this opportunity. I have been chatting with Dr. Wen Mei Hu.

Starting point is 00:25:43 He is the Senior Director of Research and Distinguished Research Scientist at NVIDIA Corporation. And many congratulations as well, sir, on your award, your ACM IEEE award for your pioneering contributions in the design of these processor architectures. We do appreciate you and all of the work that you and your colleagues have done to move us forward in computing. Thank you. Thank you so much. This has been another episode of Hansel Minutes in association with the ACM ByteCast, and we'll see you again next week. ACM ByteCast is a production of the Association for Computing Machinery's Practitioner Board. To learn more about ACM and its activities, visit acm.org. For more information about this and other episodes, please do visit our website at learning.acm.org slash bytecast. That's B-Y-T-E-C-A-S-T. Learning.acm.org slash bytecast.

ACM ByteCast - Wen-Mei Hwu - Episode 58

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.