In The Arena by TechArena - Bill Gropp on the NCSA and the Future of Supercomputing Innovation

Episode Date: November 19, 2024

In this podcast, NCSA Director Bill Gropp explores the latest advanced computing trends, from AI innovations to groundbreaking research on supercomputing climate models, and how this tech is transform...ing science and society.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein. Now, let's step into the arena. Welcome in the arena. My name is Alison Klein. This week, we're coming to you from the supercomputing event in Atlanta, Georgia, and I am so delighted to be joined by Dr. Bill Gropp, Director of the National Center for Supercomputing Applications at the University of Illinois. Welcome to the program, Bill. How's it going? Great. Thank you, Allison.
Starting point is 00:00:43 Bill, why don't we just get started? I mean, the NCSA is an amazing organization, and you guys are in Champaign-Urbana, of course. Tell me a bit about the charter and what it means to serve as the director of the organization. So to start with, NCSA is part of the university. We are, in fact, the oldest of the campus interdisciplinary research institutes. So NCSA as Interdisciplinary Research Center in the Use of Advanced Computing in All Areas of Scholarship, which also operates a supercomputing center. We were founded in January of 1986 by Larry Smarr with funding from the National Science Foundation. Larry had realized that he just couldn't get the computing that he needed to do his research and submitted a famous proposal called the Black Proposal for the Color of the Cover, an unsolicited proposal to the National Science Foundation, which led to the formation of centers like NCSA. It is an amazing place to lead.
Starting point is 00:01:41 I have a large team of people who are passionate about computing, and my job is really to help them follow that passion, ensure that we're helping to advance the use of computing. So not just show people how to use what's there, but to think about where the community should be going and help it get there. What impressed me the most when I was doing my research for this interview is the broad purview of scientific topics that your team engages on. What realms of research are your team studying today? Wow, almost everything to some extent. We benefit from being part of an amazing university
Starting point is 00:02:20 that has strengths in engineering, natural sciences, arts, and humanities, and we work with them all. We do have a couple of areas of concentration which have emerged, some from our history and the foundation of NCSA and others from, again, people pursuing their passions. So astronomy is one. We have a Center for Astrophysical Surveys, which is looking at essentially 21st century astronomy as digital. Telescopes have become larger and more and more expensive, and they're all digital. And making advances in that area is really based on how can you best take advantage of that. We have a Center for Digital Agriculture that takes advantage of the strength that the campus has in agriculture.
Starting point is 00:03:11 We have our Health Innovation Program Office looking at the application of advanced computing to health. The university has what is actually the world's first engineering-based medical school, and that sort of strength in applying computing and engineering to health is really amazing. We have work going on in earth sciences, particularly climate, and in computational sciences. There's a DOE-funded center looking at scramjet design as a way to make it easier to get access to space. And we have cross-tutting groups looking at the application of AI and data science. I don't even know where to go after all of those topics. I could spend a series of podcasts asking you about all those. But I'm just going to go to my favorite topic, which is astrophysics. Space and discovery just happens to be something that I get really excited about.
Starting point is 00:03:59 You talked about making it easier to get into space. What is the group studying here, and how does supercomputing help with the problem? What they're looking at is a kind of air-breathing jet engine called a scramjet. A scramjet operates at hypersonic velocities like Mach 5 and above. And at those speeds, you can't build a jet engine the way you think of them. When you go and fly, you see those engines that got these big fans. If you open one up, you see all sorts of moving parts. At hypersonic speeds, you can't do that. You basically pull in air, spray fuel into it, ignite it, or at least try to ignite it,
Starting point is 00:04:36 hope that it ignites, and then design the exhaust so that you recover as much of the forward thrust from that burning as you can. The environment is terrible. One of the big problems is that at those speeds and those temperatures, most materials just disintegrate. And so one of the things we're looking at is how can you design a scramjet
Starting point is 00:05:00 so that looking at novel materials, looking at how those materials interact with the flow, how can you design the shape to reduce the stress and improve the performance of the engine? How can you design it to be maintainable? Because it is a terrible environment, you're probably going to have to frequently replace parts of it. So how do you do that in a way that makes sense? And all of this is being done primarily computationally, although as part of the predictive science program at DOE, every year we design experiments. People go off and use petascale and aiming towards exascale computing to model the flows, model the interactions with the materials and so forth, and then compare to
Starting point is 00:05:45 experiments. We even have a hypersonic wind tunnel on campus, which helps us do that. But the key is that in order to be able to do design, we need to be able to do that computationally. That's a really exciting program. And one of the things we're looking at there is how do you write the programs for this in a way that allows you to evolve the codes and the algorithms that you get a better understanding. So we're in fact using a Python-based representation that is then transformed into code that runs natively on clusters of multi-GPU nodes. That's incredible. I think that as you describe that and you think about petascale heading to exascale computing, saying that casually, Bill, it's an interesting
Starting point is 00:06:25 thing. It's amazing to see the advancements and what it means in terms of making space more accessible. And I can't wait to hear more about the research. But I do have to ask you and turn the topic to AI. I think that I can't do a podcast in 2024 without talking about AI. AI clusters have obviously gotten a lot of heritage from the supercomputing community. It was only two years ago that ChatGPT took the world by storm. And I don't know if you saw last weekend, but Sam Altman is predicting AGI now by 2025. Let's bring us back to Earth. No pun intended. Where do you see the leading edge of AI today?
Starting point is 00:07:05 And where do you think we are going? And how is NCSA contributing? Let me start by saying that I think that AI is, in fact, a poor term for what we have today. I much prefer to call it imitation intelligence. When you say artificial intelligence, I think of intelligence by artificial means. And instead, what we've got is an amazingly powerful tool that imitates intelligence. And that's not meant to denigrate it in any way. It's tremendously powerful. And I think it's really exciting where we're going and how we're taking
Starting point is 00:07:37 advantage of that. But fundamentally, there's no comprehension in it. It's not really intelligent. It's just acting it. And humans have a very poor record at understanding the difference. In the late 60s, there was a pair of programs called ELISA and DOCTORS, which are really the first acknowledged sort of chatbots. And people took them very seriously and interacted with them as if they were human. In many ways, for a lot of people, they passed the Turing test. One of the challenges here is no one has a definition for what artificial general intelligence is. It's sort of a, I'll know it when I see it. But we're really bad at understanding what is really intelligence
Starting point is 00:08:18 and what is the appearance of intelligence. There are already people who are talking about these systems as if they're sentient, and that's an AGI. I feel pretty confident that people will claim that some of these systems meet AGI in the next couple of years, but I'm also confident other people will say no. And because we don't have a definition in a real testable way, it's not something that we can answer. A really interesting question is, do we really care? From a philosophical standpoint, yes. Maybe even from a legal standpoint, if it truly is an artificial general intelligence, does it deserve legal rights? But from a standpoint of these tools as an aid or as something that contributes to society, whether or not I say that it's an
Starting point is 00:09:02 artificial general intelligence or not is really beside the point. The question really is, can it do what I need it to do? And that's a place where I think we've seen that there's both tremendous capability and advantages, but also a lot of shortcomings. We are doing a lot in this area because, again, it's a tremendously powerful tool. We've got teams that are looking at using AI, for example, for turbulence modeling. In fact, there's an AI component to the Scramjet work that you see there. We have a Center for AI Innovation that is looking at the translational use of AI in a number of areas, including things like providing teaching assistants and providing ways to examine your documentation. So just an interesting question. If you put up documentation for a supercomputer,
Starting point is 00:09:50 how do you know what's not there? One of the ways you can do that is to ingest all of it into an AI system and then start asking it questions and see how well it does. And then link that back to, we forgot to explain how to do this. Those are things that are very hard to do for humans. Again, with National Science Foundation funding, we're providing significant resources for the nation's researchers in AI, including as part of the National AI Research Resource Pilot. Our Delta system is providing the majority of GPU cycles that are available through the NSF Access program. And as we're recording it just this morning, which was the week before supercomputing,
Starting point is 00:10:34 we just learned that the National Science Foundation has accepted our newest machine, which we call Delta AI, which is about two to three times more powerful than our current machine. And that will also be available for both Nair and Access. So we're deeply involved in it. And I think that it's going to be tremendously valuable when I think we'll find that some of the things that we ascribe to intelligence, which is not just the ability to find facts and even to put them together in new and unusual ways, but to really have that sort of comprehension and insight is something
Starting point is 00:11:02 that I don't expect us to see in the near future. I love the way that you describe what AI is today. And it's something that I've talked about quite a bit, that it's really a reflection of what we know rather than driving incredible new paths on its own. And I think that one thing that I would ask you is, as we look at where this technology is going, are we continuing towards that path? Is there a break where we actually get to true artificial intelligence in your mind? I know that's a prognostication that no one can make casually. And do you see any limitations that we're pushing up against? Yeah, so I'll take an extreme point here. From the point of view of getting to artificial intelligence, I think this is actually a dead
Starting point is 00:11:48 end. My primary work is not as an AI researcher, so I'm definitely treading on ground where I'm not as expert as some of my colleagues that focus on AI. But from all that I've seen, I think that we're building systems that, again, imitate intelligence with amazing power and value, but I don't see the focus on what I would consider necessary. So real understanding and insight. I think the hope for things like AGI is magically emergent behavior from more and more data and more and more connections. And you could argue that brains work that way, but there are lots of things in biological brains which we are not modeling in the current AI systems. And it's not clear that doing more of what we're doing will get us there. And yeah, so I'm actually
Starting point is 00:12:38 a little pessimistic on the more general true artificial intelligence with these techniques. But again, it doesn't mean that the techniques aren't incredibly valuable, and we definitely need to be doing lots of research in them. But again, if I could really define AGI, I would bet against it. We'll see how that goes, Bill, given that apparently it's arriving next year. You've written extensively on a topic that I think is going to be an acute focus on 2025, and we called it out in our recap report of OCP, compute cluster networking. Why have you focused there? Because that's really the only way that you can build bigger and bigger machines. The thing which stands in the way is physics. It's the speed of light. An interesting question
Starting point is 00:13:21 to ask people is, how many clock ticks across is your cluster? In the sense that, how many clock ticks does it take light to cross your cluster? That's a fundamental limit on how coordinated the computation can be. You can't access data that's far away in a single clock cycle. It's not a matter of engineering, it's physics. And so that means you fundamentally need to think in a different way about how you do computing. You have to think about locality, you have to focus on data, and then that means that you have to be thinking about how that data moves from place to place. And so that led me to spend a lot of my career looking at networking and programming models for
Starting point is 00:14:01 networking and algorithms that are based around how do I exploit the fact that I have to use local data more than faraway data? Are there ways to organize that faraway data so that I can get more value out of it? Recently, we've looked more at things we've called noteware algorithms. If you look at current systems, the network topology is, in fact, less important than many of us expected it to be a decade ago. What's really often a limit is the ability or the lack of ability to get data on and off a node. And so we've been focusing a lot more on what does that mean for your algorithms? What does that mean for the networking you do? For example, how many network points do you have on a node?
Starting point is 00:14:43 One is not sufficient. If you look at our two InnoCF-funded systems, the original Delta had one NIC per node. That was okay because that system was expected to serve a capacity demand, which meant that many of the applications would need only one node, and for ones that needed more, it still had a fairly high-speed network. Delta AI, which was just accepted, has one NIC per GPU. So there's four NICs per node to provide the bandwidth that's needed to support applications that need more GPU memory than we have on many of these current nodes.
Starting point is 00:15:14 We need to be looking at where the real choke points are in the hardware that we can afford to build and how we can address some of those issues with algorithms. So that might be various kinds of aggregation or better representations, more compact representations of the data. And it's also looking at different kinds of networking. One of the things which I think is interesting is the promise of ultra-Ethernet as a way to make it easier to keep up with innovation. I think that's another thing which has changed in HPC that used to be quite
Starting point is 00:15:45 reasonable to buy a machine, put it on the floor, run it for five, six, seven years, and then replace it. Rate of innovation has been accelerating. I don't think that model works anymore. And we're going to need networks that make it easier to do rolling upgrades or rolling deployments of newer technology. And we're going to have to deal with the complexities of that heterogeneity. I know some people aren't going to like to hear that, but that's just, I think, the way it's going to be. But in the end, come back, it really is driven by physics that we can't control. We have to deal with locality. We have to deal with moving data from place to place. And that's why the focus on network has been so important for me. Now, I was geeking out over the weekend reading about a paper that you were an author on
Starting point is 00:16:27 introducing CommBench. Tell me about it and how benchmarking will help push network performance further across cluster variations that we're seeing emerge. I'm glad you found that paper. That's really the work of Mert Hedayatoglu. I never get his name quite right, but Mert was a student of Wenmei Hu's, and I was on his committee, and we've been working together on this. And ComBatch is designed to better understand where are the performance thresholds for multi-GPU nodes that are communicating with other multi-GPU nodes. So it's a diagnostic tool that helps you understand where the achieved performance, for example, might fall short of what the hardware looks like it provides.
Starting point is 00:17:09 And those can be due to errors in the way some of the system code is implemented. They can reflect other constraints, which maybe aren't as apparent when you're looking at the hardware. And the thing that's really important with this is I'd combine the comm bench with other work that MERT has been doing on designing building blocks for collective routines. Things like thought products, of course, are very important for lots of applications, but particularly for a lot of applications in machine learning model training, for example. By having a benchmark that helps you understand where the performance points are, where you have to work carefully at getting the most performance, it gives you insights that help you design better algorithms. And some of the work that MERT has done has developed collective routines, which are, for example, they're competitive with Nickel,
Starting point is 00:17:57 the NVIDIA collective library, but they have the advantage that they're far more complete than Nickel. Nickel only implements a handful of collective routines. MERT, by designing the right sort of building blocks, almost instantly has a full collection. And then, of course, if you're not using video GPUs and you don't have nickel, or you do have nickel, but it doesn't work well in your configuration, which we've also run into, this sort of approach, driven by the insights that you get from the benchmarks, gives you a way to implement high-performance codes that you really can't do
Starting point is 00:18:29 in any other way. So I think that one of the things that underlay a lot of the work that I've done has been taking an analytic view of performance and identifying the right level of detail. You can get to doing cycle-accurate analyses, but that takes months if years. And the insights you get from that may not be as relevant to you as the ones that you can get for analysis that tries to hit the top 10% of the things that impact the performance. Benchmarks help you get the numbers that go with that analysis. And then the two things together allow you to design and implement algorithms that give you that analysis. And then the two things together allow you to design and implement algorithms that give you better performance.
Starting point is 00:19:07 Now, obviously, this week is supercomputing. A great week to find out what's going on within the research community. What is NCSEA showing at Supercomputing 24? Of course, we'll be talking about
Starting point is 00:19:20 our Delta and Delta AI systems. As I mentioned, they're providing resources for the National AI Research Resource Pilot. We are very excited about NAR and hope that NAR will be fully funded because right now it's a pilot where the resources are quite limited. We're working with Access and the NAR pilot to provide a lot of AI resources for the community. So that's something we're very excited about. We have a lot of applications in visualization work and collaborations that we'll be talking about.
Starting point is 00:19:50 There's some that I've mentioned, the Center for Astrophysical Surveys, for example. We're part of a new AI Institute in astronomy. It's a partnership with Northwestern and the University of Chicago, as well as Illinois, looking at using AI to gain more insights from the data that we're getting from survey astronomy, which is taking images of, for example, the whole sky and maybe doing it over and over again so that you can understand how things are changing, even transients.
Starting point is 00:20:17 We have a new ARPA-H project looking at tumor removal. So if you're a surgeon and you're removing a tumor from a patient, how do you know that you've gotten all the bad stuff without taking out too much of the good stuff? Surprisingly, there isn't actually a really good answer to this right now, but there's a lot that we believe can be done using AI and imaging and other techniques to give effectively real-time information to surgeons. And this is an ARPH project that was recently funded as a joint project between Illinois and Mayo Clinic. We'll be talking about things like that. Another initiative, which we're part of, has been looking at, could we build a computer that would simulate the climate
Starting point is 00:20:55 well enough for us to use it to guide policy discussions? Right now, even the current exascale machines can't do that. And it's an interesting question about if you were to build such a machine targeting climate modeling, what would that look like? to deal with the parts of the interactions of the climate where maybe we understand the basic physical processes, but we don't understand how they are realized at the scale of the climate. So many, many things like that. Just come by the booth and we'll have people there who be happy to talk with you about these and many other of the applications that we're doing. Bill, thanks so much for being on the show. I learned so much from you in just this short period of time. I'm sure our audience wants to engage further. And while you said come engage at the show, for those who are listening online and don't have the luxury of being in Atlanta with
Starting point is 00:21:56 us, where should they go to engage with NCSA? So the way NCSA is organized, one of my directorates is engagement. So you can contact Chuck Pawlowski or the people in engagement as a starting point. We like to collaborate with people all across the country and come to our webpage, see what we do, follow the links to people working on projects that you find interesting and see if we can develop a new collaboration. Awesome. Thank you so much for being here today. It's been so much fun. It has. Thank you very much. Thanks for joining the Tech Arena.
Starting point is 00:22:36 Subscribe and engage at our website, thetecharena.net. All content is copyright by the Tech Arena.チャンネル登録をお願いいたします。

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.