In The Arena by TechArena - Bill Gropp on the NCSA and the Future of Supercomputing Innovation
Episode Date: November 19, 2024In this podcast, NCSA Director Bill Gropp explores the latest advanced computing trends, from AI innovations to groundbreaking research on supercomputing climate models, and how this tech is transform...ing science and society.
Transcript
Discussion (0)
Welcome to the Tech Arena,
featuring authentic discussions between
tech's leading innovators and our host, Alison Klein.
Now, let's step into the arena.
Welcome in the arena. My name is Alison Klein. This week, we're coming to you
from the supercomputing event in Atlanta, Georgia, and I am so delighted to be joined by Dr. Bill
Gropp, Director of the National Center for Supercomputing Applications at the University
of Illinois. Welcome to the program, Bill. How's it going? Great. Thank you, Allison.
Bill, why don't we just get started?
I mean, the NCSA is an amazing organization, and you guys are in Champaign-Urbana, of course.
Tell me a bit about the charter and what it means to serve as the director of the organization.
So to start with, NCSA is part of the university. We are, in fact, the oldest of the campus interdisciplinary research institutes.
So NCSA as Interdisciplinary Research Center in the Use of Advanced Computing in All Areas of Scholarship, which also operates a supercomputing center.
We were founded in January of 1986 by Larry Smarr with funding from the National Science Foundation. Larry had realized that he just couldn't get the computing that he needed to do his research and submitted a famous proposal
called the Black Proposal for the Color of the Cover, an unsolicited proposal to the National
Science Foundation, which led to the formation of centers like NCSA. It is an amazing place to lead.
I have a large team of people who are passionate about computing,
and my job is really to help them follow that passion, ensure that we're helping to advance
the use of computing. So not just show people how to use what's there, but to think about where the
community should be going and help it get there. What impressed me the most when I was doing my research for this interview
is the broad purview of scientific topics that your team engages on.
What realms of research are your team studying today?
Wow, almost everything to some extent.
We benefit from being part of an amazing university
that has strengths in engineering, natural sciences, arts, and humanities,
and we work with them all. We do have a couple of areas of concentration which have emerged,
some from our history and the foundation of NCSA and others from, again, people pursuing their
passions. So astronomy is one. We have a Center for Astrophysical Surveys, which is looking at
essentially 21st century astronomy as digital.
Telescopes have become larger and more and more expensive, and they're all digital.
And making advances in that area is really based on how can you best take advantage of that.
We have a Center for Digital Agriculture that takes advantage of the strength that the campus has in agriculture.
We have our Health Innovation Program Office looking at the application of advanced computing to health.
The university has what is actually the world's first engineering-based medical school,
and that sort of strength in applying computing and engineering to health is really amazing. We have work going on in earth sciences,
particularly climate, and in computational sciences. There's a DOE-funded center looking at scramjet design as a way to make it easier to get access to space. And we have cross-tutting
groups looking at the application of AI and data science.
I don't even know where to go after all of those topics. I could spend a series of podcasts asking you about all those.
But I'm just going to go to my favorite topic, which is astrophysics.
Space and discovery just happens to be something that I get really excited about.
You talked about making it easier to get into space.
What is the group studying here, and how does supercomputing help with the problem? What they're looking at is a kind of
air-breathing jet engine called a scramjet. A scramjet operates at hypersonic velocities like
Mach 5 and above. And at those speeds, you can't build a jet engine the way you think of them.
When you go and fly, you see those engines that got these big fans. If you open one up, you see all sorts of moving parts.
At hypersonic speeds, you can't do that.
You basically pull in air, spray fuel into it,
ignite it, or at least try to ignite it,
hope that it ignites,
and then design the exhaust
so that you recover as much of the forward thrust
from that burning as you can.
The environment is terrible.
One of the big problems is that at those speeds and those temperatures,
most materials just disintegrate.
And so one of the things we're looking at is how can you design a scramjet
so that looking at novel materials, looking at how those materials interact with the
flow, how can you design the shape to reduce the stress and improve the performance of the engine?
How can you design it to be maintainable? Because it is a terrible environment, you're probably going
to have to frequently replace parts of it. So how do you do that in a way that makes sense?
And all of this is being done primarily computationally, although as part of the predictive science program at DOE,
every year we design experiments. People go off and use petascale and aiming towards exascale
computing to model the flows, model the interactions with the materials and so forth,
and then compare to
experiments. We even have a hypersonic wind tunnel on campus, which helps us do that. But the key is
that in order to be able to do design, we need to be able to do that computationally. That's a really
exciting program. And one of the things we're looking at there is how do you write the programs
for this in a way that allows you to evolve the codes and the algorithms
that you get a better understanding. So we're in fact using a Python-based representation that
is then transformed into code that runs natively on clusters of multi-GPU nodes.
That's incredible. I think that as you describe that and you think about petascale heading to
exascale computing, saying that casually, Bill, it's an interesting
thing. It's amazing to see the advancements and what it means in terms of making space more
accessible. And I can't wait to hear more about the research. But I do have to ask you and turn
the topic to AI. I think that I can't do a podcast in 2024 without talking about AI. AI clusters have obviously gotten a lot of heritage from the supercomputing community.
It was only two years ago that ChatGPT took the world by storm.
And I don't know if you saw last weekend, but Sam Altman is predicting AGI now by 2025.
Let's bring us back to Earth.
No pun intended.
Where do you see the leading edge of AI today?
And where do you think we are going?
And how is NCSA contributing?
Let me start by saying that I think that AI is, in fact, a poor term for what we have today.
I much prefer to call it imitation intelligence.
When you say artificial intelligence, I think of intelligence by artificial means.
And instead, what we've got is an amazingly
powerful tool that imitates intelligence. And that's not meant to denigrate it in any way. It's
tremendously powerful. And I think it's really exciting where we're going and how we're taking
advantage of that. But fundamentally, there's no comprehension in it. It's not really intelligent. It's just acting it. And humans have
a very poor record at understanding the difference. In the late 60s, there was a pair of programs
called ELISA and DOCTORS, which are really the first acknowledged sort of chatbots. And people
took them very seriously and interacted with them as if they were human. In many ways, for a lot of people, they passed the Turing test.
One of the challenges here is no one has a definition
for what artificial general intelligence is.
It's sort of a, I'll know it when I see it.
But we're really bad at understanding what is really intelligence
and what is the appearance of intelligence.
There are already people who are talking about these systems
as if they're sentient, and that's an AGI. I feel pretty confident that people will claim that some
of these systems meet AGI in the next couple of years, but I'm also confident other people will
say no. And because we don't have a definition in a real testable way, it's not something that
we can answer. A really interesting question is, do we really care?
From a philosophical standpoint, yes. Maybe even from a legal standpoint, if it truly is an artificial general intelligence, does it deserve legal rights? But from a standpoint of these tools
as an aid or as something that contributes to society, whether or not I say that it's an
artificial general intelligence or not is really beside the point. The question really is, can it do what I need it to do?
And that's a place where I think we've seen that there's both tremendous capability and advantages,
but also a lot of shortcomings. We are doing a lot in this area because, again, it's a tremendously
powerful tool. We've got teams that are looking at using AI, for example,
for turbulence modeling. In fact, there's an AI component to the Scramjet work that you see there.
We have a Center for AI Innovation that is looking at the translational use of AI in a number of
areas, including things like providing teaching assistants and providing ways to examine your documentation.
So just an interesting question. If you put up documentation for a supercomputer,
how do you know what's not there? One of the ways you can do that is to ingest all of it into an AI
system and then start asking it questions and see how well it does. And then link that back to,
we forgot to explain how to do this. Those are things that
are very hard to do for humans. Again, with National Science Foundation funding, we're providing
significant resources for the nation's researchers in AI, including as part of the National AI
Research Resource Pilot. Our Delta system is providing the majority of GPU cycles that are
available through the NSF Access program.
And as we're recording it just this morning, which was the week before supercomputing,
we just learned that the National Science Foundation has accepted our newest machine,
which we call Delta AI, which is about two to three times more powerful than our current machine.
And that will also be available for both Nair and Access.
So we're deeply involved in it.
And I think that it's
going to be tremendously valuable when I think we'll find that some of the things that we ascribe
to intelligence, which is not just the ability to find facts and even to put them together in
new and unusual ways, but to really have that sort of comprehension and insight is something
that I don't expect us to see in the near future. I love the way that you describe what AI is today. And it's something that I've
talked about quite a bit, that it's really a reflection of what we know rather than driving
incredible new paths on its own. And I think that one thing that I would ask you is, as we look at where this technology is going, are we continuing towards that path?
Is there a break where we actually get to true artificial intelligence in your mind?
I know that's a prognostication that no one can make casually.
And do you see any limitations that we're pushing up against?
Yeah, so I'll take an extreme point here.
From the point of view of getting to artificial intelligence, I think this is actually a dead
end. My primary work is not as an AI researcher, so I'm definitely treading on ground where I'm
not as expert as some of my colleagues that focus on AI. But from all that I've seen,
I think that we're building systems that, again, imitate intelligence with amazing power and value,
but I don't see the focus on what I would consider necessary.
So real understanding and insight.
I think the hope for things like AGI is magically emergent behavior from more and more data and more and more connections.
And you could argue that brains work that way, but there are lots of things in biological brains which we are not modeling in the current AI systems.
And it's not clear that doing more of what we're doing will get us there. And yeah, so I'm actually
a little pessimistic on the more general true artificial intelligence with these techniques.
But again, it doesn't mean that the techniques aren't incredibly valuable, and we definitely need to be doing lots of research
in them. But again, if I could really define AGI, I would bet against it. We'll see how that goes,
Bill, given that apparently it's arriving next year. You've written extensively on a topic that
I think is going to be an acute focus on 2025, and we called it out
in our recap report of OCP, compute cluster networking. Why have you focused there?
Because that's really the only way that you can build bigger and bigger machines.
The thing which stands in the way is physics. It's the speed of light. An interesting question
to ask people is, how many clock ticks across is your cluster?
In the sense that, how many clock ticks does it take light to cross your cluster?
That's a fundamental limit on how coordinated the computation can be.
You can't access data that's far away in a single clock cycle.
It's not a matter of engineering, it's physics.
And so that means you fundamentally need to think in a different way about how you do computing. You have to think about locality, you have to focus on data,
and then that means that you have to be thinking about how that data moves from place to place.
And so that led me to spend a lot of my career looking at networking and programming models for
networking and algorithms that are based around how do I exploit the fact that I have to use local data more than faraway data?
Are there ways to organize that faraway data so that I can get more value out of it?
Recently, we've looked more at things we've called noteware algorithms.
If you look at current systems, the network topology is, in fact, less important than many of us expected it to be a decade ago.
What's really often a limit is the ability or the lack of ability to get data on and off a node.
And so we've been focusing a lot more on what does that mean for your algorithms?
What does that mean for the networking you do?
For example, how many network points do you have on a node?
One is not sufficient. If you look at our two InnoCF-funded systems, the original Delta had one NIC per node.
That was okay because that system was expected to serve a capacity demand, which meant that
many of the applications would need only one node, and for ones that needed more, it still
had a fairly high-speed network.
Delta AI, which was just accepted, has one NIC per GPU.
So there's four NICs per node to provide the bandwidth that's needed
to support applications that need more GPU memory
than we have on many of these current nodes.
We need to be looking at where the real choke points are
in the hardware that we can afford to build
and how we can address some of those issues with algorithms.
So that might be
various kinds of aggregation or better representations, more compact representations
of the data. And it's also looking at different kinds of networking. One of the things which I
think is interesting is the promise of ultra-Ethernet as a way to make it easier to keep
up with innovation. I think that's another thing which has changed in HPC that used to be quite
reasonable to buy a machine, put it on the floor, run it for five, six, seven years, and then replace
it. Rate of innovation has been accelerating. I don't think that model works anymore. And we're
going to need networks that make it easier to do rolling upgrades or rolling deployments of newer
technology. And we're going to have to deal with the complexities of that heterogeneity. I know some people aren't going to like to hear that,
but that's just, I think, the way it's going to be. But in the end, come back, it really is driven
by physics that we can't control. We have to deal with locality. We have to deal with moving data
from place to place. And that's why the focus on network has been so important for me.
Now, I was geeking out over the weekend reading about a paper that you were an author on
introducing CommBench. Tell me about it and how benchmarking will help push network performance
further across cluster variations that we're seeing emerge.
I'm glad you found that paper. That's really the work of Mert Hedayatoglu. I never get his name
quite right, but Mert was a student of
Wenmei Hu's, and I was on his committee, and we've been working together on this.
And ComBatch is designed to better understand where are the performance thresholds for multi-GPU
nodes that are communicating with other multi-GPU nodes. So it's a diagnostic tool that helps you
understand where the achieved performance, for example, might fall short of what the hardware looks like it provides.
And those can be due to errors in the way some of the system code is implemented.
They can reflect other constraints, which maybe aren't as apparent when you're looking at the hardware. And the thing that's really important with this is I'd combine the
comm bench with other work that MERT has been doing on designing building blocks for collective
routines. Things like thought products, of course, are very important for lots of applications, but
particularly for a lot of applications in machine learning model training, for example. By having a
benchmark that helps you understand where the performance points are, where you have to work carefully at getting the most performance, it gives you
insights that help you design better algorithms. And some of the work that MERT has done has
developed collective routines, which are, for example, they're competitive with Nickel,
the NVIDIA collective library, but they have the advantage that they're far more complete than
Nickel. Nickel only implements a handful of collective routines.
MERT, by designing the right sort of building blocks,
almost instantly has a full collection.
And then, of course, if you're not using video GPUs and you don't have nickel,
or you do have nickel, but it doesn't work well in your configuration,
which we've also run into,
this sort of approach, driven by the insights that you get from the benchmarks, gives you a way to implement high-performance codes that you really can't do
in any other way. So I think that one of the things that underlay a lot of the work that I've done
has been taking an analytic view of performance and identifying the right level of detail.
You can get to doing cycle-accurate analyses, but that takes months
if years. And the insights you get from that may not be as relevant to you as the ones that you
can get for analysis that tries to hit the top 10% of the things that impact the performance.
Benchmarks help you get the numbers that go with that analysis. And then the two things together
allow you to design and implement algorithms that give you that analysis. And then the two things together allow you to design and implement algorithms
that give you better performance.
Now, obviously,
this week is supercomputing.
A great week to find out
what's going on
within the research community.
What is NCSEA showing
at Supercomputing 24?
Of course, we'll be talking about
our Delta and Delta AI systems.
As I mentioned,
they're providing resources for the National AI
Research Resource Pilot. We are very excited about NAR and hope that NAR will be fully funded
because right now it's a pilot where the resources are quite limited. We're working with Access and
the NAR pilot to provide a lot of AI resources for the community. So that's something we're very
excited about. We have a lot of applications in visualization work
and collaborations that we'll be talking about.
There's some that I've mentioned,
the Center for Astrophysical Surveys, for example.
We're part of a new AI Institute in astronomy.
It's a partnership with Northwestern
and the University of Chicago, as well as Illinois,
looking at using AI to gain more insights from the data that we're
getting from survey astronomy, which is taking images of, for example, the whole sky and maybe
doing it over and over again so that you can understand how things are changing, even transients.
We have a new ARPA-H project looking at tumor removal. So if you're a surgeon and you're
removing a tumor from a patient,
how do you know that you've gotten all the bad stuff without taking out too much of the good
stuff? Surprisingly, there isn't actually a really good answer to this right now, but there's a lot
that we believe can be done using AI and imaging and other techniques to give effectively real-time
information to surgeons. And this is an ARPH project that was recently funded as a joint
project between Illinois and Mayo Clinic. We'll be talking about things like that. Another initiative,
which we're part of, has been looking at, could we build a computer that would simulate the climate
well enough for us to use it to guide policy discussions? Right now, even the current
exascale machines can't do that. And it's an interesting question about if you were to build such a machine targeting climate modeling, what would that look like? to deal with the parts of the interactions of the climate where maybe we understand the basic
physical processes, but we don't understand how they are realized at the scale of the climate.
So many, many things like that. Just come by the booth and we'll have people there who
be happy to talk with you about these and many other of the applications that we're doing.
Bill, thanks so much for being on the show. I learned so much from you in just this
short period of time. I'm sure our audience wants to engage further. And while you said come engage
at the show, for those who are listening online and don't have the luxury of being in Atlanta with
us, where should they go to engage with NCSA? So the way NCSA is organized, one of my directorates is engagement.
So you can contact Chuck Pawlowski or the people in engagement as a starting point.
We like to collaborate with people all across the country and come to our webpage, see what we do, follow the links to people working on projects that you find interesting and see if we can develop a new collaboration.
Awesome. Thank you so much for being here today.
It's been so much fun.
It has.
Thank you very much.
Thanks for joining the Tech Arena.
Subscribe and engage at our website, thetecharena.net. All content is copyright by the Tech Arena.チャンネル登録をお願いいたします。