@HPC Podcast Archives - OrionX.net - @HPCpodcast-89: Rick Stevens and Mike Papka of Argonne National Lab
Episode Date: September 19, 2024We discuss the Aurora supercomputer, Exascale, AI, reliability at scale, technology adoption agility, datacenter power and cooling, cloud computing, quantum computing. [audio mp3="https://orionx.net/...wp-content/uploads/2024/09/089@HPCpodcast_Rick-Stevens_Mike-Papka_ANL-update_20240919.mp3"][/audio] The post @HPCpodcast-89: Rick Stevens and Mike Papka of Argonne National Lab appeared first on OrionX.net.
Transcript
Discussion (0)
From a national security standpoint, we need AI innovation, data center innovation, power innovation to happen in the United States.
That's one that coupling what I call the power AI technology nexus, a play on the energy water nexus, that has to be driven by the U.S. Certainly when you have so many thousands of nodes and then components, just the sheer
large numbers is going to expose that reliability.
All these big systems suffer from that.
Most of the jobs are still in the 6,000 to 7,000 node space.
And we're seeing very good results.
People seem to be very happy with what they're seeing come out of it.
From OrionX in association with InsideHPC, this is the At HPC podcast.
Join Shaheen Khan and Doug Black as they discuss supercomputing technologies and the applications, markets, and policies that shape them.
Thank you for being with us.
Hi, everyone. Welcome to the HPC podcast.
I'm Doug Black at Inside HPC with Shaheen Khan of
OrionX.net. And today we have two special guests. We have HPC industry luminary Rick Stevens. He is
associate lab director and a distinguished fellow at Argonne National Laboratory,
where he has been since 1982. And Rick's colleague at Argonne, Mike Papka. He is an Argonne National Laboratory, where he has been since 1982, and Rick's colleague at Argonne,
Mike Papka. He is an Argonne senior scientist and also deputy associate lab director and division
director of the Argonne Leadership Computing Facility. Gentlemen, welcome. If we could,
let's start with an update on the Aurora supercomputers, the second American
exascale class system.
We last talked with you at the ISC conference in May about Aurora and the drama of Aurora
exceeding the exascale barrier.
Can you provide an update on the system performance and to what extent, maybe on a percentage
basis, the full system has been installed?
Yeah, sure.
These systems are extremely complex, as you guys know.
And so we've been making steady progress since we last spoke.
For our HPL numbers and MXP numbers, while there was only a fraction of the system used
in those runs were the results, the entire system is installed.
And on any given day, as we shake out the components,
some fraction of it is up.
And so all the nodes are in place and we're replacing the early broken pieces, but in
general, making very good progress.
We are pushing very hard towards early 2025 release to users, though we do have science
happening on the machine now.
Most of the jobs are still in the 6,000 to 7,000 node space.
We have not done any full, really full node runs yet,
but steady progress is progressing.
And we're seeing very good results.
People seem to be very happy with what they're seeing come out of it.
We can assume we'll see a higher number at the next time.
So people always ask that question. Right now I'm focused on the science. I think right now, we want to get the science
users on. We want to get the end users. We may do another HPL run, but it's not my priority.
The priority is users and science. Okay. And how many blades total for the system?
Could you refresh us on that? Forget that number. Yeah, 10,624 nodes.
Okay, great.
Yeah, it's got over 10,000 nodes.
And when you're bringing up a system, you have fallout of hardware.
It's common across any big system. You have to replace nodes, replace TPUs, replace processors.
And that process has been ongoing.
And we're building up spares,ares capacity and so on to be able to
keep the machine running for its lifespan. And so that's some of the work that's been going on.
When the DOE Exascale project started, it was obviously visible that there were different
strategies being taken in different sites. And that variation seemed to indicate that we wanted to learn about different ways of doing
this. Building exascale systems, as I like to say, is something we should be able to do a couple a
week over time. And certainly AI is like pointing in that direction. So taking different approaches
seemed like just smart thing to do. And of course, when you do take different approaches,
one of them is going to be like better than the other or easier than the other or luckier than
the other. I feel like this whole kind of Aurora trajectory to me has been highly valuable exactly
for that reason, because we ended up learning some new things. Things didn't just work.
What are some of those learnings? Has that, in retrospect, been a good thing that we learned
all of this stuff, or it's just been pain? There was, early on, of course, there was a sense of
having substantial architectural diversity in exascale class. So if you go, depending on how
far back you want to go, we had this notion of swim lanes, and swim lanes were basically
architectural bets in some sense, right?
Just like the vector machines were a different architectural bet than Blue Gene and so forth.
And when Aurora was committed to by the department, the original goal for Aurora was to build out a data flow based on Intel's CSA architecture. And that would have been, I think, in the context of your question,
a radically alternative platform than, say, a GPU-based machine. And it was interesting. We
did a ton of development work on that. Intel did a ton of development work on that. And it had some
pretty unique properties. Like it, as an, it was much more resistant to things like what's called
control flow divergence, that is when you have nested if statements on. You could maintain
performance in the face of that, whereas traditional processors have a hard time with that.
So it could access performance in different parts of the algorithm space than, say, GPUs or traditional CPUs could.
One of the challenges that Intel faced at the time
was their internal business strategy of how many diverse architectures
could they support or how many architectural families could they support.
They ultimately made a decision not to proceed with CSA
based on a business case.
And that was too bad because it would have really been an
alternative platform. They pivoted to GPUs and as everybody's been pivoting. And in some sense,
that was a positive because you were able to tap into the mainstream direction. So what we're
ending up with though in Exascale is basically GPU machines in the US. And while the micro architectural differences between say the
AMD GPUs or NVIDIA GPUs or Intel GPUs are substantial, if you dig into the details,
they're quite different. But from a programming model standpoint, they're pretty similar. And so
the diversity that we have now is much more of a supply chain diversity up to a point. And of course,
in the US, the three exascale machines, Frontier, Aurora, and Alkaff, are all very similar. You've
got GPUs, they're integrated by HPE, they have similar networks, software stacks are quite
similar. So in some sense, the strategy of having architectural diversity or supply chain diversity
has not worked out as we originally intended by trying to maintain really distinct tracks
in some sense.
But we'll see.
The future could be quite different than where we are now.
The main difference is likely to be systems that have different accelerators, perhaps AI accelerators
that are more power efficient or more performant than GPUs in some future systems. So that becomes
an option. And that's where a lot of the silicon innovation is happening in the marketplace,
right? You've got dozens of companies trying to build products that essentially compete with GPUs.
And that's one of the few places where you've got architectural innovation happening.
So we'll see.
I think the other thing that comes out of that, though, is these systems are still not easy to assemble, right?
Even the vendors who have put together the three current lead systems, they don't build these at scale anywhere but in the labs.
And those challenges are labs. And those
challenges are there. And no technology is solving that, right? You're putting together a bunch of
components for the very first time on the floor. And there's just a level of complexity there that
you can't get around. So the idea of putting these out every couple of weeks makes me nervous.
No, that's true.
It's interesting.
Yeah, so to what Mike's saying is really correct.
If you look at even the systems that were announced,
the XAI system that was just announced yesterday or whatever,
100,000 GPUs is actually two 50,000 GPU systems integrated by two different companies.
And building systems at this 50,000, 60,000,
Aurora's over 60,000 GVUs
is super hard. And once you, it's easy, relatively easy to assemble that many parts in a room,
but getting it to work as a system is really challenging due to the sheer number of components,
the failure rates, the full stack that has to operate both hardware and software.
And it'll be interesting to see
whether these commercial systems, like we saw last year that Microsoft's Eagle system was on
the top 500, but it wasn't at full scale, whether or not any of these other systems that are not,
say, in the labs will end up being able to be stabilized to run something like Linpack or
mixed precision benchmarks is unclear to me because actually getting the machine solid enough to run something like LINPACK or mixed precision benchmarks is unclear to me because actually
getting the machine solid enough to run those applications, those benchmarks, is much harder
than getting them solid enough to run, say, a distributed training of an LLM where there's
robust failure management and fault tolerance layers and restart capabilities. So we'll see. Nobody has stood up, say, a million GPU class machines
and made them work yet.
We're like an order of magnitude off of the scale
where we need to be at some point.
You also just mentioned like the 100,000 GPU
is really two times 50,000.
Yeah.
And there's also that level of, for lack of a better word,
lack of transparency.
It's like not enough information is disclosed for folks to know what it is that they're evaluating from the outside.
But generally, I think we all agree that these big things are just really hard to do in many dimensions.
Yeah. And there's new ideas needed.
So there was earlier this summer, there was a workshop or a conference on high
availability, right, of big systems. So it used to be, if you remember, some of us are old enough,
if you go back 30, 40 years from today, there were companies that specialized in fault-tolerant
computing. Oh, yes. But for those companies, it was something like 10 processors or something,
and they would make a big deal how you could shoot a gun into it and keep running
or pull a node out and keep running and so on.
Trying to build systems with tens of thousands,
hundreds of thousands of processors
and have a mean time to failure that's measured in days
as opposed to hours or minutes is super hard.
And there hasn't been the R&D in novel strategies of making
these fault-tolerant systems scale up. In the early days of the Exascale initiative, there was
a big concern about reliability, right? So what were, if I go back 15, 20 years ago,
what were we worried about? We were worried about power, right? Our original projection showed we
needed a gigawatt of power. Well, that's also starting to come true again, right? Our original projection showed we needed a gigawatt of power.
Well, that's also starting to come true again, right? With data centers needing gigawatts.
We were worried about reliability because if you just took the fit rates of the components at the time and scaled them up by the sheer number of components, you would have projections of failures
that were minutes of MTBF and the ability of applications and so forth.
So we had half a dozen of these kind of challenges.
So power was not completely solved.
We got within a factor of two or something of where we were trying to go.
Reliability fell off the roadmap or something.
It was like either the vendors or maybe the community just decided that we had enough
tools in our toolbox that we didn't need to do special, really higher order special things to make these machines reliable.
And I think we're starting to see that that's actually not the case, that we do need to invest more there.
Scalability has been interesting, right?
So in science, we've been able to devise algorithms that are latency hiding to get scale.
So strong scaling has been somewhat successful. So strong scaling has been somewhat successful.
Weak scaling has been enormously successful.
And so people don't talk so much about scaling challenges, right?
The challenges that come from scale tend to fall into the reliability.
So as you scale up to use, say, 50,000 GPUs, can you make your application more fault tolerant?
Things like that.
But scale as a goal in and of itself has been achieved pretty well at Exascale.
Now, whether we can get to Zeta scale class applications with the same approach is not
clear.
I think when we talk about reliability of components, certainly when you have so many
thousands of nodes and then components, just the sheer large numbers is going to expose
that reliability.
But I just want to emphasize that this is not a situation that is unique to any one site.
Like all these big systems suffer from that, right?
Sure. Every site that has large numbers of components suffers from reliability in the
mathematical sense, but not every site that has big machines are trying to run
single applications across the entire infrastructure. That's the difference.
So if you look at Microsoft Cloud or Amazon or Google, or when you say, how many applications
are trying to run on 100,000 GPUs as a single job? The answer is zero, right? Linpack would be an
example of that, but real applications in real
clouds don't run like that. You have hundreds or thousands of services or servers. You have
hugely distributed applications. You're not trying to solve the airflow over a wing on 50,000 GPUs
as a single application. So the scientific use cases put a huge, at least the way that we currently implement, put a huge pressure on the kind of crystalline reliability of the machine.
That is where you're trying to have thousands of components all working for many hours with no failure in the entire system.
That is a very specific use case that happens in science.
It doesn't happen in almost every other
part of industry. Right. So we're talking about hero runs. Yeah, but that's the normal kind of
production's idea in scientific computing. So I'm going to run my job at a thousand or five thousand
or ten thousand nodes, and I don't do anything different when I'm running from a thousand to
ten thousand. Whereas in commercial, that's just not how people operate.
Yeah.
Now, Rick, to talk to you about architectural diversity,
you said it sounds as though you're pointing,
if I had to guess, toward some of these AI chips,
Cerebra, Samba, McGraw kind of idea.
We did speak with Matt Seeger,
who is heading the OLCF6 Next Generation Leadership System Project. And we asked him if there might
be surprises coming out of the next system in terms of types of technologies being engaged.
And he said, yes. Can you talk a little bit more about the possibility? We just saw news from
Cerebrus last week, pretty eye-popping AI inference results. Yeah, 400 tokens per second on a 3.1.
It's easier to say there might be
surprises. It's not releasing much information. If you look at the RFP, you would say...
Go ahead and release new information. Well, no, I'm not, I don't have anything new to release
other than if you look at any of the RFP from OLCF Next and ALCF Next, they're pretty, they're written in a way that vendors can be quite
innovative in responding with new technology. But if you look at the job mixes that the current
machines are supporting, say a Frontier or Aurora job mix, and you say, oh, I want to get a 5x or a
10x overall throughput improvement against a similar kind of job mix,
you might not be very surprised because the current job mix is not heavily weighted towards
AI or it's not heavily weighted towards AI surrogates where there could be enormous headroom
in acceleration for specialized processors. But if you've got to carry the current job mix forward, your strategy is how
to do that in some way that is cost effective. And AI accelerators, particularly low precision
accelerators, of course, can be helpful in some ways. If a vendor can find a way to use low
resolution or low precision hardware and synthesize high precision computation with that low precision hardware,
then you might see really interesting things happen, right? So rather than having, say,
dedicated 64-bit hardware, you might take your FP4 units and aggregate them in a special way
with some special sauce, right, that allows you to do high precision computation, but with natively
low precision hardware. That kind of idea has been around.
The question is whether you could pull it off in a way that makes sense, right?
I think, so that's like the kind of surprise that you might see, whether that's a surprise
in your book or not, I don't know, but that would be interesting, of course.
We're hoping for some things, like we've been waiting for a long time for integrated
silicon photonics.
We keep
waiting. So that would be a, again, it'd be a surprise, maybe. It'd be cool. I think there's
other strategies around storage that might be super interesting that could happen. I don't know.
What do you think, Mike? I'm interested in seeing, we'll add to the complexity and challenges,
just more options to the developer in terms of what they can access.
Maybe it's in the chiplet space.
Maybe it's in just over-provisioning of resources on the nodes.
But I think you can start to tune applications more so towards their needs
than trying to hammer them into a square peg.
Yeah, I agree with that.
A couple of ideas that we keep coming
back to over the years, there was maybe 10 years ago, there was this idea that maybe everything
was going to be disaggregated. Systems would be composed of a super fast, low latency fabric
of which you would have memory servers and compute servers and storage servers and specialized
accelerators and so on. And they would be all attached to this big fabric.
And during runtime, it'd be like a software-defined machine.
You would gather up the memory units and the processors and the other units that you might
need and have a virtual overlay that you'd run.
And then when your job was done, it would all tear it down virtually and build it back
up again.
And that idea hasn't panned out.
It exists in some areas, but not in
scientific computing. And instead, what we're seeing is a movement towards ever higher levels
of integration on a node, right? So where the building block is many GPUs, many memory units
aggregated with a on-node fabric that's very fast. And the building block now
becomes maybe a node or a supernode or a pod or something. And that's working in the opposite
direction of disaggregation. We've also seen chiplets, but whether chiplets will actually
affect what we can build is not clear. There was a robust chiplet market where a vendor might be not like a vendor
like Intel or Nvidia or AMD or something, but might be like just a, think of it as a reseller
who would gather chiplets from the market and do a custom SKU for a bid response by combining a
bunch of chiplets in a certain way and having these unique things. That kind of market hasn't
really developed.
If we're going to keep talking system design and options, it could be surprises.
Right now, our acquisitions are monolithic decisions made four or five, six years out
that are locked in and very rigid.
And it's not a technology innovation, but it's a mindset innovation in terms of if we
could look at a much more nimble, fluid approach to responding to things that are happening in the space that
isn't locked us into a machine. If you look at Roar, it's a unique situation. But even
if you look at Frontier and LCAP, the designs were pretty rigid, maybe late binding on GPUs. But
overall, you're making decisions in a space that's moving
extremely fast. I mean, there were no AI accelerators that you could acquire when we
were designing those systems. And all of a sudden, midway through the acquisition,
those pieces of technology became available. Question is, could you make changes and adapt
that would have changed the machine?
And right now, they're pretty rigid.
If you made one a week, then it would fix that, wouldn't it?
Yeah, Mike would get a lot more gray hair faster if that were the case.
Lighten up on that one.
Even one a month would be progress.
I'll take one a month.
Let's talk about power.
You said that was on your radar all along.
And of course, it turned out to eventually become the big deal that it is now. We've gone from describing data centers from square feet to megawatts. And that doesn't seem to be stopping when one looks at the roadmaps of all these systems. Let's talk a little bit about the complexity that leads to power, cooling, water, the whole thing, modernization?
Well, I think it's a new way to think about the limits at some levels, or not limits,
but like the opportunities, maybe. One reason you can think about translating power or talking
about a data center in terms of its power is that we have a canonical unit at the moment,
like a GPU that's order thousand watts plus fractions of stuff, network
and memory and so on. So you can think of a per GPU as needing something like one to 2000 watts.
And so when you talk about a data center, like a hundred thousand GPUs, that's a 150 megawatts or
200 megawatts. And it's a way to think about the challenge. It also translates, of course,
into huge operating expenses based on
where you are in the country or in the world in terms of cost, much more expensive to run a big
data center in California, say, than in Illinois or Tennessee or someplace like that. So it affects
where you can put things. And the data center markets are very sensitive to all these things.
And we see just as across the US, right, the build out
of dozens of data centers, in some places, it's approaching, like in Virginia, I think it's
approaching about 20% of the total power in the state is going to data centers. But there's no
sign of it really slowing down. And in fact, it's accelerating with this idea of gigawatt data
centers, which in your head, you could translate, let's say a million GPU kind of system
or order half a million.
And we have AI roadmaps that say,
that's just a stepping stone to where we need to go
for AGI and super intelligence at scales like that or beyond.
So the idea that we're even thinking about that is amazing, right?
If you go back five, 10 years from now, in the past,
if you talked about a gigawatt data center, that was something to avoid. Now it's something you're
trying to build, right? So that's a huge shift in thinking. It also puts enormous pressure on
the energy system precisely at the time that we're trying to produce more green energy, right? We want renewable power or we want
low or no carbon energy sources. And it's really hard to stand up huge amounts of renewable power
or low carbon power quickly, right? The fastest kind of power you can build today is natural gas,
right? Because you can basically order it off Amazon, not literally, but actively order a gas turbine or dozens of them and pipe it in and off you go.
Whereas building out a new wind farm or 1,000 acres of solar, 10,000 acres of solar takes many years.
Permitting a nuclear power plant takes even longer. So there's a real challenge of how do we sustain the growth of computing, particularly the AI component of it, without trashing our goals of reducing emissions, right?
So that's like now a major policy question. workshops around this idea of how do we supply enough green energy to not become the,
where energy is not the bottleneck in building out AI is a major issue.
Cooling is another issue.
All the state-of-the-art big systems are liquid cooled,
but sometimes there's even problems of that in terms of where there's power.
Like there's a lot of solar and wind out in West Texas, for example, or in the deserts,
but often not a lot of water. And one of the things that we need to worry about is much more
efficient cooling schemes that don't consume huge amounts of water. So that's an R&D topic.
Now, the commercial sector is investing and growing much faster than the government sector.
I think there was a proposal in the last couple of months
to maybe build a data center, a public-private partnership around a data center with the goal
of using it as a testbed or a laboratory to investigate novel strategies at scale for
power efficiency in data centers. That's something that I think the Department of Energy Advisory
Board was recommending. So it's a hot topic. It's also interesting because the US is one of the few
global markets where we have reasonably affordable energy and a flexible environment where these
things can be built out relatively quickly. It's much more difficult in Europe, much more difficult.
You're not going to get a carbon neutral power say, in the Middle East. You can maybe stand up a lot of fossil there. So there's an interesting confluence of where AI is
happening, where energy is affordable, and where innovation can happen. And that, I think, is
something we want to, as we look forward from a national security standpoint, we need AI innovation,
data center innovation, power innovation to happen in
the United States. That's one that coupling what I call the power AI technology nexus,
a play on the energy water nexus, that has to be driven by the US. That's the future critical
bottleneck, I think, for everything that we want. Really, a lot of it just comes down to energy.
And of course, in our pre-call, we were pointing out that you have energy in the very name of the department. It is.
So you're ideally placed to study this and point the way. People have talked about small modular
reactors, about geothermal, about, of course, solar, about all these different novel ways.
We talked in our recent episode about the challenges of transmitting power,
not just generating it.
What is your guys' perspective on how all of those alternatives are coming together?
It's interesting.
At Argonne, we're actually talking about small modular reactors.
Of course, Argonne and reactors, Argonne lab was created around the time.
I think, or two.
Reactors. There's a strong interest there.
I think we're going to see how fast we can,
I say we here, collectively, humanity, how quickly can we scale
out power? That is, I think, an important question.
It also may be the case that you start to see things being co-located. The easiest thing
to do is build a data center next to where your power is because you don't need that many people to be physically
at the data center. So you could put the data center next to a hydro plant or next to a nuclear
plant in the middle of the West somewhere without really requiring lots of people to be there. It's
much harder to say, stand up a wind farm or a nuclear infrastructure or even a geothermal infrastructure
in the middle of a city. So this could result in some interesting pairings going forward.
It also could result in some interesting relationships, right? There's a discussion
about Canada has a lot of hydro and a lot of land, and maybe we should be looking at strategic
partnerships in some cases where that makes sense.
How much we're going to have to build out over the next decade to keep the AI community satiated in some sense is not clear.
Is it 10 gigawatts?
Is it 100 gigawatts?
And that scale of an infrastructure has never been contemplated more, right?
That's like multiple times more computing than we currently have
deployed in the country.
The problem is it has to start now.
Right now there's a lot of discussion
but we need to be
prototyping and building
test cases and then along
the way, because you're not going to start off with a
100 gigawatt solution.
Very good point.
Now Mike, you see the diversity of applications that are
coming your way. Are there apps that can actually use these outside of just model learning building?
Over my years, I've been constantly impressed, and I am a computer scientist, not a domain expert in
any of the spaces, but at each turn of the crank, the community steps up and
continues to add realism and fidelity to their codes. And Rick's much more an expert in the AI
space, but I think opportunities coming from AI, specifically in the biology and medical space,
are tremendous. And yes, I think we'll continue to see this. We'll see it changing the way
we do everything.
Recall that Aurora got the top benchmark number in the top 500, the HPL-MXP AI benchmark.
Could you refresh us on Aurora's characteristics that enabled it to score well in this area?
Well, Aurora has 60,000 plus GPUs, right?
Six GPUs per node.
And we have 10,600 blah, blah nodes.
A lot of nodes, a lot of GPUs.
It has double precision, single precision,
half precision, FP16, BFP16.
It also has int8 precision, does not have FP4.
It's of a generation before FP4 was a thing.
So for mixed precision, the mixed precision benchmark is a curious benchmark because
it's not a pure 16-bit computation or a pure 8-bit computation or whatever. It's a mixture,
obviously. And the benchmark number is a calculated number that says what it would have to be in
double precision if you were solving it using the baseline method. So you're allowed to use as many precisions as you want.
So the reason that it could get that number or the highest number is because it has in 16-bit,
it has a systolic arrays that are quite efficient and produce a large number of operations per
clock per execution unit at each of the GPUs. And each GPU has got a thousand execution units,
many cycles, many instructions per cycle for each GPU. So in some sense, the ratio of those things
is what determines its performance. It also of course has HBM memory, like all GPUs do these
days. It's got 128 gigabytes of HBM per GPU, and you're spending most of your time
in that memory. So it's relatively efficient and it has eight network interfaces per node.
So it has about twice as many kind of network endpoints as Frontier. So it's a more heavily
weighted in communication fabric. So that allowed us to get that high number, even with less than all the
nodes running, which is good. What that translates to, I think, is training or for fine-tuning
AI models, it's incredibly performant. So one of the Gordon Bell submissions that we made this year
is on direct preference optimization of our protein language model LLMs and was achieving numbers that
are on the order of exaflop in FP16 per thousand nodes. That's about half of the peak. I think the
peak is a little over 20 exaflops in FP16. So that's pretty good. So it means for training
large language models, it's an excellent platform.
If we can get those models to train in Int8, it'll be even faster because you get about twice the throughput.
It also means for inference, it's a very good platform as well.
That's excellent.
Can I ask you about cloud and how it figures currently just as a spillover capacity or a spectrum of options that are available.
And also as it appears in the future and how you may or may not want to take advantage of it.
We talked about this a little bit earlier.
We did, yes.
And this place, I don't think it's going to replace our current systems.
We're interested in it for its first capability and potentially sending smaller jobs
out to it. Many of the reasons, I think it's not a solution for the labs. I believe that there's a
set of capabilities that we've built out over the years, both in terms of knowledge and skills in
our workforce that the idea of offloading those and losing that
capability within the complex seems like a terrible waste and something that you would
take forever to recover from. So we look at it, but it's not something that's on at least
ALCF's roadmap at scale, I should say. I see. So let's continue to monitor it and use it
when appropriate. Yeah. All the labs are using clouds now. It's not an either or thing.
If you have lots of application scenarios where clouds are perfect, right? Let's say you have a
bunch of sensors out there that are collecting data and you want to periodically do aggregation
and cleaning that data. That's a perfect application for a cloud. You're streaming to
cloud. You don't have to build out that streaming infrastructure yourself. You can be dynamic in the cloud to scale the workflow a problem. One is cost. Now you can do
reserved instances and so on. You can get the cost down a little bit, but the premium that you're
going to pay for somebody else doing all of this and providing that infrastructure is going to be
substantial. So it's going to cost you multiple factors, maybe three to five X, depending on the
details. So that's immediately a kind of a problem in that you need a lot more money for the same thing.
Mike mentioned this notion of losing human expertise and capabilities.
Labs, our machines are not just isolated things.
They're integrated in with our infrastructure.
At Argonne, for example, Aurora is integrated with a lot of other infrastructures, but it's also in a position where we can do fast data transfers, say, to the advanced photon source, right, at terabit or multiple terabits
per second. That's a capability that would be very hard to do in a cloud, right? Having many
terabits per second networking and paying for the large data volumes that you'd have to transfer in
and out of the cloud would be prohibitive on top of the computing part of it. And even the commercial sector is not all in on this. If you look at companies that are building
out their large-scale AI systems, GPU clusters at, say, 50 or 100,000 GPUs, they're either in
clouds, so they're funny money with respect to, say, the Microsoft OpenAI deal, or they're
companies that are doing it themselves.
XAI, right, are purposefully building it themselves because it's cheaper and they have more control
over it than if they were just buying it via some contract via a cloud.
It doesn't mean that people aren't going to continue to build out clouds in any sense,
right?
You're not trying to push back on that.
It's an incredibly powerful, useful capability.
And even for the government, there are many things that the government needs to do
where clouds are perfectly the right solution. But at scale and where capability is the thing
that you're primarily trying to support, it's probably not the best solution. But we're
constantly reevaluating. That's absolutely right. I've made some statements that I don't see that
it's ALCF's future, but every year we
look at it and we've looked at it for the last 10 years and each year the story changes
and capabilities change.
I guess I should never say never.
Yeah, listening to you guys, you just get the sense that you're in these planning and
strategic positions that will impact future DOE supercomputing strategy.
Yet the technology and the situation and the workload needs and everything is changing so fast.
So how do you build in flexibility into your overall strategic thinking so that you can shift,
even though these systems take years to stand up?
I think you have to be doing multiple things at once.
So most of the labs are not, those don't have a single system
or have a single mission. There's multiple systems, multiple missions, test beds, and so on.
So you're doing many things at once. We are constantly trying to figure out, is it possible
to do our large systems differently? There's rather than these big monolithic contracts that
you have to put out a
couple of years before the system is installed. And then it takes a year or two to get the system
installed and up and running. And this is a very, it's like turning an aircraft carrier. It's very
slow to change direction, right? The advantage is the system knows how to do it. So procurement
knows how to do it. The project management people know how to do it. The infrastructure,
but it's a very, by the time you're turning the crank three or four
or five, six times or nurse again, nine times or whatever you get, the system is very tuned
to how to execute.
But those big procurements are hard to, as Mike said, inject technology at the last minute.
One thing that we're trying to figure out is how can we change, right?
So can we write contracts in a way that have multiple options?
Can you write contracts in a way that allows you to upgrade on the fly or to have resource
sharing agreements?
Maybe you don't, maybe you're not purchasing or even leasing nodes.
Maybe you're renting nodes in a different way, or maybe you have an arrangement where
you are partnered with a cloud provider. And we've had conversations with Microsoft and HP and others are on this concept
of what would it look like if our data center was a hybrid cloud and what are the advantages
and disadvantages of that? So all these ideas are being constantly discussed and it's possible
that we'll have some new way of doing things that results in changes in the future.
Like, for example, one thing that we're interested in is making it easier for startups to participate
in these large procurements and reducing the integration time from a new technology, a
new silicon technology to when it could become available in a large scale system at scale.
Right now, it's many years, right,
for something to become integrated into large-scale systems. And we would like to change that. So I
think that there's many directions of this can go. And some of that's even influenced by how
clouds are thinking about it. If you talk to folks at Microsoft or Amazon or Google, they're all
building their own hardware meta. They're all building their
own hardware in addition to buying from the market. And that's something that we're also
considering, like not so much where DOE is going to make its own processors as much, but how could
we partner in ways with non-traditional players that would give us more flexibility in our
deployment options? I'm sure you're following developments in quantum
computing. Are there any developments in that line that have caught your eye in particular?
Are you encouraged, discouraged by the progress being made overall? I guess I'm encouraged at
some level, but I don't think large-scale quantum computing is going to intersect our
practical large-scale facility roadmap for quite a while. I think there
are players out there that are taking it seriously at scale, like SciQuantum. I think IBM's taking it
seriously. I think a few others are. I think the quantum computers are not going to be cheap.
Probably the cost of a large-scale quantum computer, large-scale like million qubit
class machine, you know, is going to be measured in the billions of dollars.-scale quantum computer, large-scale like million qubit class machine,
you know, is it going to be measured in the billions of dollars? And they also won't be
low power. Big cryo plants are both expensive from a capital standpoint, but they're also very
expensive from a power standpoint. So the way to think about a quantum computer that would be big
enough to really be interesting as a resource for, say, the science
community. I think about it as something that would have kind of order million qubits, physical
qubits, allowing you to build applications that have maybe a thousand or so, maybe 10,000 logical
qubits, depending on your error correction, and machines that are stable enough to run applications for
days at a time, a single problem for many days, because that's what you need in order to have
enough computational power to be interesting on scientific problems. And at that scale,
you're talking about machines that have the same economics as the current supermarkets, right?
You need facilities that are funded at the level of a couple hundred million dollars a year, and they would serve on the order of 100 or 200 applications a year.
And we're just not there yet.
It's not that we don't want that to happen. It'd be great if it could happen, especially if we can identify enough applications
that could really do novel science, novel breakthrough level science, if they had access
to such machines. So right now it's not an option. There's just nothing at scale, nothing that
would be interesting from a science standpoint that could be built in the next couple of years.
Not for production anyway.
Everything right now is still more or less a physics experiment.
Right, yes.
Maybe worth doing.
But I think for a user facility that is trying to advance science as opposed to advance quantum computing technology.
Okay, so these are two different things, right?
We're probably 10 years away. And if you understand how we justify building large-scale scientific computers,
say in the context of DOE, there's a long roadmapping process where you accumulate
mission requirements or mission needs from the scientific community that could take advantage
of the platform. You build use cases from that. You then design a facility to try to meet those
needs or those requirements. And that first
part of trying to figure out what is it that the scientific community could do in the next, say,
five years or even 10 years that they will not be able to do on a classical mean, factor into it
the idea that the scale and the reliability and the time-sharing aspects or space-sharing aspects of the machine,
that case hasn't been made yet. And that's probably the next step that has to happen
by the community is to really articulate, and not by us computer scientists, but by
chemists or material scientists or other end users that are not the technology people, but the
end user people, to build that case and to really do the
analysis that would convince Congress or taxpayers in some general sense that it's worth a billion
dollars or five billion, whatever it's going to cost to do it. I think that's where we're at.
There's been progress, of course, in scaling up machines and scaling up R&D enterprises here in
Chicago. We're building out the governor
and regional universities and labs and so on are all partnering to build this giant quantum park.
Yeah, very much, especially in Illinois, you're right.
Yeah, yeah. That's a step, but that's equivalent to like building a fab, right? It's not building,
yes, they'll build some machines, but it's not at the point where say three years from now,
we'll have a million qubit machine. It's not.
Let me ask, I'm just getting the sense that when we talk about the power requirements with GPUs
and the data center and the AI models, all of the above, and now quantum computing,
are we now reaching a time in the evolution of HPC where we're just hitting a bunch of walls again?
Because it seemed like for a few years, we were just bulldozing through a bunch of previous walls. How do you assess where we are as an industry? Are we
tackling really big problems now again? Yes. There's always some problem, whether it's...
The problems that are forcing us to pause, and now we're just going to have a whole long haul again.
I'm not sure that it's boring again. I don't characterize it as a pause.
It's paused in the past.
What would happen is we would project and then we'd say, ah, this projection doesn't
work because 15 years ago, if I tried to build a gigawatt data center, people would laugh
me out of the room.
Yes, yes.
It wasn't that I couldn't imagine it.
It's just that I couldn't get any traction with that idea.
Yeah, I agree.
Positive bad word.
But are we now into a boring long slug again, like we were some years ago, and then like
rapid innovation?
I don't think it will be boring because we'll have AI to entertain us.
But I think what we're maybe not seeing here is that these things are all coupled.
So if we can make enough progress in AI, it will help us in making progress in
quantum because it will help us write quantum applications. It'll help us maybe dream up better
quantum algorithms. It'll help us with breakthroughs in materials or breakthroughs in air correction or
whatever, right? So we're at this weird inflection point, I think, where weird in the sense it hasn't
happened that many times in the past, but where we have a technology that could accelerate many other technologies,
and that's AI.
Simulation was the thing that we would argue
played that role, say, for the last 30 years,
that you'd say, oh, if you wanted to design a drug,
use simulation.
You want to design a material, use simulation.
You want to design an airplane, use simulation.
That was the argument.
Now, what's the argument?
Oh, you want to do all those things?
I'm going to use AI. I'm going to use, and I want to do all those things? I'm going to use AI, right?
I'm going to use, and I'm going to accelerate my simulation.
I'm going to use AI to do that.
So it's become this general tool that affects the productivity of many people, but also
affects the utility of other tools.
And that will have some outcome.
Is it going to be everything exponential in three years?
Probably not.
But is it going to profoundly affect almost everything that we're doing? I think the answer is probably
yes. And we're not fully appreciating that because it's hard to see. It's hard to see what,
if all of us had a thousand times more, I don't know, capability in our daily work life or
whatever, how would we behave differently?
It's if you went back 40 or 50 years ago and said, what if you had a gigabit network?
Right back when you had dial up, it was really hard to imagine like a gigabit. What's that?
You'd like, but millions of people have gigabit servers. And so you can do all kinds of things
that you couldn't do then. So the AI revolution in in some sense, is going to be like that. Most of the things that we will be
doing, say, five, 10 years from now, we're not doing today. So it's hard to say, oh, I'm just
going to do that only it's going to be faster. No, I'm going to be doing something completely
different. That's right. Okay. So that will certainly keep things interesting. Yeah. Pause
is just, we're going to probably have those pause buttons taken off of all of our devices.
There's so many other things we haven't talked about that we should, but we've also taken a lot of your time.
Always grateful for that.
Doug, any questions that we haven't asked?
There were a few, but I think we're good for now.
It was a really interesting hour conversation.
Much appreciated.
Sure.
Anytime, guys.
Thanks. Rick, Mike, thanks so much. Thank you, Rick. Thank you, Mike. appreciated. Sure. Anytime, guys. Thanks.
Rick, Mike, thanks so much.
Thank you, Rick.
Thank you, Mike.
All right.
All right, guys.
Cheers.
Bye-bye.
That's it for this episode of the At HPC podcast.
Every episode is featured on InsideHPC.com and posted on OrionX.net.
Use the comment section or tweet us with any questions or to propose topics of discussion.
If you like the show, rate and review it on Apple Podcasts or wherever you listen. Thank you for listening.