@HPC Podcast Archives - OrionX.net - @HPCpodcast-81: Matt Sieger of ORNL on “Discovery” after “Frontier”
Episode Date: March 26, 2024We caught up with Matt Sieger, Project Director for the 6th iteration of the Oak Ridge Leadership Computing Facility (OLCF-6) to get a glimpse of the project, its objectives, status, and timelines. M...eet Discovery, the supercomputer that plans to succeed Frontier, the current #1 (at 1.19 exaflops in 64 bits) while Summit, the current #7 (at 148.8 64-bit petaflops) continues to work alongside it. [audio mp3="https://orionx.net/wp-content/uploads/2024/03/081@HPCpodcast_Matt-Sieger_ORNL_Discovery_20240326.mp3"][/audio] The post @HPCpodcast-81: Matt Sieger of ORNL on “Discovery” after “Frontier” appeared first on OrionX.net.
Transcript
Discussion (0)
This successor to Frontier, we have a name for the system.
We're going to call it Discovery.
And Discovery has the objective to be a significant boost in the computational and data science capabilities over Frontier.
How about the cloud?
Do you see a scenario where you either spill over to some kind of a designated
public, but just for you, cloud?
After we do source selection this summer, we enter into a period of contract negotiations.
We don't announce the contract until it actually is finalized and signed, and that will be
sometime in 2025. I would expect that to be
somewhere in the spring of 2025, of course, subject to the winds of everything that can
happen between now and then. We often heard that Frontier was about a $600 million system.
Is there a price tag, if you will, connected to Discovery? And secondly,
will it incorporate quantum computing in some way?
From OrionX in association with InsideHPC, this is the AtHPC podcast.
Join Shaheen Khan and Doug Black as they discuss supercomputing technologies and the applications,
markets, and policies that shape them.
Thank you for being with us.
Hi, everyone.
Welcome to At HPC Podcast. I'm Doug Black. And today, Shaheen and I have the pleasure of welcoming Matt Seeger on the podcast. Matt is director of the OLCF6 project for the
Oak Ridge Leadership Computing Facility. Matt joined Oak Ridge Lab in 2009 as a Quality Manager,
and in 2018, he moved to OLCF as Deputy Project Director for the Frontier Exascale Supercomputer.
In 2021, he was selected to lead the effort for Frontier's successor. So, Matt, welcome.
Thank you. Thank you. Glad to be here.
Okay. So, we understand that your job has a focus on external relationships with technology
vendors, stakeholders from the lab and DOE, as well as other science users.
Why don't we start off, if you could provide us an overview of the OLCF6 project, its general
objectives and vision, the workload problems it will address, and why it's needed.
Sure. So in general, OLCF6 is the successor
to Frontier. Frontier was the first exascale system. It was deployed in 2021. It's now operational on
our floor. Scientists are running every day on that machine, and we're already making plans for
the system that will replace it. This successor to successor frontier, we have a name for the system.
We're going to call it Discovery.
And Discovery has the objective to be a significant boost in the computational and data science capabilities of our frontier.
Of course, we continue to see tremendous demand for these systems. We're typically oversubscribed 5x or more for proposals
to use these machines. And so we constantly need more and more capabilities. So the first order of
business for Discovery is going to be giving us a boosting capability. There's a few new missions,
though. If anything, the mission is expanding. We have a lot of users and traditional modeling and simulation, but we're also seeing an explosive growth in artificial intelligence.
Not just us, but across the entire world, everything is going AI.
And so this system will be targeted at providing enhanced AI capabilities for our users.
And there's another initiative from DOE called the Integrated Research Infrastructure. And so IRI is an effort to couple together the DOE computing facilities with experimental data facilities across the DOE complex. facilities have a need for real-time data analysis or for modeling and simulation or AI inference to
back up and control their experiments. And so DOE is looking at fielding a set of capabilities to
provide seamless interoperability between the experimental facilities and the computational
data facilities. And the integrated research infrastructure is an effort to put that into
the field. And so one of the design goals for Discovery is to interoperate with that IRI.
We also have objectives to continue to improve in energy efficiency. One of the key enablers
of the exascale era with Frontier was a 200-fold improvement in energy efficiency between the petascale class systems like Jaguar
to Frontier. And we need to continue to push on energy efficiency and computing for not only for
the good reasons of just reducing energy consumption demand in general for computing,
but also because in our facilities, we have limited space, power, and
cooling available. And we want to deploy a system that gives the most capability for science that we
can possibly deliver to our users. And so we have a fixed budget, a fixed amount of power that we
can deliver, and we want to get the most out of that. And so energy efficiency is a way for us to
continue to push performance per watt and to deliver the most capability to our users.
Okay. And by the way, do you have projected objectives for the throughput of OLCF6?
Traditionally, in past projects, we've typically given an X number for the design goal for the
system. For instance, Frontier was to be 20x performance improvement in certain applications.
For OLCF6, we're taking a little bit of a different tack.
We haven't specified a speed-up goal.
Instead, we've specified a budget, a power limit, and a space limit,
and we're saying give us the most capability that we can fit inside of that box.
So while our partner institution, NERSC 10, has stated a goal of 5x
performance improvements, we explicitly did not do that. But we do expect to get a healthy boost
in performance over the frontier baseline. So a number like 3 to 5x is what we bandy about
internally. 3 to 5x, okay. So my second question would be discuss where the project currently stands within the overall plan and the timeline. Sure. So, I mean, right now we just got
our CD1 approval and CD1 is code within the DOE project management community for the first
critical decision. It is a gate which the DOE approves our alternative selection and our cost and budget estimates.
That gives us the authority to release an RFP.
So we're expecting to release the request for proposals in late May.
Right now, the proposal package is in the approval cycles within DOE.
And we're also allowing our friends at NERSC to go ahead of us.
They just released their RFP yesterday. And they have a period of time for vendors to respond.
And we don't want to stomp on them and give our vendor friends too much of a headache having to do two jobs at once.
So we'll wait for NERSC to finish up, and we will release our RFP in late May.
We'll be selecting a system sometime in the summer, and then we're looking at system delivery in the mid-2027
timeframe. And what's driving the timeline there is the desire to be operational by the time
frontier end of life. So frontier will reach the end of its nominal lifespan in late 2028.
And so discovery has to be operational with users on it before that happens.
Matt, one of the problems or challenges in HPC these days is we've looked at the history
of system architecture, and there was always a question, will the next one still be a CPU,
GPU, GPU intensive, or will there be new technologies that would propel things forward?
When you look at strategies for system architecture
and how it fits within other DOE sites
and what they're pursuing,
where does that land at the moment?
What can be disclosed at this point?
Well, our objective is to deploy the best system
that we can for large-scale science.
And what that means for our future is mod-sim,
but heavily infused with AI and ML.
And we're also seeing a growth in pure AI applications.
So there's a lot of interest within to leverage AI for science and to understand how is that the CPU-GPU architectures that we
relied upon for Titan Summit and Frontier are continuing to evolve. The large AI systems that
are being fielded today are typically GPU accelerated. We are seeing the development of
custom AI accelerators, both for training and for inference. And so these are very interesting.
But if you look at our workload and if you look at our user community, it's not pure AI.
We do a lot of modeling and simulation that requires FB64.
And so one of the biggest differences between the needs of AI and the needs of mod sim is in the precision, right?
In the old days, we used to do mod sim with FB32, and then we found
we couldn't get good convergence with only 32 bits of precision. Then people moved to FB64.
You're seeing AI kind of go the other direction. They started with FB32, and then they're finding
that they can get good results with AI training with FB16 or even FBA. And so the architectures
are diverging somewhat. And so the challenge for us is how
do we leverage that reduced precision best? We can use AI models, of course, everywhere in mod
sim to accelerate, to help us do data analysis. But there are some interesting methods out there
for using, say, like iterative refinement, right, to be able to use lower precision arithmetic to arrive at a
higher precision result. As an example, Jack Dongara's ICL lab at the University of Tennessee,
they released the HPL-AI code, which is now called HPL-MXP. And that is essentially using
mixed precision to calculate the HPL result in FB64 using reduced precision for the meat of the calculation
and then iterative refinement at the end to get the 64-bit answer.
We've got this running on Frontier, and we've seen tremendous boosts in performance,
nearly 10 exaOps of performance on Frontier using those methods.
And so there's a lot of interest within DOE of how do we leverage mixed precision to accelerate our science
workloads.
But at the same time, we can't do everything with mixed precision.
And so to get increased performance, really the mantra for discovery is going to be bandwidth
everywhere.
We're looking for increased memory bandwidth, increased network bandwidth, increased IO
bandwidth to help us get the most out of the flops that we have and to balance the systems. Jack D'Angara, during his Turing Award talk,
talked in 2022, talked about the need for balancing between memory bandwidth and flops,
that systems nowadays tend to be very flop-heavy. They tend to be constrained by memory bandwidth. Our applications
are no exception to this. And I think that we're hopeful that the OLCF6 system will help deliver
some big boosts in memory bandwidth that we can use to lift all boats and to get better performance
out of our science applications. Yeah, excellent. The other question I had is that some sites are now partitioning their systems into the AI zone and the 64-bit zone and the memory optimized and storage.
Are you envisioning something like that or does the project call for a uniform, homogeneous, for lack of a better word, architecture? Yeah, one of the things that makes us different at the Leadership Computing Facility is that we specialize in applications that need the largest scale.
So we choose applications that need at least 20% or more of the machine to solve their problems.
And so that sort of tips our balance into more of a homogenous architecture, right? It's very difficult for a user to write their code in such a way that they can make use of a wildly inhomogeneous system.
So we do like to have a very large portion of the system we've asked for in our RFP is for flexibility and for us to be able to add in smaller subsystems that may contain some interesting technologies like AI accelerators or FPGAs or even a quantum computer to be able to accelerate workloads, but not have every node necessarily support those technologies. And it just occurred to me during your remarks, curious, we all followed the Exascale computing project and the progress with Frontier. There
was a lot of emphasis on the software side, co-design, and capable Exascale. Are there
characteristics of the OLCF6 project that differ from the Exascale project?
No. I would say that DOE made tremendous investments in ECP
in the software ecosystem to enable Exascale application, right? And those are largely
architected for GPU accelerated architectures. We don't necessarily want to field something
drastically different right now because the code base is there to be able to utilize GPUs,
and we want to make the most of
that. Now, that said, we're not specifying to vendors in our RFP the architecture. We're saying,
give us the most performance, the most bang for the buck that we can get. And if that happens to
be something very unusual, we're all ears for it. But so far in our discussions with vendors,
we're not seeing anything coming up the pipeline, which is radically different than what we've seen before.
How about the cloud?
Do you see a scenario where you either spill over to some kind of a designated public,
but just for you, cloud, or is it required that the system be on-prem?
That's a very interesting question.
So something that's very different about OLCF 6 than in the past is that we are taking cloud very seriously.
Well, we've always took it seriously in the past, but the readiness for high-performance computing in the frontier and in the summit timeframe was just not judged to be there yet.
But of course, in the AI revolution, you've seen the hyperscalers now deploying very large systems, tens of thousands of G.
And they're also,
they're utilizing the same architectures that we utilize, high-speed networks, connecting these
nodes together, very fat nodes. And we certainly don't see any technical reason why you can't run
our workloads in the cloud. And so we wrote our RFP explicitly to enable cloud vendors to respond to it.
We are not restricting them to only on-prem deployments.
We are open to off-prem.
And so we've been having talks with various vendors about this.
It is something new for us.
It's causing us to ask a lot of questions we haven't necessarily had to ask before.
But it's exciting and it's interesting. And we are, of
course, very interested to see what potential bidders may give us back for the whole CF system.
So cloud is definitely a possibility. They've done some amazing things, as everybody saw at SC
last year. They debuted at number three with their Eagle supercomputer, which is a tremendous accomplishment.
And there certainly are players in this field. And they have very deep pockets. They have a lot of resources. And so I think you'll see in the future that the cloud vendors will continue to
deploy very impressive HPC systems. And so DOE as a whole is working with cloud vendors, and we're
talking to them, and we're looking for ways that we can best make use of these resources for our science mission.
Is there a concern that we may lose the ability to build these things ourselves and make sure that we remain competitive on the global stage and avoid what happened with chip making where it was kind of entrusted to industry and fast forward a couple of decades,
suddenly we're not competitive. Is there, or is that kind of not yet, if at all?
Well, I think this falls under the category of those questions that we haven't really had to
ask ourselves before. And that what is the broader mission of Oscar and the leadership
computing facilities in this space, right? In terms of maintaining a capable workforce
that can do supercomputing, right? And knows how to do these types of things.
Certainly one requirement that we have is that we can't afford to allow a cloud deployment to
become a one-way street, right? We have to have the ability to pull this back. If anything else,
in five years or six years, when we issue the next RFP for the next system,
there's no guarantee that a cloud-based system would win that contract.
And we have to have the ability to come back.
And so I would expect that, at least in the near term, we don't see any changes to how
we operate.
We don't see any changes to our workforce.
Even if we were to select a cloud-based system for all CF6, we would not anticipate in making any real changes to our workforce mix. Do you have a general notion of
when vendor selections will be announced? Is that an appropriate question? Yeah, it typically comes
after we do source selection this summer, we enter into a period of contract negotiations where we're working
through the details. And we don't announce the contract until it actually is finalized and
signed. And that will be sometime in 2025. I would expect that to be somewhere in the spring
of 2025, of course, subject to the winds of everything that can happen between now and then.
And of course, you can't name names, I assume. But are there any surprising names in the vendors in the mix right now? Well, as I mentioned before, we are talking to cloud vendors. That's different
than we've done before. I think I don't really want to say too much about surprises. I know that
we've put a lot of effort and a lot of work into ensuring that we have a competitive and a vibrant environment for this.
It's in nobody's interest for there to be only one chip vendor or only one system integrator who's capable of building these systems.
And so we've been very open about talking to anybody and everybody.
And of course, we hope to get viable bids from anybody and everybody.
So Matt, when we as individuals move from a laptop to a new laptop, it's such a pain.
Like now you do that at an exascale level, that must be quite an ordeal to move from Frontier to OLCF6.
What do you do with all the data, with all the storage, with all the everything?
What sort of a migration transition path do you do with all the data, with all the storage, with all the everything? What sort of a
migration transition path do you adopt? Well, this is one of the reasons why we actually
deliver these systems relatively early, right? The frontier won't end of life until the end of
2028. And we're asking vendors to deliver this system in early to mid 2027. And that's to give
us a period of time, number one, to get the system stood up and shaken out, but also to be able to operate the systems in parallel for a period of time to give our users
time to migrate from one to the other and do that in a fashion which is least disruptive to our user
programs. As far as how difficult it is to transition, ECP, of course, did a tremendous
amount of work in this area on code portability, right? And I think what we've discovered is that for GPU-based architectures, it's not a really heavy lift to get the codes
migrated from one architecture to another. Like, for instance, going from Summit to Frontier was
not such a tremendous problem, as opposed to when GPUs were first introduced with Titan. That, of
course, was a tremendous lift to rewrite the codes to be able to utilize the GPUs were first introduced with Titan. That, of course, was a tremendous lift to rewrite
the codes to be able to utilize the GPUs. But now the industry seems to have settled somewhat on an
architecture where we're at kind of a point of stability right now, I think. So we don't expect
to see drastic changes. Of course, I would love to see a vendor propose something which gives me a
good 10x performance boost over Frontier. And even if that is a novel architecture that requires us to do a lot of work rewriting the codes, we would
look at that pretty seriously, right? Because that's a tremendous value to be able to get that
kind of speed up. But I think prognosticating a bit here, I think that we won't see the same
level of difficulty porting codes to future systems that we've seen in the past, that
software frameworks are getting better, code portability in general is getting a little bit
better. So I'm optimistic there. That's awesome. What do you see in storage and the architecture
that you'd like for this new system? So storage is interesting. And again,
something that we're doing different with all CF6 is that if you look at what we did for Summit, right, it was a 250
petabyte storage system. Frontier is almost three times bigger than that. So typically,
we architect the storage systems to work well with our mod sim workloads. So we want them to
be right performant, to be able to have very, very high right With full CF6, of course, with the AI workload becoming so prominent,
we've asked the vendors to propose to us a traditional parallel file system.
But in addition to that, a smaller, what we call an AI-optimized storage system.
And this is going to be a resource that is much more optimized
for the type of random reads that AI requires. And so we're asking for a
capability that would be usable not only for traditional HPC, but also for AI. And so I think
hopefully we'll see some innovative stuff there, either with a separate AI-optimized storage system
or a single file system, which has the performance characteristics to support both.
Matt, a couple of quick questions. One is, we often heard that Frontier was about a $600 million system.
Is there a price tag, if you will, connected to Discovery? And secondly, will it incorporate quantum computing in some way?
Well, on the budget question, I'll take a broad no comment on that. I don't want to put any budget numbers out there. Of course, we are budget constrained here, right?
The budgets for the leadership computing facility, we don't project them to be growing by leaps and
bounds. And so we're not assuming that. I think the overall system price, I think more than summit,
less than frontier, I think is a good assumption. But of course, it all depends on the budget outlook and where that goes. In answer to the quantum question, of course, we are
exploring quantum computing here at Oak Ridge National Lab. We have our quantum computing user
program, which is actively allocating time on various quantum computing resources to users
that people can propose work to be done on quantum systems.
And we also have a quantum science center, which is led by Travis Humble. So there's a lot going
on there. But I would say that quantum computing is not quite ready yet for large scale production,
but there is, of course, a lot of activity and work going on there. And as we watch quantum
mature, we want to make sure that discovery is going to be in a position to take
advantage of that when it's ready. The RFP has requirements in there for expandability in the
system so that we can add new technologies to it that will benefit our workflows. And that could
be a quantum-based system. It could be some new and interesting AI inferencing hardware, neuromorphic computing system,
data flows, or even analog computing is an option for us to add in as a potential accelerator.
The goal here really is to be adaptable. And so we want the old CF6 discovery system to be
flexible and adaptable and to be able to take advantage of new things as they come afield,
and that includes
quantum computing. Okay, great. My final question would be about Oak Ridge in general. If you could
just catch us up on all the range of things that the big candy store that is Oak Ridge for us
geeky types. Oh gosh, there's a lot going on. I think if you look at the intersection of high
performance computing and the rest of the lab, I think IRI and the integrated research infrastructure is going to play a big role there.
Upgrades to the light sources, not at Oak Ridge, but at other DOE facilities, the addition of a second target station at SNS, and various other upgrades to experimental facilities, which are increasing their data rates dramatically.
There's more and more of a demand for large-scale computing resources backing these experiments.
And there's going to be a lot of work in the coming years to tie these things together
in a way which is more seamless and it's more effortless so that researchers can move data
around transparently, they can operate on that data and get results back with,
of course, the term workflows is the term of the art nowadays,
but sort of standardized workflow architectures that allow us to move the data to where the compute needs to happen and vice versa,
to get results back to the experiments very quickly.
So I think there's going to be greater integration
of the experimental facilities at Oak Ridge with the high-performance computing. But beyond that,
I mean, Oak Ridge, of course, is a very interesting and dynamic place. We have groups that are doing
work in isotope production for cancer therapies, plutonium-238 production for radioisotope thermal electric generators for Mars missions.
There is just a diverse portfolio.
And of course, I love it here at Oak Ridge.
I find this to be an extremely rewarding place to be.
And frankly, this is one of the reasons why, if you look at high-performance computing,
if you look at the ability to attract and retain a very talented and diverse workforce,
Oak Ridge is a magnet. It's a draw
because even though we can't pay the same kind of salaries that you might get in Silicon Valley or
some of the bigger cloud or software vendors, we're doing things here that are really cool.
And they are really in the public interest, in the interest of the country, it's just such a dynamic and fun place that people stay here.
You can try to attract them away with bigger salaries, and some people do, but a lot of people say, no, I like working here.
This is the leadership computing facility.
We have Frontier here, right?
We take a lot of pride in that, and really, we love what we do.
Of course, we look forward to continuing to do this for as
long as we possibly can. That's fabulous. Thank you. That's great. Well, Matt, it's been a great
conversation. We appreciate your time and your insights. We've been with Matt Seeger of the
Discovery Project. Thanks so much for your time. Thank you, Doug. That's it for this episode of the at HPC podcast. Every episode is featured on inside HPC
dot com and posted on Orion X dot net. Use the comment section or tweet us with any questions
or to propose topics of discussion. If you like the show, rate and review it on Apple podcasts
or wherever you listen. The at HPC podcast is a production of Orion X in association with inside
HPC. Thank a production of OrionX in association with Inside HPC.
Thank you for listening.