@HPC Podcast Archives - OrionX.net - @HPCpodcast-81: Matt Sieger of ORNL on “Discovery” after “Frontier”

Episode Date: March 26, 2024

We caught up with Matt Sieger, Project Director for the 6th iteration of the Oak Ridge Leadership Computing Facility (OLCF-6) to get a glimpse of the project, its objectives, status, and timelines. M...eet Discovery, the supercomputer that plans to succeed Frontier, the current #1 (at 1.19 exaflops in 64 bits) while Summit, the current #7 (at 148.8 64-bit petaflops) continues to work alongside it. [audio mp3="https://orionx.net/wp-content/uploads/2024/03/081@HPCpodcast_Matt-Sieger_ORNL_Discovery_20240326.mp3"][/audio] The post @HPCpodcast-81: Matt Sieger of ORNL on “Discovery” after “Frontier” appeared first on OrionX.net.

Transcript
Discussion (0)
Starting point is 00:00:00 This successor to Frontier, we have a name for the system. We're going to call it Discovery. And Discovery has the objective to be a significant boost in the computational and data science capabilities over Frontier. How about the cloud? Do you see a scenario where you either spill over to some kind of a designated public, but just for you, cloud? After we do source selection this summer, we enter into a period of contract negotiations. We don't announce the contract until it actually is finalized and signed, and that will be
Starting point is 00:00:43 sometime in 2025. I would expect that to be somewhere in the spring of 2025, of course, subject to the winds of everything that can happen between now and then. We often heard that Frontier was about a $600 million system. Is there a price tag, if you will, connected to Discovery? And secondly, will it incorporate quantum computing in some way? From OrionX in association with InsideHPC, this is the AtHPC podcast. Join Shaheen Khan and Doug Black as they discuss supercomputing technologies and the applications, markets, and policies that shape them.
Starting point is 00:01:22 Thank you for being with us. Hi, everyone. Welcome to At HPC Podcast. I'm Doug Black. And today, Shaheen and I have the pleasure of welcoming Matt Seeger on the podcast. Matt is director of the OLCF6 project for the Oak Ridge Leadership Computing Facility. Matt joined Oak Ridge Lab in 2009 as a Quality Manager, and in 2018, he moved to OLCF as Deputy Project Director for the Frontier Exascale Supercomputer. In 2021, he was selected to lead the effort for Frontier's successor. So, Matt, welcome. Thank you. Thank you. Glad to be here. Okay. So, we understand that your job has a focus on external relationships with technology
Starting point is 00:02:07 vendors, stakeholders from the lab and DOE, as well as other science users. Why don't we start off, if you could provide us an overview of the OLCF6 project, its general objectives and vision, the workload problems it will address, and why it's needed. Sure. So in general, OLCF6 is the successor to Frontier. Frontier was the first exascale system. It was deployed in 2021. It's now operational on our floor. Scientists are running every day on that machine, and we're already making plans for the system that will replace it. This successor to successor frontier, we have a name for the system. We're going to call it Discovery.
Starting point is 00:02:48 And Discovery has the objective to be a significant boost in the computational and data science capabilities of our frontier. Of course, we continue to see tremendous demand for these systems. We're typically oversubscribed 5x or more for proposals to use these machines. And so we constantly need more and more capabilities. So the first order of business for Discovery is going to be giving us a boosting capability. There's a few new missions, though. If anything, the mission is expanding. We have a lot of users and traditional modeling and simulation, but we're also seeing an explosive growth in artificial intelligence. Not just us, but across the entire world, everything is going AI. And so this system will be targeted at providing enhanced AI capabilities for our users. And there's another initiative from DOE called the Integrated Research Infrastructure. And so IRI is an effort to couple together the DOE computing facilities with experimental data facilities across the DOE complex. facilities have a need for real-time data analysis or for modeling and simulation or AI inference to
Starting point is 00:04:08 back up and control their experiments. And so DOE is looking at fielding a set of capabilities to provide seamless interoperability between the experimental facilities and the computational data facilities. And the integrated research infrastructure is an effort to put that into the field. And so one of the design goals for Discovery is to interoperate with that IRI. We also have objectives to continue to improve in energy efficiency. One of the key enablers of the exascale era with Frontier was a 200-fold improvement in energy efficiency between the petascale class systems like Jaguar to Frontier. And we need to continue to push on energy efficiency and computing for not only for the good reasons of just reducing energy consumption demand in general for computing,
Starting point is 00:05:01 but also because in our facilities, we have limited space, power, and cooling available. And we want to deploy a system that gives the most capability for science that we can possibly deliver to our users. And so we have a fixed budget, a fixed amount of power that we can deliver, and we want to get the most out of that. And so energy efficiency is a way for us to continue to push performance per watt and to deliver the most capability to our users. Okay. And by the way, do you have projected objectives for the throughput of OLCF6? Traditionally, in past projects, we've typically given an X number for the design goal for the system. For instance, Frontier was to be 20x performance improvement in certain applications.
Starting point is 00:05:46 For OLCF6, we're taking a little bit of a different tack. We haven't specified a speed-up goal. Instead, we've specified a budget, a power limit, and a space limit, and we're saying give us the most capability that we can fit inside of that box. So while our partner institution, NERSC 10, has stated a goal of 5x performance improvements, we explicitly did not do that. But we do expect to get a healthy boost in performance over the frontier baseline. So a number like 3 to 5x is what we bandy about internally. 3 to 5x, okay. So my second question would be discuss where the project currently stands within the overall plan and the timeline. Sure. So, I mean, right now we just got
Starting point is 00:06:32 our CD1 approval and CD1 is code within the DOE project management community for the first critical decision. It is a gate which the DOE approves our alternative selection and our cost and budget estimates. That gives us the authority to release an RFP. So we're expecting to release the request for proposals in late May. Right now, the proposal package is in the approval cycles within DOE. And we're also allowing our friends at NERSC to go ahead of us. They just released their RFP yesterday. And they have a period of time for vendors to respond. And we don't want to stomp on them and give our vendor friends too much of a headache having to do two jobs at once.
Starting point is 00:07:14 So we'll wait for NERSC to finish up, and we will release our RFP in late May. We'll be selecting a system sometime in the summer, and then we're looking at system delivery in the mid-2027 timeframe. And what's driving the timeline there is the desire to be operational by the time frontier end of life. So frontier will reach the end of its nominal lifespan in late 2028. And so discovery has to be operational with users on it before that happens. Matt, one of the problems or challenges in HPC these days is we've looked at the history of system architecture, and there was always a question, will the next one still be a CPU, GPU, GPU intensive, or will there be new technologies that would propel things forward?
Starting point is 00:08:02 When you look at strategies for system architecture and how it fits within other DOE sites and what they're pursuing, where does that land at the moment? What can be disclosed at this point? Well, our objective is to deploy the best system that we can for large-scale science. And what that means for our future is mod-sim,
Starting point is 00:08:22 but heavily infused with AI and ML. And we're also seeing a growth in pure AI applications. So there's a lot of interest within to leverage AI for science and to understand how is that the CPU-GPU architectures that we relied upon for Titan Summit and Frontier are continuing to evolve. The large AI systems that are being fielded today are typically GPU accelerated. We are seeing the development of custom AI accelerators, both for training and for inference. And so these are very interesting. But if you look at our workload and if you look at our user community, it's not pure AI. We do a lot of modeling and simulation that requires FB64.
Starting point is 00:09:14 And so one of the biggest differences between the needs of AI and the needs of mod sim is in the precision, right? In the old days, we used to do mod sim with FB32, and then we found we couldn't get good convergence with only 32 bits of precision. Then people moved to FB64. You're seeing AI kind of go the other direction. They started with FB32, and then they're finding that they can get good results with AI training with FB16 or even FBA. And so the architectures are diverging somewhat. And so the challenge for us is how do we leverage that reduced precision best? We can use AI models, of course, everywhere in mod sim to accelerate, to help us do data analysis. But there are some interesting methods out there
Starting point is 00:09:59 for using, say, like iterative refinement, right, to be able to use lower precision arithmetic to arrive at a higher precision result. As an example, Jack Dongara's ICL lab at the University of Tennessee, they released the HPL-AI code, which is now called HPL-MXP. And that is essentially using mixed precision to calculate the HPL result in FB64 using reduced precision for the meat of the calculation and then iterative refinement at the end to get the 64-bit answer. We've got this running on Frontier, and we've seen tremendous boosts in performance, nearly 10 exaOps of performance on Frontier using those methods. And so there's a lot of interest within DOE of how do we leverage mixed precision to accelerate our science
Starting point is 00:10:45 workloads. But at the same time, we can't do everything with mixed precision. And so to get increased performance, really the mantra for discovery is going to be bandwidth everywhere. We're looking for increased memory bandwidth, increased network bandwidth, increased IO bandwidth to help us get the most out of the flops that we have and to balance the systems. Jack D'Angara, during his Turing Award talk, talked in 2022, talked about the need for balancing between memory bandwidth and flops, that systems nowadays tend to be very flop-heavy. They tend to be constrained by memory bandwidth. Our applications
Starting point is 00:11:25 are no exception to this. And I think that we're hopeful that the OLCF6 system will help deliver some big boosts in memory bandwidth that we can use to lift all boats and to get better performance out of our science applications. Yeah, excellent. The other question I had is that some sites are now partitioning their systems into the AI zone and the 64-bit zone and the memory optimized and storage. Are you envisioning something like that or does the project call for a uniform, homogeneous, for lack of a better word, architecture? Yeah, one of the things that makes us different at the Leadership Computing Facility is that we specialize in applications that need the largest scale. So we choose applications that need at least 20% or more of the machine to solve their problems. And so that sort of tips our balance into more of a homogenous architecture, right? It's very difficult for a user to write their code in such a way that they can make use of a wildly inhomogeneous system. So we do like to have a very large portion of the system we've asked for in our RFP is for flexibility and for us to be able to add in smaller subsystems that may contain some interesting technologies like AI accelerators or FPGAs or even a quantum computer to be able to accelerate workloads, but not have every node necessarily support those technologies. And it just occurred to me during your remarks, curious, we all followed the Exascale computing project and the progress with Frontier. There was a lot of emphasis on the software side, co-design, and capable Exascale. Are there
Starting point is 00:13:16 characteristics of the OLCF6 project that differ from the Exascale project? No. I would say that DOE made tremendous investments in ECP in the software ecosystem to enable Exascale application, right? And those are largely architected for GPU accelerated architectures. We don't necessarily want to field something drastically different right now because the code base is there to be able to utilize GPUs, and we want to make the most of that. Now, that said, we're not specifying to vendors in our RFP the architecture. We're saying, give us the most performance, the most bang for the buck that we can get. And if that happens to
Starting point is 00:13:56 be something very unusual, we're all ears for it. But so far in our discussions with vendors, we're not seeing anything coming up the pipeline, which is radically different than what we've seen before. How about the cloud? Do you see a scenario where you either spill over to some kind of a designated public, but just for you, cloud, or is it required that the system be on-prem? That's a very interesting question. So something that's very different about OLCF 6 than in the past is that we are taking cloud very seriously. Well, we've always took it seriously in the past, but the readiness for high-performance computing in the frontier and in the summit timeframe was just not judged to be there yet.
Starting point is 00:14:37 But of course, in the AI revolution, you've seen the hyperscalers now deploying very large systems, tens of thousands of G. And they're also, they're utilizing the same architectures that we utilize, high-speed networks, connecting these nodes together, very fat nodes. And we certainly don't see any technical reason why you can't run our workloads in the cloud. And so we wrote our RFP explicitly to enable cloud vendors to respond to it. We are not restricting them to only on-prem deployments. We are open to off-prem. And so we've been having talks with various vendors about this.
Starting point is 00:15:15 It is something new for us. It's causing us to ask a lot of questions we haven't necessarily had to ask before. But it's exciting and it's interesting. And we are, of course, very interested to see what potential bidders may give us back for the whole CF system. So cloud is definitely a possibility. They've done some amazing things, as everybody saw at SC last year. They debuted at number three with their Eagle supercomputer, which is a tremendous accomplishment. And there certainly are players in this field. And they have very deep pockets. They have a lot of resources. And so I think you'll see in the future that the cloud vendors will continue to deploy very impressive HPC systems. And so DOE as a whole is working with cloud vendors, and we're
Starting point is 00:16:03 talking to them, and we're looking for ways that we can best make use of these resources for our science mission. Is there a concern that we may lose the ability to build these things ourselves and make sure that we remain competitive on the global stage and avoid what happened with chip making where it was kind of entrusted to industry and fast forward a couple of decades, suddenly we're not competitive. Is there, or is that kind of not yet, if at all? Well, I think this falls under the category of those questions that we haven't really had to ask ourselves before. And that what is the broader mission of Oscar and the leadership computing facilities in this space, right? In terms of maintaining a capable workforce that can do supercomputing, right? And knows how to do these types of things. Certainly one requirement that we have is that we can't afford to allow a cloud deployment to
Starting point is 00:16:55 become a one-way street, right? We have to have the ability to pull this back. If anything else, in five years or six years, when we issue the next RFP for the next system, there's no guarantee that a cloud-based system would win that contract. And we have to have the ability to come back. And so I would expect that, at least in the near term, we don't see any changes to how we operate. We don't see any changes to our workforce. Even if we were to select a cloud-based system for all CF6, we would not anticipate in making any real changes to our workforce mix. Do you have a general notion of
Starting point is 00:17:31 when vendor selections will be announced? Is that an appropriate question? Yeah, it typically comes after we do source selection this summer, we enter into a period of contract negotiations where we're working through the details. And we don't announce the contract until it actually is finalized and signed. And that will be sometime in 2025. I would expect that to be somewhere in the spring of 2025, of course, subject to the winds of everything that can happen between now and then. And of course, you can't name names, I assume. But are there any surprising names in the vendors in the mix right now? Well, as I mentioned before, we are talking to cloud vendors. That's different than we've done before. I think I don't really want to say too much about surprises. I know that we've put a lot of effort and a lot of work into ensuring that we have a competitive and a vibrant environment for this.
Starting point is 00:18:27 It's in nobody's interest for there to be only one chip vendor or only one system integrator who's capable of building these systems. And so we've been very open about talking to anybody and everybody. And of course, we hope to get viable bids from anybody and everybody. So Matt, when we as individuals move from a laptop to a new laptop, it's such a pain. Like now you do that at an exascale level, that must be quite an ordeal to move from Frontier to OLCF6. What do you do with all the data, with all the storage, with all the everything? What sort of a migration transition path do you do with all the data, with all the storage, with all the everything? What sort of a migration transition path do you adopt? Well, this is one of the reasons why we actually
Starting point is 00:19:10 deliver these systems relatively early, right? The frontier won't end of life until the end of 2028. And we're asking vendors to deliver this system in early to mid 2027. And that's to give us a period of time, number one, to get the system stood up and shaken out, but also to be able to operate the systems in parallel for a period of time to give our users time to migrate from one to the other and do that in a fashion which is least disruptive to our user programs. As far as how difficult it is to transition, ECP, of course, did a tremendous amount of work in this area on code portability, right? And I think what we've discovered is that for GPU-based architectures, it's not a really heavy lift to get the codes migrated from one architecture to another. Like, for instance, going from Summit to Frontier was not such a tremendous problem, as opposed to when GPUs were first introduced with Titan. That, of
Starting point is 00:20:03 course, was a tremendous lift to rewrite the codes to be able to utilize the GPUs were first introduced with Titan. That, of course, was a tremendous lift to rewrite the codes to be able to utilize the GPUs. But now the industry seems to have settled somewhat on an architecture where we're at kind of a point of stability right now, I think. So we don't expect to see drastic changes. Of course, I would love to see a vendor propose something which gives me a good 10x performance boost over Frontier. And even if that is a novel architecture that requires us to do a lot of work rewriting the codes, we would look at that pretty seriously, right? Because that's a tremendous value to be able to get that kind of speed up. But I think prognosticating a bit here, I think that we won't see the same level of difficulty porting codes to future systems that we've seen in the past, that
Starting point is 00:20:45 software frameworks are getting better, code portability in general is getting a little bit better. So I'm optimistic there. That's awesome. What do you see in storage and the architecture that you'd like for this new system? So storage is interesting. And again, something that we're doing different with all CF6 is that if you look at what we did for Summit, right, it was a 250 petabyte storage system. Frontier is almost three times bigger than that. So typically, we architect the storage systems to work well with our mod sim workloads. So we want them to be right performant, to be able to have very, very high right With full CF6, of course, with the AI workload becoming so prominent, we've asked the vendors to propose to us a traditional parallel file system.
Starting point is 00:21:32 But in addition to that, a smaller, what we call an AI-optimized storage system. And this is going to be a resource that is much more optimized for the type of random reads that AI requires. And so we're asking for a capability that would be usable not only for traditional HPC, but also for AI. And so I think hopefully we'll see some innovative stuff there, either with a separate AI-optimized storage system or a single file system, which has the performance characteristics to support both. Matt, a couple of quick questions. One is, we often heard that Frontier was about a $600 million system. Is there a price tag, if you will, connected to Discovery? And secondly, will it incorporate quantum computing in some way?
Starting point is 00:22:21 Well, on the budget question, I'll take a broad no comment on that. I don't want to put any budget numbers out there. Of course, we are budget constrained here, right? The budgets for the leadership computing facility, we don't project them to be growing by leaps and bounds. And so we're not assuming that. I think the overall system price, I think more than summit, less than frontier, I think is a good assumption. But of course, it all depends on the budget outlook and where that goes. In answer to the quantum question, of course, we are exploring quantum computing here at Oak Ridge National Lab. We have our quantum computing user program, which is actively allocating time on various quantum computing resources to users that people can propose work to be done on quantum systems. And we also have a quantum science center, which is led by Travis Humble. So there's a lot going
Starting point is 00:23:10 on there. But I would say that quantum computing is not quite ready yet for large scale production, but there is, of course, a lot of activity and work going on there. And as we watch quantum mature, we want to make sure that discovery is going to be in a position to take advantage of that when it's ready. The RFP has requirements in there for expandability in the system so that we can add new technologies to it that will benefit our workflows. And that could be a quantum-based system. It could be some new and interesting AI inferencing hardware, neuromorphic computing system, data flows, or even analog computing is an option for us to add in as a potential accelerator. The goal here really is to be adaptable. And so we want the old CF6 discovery system to be
Starting point is 00:23:59 flexible and adaptable and to be able to take advantage of new things as they come afield, and that includes quantum computing. Okay, great. My final question would be about Oak Ridge in general. If you could just catch us up on all the range of things that the big candy store that is Oak Ridge for us geeky types. Oh gosh, there's a lot going on. I think if you look at the intersection of high performance computing and the rest of the lab, I think IRI and the integrated research infrastructure is going to play a big role there. Upgrades to the light sources, not at Oak Ridge, but at other DOE facilities, the addition of a second target station at SNS, and various other upgrades to experimental facilities, which are increasing their data rates dramatically. There's more and more of a demand for large-scale computing resources backing these experiments.
Starting point is 00:24:52 And there's going to be a lot of work in the coming years to tie these things together in a way which is more seamless and it's more effortless so that researchers can move data around transparently, they can operate on that data and get results back with, of course, the term workflows is the term of the art nowadays, but sort of standardized workflow architectures that allow us to move the data to where the compute needs to happen and vice versa, to get results back to the experiments very quickly. So I think there's going to be greater integration of the experimental facilities at Oak Ridge with the high-performance computing. But beyond that,
Starting point is 00:25:30 I mean, Oak Ridge, of course, is a very interesting and dynamic place. We have groups that are doing work in isotope production for cancer therapies, plutonium-238 production for radioisotope thermal electric generators for Mars missions. There is just a diverse portfolio. And of course, I love it here at Oak Ridge. I find this to be an extremely rewarding place to be. And frankly, this is one of the reasons why, if you look at high-performance computing, if you look at the ability to attract and retain a very talented and diverse workforce, Oak Ridge is a magnet. It's a draw
Starting point is 00:26:06 because even though we can't pay the same kind of salaries that you might get in Silicon Valley or some of the bigger cloud or software vendors, we're doing things here that are really cool. And they are really in the public interest, in the interest of the country, it's just such a dynamic and fun place that people stay here. You can try to attract them away with bigger salaries, and some people do, but a lot of people say, no, I like working here. This is the leadership computing facility. We have Frontier here, right? We take a lot of pride in that, and really, we love what we do. Of course, we look forward to continuing to do this for as
Starting point is 00:26:45 long as we possibly can. That's fabulous. Thank you. That's great. Well, Matt, it's been a great conversation. We appreciate your time and your insights. We've been with Matt Seeger of the Discovery Project. Thanks so much for your time. Thank you, Doug. That's it for this episode of the at HPC podcast. Every episode is featured on inside HPC dot com and posted on Orion X dot net. Use the comment section or tweet us with any questions or to propose topics of discussion. If you like the show, rate and review it on Apple podcasts or wherever you listen. The at HPC podcast is a production of Orion X in association with inside HPC. Thank a production of OrionX in association with Inside HPC. Thank you for listening.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.