@HPC Podcast Archives - OrionX.net - @HPCpodcast-89: Rick Stevens and Mike Papka of Argonne National Lab

Starting point is 00:00:00 From a national security standpoint, we need AI innovation, data center innovation, power innovation to happen in the United States. That's one that coupling what I call the power AI technology nexus, a play on the energy water nexus, that has to be driven by the U.S. Certainly when you have so many thousands of nodes and then components, just the sheer large numbers is going to expose that reliability. All these big systems suffer from that. Most of the jobs are still in the 6,000 to 7,000 node space. And we're seeing very good results. People seem to be very happy with what they're seeing come out of it. From OrionX in association with InsideHPC, this is the At HPC podcast.

Starting point is 00:00:51 Join Shaheen Khan and Doug Black as they discuss supercomputing technologies and the applications, markets, and policies that shape them. Thank you for being with us. Hi, everyone. Welcome to the HPC podcast. I'm Doug Black at Inside HPC with Shaheen Khan of OrionX.net. And today we have two special guests. We have HPC industry luminary Rick Stevens. He is associate lab director and a distinguished fellow at Argonne National Laboratory, where he has been since 1982. And Rick's colleague at Argonne, Mike Papka. He is an Argonne National Laboratory, where he has been since 1982, and Rick's colleague at Argonne, Mike Papka. He is an Argonne senior scientist and also deputy associate lab director and division

Starting point is 00:01:34 director of the Argonne Leadership Computing Facility. Gentlemen, welcome. If we could, let's start with an update on the Aurora supercomputers, the second American exascale class system. We last talked with you at the ISC conference in May about Aurora and the drama of Aurora exceeding the exascale barrier. Can you provide an update on the system performance and to what extent, maybe on a percentage basis, the full system has been installed? Yeah, sure.

Starting point is 00:02:05 These systems are extremely complex, as you guys know. And so we've been making steady progress since we last spoke. For our HPL numbers and MXP numbers, while there was only a fraction of the system used in those runs were the results, the entire system is installed. And on any given day, as we shake out the components, some fraction of it is up. And so all the nodes are in place and we're replacing the early broken pieces, but in general, making very good progress.

Starting point is 00:02:36 We are pushing very hard towards early 2025 release to users, though we do have science happening on the machine now. Most of the jobs are still in the 6,000 to 7,000 node space. We have not done any full, really full node runs yet, but steady progress is progressing. And we're seeing very good results. People seem to be very happy with what they're seeing come out of it. We can assume we'll see a higher number at the next time.

Starting point is 00:03:03 So people always ask that question. Right now I'm focused on the science. I think right now, we want to get the science users on. We want to get the end users. We may do another HPL run, but it's not my priority. The priority is users and science. Okay. And how many blades total for the system? Could you refresh us on that? Forget that number. Yeah, 10,624 nodes. Okay, great. Yeah, it's got over 10,000 nodes. And when you're bringing up a system, you have fallout of hardware. It's common across any big system. You have to replace nodes, replace TPUs, replace processors.

Starting point is 00:03:39 And that process has been ongoing. And we're building up spares,ares capacity and so on to be able to keep the machine running for its lifespan. And so that's some of the work that's been going on. When the DOE Exascale project started, it was obviously visible that there were different strategies being taken in different sites. And that variation seemed to indicate that we wanted to learn about different ways of doing this. Building exascale systems, as I like to say, is something we should be able to do a couple a week over time. And certainly AI is like pointing in that direction. So taking different approaches seemed like just smart thing to do. And of course, when you do take different approaches,

Starting point is 00:04:25 one of them is going to be like better than the other or easier than the other or luckier than the other. I feel like this whole kind of Aurora trajectory to me has been highly valuable exactly for that reason, because we ended up learning some new things. Things didn't just work. What are some of those learnings? Has that, in retrospect, been a good thing that we learned all of this stuff, or it's just been pain? There was, early on, of course, there was a sense of having substantial architectural diversity in exascale class. So if you go, depending on how far back you want to go, we had this notion of swim lanes, and swim lanes were basically architectural bets in some sense, right?

Starting point is 00:05:05 Just like the vector machines were a different architectural bet than Blue Gene and so forth. And when Aurora was committed to by the department, the original goal for Aurora was to build out a data flow based on Intel's CSA architecture. And that would have been, I think, in the context of your question, a radically alternative platform than, say, a GPU-based machine. And it was interesting. We did a ton of development work on that. Intel did a ton of development work on that. And it had some pretty unique properties. Like it, as an, it was much more resistant to things like what's called control flow divergence, that is when you have nested if statements on. You could maintain performance in the face of that, whereas traditional processors have a hard time with that. So it could access performance in different parts of the algorithm space than, say, GPUs or traditional CPUs could.

Starting point is 00:06:05 One of the challenges that Intel faced at the time was their internal business strategy of how many diverse architectures could they support or how many architectural families could they support. They ultimately made a decision not to proceed with CSA based on a business case. And that was too bad because it would have really been an alternative platform. They pivoted to GPUs and as everybody's been pivoting. And in some sense, that was a positive because you were able to tap into the mainstream direction. So what we're

Starting point is 00:06:38 ending up with though in Exascale is basically GPU machines in the US. And while the micro architectural differences between say the AMD GPUs or NVIDIA GPUs or Intel GPUs are substantial, if you dig into the details, they're quite different. But from a programming model standpoint, they're pretty similar. And so the diversity that we have now is much more of a supply chain diversity up to a point. And of course, in the US, the three exascale machines, Frontier, Aurora, and Alkaff, are all very similar. You've got GPUs, they're integrated by HPE, they have similar networks, software stacks are quite similar. So in some sense, the strategy of having architectural diversity or supply chain diversity has not worked out as we originally intended by trying to maintain really distinct tracks

Starting point is 00:07:34 in some sense. But we'll see. The future could be quite different than where we are now. The main difference is likely to be systems that have different accelerators, perhaps AI accelerators that are more power efficient or more performant than GPUs in some future systems. So that becomes an option. And that's where a lot of the silicon innovation is happening in the marketplace, right? You've got dozens of companies trying to build products that essentially compete with GPUs. And that's one of the few places where you've got architectural innovation happening.

Starting point is 00:08:11 So we'll see. I think the other thing that comes out of that, though, is these systems are still not easy to assemble, right? Even the vendors who have put together the three current lead systems, they don't build these at scale anywhere but in the labs. And those challenges are labs. And those challenges are there. And no technology is solving that, right? You're putting together a bunch of components for the very first time on the floor. And there's just a level of complexity there that you can't get around. So the idea of putting these out every couple of weeks makes me nervous. No, that's true.

Starting point is 00:08:46 It's interesting. Yeah, so to what Mike's saying is really correct. If you look at even the systems that were announced, the XAI system that was just announced yesterday or whatever, 100,000 GPUs is actually two 50,000 GPU systems integrated by two different companies. And building systems at this 50,000, 60,000, Aurora's over 60,000 GVUs is super hard. And once you, it's easy, relatively easy to assemble that many parts in a room,

Starting point is 00:09:13 but getting it to work as a system is really challenging due to the sheer number of components, the failure rates, the full stack that has to operate both hardware and software. And it'll be interesting to see whether these commercial systems, like we saw last year that Microsoft's Eagle system was on the top 500, but it wasn't at full scale, whether or not any of these other systems that are not, say, in the labs will end up being able to be stabilized to run something like Linpack or mixed precision benchmarks is unclear to me because actually getting the machine solid enough to run something like LINPACK or mixed precision benchmarks is unclear to me because actually getting the machine solid enough to run those applications, those benchmarks, is much harder

Starting point is 00:09:52 than getting them solid enough to run, say, a distributed training of an LLM where there's robust failure management and fault tolerance layers and restart capabilities. So we'll see. Nobody has stood up, say, a million GPU class machines and made them work yet. We're like an order of magnitude off of the scale where we need to be at some point. You also just mentioned like the 100,000 GPU is really two times 50,000. Yeah.

Starting point is 00:10:20 And there's also that level of, for lack of a better word, lack of transparency. It's like not enough information is disclosed for folks to know what it is that they're evaluating from the outside. But generally, I think we all agree that these big things are just really hard to do in many dimensions. Yeah. And there's new ideas needed. So there was earlier this summer, there was a workshop or a conference on high availability, right, of big systems. So it used to be, if you remember, some of us are old enough, if you go back 30, 40 years from today, there were companies that specialized in fault-tolerant

Starting point is 00:10:56 computing. Oh, yes. But for those companies, it was something like 10 processors or something, and they would make a big deal how you could shoot a gun into it and keep running or pull a node out and keep running and so on. Trying to build systems with tens of thousands, hundreds of thousands of processors and have a mean time to failure that's measured in days as opposed to hours or minutes is super hard. And there hasn't been the R&D in novel strategies of making

Starting point is 00:11:27 these fault-tolerant systems scale up. In the early days of the Exascale initiative, there was a big concern about reliability, right? So what were, if I go back 15, 20 years ago, what were we worried about? We were worried about power, right? Our original projection showed we needed a gigawatt of power. Well, that's also starting to come true again, right? Our original projection showed we needed a gigawatt of power. Well, that's also starting to come true again, right? With data centers needing gigawatts. We were worried about reliability because if you just took the fit rates of the components at the time and scaled them up by the sheer number of components, you would have projections of failures that were minutes of MTBF and the ability of applications and so forth. So we had half a dozen of these kind of challenges.

Starting point is 00:12:07 So power was not completely solved. We got within a factor of two or something of where we were trying to go. Reliability fell off the roadmap or something. It was like either the vendors or maybe the community just decided that we had enough tools in our toolbox that we didn't need to do special, really higher order special things to make these machines reliable. And I think we're starting to see that that's actually not the case, that we do need to invest more there. Scalability has been interesting, right? So in science, we've been able to devise algorithms that are latency hiding to get scale.

Starting point is 00:12:43 So strong scaling has been somewhat successful. So strong scaling has been somewhat successful. Weak scaling has been enormously successful. And so people don't talk so much about scaling challenges, right? The challenges that come from scale tend to fall into the reliability. So as you scale up to use, say, 50,000 GPUs, can you make your application more fault tolerant? Things like that. But scale as a goal in and of itself has been achieved pretty well at Exascale. Now, whether we can get to Zeta scale class applications with the same approach is not

Starting point is 00:13:14 clear. I think when we talk about reliability of components, certainly when you have so many thousands of nodes and then components, just the sheer large numbers is going to expose that reliability. But I just want to emphasize that this is not a situation that is unique to any one site. Like all these big systems suffer from that, right? Sure. Every site that has large numbers of components suffers from reliability in the mathematical sense, but not every site that has big machines are trying to run

Starting point is 00:13:46 single applications across the entire infrastructure. That's the difference. So if you look at Microsoft Cloud or Amazon or Google, or when you say, how many applications are trying to run on 100,000 GPUs as a single job? The answer is zero, right? Linpack would be an example of that, but real applications in real clouds don't run like that. You have hundreds or thousands of services or servers. You have hugely distributed applications. You're not trying to solve the airflow over a wing on 50,000 GPUs as a single application. So the scientific use cases put a huge, at least the way that we currently implement, put a huge pressure on the kind of crystalline reliability of the machine. That is where you're trying to have thousands of components all working for many hours with no failure in the entire system.

Starting point is 00:14:39 That is a very specific use case that happens in science. It doesn't happen in almost every other part of industry. Right. So we're talking about hero runs. Yeah, but that's the normal kind of production's idea in scientific computing. So I'm going to run my job at a thousand or five thousand or ten thousand nodes, and I don't do anything different when I'm running from a thousand to ten thousand. Whereas in commercial, that's just not how people operate. Yeah. Now, Rick, to talk to you about architectural diversity,

Starting point is 00:15:09 you said it sounds as though you're pointing, if I had to guess, toward some of these AI chips, Cerebra, Samba, McGraw kind of idea. We did speak with Matt Seeger, who is heading the OLCF6 Next Generation Leadership System Project. And we asked him if there might be surprises coming out of the next system in terms of types of technologies being engaged. And he said, yes. Can you talk a little bit more about the possibility? We just saw news from Cerebrus last week, pretty eye-popping AI inference results. Yeah, 400 tokens per second on a 3.1.

Starting point is 00:15:44 It's easier to say there might be surprises. It's not releasing much information. If you look at the RFP, you would say... Go ahead and release new information. Well, no, I'm not, I don't have anything new to release other than if you look at any of the RFP from OLCF Next and ALCF Next, they're pretty, they're written in a way that vendors can be quite innovative in responding with new technology. But if you look at the job mixes that the current machines are supporting, say a Frontier or Aurora job mix, and you say, oh, I want to get a 5x or a 10x overall throughput improvement against a similar kind of job mix, you might not be very surprised because the current job mix is not heavily weighted towards

Starting point is 00:16:32 AI or it's not heavily weighted towards AI surrogates where there could be enormous headroom in acceleration for specialized processors. But if you've got to carry the current job mix forward, your strategy is how to do that in some way that is cost effective. And AI accelerators, particularly low precision accelerators, of course, can be helpful in some ways. If a vendor can find a way to use low resolution or low precision hardware and synthesize high precision computation with that low precision hardware, then you might see really interesting things happen, right? So rather than having, say, dedicated 64-bit hardware, you might take your FP4 units and aggregate them in a special way with some special sauce, right, that allows you to do high precision computation, but with natively

Starting point is 00:17:23 low precision hardware. That kind of idea has been around. The question is whether you could pull it off in a way that makes sense, right? I think, so that's like the kind of surprise that you might see, whether that's a surprise in your book or not, I don't know, but that would be interesting, of course. We're hoping for some things, like we've been waiting for a long time for integrated silicon photonics. We keep waiting. So that would be a, again, it'd be a surprise, maybe. It'd be cool. I think there's

Starting point is 00:17:50 other strategies around storage that might be super interesting that could happen. I don't know. What do you think, Mike? I'm interested in seeing, we'll add to the complexity and challenges, just more options to the developer in terms of what they can access. Maybe it's in the chiplet space. Maybe it's in just over-provisioning of resources on the nodes. But I think you can start to tune applications more so towards their needs than trying to hammer them into a square peg. Yeah, I agree with that.

Starting point is 00:18:24 A couple of ideas that we keep coming back to over the years, there was maybe 10 years ago, there was this idea that maybe everything was going to be disaggregated. Systems would be composed of a super fast, low latency fabric of which you would have memory servers and compute servers and storage servers and specialized accelerators and so on. And they would be all attached to this big fabric. And during runtime, it'd be like a software-defined machine. You would gather up the memory units and the processors and the other units that you might need and have a virtual overlay that you'd run.

Starting point is 00:18:57 And then when your job was done, it would all tear it down virtually and build it back up again. And that idea hasn't panned out. It exists in some areas, but not in scientific computing. And instead, what we're seeing is a movement towards ever higher levels of integration on a node, right? So where the building block is many GPUs, many memory units aggregated with a on-node fabric that's very fast. And the building block now becomes maybe a node or a supernode or a pod or something. And that's working in the opposite

Starting point is 00:19:32 direction of disaggregation. We've also seen chiplets, but whether chiplets will actually affect what we can build is not clear. There was a robust chiplet market where a vendor might be not like a vendor like Intel or Nvidia or AMD or something, but might be like just a, think of it as a reseller who would gather chiplets from the market and do a custom SKU for a bid response by combining a bunch of chiplets in a certain way and having these unique things. That kind of market hasn't really developed. If we're going to keep talking system design and options, it could be surprises. Right now, our acquisitions are monolithic decisions made four or five, six years out

Starting point is 00:20:15 that are locked in and very rigid. And it's not a technology innovation, but it's a mindset innovation in terms of if we could look at a much more nimble, fluid approach to responding to things that are happening in the space that isn't locked us into a machine. If you look at Roar, it's a unique situation. But even if you look at Frontier and LCAP, the designs were pretty rigid, maybe late binding on GPUs. But overall, you're making decisions in a space that's moving extremely fast. I mean, there were no AI accelerators that you could acquire when we were designing those systems. And all of a sudden, midway through the acquisition,

Starting point is 00:20:57 those pieces of technology became available. Question is, could you make changes and adapt that would have changed the machine? And right now, they're pretty rigid. If you made one a week, then it would fix that, wouldn't it? Yeah, Mike would get a lot more gray hair faster if that were the case. Lighten up on that one. Even one a month would be progress. I'll take one a month.

Starting point is 00:21:20 Let's talk about power. You said that was on your radar all along. And of course, it turned out to eventually become the big deal that it is now. We've gone from describing data centers from square feet to megawatts. And that doesn't seem to be stopping when one looks at the roadmaps of all these systems. Let's talk a little bit about the complexity that leads to power, cooling, water, the whole thing, modernization? Well, I think it's a new way to think about the limits at some levels, or not limits, but like the opportunities, maybe. One reason you can think about translating power or talking about a data center in terms of its power is that we have a canonical unit at the moment, like a GPU that's order thousand watts plus fractions of stuff, network and memory and so on. So you can think of a per GPU as needing something like one to 2000 watts.

Starting point is 00:22:11 And so when you talk about a data center, like a hundred thousand GPUs, that's a 150 megawatts or 200 megawatts. And it's a way to think about the challenge. It also translates, of course, into huge operating expenses based on where you are in the country or in the world in terms of cost, much more expensive to run a big data center in California, say, than in Illinois or Tennessee or someplace like that. So it affects where you can put things. And the data center markets are very sensitive to all these things. And we see just as across the US, right, the build out of dozens of data centers, in some places, it's approaching, like in Virginia, I think it's

Starting point is 00:22:52 approaching about 20% of the total power in the state is going to data centers. But there's no sign of it really slowing down. And in fact, it's accelerating with this idea of gigawatt data centers, which in your head, you could translate, let's say a million GPU kind of system or order half a million. And we have AI roadmaps that say, that's just a stepping stone to where we need to go for AGI and super intelligence at scales like that or beyond. So the idea that we're even thinking about that is amazing, right?

Starting point is 00:23:22 If you go back five, 10 years from now, in the past, if you talked about a gigawatt data center, that was something to avoid. Now it's something you're trying to build, right? So that's a huge shift in thinking. It also puts enormous pressure on the energy system precisely at the time that we're trying to produce more green energy, right? We want renewable power or we want low or no carbon energy sources. And it's really hard to stand up huge amounts of renewable power or low carbon power quickly, right? The fastest kind of power you can build today is natural gas, right? Because you can basically order it off Amazon, not literally, but actively order a gas turbine or dozens of them and pipe it in and off you go. Whereas building out a new wind farm or 1,000 acres of solar, 10,000 acres of solar takes many years.

Starting point is 00:24:17 Permitting a nuclear power plant takes even longer. So there's a real challenge of how do we sustain the growth of computing, particularly the AI component of it, without trashing our goals of reducing emissions, right? So that's like now a major policy question. workshops around this idea of how do we supply enough green energy to not become the, where energy is not the bottleneck in building out AI is a major issue. Cooling is another issue. All the state-of-the-art big systems are liquid cooled, but sometimes there's even problems of that in terms of where there's power. Like there's a lot of solar and wind out in West Texas, for example, or in the deserts, but often not a lot of water. And one of the things that we need to worry about is much more

Starting point is 00:25:10 efficient cooling schemes that don't consume huge amounts of water. So that's an R&D topic. Now, the commercial sector is investing and growing much faster than the government sector. I think there was a proposal in the last couple of months to maybe build a data center, a public-private partnership around a data center with the goal of using it as a testbed or a laboratory to investigate novel strategies at scale for power efficiency in data centers. That's something that I think the Department of Energy Advisory Board was recommending. So it's a hot topic. It's also interesting because the US is one of the few global markets where we have reasonably affordable energy and a flexible environment where these

Starting point is 00:25:55 things can be built out relatively quickly. It's much more difficult in Europe, much more difficult. You're not going to get a carbon neutral power say, in the Middle East. You can maybe stand up a lot of fossil there. So there's an interesting confluence of where AI is happening, where energy is affordable, and where innovation can happen. And that, I think, is something we want to, as we look forward from a national security standpoint, we need AI innovation, data center innovation, power innovation to happen in the United States. That's one that coupling what I call the power AI technology nexus, a play on the energy water nexus, that has to be driven by the US. That's the future critical bottleneck, I think, for everything that we want. Really, a lot of it just comes down to energy.

Starting point is 00:26:42 And of course, in our pre-call, we were pointing out that you have energy in the very name of the department. It is. So you're ideally placed to study this and point the way. People have talked about small modular reactors, about geothermal, about, of course, solar, about all these different novel ways. We talked in our recent episode about the challenges of transmitting power, not just generating it. What is your guys' perspective on how all of those alternatives are coming together? It's interesting. At Argonne, we're actually talking about small modular reactors.

Starting point is 00:27:19 Of course, Argonne and reactors, Argonne lab was created around the time. I think, or two. Reactors. There's a strong interest there. I think we're going to see how fast we can, I say we here, collectively, humanity, how quickly can we scale out power? That is, I think, an important question. It also may be the case that you start to see things being co-located. The easiest thing to do is build a data center next to where your power is because you don't need that many people to be physically

Starting point is 00:27:49 at the data center. So you could put the data center next to a hydro plant or next to a nuclear plant in the middle of the West somewhere without really requiring lots of people to be there. It's much harder to say, stand up a wind farm or a nuclear infrastructure or even a geothermal infrastructure in the middle of a city. So this could result in some interesting pairings going forward. It also could result in some interesting relationships, right? There's a discussion about Canada has a lot of hydro and a lot of land, and maybe we should be looking at strategic partnerships in some cases where that makes sense. How much we're going to have to build out over the next decade to keep the AI community satiated in some sense is not clear.

Starting point is 00:28:33 Is it 10 gigawatts? Is it 100 gigawatts? And that scale of an infrastructure has never been contemplated more, right? That's like multiple times more computing than we currently have deployed in the country. The problem is it has to start now. Right now there's a lot of discussion but we need to be

Starting point is 00:28:53 prototyping and building test cases and then along the way, because you're not going to start off with a 100 gigawatt solution. Very good point. Now Mike, you see the diversity of applications that are coming your way. Are there apps that can actually use these outside of just model learning building? Over my years, I've been constantly impressed, and I am a computer scientist, not a domain expert in

Starting point is 00:29:20 any of the spaces, but at each turn of the crank, the community steps up and continues to add realism and fidelity to their codes. And Rick's much more an expert in the AI space, but I think opportunities coming from AI, specifically in the biology and medical space, are tremendous. And yes, I think we'll continue to see this. We'll see it changing the way we do everything. Recall that Aurora got the top benchmark number in the top 500, the HPL-MXP AI benchmark. Could you refresh us on Aurora's characteristics that enabled it to score well in this area? Well, Aurora has 60,000 plus GPUs, right?

Starting point is 00:30:04 Six GPUs per node. And we have 10,600 blah, blah nodes. A lot of nodes, a lot of GPUs. It has double precision, single precision, half precision, FP16, BFP16. It also has int8 precision, does not have FP4. It's of a generation before FP4 was a thing. So for mixed precision, the mixed precision benchmark is a curious benchmark because

Starting point is 00:30:25 it's not a pure 16-bit computation or a pure 8-bit computation or whatever. It's a mixture, obviously. And the benchmark number is a calculated number that says what it would have to be in double precision if you were solving it using the baseline method. So you're allowed to use as many precisions as you want. So the reason that it could get that number or the highest number is because it has in 16-bit, it has a systolic arrays that are quite efficient and produce a large number of operations per clock per execution unit at each of the GPUs. And each GPU has got a thousand execution units, many cycles, many instructions per cycle for each GPU. So in some sense, the ratio of those things is what determines its performance. It also of course has HBM memory, like all GPUs do these

Starting point is 00:31:21 days. It's got 128 gigabytes of HBM per GPU, and you're spending most of your time in that memory. So it's relatively efficient and it has eight network interfaces per node. So it has about twice as many kind of network endpoints as Frontier. So it's a more heavily weighted in communication fabric. So that allowed us to get that high number, even with less than all the nodes running, which is good. What that translates to, I think, is training or for fine-tuning AI models, it's incredibly performant. So one of the Gordon Bell submissions that we made this year is on direct preference optimization of our protein language model LLMs and was achieving numbers that are on the order of exaflop in FP16 per thousand nodes. That's about half of the peak. I think the

Starting point is 00:32:15 peak is a little over 20 exaflops in FP16. So that's pretty good. So it means for training large language models, it's an excellent platform. If we can get those models to train in Int8, it'll be even faster because you get about twice the throughput. It also means for inference, it's a very good platform as well. That's excellent. Can I ask you about cloud and how it figures currently just as a spillover capacity or a spectrum of options that are available. And also as it appears in the future and how you may or may not want to take advantage of it. We talked about this a little bit earlier.

Starting point is 00:32:54 We did, yes. And this place, I don't think it's going to replace our current systems. We're interested in it for its first capability and potentially sending smaller jobs out to it. Many of the reasons, I think it's not a solution for the labs. I believe that there's a set of capabilities that we've built out over the years, both in terms of knowledge and skills in our workforce that the idea of offloading those and losing that capability within the complex seems like a terrible waste and something that you would take forever to recover from. So we look at it, but it's not something that's on at least

Starting point is 00:33:35 ALCF's roadmap at scale, I should say. I see. So let's continue to monitor it and use it when appropriate. Yeah. All the labs are using clouds now. It's not an either or thing. If you have lots of application scenarios where clouds are perfect, right? Let's say you have a bunch of sensors out there that are collecting data and you want to periodically do aggregation and cleaning that data. That's a perfect application for a cloud. You're streaming to cloud. You don't have to build out that streaming infrastructure yourself. You can be dynamic in the cloud to scale the workflow a problem. One is cost. Now you can do reserved instances and so on. You can get the cost down a little bit, but the premium that you're going to pay for somebody else doing all of this and providing that infrastructure is going to be

Starting point is 00:34:34 substantial. So it's going to cost you multiple factors, maybe three to five X, depending on the details. So that's immediately a kind of a problem in that you need a lot more money for the same thing. Mike mentioned this notion of losing human expertise and capabilities. Labs, our machines are not just isolated things. They're integrated in with our infrastructure. At Argonne, for example, Aurora is integrated with a lot of other infrastructures, but it's also in a position where we can do fast data transfers, say, to the advanced photon source, right, at terabit or multiple terabits per second. That's a capability that would be very hard to do in a cloud, right? Having many terabits per second networking and paying for the large data volumes that you'd have to transfer in

Starting point is 00:35:21 and out of the cloud would be prohibitive on top of the computing part of it. And even the commercial sector is not all in on this. If you look at companies that are building out their large-scale AI systems, GPU clusters at, say, 50 or 100,000 GPUs, they're either in clouds, so they're funny money with respect to, say, the Microsoft OpenAI deal, or they're companies that are doing it themselves. XAI, right, are purposefully building it themselves because it's cheaper and they have more control over it than if they were just buying it via some contract via a cloud. It doesn't mean that people aren't going to continue to build out clouds in any sense, right?

Starting point is 00:36:00 You're not trying to push back on that. It's an incredibly powerful, useful capability. And even for the government, there are many things that the government needs to do where clouds are perfectly the right solution. But at scale and where capability is the thing that you're primarily trying to support, it's probably not the best solution. But we're constantly reevaluating. That's absolutely right. I've made some statements that I don't see that it's ALCF's future, but every year we look at it and we've looked at it for the last 10 years and each year the story changes

Starting point is 00:36:30 and capabilities change. I guess I should never say never. Yeah, listening to you guys, you just get the sense that you're in these planning and strategic positions that will impact future DOE supercomputing strategy. Yet the technology and the situation and the workload needs and everything is changing so fast. So how do you build in flexibility into your overall strategic thinking so that you can shift, even though these systems take years to stand up? I think you have to be doing multiple things at once.

Starting point is 00:37:02 So most of the labs are not, those don't have a single system or have a single mission. There's multiple systems, multiple missions, test beds, and so on. So you're doing many things at once. We are constantly trying to figure out, is it possible to do our large systems differently? There's rather than these big monolithic contracts that you have to put out a couple of years before the system is installed. And then it takes a year or two to get the system installed and up and running. And this is a very, it's like turning an aircraft carrier. It's very slow to change direction, right? The advantage is the system knows how to do it. So procurement

Starting point is 00:37:40 knows how to do it. The project management people know how to do it. The infrastructure, but it's a very, by the time you're turning the crank three or four or five, six times or nurse again, nine times or whatever you get, the system is very tuned to how to execute. But those big procurements are hard to, as Mike said, inject technology at the last minute. One thing that we're trying to figure out is how can we change, right? So can we write contracts in a way that have multiple options? Can you write contracts in a way that allows you to upgrade on the fly or to have resource

Starting point is 00:38:13 sharing agreements? Maybe you don't, maybe you're not purchasing or even leasing nodes. Maybe you're renting nodes in a different way, or maybe you have an arrangement where you are partnered with a cloud provider. And we've had conversations with Microsoft and HP and others are on this concept of what would it look like if our data center was a hybrid cloud and what are the advantages and disadvantages of that? So all these ideas are being constantly discussed and it's possible that we'll have some new way of doing things that results in changes in the future. Like, for example, one thing that we're interested in is making it easier for startups to participate

Starting point is 00:38:51 in these large procurements and reducing the integration time from a new technology, a new silicon technology to when it could become available in a large scale system at scale. Right now, it's many years, right, for something to become integrated into large-scale systems. And we would like to change that. So I think that there's many directions of this can go. And some of that's even influenced by how clouds are thinking about it. If you talk to folks at Microsoft or Amazon or Google, they're all building their own hardware meta. They're all building their own hardware in addition to buying from the market. And that's something that we're also

Starting point is 00:39:30 considering, like not so much where DOE is going to make its own processors as much, but how could we partner in ways with non-traditional players that would give us more flexibility in our deployment options? I'm sure you're following developments in quantum computing. Are there any developments in that line that have caught your eye in particular? Are you encouraged, discouraged by the progress being made overall? I guess I'm encouraged at some level, but I don't think large-scale quantum computing is going to intersect our practical large-scale facility roadmap for quite a while. I think there are players out there that are taking it seriously at scale, like SciQuantum. I think IBM's taking it

Starting point is 00:40:13 seriously. I think a few others are. I think the quantum computers are not going to be cheap. Probably the cost of a large-scale quantum computer, large-scale like million qubit class machine, you know, is going to be measured in the billions of dollars.-scale quantum computer, large-scale like million qubit class machine, you know, is it going to be measured in the billions of dollars? And they also won't be low power. Big cryo plants are both expensive from a capital standpoint, but they're also very expensive from a power standpoint. So the way to think about a quantum computer that would be big enough to really be interesting as a resource for, say, the science community. I think about it as something that would have kind of order million qubits, physical

Starting point is 00:40:51 qubits, allowing you to build applications that have maybe a thousand or so, maybe 10,000 logical qubits, depending on your error correction, and machines that are stable enough to run applications for days at a time, a single problem for many days, because that's what you need in order to have enough computational power to be interesting on scientific problems. And at that scale, you're talking about machines that have the same economics as the current supermarkets, right? You need facilities that are funded at the level of a couple hundred million dollars a year, and they would serve on the order of 100 or 200 applications a year. And we're just not there yet. It's not that we don't want that to happen. It'd be great if it could happen, especially if we can identify enough applications

Starting point is 00:41:45 that could really do novel science, novel breakthrough level science, if they had access to such machines. So right now it's not an option. There's just nothing at scale, nothing that would be interesting from a science standpoint that could be built in the next couple of years. Not for production anyway. Everything right now is still more or less a physics experiment. Right, yes. Maybe worth doing. But I think for a user facility that is trying to advance science as opposed to advance quantum computing technology.

Starting point is 00:42:17 Okay, so these are two different things, right? We're probably 10 years away. And if you understand how we justify building large-scale scientific computers, say in the context of DOE, there's a long roadmapping process where you accumulate mission requirements or mission needs from the scientific community that could take advantage of the platform. You build use cases from that. You then design a facility to try to meet those needs or those requirements. And that first part of trying to figure out what is it that the scientific community could do in the next, say, five years or even 10 years that they will not be able to do on a classical mean, factor into it

Starting point is 00:42:59 the idea that the scale and the reliability and the time-sharing aspects or space-sharing aspects of the machine, that case hasn't been made yet. And that's probably the next step that has to happen by the community is to really articulate, and not by us computer scientists, but by chemists or material scientists or other end users that are not the technology people, but the end user people, to build that case and to really do the analysis that would convince Congress or taxpayers in some general sense that it's worth a billion dollars or five billion, whatever it's going to cost to do it. I think that's where we're at. There's been progress, of course, in scaling up machines and scaling up R&D enterprises here in

Starting point is 00:43:43 Chicago. We're building out the governor and regional universities and labs and so on are all partnering to build this giant quantum park. Yeah, very much, especially in Illinois, you're right. Yeah, yeah. That's a step, but that's equivalent to like building a fab, right? It's not building, yes, they'll build some machines, but it's not at the point where say three years from now, we'll have a million qubit machine. It's not. Let me ask, I'm just getting the sense that when we talk about the power requirements with GPUs and the data center and the AI models, all of the above, and now quantum computing,

Starting point is 00:44:14 are we now reaching a time in the evolution of HPC where we're just hitting a bunch of walls again? Because it seemed like for a few years, we were just bulldozing through a bunch of previous walls. How do you assess where we are as an industry? Are we tackling really big problems now again? Yes. There's always some problem, whether it's... The problems that are forcing us to pause, and now we're just going to have a whole long haul again. I'm not sure that it's boring again. I don't characterize it as a pause. It's paused in the past. What would happen is we would project and then we'd say, ah, this projection doesn't work because 15 years ago, if I tried to build a gigawatt data center, people would laugh

Starting point is 00:44:56 me out of the room. Yes, yes. It wasn't that I couldn't imagine it. It's just that I couldn't get any traction with that idea. Yeah, I agree. Positive bad word. But are we now into a boring long slug again, like we were some years ago, and then like rapid innovation?

Starting point is 00:45:14 I don't think it will be boring because we'll have AI to entertain us. But I think what we're maybe not seeing here is that these things are all coupled. So if we can make enough progress in AI, it will help us in making progress in quantum because it will help us write quantum applications. It'll help us maybe dream up better quantum algorithms. It'll help us with breakthroughs in materials or breakthroughs in air correction or whatever, right? So we're at this weird inflection point, I think, where weird in the sense it hasn't happened that many times in the past, but where we have a technology that could accelerate many other technologies, and that's AI.

Starting point is 00:45:48 Simulation was the thing that we would argue played that role, say, for the last 30 years, that you'd say, oh, if you wanted to design a drug, use simulation. You want to design a material, use simulation. You want to design an airplane, use simulation. That was the argument. Now, what's the argument?

Starting point is 00:46:03 Oh, you want to do all those things? I'm going to use AI. I'm going to use, and I want to do all those things? I'm going to use AI, right? I'm going to use, and I'm going to accelerate my simulation. I'm going to use AI to do that. So it's become this general tool that affects the productivity of many people, but also affects the utility of other tools. And that will have some outcome. Is it going to be everything exponential in three years?

Starting point is 00:46:23 Probably not. But is it going to profoundly affect almost everything that we're doing? I think the answer is probably yes. And we're not fully appreciating that because it's hard to see. It's hard to see what, if all of us had a thousand times more, I don't know, capability in our daily work life or whatever, how would we behave differently? It's if you went back 40 or 50 years ago and said, what if you had a gigabit network? Right back when you had dial up, it was really hard to imagine like a gigabit. What's that? You'd like, but millions of people have gigabit servers. And so you can do all kinds of things

Starting point is 00:47:03 that you couldn't do then. So the AI revolution in in some sense, is going to be like that. Most of the things that we will be doing, say, five, 10 years from now, we're not doing today. So it's hard to say, oh, I'm just going to do that only it's going to be faster. No, I'm going to be doing something completely different. That's right. Okay. So that will certainly keep things interesting. Yeah. Pause is just, we're going to probably have those pause buttons taken off of all of our devices. There's so many other things we haven't talked about that we should, but we've also taken a lot of your time. Always grateful for that. Doug, any questions that we haven't asked?

Starting point is 00:47:37 There were a few, but I think we're good for now. It was a really interesting hour conversation. Much appreciated. Sure. Anytime, guys. Thanks. Rick, Mike, thanks so much. Thank you, Rick. Thank you, Mike. appreciated. Sure. Anytime, guys. Thanks. Rick, Mike, thanks so much. Thank you, Rick.

Starting point is 00:47:47 Thank you, Mike. All right. All right, guys. Cheers. Bye-bye. That's it for this episode of the At HPC podcast. Every episode is featured on InsideHPC.com and posted on OrionX.net. Use the comment section or tweet us with any questions or to propose topics of discussion.

Starting point is 00:48:16 If you like the show, rate and review it on Apple Podcasts or wherever you listen. Thank you for listening.

@HPC Podcast Archives - OrionX.net - @HPCpodcast-89: Rick Stevens and Mike Papka of Argonne National Lab

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.