@HPC Podcast Archives - OrionX.net - @HPCpodcast-102: TOP500 at ISC25 Conference

Starting point is 00:00:00 Cool IT Systems is proud to cool the world's fastest and most advanced supercomputers on the top 500 list. For over 24 years, we've been the leading liquid cooling provider for the world's top HPC and AI systems. Ensure performance and reliability for your next-gen AI systems with the world leader in liquid cooling. Explore more at CoolITSystems.com. and liquid cooling. Explore more at coolitsystems.com. There's a serious point here. And it's that Europe continues to expand its presence with half of the top 10 and eight of the top 20.

Starting point is 00:00:36 And that's 1.2%. So not too bad, which is really funny. When you think of 1.2% of something and you think it's good. Based on total aggregate power of the entire top 500 list, the top three have about 20% of that total. As well as China not participating and thus we really don't know where they stand, although they also have been under significant trade restrictions.

Starting point is 00:01:04 And if you go down the list, if you're basically doing over 60 gigaflops per watt, you're doing pretty well. From Orion X in association with InsideHPC, this is the At HPC podcast. Join Shaheen Khan and Doug Black as they discuss supercomputing technologies and the applications, markets, and policies that shape them.

Starting point is 00:01:23 Thank you for being with us. Hi, everyone. I'm Doug Black at InsideHPC and with me is Shaheen Khan at OrionX.net. And this is the At HPC podcast. Our topic for today is the top 500 organization has released its new list of the world's most powerful supercomputers. This is done twice annually at the two big HPC conferences of the year. ISC is going on in Hamburg and obviously the other conferences

Starting point is 00:01:52 SC in November. The top line takeaway for this one is Shaheen, I would say that the list is kind of reverted to long term form. Recently, there has been some rapid changes in the top 10 especially that were pretty surprising. But this time around, there has been some rapid changes in the top 10 especially that were pretty surprising. But this time around, there's just one new system in the top 10 part of the list. So coming in at number one is El Capitan, the exascale supercomputer that is

Starting point is 00:02:16 at Lawrence Livermore National Lab. The number two system is Frontier, which in 2022 came out as the first certified exascale system. And Aurora at number three is at Argonne National Lab in Illinois. Those are the top three systems. Looking at El Capitan, it comes in at 1.7 exaflops. This is an HPE Cray EX255A system. It's powered by AMD EPYC CPUs and AMD Instinct MI300A GPUs. So a very impressive system. The other important takeaway is that the US has a somewhat diminishing presence at the top of the list with six of the top 10 and 11 of the top 20 being from outside of the United States. And Shaheen has that for a parochial way to look at things.

Starting point is 00:03:08 But there's a serious point here, and it's that Europe continues to expand its presence with half of the top 10 and eight of the top 20 systems. We also see a top 20 system from South Korea, their system coming in at number 18, and Japan with two top 20s. At number 7 is the formerly number one ranked Fugaku supercomputer, and also at number 15. Now, skewing the entire picture is that, as we all know, China long ago, I believe in 2017,

Starting point is 00:03:38 stopped participating in the list. We've reviewed the top 500 together now, Shaheen, you and me for several years, and we've often noted that China has a host of exascale class systems. We just don't know how many or how they are utilized. But getting back to the US with a new administration in place and supercomputing budgets and plans

Starting point is 00:03:59 at the US National Labs in flux, leadership class supercomputing might continue to diminish relatively speaking, at least in sheer numbers. But we should also note that the top three systems are American, and their aggregate compute power represents a significant portion of the aggregate power of the entire top 500 list. Now, Shaheen, you've done some historical research. I believe currently the US has 174 of the top 500. This is up pretty significantly from a low point of 109 in 2019. And in 2006, the US had 309 systems.

Starting point is 00:04:36 So it's fluctuating. The US now have 35% of all the systems on the list, and 28% of the top 50. Europe now has 30% of the entire list. I think that's the size of it. The investments in the US primarily through DOE have obviously yielded beautiful fruits. The top systems are all US systems and a lot of architectural experimentation and implementation that is providing excellent

Starting point is 00:05:05 insight. But the Chinese stopped participating, as you mentioned, and the Europeans started investing in it properly in a big way, starting with the European Euro HPC joint undertaking effort that started in 2018. And obviously it is also showing results. Albeit their strategy seems to be different, they are distributing their systems a little bit more evenly across their geography. So both of them are, in my view, challenges to US leadership in their own way, with Europeans actually investing and providing a lot of resources and performance to a very wide range of scientists, as well as China not participating, and thus we really don't know where they stand, although they also have been under significant trade restrictions.

Starting point is 00:05:55 And it's going to be difficult for them to really come up with these big systems, so who knows exactly what's there. And the systems that they did have that were homegrown were excellent performance in HPL, but they architecturally looked like they would not be performing well on much else. So there is that sort of special purpose to them. Yeah. As I was saying, there's serious questions about the utility of those Chinese systems. By the way, the one new system in the top 10 is from Europe. It's the Jupiter Booster system at Bulls Aquana. This will be part of the Jupiter system going in Germany,

Starting point is 00:06:32 which is expected to be Europe's first exascale class supercomputer. But Europe has not been fixated on achieving that top performance level as much as, I think what you're talking about is more of a distribution of supercomputing power throughout the continent. We also note that England or the UK, pardon me, has continued to sort of de-Brexit itself where HPC is concerned. They are part of the Euro HPC effort. Yeah. The pulse Iquana system at Jupiter Booster booster is from evidence and as you've heard in our HPC news bites evidence is being acquired by the french government more details there but it's a great system and Jupiter is a modular heterogeneous system and booster is the big computational capability there that itself is projected to exceed one exoflops in 64-bit performance.

Starting point is 00:07:27 And of course, much more in lower precision. So Sheen, obviously the list looks at more than Linpack performance. And I know you've dug into other aspects of what- Yes. What Top 5 announced. Yes, for sure. Let me just start with countries

Starting point is 00:07:42 because we already talked about it a little bit. As you mentioned, the US leads in terms of supercomputer count, and it has about 35% of the total at 174 units. Europe collectively has about 30%. And then China has 9.2%, even now, despite not having participated for a while. Japan has about 8%. and then we go to South Korea at 3% and Canada at 2.6%. And then notably Brazil shows up as 1.8%. In terms of performance, however, it's a pretty slam dunk for the US.

Starting point is 00:08:19 It has about 50% of the total performance, 48.4% to be exact. Now, the total performance, 48.4% to be exact. Now the total aggregate performance across the entire list comes in at 13.84 exaflops. That's up from last time, six months ago, when it was 11.72. So there's a lot of improvements down in the list as well, even though we kind of focused on the top 10. The average concurrency, the average number of cores per system also went up. Last time around

Starting point is 00:08:49 it was 257,000 cores per system and this time is about 275,000 cores per system. So it is high time to start using kilocores instead of just cores. So that makes it 258K compared to 275K now. All right, and then moving to vendors. In terms of supercomputing count, Lenovo has 27%, HPE has just behind them at 26.4%, and then Eviden follows with 11%, Dell at 8.2%, and then notably again, NVIDIA shows up at 5.4%. So these are systems branded as NVIDIA systems. In terms of performance, HPE has a commanding lead at 48% of the total aggregate performance, primarily because the top systems are all HPE systems. And then that is again followed by

Starting point is 00:09:46 Eviden because they also do big systems and they have 12.5% of the total aggregate. I expect that Eviden's share might improve if the European systems come online, but HPE is winning deals left and right as well. So it's going to be nice competition to watch. Shaheen, based on total aggregate power of the entire top 500 list, the top three have about 20% of that total. So yeah, yeah. You know, when you have a big one like El Capitan or Frontier Aurora, it all adds up. So you get the big chunk. It adds up quickly and hugely. That's right. That's right. So now we get into accelerators. Obviously, no surprise, NVIDIA leads

Starting point is 00:10:29 with 39.6% of all the systems and then AMD with 5.2%. What this also tells you is that there are 265 systems, more than 50%, that do not have any accelerators. So CPU only continues to be a very big important vehicle for HPC, especially if you go down the list a little bit. In terms of interconnects, InfiniBand has now exceeded 50% of all systems, so it has about 54% of all the interconnects. And gigabit ethernet follows at 33%. Omnipath carries on at 6.6%. And it is up from last time.

Starting point is 00:11:12 And they just announced new switches last week. And that's pretty interesting development there as well. And then custom and proprietary is another 6%. If we go to CPUs, also no surprise, Intel has dropped. But it still has a pretty big lead at about 59% of all systems, followed by AMD at 34%, which is obviously up in a good, healthy way. So AMD continues to gain, but Intel continues to lead.

Starting point is 00:11:43 That's the story there. And then finally, we can look at cores per socket. If you just, that's kind of a good indication of just the kind of CPUs that HPC types tend to pick. And 106 systems on the list have 64 cores per socket. What follows after that is 79 systems with 24 cores per socket. So we're not seeing 96 or 128 or some of these other numbers. What follows after that is 79 systems with 24 cores per socket.

Starting point is 00:12:05 So we're not seeing 96 or 128 or some of these other numbers. 64 and 24 appear to be the popular ones. Supercomputers are driving the world's most exciting innovations, but all that power generates a ton of heat. That's where Cool IT Systems comes in. For over 24 years, we've been cooling the world's fastest and most advanced supercomputers, including systems on the top 500 list. We lead the way in liquid cooling for HPC, AI, and next-gen platforms. Heading to ISC 2025?

Starting point is 00:12:39 Come check out our expert presentations to see what it really takes to cool today's most demanding workloads and explore what's coming next for AI and HPC cooling. Visit coolitsystems.com to learn more. So the benchmarks that we just talked about are all HPL, high performance LINPACK, which is a dense matrix solver of giant sizes to get to the performance that is reported. Other benchmarks are HPCG for conjugate gradient that tends to establish a lower end to the benchmark.

Starting point is 00:13:15 It's more indicative of everyday garden variety codes, and it's not as well behaved as HPL. So the performance that you see on HPCG is dramatically lower than HPL, as we'll discuss in a second. And then you have green 500, which is the same HPL benchmark, but divided by the power that it used. So it's really gigaflops per watt. There are a few other benchmarks in the industry, IO 500. It usually gets announced around the same time. And when it comes, we'll talk about them. But I'll reference a little bit about their previous list. And then MLPerf, which is a suite of AI benchmarks, and it was a subject of one of our HPC podcasts

Starting point is 00:13:57 when we had David Cantor of ML Commons on as a special guest. That was episode 91. Go look it up. It explains a lot of what they do. So with that introduction, maybe I can talk about green 500 first because that's just HPL as compared with the wattage that is used. The number one system on that list is Jedi. That's in Germany. Jedi is number 261 on HPL, but it is number one on green 500 and it delivers 72.7 gigaflops per watt. So that's the gold standard. And if you go down the list, if you're basically doing over 60 gigaflops per watt, you're doing pretty well. Number two is Romeo in France and that's 70.9 gigaflops per watt. Number three is Adastra also in France

Starting point is 00:14:47 and that delivers 69 gigaflops per watt. And then number four is Eisenbart in the UK and that's 68.8, almost 69 gigaflops per watt. Now the number one system is Eviden and it uses a Grace Hopper 200 system and InfiniBand. Number two is also Eviden, also Grace Hopper 200 system and InfiniBand. Number two is also Eviden, also Grace Hopper 200. And number three is an HPE AMD MI300A slingshot system. And number four is also HPE, but again, it's a Grace Hopper 200, except now it's slingshot instead of InfiniBand.

Starting point is 00:15:20 So basically Grace Hopper, these super chips from NVIDIA seem to be the sweet spot on being able to get energy efficiency. So I expect MI300A might also perform quite well. So Shaheen, as you went through that, I noted all four of the top four systems on the Green 500 list are in Europe. I'm curious where the US shows up on the list for the first time. I'm curious where the US shows up on the list for the first time. Well, the US shows up prominently in terms of a vendor list, but on the list itself, number 10 actually is the first US system that shows up on the list, and that's Henry at the Flatiron Institute that we have discussed on this show before. It was higher up in the list before.

Starting point is 00:16:02 It has now dropped to number 414 on the top 500 itself, but it shows up as number 10 on green 500. Okay. Okay. Then we go to HPCG. As I said, HPCG is a really difficult benchmark and the fraction of the performance that you get is really pretty low. And we'll demonstrate that in a second. So the number one system is El Capitan, which is also the number one on the top 500. And whereas on the top 500, it does 1.7 exaflops on HPCG, it's only doing 17 petaflops. So that's 1% of the performance that it gets for LimPAC.

Starting point is 00:16:45 The number two system is Supercomputer Fugaku, one of my favorites, as you know, and you will see why in a second. Because while it's doing 442 petaflops only, in quotes, 442 petaflops on LimPack, it's doing 16 HPCG petaflops right behind El Capitan. And that's 3.61% of the performance that it gets on Limpac. And that's a high watermark. So imagine that you're building a supercomputer for one application, and then you can only get 3, 1, less than 1% for some other garden variety application. So that kind of shows you how you might be able to architecturally optimize for some apps

Starting point is 00:17:27 rather than others. And also Y was saying that the Chinese systems that they're doing really good on Limpak didn't look like they could do well on much else. Those guys would have a really difficult time doing this, like even worse than the ones that we're seeing. Number three is Frontier, and it does 1.3 exaflops for LimPack.

Starting point is 00:17:48 It does 14 petaflops for HPCG. That's just over 1% of the performance. Number four is Aurora. Aurora only gets 0.55% of its LimPack for HPCG, and HPCG comes in at 5.6 petaflops. Number five, and we stop there, is Lumi. Lumi is in Finland. And it is doing 379 petaflops for Limpac and 4.6 petaflops

Starting point is 00:18:18 in conjugate gradient. And that's 1.2%. So not too bad, which is really funny. When you think of 1.2% of something, and you think it's 1.2%. So not too bad, which is really funny when you think of 1.2% of something and you think it's good. Well, Cigin, I know you're a big fan of the Fugaku system and you can really see why. I mean, you extolled the beauty, the virtues of its architecture and it shows in that benchmark. That benchmark really is where you can see the benefits of that sort of architecture. Now, of course, it is also physically beautiful and packaged beautifully. I would like one

Starting point is 00:18:48 at home really. The next one is HPLMXP, which is the mixed precision benchmark, that it's doing HPL but doing it kind of the quote AI way by looking at lower precision but higher performance arithmetic that's available and try to iterate your way to the same exact result. So number one is again El Capitan. Now here we go way at the other end. You know like that one with HPCG we could do really not a lot of flops. In this one we are doing a lot of flops because we're using different hardware, different metrics. So number one is El Capitan. It is obviously number one on the top 500. It does 1.7 exaflops.

Starting point is 00:19:30 For HPL-MXP, it does 16.7 exaflops. So it's 9.6 times more performance compared to the 64-bit hardware. Number two is Aurora. It's doing one exaflop. And for HPL-MXP, it does 11.6. So that's Aurora. It's doing one exaflop and for HP LMXP it does 11.6 so that's 11.5 times. And then you go to Frontier that's 8.4 times. There's a system in Japan, AIST, that is an HPE system with NVIDIA H200 GPUs and that's 16.3 times the speed up. Now,

Starting point is 00:20:03 if we're comparing performance benchmarks on one and to the other, so a lot has to do with how well the team that did the benchmarks did one versus the other. So the speed ups are kind of indicative, just a ballpark sort of a thing. And if you look at like all the top 10 list, you see anywhere from like four and a half times to 25 times.

Starting point is 00:20:27 But the typical range seems to be 6 to 10x. So really the walk away is that if you use lower precision arithmetic that's available, doing something like HPL, you should expect to see 6 to 10 times more performance. Interesting stuff. I mean, so much of the spotlight over the last five months really has been around AI factories, these massive tens of billions of dollars data centers. But by the same token, there'll always be a place for these incredibly high powered supercomputers with high precision computing. And really, I think a lot of the movement right now, computing. And really, I think a lot of the movement right now, we have to say is coming out of Europe. In fact, Jupiter is scheduled to be stood up, installed and tested next year,

Starting point is 00:21:13 and could be Europe's first exascale system. Yeah, definitely. The time to redouble efforts on supercomputing in the US is really now. And I know a lot of folks are working on that, as I see reports from various committees. So fingers crossed that leadership will be maintained and spread across all the scientific community. Yeah, it'll be very interesting to see how long it takes before El Capitan is displaced by something more powerful. Now, let me close with IO 500.

Starting point is 00:21:39 As I said, the list isn't out yet, but the last one that they did, this was all about file system. So you have obviously Aurora and a few other systems that show up on that list, but the file systems are generally either DAOs at the top end and DAOs is the distributed asynchronous object storage that's an open source storage software originally done by Intel and it's object store. It's got key value, it's got the erasure coding, it does scale out really well, it can use NVMe, it's kind of a modern file system if you will. And then right behind it is Lustre, well-established,

Starting point is 00:22:15 also highly scalable, very robust, and you can see variations of that including the ones from DDN. also Weka they have really a good protocol that they use as well so that's what IO 500 looks like and then MLPerf the latest results that were announced just a few days ago where Blackwell leading in MLPerf training by good margin and then on the AI inference there is the data center variety of inference and the edge variety of inference. AI inference for a data center, the H200 is the current leader from Nvidia, and we'll see how that evolves. So that kind of concludes my analysis of this with a little bit

Starting point is 00:22:58 of time that I had, but I'm delighted that this list continues. And it is the 65th edition of the top 500. So divide that by two, and that's how many years they've been doing it. It's just over, you can't do the math wrong when you say something like that. I think we should fire up Il Capitan and figure out how many years they've been doing it. So it's over 32 years that they've been doing this. And it's wonderful because it provides a whole lot of historical data. It has predictive power if you dig deep

Starting point is 00:23:30 and really understand what the benchmarks are really trying to do. It is a good informer of future technologies and future architectures. And of course, it's also a good indication of what's going on in the market. And a good comparative piece as well. Absolutely.

Starting point is 00:23:47 Yeah, huge value to the industry. I think it really binds the industry together quite nicely and it's tremendous effort by the team that does it. Okay. Well, as always, Shaheen, great to be with you on our long form podcast and we look forward to the entire ISC show this week. That sounds good. All right.

Starting point is 00:24:04 Thank you, Doug. Thank you to our listeners. Take care. Until next time. That's it for this episode of the At HPC Podcast. Every episode is featured on insidehpc.com and posted on orionx.net. Use the comment section or tweet us with any questions or to propose topics of discussion. If you like the show, rate and review it on Apple Podcasts or wherever you listen. The At HPC Podcast is a production of OrionX in association with Inside HPC. Thank you for listening.

Your Ad Here

@HPC Podcast Archives - OrionX.net - @HPCpodcast-102: TOP500 at ISC25 Conference

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.