@HPC Podcast Archives - OrionX.net - @HPCpodcast-95: Adrian Cockcroft on SC24, RISC-V Summit, AWS Reinvent

Starting point is 00:00:00 12 plus years of industry-leading liquid cooling. Over 40 patents in liquid cooling technology. 100% heat removal. Lenovo Neptune Liquid Cooling. Learn more at lenovo.com slash Neptune. SC is emerging as the main AI show in the market. And it's, of course, further evidence that AI is a subset of HPC. When I look at computer architecture now,

Starting point is 00:00:32 I keep having to add a bunch of zeros onto the end of all the numbers because it all seems ridiculously large compared to what we were looking at only 10 years ago. With NVIDIA moving to an annual cycle, when even a two-year cycle was already really difficult. And that is causing, Adrian, you called it enterprise indigestion, because the chips are moving a little bit too fast. And by the time you install one, there's another one that is coming.

Starting point is 00:01:01 From OrionX in association with Inside HPC, this is the At HPC Podcast. Join Shaheen Khan and Doug Black as they discuss supercomputing technologies and the applications, markets, and policies that shape them. Thank you for being with us. Hi, everyone. Welcome to of OrionX.net. And joining us as our special guest is a colleague of Shaheen's who's been with us before, Adrian Cockroft. He's a partner and analyst of OrionX. And we're looking back at a recent spate of shows, very significant conferences in our world, starting with SC24. Also looking at the AWS reInvent conference,

Starting point is 00:01:49 along with a recent RISC-V conference. So Adrian, welcome. Great to be with you again. And let me just start off with some interesting numbers. We've talked about some of these, but some new factoids have come in, Shaheen, about SC Atlanta. We know that they broke the 18,000 barrier on numbers of attendees, which was a jump of more than 25%. They had a record number of exhibitors, a quarter of which had never been at SC before, just under 500 total. And I think we all felt this in our feet, shoes, and legs, 30% increase in floor space. There was a lot of walking done two weeks ago in Atlanta. And I think an interesting thing to talk about a little bit would be, what does this all mean that this conference is growing so fast?

Starting point is 00:02:40 And AI, of course, is the headline, but HPC is enabling AI. And I think the world is really clued into that. And more and more attendees and companies want to be there. But Shaheen, if you want to jump in, or Adrian, what would be your thoughts on the implications of how that show is growing? So in agreement with you, Doug, I think SC is emerging as the main AI show in the market. And it's, of course, further evidence of my own view that AI is a subset of HPC. It's a killer app within HPC, and it's arguably potentially bigger than the rest, but still fundamentally HPC.

Starting point is 00:03:17 We've said it many times, the skill set is the same, the infrastructure is the same, the algorithms are the same. And when you look at new features that get added to AI, like sensitivity analysis that we were talking about in our last HPC News Bites, those are all like partial differential equations and such. So I think that's one reason. And of course, the gentle and growing HPC that it would do anyway.

Starting point is 00:03:42 It was nice to see such a big jump, both in attendees, but also in exhibitors. Yeah. Adrian, what grabbed you about SC? I know you attended remotely, but yeah, what were some of the important themes you saw coming out of the show? Yeah, I've been to the previous two SC conferences and was sort of trying to figure out some trends and a couple of things that sort of picking up from the last year to this year. Last year, Microsoft Azure put up a Limpact result with Eagle that was quite competitive. And this suspected that maybe this year some of the other cloud vendors might come in. It's relatively easy for them to run a

Starting point is 00:04:23 number. And they didn't do that. I think they're all distracted off doing AI and they don't really have a focus on HPC other than that it's a place to hire people from, I guess, that have the right kind of experience. So that was one area where I was expecting to see more and didn't really. And then the other area is mostly around CXL and the maturation of the CXL protocol. We're seeing implementations of CXL 1.1 and 2.0 now as memory extensions. You're now building a memory hierarchy where you can have shared pools, very large shared pools of memory across clusters to hold your models and your data in basically in memory rather than having to go to IO. And that seems to be turning into products that are shipping now this year. Okay. Shaheen, other areas of technology caught your eye, really caught your attention at SC?

Starting point is 00:05:15 Well, I think liquid cooling and energy were front and center this time around. And not just for ESG purposes, for environmental sensitivity and green credentials, but to actually save money. And I thought that it's nice that the economic incentive and the climate incentive are aligned in a big way or a bigger way than before. Now, Adrian, when you were at AWS, you looked at this quite a bit and you're tracking it. So interested in your comments, but that was a big deal for me. Yeah, we're starting to see power being one of the big constraints on deploying data centers and the typical places where the hyperscalers have been putting regions are becoming power limited. And so they're basically going to where they can find power and putting sort of offshoots of their main

Starting point is 00:06:06 centers. AWS calls these local zones where the control plane that you use to manage all the resources stays in one of their central regions, but they can go and say, oh, there's a nuclear power station over here or a particular big solar array cluster over here, and somewhere where they can put up a very large building full of the GPUs and have it be close to power and cooling, sited for that, but not have to deploy an entire region-level

Starting point is 00:06:35 control plane infrastructure. So it operates as an extension of the main regions. It's like a local zone. They've had that concept for a while and they're placing them all around the world. So that's, I think, how we're seeing the hyperscalers go after this problem. But the new data centers are being built much more efficiently.

Starting point is 00:06:57 AWS announced a new data center technology package at reInvent this week, which is much more efficient. It uses a power cooling. It's quite a detailed release of the technologies they're doing. And they also started releasing power usage efficiency data for the first time with some quite good numbers. And finally, we have PUE numbers for all the regions of AWS around the world, something they haven't disclosed before. So it's becoming enough of an issue that it's forcing people to disclose, this is our efficiency, this is how we're doing it, and come up with architectures where they can go find places where they can put the capacity that they need. Yeah, and it's interesting along those lines. One of the exhibitors, one of the first-time

Starting point is 00:07:39 exhibitors, was Valvoline. So they're getting into liquid cooling, which surely indicates that data center energy consumption is becoming a major market opportunity to address. And we're also seeing new technologies. An amazing thing about obviously the technology industry, a characteristic of it is its ability to quickly respond to new market opportunities and new market demands. So Shaheen, for example, in the area of optical I.O., not a strong presence on the SC conference agenda, but an increasing number of vendors, optical I.O. vendors and optical, we also saw optical computing presence on the show floor. Yeah, definitely. This is a good reminder of the number of exhibitors that were at SC24, and it was a new high. 494 was the number from 29 different

Starting point is 00:08:34 countries. So what that tells me is really the reinforcement of the notion that HPC is an early adopter market. So when you go to SC, you see everything. You don't just see supercomputers. And you see all these novel advanced architectures and technologies that are coming online. And Valvoline is an excellent example. They were the first time exhibitor, but Shell was on the floor. Castrol was on the floor. And all of these guys are known as being motor oil kind of producers and or actual like

Starting point is 00:09:06 oil products. So when I talked to the folks at Valvoline, and I said they had a nice, beautiful sports car in their booth, like a Formula One racing car. And my little joke with them, by the way, was that's the best giveaway on the show. I didn't get one, unfortunately. I'd even do it just one round around the convention floor. In five seconds, right? Yeah, totally. But what I asked them was, hey, when I think of motor oil, I think lubrication. I don't think cooling. Can you explain that? And they said, well, actually, motor oil has five uses. It is for lubrication, of course, and that's the star of the show, as they put it, in a car.

Starting point is 00:09:49 But it also cleans. It also protects against corrosion and such. And it seals, but it also cools. And on these high-end, super high-end cars, they actually splash oil on top of the cylinder head to cool it when it is gone through all the ignition phase. And that prevents it from getting stuck. And it's a very important part. But in a computer, cooling becomes the star of the show. So that was interesting and good insight. There were companies that were making hoses of different kinds. And the coating of the hoses in the inside coating actually has an impact based on what sort of fluid runs through it.

Starting point is 00:10:27 So all of these complexities bring chemistry and biology and fluid dynamics into the electronics world that we've had. And of course, HPC people simulate all of it. A lot of the high-power electric cars have very extreme oil-cooled, both the power unit controllers and the motors themselves tend to be oil cooled now and some of the bigger cars are basically when you put your foot down there's a megawatt of power being dissipated that's lots of cars now in the sort of hundreds of horsepower up to a thousand horsepower level and keeping those things cool enough so that you

Starting point is 00:11:02 can do more than just three seconds not to 60 or two seconds or whatever it is. It takes a lot of cooling. There you go. Yeah. Yeah. And I forgot your fast car history, Adrian. And Adrian was part of the technical team that made it go fast. So when he wrote a book on performance, I think you had a picture of your sports car on top of the front cover. Yeah. People called it the red Porsche book if they couldn't remember the title because that was there you go so the other thing was really advanced technologies as you mentioned Doug and optical technology in general optical computing as well as optical communication was one CXL was one and I know that Adrian and I have

Starting point is 00:11:43 been tracking it quite closely, and we should talk about that in general. And also new types of CPU architecture, whether they're sort of dynamically self-optimizing or whether they are trying to have a one-to-one correspondence with GPUs and remove the CPU bottleneck or other forms of CPUs, and not to mention ARM and RISC-V. So there's just a wide diversity of chips, both in a CPU capacity and in an accelerator GPU capacity. Well, on CXL, Adrian, I know two years ago coming out of SC, you wrote an article about CXL that got a lot of interest, even a little bit of controversy. And I'm wondering if your thesis at that time has been borne out or if there have been surprising changes since then? Yeah, I think, I mean, there's always, you get a chip spec and then it takes a

Starting point is 00:12:37 while for it to turn up into implementations. And with CXL, the implementation needs to be in the CPU chips. So you're waiting for the CPU and GPU vendors to support it, not just the secondary I.O. space kind of chipsets. So what we've seen so far is CXL 1.1 and 2.0 coming out and actually being usable, and that has tracked fairly well to what we thought was going to happen. CXL 3.0 was just being talked about two years ago. And then I think that it was a little too simplistic. So CXL 3.1 came out

Starting point is 00:13:16 and that was announced about a year ago. It was okay, we've fixed it. This is where this fabric management, the ability to dynamically add and remove nodes from a shared memory system, which is obviously a pretty complex thing to do without crashing it. A lot of reserving of memory and just making it into a dynamic fabric managed memory cluster. And then some people say, okay, 3.1 has fixed some of those things. There's more security in there.

Starting point is 00:13:43 And then just this week they announced the Excel 3.2, which adds a few more details on top of that. So I think that we've seen a delay in the 3.x rollout, the sort of more complex fabric-based systems, but they've been debugging their approach a bit more. So that's going to take a little bit longer, I think. And then the other thing that's been happening is that the alternative to cxl if you look at an industry standards body they tend to go relatively slowly but they carry a lot of the industry with them and you see individual vendors going out ahead of the pack with something that's more aggressive that they can build themselves they don't have to negotiate

Starting point is 00:14:20 with everyone else and they can take shortcuts in some areas and get more performance so we've seen nvidia's nv link really take over that role as being the shared memory interconnect that you use to build larger clusters and aws has their own similar link neural link i think they call it or something which is that both in the 1.8 terabyte per second to 2 terabytes per second kind of range, which is exceedingly high bandwidth. And when I look at computer architecture now, I keep having to add a bunch of zeros onto the end of all the numbers because it all seems like ridiculously large compared to what we were looking at only 10 years ago.

Starting point is 00:14:59 So huge amounts of capacity. And the thing that's really driving this is the AI models that they want to run are getting bigger and bigger, particularly for training. And the limiting factor becomes, I want to put all that model in memory. And the models are now getting to be terabytes of data in memory that you want to process efficiently. And if you have to chop that into lots of smaller chunks and process it 100 gigabytes at a time, it's much less efficient than you can put that all in memory. So you get an additional speed up for your training runs over having a very large node size rather than a cluster of smaller nodes. So that seems to be the architecture trend that we're seeing, particularly with the NVIDIA Grace Hopper sort of clusters and Blackwell clusters and the AWS with their Tranium 2 clusters.

Starting point is 00:15:51 So there's a few comparisons there, but that's generally the trend we're seeing that the CPU centric, memory centric drive. Lenovo's sixth generation Neptune liquiding powers the era of AI. Lenovo's patented Neptune Liquid Cooling technology dramatically boosts system speed and performance, helps meet sustainability goals and align with carbon reduction initiatives, and delivers more computing power in a compact footprint. Twelve-plus years of industry-leading liquid cooling, over 40 patents in liquid cooling technology, 100% heat removal. Neptune enables performance without compromise.

Starting point is 00:16:30 Learn more at Lenovo.com slash Neptune. And of course, the other thing we're seeing is how aggressive the roadmaps are for all these chip companies. With NVIDIA moving to an annual cycle when even a two-year cycle was already really difficult. And that is causing, Adrian, you called it enterprise indigestion because the chips are moving a little bit too fast. And by the time you can install one, there's another one that is coming. Also, there were rumors of back-world delays. That seems to be real, but clearly has not impacted NVIDIA's performance. So there's that complexity going on as well in the GPU world.

Starting point is 00:17:10 Yeah, and I think we've seen that there were announcements at SC of the first shipments of the NVL72 rack. That's the water-cooled 72 Blackwell GPUs and 120 kilowatts rack. And those racks are starting to be delivered. It was Dell delivering to CoreWeave was announced. I think Supermicro have done some deliveries. AWS also announced that they will be taking delivery of those and having them available as product on the AWS cloud in Q1, I think. So they're all coming together around that architecture.

Starting point is 00:17:44 And it's a very impressive machine. For comparison, if you look at Blackwell, and it's the product that they've got there, the MVL72 can be configured as a shared memory rack with 34 terabytes of RAM and several hundred petaflops of sparse FP16 capacity. So that's what you'd use for training. And then AWS came out with their own Tranium 2 chips. And it's an interesting strategy because what they did is they targeted a different process to Blackwell. They knew where NVIDIA was going to be supply constrained on the chip foundries. So they targeted a different chip foundry spec for Titanium 2 so that they wouldn't have constraints and they'd be able to

Starting point is 00:18:25 build as much capacity as they wanted to so that that's maybe not obvious but that's one reason why tranium 2 isn't quite as powerful but they can get it out at a lower price point and overall the rack they've built they have a thing called an ultra cluster and it's it's sort of comparable if you like to what the Blackwell chip is. And they have a deal with Anthropic to train the Anthropic models on the Iteranium 2 ultra clusters on AWS. But that system, the ultra cluster, is about half the capacity. It's about half the performance, about half the memory, but probably less than half the

Starting point is 00:19:03 cost. That's sort of the way they're doing it. So on a price performance basis, AWS is hoping to be able to compete with NVIDIA. They'll have the NVIDIA systems, but they're trying to build their own path to market, which is without the same supply constraints, but at a different price point and a different performance point. So anyway, that's what I saw happening. The Tranium 2, it's a two-chip module similar to Blackwell, two of the largest chips you can get with four HBM chips on it. And they bundle 64 of them into a double rack when they're configuring it.

Starting point is 00:19:38 So it's an interesting looking system. We'll see. And they have it, I think, shipping now. So that's going to be interesting to see how that strategy plays out. This is a little bit reminiscent of the earlier days of supercomputing when Cray was the fastest, most expensive system out there. And people were building mini supercomputers that were half the performance and less than half the cost.

Starting point is 00:20:01 Well, since SC, the biggest news in the tech world, of course, is Pat Gelsinger stepping down as CEO. It's funny for me, this is, I've been to the SC conference off and on for 30 years, but this is my 10th straight conference as a journalist, not including the COVID interfered with years, but in 2015, Intel was dominant, of course, and things have just exploded since then. The big bang in 2015, Intel was dominant, of course, and things have just exploded since then.

Starting point is 00:20:27 The big bang in chips, and that in part has led to the news about Gelsinger, that things have moved away from x86. But Adrian, in pre-show, you had some very interesting numbers around, for example, what AWS is doing on the CPU side. Yeah, so historically, AWS was Intel architecture first. There were some AMD chips in there, but Intel was the primary supplier to AWS. And we've seen sort of two things. One is that on the GPU front, Intel hasn't been competitive. So the GPU business has primarily gone to NVIDIA, and then AWS building its own internal chips.

Starting point is 00:21:10 But on the normal CPU cores, a few years ago, AWS came out with its ARM-based Graviton product line. And one of the things they said this week was that more than 50% of the capacity they've installed in the last two years is ARM based, it's Graviton. And all the internal systems, all the control planes and all of the internal services that AWS runs on has shifted over to Graviton. So if that capacity had been on Intel, there would obviously be a lot more revenue for Intel right now. But I think that that's certainly contributing to Intel's woes that some of their largest customers have an in-house alternative and don't need Intel for everything they do. And just the variety of architectures that are being adopted. We could touch on the RISC-V conference that you both attended.

Starting point is 00:22:01 Yeah. So the thing that was interesting at RISC-V, I read a story that basically said this is the architecture you already have that you didn't know about. It's embedded in so many things. And in fact, NVIDIA said that it's embedded in all their GPUs. So when you have an NVIDIA GPU, it has a number of RISC-V cores in it, which are running the firmware that manages the rest of the GPU, boots it up, provides the interfaces, keeps track of it, reports how much power is going on, all of the things that are ancillary to the actual AI workloads that are on it, but they're being managed by RISC-V cores that are embedded into everything. And the other thing with RISC-V that a couple of years ago, it really became mainstream

Starting point is 00:22:43 in terms of Linux support. So the core Linux distributions now support RISC-V that a couple of years ago it really became mainstream in terms of Linux support. So the core Linux distributions now support RISC-V as just another production option. You don't have to go recompiling kernels and searching for modules. It's just another architecture you can specify and there's broad support for it. Certainly all the other packages, compilers, languages now target RISC-V. So it's becoming a viable architecture for people that want to look for a lower cost alternative. And you can see ARM as being sort of a reaction to Intel charging premium because they had a lock-in on the industry. So people went to ARM for a lower cost. And then we've seen RISC-V because ARM itself has

Starting point is 00:23:22 some costs and restrictions. We're seeing RISC-V, because ARM itself has license and custom restrictions, we're seeing RISC-V come in as effectively a lower cost, more open alternative to ARM. Yeah, and on the server side of RISC-V, that locomotive is also coming. Now, Sci-5 is probably the most well-known company within the HPC world, but there's also Alibaba, there's Tense Torrent, there is Microchip, there's Lattice Semiconductor. At the RISC-V event, Andy's technology was prominently exhibited and was on various panels. So it's coming. And I think, as Adrian mentioned, ARM has already unlocked the adoption of new technologies and AI as a new application base makes that a little bit easier. So it is a lot easier to crack the market than it was the case for ARM.

Starting point is 00:24:11 So we're certainly seeing some increase in support for RISC-V, but I don't think it's really turned up on the top 500 yet. So that's mostly Intel and a little bit of ARM here and there. So maybe Shaheen, you can just give us a summary of what the top 500 looked like and what the news was like in that area at this year's SC. Yeah, definitely can't talk about SC or ISC without talking about top 500. And we've covered it in other podcasts, and everybody knows about it by now, so no need to go into details.

Starting point is 00:24:42 But there was a new number one with El Capitan at 1.74 exaflops, which was comfortably above 1.5 exaflops. And when you look at the composition of the top 500, obviously you see the rapid growth of AMD and NVIDIA for CPUs, GPUs, for both. And of course, AMD has all the top exascale systems from DOE. So in terms of performance share, they're probably leading. But you see a lot of GPUs coming up. And of course, Intel continues to be very prominent and even dominant in terms of the number of systems that use Intel chips. ARM is coming, the Jupiter system that is being installed in Europe as the first exascale system

Starting point is 00:25:26 in Europe will be using Cyperl's ARM processors. But RISC-V, not quite. It might actually show up as an accelerator before it shows up as a CPU. And overall, Shaheen, my sense from previous comments is you're impressed with that architecture. I am. I think it's great. I think they've done a great job to have the KK-Nita 2 in terms of flexibility and compatibility and extendability and such and the frictionless way in which people can adopt it. But yeah, I think once it kicks in, it will be a really good addition. Of course, like Adrian said, it's an instruction set,

Starting point is 00:26:03 so it could show up anywhere on a chip. And as we go from chips to tiles and chiplets integrated on top of a substrate, there are a lot more opportunities for all these instruction sets to commingle without anybody noticing. And sticking with architecture for one minute, we are seeing other companies coming out with new and novel architecture. There's GPU and CPU integration under one hood. We've seen the next silicon news. We're just seeing this continuing proliferation of the search for more power, more efficiency, and new approaches. Very much, very much. What else did you see, Adrian, at AWS reInvent?

Starting point is 00:26:42 Well, it's back to being a big event. Obviously, it went through a dip for COVID, like everything else. It's about 60,000 people attended this year, lots of focus on AI. And AWS now has its own suite of models that Nova, there's like a small, medium, large size, they claim they were pretty competitive, but the numbers they were comparing against were a month or two old in terms of the benchmark league tables. And a few people looked up the current numbers and said, well, yeah, they were competitive a month or two ago, but they're not competitive now. It's such a race too. But they have some reasonable models that are low cost and reasonable performance. And as long as they keep iterating on them then there's a good good approach

Starting point is 00:27:25 there they now it's much bigger deals with anthropic they have the anthropic founder i think which was the ceo cto on stage they've been funding anthropic and anthropic has then been coming back with use centering on the aws tranium as its core architectures for building front training its models so you're seeing that kind of microsoft is cozied up to open ai and us is basically doing the same thing with anthropic to give them sort of a deeper relationship there was a couple of other things there was some in the data data science data management area that the s3 has been around for a long time as an object store. People have been using it to store data lakes. And one of the things people have been storing is Parquet format, which gives you something that's more structured. And there's an open source

Starting point is 00:28:16 called Apache Iceberg, which is sort of the way people tend to build data lakes on top of it. One of the most interesting announcements was that AWS announced basically AWS S3 tables, which is direct managed support, optimized support for these Parquet format iceberg tables on S3. So your interface to online storage is no longer object-based, but it's a full metadata schema table format. Everything ends up as a database, basically. Whatever features you start with, you'll end up with this sort of migration into, like, this is the functionality people want.

Starting point is 00:28:54 But that's one of the more interesting things that I think people will be building off of. And it sort of disrupted the markets a little bit. It undermined some of the Databricks sort of approaches where they have their own slightly different approach. So they all have to rearrange around this patchy iceberg standard. So that was the other one that I found useful as an announcement. Excellent. Yeah. Getting the data sorted out and straight. I mean, 90% of the problem in data science and

Starting point is 00:29:20 supercomputing and AI is getting your data cleaned up and sorted out and knowing where it is. So all of the things in that space are super valuable. Yes, yes. And as I like to say, we used to and still talk about data, velocity, volume, variety, and value. And gravity. Well, gravity doesn't start with a V, so it doesn't qualify. It doesn't make the list.

Starting point is 00:29:46 So I would match it with incorrect, irreproducible, irrelevant, and incomplete, because data is just complicated. And I think that that's a major part of storage. And of course, everything that we do in our digital world. So another show I wanted to bring up is the annual Q2B conference. That's Q digit two letter B conference that is done by QCWare. And they have three installments of it in Europe, in Asia, and in Silicon Valley. And the Silicon Valley version is the one that I go to every year. It's my favorite quantum computing conference, quantum technology conference,

Starting point is 00:30:25 I should say. And I like the way they do it. I like the quality of the event and the quality of the speakers and such. And we'll track that event as well. All right. Very good. Great discussion, guys. Adrian, glad you could be with us again. And thanks so much. Yeah, thanks, everyone. Always good to chat to you too. Thank you all. Take care. That's it for this episode of the At HPC podcast. Every episode is featured on InsideHPC.com

Starting point is 00:30:53 and posted on OrionX.net. Use the comment section or tweet us with any questions or to propose topics of discussion. If you like the show, rate and review it on Apple Podcasts or wherever you listen. The At HPC podcast is a production of OrionX in association with Inside HPC.

Starting point is 00:31:10 Thank you for listening.

@HPC Podcast Archives - OrionX.net - @HPCpodcast-95: Adrian Cockcroft on SC24, RISC-V Summit, AWS Reinvent

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.