@HPC Podcast Archives - OrionX.net - @HPCpodcast-95: Adrian Cockcroft on SC24, RISC-V Summit, AWS Reinvent
Episode Date: January 8, 2025Review of SC24, RISC-V Summit, and AWS Reinvent. Topics include: HPC and AI Clouds, CXL, Liquid Cooling, Optical Interconnects, Optical Computing, Novel CPUs and GPUs, the state of RISC-V in servers ...and supercomputers, TOP500, Chiplets, AWS CPU and GPU strategies. [audio mp3="https://orionx.net/wp-content/uploads/2025/01/095@HPCpodcast_SP_Adrian-Cockcroft_SC24-RISCV_AWS-Reinvent-Q2B-Events_20250108.mp3"][/audio] The post @HPCpodcast-95: Adrian Cockcroft on SC24, RISC-V Summit, AWS Reinvent appeared first on OrionX.net.
Transcript
Discussion (0)
12 plus years of industry-leading liquid cooling.
Over 40 patents in liquid cooling technology.
100% heat removal.
Lenovo Neptune Liquid Cooling.
Learn more at lenovo.com slash Neptune.
SC is emerging as the main AI show in the market.
And it's, of course, further evidence that AI is a subset of HPC.
When I look at computer architecture now,
I keep having to add a bunch of zeros onto the end of all the numbers
because it all seems ridiculously large
compared to what we were looking at only 10 years ago.
With NVIDIA moving to an annual cycle,
when even a two-year cycle was already really difficult.
And that is causing, Adrian, you called it enterprise indigestion,
because the chips are moving a little bit too fast.
And by the time you install one, there's another one that is coming.
From OrionX in association with Inside HPC,
this is the At HPC Podcast.
Join Shaheen Khan and Doug Black as they discuss supercomputing technologies and the applications, markets, and policies that shape them.
Thank you for being with us.
Hi, everyone. Welcome to of OrionX.net. And joining us as our special guest is a colleague of Shaheen's who's been with us before, Adrian Cockroft.
He's a partner and analyst of OrionX.
And we're looking back at a recent spate of shows, very significant conferences in our world, starting with SC24.
Also looking at the AWS reInvent conference,
along with a recent RISC-V conference. So Adrian, welcome. Great to be with you again.
And let me just start off with some interesting numbers. We've talked about some of these,
but some new factoids have come in, Shaheen, about SC Atlanta. We know that they
broke the 18,000 barrier on numbers of attendees, which was a jump of more than 25%.
They had a record number of exhibitors, a quarter of which had never been at SC before,
just under 500 total. And I think we all felt this in our feet, shoes, and legs, 30% increase in floor space.
There was a lot of walking done two weeks ago in Atlanta. And I think an interesting thing to talk
about a little bit would be, what does this all mean that this conference is growing so fast?
And AI, of course, is the headline, but HPC is enabling AI.
And I think the world is really clued into that.
And more and more attendees and companies want to be there.
But Shaheen, if you want to jump in, or Adrian, what would be your thoughts on the implications
of how that show is growing?
So in agreement with you, Doug, I think SC is emerging as the main AI show in the market. And it's, of course,
further evidence of my own view that AI is a subset of HPC. It's a killer app within HPC,
and it's arguably potentially bigger than the rest, but still fundamentally HPC.
We've said it many times, the skill set is the same, the infrastructure is the same,
the algorithms are the same. And when you look at new features that get added to AI,
like sensitivity analysis that we were talking about
in our last HPC News Bites,
those are all like partial differential equations and such.
So I think that's one reason.
And of course, the gentle and growing HPC
that it would do anyway.
It was nice to see such a big jump,
both in attendees, but also in exhibitors. Yeah. Adrian, what grabbed you about SC? I know you attended remotely,
but yeah, what were some of the important themes you saw coming out of the show?
Yeah, I've been to the previous two SC conferences and was sort of trying to figure out some trends
and a couple of things that
sort of picking up from the last year to this year. Last year, Microsoft Azure put up a
Limpact result with Eagle that was quite competitive. And this suspected that maybe
this year some of the other cloud vendors might come in. It's relatively easy for them to run a
number. And they didn't do that. I think they're all distracted off doing AI and they don't really have a focus on HPC other than
that it's a place to hire people from, I guess, that have the right kind of experience. So that
was one area where I was expecting to see more and didn't really. And then the other area is
mostly around CXL and the maturation of the CXL protocol. We're seeing implementations
of CXL 1.1 and 2.0 now as memory extensions. You're now building a memory hierarchy where you
can have shared pools, very large shared pools of memory across clusters to hold your models and
your data in basically in memory rather than having to go to IO. And that seems to be turning into products that are shipping now this year.
Okay. Shaheen, other areas of technology caught your eye, really caught your attention at SC?
Well, I think liquid cooling and energy were front and center this time around.
And not just for ESG purposes, for environmental sensitivity and green credentials, but to actually save money.
And I thought that it's nice that the economic incentive and the climate incentive are aligned
in a big way or a bigger way than before. Now, Adrian, when you were at AWS, you looked at this
quite a bit and you're tracking it. So interested in your comments,
but that was a big deal for me. Yeah, we're starting to see power being one of the big constraints on deploying data centers and the typical places where the hyperscalers have been
putting regions are becoming power limited. And so they're basically going to where they can find
power and putting sort of offshoots of their main
centers.
AWS calls these local zones where the control plane that you use to manage all the resources
stays in one of their central regions, but they can go and say, oh, there's a nuclear
power station over here or a particular big solar array cluster over here, and somewhere where they can put up a very large building
full of the GPUs
and have it be close to power and cooling,
sited for that,
but not have to deploy an entire region-level
control plane infrastructure.
So it operates as an extension of the main regions.
It's like a local zone.
They've had that concept for a while
and they're placing them all around the world.
So that's, I think, how we're seeing the hyperscalers
go after this problem.
But the new data centers are being built much more efficiently.
AWS announced a new data center technology package
at reInvent this week, which is much more efficient. It uses a
power cooling. It's quite a detailed release of the technologies they're doing. And they also
started releasing power usage efficiency data for the first time with some quite good numbers. And
finally, we have PUE numbers for all the regions of AWS around the world, something they haven't
disclosed before. So it's becoming enough of an issue that it's forcing people to disclose, this is our efficiency, this is how we're doing it,
and come up with architectures where they can go find places where they can put the capacity that
they need. Yeah, and it's interesting along those lines. One of the exhibitors, one of the first-time
exhibitors, was Valvoline. So they're getting into liquid cooling, which surely indicates that data center energy
consumption is becoming a major market opportunity to address. And we're also seeing new technologies.
An amazing thing about obviously the technology industry, a characteristic of it is its ability
to quickly respond to new market opportunities and new market demands.
So Shaheen, for example, in the area of optical I.O., not a strong presence on the SC conference
agenda, but an increasing number of vendors, optical I.O. vendors and optical, we also saw
optical computing presence on the show floor. Yeah, definitely. This is a good reminder of the
number of exhibitors that were at SC24, and it was a new high. 494 was the number from 29 different
countries. So what that tells me is really the reinforcement of the notion that HPC is an early
adopter market. So when you go to SC, you see everything.
You don't just see supercomputers.
And you see all these novel advanced architectures and technologies that are coming online.
And Valvoline is an excellent example.
They were the first time exhibitor, but Shell was on the floor.
Castrol was on the floor.
And all of these guys are known as being motor oil kind of producers and or actual like
oil products. So when I talked to the folks at Valvoline, and I said they had a nice,
beautiful sports car in their booth, like a Formula One racing car.
And my little joke with them, by the way, was that's the best giveaway on the show.
I didn't get one, unfortunately.
I'd even do it just one round around the
convention floor. In five seconds, right? Yeah, totally. But what I asked them was,
hey, when I think of motor oil, I think lubrication. I don't think cooling. Can you
explain that? And they said, well, actually, motor oil has five uses. It is for lubrication, of course, and that's the star of the show, as they put it, in a car.
But it also cleans.
It also protects against corrosion and such.
And it seals, but it also cools.
And on these high-end, super high-end cars, they actually splash oil on top of the cylinder head to cool it when it is gone through all the ignition phase.
And that prevents it from getting stuck. And it's a very important part. But in a computer,
cooling becomes the star of the show. So that was interesting and good insight. There were
companies that were making hoses of different kinds. And the coating of the hoses in the inside
coating actually has an impact based on what sort of fluid runs through it.
So all of these complexities bring chemistry and biology
and fluid dynamics into the electronics world that we've had.
And of course, HPC people simulate all of it.
A lot of the high-power electric cars have very extreme oil-cooled,
both the power unit controllers and the motors themselves
tend to be oil cooled now and some of the bigger cars are basically when you put your foot down
there's a megawatt of power being dissipated that's lots of cars now in the sort of hundreds
of horsepower up to a thousand horsepower level and keeping those things cool enough so that you
can do more than just three seconds not to 60 or two seconds or whatever it is. It takes a lot of cooling.
There you go. Yeah. Yeah. And I forgot your fast car history, Adrian. And Adrian was part of the
technical team that made it go fast. So when he wrote a book on performance, I think you had a
picture of your sports car on top of the front cover.
Yeah. People called it the red
Porsche book if they couldn't remember the title because that was there you go so the other thing
was really advanced technologies as you mentioned Doug and optical technology in general optical
computing as well as optical communication was one CXL was one and I know that Adrian and I have
been tracking it quite closely,
and we should talk about that in general. And also new types of CPU architecture, whether
they're sort of dynamically self-optimizing or whether they are trying to have a one-to-one
correspondence with GPUs and remove the CPU bottleneck or other forms of CPUs, and not to mention ARM and RISC-V.
So there's just a wide diversity of chips, both in a CPU capacity and in an accelerator GPU capacity.
Well, on CXL, Adrian, I know two years ago coming out of SC, you wrote an article about CXL that got a lot of interest, even a little bit of controversy.
And I'm wondering if your thesis at that time has been borne out or if there have been surprising
changes since then? Yeah, I think, I mean, there's always, you get a chip spec and then it takes a
while for it to turn up into implementations. And with CXL, the implementation needs to be in the CPU chips. So you're waiting for the CPU and GPU vendors to support it,
not just the secondary I.O. space kind of chipsets.
So what we've seen so far is CXL 1.1 and 2.0 coming out
and actually being usable,
and that has tracked fairly well to what we thought was going to happen.
CXL 3.0 was just being talked about two years ago.
And then I think that it was a little too simplistic.
So CXL 3.1 came out
and that was announced about a year ago.
It was okay, we've fixed it.
This is where this fabric management,
the ability to dynamically add and remove nodes from a shared memory system,
which is obviously a pretty complex thing to do without crashing it.
A lot of reserving of memory and just making it into a dynamic fabric managed memory cluster.
And then some people say, okay, 3.1 has fixed some of those things.
There's more security in there.
And then just this week they announced the Excel 3.2, which adds a few more details on top of that.
So I think that we've seen a delay in the 3.x rollout, the sort of more complex fabric-based
systems, but they've been debugging their approach a bit more.
So that's going to take a little bit longer, I think.
And then the other thing that's been happening is that the alternative
to cxl if you look at an industry standards body they tend to go relatively slowly but they carry
a lot of the industry with them and you see individual vendors going out ahead of the pack
with something that's more aggressive that they can build themselves they don't have to negotiate
with everyone else and they can take shortcuts in some areas and get more performance so we've seen nvidia's nv link really take over that role as being the shared memory interconnect
that you use to build larger clusters and aws has their own similar link neural link i think
they call it or something which is that both in the 1.8 terabyte per second to 2 terabytes per second kind of range,
which is exceedingly high bandwidth.
And when I look at computer architecture now,
I keep having to add a bunch of zeros onto the end of all the numbers
because it all seems like ridiculously large
compared to what we were looking at only 10 years ago.
So huge amounts of capacity.
And the thing that's really driving this is the AI models that they want to run are getting bigger and bigger, particularly for training.
And the limiting factor becomes, I want to put all that model in memory.
And the models are now getting to be terabytes of data in memory that you want to process efficiently.
And if you have to chop that into lots of smaller chunks and process it 100 gigabytes at
a time, it's much less efficient than you can put that all in memory. So you get an additional speed
up for your training runs over having a very large node size rather than a cluster of smaller nodes.
So that seems to be the architecture trend that we're seeing, particularly with the NVIDIA Grace Hopper sort of clusters and Blackwell clusters and the AWS with their Tranium 2 clusters.
So there's a few comparisons there, but that's generally the trend we're seeing that the CPU centric, memory centric drive.
Lenovo's sixth generation Neptune liquiding powers the era of AI.
Lenovo's patented Neptune Liquid Cooling technology dramatically boosts system speed and performance,
helps meet sustainability goals and align with carbon reduction initiatives, and delivers
more computing power in a compact footprint.
Twelve-plus years of industry-leading liquid cooling, over 40 patents in liquid cooling
technology, 100% heat removal.
Neptune enables performance without compromise.
Learn more at Lenovo.com slash Neptune.
And of course, the other thing we're seeing is how aggressive the roadmaps are for all these chip companies.
With NVIDIA moving to an annual cycle when even a two-year cycle was already
really difficult. And that is causing, Adrian, you called it enterprise indigestion because the
chips are moving a little bit too fast. And by the time you can install one, there's another one
that is coming. Also, there were rumors of back-world delays. That seems to be real, but
clearly has not impacted NVIDIA's performance.
So there's that complexity going on as well in the GPU world.
Yeah, and I think we've seen that there were announcements at SC of the first shipments of the NVL72 rack.
That's the water-cooled 72 Blackwell GPUs and 120 kilowatts rack.
And those racks are starting to be delivered.
It was Dell delivering to CoreWeave was announced.
I think Supermicro have done some deliveries.
AWS also announced that they will be taking delivery of those
and having them available as product on the AWS cloud in Q1, I think.
So they're all coming together around that architecture.
And it's a very impressive machine. For comparison, if you look at Blackwell, and it's the product that they've
got there, the MVL72 can be configured as a shared memory rack with 34 terabytes of RAM
and several hundred petaflops of sparse FP16 capacity. So that's what you'd use for training.
And then AWS came out with their own
Tranium 2 chips. And it's an interesting strategy because what they did is they targeted a different
process to Blackwell. They knew where NVIDIA was going to be supply constrained on the chip
foundries. So they targeted a different chip foundry spec for Titanium 2 so that they wouldn't
have constraints and they'd be able to
build as much capacity as they wanted to so that that's maybe not obvious but that's one reason
why tranium 2 isn't quite as powerful but they can get it out at a lower price point and overall
the rack they've built they have a thing called an ultra cluster and it's it's sort of comparable
if you like to what the Blackwell chip is.
And they have a deal with Anthropic to train the Anthropic models on the Iteranium 2 ultra
clusters on AWS.
But that system, the ultra cluster, is about half the capacity.
It's about half the performance, about half the memory, but probably less than half the
cost.
That's sort of the way they're doing it.
So on a price performance basis, AWS is hoping to be able to compete with NVIDIA. They'll have
the NVIDIA systems, but they're trying to build their own path to market, which is without the
same supply constraints, but at a different price point and a different performance point.
So anyway, that's what I saw happening. The Tranium 2, it's a two-chip module similar to Blackwell,
two of the largest chips you can get with four HBM chips on it.
And they bundle 64 of them into a double rack when they're configuring it.
So it's an interesting looking system.
We'll see.
And they have it, I think, shipping now.
So that's going to be interesting to see how that strategy plays out.
This is a little bit reminiscent of the earlier days of supercomputing
when Cray was the fastest, most expensive system out there.
And people were building mini supercomputers
that were half the performance and less than half the cost.
Well, since SC, the biggest news in the tech world, of course,
is Pat Gelsinger stepping down as CEO.
It's funny for me, this is,
I've been to the SC conference off and on for 30 years,
but this is my 10th straight conference as a journalist,
not including the COVID interfered with years,
but in 2015, Intel was dominant, of course,
and things have just exploded since then. The big bang in 2015, Intel was dominant, of course, and things have just exploded since then.
The big bang in chips, and that in part has led to the news about Gelsinger, that things have moved away from x86.
But Adrian, in pre-show, you had some very interesting numbers around, for example, what AWS is doing on the CPU side.
Yeah, so historically, AWS was Intel architecture first.
There were some AMD chips in there,
but Intel was the primary supplier to AWS.
And we've seen sort of two things.
One is that on the GPU front, Intel hasn't been competitive.
So the GPU business has primarily gone to NVIDIA, and then AWS building its own internal chips.
But on the normal CPU cores, a few years ago, AWS came out with its ARM-based Graviton product line.
And one of the things they said this week was that more than 50% of the capacity they've installed in the last two years is ARM
based, it's Graviton. And all the internal systems, all the control planes and all of the
internal services that AWS runs on has shifted over to Graviton. So if that capacity had been
on Intel, there would obviously be a lot more revenue for Intel right now. But I think that that's certainly contributing to Intel's woes that some of their largest
customers have an in-house alternative and don't need Intel for everything they do.
And just the variety of architectures that are being adopted.
We could touch on the RISC-V conference that you both attended.
Yeah.
So the thing that was interesting at RISC-V,
I read a story that basically said this is the architecture you already have that you didn't know about. It's embedded in so many things. And in fact, NVIDIA said that it's embedded in all
their GPUs. So when you have an NVIDIA GPU, it has a number of RISC-V cores in it, which are
running the firmware that manages the rest of the GPU, boots it up, provides the interfaces, keeps track of it, reports how much power is going on,
all of the things that are ancillary to the actual AI workloads that are on it,
but they're being managed by RISC-V cores that are embedded into everything.
And the other thing with RISC-V that a couple of years ago, it really became mainstream
in terms of Linux support. So the core Linux distributions now support RISC-V that a couple of years ago it really became mainstream in terms of Linux support.
So the core Linux distributions now support RISC-V as just another production option. You don't have
to go recompiling kernels and searching for modules. It's just another architecture you
can specify and there's broad support for it. Certainly all the other packages, compilers,
languages now target RISC-V. So it's becoming a viable
architecture for people that want to look for a lower cost alternative. And you can see ARM as
being sort of a reaction to Intel charging premium because they had a lock-in on the industry.
So people went to ARM for a lower cost. And then we've seen RISC-V because ARM itself has
some costs and restrictions. We're seeing RISC-V, because ARM itself has license and custom restrictions, we're seeing RISC-V come
in as effectively a lower cost, more open alternative to ARM. Yeah, and on the server
side of RISC-V, that locomotive is also coming. Now, Sci-5 is probably the most well-known company
within the HPC world, but there's also Alibaba, there's Tense Torrent, there is Microchip,
there's Lattice Semiconductor. At the RISC-V event, Andy's technology was prominently exhibited and
was on various panels. So it's coming. And I think, as Adrian mentioned, ARM has already
unlocked the adoption of new technologies and AI as a new application base makes that a little bit easier.
So it is a lot easier to crack the market than it was the case for ARM.
So we're certainly seeing some increase in support for RISC-V,
but I don't think it's really turned up on the top 500 yet.
So that's mostly Intel and a little bit of ARM here and there.
So maybe Shaheen, you can just give us a summary of what the top 500 looked like
and what the news was like in that area at this year's SC.
Yeah, definitely can't talk about SC or ISC without talking about top 500.
And we've covered it in other podcasts, and everybody knows about it by now,
so no need to go into details.
But there was a new number one with El Capitan at 1.74 exaflops, which was comfortably
above 1.5 exaflops.
And when you look at the composition of the top 500, obviously you see the rapid growth
of AMD and NVIDIA for CPUs, GPUs, for both.
And of course, AMD has all the top exascale systems from DOE. So in terms of
performance share, they're probably leading. But you see a lot of GPUs coming up. And of course,
Intel continues to be very prominent and even dominant in terms of the number of systems that
use Intel chips. ARM is coming, the Jupiter system that is being installed in Europe as the first exascale system
in Europe will be using Cyperl's ARM processors. But RISC-V, not quite. It might actually show up
as an accelerator before it shows up as a CPU. And overall, Shaheen, my sense from previous
comments is you're impressed with that architecture. I am. I think it's great. I think
they've done a great job to have the KK-Nita 2
in terms of flexibility and compatibility and extendability and such
and the frictionless way in which people can adopt it.
But yeah, I think once it kicks in, it will be a really good addition.
Of course, like Adrian said, it's an instruction set,
so it could show up anywhere on a chip. And as we go from
chips to tiles and chiplets integrated on top of a substrate, there are a lot more opportunities
for all these instruction sets to commingle without anybody noticing. And sticking with
architecture for one minute, we are seeing other companies coming out with new and novel
architecture. There's GPU and CPU integration
under one hood. We've seen the next silicon news. We're just seeing this continuing proliferation
of the search for more power, more efficiency, and new approaches.
Very much, very much. What else did you see, Adrian, at AWS reInvent?
Well, it's back to being a big event. Obviously, it went through a dip for
COVID, like everything else. It's about 60,000 people attended this year, lots of focus on AI.
And AWS now has its own suite of models that Nova, there's like a small, medium, large size,
they claim they were pretty competitive, but the numbers they were comparing
against were a month or two old in terms of the benchmark league tables. And a few people looked
up the current numbers and said, well, yeah, they were competitive a month or two ago, but they're
not competitive now. It's such a race too. But they have some reasonable models that are low cost
and reasonable performance. And as long as they keep iterating on them then there's a good good approach
there they now it's much bigger deals with anthropic they have the anthropic founder i think
which was the ceo cto on stage they've been funding anthropic and anthropic has then been
coming back with use centering on the aws tranium as its core architectures for building front training its models so you're seeing that kind
of microsoft is cozied up to open ai and us is basically doing the same thing with anthropic to
give them sort of a deeper relationship there was a couple of other things there was some in the data
data science data management area that the s3 has been around for a long time as an object store.
People have been using it to store data lakes. And one of the things people have been storing
is Parquet format, which gives you something that's more structured. And there's an open source
called Apache Iceberg, which is sort of the way people tend to build data lakes on top of it.
One of the most interesting announcements was that AWS announced
basically AWS S3 tables, which is direct managed support, optimized support for these Parquet
format iceberg tables on S3. So your interface to online storage is no longer object-based,
but it's a full metadata schema table format. Everything ends up as a database, basically.
Whatever features you start with,
you'll end up with this sort of migration into,
like, this is the functionality people want.
But that's one of the more interesting things
that I think people will be building off of.
And it sort of disrupted the markets a little bit.
It undermined some of the Databricks sort of approaches
where they have their own slightly different approach. So they all have to rearrange around this patchy
iceberg standard. So that was the other one that I found useful as an announcement.
Excellent. Yeah.
Getting the data sorted out and straight. I mean, 90% of the problem in data science and
supercomputing and AI is getting your data cleaned up and sorted out and knowing where it is.
So all of the things in that space are super valuable.
Yes, yes.
And as I like to say, we used to and still talk about data,
velocity, volume, variety, and value.
And gravity.
Well, gravity doesn't start with a V, so it doesn't qualify.
It doesn't make the list.
So I would match it with incorrect, irreproducible, irrelevant, and incomplete,
because data is just complicated.
And I think that that's a major part of storage.
And of course, everything that we do in our digital world.
So another show I wanted to bring up is the annual Q2B conference. That's
Q digit two letter B conference that is done by QCWare. And they have three installments of it in
Europe, in Asia, and in Silicon Valley. And the Silicon Valley version is the one that I go to
every year. It's my favorite quantum computing conference, quantum technology conference,
I should say. And I like the way they do it. I like the quality of the event and the quality
of the speakers and such. And we'll track that event as well. All right. Very good. Great
discussion, guys. Adrian, glad you could be with us again. And thanks so much. Yeah, thanks,
everyone. Always good to chat to you too. Thank you all. Take care.
That's it for this episode
of the At HPC podcast.
Every episode is featured
on InsideHPC.com
and posted on OrionX.net.
Use the comment section
or tweet us with any questions
or to propose topics of discussion.
If you like the show,
rate and review it on Apple Podcasts
or wherever you listen.
The At HPC podcast is a production of OrionX in association with Inside HPC.
Thank you for listening.