@HPC Podcast Archives - OrionX.net - HPC News Bytes – 20240916
Episode Date: September 16, 2024- Oracle Cloud 130K Blackwell plan - AI inference battleground - Storage for AI - Rhea takes Jupiter to Jülich [audio mp3="https://orionx.net/wp-content/uploads/2024/09/HPCNB_20240916.mp3"][/audio] ...The post HPC News Bytes – 20240916 appeared first on OrionX.net.
Transcript
Discussion (0)
Welcome to HPC News Bytes,
a weekly show about important news in the world of supercomputing,
AI, and other advanced technologies.
Hi, everyone. Welcome to HPC News Bytes.
I'm Doug Black of Inside HPC,
and with me is Shaheen Khan of OrionX.net.
The pace of change in high-performance computing
continues to accelerate at dizzying
speed, and several pieces of news over the past week demonstrate the point. We'll start with
Oracle and its cloud computing platform, currently ranked toward the bottom of the top cloud service
providers, within a market growing at nearly 20% annually and expected to reach $2 trillion by 2032.
Oracle announced plans to build an AI supercluster powered by NVIDIA's upcoming Blackwell GPU,
which are scheduled for shipment the first half of next year.
The Oracle Blackwell offering will combine more than 130,000 Blackwells and exceed the
Zeta scale performance milestone, according to the company.
Certainly in FP8 and lower precision, it does. The stampede to build AI infrastructure continues
as all cloud providers, national plans in countries, and well-funded AI companies allocate
tens and hundreds of megawatts and multi-billion dollar funds. When they get this big though,
they often are multiple systems. 50,000 GPUs seems to be the upper limit for a single system based on
what I can see. Oracle also mentioned increasing openness by customers to other GPUs, presumably
AMD's MI300A, Intel Gaudi, and others, and especially for inference.
AI is the gold rush, and NVIDIA continues to lead and is able to fulfill demand.
Maybe someday they will have serious competition, but that day has been elusive for many years now.
And as has been the case for some months now, for vendors of AI accelerators, it is more a question of capacity.
How many can you ship and when, than a question
of how much better your GPU is. This is reminiscent of early days of CPUs, where manufacturing
technology and capacity was a big driver, and back then, Intel was leading on that front.
Yes, and there are reports AMD is de-emphasizing the gaming market for its GPUs to focus more on the AI data center server business.
Also, a JP Morgan analyst projects AMD's revenue from data center GPUs to grow by $5 billion in
the current fiscal year. We've seen reports that shipments of AMD's MI300 series could reach
$400,000 this year, with Microsoft the number one customer and Meta at
number two. Those numbers should only increase in upcoming years as TSMC, Samsung, and Intel
increase their GPU fab capacity. There was an interesting story in the Wall Street Journal
last week about Vast Data, the New York City-headquartered storage company that is well
known to the HBC community.
The journal article focused on the change that AI is having on the tech infrastructure
with AI's insatiable appetite for high-speed data access. The big point they made was about
traditional multi-tiered storage in which less used data is stored in cheaper and less accessible storage devices, and noting that with AI,
you essentially need just one fast tier.
Well, in fact, and better yet, you'd want all that data in memory.
That has been a motivation to have high bandwidth memory and coherent interconnects, a return
to shared memory and shared storage, really.
Data layout and access patterns for AI applications
have not settled yet, so we should expect changes as new algorithms and optimization techniques
emerge. Meanwhile, we will see technologies that take a big enough use case and fully optimize for
it. That's one reason why we have so many AI chips and data models. So you take a few months of AI training to create a model
and then use it for inference every day. That every day part starts looking like online
transaction processing, something enterprise customers understand very well going back to
the mainframe and where cost per transaction, response time, number of simultaneous users, compliance, consistency, etc. become major factors.
And then you need federated and continual learning to evolve the model.
All of that puts new strains on the whole infrastructure, including data placement, persistence, and access, which is not a bad way of thinking about storage.
Let's end with a few quick takes. The Jupiter supercomputer,
Europe's first exascale system, is being installed at the ULIC supercomputer center in Germany.
That's a system by Eviden that typically installs the usual NVIDIA AMD Intel configurations,
but this one has what ULIC calls a universal cluster and a GPU booster. The cluster is based on Cyperl's RIA-1 ARM CPUs,
and the booster uses NVIDIA's GH200 chips. By the way, RIA, the goddess of Earth in Greek
mythology, is the mother of Zeus, which the Romans call Jupiter.
Ah, there we have it. So staying on Europe, the 2025 ISC Supercomputing Conference to be held in Germany next June is now accepting proposals for presentations, sessions, and birds of a feather discussions. became the world's number one system after it was installed at Oak Ridge Lab in 2018,
and now goes into retirement after an impressive tour of duty. Summit combined IBM Power CPUs
with NVIDIA GPUs. And while it wasn't the first HPC system to do this, it significantly furthered
that architectural approach. And now it goes out in style, still ranked number nine in the world.
Amazing.
All right, that's it for this episode. Thank you all for being with us.
HPC News Bites is a production of OrionX in association with Inside HPC. Shaheen Khan and
Doug Black host the show. Every episode is featured on InsideHPC.com and posted on OrionX.net.
Thank you for listening.