@HPC Podcast Archives - OrionX.net - HPC News Bytes – 20240916

Episode Date: September 16, 2024

- Oracle Cloud 130K Blackwell plan - AI inference battleground - Storage for AI - Rhea takes Jupiter to Jülich [audio mp3="https://orionx.net/wp-content/uploads/2024/09/HPCNB_20240916.mp3"][/audio] ...The post HPC News Bytes – 20240916 appeared first on OrionX.net.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to HPC News Bytes, a weekly show about important news in the world of supercomputing, AI, and other advanced technologies. Hi, everyone. Welcome to HPC News Bytes. I'm Doug Black of Inside HPC, and with me is Shaheen Khan of OrionX.net. The pace of change in high-performance computing continues to accelerate at dizzying
Starting point is 00:00:26 speed, and several pieces of news over the past week demonstrate the point. We'll start with Oracle and its cloud computing platform, currently ranked toward the bottom of the top cloud service providers, within a market growing at nearly 20% annually and expected to reach $2 trillion by 2032. Oracle announced plans to build an AI supercluster powered by NVIDIA's upcoming Blackwell GPU, which are scheduled for shipment the first half of next year. The Oracle Blackwell offering will combine more than 130,000 Blackwells and exceed the Zeta scale performance milestone, according to the company. Certainly in FP8 and lower precision, it does. The stampede to build AI infrastructure continues
Starting point is 00:01:13 as all cloud providers, national plans in countries, and well-funded AI companies allocate tens and hundreds of megawatts and multi-billion dollar funds. When they get this big though, they often are multiple systems. 50,000 GPUs seems to be the upper limit for a single system based on what I can see. Oracle also mentioned increasing openness by customers to other GPUs, presumably AMD's MI300A, Intel Gaudi, and others, and especially for inference. AI is the gold rush, and NVIDIA continues to lead and is able to fulfill demand. Maybe someday they will have serious competition, but that day has been elusive for many years now. And as has been the case for some months now, for vendors of AI accelerators, it is more a question of capacity.
Starting point is 00:02:03 How many can you ship and when, than a question of how much better your GPU is. This is reminiscent of early days of CPUs, where manufacturing technology and capacity was a big driver, and back then, Intel was leading on that front. Yes, and there are reports AMD is de-emphasizing the gaming market for its GPUs to focus more on the AI data center server business. Also, a JP Morgan analyst projects AMD's revenue from data center GPUs to grow by $5 billion in the current fiscal year. We've seen reports that shipments of AMD's MI300 series could reach $400,000 this year, with Microsoft the number one customer and Meta at number two. Those numbers should only increase in upcoming years as TSMC, Samsung, and Intel
Starting point is 00:02:51 increase their GPU fab capacity. There was an interesting story in the Wall Street Journal last week about Vast Data, the New York City-headquartered storage company that is well known to the HBC community. The journal article focused on the change that AI is having on the tech infrastructure with AI's insatiable appetite for high-speed data access. The big point they made was about traditional multi-tiered storage in which less used data is stored in cheaper and less accessible storage devices, and noting that with AI, you essentially need just one fast tier. Well, in fact, and better yet, you'd want all that data in memory.
Starting point is 00:03:34 That has been a motivation to have high bandwidth memory and coherent interconnects, a return to shared memory and shared storage, really. Data layout and access patterns for AI applications have not settled yet, so we should expect changes as new algorithms and optimization techniques emerge. Meanwhile, we will see technologies that take a big enough use case and fully optimize for it. That's one reason why we have so many AI chips and data models. So you take a few months of AI training to create a model and then use it for inference every day. That every day part starts looking like online transaction processing, something enterprise customers understand very well going back to
Starting point is 00:04:17 the mainframe and where cost per transaction, response time, number of simultaneous users, compliance, consistency, etc. become major factors. And then you need federated and continual learning to evolve the model. All of that puts new strains on the whole infrastructure, including data placement, persistence, and access, which is not a bad way of thinking about storage. Let's end with a few quick takes. The Jupiter supercomputer, Europe's first exascale system, is being installed at the ULIC supercomputer center in Germany. That's a system by Eviden that typically installs the usual NVIDIA AMD Intel configurations, but this one has what ULIC calls a universal cluster and a GPU booster. The cluster is based on Cyperl's RIA-1 ARM CPUs, and the booster uses NVIDIA's GH200 chips. By the way, RIA, the goddess of Earth in Greek
Starting point is 00:05:15 mythology, is the mother of Zeus, which the Romans call Jupiter. Ah, there we have it. So staying on Europe, the 2025 ISC Supercomputing Conference to be held in Germany next June is now accepting proposals for presentations, sessions, and birds of a feather discussions. became the world's number one system after it was installed at Oak Ridge Lab in 2018, and now goes into retirement after an impressive tour of duty. Summit combined IBM Power CPUs with NVIDIA GPUs. And while it wasn't the first HPC system to do this, it significantly furthered that architectural approach. And now it goes out in style, still ranked number nine in the world. Amazing. All right, that's it for this episode. Thank you all for being with us. HPC News Bites is a production of OrionX in association with Inside HPC. Shaheen Khan and
Starting point is 00:06:15 Doug Black host the show. Every episode is featured on InsideHPC.com and posted on OrionX.net. Thank you for listening.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.