@HPC Podcast Archives - OrionX.net - HPC News Bytes – 20250616

Starting point is 00:00:00 Welcome to HPC News Bytes, a weekly show about important news in the world of supercomputing, AI, and other advanced technologies. Welcome to HPC News Bytes. This is Shaheen Khan, Doug is away. The GPU race continues to tighten as AMD upped the ante with three important moves. It launched new GPUs, new software, an actual server, while touting openness for AI. Chip benders introduce new chips several times these days, drip-feeding more information as they do.

Starting point is 00:00:36 So if you're not tracking it closely, it's hard to tell exactly where a product is in its lifecycle. Is it just a concept, or is is it shipping or somewhere in the middle? AMD formally launched its MI350X and MI355X GPUs last week, saying deliveries have begun, which means customers can expect them later this year. It projected big architectural performance gains compared to the previous generation MI300X, four times for AI compute, meaning AI training, and 35 times in inferencing. When performance gains are that high,

Starting point is 00:01:12 it usually means low-hanging watermelons, as they say. It means they finally got to optimizing big, obvious things. But it is good to see anyway, and AMD said it can now beat Nvidia in quote, like for like inference benchmarks by up to 1.3x and up to 1.13x in select training workloads, end quote. As chips become more and more of a complete system, we see multiple layers and various chiplets, like a city block with multi-story buildings on it, where each story could be a different fabrication technology. So the two GPUs use TSMC 6 nanometer process for a base layer, on top of which they have a 3 nanometer die for the accelerator itself.

Starting point is 00:01:57 They both have 288 gigabytes of high bandwidth memory, HPM3e, at 8 terabytes per second memory bandwidth. The MI355 is a bit faster, providing 78.6 teraflops of FP64 performance, 5 petaflops in FP16, 10.1 petaflops in FP8, and 20.1 petaflops in either FP6 or FP4. This is about 2X better than Nvidia Blackwell in 6-bit and 64-bit performance,

Starting point is 00:02:32 but otherwise pretty similar. AMD also rattled off a who's who list of prominent AI players who are joining its party that includes OpenAI, XAI, Oracle, Microsoft, Meta, and several others. The second thing they launched was a new rev of their developer software, ROCM ROCM 7.0, which is responsible for 4x inference and 3x training performance improvement over ROCM 6.0. ROCM has been lagging behind NVIDIA's CUDA and has been closing the gap.

Starting point is 00:03:06 And the third item is another step in emulating NVIDIA as they try to catch up, and that is a preview, not an actual launch, of their Helios AI rack infrastructure. This will be based on a future GPU called MI400, together with their latest CPUs and DPUs, to arrive in 2026 and to provide a 72-way scale-up system like NVIDIA's NVL-72 DGX system. Now, as we've covered here, the idea of a chip company building

Starting point is 00:03:37 a rack-scale system goes back about 15 years, when Intel started a project to do just that. But that work was never finished until NVIDIA introduced their DGX system in 2016. AMD also talked about MI350 systems that have eight-way scale-up capability and can be networked to produce much larger clusters. So AMD becoming a rack-scale systems company is the qualitatively new announcement, though it was sure to arrive after AMD's acquisition of ZT Systems for its RackScale engineering capabilities and was expected even before that.

Starting point is 00:04:14 The market is growing fast for all accelerators, but NVIDIA continues its commanding lead and ability to ship in large volumes. It's been NVIDIA versus everybody else, what I call NVIDIA versus OnVIDIA. So it makes sense that AMD is using this announcement as an opportunity to tout openness, even as it emulated NVIDIA by using all of its own technologies to build a system.

Starting point is 00:04:39 Staying on chips, another notable development is Fujitsu's next generation arm-based chip, a follow-on to the A64FX that is used in supercomputer Fugaku. Fujitsu has talked about this for a couple of years now, targeting 2027 for introduction. It is called Monaka, M-O-N-A-K-A, presumably named after the Japanese suite that is made of often elaborately designed wafers with suite filling in between. This one is 144 core ARM V9 CPUs with extensions and uses TSMC's 2 nanometer technology on top of a 5 nanometer SRAM and I.O. base layer packaged with Broadcom's 3.5D system in package platform.

Starting point is 00:05:26 It includes confidential computing and hardware root of trust. It uses PCIe6, CXL3, and DDR5. So no high bandwidth memory and also no GPUs while projecting GPU-less LLM performance and performance per watt that could be three times better than contemporary CPUs. But Fujitsu and AMD formed an agreement in 2024 to cooperate on AI and HPC infrastructure,

Starting point is 00:05:54 which includes the Fujitsu Monocot chip and AMD Instinct GPUs and RockM software. But arguably the most important part of the Fujitsu design is their focus on energy efficiency and a design center that would allow fully air-cooled systems, though it's been mentioned that liquid cooling might be desirable and available to allow higher density systems. The ISC conference was by all counts another excellent and well-attended show, so now it's time to look forward to SC25 in St. Louis. SIG HPC is offering travel grants to undergraduates, graduate students, and early career professionals

Starting point is 00:06:34 to attend SC25. Deadline to apply is September 5th, and you will hear by September 26th. Accepted applicants would receive reimbursement of travel expenses up to $800 for travel from North America or $1,600 for travel from other continents. The grant also includes conference registration and the assignment to a mentor throughout the conference if desired.

Starting point is 00:06:59 Please look for it on sighpc.org. That's S-I-G-H-P-C dot org. We end with a hat tip to our community's mysterious, capable, always fair, and beloved personality. Yes, I'm talking about HPC Guru. Last week, after more than 16 years of keeping the HPC community vibrant and informed, a couple of sabbaticals, and an occasional will he, won't he,, HPC guru or community's anchor, steward and leader announced that he was signing off. His contributions were huge and an incredible amount of work. That he kept it all interesting by successfully keeping his identity secret was no small feat, especially in this community where people would go to great lengths to uncover his identity, especially in this community, where people would go to great lengths to uncover his identity, including analyze the photos he'd post to figure out what angle they were taken at and

Starting point is 00:07:50 who was there when they were taken. I also remember fondly those I am HPC guru pinback buttons. If you have any, let's wear that SC25. Here's his last message, quote, all things with one exception have to end. After years of sharing HPC news and insights, it's time to log off. It was fun while it lasted, and I remain grateful to all of you

Starting point is 00:08:13 who followed and interacted. Stay curious, stay kind. At HPC Guru signing off. Hashtag HPC, hashtag farewell. Here, here, farewell from all of us. Thank you, HPC guru, Godspeed, oh, and CRM. All right, that's it for this episode. Thank you all for being with us.

Starting point is 00:08:34 HPC News Bytes is a production of OrionX in association with Inside HPC. Shaheen Khan and Doug Black host the show. Every episode is featured on InsideHPC.com and posted on OrionX.net. Thank you for listening.

@HPC Podcast Archives - OrionX.net - HPC News Bytes – 20250616

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.