@HPC Podcast Archives - OrionX.net - HPC News Bytes – 20250630

Episode Date: June 30, 2025

- GPU-ASIC War - Hyperscalers’ CPUs, “GPUs", DPUs, QPUs - Google TPU-7 and Open AI? - Meta’s AI chip tape out - Microsoft’s AI chip delays - Why do engineering projects get delayed? - Chip co...-designers break into chip supply chain [audio mp3="https://orionx.net/wp-content/uploads/2025/06/HPCNB_20250630.mp3"][/audio] The post HPC News Bytes – 20250630 appeared first on OrionX.net.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to HPC News Bites, a weekly show about important news in the world of supercomputing, AI, and other advanced technologies. Hi everyone, welcome to HPC News Bites. I'm Doug Black of InsideHPC and with me is Shaheen Khan of OrionX.net. When ChatGPT ignited the AI rocket chip two plus years ago, it pulled huge investments into AI chips and led to what today can be described as the GPU ASIC war. On one side, we have the established GPU giants led by NVIDIA and AMD with Intel mostly in the background, but never to be completely discounted. On the other, we have public cloud and hyperscaler behemoths building their own AI chips, which the industry is referring to as AI ASICs. A bit of a curious nomenclature since GPU stands for Graphics Processing Unit, so they
Starting point is 00:01:00 are not too general purpose, and the chips that drive the market are data center server chips that do AI and HPC. And ASIC stands for Application-Specific Integrated Circuit, and what the hyperscalers are building look more like GPUs than not. Regardless, building your own GPU or ASIC is not too much of a diversion for hyperscalers. They have been doing that for several years and today they all have significant development across CPUs, GPUs, DPUs and many of them also invest heavily in QPUs for quantum computing. We discussed the landscape in episode 100 of the full format at HBC podcast with Dr. Ian Cutrus. A very informative and popular episode, so please check it out. Among the big players, we have AWS with the Graviton CPU, Inferentia and Tranium AI chips,
Starting point is 00:01:56 and the Nitro DPU. Microsoft Azure with the Cobalt CPU, Maya AI boost DPU, and also an integrated HSM for hardware security module. Oracle is working on X 10 M a CPU focused on database applications. And IBM is developing AI chips like the Telom processor used with their main frames and the North Pole chip focused on efficient inference. And of course, Meta Facebook working on what they call the Meta Training and Inference Accelerator, MTIA. And we should also mention Apple's A and M series of chips that have an integrated neural processing unit, NPU, which they also refer
Starting point is 00:02:43 to as their neural engine. Apple has not taken these technologies to the data center. The information reported this week that Google achieved a chip win with OpenAI. The report quotes an unnamed source to the effect that OpenAI is moving toward Google's tensor processing units, TPUs, to run its AI products. To date, OpenAI has relied on Microsoft Oracle
Starting point is 00:03:07 and other cloud providers to use NVIDIA GPUs to train and run its AI software. The reported motive for this move is OpenAI's desire to lower its operating costs by using Google. But before we call out NVIDIA for weakening its grip on the AI compute market, we should note its stock has risen about 60% over the past two plus months.
Starting point is 00:03:30 Google introduced their seventh generation TPUs two months ago with heavy emphasis on scalability to over 9,000 chips in a single pod, better performance for inference, and they mentioned energy efficiency, but the individual chip's energy requirement is not explicitly disclosed yet, and it cannot be too low with those specs. The 8-bit performance of the chip seems to be about half of Blackwell, but memory size and bandwidth and interconnect are all in the Blackwell ballpark. They're lower, but not by a whole lot. That makes for good ratios and helps with scalability
Starting point is 00:04:05 so you could approach the same overall performance with a different approach. Anyway, the story in the information says OpenAI is already using the Google chip. The TPU is only available as a service on the Google Cloud, so price performance comparisons become harder. But the story also says Google is attempting to sell its TPUs to cloud computing
Starting point is 00:04:26 infrastructure providers. It's not clear if that is just access to Google's capacity or an actual sale of TPU enabled systems to emerging cloud providers, because it's unlikely that it would be just a chip. As you said, Doug, what's happening can be attributed to the growing competition in the space, but also to pricing and availability and, of course, a desire by users to have multiple options, and especially if they're not building their own chip, which OpenAI very well might. Keeping with this theme, Meta has been developing and testing an in-house AI chip for model training, according to a report earlier this year, and the company has become a limited rollout of the chip. As with Google, it's reported that Meta seeks to lower infrastructure costs by reducing its reliance on NVIDIA. Meta is collaborating
Starting point is 00:05:17 with TSMC on production of the chip. The test phase follows Meta's first tape-out of the chip, a crucial milestone in silicon development, where an initial design is sent to a chip factory. Yeah, high stakes everywhere. Meta Facebook has projected expenses between $114 and $119 billion for 2025, with some $65 billion of that dedicated to AI infrastructure. As these projects become more complex, we are also seeing specialists becoming necessary in a co-design capacity and breaking into an already complex supply chain.
Starting point is 00:05:55 So maybe that will become standard practice. Case in point is news that the Taiwanese chip design specialist Global Unit Chip, GUC, is tapped by Meta and Microsoft for work on their ASIC chips and especially packaging. GUC has reported success with TSMC's three nanometer and two nanometer technologies. And there was news this week in the publication
Starting point is 00:06:19 of the information that Microsoft's next generation AI chip codenamed Braga is delayed by at least six months. Azure Maya, spelled M-A-I-A, is the name of the family of AI chips from Microsoft. Maya 100 was launched in late 2023, and Braga, expected to launch as Maya 200, was slated for 2025, but now will be a 2026 product. These projects are massive and complex,
Starting point is 00:06:48 so it's not too hard to fall behind. And you're in good company if it happens. The chip world is especially unforgiving, but the big players all have the wherewithal to play the long game. Now, the first time I had a product engineering group report to me, and in order to avoid delays, I started out by doing a lot of analysis on why engineering projects get delayed. The top line short answer was management.
Starting point is 00:07:13 That was always the ultimate reason. But the next level of reasons usually trace to product strategy and the competitive environment. You'd have to ask and answer questions like, how much should you push the envelope on technical complexities and on how many fronts? Do you have or can you build all the know-how? Do you have all the patents, integration, et cetera? Can you find the right staff with the needed skills? Can you keep the right staff and the team sufficiently intact?
Starting point is 00:07:39 Or will you suffer turnover? Is your supply chain able and ready to deliver? Does the product specification stay fixed, or do you keep changing it, or worse, expanding it? Are any regulatory requirements anticipated and handled? Is your schedule realistic or overly ambitious? Might your competition make a sudden advance or execute better, making your product not as interesting
Starting point is 00:08:02 as it would have been? So these are a lot of questions, and they they're very hard and it can lead to delays. There's a framework called Porter's Five Forces that helps with some of such analysis. It was introduced in 1979 by Michael Porter, a famous Harvard professor in his book. So go check it out if you're interested in this topic. All right. That's it for this episode. Thank you all for being with us.
Starting point is 00:08:26 HPC News Bytes is a production of OrionX in association with Inside HPC. Shaheen Khan and Doug Black host the show. Every episode is featured on insidehpc.com and posted on orionx.net. Thank you for listening.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.