@HPC Podcast Archives - OrionX.net - HPC News Bytes – 20250630
Episode Date: June 30, 2025- GPU-ASIC War - Hyperscalers’ CPUs, “GPUs", DPUs, QPUs - Google TPU-7 and Open AI? - Meta’s AI chip tape out - Microsoft’s AI chip delays - Why do engineering projects get delayed? - Chip co...-designers break into chip supply chain [audio mp3="https://orionx.net/wp-content/uploads/2025/06/HPCNB_20250630.mp3"][/audio] The post HPC News Bytes – 20250630 appeared first on OrionX.net.
Transcript
Discussion (0)
Welcome to HPC News Bites, a weekly show about important news in the world of supercomputing,
AI, and other advanced technologies.
Hi everyone, welcome to HPC News Bites.
I'm Doug Black of InsideHPC and with me is Shaheen Khan of OrionX.net.
When ChatGPT ignited the AI rocket chip two plus years ago, it pulled huge investments into AI chips and led to what today can be described as the GPU ASIC war.
On one side, we have the established GPU giants led by NVIDIA and AMD with Intel mostly in the background, but never to be completely discounted. On the other, we have public cloud and hyperscaler behemoths building their own AI chips, which
the industry is referring to as AI ASICs.
A bit of a curious nomenclature since GPU stands for Graphics Processing Unit, so they
are not too general purpose, and the chips that drive the market are data center server chips that do AI and HPC.
And ASIC stands for Application-Specific Integrated Circuit, and what the hyperscalers are building look more like GPUs than not.
Regardless, building your own GPU or ASIC is not too much of a diversion for hyperscalers. They have been doing that for several years and today they all have significant development across
CPUs, GPUs, DPUs and many of them also invest heavily in QPUs for quantum
computing. We discussed the landscape in episode 100 of the full format at HBC
podcast with Dr. Ian Cutrus.
A very informative and popular episode, so please check it out.
Among the big players, we have AWS with the Graviton CPU, Inferentia and Tranium AI chips,
and the Nitro DPU. Microsoft Azure with the Cobalt CPU,
Maya AI boost DPU, and also an integrated HSM for hardware security module.
Oracle is working on X 10 M a CPU focused on database applications.
And IBM is developing AI chips like the Telom processor used with their main
frames and the North Pole chip focused
on efficient inference. And of course, Meta Facebook working on what they call the Meta
Training and Inference Accelerator, MTIA. And we should also mention Apple's A and M
series of chips that have an integrated neural processing unit, NPU, which they also refer
to as their neural engine.
Apple has not taken these technologies to the data center.
The information reported this week
that Google achieved a chip win with OpenAI.
The report quotes an unnamed source to the effect
that OpenAI is moving toward Google's tensor processing
units, TPUs, to run its AI products.
To date, OpenAI has relied on Microsoft Oracle
and other cloud providers to use NVIDIA GPUs
to train and run its AI software.
The reported motive for this move
is OpenAI's desire to lower its operating costs
by using Google.
But before we call out NVIDIA
for weakening its grip on the AI compute market, we should note its
stock has risen about 60% over the past two plus months.
Google introduced their seventh generation TPUs two months ago with heavy emphasis on
scalability to over 9,000 chips in a single pod, better performance for inference, and
they mentioned energy efficiency, but the individual chip's
energy requirement is not explicitly disclosed yet, and it cannot be too low with those specs.
The 8-bit performance of the chip seems to be about half of Blackwell, but memory size
and bandwidth and interconnect are all in the Blackwell ballpark.
They're lower, but not by a whole lot.
That makes for good ratios and helps with scalability
so you could approach the same overall performance
with a different approach.
Anyway, the story in the information
says OpenAI is already using the Google chip.
The TPU is only available as a service on the Google Cloud,
so price performance comparisons become harder.
But the story also says Google is
attempting to sell its TPUs to cloud computing
infrastructure providers. It's not clear if that is just access to Google's capacity or an actual
sale of TPU enabled systems to emerging cloud providers, because it's unlikely that it would
be just a chip. As you said, Doug, what's happening can be attributed to the growing competition in the space, but also to pricing and availability and, of course, a desire by users to have multiple
options, and especially if they're not building their own chip, which OpenAI very well might.
Keeping with this theme, Meta has been developing and testing an in-house AI chip for model
training, according to a report earlier this year, and the
company has become a limited rollout of the chip. As with Google, it's reported that
Meta seeks to lower infrastructure costs by reducing its reliance on NVIDIA. Meta is collaborating
with TSMC on production of the chip. The test phase follows Meta's first tape-out of the
chip, a crucial milestone in silicon development,
where an initial design is sent to a chip factory.
Yeah, high stakes everywhere. Meta Facebook has projected expenses between $114 and $119
billion for 2025, with some $65 billion of that dedicated to AI infrastructure. As these projects become more complex,
we are also seeing specialists becoming
necessary in a co-design capacity
and breaking into an already complex supply chain.
So maybe that will become standard practice.
Case in point is news that the Taiwanese chip design
specialist Global Unit Chip, GUC,
is tapped by Meta and Microsoft
for work on their ASIC chips and especially packaging.
GUC has reported success with TSMC's
three nanometer and two nanometer technologies.
And there was news this week in the publication
of the information that Microsoft's next generation AI chip
codenamed Braga is delayed by at least six months.
Azure Maya, spelled M-A-I-A,
is the name of the family of AI chips from Microsoft.
Maya 100 was launched in late 2023,
and Braga, expected to launch as Maya 200,
was slated for 2025, but now will be a 2026 product.
These projects are massive and complex,
so it's not too hard to fall behind.
And you're in good company if it happens.
The chip world is especially unforgiving,
but the big players all have the wherewithal
to play the long game.
Now, the first time I had a product engineering group
report to me, and in order to avoid delays, I started out by doing a lot of analysis on why engineering projects get delayed.
The top line short answer was management.
That was always the ultimate reason.
But the next level of reasons usually trace to product strategy and the competitive environment.
You'd have to ask and answer questions like, how much should you push the envelope on technical
complexities and on how many fronts?
Do you have or can you build all the know-how?
Do you have all the patents, integration, et cetera?
Can you find the right staff with the needed skills?
Can you keep the right staff and the team sufficiently intact?
Or will you suffer turnover?
Is your supply chain able and ready to deliver?
Does the product specification stay fixed,
or do you keep changing it, or worse, expanding it?
Are any regulatory requirements anticipated and handled?
Is your schedule realistic or overly ambitious?
Might your competition make a sudden advance or execute
better, making your product not as interesting
as it would have been?
So these are a lot of questions, and they they're very hard and it can lead to delays.
There's a framework called Porter's Five Forces that helps with some of such analysis.
It was introduced in 1979 by Michael Porter, a famous Harvard professor in his book.
So go check it out if you're interested in this topic.
All right.
That's it for this episode.
Thank you all for being with us.
HPC News Bytes is a production of OrionX
in association with Inside HPC.
Shaheen Khan and Doug Black host the show.
Every episode is featured on insidehpc.com
and posted on orionx.net.
Thank you for listening.