@HPC Podcast Archives - OrionX.net - @HPCpodcast-98: Hyperion Research on HPC, AI, Quantum – In Depth
Episode Date: February 27, 2025We are delighted to have as special guests today three of the top analysts in the HPC, AI, Cloud, and Quantum fields, representing the industry analyst firm Hyperion Research. Earl Joseph is Hyperio...n CEO, Mark Nossokoff, Research Director, and Bob Sorensen, Senior VP of Research. Join us for an In Depth discussion of the current state and future trends in HPC, AI, Quantum, Cloud Computing, Exascale, Storage, Interconnects and Optical I/O, and Liquid Cooling. [audio mp3="https://orionx.net/wp-content/uploads/2025/02/098@HPCpodcast_ID_Hyperion-Research-HPC-AI-Quantum-Market_20250227.mp3"][/audio] The post @HPCpodcast-98: Hyperion Research on HPC, AI, Quantum – In Depth appeared first on OrionX.net.
Transcript
Discussion (0)
If you look at all the spending to do HPC and AI related work, the overall spend rate
was around $52 billion in 2024, and we're expecting by 2028 that to reach around $85
billion.
The line was drawn at number eight.
So the top eight machines are as powerful as the bottom 492 systems.
A greater comfort level and a willingness to embrace this notion of a
continuum computing, not either on-prem or cloud.
3.5 companies in essence can really skew the amount of money that's being committed
towards HPC in its most broadest definition simply because they have the
money and the
agenda to go out and spend.
From OrionX in association with InsideHPC, this is the At HPC podcast.
Join Shaheen Khan and Doug Black as they discuss supercomputing technologies and the applications,
markets and policies that shape them.
Thank you for being with us.
I'm Doug Black at Inside HPC,
along with Shaheen Khan of OrionX.
Guests today are from HPC industry analyst firm,
Hyperion Research.
We have Earl Joseph, the CEO.
He has oversight for Hyperion's overall research
and consulting efforts,
along with the firm's HPC user forum,
which conducts conferences in the
US and around the world throughout the year. We have Mark Nosikoff. He is
Research Director, Lead Analyst for Storage and Interconnects and Cloud. And
Bob Sorensen, Senior VP of Research. Among other areas, Bob focuses on
exascale, supercomputing along with quantum and other
future technologies.
So today we're going to talk about future trends in HPC AI, how the melding of the two
has impacted Hyperion's research strategy, and to look ahead to the trends we can expect
in 2025 and the rest of the decade.
So Earl, let's start with you
from a big picture perspective,
share your views on the relative health
and growth of the industry.
And particularly now that Hyperion has grouped AI
under the HPC umbrella or within it,
we're looking at some very healthy CAGR numbers
for the rest of the decade, isn't that right?
Thanks, Doug.
And we appreciate the invitation to join you and Shaheen today.
Over the last three or four years,
we've been tracking the AI portion of the HPC market
in much more depth,
and there have been a lot of new sellers in the market,
Nvidia, Supermicro, and a lot of AI specialty houses
selling equipment, which are really non-traditional vendors,
and we have included those into our market tracking numbers.
The explosion in AI, which started just a little bit more than two years ago,
has resulted in the market size, as the way we track it, to be 36.7% larger.
More than a third larger because of all this growth and increase in use of AI.
But in addition to that, this whole recognition
that high performance computing, computing at scale,
and AI capabilities, it's lifting both the AI portion
of market, plus the traditional HPC portion,
because government agencies, companies are all recognizing
that the use of big data, big HPC, and scaling
can do a lot of good things.
Now that's also changed the growth trajectory
that you were mentioning.
If we look at the overall market right now,
we're expecting to see around 15% annual growth
over the next five years.
That used to be seven to 8%.
So the market is more than a third larger
and growing at roughly double the growth rate
that we were seeing before.
The other parts of the market that are growing tremendously is cloud computing.
And the way we track that is what HPC end users, researchers, engineers, analysts, and others,
spend in the cloud itself.
And that we saw in 2024 grow by 20%.
So just a phenomenal growth for here, almost large enough to call it a tipping point year. And then for last year 2024,
for the on-prem AI servers, we saw those grow by 40%. So just some phenomenal growth rates in the
market. And to put it in the big picture perspective, if you look at all the spending to do HPC and AI
related work, so it'd be the on-prem servers, software storage, maintenance, plus cloud spending. The overall spend rate was around $52 billion in 2024.
And we're expecting by 2028 that to reach around $85 billion.
So just some tremendous growth taking place.
Erl and Mark and Bob, it's such a treat to have all of you on this call.
And thank you for making that happen.
And thank you, Erl, for that sort of setup for the market. What do you think is causing the growth in cloud? I have several questions,
of course, but just to start, is the cloud adoption, you feel it's related to just the
inability of on-prem data centers to handle these emerging CPUs and GPUs from power and cooling
standpoint, or is it really something else?
Perhaps we can all three share some ideas. I'll start with a couple of them.
Cloud computing has become much more useful for HPC, meaning the cloud
providers have created better software, better environments and hardware
approaches, and their prices have gotten a bit more competitive. But some of the
gigantic drivers is the relation of new technology. Nvidia and AMD have announced that they're going to
have new processors every 12 months so if you want access to the latest
technology out is one really good avenue. In addition the supply chain issue if
you wanted to get an Nvidia Blackwell let's say a couple hundred of them right
now the waiting list is more than 12 months.
So the difference between buying on-prem, which means you buy something now and you
have to keep it four or five years, you're going to be multiple years behind in
technology, really is driving a lot of people to use the clouds for more parts
of the workload.
And I think Mark and Bob can add to that.
Yeah.
Hi, this is Mark.
Thanks for, I'm happy to be here and thanks for including us in these
conversations. To add what Earl said, there's for, I'm happy to be here and thanks for including us in these conversations.
To add what Earl said, I think it was just on all sorts of comfort level from the users
and research that are utilizing.
They've gone through the various stages, possibly of resistance and acceptance and now are in
evaluation and more understanding of what makes sense to run in the cloud and on premises.
So just a greater comfort level and a willingness to
embrace this notion of a continuum computing, not either on-prem or cloud, but how can I best
utilize and embrace the cloud across all my workflows and optimize my resources. Something
that's been coming about as well is this notion of sometimes referred to as Neo clouds or the really focused attention of AI as a service
or GPU as a service in a more just focused
source provisioning that differs from the broader
cloud providers like Google, Amazon or Microsoft Azure
that provide a wide range of services
across the full spectrum.
But these Neo clouds are focusing on just providing the AI or G
assertive kind of activities.
And maybe one last time, I'll comment on before turning over to Bob, was that there is a sustainability
element as well as organizations are still tasked and have goals for energy efficiency
and sustainability.
There's a migration to the cloud that can be more efficient of resources and remove the carbon footprint and power utilization from on premises for what they would be on the hook for measuring and reporting that then gets deferred to how the cloud providers are tackling the issue.
I guess I'll go now for those who are listening at home, trying to decide which voice is which I'm Bob Sarts and I'll be the one who'll be speaking too fast. Hopefully that'll help you out. I have to kick the discussion along because I think we need
to move away from on-prem versus cloud to on-prem and cloud. I think it's really time for organizations
to understand the fact that if you could sum up the opportunities here, if you have capacity,
if you have the need for a specific workload
that you know you're going to need
for the next three to four years,
you have your architectural requirements
and your budgetary requirements in sync,
then maybe on-prem is the way to go.
If you're looking for resource access,
some of the things are all talked about,
emerging technologies, Nvidia coming out,
or trying to come out with at least a new product
at a higher cadence.
New technologies, what's going on in some AI space.
A lot of the cloud service providers are looking
at offering their own AI accelerators.
The easiest way to explore that opportunity
is through the cloud.
And so I think that really thinking about
how one can solve their computational capabilities
by a much more sophisticated mix of on-prem versus cloud
is something that I hope we see in the near future.
As organizations say, we have a diverse set of needs.
Some of it's predictable, some of it's reliable,
and some of it we'd like to have in-house
so we can kick the tires on the machine we own.
But at the same time, we want the flexibility
and perhaps the low barrier to entry
to explore new technology and the flexibility and perhaps the low barrier to entry to explore new technology
and the flexibility from a budgetary perspective to basically bring new technology, new workloads
to bear, or maybe even doing some workload balancing. So what I'd like to see in the near
future is a lessening of this us versus them versus everybody kind of holding hands and singing Kuba-ya going forward. Bob, it sounds like it's a much more complex managerial task
to oversee how you put together this hybrid mix
on-prem, off-prem.
But let me ask a very concrete question.
Earl, you mentioned a 12 month waiting period
to get your hands on Blackwell chips.
If you went to the cloud,
is that immediate, are Blackwells immediately available? Immediately Immediately as long as you have the pocketbook for it but yes you
can start your job almost immediately if you go with the cloud and when I
mentioned the 12 months if you actually wanted to buy quite a few of them and
put a larger on-prem server in place the 12 months is how long it takes to get
the chips from Nvidia right now. So whoever builds the system would
probably need another four months. So you might be looking at 16 months. And a good example of why
that's such a big problem is if your boss just told you to, hey, figure a way to use AI and do
this great stuff on Blackwall, you can't go back to your boss and say, give me 16 months to get a
machine on board, then I'll start working the answer. Whereas in the cloud, it's give me 16 months to get a machine on board, then I'll start working the answer. Whereas in the cloud, it's give me 16 minutes and I'm going to be up and running. Yeah, non-starter. The differential
there is massive. Yeah. Now, at the risk of complicating this, Bob, you mentioned that if
your workload is predictable, then maybe you want to be on-prem. But it's also true that if your
workload is predictable, you can get better prices on the cloud. Maybe part of what we're looking at is that
this stuff is just going to cost one way or another,
and it's not for the faint of heart
or somebody who has a limitation with budgets.
That's also true, isn't it?
I would argue that the idea here
is that when you procure an HPC,
you're really thinking about a three, four, five,
or six year budgetary commitment.
And that actually never is actually getting longer.
I'm not terribly aware of cloud service provider agreements
that really extend out to those kinds of things.
So there is some variability going on here
in terms of what can I commit to on an on-prem procurement,
which for certainly a lot of government organizations
and a large subset of commercial organizations
are pretty much in sync.
So the bottom line here is that some organizations
are already geared towards
these long-term budgetary commitments.
And the cloud, A, doesn't offer that kind of
long-term commitments,
certainly for even their largest organizations.
And in my mind, Doug talked a little bit about
the complexities of this on-prem cloud happy wedding here.
And the concern here is how one deals with a budget
where a significant percentage of your budget
may be in this kind of three, four,
five year procurement cycle.
And the other one may be a month by month,
annual by annual, maybe two or three year kind of commitment
on the cloud and how you do resource management
and provisioning in those two very different environments.
The bottom line here is there are some compelling issues to be attacked and addressed if you're
thinking about having a much more sophisticated on-prem environment.
And that of course doesn't even address the use of software.
But here we're talking about budget and those things are going to have to be worked out
because it's one thing for long-term commitment, it's another thing for a short-term pay-as-you-go model and how you reconcile those two things,
especially if you're working in a government organization or a large commercial organization
that wants predictability and budgetary commitments, those things are going to have
to be ironed out going forward. Yeah. Could we, for a moment, let's talk about leadership class
supercomputing. Right now, the state of the art, of course, is of course is access scale and but it's very interesting there's more talk about maybe the most powerful computing could come out of the cloud providers certainly the hyper scalers have money and budget.
For technology r&d that dwarfs say the department of energy what are your thoughts about this mix of we have three exascale US systems, how many in
China we don't know, but how will that work over against or with the HPC capabilities
coming out of big tech companies?
Yeah, first off, Doug, my issue here is the concentration on the highest end of computing
nowadays, the so-called exascale systems.
And the top 500 list when it comes out every six months or so,
one of the things they release is this wonderful metric that talks about drawing an imaginary line.
And if you, above the line is 50% of the sum total of the computing power above that line,
and below the line is the other 50%. So it's where you draw the line
in terms of the average system, if you will. In the old days, the line would be drawn at
about a hundred machines. So the hundred days, the line would be drawn at about 100 machines.
So the 100 machines on top were equal to the bottom,
400 machines on the bottom.
That number has gone down significantly
over the last few decades.
And the last one, I think the line was drawn at number eight.
So the top eight machines are as powerful
as the bottom 492 systems.
And so what we're seeing in some sense
is an architectural, a power consumption,
a cost capability, and a complexity in programming
at those exascale class systems that in some way,
in my mind, aren't terribly representative
of where HPC is going writ large.
Certainly not a lot of organizations
are capable of spending $6 million plus on a system.
They're not capable of funding 50 megawatt systems that may ultimately cost you in some
sense equal to what the machine costs.
You may be paying power and cooling costs that are equal to the acquisition cost of
your HPC during its lifetime.
So to me, I think that the exasale system mentality has to start to move back into sync with where
the rest of the sector is going.
Lower power, lower cost, and really what that boils down to from a practical perspective,
and we're starting to see this coming out of the Department of Energy, is the idea of
partitioned heterogeneous architectures, different subclasses of partitions that address the
workloads that matter to the particular organization.
And this way you're not buying one machine every seven years that cost 500 million dollars plus.
You're buying a series of SunSense of smaller partitions that can roll out
in time to address your workloads much more responsive because you're not committing to a
single architecture for five, six, seven, eight years. You're committing to an architecture that can evolve over time to address new workloads,
address new technology options, and unfortunately address the vagaries of the budgetary process.
And that's where I think we're going to go here in terms of Exascale's class system.
And the one thing I'm going to mention, and I'm so glad you brought this up Doug,
we just started putting out these things we're calling them top of mind surveys.
It's five quick questions that we're blasting out to the entire HPC world.
And I think question number three on that was, when do you expect to see a hyperscaler
HPC occupy the top one slot on the top 500 list?
And I think we had next year, two years, three years,
four years, five years and never. But we're, as I said, we just launched that. And I really can't
wait to see the results of that in terms of what the overall community thinks. The answer to that
exact question. If so, when is that going to happen? Because I think that the stars will align
and certainly some organization is going to be willing to commit the cycles to run that top 500 list Linpak benchmark and at least for the sake of publicity have a
number one slot going forward. It may not be there for long, but it may be just long
enough to run Linpak, but I think we're going to see it sooner rather than later.
And Bob, I think those are great points and it's such a great question. My view of the
world right now is there's really three categories of, if I call them supercomputers, large scale machines. One of them is the hyperscalers. Talk about
why they're different. The other one is the exascaler, the real giant machines. Bob mentioned
the top 500 list shows us too. Maybe there's eight or nine of those in the world right
now. And then third group is everyone else that's doing HPC and AI. And they're all going off into three different directions right now.
But Doug, your question about the CSPs, the hyperscalers,
and the social media companies that are building these giant systems,
they're not being built to do broad-scale R&D or leadership R&D.
So I think there's still tremendous room for national labs around the world
to really do leadership class R&D. They're much more focused on what I would call a narrower set of problems,
whether that's large language models only or three or four types of AI, but
then the bulk of their cycles are sold out to clients and they're servicing
clients on a lower scale. So with an exception of a few of them, they're not
fully being used for leadership class workloads.
And Earl, let me just add quickly that we could say 3.5 systems, because to me, we can't ignore the fact that these basically self-built dark systems,
the idea of what's going on at, say, Tesla with Colossus and some of these other organizations that are assembling their own systems for their own uses,
the very end use specific, primarily right now targeted towards things like AI, large language model or generative AI training. And those organizations
have significantly deep pockets. So in some sense, one or two very large installations
by some of these 3.5 companies, in essence, can really skew the amount of money that's being
committed towards HPC in its most broadest definition,
simply because they have the money and the agenda to go out and spend $3 billion to buy an awful
lot of Nvidia GPUs to do in-house training for in-house end uses. So there again, the sector is
fractionating once again, creating more opportunity, creating more confusion, and just opening the
potential applicability of whatever
particular flavor of HPC you're interested in.
I have a question, maybe this is for you, Mark, and that's data gravity.
We talked about the hybrid model and is the hybrid model stable or is it a slippery slope
towards one end or the other or how do you keep that hybrid model to meet the ideals
that it was designed to optimize?
I think there's a couple of things there
where the data graven, and maybe the first and foremost thing
is trying to minimize any data movement that
has to occur between as you're leveraging both on-premises
and in the cloud and the hybrid architectures,
that the ability to maybe move the compute closer to where
data is rather than moving the data around
is going to be critical to doing some of this. Another consideration is the notion of sovereignty and where the
data is created and where can the data actually go and need to reside, what relative to client
and who can access the data based on some of the growing attention being given to sovereign
efforts whether it be a sovereign cloud or even other sovereignty related items like the sovereign compute sovereign, sovereign LLMs too, as well as how sovereignty is affecting
everything.
I also want to jump back if I could on the investment in AI and how it's impacting the
industry that with all of the investment being driven and chasing the large hyperscalers
with the AI, when you consider from the software impact and especially looking at the
precision aspects of it. The traditional HPC codes and users have the FP64 based codes,
but there's an uncertainty on the commitment of the vendors being able to continue to optimize
and extend the FP64 roadmap. So in some sense, kind of a notion or potentially even an acceptance
that the codes and software folks may need to be skating to where the puck is going to be
and leveraging and optimizing where their codes are, where it makes sense
and where it can and should be done to take advantage of all the investment
and advancements in the mixed and lower precision areas.
Mark, am I right that big trend in data storage is around this
grab bag,
if you will, where all sorts of different storage media,
different types of storage are all under one roof
so that it's all available to these huge AI workloads
no matter in what format it's in or where.
I think that's a goal and not necessarily in Nirvana,
but where it's going.
I know that past investments in storage and the storage winners, if you will, really focused
on maybe more on the speeds and feeds, the abilities, as I'd call them, the reliability,
availability, durability.
But those storage systems, not just the features, need to be upleveled more towards the data
management, especially related to AI, this notion of an AI data pipeline, where the different stages within the data pipeline that as data moves from
ingesting to training to the pre-processing to any checkpointing and
then eventually getting down to the inferencing aspects of it. Each of those
stages in the data pipeline have different IO profiles, whether it be
block sizes or sequential or random and all the other aspects relative to it. So
it's a strain and a stress and requirements on the storage system to be able to really
address and satisfy all of those. And so, the investments that are happening
up level or up the data, the storage software stack into the data management and orchestration
to be able to provide the data from a common storage system, single storage system to meet
the requirements of all the aspects in the data from a common storage system, single storage system to meet the requirements
of all the aspects in the data pipeline
is becoming more critical for the emergence of these,
the data, well, I'll call data platform service providers
that are emerging in finding success and traction.
So that's still a dream and a goal not achieved as of yet.
Not fully, but it's certainly,
the steps are being taken there.
I think it's, there are, it's closer than just a wild dream, but there are in some aspects
and some workloads and some verticals that this notion of the orchestration
and being able to manage across the different data types are being realized.
And Earl, I wanted to ask you, now we last saw you at SC in November,
and lots has happened since then.
I'm curious what your thoughts are on the biggest
developments that have happened over the last three or four months that are impacting the industry
overall, the most prominent impacts. Yeah, just for fun, one of the major changes or line of
confusion is what the U.S. government is doing and the number of changes that are taking place almost
on a daily basis is leading to a tremendous amount
of uncertainty agency by agency.
And along with that announcements of starting with the hundred billion going to 500 billion
to build out 20 new AI data centers across the US.
And then that being matched by Europe, the French last week announced a major new AI
initiative.
The Middle East is planning to make major investments,
not just on the AI side, but they'd like to even have their own foundries and build their own ships.
So the amount of growth is just phenomenal, but the amount of uncertainty too is very strong.
I think it's going to take a while before we understand what it means in the US. The national
security topics and areas clearly continue with investment,
but it's very uncertain for us. We're NIH, the National Science Foundation, and all these places,
their funding and operations are being redirected. So learning with that is real difficult at the
moment, but that's something we hope will settle down in the near future. The other part that we're
finding, at least I do personally, is tremendous
amount of AI applications, different success stories, ideas that people
have on how they can apply AI.
And like I was saying earlier, just large compute with large data, with
smart people doing smart software to solve a lot of problems that people
previously thought would take three to five years to solve.
And now they're thinking they could do it perhaps even this year. Just use cases
everywhere from the traditional ModSim, legal organizations, and automotive design and a
lot of different areas.
And then there's DeepSeek, which was a pretty big piece of news a few weeks ago. All three
of you, what are your takes or insights on the DeepSeq phenomenon?
Sure.
You'll probably get three answers here, which is half the fun.
We're just finishing a short write up on DeepSeq and a couple of things are apparent.
They did use a lot of US software.
The number of GPUs used were a bit more than they originally announced.
And perhaps the cost is higher.
But again, this is to me like adding fuel to the excitement fire in the sense
that it's started a price war to make AI a lot less costly to the end user.
And so I think that part's positive.
We're trying to figure out really how real their lower costs are because there's
conflicting article and data points out there.
And with that, I'll hand it to Bob and Mark.
I think caution should be taken.
Yes, what they've done is impressive
and what they've been able to achieve
and provide the results.
But have they, just from a share,
we do a lot of relative to total cost of owners.
Were all the total costs associated
with actually getting to what they produced,
were all those total costs that others include,
were they all incorporated into what was was presented by the deep sea folk
What they actually spent to achieve what they've done the thing I'd add to the conversation
Of course is that deep seek achieve such notoriety by being the number one Apple in the Apple Store
So right up front that kind of scopes the issue and so we're an HPC shop and much more interested in what AI means to the
Science and engineering community not to the people AI means to the science and engineering community,
not to the people who want to go out and buy an AI enabled smartphone from Samsung.
And to me, there's some significant issues that need to be resolved if AI is to become a trusted component
that contributes to the science and engineering community.
It's not just getting the right answer
about 85% of the time,
which is what we see on some of the large language models
and such, there are significant uphill battles
that are gonna have to be fought
before science and engineering can sit down
and say we can validate and verify
and trust a reproducible explainable output
from some of this generative AI hoopla that's going on out there. and verify and trust a reproducible, explainable output
from some of this generative AI hoopla
that's going on out there.
And so we're really trying to think if this is a pun
on the idea of the tipping point, is this a tripping point?
Has there been too much investment, too much enthusiasm
within the enterprise space that's dragging along
the science and engineering community
that ultimately may not have legs.
I come from a certain amount of skepticism within the quantum computing community where
people talk about too much investment, not enough realized performance results.
And I'm wondering if we're seeing, I know we're seeing huge investments going into the
amount of money in terms of every company out there trying to become an AI-centric organization.
And if you look at the projections of the amount of revenue that these things are supposed
to be generating in the near term, I'm wondering, and I'm not going to say it's a gen AI winter
that's coming, but I think there's going to be a certain amount of reckoning going forward
in the next few years as organizations sit back and say, what did all of this spending and expectations deliver us in terms of return
on investment from a financial perspective or return on science from a science and engineering
perspective?
And quite frankly, I think the jury is still out on all this.
And the reason that the one data point I can point to that scares me is when that Chinese
company came out,
Nvidia stock tanked.
Now, it was only for a day or two, but that scares me in terms of the volatility and fragility
of what that sector really engenders at this point in time.
So to me, the jury is still out.
I don't see it as the next turn of the crank in HPC writ large.
I see it perhaps as an interesting activity,
an interesting opportunity that ultimately will be
winnowed down to things where it works best
and other things where it's just simply not worth
the cost or the labor to bring it into your existing
computational workload environment.
Yeah, I think that's excellent.
Bob, I was gonna say, I liked your,
maybe it was a potential Freudian slip,
is it a tipping point where it's an upward trend? Or is it a
tripping point?
That we're going to stumble that we caught some stumbling. And like
you said, the challenge is extreme caution.
Bob, I completely agree. But the deep seek phenomenon also pointed to
perhaps a situation where maybe we don't need all
these resources. Maybe we don't need the latest, greatest chips and interconnects. Maybe there's
a lower cost way of getting to the kind of answers that we want it to. And if so, that
opens the whole AI scene to on-chip AVX vector extensions, maybe even CPUs. And maybe we
don't need the GPUs.
Maybe it can all be on-prem.
There's a lot of unknowns.
And of course, that's what makes it volatile.
But how do you square all of that?
I agree, because to me, what's happened
is because of the intense competition.
Remember, all of this AI progress is being driven
by commercial organizations who are vying for mindshare
and vying for venture capital money as well.
And so they're racing ahead
with some very aggressive developments. No one has had time yet to take a deep breath and say,
okay, let's start programming for efficiency for what are the heuristics? How can we distribute
this? How can we make it more less power consumptive and get better results? Are we over
training some of these large language models?
Do we really need to go to 10 to the 25th operations?
What kind of pruning can we do?
Where are the optimization stages?
And I think that as the sector starts to progress
and the programmers and the engineers
start looking under the hood,
they're gonna go, hey, I've got a better idea.
I know how to do this a little better.
As soon as the technology comes out,
the users find a million ways to start to optimize.
And I think we're going to start to see that.
Look for AI heuristics in the next years,
I think to become a much more integral part
of the overall AI ecosystem.
Because right now it's running full bore,
pedal to the metal kind of acceleration.
And it's time for someone to say,
do we really need to be running
at 8,000 RPM? Can we be a little more efficient in how we do some of this stuff? And I think those
opportunities are just overflowing in AI. No one's just had the time or actually the inclination as
yet to address them. But I think as end users start to bring it in house, they're going to start to
come up with interesting ways to find efficiencies that we haven't even dreamed of yet.
Okay.
So I have another question here.
When I try to read the DeepSeq papers, some of the optimization methods that
they used came across to me as like standard in the HPC world, like overlapping
IO with computation, duh, we did that 10 years ago, and a few other things.
Where do you land on HPC versus AI?
Do you consider AI just a manifestation of it's an HPC app?
Or do you think that this is so fundamentally different, mixed
precision, as Mark, you were saying that is just different.
Or even if it is, even if it came out of HPC, it now is so big and
has its own life that it needs to be viewed separately.
I mean, from an algorithm perspective, it's dramatically different.
You have these deep neural networks and such that you're not doing a calculation.
You're getting a result out that you can't tear the hood open.
When you get a result back, you can't go, okay, find me the variable that was at 49%, but if it was a 51%, we could do this.
So from an architectural perspective,
there's a lot of interesting issues here.
But one of the things I like to remind people
is that invariably, all code gets translated into assembly
and then machine language.
So at some level, everything we're doing in the AI world
and the HPC world still comes back to ones and zeros
that have to do things like compute, store, memory, and
interconnect. And so in some sense it's an interesting confluence of
dramatically different algorithms that ultimately are running on the von
Neumann architecture that has been around for the last 60, 70 years. And so
both worlds I think could benefit from a greater understanding of what is brought
to the table by each.
And that's my point about the HPC world knows how to do those things well, the optimization
process to rest out the greatest performance out of an architecture, but ultimately when
you get to a deep learning algorithm, it's still just a piece of code that's machine
language that can be optimized.
So I think there's a lot to be learned from both sides.
They need to cooperate more and there needs to be a little more coordination.
Certainly architecturally there's some ways to fine tune a system to be more amenable
to an AI piece of code versus a mod-sim code.
But again, that's a knowable thing.
It only will start to happen though as the sector starts to stabilize and starts to slow down a little bit in terms of new models, new architectures, new algorithms, and just new vitality of what's going on.
And Bob, what I'd like to add to that is one view, and Shaheen, this is flipping your example a little bit, is that HPC technologies underlie almost all of AI. And as Bob was mentioning, whether it's the hardware, the GPUs,
which were really heavily used and optimized in HPC, the software, the algorithms, the file systems,
the scaling aspects, the power and cooling, all these technologies that have made AI successful
are really HPC technologies underneath. And so just wanted to add that. Yeah, really are.
And Mark, to what extent are you seeing special to AI storage
requirements?
Or is it just AI washing?
No, I think it's a little more than AI washing.
I think that there's storage and some data requirements too.
The notion of how big the data sets need to be
and how much data should be included in the training.
Does the whole universe of data need
to be included in some really precise model development
and training for a really targeted environment.
But in terms of AI washing versus unique requirements
relevant to AI, there's some notion of our files needed
and some of the aspects needed,
or our parallel file systems required in some aspect of it,
or can you move more to object than file-based
or even block-based across some of these elements
within the data pipeline?
There's still an awful lot of evaluation and learning
occurring to really decide what's being done.
Now that is a question I ask a lot of the vendors
that I talk to, yes, they're using all of the,
and addressing the, trying to get on the AI bandwagon,
but what are you doing explicitly, specifically,
that's AI-oriented? And there's only a few that are really addressing the bandwagon, but you say, what are you doing explicitly specifically that AI oriented?
And there's only a few that are really addressing specific AI versus really trying to fit a
little bit more of the what has always been done, but focus it towards AI.
So just, there's probably a little bit, maybe more than some would care to admit AI washing,
but I think as we're evolving and learning is occurring, there are
more AI specific features and elements being integrated into the data platforms. And I'm
intentionally up leveling the vocabulary of storage systems to a data platform to kind of account
for that. Well, you'd mentioned that whole issue of the US, now France, Europe, We saw news out of South Korea building out these enormous
AI data center infrastructures. Are there any interesting trends going on in the
area of power and cooling? Because the compute demand is so far out running, the
power capabilities that are available. Are there any interesting trends there?
Yes, Doug, there's been a lot of different changes, as I was mentioning.
The growth rate of these centers are at a higher rate than what we saw previously
and meaning expecting with those countries.
And one thing that's always been talked about with AI is once AI more globally
is successful and takes place, it's going to launch a tremendous competitive
kind of battleground between different organizations. Of course, on the military side, there's been a lot of science fiction stories,
but whether you're an academic research institution or any manufacturer, once everyone else or a few
people adopt AI, once it's successful, then other folks will adopt it and there's going to be this
tremendous competitive battle. So the requirements of the hardware, we've all seen what that's done for Nvidia.
On the power side, there's a lot of issues and we're hearing about a lot of nuclear
power plants being re put online and then ones being built right now.
We are aware that China is doing a nuclear power plant build out for their data
centers that are extreme scale.
So all these different countries
are trying to address that. And the sustainability and the ability to do your work using less power
is crucial to try to power optimize things. We had an interesting presentation at one of our recent
user forums where the presenter was showing how they profiled their workload and that it cost them a
million dollars per megawatt just to install the power into their data center
and even though their system could say do a hundred megawatts it never actually
ran at a hundred megawatts and so what the presenter was suggesting is to
profile your workload and maybe you only have to install two-thirds of the power
that the peak power the system can draw because you're never exercising all the GPUs, all the CPUs, and all the memory at the same time.
So that's an interesting one-off solution there. But I think to address this whole thing,
it's going to take a whole host of solutions and at the same time an acceptance that the world's
power, a larger portion of it's going to be going into these large data centers.
If I could also add a little bit too, I think it was a comment that I heard at SC24 was that
everyone's either doing AI or cooling AI. And the notion of the acceptance and recognition of
liquid cooling is certified and being adopted aggressively with especially with the new data
center build out. But when you're looking at existing data centers and folks that are maybe trying to retrofit
liquid cooling to existing data centers, they're hitting speed bumps and challenges and that's
powered in within their data centers where they're initially developed and designed that
they're still kind of only supporting the 10 to 14 kilowatt per rack.
And you know, they're just not designed to support the 150, 160 kilowatt per rack or
even upwards of the 400 kilowatt or greater per rack
that's on the horizon.
There could be some muting of adoption of the liquid cooling
and addressing of the high power demand,
especially when considering leveraging
the existing data centers.
The one thing I'd like to add to that is,
I think there is a place for a power revolution in AI.
The idea that maybe you don't need a 1.2 kilowatt
chip to do training or to do inferencing and that if you strip the computational requirements
down to their barest elements, you may be able to offer an AI-centric accelerator that
does far less outside the realm of what you need to do for those particular mathematical
calculations.
And if you can offer that at a lower power level and perhaps at a lower price than some
of the general purpose GPUs that we see on the market today, there may be some opportunity,
there may be some place for an organization to say, we can go backwards here.
We don't need to extend this.
And I liken it back to when we always judge microprocessor speeds by frequency.
And then we got at some level and everyone said, this is crazy. Let's just change the
paradigm. And I'm wondering if there is going to be some place for lower powered AI centric
hardware out there that adds a bit of diversity, if you will, to the ultimate selection process.
And unfortunately, we didn't, we don't have my colleague Tom on here,
where he would pipe in because he lives in Ashburn, Virginia,
where 80% of the internet goes through their 113 different data
centers.
There has been a data center under construction in Ashburn
for the last decade and a half.
There is a not in my backyard mentality growing.
Certainly in Ashburn, certainly in Loudoun County writ large, because
there's talk about reopening three coal-fired power plants in West Virginia to supply energy
to those data centers. And I think that there's going to be some significant consternation by a
lot of different neighborhoods, organizations and such, who basically are saying, enough,
we're not getting, we don't want to be the place that has nuclear power plants to supply people so they can ask them at the best place
to have lunch. And I think that that issue is going to again drive a reversal of fortune.
I don't see two kilowatt chips. I don't see three kilowatt chips. Just the same way I
didn't see six gigahertz microprocessors at some point. So I think there's some nuance here.
That's going to be exciting to watch in the next few years.
Maybe that's a segue into an area that we're going to have to bring you back,
Bob, to discuss, and that's quantum computing, which I know you track intimately.
And as we were discussing at SC of the benefits that it provides,
one of them is energy efficiency.
How do you see that whole area coming?
You alluded to that, that there's a lot of progress,
but also some disappointment
with the level of progress so far.
How do you see that environment?
The quick takeaway is I think,
and I'm pretty confident about the idea
that within the next three years or so,
we're going to start to see what the sector's calling
utility class quantum
systems, which basically means for a particular science and engineering job, most likely either
optimization or computational chemistry, it's going to be more effective to do that on a
quantum system than it is on a classical counterpart.
And so I think we've seen enough roadmaps that have been in existence for a number of
years that certain milestones have been hit at a regular cadence,
predictable one, that says that is going to be doable within the next three years or so.
And so there's a lot of opportunity here, and you mentioned the power requirements, and whenever I tour a quantum computing facility,
usually a manufacturing site, the first thing I do is run around the back of a machine machine and I look how thick the electrical cables going in are.
And quantum systems are still running at the kilowatt range.
This is three or four orders of magnets less than a computational traditional classical
counterpart.
And that scaling is absolutely wonderful.
It's not as if you double the qubit count, you double the power requirements.
I actually saw a chip design a couple days ago.
There'll be an interesting announcement the next few days.
Unfortunately, I'm under some NDA stuff.
But the bottom line is I saw a chip package
that holds a very small number of qubits,
but it's the same qubit package that will scale
to 10 to the sixth kind of qubit capability
all within the same package going forward.
So the power requirement, and I've talked to folks about doing the math.
If you're running a kilowatt range versus a tens of megawatt power range, if you can
offload some percentage of your workload to a quantum system, even 5% or 10%, just from
an energy perspective, that machine pays for itself in a matter of months, even if it had
an original $20 million price
tag. So quantum is coming. There's a lot of interesting progress going forward. And I
think one of the untapped benefits of quantum is the fact that it gets you out of this pernicious
cycle of more and more power to deliver more and more compute.
Yeah, definitely.
It's almost a reset From a computational perspective. Yeah. I don't know
if the industry itself is really pushing on that yet, maybe because they're not tracking GPUs.
There's a few companies that get it and I'm not going to say I've been haranguing them all,
but there are, I'm starting to see roadmaps that actually list not only qubit counts and some of
the other particulars that are unique
to the quantum computing space, but also the idea
this is what our power consumption budgets are going to be
in the next three or four years.
So some organizations are really starting
to wake up to that fact.
Excellent.
Now we talked about liquid.
Do you see that happening across the board becoming,
and maybe the real question is really immersion.
Is that going to happen? Because I think liquid is a pretty standard thing now. Liquid plus
air that is.
Yeah, I think that I mean, they're both being adopted a liquid cooling and more and more
so than immersion cooling. It did have a recent conversation where there was originally thought
of a higher transition rate or adoption rate to get to
immersion cooling, that folks, even the vendors and the people really tracking that market
have muted their expectation of that.
And there's a number of reasons for it, but there's some, it is just kind of new.
It's been around for a while, but in the mainstream adoption enterprise type of space, and even
some of the
HPC space it's still relatively new but there's some notion and a hindrance of some of the highest
performing parts are warranted differently or if at all for use in certain immersion fluid.
So there's usage consideration there that may use adoption of immersion cooling but there are
elements and we do see an uptick in adoption,
but it might not be as good as it may have been
or it would initially thought.
I heard there were considerations with decay
and that sort of a thing with various fluids.
There was one day I toured an all immersion cooling data
center last year and what was remarkable
was a deafening silence in the data center
because there were no fans.
That's right.
You walked in and it was eerie, just in what you're used to experiencing
walking into an Oracle data center.
And probably a lot of headroom.
Absolutely.
I had that same feeling, Mark, when I stood around Frontier XS scale system.
I think, where's the noise?
It's amazing.
So one that I'd like to add to that too, is we are studying lots of different
types of liquids too, for immersion cooling.
So the different issues, Shaheen, you mentioned some of the deterioration
issues in that too, can vary whether it's oil-based, fluorine-art-based, and
single-phase, multiple-phase changed cooling and that.
So there are a lot of different options still out there that people are exploring.
Yeah, exactly.
I mean, they're all synthetic at this point.
And I imagine you could formulate just the right mix
so you can have the cake and eat it too.
So I'm like expecting that to be done any day now, right?
We keep watching it.
We're watching to see what happens.
I know part of our discussion would
focus on some of your more interesting or what you think are intriguing interesting
predictions for either the rest of the year or the decade or both.
Sure, I'd like to ask Bob to talk about quantum computing. There's been so much shall we say
interest in the press and quotes by famous people about it could be 30 years away and
things like that. And Bob, maybe you can address what our view is on that.
Yeah, sure. First of all, anything past five year predictions to me is science fiction.
So I don't know how you can say 30 versus 29. And I do like the fact that right after
Jensen came out with his pontification, Bill Gates came out with his. So famous people
who are peripherally connected to the quantum space, who have more mindshare
than others, are to be taken with a grain of salt.
I like to look at what the experts are saying, not the marketing department, but the actual
folks that are involved.
As I said earlier, the roadmaps are pretty reliable in terms of progress in quantum.
The big thing here is application, or more accurately algorithm development.
There's still a limited span of use cases for quantum.
And to me, the big, not surprise,
but a development that I'd hoped for,
but didn't expect to see this soon,
is a growing enthusiasm towards
on-prem procurements of quantum.
Moving away from cloud access models
and bringing quantum computing into your existing,
generally large, classical HPC ecosystem
and starting to think about how I can integrate
that quantum system into my workloads.
And that's where I think we're going to start to see
some amazing innovations going forward.
The idea where I now have an application
that's difficult to deal with,
but I let the quantum system do what quantum does best,
and I let the classic side of the house do what it does best,
and I have these wonderfully interesting,
innovative, hybrid applications.
For the longest time,
the quantum application algorithm realm
has been dominated primarily by quantum people.
And I think that unleashing the vast hordes
of really smart classical programmers, mathematicians,
organizations that develop algorithms and applications
is going to offer up kind of a new frontier
of hybrid applications.
And I think that's really what's going to drive
or propel forward quantum adoption, quantum acceptance,
and ultimately the effectiveness and usefulness of quantum across the entire.
I'm not saying it's going to be used for everything.
No, you're never going to check your email on your quantum processor.
But I think that once it starts to get into the hands of end users who say, I need to
do this and what can the quantum system down the hall help me do, we're going to see a new birth, if you will, of insight and innovation.
And that is, I think, part of what's going to drive the on-prem emphasis,
bringing the system in-house and tying it directly to your existing computational workload
and letting your programmers and your scientists and your software developers
have 24
7 access to that system to rest out the computational performance capabilities that's a little harder to
get with indirect access through the cloud access model which is what dominates today. So pretty
positive about what's going forward but I guess the surprise there was the interest the growing
interest in on-prem capabilities versus just maintaining cloud access, which is predominantly a great model for pay as you go, exploration,
low barriers to entry for exploration, ease of switching vendors to kick around, try out
different modalities and such to committing to a particular modality, a particular vendor
to optimize for your particular workloads.
Brilliant.
Right on. Doug, I would like to add another prediction that we have that we're
expecting to see is some tremendous growth in budgets for the different
machines, supercomputers and exascale and so on around the world.
From our recent surveys on average, people are expecting to increase
their budget quite a bit, but there's really been a shift.
We're talking now about $500 million systems
as being quote unquote common,
at least there's three or four of them.
But it is more mainstream level.
Folks are planning to go from a 500,000 ish spin rate
to something closer to 750,000,
which when you go across the market
is just tremendous growth there.
But then the amazing part is there are a number of people
talking about $100 billion expenditures, Microsoft, OpenAI
and the US government. So these numbers are just staggering as far as the growth
rate compared to just three years ago for a large system. And we think that's
going to continue as a trend. I'll also add another area that we're looking at
and Bob hinted maybe even more than hinted at it when he indicated that
trying to find the right application
of AI in science and engineering where, you know,
85% accuracy doesn't cut it with, you know,
the most recent hallucination there where AI suggested that,
yeah, water only freezes at 32 degrees
and it won't freeze at 27 degrees.
But it also reminds me of, I go back to War Games
and the Whopper computer and
the human missile silos.
That's right.
You couldn't unplug it either.
Yeah, but we're thinking that entity will come back and inserting human oversight and
into the AI process to really better tune and optimize and manage and minimize.
It can make it more trustworthy that as we bring the human element back into this process
and not the full blind trust into what AI is doing, especially in the critical areas
of how some of this traditional HPC design, we want more than 85% accuracy with the design
and development of a new airplane link.
Could I ask Mark, I have a sort of a pet interest in interconnects, optical I.O. Talked with other folks who think this could be the year
for optical I.O. becoming more commercially viable.
What's your sense on that?
There's certainly a lot of interest
in investment going on in optics.
I don't think we're hitting a mainstream adoption
for that yet.
There's certainly some interesting things being done
down at the both on chip optics and on board optics areas,
some standardization
with the interconnect elements down within the chiplet interface kind of areas.
But there's still, I think, some challenges, especially in manufacturing some of the elements
with whether it's in the lasers and all that manufacturing the various optical elements
at scale to be able to support a broad mainstream adoption.
So while the technology is there and some proof of concepts are really promising,
broader widespread production and manufacturing, I think, is going to
mute some of the growth and adoption of that. Okay. Gene and I could go on all day with you guys.
We certainly could. It's such a treat. Thank you. Oh, thank you so much. Thank you.
Thank you. Catch it up with you guys.
Any comments or thoughts that we haven't covered that you'd like to do?
I just want to thank you for the opportunity to have the conversation.
Thank you very much.
We look forward to staying in touch.
A lot of activities, as you mentioned, GTC around the corner, Quantum Days left and right,
storage continues to be the complicated thing that it has been.
So it's a delight to be able to catch up and I look forward to us doing that again soon.
Thanks so much guys.
All right.
Thank you.
Bye-bye.
Thank you.
That's it for this episode of the At HPC podcast.
Every episode is featured on insidehpc.com
and posted on orionx.net.
Use the comment section or tweet us with any questions
or to propose topics of discussion.
If you like the show, rate and review it on Apple Podcasts or wherever you listen.
The At HPC Podcast is a production of OrionX in association with Inside HPC.
Thank you for listening.