@HPC Podcast Archives - OrionX.net - @HPCpodcast-98: Hyperion Research on HPC, AI, Quantum – In Depth

Episode Date: February 27, 2025

We are delighted to have as special guests today three of the top analysts in the HPC, AI, Cloud, and Quantum fields, representing the industry analyst firm Hyperion Research. Earl Joseph is Hyperio...n CEO, Mark Nossokoff, Research Director, and Bob Sorensen, Senior VP of Research. Join us for an In Depth discussion of the current state and future trends in HPC, AI, Quantum, Cloud Computing, Exascale, Storage, Interconnects and Optical I/O, and Liquid Cooling. [audio mp3="https://orionx.net/wp-content/uploads/2025/02/098@HPCpodcast_ID_Hyperion-Research-HPC-AI-Quantum-Market_20250227.mp3"][/audio] The post @HPCpodcast-98: Hyperion Research on HPC, AI, Quantum – In Depth appeared first on OrionX.net.

Transcript
Discussion (0)
Starting point is 00:00:00 If you look at all the spending to do HPC and AI related work, the overall spend rate was around $52 billion in 2024, and we're expecting by 2028 that to reach around $85 billion. The line was drawn at number eight. So the top eight machines are as powerful as the bottom 492 systems. A greater comfort level and a willingness to embrace this notion of a continuum computing, not either on-prem or cloud. 3.5 companies in essence can really skew the amount of money that's being committed
Starting point is 00:00:39 towards HPC in its most broadest definition simply because they have the money and the agenda to go out and spend. From OrionX in association with InsideHPC, this is the At HPC podcast. Join Shaheen Khan and Doug Black as they discuss supercomputing technologies and the applications, markets and policies that shape them. Thank you for being with us. I'm Doug Black at Inside HPC,
Starting point is 00:01:05 along with Shaheen Khan of OrionX. Guests today are from HPC industry analyst firm, Hyperion Research. We have Earl Joseph, the CEO. He has oversight for Hyperion's overall research and consulting efforts, along with the firm's HPC user forum, which conducts conferences in the
Starting point is 00:01:26 US and around the world throughout the year. We have Mark Nosikoff. He is Research Director, Lead Analyst for Storage and Interconnects and Cloud. And Bob Sorensen, Senior VP of Research. Among other areas, Bob focuses on exascale, supercomputing along with quantum and other future technologies. So today we're going to talk about future trends in HPC AI, how the melding of the two has impacted Hyperion's research strategy, and to look ahead to the trends we can expect in 2025 and the rest of the decade.
Starting point is 00:02:03 So Earl, let's start with you from a big picture perspective, share your views on the relative health and growth of the industry. And particularly now that Hyperion has grouped AI under the HPC umbrella or within it, we're looking at some very healthy CAGR numbers for the rest of the decade, isn't that right?
Starting point is 00:02:22 Thanks, Doug. And we appreciate the invitation to join you and Shaheen today. Over the last three or four years, we've been tracking the AI portion of the HPC market in much more depth, and there have been a lot of new sellers in the market, Nvidia, Supermicro, and a lot of AI specialty houses selling equipment, which are really non-traditional vendors,
Starting point is 00:02:43 and we have included those into our market tracking numbers. The explosion in AI, which started just a little bit more than two years ago, has resulted in the market size, as the way we track it, to be 36.7% larger. More than a third larger because of all this growth and increase in use of AI. But in addition to that, this whole recognition that high performance computing, computing at scale, and AI capabilities, it's lifting both the AI portion of market, plus the traditional HPC portion,
Starting point is 00:03:15 because government agencies, companies are all recognizing that the use of big data, big HPC, and scaling can do a lot of good things. Now that's also changed the growth trajectory that you were mentioning. If we look at the overall market right now, we're expecting to see around 15% annual growth over the next five years.
Starting point is 00:03:34 That used to be seven to 8%. So the market is more than a third larger and growing at roughly double the growth rate that we were seeing before. The other parts of the market that are growing tremendously is cloud computing. And the way we track that is what HPC end users, researchers, engineers, analysts, and others, spend in the cloud itself. And that we saw in 2024 grow by 20%.
Starting point is 00:03:59 So just a phenomenal growth for here, almost large enough to call it a tipping point year. And then for last year 2024, for the on-prem AI servers, we saw those grow by 40%. So just some phenomenal growth rates in the market. And to put it in the big picture perspective, if you look at all the spending to do HPC and AI related work, so it'd be the on-prem servers, software storage, maintenance, plus cloud spending. The overall spend rate was around $52 billion in 2024. And we're expecting by 2028 that to reach around $85 billion. So just some tremendous growth taking place. Erl and Mark and Bob, it's such a treat to have all of you on this call. And thank you for making that happen.
Starting point is 00:04:43 And thank you, Erl, for that sort of setup for the market. What do you think is causing the growth in cloud? I have several questions, of course, but just to start, is the cloud adoption, you feel it's related to just the inability of on-prem data centers to handle these emerging CPUs and GPUs from power and cooling standpoint, or is it really something else? Perhaps we can all three share some ideas. I'll start with a couple of them. Cloud computing has become much more useful for HPC, meaning the cloud providers have created better software, better environments and hardware approaches, and their prices have gotten a bit more competitive. But some of the
Starting point is 00:05:22 gigantic drivers is the relation of new technology. Nvidia and AMD have announced that they're going to have new processors every 12 months so if you want access to the latest technology out is one really good avenue. In addition the supply chain issue if you wanted to get an Nvidia Blackwell let's say a couple hundred of them right now the waiting list is more than 12 months. So the difference between buying on-prem, which means you buy something now and you have to keep it four or five years, you're going to be multiple years behind in technology, really is driving a lot of people to use the clouds for more parts
Starting point is 00:05:58 of the workload. And I think Mark and Bob can add to that. Yeah. Hi, this is Mark. Thanks for, I'm happy to be here and thanks for including us in these conversations. To add what Earl said, there's for, I'm happy to be here and thanks for including us in these conversations. To add what Earl said, I think it was just on all sorts of comfort level from the users and research that are utilizing.
Starting point is 00:06:13 They've gone through the various stages, possibly of resistance and acceptance and now are in evaluation and more understanding of what makes sense to run in the cloud and on premises. So just a greater comfort level and a willingness to embrace this notion of a continuum computing, not either on-prem or cloud, but how can I best utilize and embrace the cloud across all my workflows and optimize my resources. Something that's been coming about as well is this notion of sometimes referred to as Neo clouds or the really focused attention of AI as a service or GPU as a service in a more just focused source provisioning that differs from the broader
Starting point is 00:06:53 cloud providers like Google, Amazon or Microsoft Azure that provide a wide range of services across the full spectrum. But these Neo clouds are focusing on just providing the AI or G assertive kind of activities. And maybe one last time, I'll comment on before turning over to Bob, was that there is a sustainability element as well as organizations are still tasked and have goals for energy efficiency and sustainability.
Starting point is 00:07:20 There's a migration to the cloud that can be more efficient of resources and remove the carbon footprint and power utilization from on premises for what they would be on the hook for measuring and reporting that then gets deferred to how the cloud providers are tackling the issue. I guess I'll go now for those who are listening at home, trying to decide which voice is which I'm Bob Sarts and I'll be the one who'll be speaking too fast. Hopefully that'll help you out. I have to kick the discussion along because I think we need to move away from on-prem versus cloud to on-prem and cloud. I think it's really time for organizations to understand the fact that if you could sum up the opportunities here, if you have capacity, if you have the need for a specific workload that you know you're going to need for the next three to four years, you have your architectural requirements
Starting point is 00:08:12 and your budgetary requirements in sync, then maybe on-prem is the way to go. If you're looking for resource access, some of the things are all talked about, emerging technologies, Nvidia coming out, or trying to come out with at least a new product at a higher cadence. New technologies, what's going on in some AI space.
Starting point is 00:08:29 A lot of the cloud service providers are looking at offering their own AI accelerators. The easiest way to explore that opportunity is through the cloud. And so I think that really thinking about how one can solve their computational capabilities by a much more sophisticated mix of on-prem versus cloud is something that I hope we see in the near future.
Starting point is 00:08:49 As organizations say, we have a diverse set of needs. Some of it's predictable, some of it's reliable, and some of it we'd like to have in-house so we can kick the tires on the machine we own. But at the same time, we want the flexibility and perhaps the low barrier to entry to explore new technology and the flexibility and perhaps the low barrier to entry to explore new technology and the flexibility from a budgetary perspective to basically bring new technology, new workloads
Starting point is 00:09:11 to bear, or maybe even doing some workload balancing. So what I'd like to see in the near future is a lessening of this us versus them versus everybody kind of holding hands and singing Kuba-ya going forward. Bob, it sounds like it's a much more complex managerial task to oversee how you put together this hybrid mix on-prem, off-prem. But let me ask a very concrete question. Earl, you mentioned a 12 month waiting period to get your hands on Blackwell chips. If you went to the cloud,
Starting point is 00:09:42 is that immediate, are Blackwells immediately available? Immediately Immediately as long as you have the pocketbook for it but yes you can start your job almost immediately if you go with the cloud and when I mentioned the 12 months if you actually wanted to buy quite a few of them and put a larger on-prem server in place the 12 months is how long it takes to get the chips from Nvidia right now. So whoever builds the system would probably need another four months. So you might be looking at 16 months. And a good example of why that's such a big problem is if your boss just told you to, hey, figure a way to use AI and do this great stuff on Blackwall, you can't go back to your boss and say, give me 16 months to get a
Starting point is 00:10:21 machine on board, then I'll start working the answer. Whereas in the cloud, it's give me 16 months to get a machine on board, then I'll start working the answer. Whereas in the cloud, it's give me 16 minutes and I'm going to be up and running. Yeah, non-starter. The differential there is massive. Yeah. Now, at the risk of complicating this, Bob, you mentioned that if your workload is predictable, then maybe you want to be on-prem. But it's also true that if your workload is predictable, you can get better prices on the cloud. Maybe part of what we're looking at is that this stuff is just going to cost one way or another, and it's not for the faint of heart or somebody who has a limitation with budgets. That's also true, isn't it?
Starting point is 00:10:55 I would argue that the idea here is that when you procure an HPC, you're really thinking about a three, four, five, or six year budgetary commitment. And that actually never is actually getting longer. I'm not terribly aware of cloud service provider agreements that really extend out to those kinds of things. So there is some variability going on here
Starting point is 00:11:16 in terms of what can I commit to on an on-prem procurement, which for certainly a lot of government organizations and a large subset of commercial organizations are pretty much in sync. So the bottom line here is that some organizations are already geared towards these long-term budgetary commitments. And the cloud, A, doesn't offer that kind of
Starting point is 00:11:37 long-term commitments, certainly for even their largest organizations. And in my mind, Doug talked a little bit about the complexities of this on-prem cloud happy wedding here. And the concern here is how one deals with a budget where a significant percentage of your budget may be in this kind of three, four, five year procurement cycle.
Starting point is 00:11:56 And the other one may be a month by month, annual by annual, maybe two or three year kind of commitment on the cloud and how you do resource management and provisioning in those two very different environments. The bottom line here is there are some compelling issues to be attacked and addressed if you're thinking about having a much more sophisticated on-prem environment. And that of course doesn't even address the use of software. But here we're talking about budget and those things are going to have to be worked out
Starting point is 00:12:21 because it's one thing for long-term commitment, it's another thing for a short-term pay-as-you-go model and how you reconcile those two things, especially if you're working in a government organization or a large commercial organization that wants predictability and budgetary commitments, those things are going to have to be ironed out going forward. Yeah. Could we, for a moment, let's talk about leadership class supercomputing. Right now, the state of the art, of course, is of course is access scale and but it's very interesting there's more talk about maybe the most powerful computing could come out of the cloud providers certainly the hyper scalers have money and budget. For technology r&d that dwarfs say the department of energy what are your thoughts about this mix of we have three exascale US systems, how many in China we don't know, but how will that work over against or with the HPC capabilities coming out of big tech companies?
Starting point is 00:13:15 Yeah, first off, Doug, my issue here is the concentration on the highest end of computing nowadays, the so-called exascale systems. And the top 500 list when it comes out every six months or so, one of the things they release is this wonderful metric that talks about drawing an imaginary line. And if you, above the line is 50% of the sum total of the computing power above that line, and below the line is the other 50%. So it's where you draw the line in terms of the average system, if you will. In the old days, the line would be drawn at about a hundred machines. So the hundred days, the line would be drawn at about 100 machines.
Starting point is 00:13:45 So the 100 machines on top were equal to the bottom, 400 machines on the bottom. That number has gone down significantly over the last few decades. And the last one, I think the line was drawn at number eight. So the top eight machines are as powerful as the bottom 492 systems. And so what we're seeing in some sense
Starting point is 00:14:04 is an architectural, a power consumption, a cost capability, and a complexity in programming at those exascale class systems that in some way, in my mind, aren't terribly representative of where HPC is going writ large. Certainly not a lot of organizations are capable of spending $6 million plus on a system. They're not capable of funding 50 megawatt systems that may ultimately cost you in some
Starting point is 00:14:30 sense equal to what the machine costs. You may be paying power and cooling costs that are equal to the acquisition cost of your HPC during its lifetime. So to me, I think that the exasale system mentality has to start to move back into sync with where the rest of the sector is going. Lower power, lower cost, and really what that boils down to from a practical perspective, and we're starting to see this coming out of the Department of Energy, is the idea of partitioned heterogeneous architectures, different subclasses of partitions that address the
Starting point is 00:15:03 workloads that matter to the particular organization. And this way you're not buying one machine every seven years that cost 500 million dollars plus. You're buying a series of SunSense of smaller partitions that can roll out in time to address your workloads much more responsive because you're not committing to a single architecture for five, six, seven, eight years. You're committing to an architecture that can evolve over time to address new workloads, address new technology options, and unfortunately address the vagaries of the budgetary process. And that's where I think we're going to go here in terms of Exascale's class system. And the one thing I'm going to mention, and I'm so glad you brought this up Doug,
Starting point is 00:15:43 we just started putting out these things we're calling them top of mind surveys. It's five quick questions that we're blasting out to the entire HPC world. And I think question number three on that was, when do you expect to see a hyperscaler HPC occupy the top one slot on the top 500 list? And I think we had next year, two years, three years, four years, five years and never. But we're, as I said, we just launched that. And I really can't wait to see the results of that in terms of what the overall community thinks. The answer to that exact question. If so, when is that going to happen? Because I think that the stars will align
Starting point is 00:16:20 and certainly some organization is going to be willing to commit the cycles to run that top 500 list Linpak benchmark and at least for the sake of publicity have a number one slot going forward. It may not be there for long, but it may be just long enough to run Linpak, but I think we're going to see it sooner rather than later. And Bob, I think those are great points and it's such a great question. My view of the world right now is there's really three categories of, if I call them supercomputers, large scale machines. One of them is the hyperscalers. Talk about why they're different. The other one is the exascaler, the real giant machines. Bob mentioned the top 500 list shows us too. Maybe there's eight or nine of those in the world right now. And then third group is everyone else that's doing HPC and AI. And they're all going off into three different directions right now.
Starting point is 00:17:07 But Doug, your question about the CSPs, the hyperscalers, and the social media companies that are building these giant systems, they're not being built to do broad-scale R&D or leadership R&D. So I think there's still tremendous room for national labs around the world to really do leadership class R&D. They're much more focused on what I would call a narrower set of problems, whether that's large language models only or three or four types of AI, but then the bulk of their cycles are sold out to clients and they're servicing clients on a lower scale. So with an exception of a few of them, they're not
Starting point is 00:17:42 fully being used for leadership class workloads. And Earl, let me just add quickly that we could say 3.5 systems, because to me, we can't ignore the fact that these basically self-built dark systems, the idea of what's going on at, say, Tesla with Colossus and some of these other organizations that are assembling their own systems for their own uses, the very end use specific, primarily right now targeted towards things like AI, large language model or generative AI training. And those organizations have significantly deep pockets. So in some sense, one or two very large installations by some of these 3.5 companies, in essence, can really skew the amount of money that's being committed towards HPC in its most broadest definition, simply because they have the money and the agenda to go out and spend $3 billion to buy an awful
Starting point is 00:18:31 lot of Nvidia GPUs to do in-house training for in-house end uses. So there again, the sector is fractionating once again, creating more opportunity, creating more confusion, and just opening the potential applicability of whatever particular flavor of HPC you're interested in. I have a question, maybe this is for you, Mark, and that's data gravity. We talked about the hybrid model and is the hybrid model stable or is it a slippery slope towards one end or the other or how do you keep that hybrid model to meet the ideals that it was designed to optimize?
Starting point is 00:19:05 I think there's a couple of things there where the data graven, and maybe the first and foremost thing is trying to minimize any data movement that has to occur between as you're leveraging both on-premises and in the cloud and the hybrid architectures, that the ability to maybe move the compute closer to where data is rather than moving the data around is going to be critical to doing some of this. Another consideration is the notion of sovereignty and where the
Starting point is 00:19:28 data is created and where can the data actually go and need to reside, what relative to client and who can access the data based on some of the growing attention being given to sovereign efforts whether it be a sovereign cloud or even other sovereignty related items like the sovereign compute sovereign, sovereign LLMs too, as well as how sovereignty is affecting everything. I also want to jump back if I could on the investment in AI and how it's impacting the industry that with all of the investment being driven and chasing the large hyperscalers with the AI, when you consider from the software impact and especially looking at the precision aspects of it. The traditional HPC codes and users have the FP64 based codes,
Starting point is 00:20:13 but there's an uncertainty on the commitment of the vendors being able to continue to optimize and extend the FP64 roadmap. So in some sense, kind of a notion or potentially even an acceptance that the codes and software folks may need to be skating to where the puck is going to be and leveraging and optimizing where their codes are, where it makes sense and where it can and should be done to take advantage of all the investment and advancements in the mixed and lower precision areas. Mark, am I right that big trend in data storage is around this grab bag,
Starting point is 00:20:45 if you will, where all sorts of different storage media, different types of storage are all under one roof so that it's all available to these huge AI workloads no matter in what format it's in or where. I think that's a goal and not necessarily in Nirvana, but where it's going. I know that past investments in storage and the storage winners, if you will, really focused on maybe more on the speeds and feeds, the abilities, as I'd call them, the reliability,
Starting point is 00:21:14 availability, durability. But those storage systems, not just the features, need to be upleveled more towards the data management, especially related to AI, this notion of an AI data pipeline, where the different stages within the data pipeline that as data moves from ingesting to training to the pre-processing to any checkpointing and then eventually getting down to the inferencing aspects of it. Each of those stages in the data pipeline have different IO profiles, whether it be block sizes or sequential or random and all the other aspects relative to it. So it's a strain and a stress and requirements on the storage system to be able to really
Starting point is 00:21:49 address and satisfy all of those. And so, the investments that are happening up level or up the data, the storage software stack into the data management and orchestration to be able to provide the data from a common storage system, single storage system to meet the requirements of all the aspects in the data from a common storage system, single storage system to meet the requirements of all the aspects in the data pipeline is becoming more critical for the emergence of these, the data, well, I'll call data platform service providers that are emerging in finding success and traction.
Starting point is 00:22:17 So that's still a dream and a goal not achieved as of yet. Not fully, but it's certainly, the steps are being taken there. I think it's, there are, it's closer than just a wild dream, but there are in some aspects and some workloads and some verticals that this notion of the orchestration and being able to manage across the different data types are being realized. And Earl, I wanted to ask you, now we last saw you at SC in November, and lots has happened since then.
Starting point is 00:22:43 I'm curious what your thoughts are on the biggest developments that have happened over the last three or four months that are impacting the industry overall, the most prominent impacts. Yeah, just for fun, one of the major changes or line of confusion is what the U.S. government is doing and the number of changes that are taking place almost on a daily basis is leading to a tremendous amount of uncertainty agency by agency. And along with that announcements of starting with the hundred billion going to 500 billion to build out 20 new AI data centers across the US.
Starting point is 00:23:17 And then that being matched by Europe, the French last week announced a major new AI initiative. The Middle East is planning to make major investments, not just on the AI side, but they'd like to even have their own foundries and build their own ships. So the amount of growth is just phenomenal, but the amount of uncertainty too is very strong. I think it's going to take a while before we understand what it means in the US. The national security topics and areas clearly continue with investment, but it's very uncertain for us. We're NIH, the National Science Foundation, and all these places,
Starting point is 00:23:51 their funding and operations are being redirected. So learning with that is real difficult at the moment, but that's something we hope will settle down in the near future. The other part that we're finding, at least I do personally, is tremendous amount of AI applications, different success stories, ideas that people have on how they can apply AI. And like I was saying earlier, just large compute with large data, with smart people doing smart software to solve a lot of problems that people previously thought would take three to five years to solve.
Starting point is 00:24:24 And now they're thinking they could do it perhaps even this year. Just use cases everywhere from the traditional ModSim, legal organizations, and automotive design and a lot of different areas. And then there's DeepSeek, which was a pretty big piece of news a few weeks ago. All three of you, what are your takes or insights on the DeepSeq phenomenon? Sure. You'll probably get three answers here, which is half the fun. We're just finishing a short write up on DeepSeq and a couple of things are apparent.
Starting point is 00:24:54 They did use a lot of US software. The number of GPUs used were a bit more than they originally announced. And perhaps the cost is higher. But again, this is to me like adding fuel to the excitement fire in the sense that it's started a price war to make AI a lot less costly to the end user. And so I think that part's positive. We're trying to figure out really how real their lower costs are because there's conflicting article and data points out there.
Starting point is 00:25:22 And with that, I'll hand it to Bob and Mark. I think caution should be taken. Yes, what they've done is impressive and what they've been able to achieve and provide the results. But have they, just from a share, we do a lot of relative to total cost of owners. Were all the total costs associated
Starting point is 00:25:37 with actually getting to what they produced, were all those total costs that others include, were they all incorporated into what was was presented by the deep sea folk What they actually spent to achieve what they've done the thing I'd add to the conversation Of course is that deep seek achieve such notoriety by being the number one Apple in the Apple Store So right up front that kind of scopes the issue and so we're an HPC shop and much more interested in what AI means to the Science and engineering community not to the people AI means to the science and engineering community, not to the people who want to go out and buy an AI enabled smartphone from Samsung.
Starting point is 00:26:11 And to me, there's some significant issues that need to be resolved if AI is to become a trusted component that contributes to the science and engineering community. It's not just getting the right answer about 85% of the time, which is what we see on some of the large language models and such, there are significant uphill battles that are gonna have to be fought before science and engineering can sit down
Starting point is 00:26:37 and say we can validate and verify and trust a reproducible explainable output from some of this generative AI hoopla that's going on out there. and verify and trust a reproducible, explainable output from some of this generative AI hoopla that's going on out there. And so we're really trying to think if this is a pun on the idea of the tipping point, is this a tripping point? Has there been too much investment, too much enthusiasm
Starting point is 00:26:59 within the enterprise space that's dragging along the science and engineering community that ultimately may not have legs. I come from a certain amount of skepticism within the quantum computing community where people talk about too much investment, not enough realized performance results. And I'm wondering if we're seeing, I know we're seeing huge investments going into the amount of money in terms of every company out there trying to become an AI-centric organization. And if you look at the projections of the amount of revenue that these things are supposed
Starting point is 00:27:32 to be generating in the near term, I'm wondering, and I'm not going to say it's a gen AI winter that's coming, but I think there's going to be a certain amount of reckoning going forward in the next few years as organizations sit back and say, what did all of this spending and expectations deliver us in terms of return on investment from a financial perspective or return on science from a science and engineering perspective? And quite frankly, I think the jury is still out on all this. And the reason that the one data point I can point to that scares me is when that Chinese company came out,
Starting point is 00:28:05 Nvidia stock tanked. Now, it was only for a day or two, but that scares me in terms of the volatility and fragility of what that sector really engenders at this point in time. So to me, the jury is still out. I don't see it as the next turn of the crank in HPC writ large. I see it perhaps as an interesting activity, an interesting opportunity that ultimately will be winnowed down to things where it works best
Starting point is 00:28:32 and other things where it's just simply not worth the cost or the labor to bring it into your existing computational workload environment. Yeah, I think that's excellent. Bob, I was gonna say, I liked your, maybe it was a potential Freudian slip, is it a tipping point where it's an upward trend? Or is it a tripping point?
Starting point is 00:28:50 That we're going to stumble that we caught some stumbling. And like you said, the challenge is extreme caution. Bob, I completely agree. But the deep seek phenomenon also pointed to perhaps a situation where maybe we don't need all these resources. Maybe we don't need the latest, greatest chips and interconnects. Maybe there's a lower cost way of getting to the kind of answers that we want it to. And if so, that opens the whole AI scene to on-chip AVX vector extensions, maybe even CPUs. And maybe we don't need the GPUs.
Starting point is 00:29:25 Maybe it can all be on-prem. There's a lot of unknowns. And of course, that's what makes it volatile. But how do you square all of that? I agree, because to me, what's happened is because of the intense competition. Remember, all of this AI progress is being driven by commercial organizations who are vying for mindshare
Starting point is 00:29:41 and vying for venture capital money as well. And so they're racing ahead with some very aggressive developments. No one has had time yet to take a deep breath and say, okay, let's start programming for efficiency for what are the heuristics? How can we distribute this? How can we make it more less power consumptive and get better results? Are we over training some of these large language models? Do we really need to go to 10 to the 25th operations? What kind of pruning can we do?
Starting point is 00:30:09 Where are the optimization stages? And I think that as the sector starts to progress and the programmers and the engineers start looking under the hood, they're gonna go, hey, I've got a better idea. I know how to do this a little better. As soon as the technology comes out, the users find a million ways to start to optimize.
Starting point is 00:30:27 And I think we're going to start to see that. Look for AI heuristics in the next years, I think to become a much more integral part of the overall AI ecosystem. Because right now it's running full bore, pedal to the metal kind of acceleration. And it's time for someone to say, do we really need to be running
Starting point is 00:30:45 at 8,000 RPM? Can we be a little more efficient in how we do some of this stuff? And I think those opportunities are just overflowing in AI. No one's just had the time or actually the inclination as yet to address them. But I think as end users start to bring it in house, they're going to start to come up with interesting ways to find efficiencies that we haven't even dreamed of yet. Okay. So I have another question here. When I try to read the DeepSeq papers, some of the optimization methods that they used came across to me as like standard in the HPC world, like overlapping
Starting point is 00:31:18 IO with computation, duh, we did that 10 years ago, and a few other things. Where do you land on HPC versus AI? Do you consider AI just a manifestation of it's an HPC app? Or do you think that this is so fundamentally different, mixed precision, as Mark, you were saying that is just different. Or even if it is, even if it came out of HPC, it now is so big and has its own life that it needs to be viewed separately. I mean, from an algorithm perspective, it's dramatically different.
Starting point is 00:31:50 You have these deep neural networks and such that you're not doing a calculation. You're getting a result out that you can't tear the hood open. When you get a result back, you can't go, okay, find me the variable that was at 49%, but if it was a 51%, we could do this. So from an architectural perspective, there's a lot of interesting issues here. But one of the things I like to remind people is that invariably, all code gets translated into assembly and then machine language.
Starting point is 00:32:14 So at some level, everything we're doing in the AI world and the HPC world still comes back to ones and zeros that have to do things like compute, store, memory, and interconnect. And so in some sense it's an interesting confluence of dramatically different algorithms that ultimately are running on the von Neumann architecture that has been around for the last 60, 70 years. And so both worlds I think could benefit from a greater understanding of what is brought to the table by each.
Starting point is 00:32:46 And that's my point about the HPC world knows how to do those things well, the optimization process to rest out the greatest performance out of an architecture, but ultimately when you get to a deep learning algorithm, it's still just a piece of code that's machine language that can be optimized. So I think there's a lot to be learned from both sides. They need to cooperate more and there needs to be a little more coordination. Certainly architecturally there's some ways to fine tune a system to be more amenable to an AI piece of code versus a mod-sim code.
Starting point is 00:33:18 But again, that's a knowable thing. It only will start to happen though as the sector starts to stabilize and starts to slow down a little bit in terms of new models, new architectures, new algorithms, and just new vitality of what's going on. And Bob, what I'd like to add to that is one view, and Shaheen, this is flipping your example a little bit, is that HPC technologies underlie almost all of AI. And as Bob was mentioning, whether it's the hardware, the GPUs, which were really heavily used and optimized in HPC, the software, the algorithms, the file systems, the scaling aspects, the power and cooling, all these technologies that have made AI successful are really HPC technologies underneath. And so just wanted to add that. Yeah, really are. And Mark, to what extent are you seeing special to AI storage requirements?
Starting point is 00:34:08 Or is it just AI washing? No, I think it's a little more than AI washing. I think that there's storage and some data requirements too. The notion of how big the data sets need to be and how much data should be included in the training. Does the whole universe of data need to be included in some really precise model development and training for a really targeted environment.
Starting point is 00:34:27 But in terms of AI washing versus unique requirements relevant to AI, there's some notion of our files needed and some of the aspects needed, or our parallel file systems required in some aspect of it, or can you move more to object than file-based or even block-based across some of these elements within the data pipeline? There's still an awful lot of evaluation and learning
Starting point is 00:34:50 occurring to really decide what's being done. Now that is a question I ask a lot of the vendors that I talk to, yes, they're using all of the, and addressing the, trying to get on the AI bandwagon, but what are you doing explicitly, specifically, that's AI-oriented? And there's only a few that are really addressing the bandwagon, but you say, what are you doing explicitly specifically that AI oriented? And there's only a few that are really addressing specific AI versus really trying to fit a little bit more of the what has always been done, but focus it towards AI.
Starting point is 00:35:16 So just, there's probably a little bit, maybe more than some would care to admit AI washing, but I think as we're evolving and learning is occurring, there are more AI specific features and elements being integrated into the data platforms. And I'm intentionally up leveling the vocabulary of storage systems to a data platform to kind of account for that. Well, you'd mentioned that whole issue of the US, now France, Europe, We saw news out of South Korea building out these enormous AI data center infrastructures. Are there any interesting trends going on in the area of power and cooling? Because the compute demand is so far out running, the power capabilities that are available. Are there any interesting trends there?
Starting point is 00:36:01 Yes, Doug, there's been a lot of different changes, as I was mentioning. The growth rate of these centers are at a higher rate than what we saw previously and meaning expecting with those countries. And one thing that's always been talked about with AI is once AI more globally is successful and takes place, it's going to launch a tremendous competitive kind of battleground between different organizations. Of course, on the military side, there's been a lot of science fiction stories, but whether you're an academic research institution or any manufacturer, once everyone else or a few people adopt AI, once it's successful, then other folks will adopt it and there's going to be this
Starting point is 00:36:41 tremendous competitive battle. So the requirements of the hardware, we've all seen what that's done for Nvidia. On the power side, there's a lot of issues and we're hearing about a lot of nuclear power plants being re put online and then ones being built right now. We are aware that China is doing a nuclear power plant build out for their data centers that are extreme scale. So all these different countries are trying to address that. And the sustainability and the ability to do your work using less power is crucial to try to power optimize things. We had an interesting presentation at one of our recent
Starting point is 00:37:17 user forums where the presenter was showing how they profiled their workload and that it cost them a million dollars per megawatt just to install the power into their data center and even though their system could say do a hundred megawatts it never actually ran at a hundred megawatts and so what the presenter was suggesting is to profile your workload and maybe you only have to install two-thirds of the power that the peak power the system can draw because you're never exercising all the GPUs, all the CPUs, and all the memory at the same time. So that's an interesting one-off solution there. But I think to address this whole thing, it's going to take a whole host of solutions and at the same time an acceptance that the world's
Starting point is 00:38:00 power, a larger portion of it's going to be going into these large data centers. If I could also add a little bit too, I think it was a comment that I heard at SC24 was that everyone's either doing AI or cooling AI. And the notion of the acceptance and recognition of liquid cooling is certified and being adopted aggressively with especially with the new data center build out. But when you're looking at existing data centers and folks that are maybe trying to retrofit liquid cooling to existing data centers, they're hitting speed bumps and challenges and that's powered in within their data centers where they're initially developed and designed that they're still kind of only supporting the 10 to 14 kilowatt per rack.
Starting point is 00:38:38 And you know, they're just not designed to support the 150, 160 kilowatt per rack or even upwards of the 400 kilowatt or greater per rack that's on the horizon. There could be some muting of adoption of the liquid cooling and addressing of the high power demand, especially when considering leveraging the existing data centers. The one thing I'd like to add to that is,
Starting point is 00:38:58 I think there is a place for a power revolution in AI. The idea that maybe you don't need a 1.2 kilowatt chip to do training or to do inferencing and that if you strip the computational requirements down to their barest elements, you may be able to offer an AI-centric accelerator that does far less outside the realm of what you need to do for those particular mathematical calculations. And if you can offer that at a lower power level and perhaps at a lower price than some of the general purpose GPUs that we see on the market today, there may be some opportunity,
Starting point is 00:39:34 there may be some place for an organization to say, we can go backwards here. We don't need to extend this. And I liken it back to when we always judge microprocessor speeds by frequency. And then we got at some level and everyone said, this is crazy. Let's just change the paradigm. And I'm wondering if there is going to be some place for lower powered AI centric hardware out there that adds a bit of diversity, if you will, to the ultimate selection process. And unfortunately, we didn't, we don't have my colleague Tom on here, where he would pipe in because he lives in Ashburn, Virginia,
Starting point is 00:40:08 where 80% of the internet goes through their 113 different data centers. There has been a data center under construction in Ashburn for the last decade and a half. There is a not in my backyard mentality growing. Certainly in Ashburn, certainly in Loudoun County writ large, because there's talk about reopening three coal-fired power plants in West Virginia to supply energy to those data centers. And I think that there's going to be some significant consternation by a
Starting point is 00:40:37 lot of different neighborhoods, organizations and such, who basically are saying, enough, we're not getting, we don't want to be the place that has nuclear power plants to supply people so they can ask them at the best place to have lunch. And I think that that issue is going to again drive a reversal of fortune. I don't see two kilowatt chips. I don't see three kilowatt chips. Just the same way I didn't see six gigahertz microprocessors at some point. So I think there's some nuance here. That's going to be exciting to watch in the next few years. Maybe that's a segue into an area that we're going to have to bring you back, Bob, to discuss, and that's quantum computing, which I know you track intimately.
Starting point is 00:41:16 And as we were discussing at SC of the benefits that it provides, one of them is energy efficiency. How do you see that whole area coming? You alluded to that, that there's a lot of progress, but also some disappointment with the level of progress so far. How do you see that environment? The quick takeaway is I think,
Starting point is 00:41:35 and I'm pretty confident about the idea that within the next three years or so, we're going to start to see what the sector's calling utility class quantum systems, which basically means for a particular science and engineering job, most likely either optimization or computational chemistry, it's going to be more effective to do that on a quantum system than it is on a classical counterpart. And so I think we've seen enough roadmaps that have been in existence for a number of
Starting point is 00:42:03 years that certain milestones have been hit at a regular cadence, predictable one, that says that is going to be doable within the next three years or so. And so there's a lot of opportunity here, and you mentioned the power requirements, and whenever I tour a quantum computing facility, usually a manufacturing site, the first thing I do is run around the back of a machine machine and I look how thick the electrical cables going in are. And quantum systems are still running at the kilowatt range. This is three or four orders of magnets less than a computational traditional classical counterpart. And that scaling is absolutely wonderful.
Starting point is 00:42:38 It's not as if you double the qubit count, you double the power requirements. I actually saw a chip design a couple days ago. There'll be an interesting announcement the next few days. Unfortunately, I'm under some NDA stuff. But the bottom line is I saw a chip package that holds a very small number of qubits, but it's the same qubit package that will scale to 10 to the sixth kind of qubit capability
Starting point is 00:43:01 all within the same package going forward. So the power requirement, and I've talked to folks about doing the math. If you're running a kilowatt range versus a tens of megawatt power range, if you can offload some percentage of your workload to a quantum system, even 5% or 10%, just from an energy perspective, that machine pays for itself in a matter of months, even if it had an original $20 million price tag. So quantum is coming. There's a lot of interesting progress going forward. And I think one of the untapped benefits of quantum is the fact that it gets you out of this pernicious
Starting point is 00:43:37 cycle of more and more power to deliver more and more compute. Yeah, definitely. It's almost a reset From a computational perspective. Yeah. I don't know if the industry itself is really pushing on that yet, maybe because they're not tracking GPUs. There's a few companies that get it and I'm not going to say I've been haranguing them all, but there are, I'm starting to see roadmaps that actually list not only qubit counts and some of the other particulars that are unique to the quantum computing space, but also the idea
Starting point is 00:44:09 this is what our power consumption budgets are going to be in the next three or four years. So some organizations are really starting to wake up to that fact. Excellent. Now we talked about liquid. Do you see that happening across the board becoming, and maybe the real question is really immersion.
Starting point is 00:44:26 Is that going to happen? Because I think liquid is a pretty standard thing now. Liquid plus air that is. Yeah, I think that I mean, they're both being adopted a liquid cooling and more and more so than immersion cooling. It did have a recent conversation where there was originally thought of a higher transition rate or adoption rate to get to immersion cooling, that folks, even the vendors and the people really tracking that market have muted their expectation of that. And there's a number of reasons for it, but there's some, it is just kind of new.
Starting point is 00:44:59 It's been around for a while, but in the mainstream adoption enterprise type of space, and even some of the HPC space it's still relatively new but there's some notion and a hindrance of some of the highest performing parts are warranted differently or if at all for use in certain immersion fluid. So there's usage consideration there that may use adoption of immersion cooling but there are elements and we do see an uptick in adoption, but it might not be as good as it may have been or it would initially thought.
Starting point is 00:45:30 I heard there were considerations with decay and that sort of a thing with various fluids. There was one day I toured an all immersion cooling data center last year and what was remarkable was a deafening silence in the data center because there were no fans. That's right. You walked in and it was eerie, just in what you're used to experiencing
Starting point is 00:45:52 walking into an Oracle data center. And probably a lot of headroom. Absolutely. I had that same feeling, Mark, when I stood around Frontier XS scale system. I think, where's the noise? It's amazing. So one that I'd like to add to that too, is we are studying lots of different types of liquids too, for immersion cooling.
Starting point is 00:46:11 So the different issues, Shaheen, you mentioned some of the deterioration issues in that too, can vary whether it's oil-based, fluorine-art-based, and single-phase, multiple-phase changed cooling and that. So there are a lot of different options still out there that people are exploring. Yeah, exactly. I mean, they're all synthetic at this point. And I imagine you could formulate just the right mix so you can have the cake and eat it too.
Starting point is 00:46:34 So I'm like expecting that to be done any day now, right? We keep watching it. We're watching to see what happens. I know part of our discussion would focus on some of your more interesting or what you think are intriguing interesting predictions for either the rest of the year or the decade or both. Sure, I'd like to ask Bob to talk about quantum computing. There's been so much shall we say interest in the press and quotes by famous people about it could be 30 years away and
Starting point is 00:47:05 things like that. And Bob, maybe you can address what our view is on that. Yeah, sure. First of all, anything past five year predictions to me is science fiction. So I don't know how you can say 30 versus 29. And I do like the fact that right after Jensen came out with his pontification, Bill Gates came out with his. So famous people who are peripherally connected to the quantum space, who have more mindshare than others, are to be taken with a grain of salt. I like to look at what the experts are saying, not the marketing department, but the actual folks that are involved.
Starting point is 00:47:35 As I said earlier, the roadmaps are pretty reliable in terms of progress in quantum. The big thing here is application, or more accurately algorithm development. There's still a limited span of use cases for quantum. And to me, the big, not surprise, but a development that I'd hoped for, but didn't expect to see this soon, is a growing enthusiasm towards on-prem procurements of quantum.
Starting point is 00:48:02 Moving away from cloud access models and bringing quantum computing into your existing, generally large, classical HPC ecosystem and starting to think about how I can integrate that quantum system into my workloads. And that's where I think we're going to start to see some amazing innovations going forward. The idea where I now have an application
Starting point is 00:48:24 that's difficult to deal with, but I let the quantum system do what quantum does best, and I let the classic side of the house do what it does best, and I have these wonderfully interesting, innovative, hybrid applications. For the longest time, the quantum application algorithm realm has been dominated primarily by quantum people.
Starting point is 00:48:44 And I think that unleashing the vast hordes of really smart classical programmers, mathematicians, organizations that develop algorithms and applications is going to offer up kind of a new frontier of hybrid applications. And I think that's really what's going to drive or propel forward quantum adoption, quantum acceptance, and ultimately the effectiveness and usefulness of quantum across the entire.
Starting point is 00:49:10 I'm not saying it's going to be used for everything. No, you're never going to check your email on your quantum processor. But I think that once it starts to get into the hands of end users who say, I need to do this and what can the quantum system down the hall help me do, we're going to see a new birth, if you will, of insight and innovation. And that is, I think, part of what's going to drive the on-prem emphasis, bringing the system in-house and tying it directly to your existing computational workload and letting your programmers and your scientists and your software developers have 24
Starting point is 00:49:45 7 access to that system to rest out the computational performance capabilities that's a little harder to get with indirect access through the cloud access model which is what dominates today. So pretty positive about what's going forward but I guess the surprise there was the interest the growing interest in on-prem capabilities versus just maintaining cloud access, which is predominantly a great model for pay as you go, exploration, low barriers to entry for exploration, ease of switching vendors to kick around, try out different modalities and such to committing to a particular modality, a particular vendor to optimize for your particular workloads. Brilliant.
Starting point is 00:50:24 Right on. Doug, I would like to add another prediction that we have that we're expecting to see is some tremendous growth in budgets for the different machines, supercomputers and exascale and so on around the world. From our recent surveys on average, people are expecting to increase their budget quite a bit, but there's really been a shift. We're talking now about $500 million systems as being quote unquote common, at least there's three or four of them.
Starting point is 00:50:50 But it is more mainstream level. Folks are planning to go from a 500,000 ish spin rate to something closer to 750,000, which when you go across the market is just tremendous growth there. But then the amazing part is there are a number of people talking about $100 billion expenditures, Microsoft, OpenAI and the US government. So these numbers are just staggering as far as the growth
Starting point is 00:51:12 rate compared to just three years ago for a large system. And we think that's going to continue as a trend. I'll also add another area that we're looking at and Bob hinted maybe even more than hinted at it when he indicated that trying to find the right application of AI in science and engineering where, you know, 85% accuracy doesn't cut it with, you know, the most recent hallucination there where AI suggested that, yeah, water only freezes at 32 degrees
Starting point is 00:51:37 and it won't freeze at 27 degrees. But it also reminds me of, I go back to War Games and the Whopper computer and the human missile silos. That's right. You couldn't unplug it either. Yeah, but we're thinking that entity will come back and inserting human oversight and into the AI process to really better tune and optimize and manage and minimize.
Starting point is 00:51:59 It can make it more trustworthy that as we bring the human element back into this process and not the full blind trust into what AI is doing, especially in the critical areas of how some of this traditional HPC design, we want more than 85% accuracy with the design and development of a new airplane link. Could I ask Mark, I have a sort of a pet interest in interconnects, optical I.O. Talked with other folks who think this could be the year for optical I.O. becoming more commercially viable. What's your sense on that? There's certainly a lot of interest
Starting point is 00:52:33 in investment going on in optics. I don't think we're hitting a mainstream adoption for that yet. There's certainly some interesting things being done down at the both on chip optics and on board optics areas, some standardization with the interconnect elements down within the chiplet interface kind of areas. But there's still, I think, some challenges, especially in manufacturing some of the elements
Starting point is 00:52:56 with whether it's in the lasers and all that manufacturing the various optical elements at scale to be able to support a broad mainstream adoption. So while the technology is there and some proof of concepts are really promising, broader widespread production and manufacturing, I think, is going to mute some of the growth and adoption of that. Okay. Gene and I could go on all day with you guys. We certainly could. It's such a treat. Thank you. Oh, thank you so much. Thank you. Thank you. Catch it up with you guys. Any comments or thoughts that we haven't covered that you'd like to do?
Starting point is 00:53:27 I just want to thank you for the opportunity to have the conversation. Thank you very much. We look forward to staying in touch. A lot of activities, as you mentioned, GTC around the corner, Quantum Days left and right, storage continues to be the complicated thing that it has been. So it's a delight to be able to catch up and I look forward to us doing that again soon. Thanks so much guys. All right.
Starting point is 00:53:48 Thank you. Bye-bye. Thank you. That's it for this episode of the At HPC podcast. Every episode is featured on insidehpc.com and posted on orionx.net. Use the comment section or tweet us with any questions or to propose topics of discussion.
Starting point is 00:54:04 If you like the show, rate and review it on Apple Podcasts or wherever you listen. The At HPC Podcast is a production of OrionX in association with Inside HPC. Thank you for listening.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.