@HPC Podcast Archives - OrionX.net - @HPCpodcast-107: Paul Bloch of DDN on AI Storage – Industry View
Episode Date: January 23, 2026Our special guest today is Paul Bloch, President & Co-founder of DDN, the high performance storage and intelligent data platform company. AI runs on massive amounts of fast and reliable data, wh...ich makes topics related to storage systems especially important. We discuss a broad range: technical optimizations for AI storage, DPUs and future directions, alignment with streaming, HPC, and accelerated computing, pilot to production and training to inference technical and operational challenges, sovereign AI and data sovereignty, and more. Join us! [audio mp3="https://orionx.net/wp-content/uploads/2026/01/107@HPCpodcast_IV_DDN_Paul-Bloch_AI-Storage.mp3"][/audio] The post @HPCpodcast-107: Paul Bloch of DDN on AI Storage – Industry View appeared first on OrionX.net.
Transcript
Discussion (0)
We're getting into real-time video, we're getting into changes, we're getting into multi-modal.
So training is still evolving.
However, now inferencing and be more and more important because those models are efficient.
And what's super important is to build redundancies across those, right?
So when they hit that either they do not impact the full production,
and then second that you basically get back to full production as fast as possible.
But at this point, you need the AI system to perform 100% of the time.
From OrionX in association with InsideHPC, this is the AtHPC podcast.
Join Shaheen Khan and Doug Black as they discuss supercomputing technologies
and the applications, markets, and policies that shape them.
Thank you for being with us.
Hi, everyone.
Welcome to the AtHPC podcast.
I'm Doug Black at InsideHPC.
and with me is our co-host, Shane Khan of OrionX.net.
Today with us is our special guest Paul Block.
He is president of DDN,
the High Performance Storage and Intelligent Data Platform Company.
And we're here to discuss issues around storage,
data management, and data delivery in the AI at scale era.
So, Paul, welcome.
Thank you very much.
Thank you, Doug.
Thank you, Shane.
Nice to meet you guys.
Really looking for you.
forward to this. Thanks for making the time. Okay, great. So, as I said, this episode explores how
HPC has evolved into the operational foundation for AI at scale, and also why data as opposed to
compute is the primary AI constraint and how AI is shifting into an operational level. Paul,
let me state it this way. We've talked for years about faster compute driving,
AI. From your vantage point, what's actually limiting AI systems today?
Well, I mean, it's really not a question of limitation, right? It's what you need to operate
AI at scale, right? So clearly the GPUs are improving at great pace, right? I mean, you saw
data's announcement from Blackwell to Vera Rubin. I mean, actually, so GPUs are becoming
more and more powerful. The networks are becoming faster and faster and getting low latencies.
And so what you really need is you need a data platform,
a data intelligence and data storage platform
that can really not just keep up,
but really enable the GPUs and the network
to perform at full scale.
And you really need an end-to-end system
to be fully tuned, optimized,
to be able to get the reward and your TCO.
So it's really not as much of a limitation
that it's really not trivial to get those very large systems,
to perform at massive scale.
And we're seeing it actually with the actual deployment
that we're seeing, depending on customers,
some of them are extremely successful
because they're basically using the right ingredients.
And some of them are really not at efficiency.
And so at the end of the day, the GPUs
are not performing to what they should do.
So it's really a notion of that's where the HPC,
if you want, background comes in.
I mean, we're seeing more and more that there's
kind of a disconnect between HPC,
what we call HPC people and AI people, because really AI people don't always understand
potentially the infrastructure requirements and that it's really a must and enabling, and whereby
the HPC guys have had the experience for many, many, many years to deploy and get those
systems at full scale. So right now we're seeing actually a lot of the HPC experts are being
hired by the AI natives or by the AI companies to actually help them operate their
assistance. We want to really get into that HPC AI theme in just a moment with Shaheen. But if we could,
talk about the whole notion of data starvation, how that leaves these very expensive GPUs underutilized.
Sure. Obviously, the first level is you're going to invest and use either in the cloud, on-prem,
or on the edge, hundreds, thousands, tens of thousands, hundreds of thousands of GPUs. And what's absolutely
important is to get them in parallel fully operating at 99%. And so that's one of the challenges,
right? So the first challenge is really being able to have any and all of these GPUs in full speed
real time access the data intelligence platform, the data storage, and having that data storage
being able to respond full speed, real time at very low latency. And that's what DDN does extremely well.
and that's really kind of our historic past with HPC,
as well as our 10-year, 12-year relationship with Nvidia
of optimizing end-to-end systems.
And I think also, quite frankly,
Nvidia deploy a lot of our DDN system
all over the data centers and their cloud.
So that's the first system,
is really the fact of having a parallel access,
not like, you know, what you would say,
like NFS limitations, right?
A lot of the other players claim they can do AI system
at massive scale,
but it's really they just can't,
with NFS limitations.
And even if they've improved, they just can't.
The second issue then becomes data intelligence.
Access to data, access to massive amount of data,
access to metadata, access to orchestration,
access to data validation.
And this is all basically that has been added in our platform
to be able to basically build a end-to-end system alongside Nvidia.
Very quick question.
Do you have any sorts of numbers on GPU underutilization?
and percentage of it of total capability?
Sure.
It's actually quite interesting.
We come in and we have tools, actually.
Some of the customers call us, you know, we get into discussions and they might be using
either historic enterprise class storage because that's what they add available, or maybe
some of the newcomers.
And we come in and really look at what the efficiency at the level of the system is.
And you'd be surprised.
I mean, we've seen some of the AI natives or the large cloud platform and we're seeing, you
10, 20, 30% efficiency at the level of GPUs.
And very often, these companies don't even know
that they're basically running at sub-efficiency.
So we have tools basically and been working very closely
with customers and future customers and prospects
to show them really how to optimize, how to build,
and what are the best practices
to basically deliver with DDN full efficiency on the GPU.
It's extremely important for the GPUs to operate fully
because that's actually where the core of the cost of the infrastructure comes in.
I love hearing what you're saying, Paul, because as many people know,
my personal view is that AI is a subset of HPC,
and it's the big killer app that has grown to be huge.
I also remember people would say that a supercomputer is a system
that turns a CPU-bound problem into an I-O-bound problem,
and here you are to really address the I-O-bound aspect of it.
It's interesting because we're really part,
of both world, right? We've lived the HPC world for 20 plus years. We've lived the AI world,
I would say, for 10, 12 years, but really with a massive acceleration over the past three years,
right, with the event of Chad GPT. And it's kind of funny because we've seen different reactions
from the HPC community and from the AI community. So the HPC community was kind of, you know,
not laughing, but was kind of smirking about the AI guys coming in, right, at first. And I think
they were saying, well, we know all this and we know how to do it.
everything. And the AI guys basically coming in well, you see, with new models and new training and
new technologies. And I think at this point, there's kind of a convergence, right? The HPC community
is realizing that actually there is value in AI. They need to evolve their systems into more
closer to AI systems, right, delivering HPC as well. And then the AI guys, in the meantime,
are figuring out that, yeah, you do need the right data infrastructure guys and the right
HPC knowledge to basically get those AI system to perform at 99 100%.
So it's kind of been interesting to see the reactions, right, of the communities.
And I think that what's great is that actually at this point, they're converging,
and people are actually getting on the same line.
Yeah, exactly.
I think there's definitely algorithmic and mixed precision convergence.
But in terms of infrastructure and skill set and the best practices to build it,
I think a lot of those are very, very consistent with what HPC has been doing.
And of course, like you said, you have a very long history in HPC.
The other thing I wanted to raise was streaming because I also remember you were in satellite data processing,
Hollywood video processing, data processing, low latency, let's not drop a packet, high reliability.
all of that now ends up being pretty central to what AI needs to do.
Is that true?
Absolutely.
So if you look at the history of DDN, right?
I mean, 25 years ago, we literally developed a groundbreaking system
that would be a parallel access with our own ASIC, our own memory.
We delivered a system where no one in the world was really looking at anything like streaming,
satellite download, and so forth, right?
So that system became instantly a success in HPC.
It enabled us to deliver probably 90 out of the 100 largest HBC systems successfully
across many, many, many, many years, right, at massive scale.
And so what gave us this is basically an experience
and the knowledge on how to run at 100% all these massive system pre-AI, right?
So if you look at it, we call it luck, call it vision, call it whatever you want.
but we really got the experience of, you know, kind of a 15-year head start
across any other companies basically delivering AI systems.
And this is why DBN nowadays is very valued by our customers
because we come in and we deploy massive system at, you know, groundbreaking performance
and we don't blink an eye.
I mean, it's basically just installed and delivered,
and basically people can go about the business
and basically getting results from the GPUs and the network
and from the application and the models and so forth.
very quickly. So clearly this helped us and it is the value is still today, right? Like you mentioned,
streaming any to any parallel access, low latency. We do not drop a frame. We actually have
ability to really maintain the quality of service, online upgrades. I mean, obviously all the things
that you would expect, manageability, planning of workloads. So all this is implemented in the
platforms today. Paul, let's talk about customer issues, the prevalent customer issues you run into.
Talk about the transition from pilot AI projects to operational. And what are the key problems,
whether it's compute networking or data? I mean, the problems, obviously, you know,
first you have to build the data center, right? Then you need a clear power. You need generation,
power, you need generators. So the power needs to be clean as well, right? We're seeing a lot of issues,
potentially with data centers, losing power.
And when they lose power and the generators don't get in place,
basically you have those massive systems, basically, coming down.
And it's very difficult to put them back up, right?
So clearly, cleanness of power, having the right,
the right generators is number one.
Then, you know, clearly you have a GPU's network and the data storage portion.
On the GPU side, GPU tend to fail, right?
GPU will come out.
And that's where checkpointing and the fact that DDN enables very fast, either synchronous
or asynchronous checkpointing helps a lot because you literally, you know, if you lose a few
GPUs, you can really just go back to your latest checkpoint, right?
Think about your save on the Microsoft, a Word or Excel, right?
You can really clearly just instantly checkpoint, right?
That's one of the massive advantages.
I mean, one of the advantages of DDN has is basically that checkpointing.
that enables you to go right back.
You have a 10,000, 100,000 GPU
on a simulation that lasts two or three weeks
or a month for a new model,
and you certainly don't want to lose any work
when you lose GPU.
So that DDN enables extremely well.
The other issues become network issues, right?
You have network latencies,
you have network disappearing, you have ports,
and the same way we've been working
very closely with Nvidia
to be able to diagnose this very close,
quickly. Every issue points to the data storage or the data intelligence platform, and clearly what
we're finding is that about 70, 75% of the time, it is a network potential issue. So it's extremely
important to have the tools, which the DNS, to troubleshoot, diagnose, and those automatically
basically resolve these and get the system back to optimal shape. Then on the data storage, quite
frankly, it's about efficiency, it's about reliability. It's about,
basically predictability.
And on those, I think to achieve all this,
it's really not trivial at all.
A lot of people might be using NFS-based system.
They think they're doing well.
But your analogy is like using a bicycle
instead of a Harley-Davidson, right?
You're happy with using a bicycle
if you've never done anything different.
However, if you go on a motorbike,
you'll enjoy that much better and much faster.
So again, the same way.
I think that when people get on the DDN platform
running AI and HBC, they really stay with us for many, many, many years.
And this is why DeLN has been basically a very successful enterprise for the past 25 years.
And we anticipate remaining the AI leader in data storage and intelligence.
Let's take a look at the AI landscape, since that's really where the party is.
There are reports of tens and tens of chip vendors that are addressing different parts of AI.
many of them build their own systems now.
Every one of them really needs to bring the whole compute data, network, memory, the whole thing together.
What is your perspective on that?
I mean, Nvidia continues to be the 800-pound gorilla.
How do you navigate all of that?
How do you make sure you don't get blindsided by something?
I mean, you're going to get blindsided by something anyway, right?
Even if you prepare for it.
But, I mean, that's what makes the life exciting.
That's where we're still working at it, right?
That's where my partner and I and our team are still excited every day of the week.
It's not just by the fact that we're, you know, obviously successful.
It's really what we're driving, the passion, and really understanding and enabling the industry that drives us.
So when you look at that, I mean, clearly, Nvidia, like you said, 800-pound gorilla, no doubt.
But you have other players.
You have new players for inferencing.
Look at what just Nvidia did with Grok, right?
basically enabling and basically going to integrate the grog inferencing model
because clearly Jensen recognized that they had some value in the space.
So I think this is an evolving space.
Nobody has the right full answer.
And I think as we move forward, because this is still early days in the AI era,
you will have newcomers, you will have new chips, you will have new inferencing models,
you'll have new algorithm and so forth.
and it will basically implement into the market.
So it's exciting.
It's a very rich ecosystem.
We are working with a lot of partners.
Our partners is becoming wider and wider.
So it's difficult on us and resources,
but we are doing it because you don't have a choice, right?
The same way you need to be working with the various orchestration platform,
the same way you need to be optimizing to models,
the same way you be optimizing to different networks.
So, I mean, it's part of our heritage, right?
the fact that we've been an open system for 20, 25 years.
I mean, if you look at it when we started, it was a heck of a lot more difficult.
All the systems were closed, and we had to literally force DDN in on an IBM system
or a VAC system, whatever, that will completely close, and we're able to deliver value there.
So today, the world is much more open, so we are thriving.
And the experience that we've had of delivering systems, no matter what operating system you use,
matter what chips you use, no matter what network you use, is extremely useful as well. We're very
flexible in that space. Right on. Now, there is definitely the whole training to inference part of
AI, and the industry seems to be shifting more and more towards inference as it gets used and it becomes
part of business processes. But there's also kind of pilot through production. Would you speak to
what you're observing in the market as customers transition from training or maybe they don't even
do training, they're into inference and how they go from pilot to production.
Sure. If you look at the market, right, the market is still driven by probably about 20 to 30
companies, right? The companies that are driving most of the volumes are the AI natives, the cloud
providers, the model makers, those are still training. The training has been going on and it's still
growing. I mean, if you look at it or even at Jensen's presentation today, right, the models are
basically growing 10x in parameters every year. And, you know, people are moving to millions of
GPUs where we thought that two years ago, 100,000 GPU would be the goal, right?
At this point, you know, people are talking about 1 million, 2 million, 3 million GPU. So,
clearly the training remains and will remain probably for, you know, the next two, three years.
It will require more and more GPU still because we're still at the infancy of training.
If you look at the data, we're getting into real-time video, we're getting into changes,
we're getting into multimodal.
So training is still evolving.
However, now inferencing and be collecting more and more important because those models are efficient, right?
And so inferencing is coming in and permeating in all the various industries, fine-tuning, applications.
And so inferencing is going to become much wider because it applies.
it then permeates into the enterprise.
And the enterprise has barely started.
Right now, it's in an infancy.
People are using Gen AI.
They're using chat GPT.
But they haven't really yet, you know, I mean, some of them, and you are moving, right?
We're seeing pharma.
We're seeing some of the medical guys that are moving into production from the pilots,
absolutely.
But it's still in the infancy.
Infancy, edge, sensors, everything still remains to be created over the next three,
five, ten years.
It's going to be incredible.
Well, you all use the term static storage architectures in the face of continuous AI pipelines.
How would you describe static versus dynamic storage?
Well, I mean, you have to be dynamic at this point, right?
So when you look at dynamic, you really need a data platform that addresses any and all the problems and requirements, right?
It's no longer good enough to resolve maybe just metadata or to resolve just enterprise data,
without resolved mirroring or data movement, you need to be able to do it all.
You need to be able to have clearly a cost-efficient way of storing petabytes, tens of petabytes,
hundreds of petabytes, or even exabytes.
We're seeing nowadays exabytes is becoming the new petabyte for companies.
So an exabyte is no longer that massive.
People are deploying five, ten, twenty exabytes.
So there's an explosion of data.
So you need a cost-efficient way, number one.
and power efficient, density to be able to deploy those systems.
Then the system need to be able to then match and enable the GPU.
So you have a notion of performance, bandwidth, I.O.
To those systems.
Then you have a notion of intelligence.
You need to be able to retrieve, right?
When you want to do a simulation, for example,
we are about 500 times faster than anybody else to figure,
thanks to our metadata search engines,
to figure out which of the objects you want to be able to be able to be able to,
able to look for for a simulation. Because this is downing. You have billions and billions and tens of
billions of files. What are you going to go search about? So you need to be able to do this.
Then you have intelligence, KV Cash coming in, you have the orchestration, you have the data
movement. You potentially also now, with the latest commodity, scarcity and pricing issue,
you potentially need to look at how do you manage your data across different type of devices.
Everybody was rushing to get SSDs or memory, and at this point, you know, the costs are escalating tremendously.
So the potentially hybrid infrastructure that makes SSDs and hard drives are becoming super important for those customers,
because otherwise they're going to break the budgets.
You mentioned key value cash and some of the other technical challenges that you've addressed.
What are some of the more recent challenges?
Like you mentioned, you know, simplicity, reliability, serviceability, throughput.
but scalability, hybrid, NVME, SSD, like special optimizations for KV cache or for pre-fill or decode or
transformers.
Sure.
As you pile on and you have a kind of a laundry list of all these things, then some serious
optimization emerges because there's a cumulative impact of all of these.
What are some of the highlights that they can point that have been challenging in recent time?
So clearly, yes.
So on the training side, I think we've pretty much resolved most of the issues of running really, you know, thousands, tens of thousands, 100 of thousands of system against our system when you do training and those simulation.
I think this has been worked and those are really up and coming and really stable and so forth.
When you're talking about inferencing, right, we're still in the early days.
And KV Cash becomes super important because think about millions of people doing prompt, right, doing prompt engineering.
And the idea of KV Cash is that so far, when you do a search, you're literally holding the GPU.
That means you have your first search.
You know, the system knows very well that you're going to improve your pront.
You have to go back and make it better or add on some details.
So it basically holds right now in the GPU and GPU memory that data.
And this is extremely costly.
So what you want to do where KV Cash comes in is that you're really using DDN as an intermediary
storage layer that basically is extremely fast and low latency.
So really you have a connection between GPU and us, whereby your prompt gets stored and move
from GPU to DDN and then at a much lower cost and CCU, so you release the GPUs,
but you still have your data and you still have your prompt getting ready to basically
being refilled and re-decoded after the fact.
And so when you do this by millions or tens of millions of requests, you know, KV
cash becomes extremely important because otherwise you won't be able to do it with all the GPUs
you have. You're basically keeping them busy for no reason. Is that makes sense?
Yes, absolutely. So the other thing that complicates everything that we all do is the emergence
of new technologies, new miniaturization, when you look at networking and storage and the emergence
of DPUs as a major pivot point there. What does that mean in terms of the system architecture,
the hardware architecture that you have to pursue on a roadmap.
What does that do with the roadmap?
So actually in our case, right?
I mean, DDN is really software, right?
Historically, we've delivered complete systems
because customers like a one-throat choke, right?
You basically have a one point of support
and you have someone responsible for the quality
and the efficiency of the full system
and the performance of the full system.
But really, you know, 97% of what we do in IP and R&D
is software.
And so our software actually, our later software,
called Infinia is really a groundbreaking platform that is built for the AI time.
So you can load, you know, you can actually load on your PC or your Mac.
You can load on a GPU.
So you can literally kind of run that software, call it inferencing software,
that basically can run on a GPU on the sensor and at that point validate data and move it
to your core to be able to basically extract the value of what data you really need.
So think about DDN of really kind of a evolution from complete system, which we will continue
delivering, but really to a software play that can load on DPUs, on memory, close to a GPU, across,
and so no matter what the application might be.
And so think about robotics, humanoid.
We think that there's a great future of a DDN with Nvidia and other robotics company
to basically deliver software value to manage and be able to be able.
to basically drive those humanoid.
Related to that really is what is a system anymore?
Because if it becomes more and more modularized and there's software glue that brings it all together,
doesn't that impact the definition of what a system is?
Well, I mean, you know, the system is whatever you define it, right?
I mean, historically you think of a server, you think of network, you think of all that,
and that's going to change, right?
You're going to have basically, it's going to be all about some type of a GPU or intelligent or DPU unit,
but at that point the software can run very close to that GPU and DPU
that's going to be super important, especially on the edge.
So the software and the intelligence software is going to become more and more crucial
because not only do you need access to data,
you need to be able to ingest it at the right performance,
you need to be able to validate it,
you need to be able to have some intelligence there
to keep whatever data is going to make sense or not.
Paul, explain for our listeners the issue around sovereign,
and the notion that sovereign AI is emerging as the default operating assumption.
Right now there's a battle for intelligence.
You see it, you hear it from the present, you hear it of the battle between the US and China and other countries.
And really it's a notion of creating intelligence, right?
So the AI factories, the sovereign AI factories, it's a notion that each of those countries is going to need to build their own intelligence and sovereign AI factories for their services.
for their R&D, for their universities, for their people, for their IP,
for the delivering easier system to their citizens, right?
Like getting a real-time passport or just a passport on your phone
or getting, you know, just walking through the airports at this point
and not having to speak or see anyone and then just being cleared directly.
I mean, there's so many support and services that can be improved.
But sovereign AI becomes something where you control your destiny,
you control your IP, your data, your intelligence,
and at the end of the day, it's all about data and your creation.
So it's going to be super important for each and every country,
which we are seeing, to build up their own AI value.
And it doesn't mean, you know, really...
And so people say, yeah, but you're still buying Nvidia,
which is an American company.
But that's irrelevant.
I mean, it's really because those are the tools,
what you're creating on top of it,
then you own that, right?
So it's super important that, you know,
not every country is going to have the technology.
They need to source the technology,
install it in country,
and control basically what's coming in
and keep it within country.
So I wanted to turn the conversation to the future.
We've seen really the emergence of AI,
not just as a user of storage,
but as an enabler of future storage.
We've talked about semantic storage,
computational storage.
What does that do to the future of the data management aspect of AI, whether it's hardware or software?
It's what we said, to be efficient in the AI over time, you're going to need to be able to have quality closely to an operating system for data.
And this is what DDN has been building, which is really a notion of addressing on-prem, cloud, edge, as well as different type of computational, I mean, either in-chip computation or close to networks.
I mean, I think at this point, what you will need is you will need kind of a, ideally a solution
that encompasses all the various use case in a very simple way, and that can scale easily as well.
So this is software-based that loads, and basically it's very flexible depending on either
chipsets or network set or DPU set or whatever is going to come up, and this is what we've been working
on and deploying.
So, Paul, talking at scale in AI factories, the part of the market that you're really addressing,
typically where do things break down first, you know, compute network or data, where do you see
the most common problems that your customers are up against?
The problems are going to come from pretty much every component of the system, right,
from power, from GPUs, from network, or from data storage.
And what's super important is to build redundancies across those, right?
So when they hit, that either they do not impact the full production,
and then second, that you basically get back to full production as fast as possible.
So there's going to be issues, right?
You're talking about, we spoke about power potential issues, you know, generators,
having to reboot systems.
You want to avoid all this.
GPUs will fail, but then we have models like checkpointing and ways around this
so that you do not lose your simulation that you've been working on for weeks or months.
And then network, you know, network has probably where the core of the issues come from, right?
There's an inherent instability from time to time when you push the network, you get some issues.
And so that points to the storage, but then the storage.
So it's really crucial for whoever to have elements on the data storage and to have elements on the data storage people
to be able to alleviate when the issues are storage-based, data storage or their intelligence-based,
or when they're network-based, right?
And we look at most of the system,
most of the issues come from network,
but you still need to resolve them.
And that relies upon then basically the data storage guys
to say, yeah, this is the issue, this is the trace,
this is where it's at, and this is how to resolve it.
So we're working very closely.
We're not automated tools to be able to do this,
because at the end of the day, what AI systems,
the big difference between AI system and HPC
is that AI system cannot go down, right?
They need to operate 100% of the time, no matter what.
And this is basically things that we've had to work on extremely in details over the past three, five years.
Because, you know, when you talked about the old times of HPC, you know, these people would understand why a system would potentially go down or what the issues were.
In the age of AI, you know, nobody cares.
People want online system 100% of the time.
If you have issues with SSDs or servers or controls or whatever, you have to resolve it in the background while the system is performing.
So this is obviously a pretty big challenge and this is the one that we're tackling pretty efficiently.
That's excellent.
I've heard you use the word operationalization of AI.
And that sounds like what causes these requirements to appear in a big way and AI becoming more of a mission critical thing rather than an app that you're running in the background.
No, absolutely. Think about it. I mean, you basically, it's quality of service. It's the old way of your phone.
You need like when you call with the disconnect, where at this point, you need the AI system to perform 100% of the time.
You know, it's becoming as crucial as a banking system or more crucial.
When the access to AI, especially when it's going to permeate to digital transformation or industry or omniverse or your plant or your, you know, medical systems and everything, it needs to be 100% of it.
operational. And so that's really what we've been aiming at. If you look at the efforts this year,
we are gearing towards that 100% uptime. That's really what we hear from customers, right? They
want uptime full on. They don't want to sustain any type of a downtime, of upgrade or whatever.
It needs to be uptime full on. Excellent. Doug, what have you not asked? Or Paul, what have me not
asked? No, I mean, again, it's an exciting time, right? I think that it is,
challenging. I think people are super stressed because there's so much happening. Think about an industry
which is developing, you know, 20, 50, 100,000 different ways and you have to be basically up on all
these issues, all these challenges, all these opportunities. And I think it is an interesting time
because everybody's working probably two times or four times more than they were before
and still not believing they're achieving a full result because there's so much to be done.
right so i think there's kind of a new wave of uh of kind of feeling of uh that i've been talking to
people right and they feel that they are the more they work they're still not fully satisfied
because they cannot complete all the tasks they have and i think you need to be able to to to kind of
evolve and adapt and and do the max you can but at the end of the day you're not going to be able to do
everything so you need to focus on what's super important and what you're going to be able to do
and add the most value.
But the AR era is going to change profoundly the world,
I mean, the world, the way we work,
the way people interact, everything.
And, you know, you have to, I think at this point,
you have to embrace and basically learn the tools.
The same way, you know, when people ask me,
well, is AI going to replace my job?
I said, no, if you basically embrace AI
and actually enhance what you're doing
and provide more efficiency,
you basically have job security.
So it's a notion of really,
kind of a pretty, pretty, I think, exciting time, but at the same time challenging. And so people
are going to have to evolve and accept that AI is here to stay. Paul, I had one just as an aside.
I was talking with Alex at SC in November, talking about this whole notion that you all have been
in HPC storage for all these decades. And then as AI really emerged, it was apparent that this would
be the next wave, even if it is, as Sheehan likes to say, AI is an HBC workload, but that you are well
positioned to take full advantage or to leverage or to ride this next wave. And he said the key
moment that he was convinced that you all were in the right spot was when chat GPT came out three
years ago in November. But I'm curious if that news, that explosion of LLMs and generative AI,
if that was a difficult transition for DDN.
Yeah, not at all.
I mean, if you look at it, I mean, I will tell you,
the one thing we saw pretty early is the alignment with Nvidia and Jensen, right?
We did this about 10, 12 years ago, and I think that was kind of the paradigm shift, right?
We basically aligned our engineering, we aligned our understanding our vision
into understanding what Nvidia was doing and really partnering with them,
NVDAs starting using DDN systems across all of their internal systems and cloud, right?
And so that created a lot of engineering value, a lot of learning.
It was not about marketing or selling.
It was really about moving the industry and delivering end-to-end system that would deliver AI.
So clearly, we had been working for probably seven, eight years when Chad GPT came up, right?
So we have built a lot of large systems alongside with Nvidia or Nvidia customers.
We were the first one delivering 100, 200 super pods with Nvidia, with the only data store.
storage platform there.
And so really we were ready for the chat GPT moment.
What Chad GPT did for us is really not as much challenge us,
but just basically delivered many, many, many, many opportunities across the world, right?
And delivered more business outcomes.
But, I mean, no, we're ready.
I mean, DDN has been ready for the past, you know, whatever, 10.
You know, we learned for the first 15 years,
how to deliver very large systems at full performance, at full efficiency,
at the best price points.
And basically with AI, we're now clearly optimizing for AI,
but we're built for that.
And so we know about it.
That'd be my take, too, Doug, that the market came to DDN
rather than the other way around.
It's sort of like suddenly everybody wants what we've been doing for years,
and isn't that great?
Only more so, right.
Yeah.
No, that's exactly right.
I mean, what we saw is that instead of having basically a population of potential
500 customers in the world, right, when we were in HPC, with all the various large data centers,
we now have one much larger potential single customers and obviously many more of them.
So it says, you know, you have to review that, you know, a very large HPC system or DOE
we're spending basically 500 million or 600 million on a very large system, right,
three, four, five years ago.
And nowadays, people are spending $1 billion, $2 billion, $5 billion, $10,000.
billion, 50 billion on system, like it's no problem.
Right on.
There's an explosion out there, which is quite interesting.
So Paul, in the context of what do you see looking forward, you've mentioned that a little
bit, but how do you see the strategic decisions that you need to make?
Is DPU a threat or a leverage point?
Does it accelerate you becoming more of a software company and therefore, no worries.
we're just going to layer on top of what they do, or is that a threat?
Have you considered doing your own systems?
Yeah, no, absolutely.
No, I think it's enabling because you're going to need the edge to core to cloud,
on-prem.
You're going to need all this system to basically perform efficiently and in unison.
And for that, you need potentially fully optimized and built specialized hardware and software
systems, right, for the core training system.
and then when you come to DPU or sensors or humanoid and so forth,
you only need software and be close to the unit of intelligence in this device.
So really, it's not a threat. It's convergence.
We align completely with Nvidia.
We align with other partners as well.
And this is really what makes it very exciting for us,
is that we have the right technologies to be able to deliver the solutions
no matter what you're trying to gain.
Right.
So you could be at small scale,
you could be at medium scale, you could be at massive scale,
you could be training or just inferencing.
I mean, the data and the IO profiles from training or inference are very different, right?
The challenges is extremely different.
And interestingly enough, right, we're finding that in some of the early robotics testing
that we've done is that we're finding that actually some of the performance,
the bandwidth, all the metrics remain that to enable, you know, value and TCU on your investment
you do need the data intelligence platform to deliver performance.
And that's where DDN is king, the performance.
When people realize that they do need the performance,
we're the only ones delivering that extreme performance at any scale,
and that makes a huge difference at the level of the investment.
Yeah, it may just be me, but I think what you do at the edge,
I'm less familiar with.
I obviously know that you've been in the data center and, you know, et cetera.
But I think that edge part is really excited.
This is really new, right?
I mean, if you're looking, this is really evolving, right?
This is going to happen over the next two, three years.
It is loading on DPU.
You saw the announcements with Bluefield 3, now Bluefield 4 with Nvidia.
And so we're able to really not just use them as cars or extension.
We're actually able to run our software on a DPU, right?
So you do not need CPU or GPU.
All you need is a DPU and our software to run.
basically inferencing software validation and so forth.
So this is something that we are working on and that is in part.
But as the market evolves, we are ready for it.
Yeah, yeah, yeah.
That's also a good answer.
Thank you.
Okay, Paul.
Well, a wonderful discussion.
We've been with Paul Block, president of DDN, and thanks so much for your time.
We could go forever, and I really appreciate your time, Paul.
Thank you.
Thank you, Doug and Shane.
Really appreciate your time.
and that was a great discussion.
Appreciate it. Thank you.
Very good.
Take care.
That's it for this episode of the At-HPC podcast.
Every episode is featured on InsidehPC.com and posted on Orionx.net.
Use the comment section or tweet us with any questions or to propose topics of discussion.
If you like the show, rate and review it on Apple Podcasts or wherever you listen.
The At-HPC podcast is a production of OrionX in association with InsideHPC.
Thank you for listening.
Thank you.
