Podcast Archive - StorageReview.com - Podcast #136: AI Is Forcing Data Centers to Go Liquid
Episode Date: April 25, 2025Explore how 2MW liquid cooling is transforming AI data centers with new levels of… The post Podcast #136: AI Is Forcing Data Centers to Go Liquid appeared first on StorageReview.com. ...
Transcript
Discussion (0)
Hey everyone, welcome to the podcast. Brian Beeler here and I'm with Luca in the cool IT booth at Data Center World in DC.
And this is the first time I think in a while we've tried to do one of these pods at an event.
It's pretty quiet here, but I'll apologize in advance if you hear vacuum cleaners or other pre-expo opening noises. We're standing in front of your brand new CDU.
You guys have launched this this week. What's going on with this system?
Yeah, so fortunately this right here is our brand new CHX2000 liquid to liquid CDU.
It is capable of 2 megawatts of heat rejection, so it is essentially our most
powerful CDU that we offer in our broad portfolio today. And it's in one
of the most compact form factors and footprints in the
marketplace. So this is all stainless steel, four inch
piping, victolic couplings, 750 millimeters by 1200. So it's a
standard rack, lots of accessibility on both the front
and the back. Right. And we're quite proud of it. It's it's a standard rack. Lots of accessibility on both the front and the
back. Right. And we're quite proud of it. It's definitely stuffing more power
into the same footprint and gives us I think an advantage over the competition.
And for anyone that's just listening, in the description in the show notes we'll
have a link to an article that Harold's put together on this new CDU so you can
see everything about it, get all the specs and we'll have some nice photos of it as well.
Visually, I mean I know you don't ship these with plexiglass on one side, but visually
it's quite stunning.
The amount of metal in here is incredible and all the pumps and everything.
What are the challenges with a CDU?
You guys have been in the game for what, a decade plus at this point?
Probably longer.
I'm sure I'm shaving some time off. Yeah so I mean we started with with rack based CDUs
initially so obviously in the smaller form factor the ones that would sit
integrated into the factory racks but yeah with with the row based units here
I mean essentially the problems are still the same in terms of the approach
temperature I think is a critical thing for us it's just really understanding and
having a system that has like a high efficiency heat exchanger
and really powerful pumps to be able to keep
that approach temperature really low and ensure
that you can cool effectively all of the units
that are under its power and under its control.
So it's kind of interesting as we think about
direct to chip liquid, right?
Nvidia is a big proponent of this now
and has been driving this with their massive TDPs.
But Intel, AMD, everyone else is right in there as well
as they continue to push these chips.
And it seems like Liquid was a little fringy,
maybe what, even just two and a half, three years ago.
I mean, you were certainly doing it.
It was big in HPC, but the enterprise wasn't there.
With all these new NVL systems though, there is no choice.
And so organizations are going to have to get comfortable
if they want these big GPU systems
with liquid cooling of some form or another.
And of course you guys are leading the way
with this directed chip.
But talk a little bit, we're here at Data Center World,
what's the sense of the people here
in terms of what their comfortable level is
or comfortability is with liquid cooling?
I mean, I would say generally it's still relatively new.
I mean, in our space, obviously we've been doing it
for some time, we were at the NVIDIA GTC show
and we were actually showcasing a wall
of some of our liquid cooling innovations
going back from 2009, so some of those to just kind of show the
progression of time. But in this particular environment, like with
Data Center World, I think a lot of the questions that we're getting are more
from people who are, as you've mentioned, kind of newer to the liquid cooling
space. We definitely are getting a broad spectrum of questions. We had a session
yesterday which talked a little bit about selecting the CDU.
And so that was a session that was really designed to kind of give people an understanding
of the things to look for and how to start entering and participating in this market
that, you know, as you've mentioned, is going to become more and more prevalent, right?
So as AI becomes one of the dominant sort of compute use cases. It's all liquid-cooled.
And even on the enterprise side of things,
we're starting to see it as well.
Certainly all the server OEMs are starting to create
like liquid-cooled CPU-based units as well.
So that is not exclusively AI.
And so it's just kind of getting them up to speed with that
and used to the differences between air and liquid in terms of the maintenance. We retrofitted
an R760 last year with some of your cold plates and you guys sent us the little
little baby in rack CDU which is not how most people will deploy liquid cooling
but it's a fun starter actually. And when we did that, we did nothing but liquid cool the CPUs,
and I think we saved 200 watts of power
while increasing the performance a couple points
on those CPUs.
That was one server.
If you do that at a rack scale,
even just compute forgetting about GPUs entirely,
that's an amazing savings.
Yeah, absolutely.
Yesterday, actually, I was fortunate enough
to take in a session here at the show,
which was, I think it was someone
from the University of Chicago, if I remember right,
and they were talking about their experience with,
it was a two-phase solution,
but it was the same sort of thing.
I mean, it was using liquid effectively
to cool their servers,
and doing a power savings study against that.
And I think the number that he came up with
in his particular use case was 37% savings for the lab that they were testing in. So pretty
substantial. So obviously there's a lot of you know the environmental aspects
up here and all of those things that you've talked about in terms of like you
know the smaller performance improvements as well. Just keeping the
system happy. You don't have to run as many fans and all that great stuff. The
fans is the big thing right. Anytime we're talking about liquid cooling,
it's gotta be about reducing energy consumption
in that server.
Most of that energy consumption,
we think about it as being driven by the silicon,
but the fans, like the 25, 30% in some data
I've seen out of Lenovo and others,
is pretty wild in terms of what you can save
when we pin those fans back,
or in some cases remove them entirely.
Right, absolutely.
And yeah, that's sort of the bread and butter.
I mean, we've taken from the super compute space
where they have 100% fanless solutions,
obviously in those blades.
And that's where we leveraged a lot of our expertise
and history from inside the IT,
now moving into the infrastructure
and everything that kind of supports that. It's always wild going into these labs
that are heavily liquid cooled how quiet they are and how cool they are because
you're dealing with the heat so much more efficiently. It feels kind of like
cheating right? I'm used to going into the lab and having the
servers scream, especially all these one-use systems that are that are really
popular in the HPC and AI spaces. Right, and the challenge with obviously all of the flux.
So, you know, as the load comes on, the computer spools up, all the...
Yeah, it's...
I don't know, it takes a strong person to get used to that day in and day out, for sure.
Sort of a nerdy white noise, I think, for some of us.
It's funny because I just posted a social video last week
where I was looking at something,
I think one of the big SSDs from Solid-I,
they've got these 122s now, which are insane,
but I shot this little video about it
in some of the comments on Instagram,
was what's that noise?
And I'm like, wait, do people not know
what a data center sounds like?
And they really don't.
We're talking to so many young software designers,
AI guys, that have never even seen the systems
that they're interacting with,
which to me is pretty nuts to consider.
Yeah, you're right.
And again, as all these racks move up
in thermal design power and density,
you're going to just get more and more need
to reject that heat somewhere, right?
And so if you were to power it with fans, I can't imagine, you know, what the
decibel rating would be. Oh my gosh, it'd be wild. It'd be wild. And with liquid
you're gonna get better density out of your racks and now the challenge is, as
you introduce systems like this, we still have to get power to the racks. I mean,
when we look at older data centers and as we look at enterprises that are
bringing these big systems in and trying to figure out how to manage that, what do power to the racks. I mean, when we look at older data centers and as we look at enterprises that are bringing
these big systems in and trying to figure out
how to manage that, what do you see your responsibility
in terms of helping your customers kind of figure out
how to optimize their racks?
I mean, that might be a little outside of your core scope,
but you guys are kind of leading the vanguard there.
Yeah, definitely.
I mean, fundamentally, I think one of the things
that Liquid obviously does is reduce power consumption. And you'll always have that kind of trade I mean, fundamentally, I think one of the things that liquid obviously does is reduce power consumption.
And you'll always have that kind of trade-off, right, in terms of as densities and design power goes up and up and up,
you know, you want to find ways to minimize the use of that energy, right? So using it towards the heat rejection by air is obviously not as effective with as efficient as liquid.
And you know, so you want to reduce the amount of energy you're actually using to offset
and reject that heat.
And so as those go up, we'll have that juxtaposition between the two.
Where liquid is obviously one of the ways that allows you to do that.
And it's probably the most effective means that we have right now.
So you know, we continue to expect that to grow in dominance
over the data center space as these power-hungry
GPUs and CPUs continue to consume that energy.
And anything that we can do to be a part of reducing
the reliance on energy, because we certainly are going
to have some power challenges over the next little while
as all these data centers come online.
No doubt.
And we're trying to do our little part to help that out.
Half the guys in here are dealing with power delivery
and all these challenges.
And I think there's a 42-foot truck over there that's
a giant generator and all sorts of things trying
to bring that power generation to these data centers.
That's a massive challenge.
And yeah, to your point, you're helping
drive down some of that wattage by being, to your point, you're helping drive down some
of that wattage by being more efficient in the way you're cooling it. When you think
about fluids, though you've mentioned that a couple times, I want to be clear about that.
You guys are just using water in these systems for the most part.
Yeah, I mean, our systems don't use a refrigerant, so we use single-phase direct chip cooling.
It is a PG25, so propylene glycol, 25% solution, primarily water with propylene glycol for
anti-biocides and fungicides, et cetera.
Which is one thing that people don't consider.
They just look at these loops.
And if you think about a PC that might have a liquid loop on it, most of us don't really
worry about growing little creatures or fungus in there, but the water is not perfect
And of course you've got to account for that right and and you know as the water heats up
I mean it makes for an ideal
Environment for a host of these things right if you've ever looked inside your radiator in your car for example
You'll see that you know as that over time
You can see bits of corrosion and it'll reduce sort of like the ability
for the water to transfer cleanly through there. And that is part of the reason as well
that you would use your antifreeze in your vehicle. So we're leveraging the same types
of solutions to do that and obviously we're trying to ensure that the water has the proper
chemistry and is clean to continue flowing through. The micro channels that sit in the
cold plates are also very, very fine.
That's what I was gonna say, right?
Is we look at these four inch pipes
and you think about, well, a little bit of green goo
won't make any difference,
but it's when you get to the cold plate
where you lose the efficiency
if those things start to get clogged up, right?
Absolutely, right?
And so that's exactly what we're trying to prevent
and that's why all of these systems
come with additional filtration.
We're down to 25 micron,
so we're really trying to make sure that we're maximizing the performance of the entire solution,
the entire system, not just the CDUs, not just the cold plates,
but really just keeping it optimized in a way that keeps the system happy and powering that compute.
You talked about single phase, and that's just liquid.
It's not boiling, it's not doing anything.
And just to be clear to the audience, there's also a two phase, which is another way to
do this. You guys aren't in that business or maybe you're evaluating it. What's the
deal with two phase?
Yeah, so I mean, I think in our space, we are definitely the leaders in single phase
direct to chip cooling. That doesn't mean, however, that we haven't evaluated other technologies.
You've been to the Liquid Lab, and so in our Liquid Lab we actually do, we've been doing
a lot of testing and validation on two phase for some time now.
So we continue to keep abreast of all the sort of trends.
We've actually published a paper though, when you look at it at a system, a holistic system
standpoint, we've found that there's still one
there's a lot of runway for single-phase direct-to-chip cooling still. I know
there's some folks here at the show that have said some other things that might
disagree with me but from our engineering standpoint we believe that
there's still a lot of runway in that and and the two-phase solution doesn't
in the testing that we've done, hasn't inherently been sufficiently
better that it makes sense to move to a refrigerant-based solution today. So not
to say that we're not looking at it, just not today. And immersion I guess is
the other popular technique, right, that works for different people at different
scales. What do you think about that? I mean again, obviously I'm biased, but I
would say the key thing for us
is that immersion comes with obviously a benefit,
but a lot of downside and challenge.
The market today has started making a bit more of a bet
towards direct-to-chip cooling.
I think Nvidia has said that their roadmap is really
geared around direct-to-chip for the foreseeable future.
And so long as that's what the market is driving towards, that's where our focus will continue
to be.
You talked about the headroom a little bit there in terms of what can be done.
You guys announced a couple weeks ago, I think, too, a 4,000 watt plate.
I mean, that's wild.
Right.
Yeah.
How does, I mean, just to ask a naive question, how does
that even work to be able to offload that much heat in one plate? Well, with a lot of really good
engineering, right? You know, the reality is, again, like we've been in this space for some time,
and so we've developed a lot of expertise around all the sort of tips and tricks to do this. And,
and you know, it's effectively's effectively, ultimately it's still
the thermal properties of copper
are kind of like what we work with.
But we have designed this as a prototype,
really just to kind of again show
that there is still capability within this
in the right use cases.
We created a thermal test vehicle,
so basically a series of heaters to replicate
the size of basically a V200 and an NVIDIA GPU chip.
And so from that we were able to kind of prove out that we still had the ability to reject
heat to 4,000 watts or 4 kilowatts.
So it's an interesting challenge though that you face because obviously you partner with
all the big silicon guys, but at the same time, they announce things at a cadence that's
pretty rapid and maybe before things are really shipping. So you guys have to be pretty nimble,
I suppose, because you're designing all along the way, but things still change all the way up to the
end. Yeah, I mean, no different than any other business. We have our challenges, right? So as
we progress further upstream, working with the silicon manufacturers, we do see that you know their designs change, right? So it might
be mid-flight and version 3 turns to version 4 turns to version 6 pretty
quickly, right? So you know we, I wouldn't say we struggle, we work alongside them
to continue to do the same thing. It's the same challenges that everybody else has, so we're all on
equal footing and I think we're just trying to adapt to that insatiable sort of demand and speed that
the AI market is really driving today.
So talk about a little bit about the buying process for this gear.
I've seen your stuff come through, like Dell, for instance.
I know they've been a partner of yours on the cold plates, manifolds, all that kind
of stuff in the CDUs.
But for somebody that wants to consume
a rack level CDU like this, are they going to you guys?
Are they going to your partners?
Kind of what's the motion there?
And kind of also, can you talk a little bit about
what the decision tree is?
How do I buy a CDU?
It's outside the typical IT motion.
Right. Yeah, so one of the things that we do, I think, really well is we are an end-to-end solution.
So we understand how cooling and heat rejection works inside with the cold plate, so inside the IT,
all the way through to the infrastructure required to size and plan and properly scope out
your needs from a CDU perspective. We
typically sell most of our products direct to market. We do do direct
factory integration at the cold plate level with a lot of the OEMs. So we
would sell direct to the OEMs, they would do that installation direct in one of
their factories and so the cold plates get integrated directly at the factory
under an OEM warranty.
But these types of units are typically bought directly from Cool IT.
We have a professional service group that helps support kind of like the engineering,
design, sizing, install, commissioning, preventative maintenance, etc.
So we run the whole spectrum of everything that's required to support direct liquid
cooling.
And when you bring in liquid for the first time into an organization,
IT obviously has concerns over that.
The facilities management has some concerns.
What do data center administrators have to do to embrace this?
Do they have a headcount that's dedicated to managing this? Does it fall under
someone else's umbrella? Who manages this thing? Typically the CDUs, for example, would be more of the infrastructure people. So the folks at the
data center are the ones that are typically making design decisions. Sometimes it comes in
conjunction with, you know, if you're a colo, depending on the clients and the customers,
they may also have their sort of like required vendors. So each customer and each data center
might have
its own slightly different nuance to the way
that they actually like manage and buy the equipment.
But typically the CDUs and everything done inside
the facility and that infrastructure is done
by the facilities people.
So the data center folks themselves directly
are the ones doing a lot of the purchasing
and maintenance, et cetera.
So when you consider the fact that you could pair systems with your cold plates with CDUs
that are yours or competitors, conceivably, right?
What's the industry done in terms of standardization?
Because I've been in a lot of labs and seen manifolds that connect us to servers and AI systems that aren't the same
as manifolds in other places.
And are we at the point where you can deliver
out of any CDU to any manifold?
Or kind of what's going on there?
Yeah, I mean, I would say there's a pretty substantial push
to try to have interoperability.
Yeah, OCP has got a working group for liquid storage or liquid cooling, right?
Yeah, I would say OCP has actually been the leader in terms of trying to bring together
a common standard and a common framework.
And you know, I think as a participant, as a vendor in that space, like obviously we
have our own preferred ways.
There are some challenges that I think we haven't quite solved in terms of when it comes to interoperability
in terms of warranty, for example, right?
So how do you determine what component or piece is at fault
if there's some kind of warranty claim?
Or how can you ensure that all the wedded materials
are compatible with each other?
Because we all use a slightly different sort of secret recipe
for the way that we design our products.
And so I don't think that has quite landed yet, but I think definitely we're starting to see more
involvement from groups like OCP, from the large OEMs and the manufacturers of the silicon themselves,
also getting more involved in determining what's required to support them.
Well, it's got to get universal, right? Because I can't be dropping in servers from one vendor and then servers from another vendor,
go to the manifold and be like, yo, these don't plug in.
Like, that's got to get better.
Yeah, I would think so.
I mean, we're all looking forward towards a day
where that's the case.
Well, it's in your best interest, right?
I could see why if you're an infrastructure company in IT
where you may have wanted to put up a little wall
around your investments in liquid cooling,
but at this point, I mean, being open,
I think is the right way to go.
Yeah, again, we wanna be a participant and a leader
and just have a voice in some of those decisions
that are being made, because we do think we have
an experience set that's able to contribute
to that conversation.
And so, so long as the decisions are being made, we think are beneficial for
the industry as a whole, we're all on board with that.
And so we'll see a drive towards more standardization, I would expect in the near future or, you
know, sort of the immediate term here.
So what are the challenges going forward?
You've got this massive new CDU, but what are you guys looking at a year, two, three
years out at a high level? What are the challenges going to be? Yeah, I think the
same challenges we've been trying to solve over the last little while, right? Just hotter and hotter?
Yeah, just hotter and hotter. I mean, I think, you know, in the cooling space, I
think the challenges remain the same, like thermal management of these incredibly dense, hot 1400, 1600,
2000 unit systems. And so that is, I don't think that's going to change.
I think that's actually the same problems we're gonna have to solve.
I don't think we've run up against the constraints of material science or
engineering yet. There's a lot of people that have a lot of great ideas. I think
we'll start to see more of maybe a mix
of different types of technologies being used
to try to achieve those kinds of results.
I think the bigger issue is still on the power side.
So I'm glad that we operate on the cooling side
because I think the first thing that will stare us
in the face is the power aspects.
But yeah, we're working really hard to continue
to support that ecosystem and
do the things that we've done for the last 24 years.
What can we do operationally to advance liquid cooling,
whether it's smart leak detection or reduce flow to a system that could be compromised
and be able to communicate?
Because that's one of the things, right, is if we look at a server, for instance,
that has out-of-band management, I can shut it down if I want to,
if I could detect some sort of leakage or something
and have some actions take place.
Is that kind of stuff progressing?
Definitely.
So in our advanced technologies group,
we are evaluating a series of technologies
to try to have smart sensors and smart leak detection.
We've had some customers ask us for different things
about how could we isolate specific problem areas, et cetera.
And I think you'll see that across the entire system
as we start seeing that.
We're starting to see more demand for redundant sensors
and maybe even AI can actually come in
and start supporting
some of that calculation and evaluation.
But we're definitely looking at getting more targeted
in terms of like making sure that the impacts,
especially when these systems run on really large loops
of clusters of technology to try to isolate those problems
into smaller areas that can be problem solved
and repaired quickly if there is an issue.
All right, well clearly, I mean,
looking around the show floor here,
this is a big concern, the liquid loops,
the CDUs, the cooling, the power,
all this is really important.
And as the show floor sort of starts to open up
and gets louder, I just heard my first vacuum.
I think we're gonna stop now
before we risk losing all of the audio quality.
But like I said, I've got links in the description here and we'll have a little bit up on storageview.com
if you want to learn more about the CDU, the 4000 watt cold plate and all the innovation
that Cool IT is driving.
Thanks for doing this.
Appreciate it.