Podcast Archive - StorageReview.com - Podcast #134: Leak-Proof Liquid Cooling
Episode Date: September 25, 2024Chilldyne Podcast – liquid cooling solutions for data centers focusing on leak-proof design and… The post Podcast #134: Leak-Proof Liquid Cooling appeared first on StorageReview.com. ...
Transcript
Discussion (0)
Hey everyone, welcome to the podcast. Brian Beeler here with Storage Review. And today
we've got a conversation about liquid cooling. Everyone knows that that's one of the hottest,
if you'll let me use that term, technologies in the enterprise right now. And a couple
weeks ago, our team was out in Carlsbad, California with Childine, spent the day there
and learned everything there is to learn about CDUs, negative pressure, cold plates,
turbulators, and everything else that goes into a really good CDU design and cold plate
design for GPU servers, for high-end compute servers, and I guess even rocket engines if
you're into that sort of thing.
With me today is Steve
Harrington with Childine. Steve, what do I even call you? CEO? Is that what you are? I don't think
you've got a name. Yeah, I'm the CEO. Okay, well, we didn't get into the formalities when I was out
there. So we've got Steve, the CEO, with us today. Steve, how are you doing? Fantastic.
So I mentioned rockets briefly, but you're in the cold plate business now. And I
guess maybe you've always been in that or for a while anyway. How do you get from rockets to
data center cooling? So actually, I started in cooling for supercomputers when in like 1987,
when I was grad student. And then it got real easy with CMOS. So I got into all
kinds of other things, aerospace, medical devices, and so on and so forth. And then,
you know, more recently, suddenly the power from the chips is so much you needed liquid cooling
again. So I've taken all the know-how that I've developed over the years and all these other
fields and applied it to the
liquid cooling problem. Well, go back to the old days, if you will, because data centers now
tend to be apprehensive about the notion of bringing liquid into the data center for cooling,
and we can talk about that. But if you go back, it was there in mainframes and other solutions.
Why did liquid leave?
Well, yeah, I think IBM had liquid in 1964.
And in the 80s, Unisys, Burroughs had liquid cooling as well.
But what happened was we went from a bipolar transistor in the olden days to the CMOS transistor that we use today. And at that time,
people said, wow, the speed's never going to be over 100 megahertz and the power is never going to be a problem. So, you know, all you thermal design engineers can just kick back. And so I
moved on, right, and started working at aerospace, on the the cooling of the you
know jet engines and and you know then I got back into it like I said recently
but at first when CMOS came out there was a dramatic reduction in in chip
power to be fair though they were right for four decades or so yeah they were
yeah exactly 40 years is long enough yeah Yeah, it wasn't exactly wrong. I mean,
that's a decent horizon in tech speak, right? Exactly. And so when you went to aerospace,
what was the challenge there? And was the technology similar in terms of cold plates,
or was it something else entirely? So we did liquid cooling for an airplane, the Northrop Grumman UAV,
the Hunter, I don't know, about 10 years ago. And then we also did cooling for ASML's laser
that they used, kind of the previous generation of laser for making chips. And we also did liquid cooling for rocket engines as
part of a DARPA project to develop a new kind of rocket fuel pump. So when I get to the liquid
cooling for chips and data centers, the heat intensity isn't as challenging as aerospace,
but the need for longevity, it has to run for four or five years
straight. And the need for uptime, that is the challenge in the data center business.
Yeah, it's an interesting parallel. And you did that work for DARPA, but now you've got
DARPA-E funding as part of your initiative. So maybe talk about that bridge too, and then we'll get into the
technology a little bit more. Yeah. So a few years ago, Peter DeBak from ARPA-E called me up and
wanted to talk about data center cooling. And then that led to us applying for a grant and getting it
under that grant, we developed a cold plate for a two kilowatt chip and now we're in the position of bidding on projects that have
two kilowatt chips inside of them so so uh that was good foresight on dr de box part to figure
out that two kilowatts is where we're heading and then right now uh we're continuing to work for
arpa-e on uh automatic coolant quality, so you don't have to worry about sampling your
coolant and putting in new chemicals all the time. And then we're planning on doing a comparative
deployment of our liquid cooled hardware with the Texas Advanced Computer Center on a rack of GPU
servers there. We'll be looking at performance and efficiency of air versus liquid cooled racks
of high density GPU servers. And TAC does some really cool stuff and they're willing to try
just about anything. So is this the first large study of air versus liquid systems that you've
done in the way they're approaching it? Yeah. I mean, what I've seen, particularly with
hyperscalers, is that, you know, it's very hard to do large scale testing of liquid cooling,
because if you have a high density liquid cooled rack, that might be $2 million or even more.
So to do a test of liquid cooling is something you have a hard time getting your CFO to sign
off on that, right? Because it's
production, but it's also experimentation. And there's not a good column on that spreadsheet
for them to put it in. So working with TAC, Sandia, Los Alamos, these are places that are
willing to take a risk with new technology, try it out in production, and then share the results of what happens with
everyone. Well, the last part's the key, right? Because if it's Meta or X or any of these other
huge consumers of GPUs, whatever they learn, they're less likely, and not to disparage Meta,
they're part of OCP and they do a great job there, but they're not going to give away more secret sauce than they need to on that front.
And I think it tends to publish this data, as you were saying.
Right. They were they're not going to tell anybody their their best technologies.
They tend to keep it close to their chest.
So, yeah, on the commercial side.
Exactly.
So the that'll be interesting data.
And for those that want to learn more about Chill9, we'll talk about some of the elements on this podcast.
But if you want the deep dive, we do have a paper and we've got two videos on YouTube.
And Steve, we were with your team all day and we ended up with so much content.
We did two different videos.
And the response that we're seeing has been really favorable. People love your design that's essentially
leakless. And so we don't have to do a whole deep dive into the technology, but we should address
the negative pressure issue here. Could you talk a little bit about that and where that technology comes from and why that's so special in the design that you guys use.
So what we do, which is pretty unique, is we have negative pressure on both supply and return so that if there's a leak anywhere in the system, air bubbles go in.
You don't spill coolant on the chips.
It just reduces the risk of deploying liquid cooling
by an enormous amount. And then the other thing that's kind of special about our system is when
there is a leak, the system keeps working. Because what we found is that if you make a system that's
leak-proof, if there is a leak, the customers don't fix it. The O-rings get old, maybe a fitting cracks,
and they just don't bother fixing it if it keeps working.
So that's been our mission is to make the system that works even if it has leaks
because I know there's a gamer system out there that runs your all-in-one loop
under negative pressure so it won't leak.
But once it leaks, you're done.
You've got to fix it.
And that isn't really suitable for a data center because you're going to have thousands upon thousands
of connections and any one of those could leak. And when you run it full blast for five years,
the chances that something's going to leak are 100%.
Well, so talk about that in terms of the negative pressure bit. Is there data out there or have you seen data or are you guys working on it maybe with TAC around the impact on component wear and tear in the negative pressure versus the more common pressurized systems? get that data. So actually, ARPA-E is working with University of Maryland to try to collect
all that data in terms of what's the reliability effects of liquid cooling on data centers in
general. But in a lot of cases, people, you know, don't want to talk about it because sometimes,
you know, they might have to admit that they made a mistake. So one of the common mistakes that we
see is that liquid cooling customers don't take care of the coolant properly. And then either it
gets bacteria in it or it starts attacking the seals. And then at that point, you know, it's
kind of the liquid cooling vendors fault because they didn't, you know, make the system kind of
idiot resistant. And it's also the data center
operators fault. So nobody wants to talk about the mistakes they made, which tends to keep the
mistakes discussed over a beer at the conference. So everybody knows about it, but nobody's talking
about it, which I think makes the whole fears among customers worse.
Absolutely.
And anything that makes it secretive is certainly not good for overall industry adoption. And you were talking about the fluid quality.
You have a chemist on staff.
We met him, very chemisty-looking fellow that was looking at all the water samples. And something that I
didn't appreciate is the biology, the biome here in Cincinnati, where I am, is very different from
where you are in California versus Europe versus anywhere else. And the stuff in the water that contaminates the bio bits is different. And to think about
that impacting your liquid loop was something that I hadn't considered. Talk about that a
little bit in terms of the biodiversity and how you have to engineer for that.
Yeah. So for example, IBM has published a spec on their liquid cooling water system that
they recommend. Um, and you know, they try to clean the fluid loop as best they can, but pretty
much we're all covered with bacteria. As soon as we touch something, um, it's not sterile anymore,
you know? Um, and nobody in the data center business knows anything about sterile procedures. So you're going to get bacteria in the system.
If you put PG-25 in, you're going to slow it down considerably.
But still, what we see is over a period of time, the water gets cloudy and then various slightly soluble chemicals or bacteria build up.
And then that can slow down the system.
It can make your chips run hot or it can clog it up entirely.
So that's why we have an automatic coolant quality control system.
So we can make sure that even if the customer doesn't do the right thing with the water
or the additives, that the system will issue an alarm and they won't be able to ignore it.
So in your CDU, when you look at that front door, there's a bunch of little containers there or two or three or something that have the chemicals and the system self-balances. And the one thing, you know, as I was talking about that we didn't appreciate
when thinking about the water itself is that when you get that contamination, those little bits can
gum up your fins on your heat sinks, the turbulator that agitates the water and helps with the cooling
or the heat dissipation. And it's really a big deal.
And I guess once you get to that point
and you start to gum up the system,
especially in the cold plate itself,
there's not really a mechanism to clean it, right?
You've got to, if it gets that so bad,
you have to kind of replace it.
Yeah, so you really want to keep a good eye on it.
And if the temperature started to go up,
you need to aggressively look for the problem and get it under control. So one of our customers, we told them to put in the coolant
additive. They said, that's too expensive. We're not going to bother. And then a few months later,
they said, hey, all of our GPUs are starting to get hot. And so for them, we came up with a
regimen of chemicals to put through the system to bring it back to life.
And then that was OK.
But in some cases, once that cold plate is fully clogged, then you can't get any chemicals through it.
So you really have to monitor things.
And the minute the average temperature of your chip starts to go up, you need to investigate it quickly.
Just stay on top of it.
Yeah. It makes you think a little more strategically when you bring these
solutions into your data center. It's not just a point solution of thing is hot, put this in,
make this investment. Now it's cold and everything's well again. It is a little more than
that in terms of maintenance and managing it and making sure that it's
operating properly. But you said PG-25. I want to come back to this. A lot of people, when we
think about liquid cooling systems, get concerned about the liquid being used. And in your case,
it's water, like regular water with a little bit of additive that can still go down the drain without any
EPA guys showing up at your door in the case you dump your, your CDU for whatever reason.
Talk about that versus PG 25 for those that don't know. And then let's pause there and then talk
about two phase for a minute. Okay. So, so with the, um, with what we do is we put in a little bit of antibacterial, about,
you know, three or four parts per million, and then a little bit of anti-corrosion, maybe at 40,
50 parts per million. And both of these are chemicals that you might find as a fertilizer
for crops or as a chemical that would be in your swimming pool that you'd swim in.
So they're not
particularly toxic and you can dump a lot down the drain without anybody even noticing now with um
with pg25 we actually worked with a local brewery and and they would uh occasionally have their
system burst and pour a whole bunch of pg25 into a local creek which caused everybody to get mad at them so um the the
the nice thing about pg-25 is it doesn't freeze so it's really good for all-in-one systems that
the gamers use uh because you can ship it you don't have to worry about it freezing
expanding and well while it's in shipment right right exactly and the nice thing about it is it
it slows down or
stops the growth of bacteria. But we have seen some issues where the PG, the polypropylene glycol
attacks the seals and then the seals start leaking, which isn't good.
So that's why we're using, you know, the water with just a light touch of additives. And that's what we came up with talking to chemists, you know, who have worked on things like airplane freshwater systems and cooling systems for powerful lasers and stuff like that.
And in fact, so when you bring a CDU into your environment the first time, you're just filling, you're filling it up with facility
water, right? You'll add your chemicals to it, but there's nothing in the run up that's particularly.
Yeah. What we ask is that the customer give us some distilled water or reverse osmosis water
that's filtered tap water, because we don't want to have to worry about what the tap water is in
every different location. So we're starting with water that has very low concentration of minerals in it. And then we add our chemicals and we're good
to go. And then periodically what we'll do is we'll drain a little bit down the drain and fill
a little bit more and basically refresh the coolant, put in new chemicals, and you're kind
of starting all over brand new. Because what we've seen is that, you know, with water-based chemistry, maybe after a year or so, it starts to get,
you know, buildup of chemicals in it and gets cloudy. And maybe after three years or so,
the PG-25, that starts to get cloudy as well. So one of the issues with the liquid cooling
business in general
is is you have companies that are used to procuring electronics and with electronics
if it works for two weeks it's probably going to work for five years whereas with liquid cooling
systems that's not the case at all if it works for two weeks it may last five years but it may last
six months you just don't know it has to do with the chemistry, your heating and cooling
up the water, the contact with the materials. In some cases, you don't know, is that pure copper?
Does it have other chemicals in it? Is it pure brass? Does it have other, you know, what is the
mix of brass? So, you know, you don't necessarily know, which is why, you know, in reality, what you'd like to do
is you'd like to, before you deploy 100 megawatts of liquid cooling, you'd like to run one megawatt
for two years. But we're looking at a world where we don't have that time. We have to
deploy now. Yeah, certainly not if you're going to any of these NVL racks next year from NVIDIA,
they're all liquid cooled. There's not an air cooled option, at least not that I've seen. And you better be ready if you want those systems. So what about other fluids? We talked about yours
and then the other big one is phase change fluids. I know that one's been a little trickier
in the industry. That was a pretty popular way to
do liquid cooling early on, but it seems maybe a little less in favor based on what you're saying.
Yeah, I mean, so we've done two-phase heat transfer before. We did a project for,
another project for DARPA that was a solar-powered rocket engine that ran on ammonia. So we're familiar with the technology.
And, you know, years ago we worked with Florinert for another project.
But what we see is that the two-phase chemicals, you know,
may or may not be banned soon.
And also we haven't seen people reporting a cold plate performance that's better than a good single phase cold plate.
And if you can buy fluorinert for 700 bucks a gallon or water for practically nothing, I think you'd want to buy water.
So so that's that's where we see it going.
And the other thing is, is I was talking to government regulators, and one of the first
things they say is, well, this chemical might be dangerous.
Is there a substitute?
And if you say, yes, the substitute is water, then the regulating it out of existence is
kind of a no-brainer.
Okay.
And then I guess there's another category when we think about immersion cooling that
seems to have some
popularity in some spots in Europe. I know others are experimenting with it, with
these engineered oils. What's your take on immersion and where that spot could be successful,
or maybe you don't think it can be? Yeah, so actually that's another thing that we have some
experience with. We worked on immersion cooled inducted inductors for a high-power laser,
and this is a 1,000-volt inductor that needed to be cooled by immersion
because that suppresses the arcs as well as cooling down the electronics.
So we're a bit familiar with that.
And basically immersion with natural convection,
where the heated oil moves upwards, just like a chimney, that works about 40% better than
blowing air. Whereas direct to the chip with a cold plate, that works seven or eight times
better than air. So we see immersion, you know, having a good
opportunity to be used perhaps in edge zones with moderate power chips and servers, that all works
great. But for these high power parts that are coming out, like the two kilowatt part we just
worked on, I don't think it has enough heat capacity to cool those things at a reasonable efficiency level.
Right? Because what you don't want to do is say, well, my single phase immersion system works good,
but we need to start with chilled water to cool chips just below their maximum operating point.
That's not right.
Right. Now that sounds like a difficult battle to maintain because clearly that's where we are today, but the chips are going to get more powerful. There's little doubt in anyone's mind, I think. So do you think, I know you're partial to water with a little bit of additives now, but do you think there's room for engineered fluids or investment in fluids that could be better than water and still
safe or relatively safe? Yeah, I don't know. I mean, you know,
Florinert has been around for 50 years at $700 a gallon and it evaporates fast. And if anybody
could come up with a better fluid, you'd think they'd have done so by now, right? So because there's, you know, there's
plenty of opportunities to use that to cool things, particularly in positive pressure systems,
because if it leaks, it just evaporates and doesn't hurt anything. But the cost has always
been a barrier. Right. So maybe not innovation there. Then let's talk about the cold plates,
because that does seem like a spot where innovation matters. We spent a lot of time
with you and your team on cold plate design, the little turbulators inside and how all of that
works. And you were talking a moment ago about quality of materials. And something that makes me nervous is that anytime there is a rush toward a technology,
whether it's AI or in this case, the hardware that lets AI go with these cold plates,
I'm seeing more outreach in the last six months to us about cold plates than, you know, obviously
than ever before. A lot of companies that I've not heard of before, a lot of companies
in Asia that I don't know. And it just makes me a little suspect that just because it's rectangle
and copper and has two hose connections on either side, that the quality and that something I
hadn't considered that the, if you're saying it's copper, that it really is copper or whatever the alloy is.
What's your take there in terms of what it takes to make a good cold plate and what
your customers or what the enterprise should be thinking about? Or should they even be thinking
about this? I mean, you're asking a lot for some GPU server admin to be worried about the cold plates that are on his
GPUs that are, as you said, several.
Yeah.
I think I probably mentioned this to you as one of our,
one of our customers asked us to have the copper analyzed to make sure that
the cold plate copper was indeed pure copper.
And we didn't have a vendor trying to substitute a lower quality material
but but like i said the thing about a cold plate or any part of a liquid cooling system is the
problems don't show up until it's been in service for a year or two or three and uh the worst
possible scenario is you know you ship a lot of hardware and then it starts leaking and then now you have to stop shipping, right?
And all those computer chips,
they're like lettuce from the supermarket.
You leave them on the shelf for a while,
the value goes down quickly.
So you really want to make sure that the vendor
that you're working with has experience
so that they know it's going to
last and there isn't some, you know, chemical reaction or, you know, material incompatibility
or something that's going to cause problems down the road. Because we've seen that in a number of
positive pressure systems where it just starts leaking after a while.
And then, you know, you're in the bad position of having to replace some chips,
replace a whole bunch of coal plates. It just slows down the whole process. So, I mean, that's
why we're in the negative pressure business, because we want to make sure that if there are
problems with the system, the customer doesn't have downtime because of it.
Well, talk about the leaks too, because that seems to me is one thing that the industry needs
to get better at, honestly. We've looked at dozens of these systems and the sophistication of
the CDU, the server, other parts of the loop to be able to say, you know, if I've got storage in a
server, I can see this drive is starting to throw errors. I think it's going to fail. I'm going to
tell the customer, I think this drive is going to fail. Let's get it now before it's a problem. Or,
you know, other components can do that, but the liquid loop is not quite there yet in
aggregate. Where is the responsibility or what's your view on where that should take place? Is it
more detection in the server itself with the little strings or whatever that go around the
CPUs to catch the fluid and alert you? Or should the CDUs have better intelligence?
What do you think about the alerting side? Well, as far as the alerting is concerned,
I mean, so far we don't have leaks on servers. So we're not really worried about that too much.
You know, right now what happens is you have a positive pressure system, you have the leak tape
you mentioned, and it detects a leak and then you have to go shut down the server or just take down a rack, whatever,
so that results in some downtime.
What we see is that liquid cooling is a system, and you as the customer, if something goes wrong,
you want to be able to call somebody and say, come out here and fix this thing. You don't want to be in the position of trying to determine if it's a cold plate vendor problem or a CDU vendor problem or a chemical additive vendor problem.
Right. So all these specifications and regulations and standards are still being figured out. So we think the best thing to do is to buy
the system from one vendor so that you make sure that everything works together properly.
I'm sure at some point we'll have everything figured out. But at this point, we still have
customers deploying things that run into problems. So I think you need to think about it as a system that interacts with itself.
So, for example, you may have a material in your CDU,
which is not compatible with the cold plate for some reason.
Or you may have a situation, I think, that we had recently
with one of the other organizations where they didn't ground their server properly and they had, you know, created
a battery in their liquid cooling system by accident and then something corroded and broke.
So there are a lot of things that can go wrong with a liquid cooling system. And most of our
data center operators are not experts in plumbing and chemistry and biology.
So you really need... Why would they be, right? I mean, they haven't had to worry about it until
relatively recently. Right, right. I mean, it's just enough trying to keep up with power and
networking and cybersecurity and all the other issues that take up data center people's time. So one other thing that I think was interesting
about your solution that IT people will worry about
is single point of failure.
So one of the things that you showed me
was a configuration with two CDUs and a bunch of,
you tell me what it's called.
We call it a, it's a switch over valve.
It's just like a switch over. It's it a, it's a switch over valve. It's just like a switch over relay.
It's a Y, it's a Y valves or something.
But yeah, yeah.
It's just like in a power system,
you'll have AB power and you'll have a relay
that switches over so fast
that the power supply doesn't even notice it.
So, and you know, we don't have to switch over
in 16 milliseconds, but we can switch over pretty fast
so that the servers don't uh overheat or
even get all that warm when one thing breaks down so if you think about it it's like a jet airliner
you know if anything goes wrong on that jet airliner you're just fine if you happen to have
both engines quit at the same moment they make a movie about you and you're a hero. So it's not very likely.
So same thing when we look at the data center, we want to have a system where no reasonable
single point of failure is going to cause downtime. So at our systems that are deployed
at Sandia, we have fins on the cold plate. So even if you forget to plug in the liquid cooling,
the server still works. And in other locations, like I said,
we have leak tolerance, we have redundant CDUs
with automatic switchover valves,
and we have the automatic coolant quality control system
to kind of deal with the typical failures
in a liquid cooling system that you're gonna see,
which are leaks, contamination, and corrosion.
Now, a lot of CDU vendors today are selling CDUs with multiple pumps, and corrosion. Now, a lot of CDU vendors today are selling CDUs
with multiple pumps,
and the customers are happy about that
because it kind of mirrors the servers
with multiple redundant fans inside of them.
But centrifugal pump technology is pretty mature,
and those pumps don't fail very often.
So they've got redundancy built in, but the redundancy
doesn't handle the typical failures that are most likely.
Yeah, it's interesting. The dual CDU concept should you get at a lot of those redundancy
concerns that the customers may have. And you talk about some of the national labs and the big data centers that are using liquid cooling,
but it's not just a big data center problem.
How do you think about scale here?
So for enterprises that are adopting GPUs
or have them now,
but know the next generation
is going to require liquid cooling,
what scale does it make sense
for someone to talk to Chiladime?
So at this point, over a hundred kilowatts, we have people call us all the time and say, I have a server closet with, you know, five kilowatts of heat.
We're like, it's not worth it. Right. If you have 10 connections, the chances that something's going to go wrong is pretty small.
But you have 10,000 connections. You're going to spring a leak at some point.
So that's why we say, you know, you have to be at a certain level.
And it's probably a few racks of high performance compute or AI compute is when you should talk to us.
It doesn't seem unreasonable, though, given what, I mean, your Fortune 500, obviously, but even further down it, I mean, you, you mentioned,
you saw our video trying to, uh, discuss heaving one of the eight way air cooled GPU servers into our lab this week. And, um, I mean the power consumption on, on, on eight of those plus the
everything else, the fans and switches and storage and CPUs is robust. But a couple racks of those,
that's not really too far out of the realm for, I don't know, maybe a couple thousand businesses.
So you start to get there pretty quickly. Yeah, I see that happening. We have a design
for a rack scale CDU, but we never quite got our customers to say, well, if you build a 50 or 100 kilowatt one, I'll buy, you know, 30, 40, 100 of them.
We never got to that point.
Everybody just said, well, it would be nice if you did a rack scale CDU, but we can't really tell you what the power level has got to be.
So that's why we have, you know, a 300 kilowatt CDU and we have a megawatt scale
one in development now. And so okay, so you've got a larger one that you said is in development,
that was the next thing is where do we how small can you start? So you address that? How big can
this go? Before you want to start chunking this out into multiple CDU loops or or how do you
think about that at much more yeah so it's kind of interesting because what
ends up being the limitation at least in my view is the pipes in the data center
so if you have a megawatt of liquid cooling that's about a four inch pipe to
conduct you know that 400 gallons per minute of water you might be able to
squeeze it into a smaller
pipe, but now you're spending a lot of energy pushing fluid through a pipe. But once you try
to go above that, now the pipes just get ridiculous. And you really don't want to have to
have a crane in your forklift moving around six inch, you know, pipes around sensitive computers,
you know, it's just, it's just nuts. So what we see, you know, one to two megawatts is pretty much as big
as a CDU is likely to get.
You know, and that way, if someone has a megawatt rack
or half a megawatt rack, we can still cool that.
But there's going to become a point where it's just, you know,
it's getting ridiculous in terms of the power too.
So if you think about chips today,
a two kilowatt chip is at the circuit board level is running 2000 amps at one volt roughly at the chip level. And that's a ridiculous amount of current. So I think we're starting to run into
some electrical limits that are going to hit before we run into cooling limits.
How about limits on dealing with the heat itself then?
I mean, you talked about the cold plates and cooling the gear,
but we still have to do something with the heat, right?
Yeah.
So, I mean, we are aligned with super micro on this.
Use a cooling tower if you possibly can.
I mean, there's some areas where water is scarce
and you don't want to use a cooling tower.
You have to use a chiller.
But we think that using water a cooling tower, you have to use a chiller. But we think that the
using water is better than, you know, burning, you know, carbon to make electricity to run a chiller.
You know, in many places you can use a cooling tower part of the year and a chiller, you know,
just a few weeks of the year when it's really hot and or humid. And that's kind of the best way
to operate. And the other
thing to think about is if you're building a liquid cool data center is we don't know what
the next chips are going to require in terms of temperature. It may be that you can run your GPUs
20% faster if you run them colder. So I would always, you know, run with a cooling tower if
you can, but leave room and power there for a chiller because you might need it someday.
Yeah, well, I mean, you've got the ambient air.
You've got the liquid that you're extracting that has the heat in it.
I mean, it's a delicate balance.
I mean, there's probably quite a bit of math more than ever that goes into data center design that used to be just power and
square footage. I mean, the data centers now that were designed that way, even just four or five
years ago, the densities are kind of funny when you look in there and see these racks that are
half empty because there's just no more power at the rack for them.
Right. And we're quoting projects that are 1500 watts per square foot, which is just completely nuts.
Right. And and the other thing that that you think about is in order to really, you know, optimize the efficiency of these data centers,
you want to be able to understand, you know, the cooling tower performance versus the GPU performance versus and the fan speed control algorithm and
inside the CDU, there's just a lot of stuff that goes into this. And, you know, we have some math
models that we use. And, you know, you may want to put in more liquid cooling, run the chip colder,
because it uses less power when it's running colder, particularly if you can get, you know, cooling tower water in the
wintertime, which is nice and cold, right? So, yeah, there's a lot of things in the liquid
cooling business that break some of the standards in the data center world. Like in the old days,
HVAC guys would come in, do their business, and then leave, and they would never even
talk to the IT installation folk. And now
they have to work together to get all this stuff up and running. Well, that complexity changes the
consumption model a little bit of IT gear as well, because now if I'm a customer and I'm buying
GPU servers from Supermicro or Dell or HPE or Gigabyte or whomever it is,
as the customer, where do I go? Who do I rely on to help coach me on this stuff? Because if I know
you and I come and we talk math and that stuff, that's great. But I want ultimately my hardware provider to have a good handle on this too, without throwing your partners
under the bus. How are the traditional infrastructure guys doing in terms of
understanding these challenges and then helping their customers make intelligent decisions with
their money? Yeah. I mean, I know there's a lot of people working on it and there's a very few people.
It doesn't seem real favorable.
Well, no, I just and there's a very few people that have all the scars and have figured it out are not telling everybody how they figured it out because they want to keep that as an edge over their competitors.
And the other thing is the national lab guys can help because they've deployed this stuff.
But, you know, you have to actually spend some time and track them down and talk to them. And a lot of them
are happy to tell you how things went when they deployed liquid cooling, but they've got a lot
of systems keep running, so they don't have an infinite amount of time. So I think that a big
problem is that you have not too many people who know what's going on. You have a lot of little
tiny companies that are trying to get into the liquid cooling business because they've read the Wall Street Journal
articles as well. So they're saying anything they can to close the deal because they don't
want to run out of money. So there's just an awful lot of noise out there.
Well, that's what I'm worried about, quite honestly. And like I said, we're getting more
outreach from companies I've never heard of.
Not to say that they don't have the pedigree.
Everyone starts somewhere.
And a lot of these guys are coming in from aerospace and other industries where it's
made sense to cool these components.
Totally get that.
I'm just thinking if I'm sitting here as a CIO or CTO for an organization, I know this is coming,
but maybe my VAR is not real up on it or the channel guy I'm working with, or maybe the
supplier of my servers, I don't have the greatest confidence in them. As we close out here, what's
your advice to that person or to the IT admin that's being tasked with providing recommendations.
What do these guys do to try to unlock those secrets and make it a good buying decision?
So my recommendation is if you're not doing liquid cooling now, start doing it as fast as
possible with whoever's stuff. Get some experience under your your belt because what I've seen so far is we have hyperscalers that are planning you know
20 30 100 megawatt liquid cooled data centers that have never done a 1
megawatt liquid cooled insulation or even a 50 kilowatt you know 100 kilowatt
installation and you know I like to say that that's like you show up at the
airport you say I want to learn to fly, but I'm not going to bother with that Cessna stuff. I want to go right to fighter jets.
Yeah.
What could go wrong?
That's what flight simulator taught everyone. You can do that. It's all right. servers, you know, stick on some cold plates, get a CDU, start running it, get some experience under
your belt. Cause what you don't want to do is go from zero to, you know, three or $4 billion for
the servers and then have them go down because something went wrong that you didn't anticipate.
Cause you know, right now what we see is the you know the nvl2 racks from
nvidia are coming out soon but what we haven't seen is a set of really tight solid specs on
what's the water temperature need to be what's the what's the water chemistry and quality need to be
and until we get those numbers it's hard to select a liquid cooling system for those
racks. So what we expect to happen is they're going to start shipping a lot of these racks,
and everybody's going to call up people like us and Cool IT and others and say,
hey, we want to buy some stuff yesterday. And we're going to say, hey, well, it's going to, you know, our lead time is 16 weeks right now. But, you know, for some other companies,
it's 52 weeks. So, you know, you really want to get some experience with this stuff as soon as
possible. I think it's a brilliant piece of advice. And I mean, it goes with other buying modalities. Nobody switches their storage
or their server vendor willy-nilly. They get a couple in, they mess with them for weeks or
months or whatever, and then at some point make some sort of pivot. So taking the home lab style
of approach of just getting something in, getting some reps in and learning what you don't
know is probably, that's a strong piece of advice. And I love it. So we've got a full report on
Childine on the website, but the two videos are really, really good. We'll link to those in the
description. We'll link to Childine's website. Steve, this is a great talk. Thanks for doing
this. Good to see you again. And I'm excited to watch where you guys go with this. This will be fun.
Yeah. And thanks for all the great questions. I really appreciate it.