Podcast Archive - StorageReview.com - Podcast #134: Leak-Proof Liquid Cooling

Starting point is 00:00:00 Hey everyone, welcome to the podcast. Brian Beeler here with Storage Review. And today we've got a conversation about liquid cooling. Everyone knows that that's one of the hottest, if you'll let me use that term, technologies in the enterprise right now. And a couple weeks ago, our team was out in Carlsbad, California with Childine, spent the day there and learned everything there is to learn about CDUs, negative pressure, cold plates, turbulators, and everything else that goes into a really good CDU design and cold plate design for GPU servers, for high-end compute servers, and I guess even rocket engines if you're into that sort of thing.

Starting point is 00:00:43 With me today is Steve Harrington with Childine. Steve, what do I even call you? CEO? Is that what you are? I don't think you've got a name. Yeah, I'm the CEO. Okay, well, we didn't get into the formalities when I was out there. So we've got Steve, the CEO, with us today. Steve, how are you doing? Fantastic. So I mentioned rockets briefly, but you're in the cold plate business now. And I guess maybe you've always been in that or for a while anyway. How do you get from rockets to data center cooling? So actually, I started in cooling for supercomputers when in like 1987, when I was grad student. And then it got real easy with CMOS. So I got into all

Starting point is 00:01:26 kinds of other things, aerospace, medical devices, and so on and so forth. And then, you know, more recently, suddenly the power from the chips is so much you needed liquid cooling again. So I've taken all the know-how that I've developed over the years and all these other fields and applied it to the liquid cooling problem. Well, go back to the old days, if you will, because data centers now tend to be apprehensive about the notion of bringing liquid into the data center for cooling, and we can talk about that. But if you go back, it was there in mainframes and other solutions. Why did liquid leave?

Starting point is 00:02:11 Well, yeah, I think IBM had liquid in 1964. And in the 80s, Unisys, Burroughs had liquid cooling as well. But what happened was we went from a bipolar transistor in the olden days to the CMOS transistor that we use today. And at that time, people said, wow, the speed's never going to be over 100 megahertz and the power is never going to be a problem. So, you know, all you thermal design engineers can just kick back. And so I moved on, right, and started working at aerospace, on the the cooling of the you know jet engines and and you know then I got back into it like I said recently but at first when CMOS came out there was a dramatic reduction in in chip power to be fair though they were right for four decades or so yeah they were

Starting point is 00:03:01 yeah exactly 40 years is long enough yeah Yeah, it wasn't exactly wrong. I mean, that's a decent horizon in tech speak, right? Exactly. And so when you went to aerospace, what was the challenge there? And was the technology similar in terms of cold plates, or was it something else entirely? So we did liquid cooling for an airplane, the Northrop Grumman UAV, the Hunter, I don't know, about 10 years ago. And then we also did cooling for ASML's laser that they used, kind of the previous generation of laser for making chips. And we also did liquid cooling for rocket engines as part of a DARPA project to develop a new kind of rocket fuel pump. So when I get to the liquid cooling for chips and data centers, the heat intensity isn't as challenging as aerospace,

Starting point is 00:04:02 but the need for longevity, it has to run for four or five years straight. And the need for uptime, that is the challenge in the data center business. Yeah, it's an interesting parallel. And you did that work for DARPA, but now you've got DARPA-E funding as part of your initiative. So maybe talk about that bridge too, and then we'll get into the technology a little bit more. Yeah. So a few years ago, Peter DeBak from ARPA-E called me up and wanted to talk about data center cooling. And then that led to us applying for a grant and getting it under that grant, we developed a cold plate for a two kilowatt chip and now we're in the position of bidding on projects that have two kilowatt chips inside of them so so uh that was good foresight on dr de box part to figure

Starting point is 00:04:54 out that two kilowatts is where we're heading and then right now uh we're continuing to work for arpa-e on uh automatic coolant quality, so you don't have to worry about sampling your coolant and putting in new chemicals all the time. And then we're planning on doing a comparative deployment of our liquid cooled hardware with the Texas Advanced Computer Center on a rack of GPU servers there. We'll be looking at performance and efficiency of air versus liquid cooled racks of high density GPU servers. And TAC does some really cool stuff and they're willing to try just about anything. So is this the first large study of air versus liquid systems that you've done in the way they're approaching it? Yeah. I mean, what I've seen, particularly with

Starting point is 00:05:45 hyperscalers, is that, you know, it's very hard to do large scale testing of liquid cooling, because if you have a high density liquid cooled rack, that might be $2 million or even more. So to do a test of liquid cooling is something you have a hard time getting your CFO to sign off on that, right? Because it's production, but it's also experimentation. And there's not a good column on that spreadsheet for them to put it in. So working with TAC, Sandia, Los Alamos, these are places that are willing to take a risk with new technology, try it out in production, and then share the results of what happens with everyone. Well, the last part's the key, right? Because if it's Meta or X or any of these other

Starting point is 00:06:32 huge consumers of GPUs, whatever they learn, they're less likely, and not to disparage Meta, they're part of OCP and they do a great job there, but they're not going to give away more secret sauce than they need to on that front. And I think it tends to publish this data, as you were saying. Right. They were they're not going to tell anybody their their best technologies. They tend to keep it close to their chest. So, yeah, on the commercial side. Exactly. So the that'll be interesting data.

Starting point is 00:07:07 And for those that want to learn more about Chill9, we'll talk about some of the elements on this podcast. But if you want the deep dive, we do have a paper and we've got two videos on YouTube. And Steve, we were with your team all day and we ended up with so much content. We did two different videos. And the response that we're seeing has been really favorable. People love your design that's essentially leakless. And so we don't have to do a whole deep dive into the technology, but we should address the negative pressure issue here. Could you talk a little bit about that and where that technology comes from and why that's so special in the design that you guys use. So what we do, which is pretty unique, is we have negative pressure on both supply and return so that if there's a leak anywhere in the system, air bubbles go in.

Starting point is 00:08:00 You don't spill coolant on the chips. It just reduces the risk of deploying liquid cooling by an enormous amount. And then the other thing that's kind of special about our system is when there is a leak, the system keeps working. Because what we found is that if you make a system that's leak-proof, if there is a leak, the customers don't fix it. The O-rings get old, maybe a fitting cracks, and they just don't bother fixing it if it keeps working. So that's been our mission is to make the system that works even if it has leaks because I know there's a gamer system out there that runs your all-in-one loop

Starting point is 00:08:39 under negative pressure so it won't leak. But once it leaks, you're done. You've got to fix it. And that isn't really suitable for a data center because you're going to have thousands upon thousands of connections and any one of those could leak. And when you run it full blast for five years, the chances that something's going to leak are 100%. Well, so talk about that in terms of the negative pressure bit. Is there data out there or have you seen data or are you guys working on it maybe with TAC around the impact on component wear and tear in the negative pressure versus the more common pressurized systems? get that data. So actually, ARPA-E is working with University of Maryland to try to collect all that data in terms of what's the reliability effects of liquid cooling on data centers in

Starting point is 00:09:33 general. But in a lot of cases, people, you know, don't want to talk about it because sometimes, you know, they might have to admit that they made a mistake. So one of the common mistakes that we see is that liquid cooling customers don't take care of the coolant properly. And then either it gets bacteria in it or it starts attacking the seals. And then at that point, you know, it's kind of the liquid cooling vendors fault because they didn't, you know, make the system kind of idiot resistant. And it's also the data center operators fault. So nobody wants to talk about the mistakes they made, which tends to keep the mistakes discussed over a beer at the conference. So everybody knows about it, but nobody's talking

Starting point is 00:10:18 about it, which I think makes the whole fears among customers worse. Absolutely. And anything that makes it secretive is certainly not good for overall industry adoption. And you were talking about the fluid quality. You have a chemist on staff. We met him, very chemisty-looking fellow that was looking at all the water samples. And something that I didn't appreciate is the biology, the biome here in Cincinnati, where I am, is very different from where you are in California versus Europe versus anywhere else. And the stuff in the water that contaminates the bio bits is different. And to think about that impacting your liquid loop was something that I hadn't considered. Talk about that a

Starting point is 00:11:16 little bit in terms of the biodiversity and how you have to engineer for that. Yeah. So for example, IBM has published a spec on their liquid cooling water system that they recommend. Um, and you know, they try to clean the fluid loop as best they can, but pretty much we're all covered with bacteria. As soon as we touch something, um, it's not sterile anymore, you know? Um, and nobody in the data center business knows anything about sterile procedures. So you're going to get bacteria in the system. If you put PG-25 in, you're going to slow it down considerably. But still, what we see is over a period of time, the water gets cloudy and then various slightly soluble chemicals or bacteria build up. And then that can slow down the system.

Starting point is 00:12:08 It can make your chips run hot or it can clog it up entirely. So that's why we have an automatic coolant quality control system. So we can make sure that even if the customer doesn't do the right thing with the water or the additives, that the system will issue an alarm and they won't be able to ignore it. So in your CDU, when you look at that front door, there's a bunch of little containers there or two or three or something that have the chemicals and the system self-balances. And the one thing, you know, as I was talking about that we didn't appreciate when thinking about the water itself is that when you get that contamination, those little bits can gum up your fins on your heat sinks, the turbulator that agitates the water and helps with the cooling or the heat dissipation. And it's really a big deal.

Starting point is 00:13:05 And I guess once you get to that point and you start to gum up the system, especially in the cold plate itself, there's not really a mechanism to clean it, right? You've got to, if it gets that so bad, you have to kind of replace it. Yeah, so you really want to keep a good eye on it. And if the temperature started to go up,

Starting point is 00:13:22 you need to aggressively look for the problem and get it under control. So one of our customers, we told them to put in the coolant additive. They said, that's too expensive. We're not going to bother. And then a few months later, they said, hey, all of our GPUs are starting to get hot. And so for them, we came up with a regimen of chemicals to put through the system to bring it back to life. And then that was OK. But in some cases, once that cold plate is fully clogged, then you can't get any chemicals through it. So you really have to monitor things. And the minute the average temperature of your chip starts to go up, you need to investigate it quickly.

Starting point is 00:14:04 Just stay on top of it. Yeah. It makes you think a little more strategically when you bring these solutions into your data center. It's not just a point solution of thing is hot, put this in, make this investment. Now it's cold and everything's well again. It is a little more than that in terms of maintenance and managing it and making sure that it's operating properly. But you said PG-25. I want to come back to this. A lot of people, when we think about liquid cooling systems, get concerned about the liquid being used. And in your case, it's water, like regular water with a little bit of additive that can still go down the drain without any

Starting point is 00:14:45 EPA guys showing up at your door in the case you dump your, your CDU for whatever reason. Talk about that versus PG 25 for those that don't know. And then let's pause there and then talk about two phase for a minute. Okay. So, so with the, um, with what we do is we put in a little bit of antibacterial, about, you know, three or four parts per million, and then a little bit of anti-corrosion, maybe at 40, 50 parts per million. And both of these are chemicals that you might find as a fertilizer for crops or as a chemical that would be in your swimming pool that you'd swim in. So they're not particularly toxic and you can dump a lot down the drain without anybody even noticing now with um

Starting point is 00:15:32 with pg25 we actually worked with a local brewery and and they would uh occasionally have their system burst and pour a whole bunch of pg25 into a local creek which caused everybody to get mad at them so um the the the nice thing about pg-25 is it doesn't freeze so it's really good for all-in-one systems that the gamers use uh because you can ship it you don't have to worry about it freezing expanding and well while it's in shipment right right exactly and the nice thing about it is it it slows down or stops the growth of bacteria. But we have seen some issues where the PG, the polypropylene glycol attacks the seals and then the seals start leaking, which isn't good.

Starting point is 00:16:19 So that's why we're using, you know, the water with just a light touch of additives. And that's what we came up with talking to chemists, you know, who have worked on things like airplane freshwater systems and cooling systems for powerful lasers and stuff like that. And in fact, so when you bring a CDU into your environment the first time, you're just filling, you're filling it up with facility water, right? You'll add your chemicals to it, but there's nothing in the run up that's particularly. Yeah. What we ask is that the customer give us some distilled water or reverse osmosis water that's filtered tap water, because we don't want to have to worry about what the tap water is in every different location. So we're starting with water that has very low concentration of minerals in it. And then we add our chemicals and we're good to go. And then periodically what we'll do is we'll drain a little bit down the drain and fill a little bit more and basically refresh the coolant, put in new chemicals, and you're kind

Starting point is 00:17:20 of starting all over brand new. Because what we've seen is that, you know, with water-based chemistry, maybe after a year or so, it starts to get, you know, buildup of chemicals in it and gets cloudy. And maybe after three years or so, the PG-25, that starts to get cloudy as well. So one of the issues with the liquid cooling business in general is is you have companies that are used to procuring electronics and with electronics if it works for two weeks it's probably going to work for five years whereas with liquid cooling systems that's not the case at all if it works for two weeks it may last five years but it may last six months you just don't know it has to do with the chemistry, your heating and cooling

Starting point is 00:18:05 up the water, the contact with the materials. In some cases, you don't know, is that pure copper? Does it have other chemicals in it? Is it pure brass? Does it have other, you know, what is the mix of brass? So, you know, you don't necessarily know, which is why, you know, in reality, what you'd like to do is you'd like to, before you deploy 100 megawatts of liquid cooling, you'd like to run one megawatt for two years. But we're looking at a world where we don't have that time. We have to deploy now. Yeah, certainly not if you're going to any of these NVL racks next year from NVIDIA, they're all liquid cooled. There's not an air cooled option, at least not that I've seen. And you better be ready if you want those systems. So what about other fluids? We talked about yours and then the other big one is phase change fluids. I know that one's been a little trickier

Starting point is 00:19:03 in the industry. That was a pretty popular way to do liquid cooling early on, but it seems maybe a little less in favor based on what you're saying. Yeah, I mean, so we've done two-phase heat transfer before. We did a project for, another project for DARPA that was a solar-powered rocket engine that ran on ammonia. So we're familiar with the technology. And, you know, years ago we worked with Florinert for another project. But what we see is that the two-phase chemicals, you know, may or may not be banned soon. And also we haven't seen people reporting a cold plate performance that's better than a good single phase cold plate.

Starting point is 00:19:50 And if you can buy fluorinert for 700 bucks a gallon or water for practically nothing, I think you'd want to buy water. So so that's that's where we see it going. And the other thing is, is I was talking to government regulators, and one of the first things they say is, well, this chemical might be dangerous. Is there a substitute? And if you say, yes, the substitute is water, then the regulating it out of existence is kind of a no-brainer. Okay.

Starting point is 00:20:18 And then I guess there's another category when we think about immersion cooling that seems to have some popularity in some spots in Europe. I know others are experimenting with it, with these engineered oils. What's your take on immersion and where that spot could be successful, or maybe you don't think it can be? Yeah, so actually that's another thing that we have some experience with. We worked on immersion cooled inducted inductors for a high-power laser, and this is a 1,000-volt inductor that needed to be cooled by immersion because that suppresses the arcs as well as cooling down the electronics.

Starting point is 00:21:01 So we're a bit familiar with that. And basically immersion with natural convection, where the heated oil moves upwards, just like a chimney, that works about 40% better than blowing air. Whereas direct to the chip with a cold plate, that works seven or eight times better than air. So we see immersion, you know, having a good opportunity to be used perhaps in edge zones with moderate power chips and servers, that all works great. But for these high power parts that are coming out, like the two kilowatt part we just worked on, I don't think it has enough heat capacity to cool those things at a reasonable efficiency level.

Starting point is 00:21:48 Right? Because what you don't want to do is say, well, my single phase immersion system works good, but we need to start with chilled water to cool chips just below their maximum operating point. That's not right. Right. Now that sounds like a difficult battle to maintain because clearly that's where we are today, but the chips are going to get more powerful. There's little doubt in anyone's mind, I think. So do you think, I know you're partial to water with a little bit of additives now, but do you think there's room for engineered fluids or investment in fluids that could be better than water and still safe or relatively safe? Yeah, I don't know. I mean, you know, Florinert has been around for 50 years at $700 a gallon and it evaporates fast. And if anybody could come up with a better fluid, you'd think they'd have done so by now, right? So because there's, you know, there's plenty of opportunities to use that to cool things, particularly in positive pressure systems,

Starting point is 00:22:52 because if it leaks, it just evaporates and doesn't hurt anything. But the cost has always been a barrier. Right. So maybe not innovation there. Then let's talk about the cold plates, because that does seem like a spot where innovation matters. We spent a lot of time with you and your team on cold plate design, the little turbulators inside and how all of that works. And you were talking a moment ago about quality of materials. And something that makes me nervous is that anytime there is a rush toward a technology, whether it's AI or in this case, the hardware that lets AI go with these cold plates, I'm seeing more outreach in the last six months to us about cold plates than, you know, obviously than ever before. A lot of companies that I've not heard of before, a lot of companies

Starting point is 00:23:45 in Asia that I don't know. And it just makes me a little suspect that just because it's rectangle and copper and has two hose connections on either side, that the quality and that something I hadn't considered that the, if you're saying it's copper, that it really is copper or whatever the alloy is. What's your take there in terms of what it takes to make a good cold plate and what your customers or what the enterprise should be thinking about? Or should they even be thinking about this? I mean, you're asking a lot for some GPU server admin to be worried about the cold plates that are on his GPUs that are, as you said, several. Yeah.

Starting point is 00:24:30 I think I probably mentioned this to you as one of our, one of our customers asked us to have the copper analyzed to make sure that the cold plate copper was indeed pure copper. And we didn't have a vendor trying to substitute a lower quality material but but like i said the thing about a cold plate or any part of a liquid cooling system is the problems don't show up until it's been in service for a year or two or three and uh the worst possible scenario is you know you ship a lot of hardware and then it starts leaking and then now you have to stop shipping, right? And all those computer chips,

Starting point is 00:25:10 they're like lettuce from the supermarket. You leave them on the shelf for a while, the value goes down quickly. So you really want to make sure that the vendor that you're working with has experience so that they know it's going to last and there isn't some, you know, chemical reaction or, you know, material incompatibility or something that's going to cause problems down the road. Because we've seen that in a number of

Starting point is 00:25:41 positive pressure systems where it just starts leaking after a while. And then, you know, you're in the bad position of having to replace some chips, replace a whole bunch of coal plates. It just slows down the whole process. So, I mean, that's why we're in the negative pressure business, because we want to make sure that if there are problems with the system, the customer doesn't have downtime because of it. Well, talk about the leaks too, because that seems to me is one thing that the industry needs to get better at, honestly. We've looked at dozens of these systems and the sophistication of the CDU, the server, other parts of the loop to be able to say, you know, if I've got storage in a

Starting point is 00:26:29 server, I can see this drive is starting to throw errors. I think it's going to fail. I'm going to tell the customer, I think this drive is going to fail. Let's get it now before it's a problem. Or, you know, other components can do that, but the liquid loop is not quite there yet in aggregate. Where is the responsibility or what's your view on where that should take place? Is it more detection in the server itself with the little strings or whatever that go around the CPUs to catch the fluid and alert you? Or should the CDUs have better intelligence? What do you think about the alerting side? Well, as far as the alerting is concerned, I mean, so far we don't have leaks on servers. So we're not really worried about that too much.

Starting point is 00:27:19 You know, right now what happens is you have a positive pressure system, you have the leak tape you mentioned, and it detects a leak and then you have to go shut down the server or just take down a rack, whatever, so that results in some downtime. What we see is that liquid cooling is a system, and you as the customer, if something goes wrong, you want to be able to call somebody and say, come out here and fix this thing. You don't want to be in the position of trying to determine if it's a cold plate vendor problem or a CDU vendor problem or a chemical additive vendor problem. Right. So all these specifications and regulations and standards are still being figured out. So we think the best thing to do is to buy the system from one vendor so that you make sure that everything works together properly. I'm sure at some point we'll have everything figured out. But at this point, we still have

Starting point is 00:28:19 customers deploying things that run into problems. So I think you need to think about it as a system that interacts with itself. So, for example, you may have a material in your CDU, which is not compatible with the cold plate for some reason. Or you may have a situation, I think, that we had recently with one of the other organizations where they didn't ground their server properly and they had, you know, created a battery in their liquid cooling system by accident and then something corroded and broke. So there are a lot of things that can go wrong with a liquid cooling system. And most of our data center operators are not experts in plumbing and chemistry and biology.

Starting point is 00:29:06 So you really need... Why would they be, right? I mean, they haven't had to worry about it until relatively recently. Right, right. I mean, it's just enough trying to keep up with power and networking and cybersecurity and all the other issues that take up data center people's time. So one other thing that I think was interesting about your solution that IT people will worry about is single point of failure. So one of the things that you showed me was a configuration with two CDUs and a bunch of, you tell me what it's called.

Starting point is 00:29:42 We call it a, it's a switch over valve. It's just like a switch over. It's it a, it's a switch over valve. It's just like a switch over relay. It's a Y, it's a Y valves or something. But yeah, yeah. It's just like in a power system, you'll have AB power and you'll have a relay that switches over so fast that the power supply doesn't even notice it.

Starting point is 00:29:57 So, and you know, we don't have to switch over in 16 milliseconds, but we can switch over pretty fast so that the servers don't uh overheat or even get all that warm when one thing breaks down so if you think about it it's like a jet airliner you know if anything goes wrong on that jet airliner you're just fine if you happen to have both engines quit at the same moment they make a movie about you and you're a hero. So it's not very likely. So same thing when we look at the data center, we want to have a system where no reasonable single point of failure is going to cause downtime. So at our systems that are deployed

Starting point is 00:30:36 at Sandia, we have fins on the cold plate. So even if you forget to plug in the liquid cooling, the server still works. And in other locations, like I said, we have leak tolerance, we have redundant CDUs with automatic switchover valves, and we have the automatic coolant quality control system to kind of deal with the typical failures in a liquid cooling system that you're gonna see, which are leaks, contamination, and corrosion.

Starting point is 00:31:03 Now, a lot of CDU vendors today are selling CDUs with multiple pumps, and corrosion. Now, a lot of CDU vendors today are selling CDUs with multiple pumps, and the customers are happy about that because it kind of mirrors the servers with multiple redundant fans inside of them. But centrifugal pump technology is pretty mature, and those pumps don't fail very often. So they've got redundancy built in, but the redundancy

Starting point is 00:31:27 doesn't handle the typical failures that are most likely. Yeah, it's interesting. The dual CDU concept should you get at a lot of those redundancy concerns that the customers may have. And you talk about some of the national labs and the big data centers that are using liquid cooling, but it's not just a big data center problem. How do you think about scale here? So for enterprises that are adopting GPUs or have them now, but know the next generation

Starting point is 00:31:57 is going to require liquid cooling, what scale does it make sense for someone to talk to Chiladime? So at this point, over a hundred kilowatts, we have people call us all the time and say, I have a server closet with, you know, five kilowatts of heat. We're like, it's not worth it. Right. If you have 10 connections, the chances that something's going to go wrong is pretty small. But you have 10,000 connections. You're going to spring a leak at some point. So that's why we say, you know, you have to be at a certain level. And it's probably a few racks of high performance compute or AI compute is when you should talk to us.

Starting point is 00:32:46 It doesn't seem unreasonable, though, given what, I mean, your Fortune 500, obviously, but even further down it, I mean, you, you mentioned, you saw our video trying to, uh, discuss heaving one of the eight way air cooled GPU servers into our lab this week. And, um, I mean the power consumption on, on, on eight of those plus the everything else, the fans and switches and storage and CPUs is robust. But a couple racks of those, that's not really too far out of the realm for, I don't know, maybe a couple thousand businesses. So you start to get there pretty quickly. Yeah, I see that happening. We have a design for a rack scale CDU, but we never quite got our customers to say, well, if you build a 50 or 100 kilowatt one, I'll buy, you know, 30, 40, 100 of them. We never got to that point. Everybody just said, well, it would be nice if you did a rack scale CDU, but we can't really tell you what the power level has got to be.

Starting point is 00:33:41 So that's why we have, you know, a 300 kilowatt CDU and we have a megawatt scale one in development now. And so okay, so you've got a larger one that you said is in development, that was the next thing is where do we how small can you start? So you address that? How big can this go? Before you want to start chunking this out into multiple CDU loops or or how do you think about that at much more yeah so it's kind of interesting because what ends up being the limitation at least in my view is the pipes in the data center so if you have a megawatt of liquid cooling that's about a four inch pipe to conduct you know that 400 gallons per minute of water you might be able to

Starting point is 00:34:24 squeeze it into a smaller pipe, but now you're spending a lot of energy pushing fluid through a pipe. But once you try to go above that, now the pipes just get ridiculous. And you really don't want to have to have a crane in your forklift moving around six inch, you know, pipes around sensitive computers, you know, it's just, it's just nuts. So what we see, you know, one to two megawatts is pretty much as big as a CDU is likely to get. You know, and that way, if someone has a megawatt rack or half a megawatt rack, we can still cool that.

Starting point is 00:34:57 But there's going to become a point where it's just, you know, it's getting ridiculous in terms of the power too. So if you think about chips today, a two kilowatt chip is at the circuit board level is running 2000 amps at one volt roughly at the chip level. And that's a ridiculous amount of current. So I think we're starting to run into some electrical limits that are going to hit before we run into cooling limits. How about limits on dealing with the heat itself then? I mean, you talked about the cold plates and cooling the gear, but we still have to do something with the heat, right?

Starting point is 00:35:32 Yeah. So, I mean, we are aligned with super micro on this. Use a cooling tower if you possibly can. I mean, there's some areas where water is scarce and you don't want to use a cooling tower. You have to use a chiller. But we think that using water a cooling tower, you have to use a chiller. But we think that the using water is better than, you know, burning, you know, carbon to make electricity to run a chiller.

Starting point is 00:35:53 You know, in many places you can use a cooling tower part of the year and a chiller, you know, just a few weeks of the year when it's really hot and or humid. And that's kind of the best way to operate. And the other thing to think about is if you're building a liquid cool data center is we don't know what the next chips are going to require in terms of temperature. It may be that you can run your GPUs 20% faster if you run them colder. So I would always, you know, run with a cooling tower if you can, but leave room and power there for a chiller because you might need it someday. Yeah, well, I mean, you've got the ambient air.

Starting point is 00:36:31 You've got the liquid that you're extracting that has the heat in it. I mean, it's a delicate balance. I mean, there's probably quite a bit of math more than ever that goes into data center design that used to be just power and square footage. I mean, the data centers now that were designed that way, even just four or five years ago, the densities are kind of funny when you look in there and see these racks that are half empty because there's just no more power at the rack for them. Right. And we're quoting projects that are 1500 watts per square foot, which is just completely nuts. Right. And and the other thing that that you think about is in order to really, you know, optimize the efficiency of these data centers,

Starting point is 00:37:17 you want to be able to understand, you know, the cooling tower performance versus the GPU performance versus and the fan speed control algorithm and inside the CDU, there's just a lot of stuff that goes into this. And, you know, we have some math models that we use. And, you know, you may want to put in more liquid cooling, run the chip colder, because it uses less power when it's running colder, particularly if you can get, you know, cooling tower water in the wintertime, which is nice and cold, right? So, yeah, there's a lot of things in the liquid cooling business that break some of the standards in the data center world. Like in the old days, HVAC guys would come in, do their business, and then leave, and they would never even talk to the IT installation folk. And now

Starting point is 00:38:06 they have to work together to get all this stuff up and running. Well, that complexity changes the consumption model a little bit of IT gear as well, because now if I'm a customer and I'm buying GPU servers from Supermicro or Dell or HPE or Gigabyte or whomever it is, as the customer, where do I go? Who do I rely on to help coach me on this stuff? Because if I know you and I come and we talk math and that stuff, that's great. But I want ultimately my hardware provider to have a good handle on this too, without throwing your partners under the bus. How are the traditional infrastructure guys doing in terms of understanding these challenges and then helping their customers make intelligent decisions with their money? Yeah. I mean, I know there's a lot of people working on it and there's a very few people.

Starting point is 00:39:06 It doesn't seem real favorable. Well, no, I just and there's a very few people that have all the scars and have figured it out are not telling everybody how they figured it out because they want to keep that as an edge over their competitors. And the other thing is the national lab guys can help because they've deployed this stuff. But, you know, you have to actually spend some time and track them down and talk to them. And a lot of them are happy to tell you how things went when they deployed liquid cooling, but they've got a lot of systems keep running, so they don't have an infinite amount of time. So I think that a big problem is that you have not too many people who know what's going on. You have a lot of little tiny companies that are trying to get into the liquid cooling business because they've read the Wall Street Journal

Starting point is 00:40:07 articles as well. So they're saying anything they can to close the deal because they don't want to run out of money. So there's just an awful lot of noise out there. Well, that's what I'm worried about, quite honestly. And like I said, we're getting more outreach from companies I've never heard of. Not to say that they don't have the pedigree. Everyone starts somewhere. And a lot of these guys are coming in from aerospace and other industries where it's made sense to cool these components.

Starting point is 00:40:38 Totally get that. I'm just thinking if I'm sitting here as a CIO or CTO for an organization, I know this is coming, but maybe my VAR is not real up on it or the channel guy I'm working with, or maybe the supplier of my servers, I don't have the greatest confidence in them. As we close out here, what's your advice to that person or to the IT admin that's being tasked with providing recommendations. What do these guys do to try to unlock those secrets and make it a good buying decision? So my recommendation is if you're not doing liquid cooling now, start doing it as fast as possible with whoever's stuff. Get some experience under your your belt because what I've seen so far is we have hyperscalers that are planning you know

Starting point is 00:41:30 20 30 100 megawatt liquid cooled data centers that have never done a 1 megawatt liquid cooled insulation or even a 50 kilowatt you know 100 kilowatt installation and you know I like to say that that's like you show up at the airport you say I want to learn to fly, but I'm not going to bother with that Cessna stuff. I want to go right to fighter jets. Yeah. What could go wrong? That's what flight simulator taught everyone. You can do that. It's all right. servers, you know, stick on some cold plates, get a CDU, start running it, get some experience under your belt. Cause what you don't want to do is go from zero to, you know, three or $4 billion for

Starting point is 00:42:15 the servers and then have them go down because something went wrong that you didn't anticipate. Cause you know, right now what we see is the you know the nvl2 racks from nvidia are coming out soon but what we haven't seen is a set of really tight solid specs on what's the water temperature need to be what's the what's the water chemistry and quality need to be and until we get those numbers it's hard to select a liquid cooling system for those racks. So what we expect to happen is they're going to start shipping a lot of these racks, and everybody's going to call up people like us and Cool IT and others and say, hey, we want to buy some stuff yesterday. And we're going to say, hey, well, it's going to, you know, our lead time is 16 weeks right now. But, you know, for some other companies,

Starting point is 00:43:10 it's 52 weeks. So, you know, you really want to get some experience with this stuff as soon as possible. I think it's a brilliant piece of advice. And I mean, it goes with other buying modalities. Nobody switches their storage or their server vendor willy-nilly. They get a couple in, they mess with them for weeks or months or whatever, and then at some point make some sort of pivot. So taking the home lab style of approach of just getting something in, getting some reps in and learning what you don't know is probably, that's a strong piece of advice. And I love it. So we've got a full report on Childine on the website, but the two videos are really, really good. We'll link to those in the description. We'll link to Childine's website. Steve, this is a great talk. Thanks for doing

Starting point is 00:44:02 this. Good to see you again. And I'm excited to watch where you guys go with this. This will be fun. Yeah. And thanks for all the great questions. I really appreciate it.

Podcast Archive - StorageReview.com - Podcast #134: Leak-Proof Liquid Cooling

Chilldyne Podcast – liquid cooling solutions for data centers focusing on leak-proof design and… The post Podcast #134: Leak-Proof Liquid Cooling appeared first on StorageReview.com. ...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Podcast Archive - StorageReview.com - Podcast #134: Leak-Proof Liquid Cooling

Chilldyne Podcast &#8211; liquid cooling solutions for data centers focusing on leak-proof design and&#8230; The post Podcast #134: Leak-Proof Liquid Cooling appeared first on StorageReview.com. ...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Chilldyne Podcast – liquid cooling solutions for data centers focusing on leak-proof design and… The post Podcast #134: Leak-Proof Liquid Cooling appeared first on StorageReview.com. ...