Podcast Archive - StorageReview.com - Podcast #136: AI Is Forcing Data Centers to Go Liquid

Episode Date: April 25, 2025

Explore how 2MW liquid cooling is transforming AI data centers with new levels of… The post Podcast #136: AI Is Forcing Data Centers to Go Liquid appeared first on StorageReview.com. ...

Transcript
Discussion (0)
Starting point is 00:00:00 Hey everyone, welcome to the podcast. Brian Beeler here and I'm with Luca in the cool IT booth at Data Center World in DC. And this is the first time I think in a while we've tried to do one of these pods at an event. It's pretty quiet here, but I'll apologize in advance if you hear vacuum cleaners or other pre-expo opening noises. We're standing in front of your brand new CDU. You guys have launched this this week. What's going on with this system? Yeah, so fortunately this right here is our brand new CHX2000 liquid to liquid CDU. It is capable of 2 megawatts of heat rejection, so it is essentially our most powerful CDU that we offer in our broad portfolio today. And it's in one of the most compact form factors and footprints in the
Starting point is 00:00:51 marketplace. So this is all stainless steel, four inch piping, victolic couplings, 750 millimeters by 1200. So it's a standard rack, lots of accessibility on both the front and the back. Right. And we're quite proud of it. It's it's a standard rack. Lots of accessibility on both the front and the back. Right. And we're quite proud of it. It's definitely stuffing more power into the same footprint and gives us I think an advantage over the competition. And for anyone that's just listening, in the description in the show notes we'll have a link to an article that Harold's put together on this new CDU so you can
Starting point is 00:01:22 see everything about it, get all the specs and we'll have some nice photos of it as well. Visually, I mean I know you don't ship these with plexiglass on one side, but visually it's quite stunning. The amount of metal in here is incredible and all the pumps and everything. What are the challenges with a CDU? You guys have been in the game for what, a decade plus at this point? Probably longer. I'm sure I'm shaving some time off. Yeah so I mean we started with with rack based CDUs
Starting point is 00:01:49 initially so obviously in the smaller form factor the ones that would sit integrated into the factory racks but yeah with with the row based units here I mean essentially the problems are still the same in terms of the approach temperature I think is a critical thing for us it's just really understanding and having a system that has like a high efficiency heat exchanger and really powerful pumps to be able to keep that approach temperature really low and ensure that you can cool effectively all of the units
Starting point is 00:02:15 that are under its power and under its control. So it's kind of interesting as we think about direct to chip liquid, right? Nvidia is a big proponent of this now and has been driving this with their massive TDPs. But Intel, AMD, everyone else is right in there as well as they continue to push these chips. And it seems like Liquid was a little fringy,
Starting point is 00:02:39 maybe what, even just two and a half, three years ago. I mean, you were certainly doing it. It was big in HPC, but the enterprise wasn't there. With all these new NVL systems though, there is no choice. And so organizations are going to have to get comfortable if they want these big GPU systems with liquid cooling of some form or another. And of course you guys are leading the way
Starting point is 00:02:58 with this directed chip. But talk a little bit, we're here at Data Center World, what's the sense of the people here in terms of what their comfortable level is or comfortability is with liquid cooling? I mean, I would say generally it's still relatively new. I mean, in our space, obviously we've been doing it for some time, we were at the NVIDIA GTC show
Starting point is 00:03:20 and we were actually showcasing a wall of some of our liquid cooling innovations going back from 2009, so some of those to just kind of show the progression of time. But in this particular environment, like with Data Center World, I think a lot of the questions that we're getting are more from people who are, as you've mentioned, kind of newer to the liquid cooling space. We definitely are getting a broad spectrum of questions. We had a session yesterday which talked a little bit about selecting the CDU.
Starting point is 00:03:46 And so that was a session that was really designed to kind of give people an understanding of the things to look for and how to start entering and participating in this market that, you know, as you've mentioned, is going to become more and more prevalent, right? So as AI becomes one of the dominant sort of compute use cases. It's all liquid-cooled. And even on the enterprise side of things, we're starting to see it as well. Certainly all the server OEMs are starting to create like liquid-cooled CPU-based units as well.
Starting point is 00:04:17 So that is not exclusively AI. And so it's just kind of getting them up to speed with that and used to the differences between air and liquid in terms of the maintenance. We retrofitted an R760 last year with some of your cold plates and you guys sent us the little little baby in rack CDU which is not how most people will deploy liquid cooling but it's a fun starter actually. And when we did that, we did nothing but liquid cool the CPUs, and I think we saved 200 watts of power while increasing the performance a couple points
Starting point is 00:04:51 on those CPUs. That was one server. If you do that at a rack scale, even just compute forgetting about GPUs entirely, that's an amazing savings. Yeah, absolutely. Yesterday, actually, I was fortunate enough to take in a session here at the show,
Starting point is 00:05:05 which was, I think it was someone from the University of Chicago, if I remember right, and they were talking about their experience with, it was a two-phase solution, but it was the same sort of thing. I mean, it was using liquid effectively to cool their servers, and doing a power savings study against that.
Starting point is 00:05:21 And I think the number that he came up with in his particular use case was 37% savings for the lab that they were testing in. So pretty substantial. So obviously there's a lot of you know the environmental aspects up here and all of those things that you've talked about in terms of like you know the smaller performance improvements as well. Just keeping the system happy. You don't have to run as many fans and all that great stuff. The fans is the big thing right. Anytime we're talking about liquid cooling, it's gotta be about reducing energy consumption
Starting point is 00:05:49 in that server. Most of that energy consumption, we think about it as being driven by the silicon, but the fans, like the 25, 30% in some data I've seen out of Lenovo and others, is pretty wild in terms of what you can save when we pin those fans back, or in some cases remove them entirely.
Starting point is 00:06:06 Right, absolutely. And yeah, that's sort of the bread and butter. I mean, we've taken from the super compute space where they have 100% fanless solutions, obviously in those blades. And that's where we leveraged a lot of our expertise and history from inside the IT, now moving into the infrastructure
Starting point is 00:06:23 and everything that kind of supports that. It's always wild going into these labs that are heavily liquid cooled how quiet they are and how cool they are because you're dealing with the heat so much more efficiently. It feels kind of like cheating right? I'm used to going into the lab and having the servers scream, especially all these one-use systems that are that are really popular in the HPC and AI spaces. Right, and the challenge with obviously all of the flux. So, you know, as the load comes on, the computer spools up, all the... Yeah, it's...
Starting point is 00:06:53 I don't know, it takes a strong person to get used to that day in and day out, for sure. Sort of a nerdy white noise, I think, for some of us. It's funny because I just posted a social video last week where I was looking at something, I think one of the big SSDs from Solid-I, they've got these 122s now, which are insane, but I shot this little video about it in some of the comments on Instagram,
Starting point is 00:07:17 was what's that noise? And I'm like, wait, do people not know what a data center sounds like? And they really don't. We're talking to so many young software designers, AI guys, that have never even seen the systems that they're interacting with, which to me is pretty nuts to consider.
Starting point is 00:07:33 Yeah, you're right. And again, as all these racks move up in thermal design power and density, you're going to just get more and more need to reject that heat somewhere, right? And so if you were to power it with fans, I can't imagine, you know, what the decibel rating would be. Oh my gosh, it'd be wild. It'd be wild. And with liquid you're gonna get better density out of your racks and now the challenge is, as
Starting point is 00:07:57 you introduce systems like this, we still have to get power to the racks. I mean, when we look at older data centers and as we look at enterprises that are bringing these big systems in and trying to figure out how to manage that, what do power to the racks. I mean, when we look at older data centers and as we look at enterprises that are bringing these big systems in and trying to figure out how to manage that, what do you see your responsibility in terms of helping your customers kind of figure out how to optimize their racks? I mean, that might be a little outside of your core scope,
Starting point is 00:08:17 but you guys are kind of leading the vanguard there. Yeah, definitely. I mean, fundamentally, I think one of the things that Liquid obviously does is reduce power consumption. And you'll always have that kind of trade I mean, fundamentally, I think one of the things that liquid obviously does is reduce power consumption. And you'll always have that kind of trade-off, right, in terms of as densities and design power goes up and up and up, you know, you want to find ways to minimize the use of that energy, right? So using it towards the heat rejection by air is obviously not as effective with as efficient as liquid. And you know, so you want to reduce the amount of energy you're actually using to offset and reject that heat.
Starting point is 00:08:51 And so as those go up, we'll have that juxtaposition between the two. Where liquid is obviously one of the ways that allows you to do that. And it's probably the most effective means that we have right now. So you know, we continue to expect that to grow in dominance over the data center space as these power-hungry GPUs and CPUs continue to consume that energy. And anything that we can do to be a part of reducing the reliance on energy, because we certainly are going
Starting point is 00:09:17 to have some power challenges over the next little while as all these data centers come online. No doubt. And we're trying to do our little part to help that out. Half the guys in here are dealing with power delivery and all these challenges. And I think there's a 42-foot truck over there that's a giant generator and all sorts of things trying
Starting point is 00:09:37 to bring that power generation to these data centers. That's a massive challenge. And yeah, to your point, you're helping drive down some of that wattage by being, to your point, you're helping drive down some of that wattage by being more efficient in the way you're cooling it. When you think about fluids, though you've mentioned that a couple times, I want to be clear about that. You guys are just using water in these systems for the most part. Yeah, I mean, our systems don't use a refrigerant, so we use single-phase direct chip cooling.
Starting point is 00:10:02 It is a PG25, so propylene glycol, 25% solution, primarily water with propylene glycol for anti-biocides and fungicides, et cetera. Which is one thing that people don't consider. They just look at these loops. And if you think about a PC that might have a liquid loop on it, most of us don't really worry about growing little creatures or fungus in there, but the water is not perfect And of course you've got to account for that right and and you know as the water heats up I mean it makes for an ideal
Starting point is 00:10:33 Environment for a host of these things right if you've ever looked inside your radiator in your car for example You'll see that you know as that over time You can see bits of corrosion and it'll reduce sort of like the ability for the water to transfer cleanly through there. And that is part of the reason as well that you would use your antifreeze in your vehicle. So we're leveraging the same types of solutions to do that and obviously we're trying to ensure that the water has the proper chemistry and is clean to continue flowing through. The micro channels that sit in the cold plates are also very, very fine.
Starting point is 00:11:05 That's what I was gonna say, right? Is we look at these four inch pipes and you think about, well, a little bit of green goo won't make any difference, but it's when you get to the cold plate where you lose the efficiency if those things start to get clogged up, right? Absolutely, right?
Starting point is 00:11:18 And so that's exactly what we're trying to prevent and that's why all of these systems come with additional filtration. We're down to 25 micron, so we're really trying to make sure that we're maximizing the performance of the entire solution, the entire system, not just the CDUs, not just the cold plates, but really just keeping it optimized in a way that keeps the system happy and powering that compute. You talked about single phase, and that's just liquid.
Starting point is 00:11:43 It's not boiling, it's not doing anything. And just to be clear to the audience, there's also a two phase, which is another way to do this. You guys aren't in that business or maybe you're evaluating it. What's the deal with two phase? Yeah, so I mean, I think in our space, we are definitely the leaders in single phase direct to chip cooling. That doesn't mean, however, that we haven't evaluated other technologies. You've been to the Liquid Lab, and so in our Liquid Lab we actually do, we've been doing a lot of testing and validation on two phase for some time now.
Starting point is 00:12:14 So we continue to keep abreast of all the sort of trends. We've actually published a paper though, when you look at it at a system, a holistic system standpoint, we've found that there's still one there's a lot of runway for single-phase direct-to-chip cooling still. I know there's some folks here at the show that have said some other things that might disagree with me but from our engineering standpoint we believe that there's still a lot of runway in that and and the two-phase solution doesn't in the testing that we've done, hasn't inherently been sufficiently
Starting point is 00:12:45 better that it makes sense to move to a refrigerant-based solution today. So not to say that we're not looking at it, just not today. And immersion I guess is the other popular technique, right, that works for different people at different scales. What do you think about that? I mean again, obviously I'm biased, but I would say the key thing for us is that immersion comes with obviously a benefit, but a lot of downside and challenge. The market today has started making a bit more of a bet
Starting point is 00:13:16 towards direct-to-chip cooling. I think Nvidia has said that their roadmap is really geared around direct-to-chip for the foreseeable future. And so long as that's what the market is driving towards, that's where our focus will continue to be. You talked about the headroom a little bit there in terms of what can be done. You guys announced a couple weeks ago, I think, too, a 4,000 watt plate. I mean, that's wild.
Starting point is 00:13:41 Right. Yeah. How does, I mean, just to ask a naive question, how does that even work to be able to offload that much heat in one plate? Well, with a lot of really good engineering, right? You know, the reality is, again, like we've been in this space for some time, and so we've developed a lot of expertise around all the sort of tips and tricks to do this. And, and you know, it's effectively's effectively, ultimately it's still the thermal properties of copper
Starting point is 00:14:07 are kind of like what we work with. But we have designed this as a prototype, really just to kind of again show that there is still capability within this in the right use cases. We created a thermal test vehicle, so basically a series of heaters to replicate the size of basically a V200 and an NVIDIA GPU chip.
Starting point is 00:14:29 And so from that we were able to kind of prove out that we still had the ability to reject heat to 4,000 watts or 4 kilowatts. So it's an interesting challenge though that you face because obviously you partner with all the big silicon guys, but at the same time, they announce things at a cadence that's pretty rapid and maybe before things are really shipping. So you guys have to be pretty nimble, I suppose, because you're designing all along the way, but things still change all the way up to the end. Yeah, I mean, no different than any other business. We have our challenges, right? So as we progress further upstream, working with the silicon manufacturers, we do see that you know their designs change, right? So it might
Starting point is 00:15:08 be mid-flight and version 3 turns to version 4 turns to version 6 pretty quickly, right? So you know we, I wouldn't say we struggle, we work alongside them to continue to do the same thing. It's the same challenges that everybody else has, so we're all on equal footing and I think we're just trying to adapt to that insatiable sort of demand and speed that the AI market is really driving today. So talk about a little bit about the buying process for this gear. I've seen your stuff come through, like Dell, for instance. I know they've been a partner of yours on the cold plates, manifolds, all that kind
Starting point is 00:15:44 of stuff in the CDUs. But for somebody that wants to consume a rack level CDU like this, are they going to you guys? Are they going to your partners? Kind of what's the motion there? And kind of also, can you talk a little bit about what the decision tree is? How do I buy a CDU?
Starting point is 00:16:02 It's outside the typical IT motion. Right. Yeah, so one of the things that we do, I think, really well is we are an end-to-end solution. So we understand how cooling and heat rejection works inside with the cold plate, so inside the IT, all the way through to the infrastructure required to size and plan and properly scope out your needs from a CDU perspective. We typically sell most of our products direct to market. We do do direct factory integration at the cold plate level with a lot of the OEMs. So we would sell direct to the OEMs, they would do that installation direct in one of
Starting point is 00:16:39 their factories and so the cold plates get integrated directly at the factory under an OEM warranty. But these types of units are typically bought directly from Cool IT. We have a professional service group that helps support kind of like the engineering, design, sizing, install, commissioning, preventative maintenance, etc. So we run the whole spectrum of everything that's required to support direct liquid cooling. And when you bring in liquid for the first time into an organization,
Starting point is 00:17:07 IT obviously has concerns over that. The facilities management has some concerns. What do data center administrators have to do to embrace this? Do they have a headcount that's dedicated to managing this? Does it fall under someone else's umbrella? Who manages this thing? Typically the CDUs, for example, would be more of the infrastructure people. So the folks at the data center are the ones that are typically making design decisions. Sometimes it comes in conjunction with, you know, if you're a colo, depending on the clients and the customers, they may also have their sort of like required vendors. So each customer and each data center
Starting point is 00:17:44 might have its own slightly different nuance to the way that they actually like manage and buy the equipment. But typically the CDUs and everything done inside the facility and that infrastructure is done by the facilities people. So the data center folks themselves directly are the ones doing a lot of the purchasing
Starting point is 00:18:01 and maintenance, et cetera. So when you consider the fact that you could pair systems with your cold plates with CDUs that are yours or competitors, conceivably, right? What's the industry done in terms of standardization? Because I've been in a lot of labs and seen manifolds that connect us to servers and AI systems that aren't the same as manifolds in other places. And are we at the point where you can deliver out of any CDU to any manifold?
Starting point is 00:18:34 Or kind of what's going on there? Yeah, I mean, I would say there's a pretty substantial push to try to have interoperability. Yeah, OCP has got a working group for liquid storage or liquid cooling, right? Yeah, I would say OCP has actually been the leader in terms of trying to bring together a common standard and a common framework. And you know, I think as a participant, as a vendor in that space, like obviously we have our own preferred ways.
Starting point is 00:19:01 There are some challenges that I think we haven't quite solved in terms of when it comes to interoperability in terms of warranty, for example, right? So how do you determine what component or piece is at fault if there's some kind of warranty claim? Or how can you ensure that all the wedded materials are compatible with each other? Because we all use a slightly different sort of secret recipe for the way that we design our products.
Starting point is 00:19:24 And so I don't think that has quite landed yet, but I think definitely we're starting to see more involvement from groups like OCP, from the large OEMs and the manufacturers of the silicon themselves, also getting more involved in determining what's required to support them. Well, it's got to get universal, right? Because I can't be dropping in servers from one vendor and then servers from another vendor, go to the manifold and be like, yo, these don't plug in. Like, that's got to get better. Yeah, I would think so. I mean, we're all looking forward towards a day
Starting point is 00:19:55 where that's the case. Well, it's in your best interest, right? I could see why if you're an infrastructure company in IT where you may have wanted to put up a little wall around your investments in liquid cooling, but at this point, I mean, being open, I think is the right way to go. Yeah, again, we wanna be a participant and a leader
Starting point is 00:20:15 and just have a voice in some of those decisions that are being made, because we do think we have an experience set that's able to contribute to that conversation. And so, so long as the decisions are being made, we think are beneficial for the industry as a whole, we're all on board with that. And so we'll see a drive towards more standardization, I would expect in the near future or, you know, sort of the immediate term here.
Starting point is 00:20:38 So what are the challenges going forward? You've got this massive new CDU, but what are you guys looking at a year, two, three years out at a high level? What are the challenges going to be? Yeah, I think the same challenges we've been trying to solve over the last little while, right? Just hotter and hotter? Yeah, just hotter and hotter. I mean, I think, you know, in the cooling space, I think the challenges remain the same, like thermal management of these incredibly dense, hot 1400, 1600, 2000 unit systems. And so that is, I don't think that's going to change. I think that's actually the same problems we're gonna have to solve.
Starting point is 00:21:16 I don't think we've run up against the constraints of material science or engineering yet. There's a lot of people that have a lot of great ideas. I think we'll start to see more of maybe a mix of different types of technologies being used to try to achieve those kinds of results. I think the bigger issue is still on the power side. So I'm glad that we operate on the cooling side because I think the first thing that will stare us
Starting point is 00:21:40 in the face is the power aspects. But yeah, we're working really hard to continue to support that ecosystem and do the things that we've done for the last 24 years. What can we do operationally to advance liquid cooling, whether it's smart leak detection or reduce flow to a system that could be compromised and be able to communicate? Because that's one of the things, right, is if we look at a server, for instance,
Starting point is 00:22:02 that has out-of-band management, I can shut it down if I want to, if I could detect some sort of leakage or something and have some actions take place. Is that kind of stuff progressing? Definitely. So in our advanced technologies group, we are evaluating a series of technologies to try to have smart sensors and smart leak detection.
Starting point is 00:22:26 We've had some customers ask us for different things about how could we isolate specific problem areas, et cetera. And I think you'll see that across the entire system as we start seeing that. We're starting to see more demand for redundant sensors and maybe even AI can actually come in and start supporting some of that calculation and evaluation.
Starting point is 00:22:48 But we're definitely looking at getting more targeted in terms of like making sure that the impacts, especially when these systems run on really large loops of clusters of technology to try to isolate those problems into smaller areas that can be problem solved and repaired quickly if there is an issue. All right, well clearly, I mean, looking around the show floor here,
Starting point is 00:23:10 this is a big concern, the liquid loops, the CDUs, the cooling, the power, all this is really important. And as the show floor sort of starts to open up and gets louder, I just heard my first vacuum. I think we're gonna stop now before we risk losing all of the audio quality. But like I said, I've got links in the description here and we'll have a little bit up on storageview.com
Starting point is 00:23:30 if you want to learn more about the CDU, the 4000 watt cold plate and all the innovation that Cool IT is driving. Thanks for doing this. Appreciate it.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.