Podcast Archive - StorageReview.com - Podcast #138: Solidigm Liquid-Cooled SSDs

Episode Date: June 6, 2025

Providing a look at the latest in liquid-cooled SSDs from Solidigm. Brian invited Cody… The post Podcast #138: Solidigm Liquid-Cooled SSDs appeared first on StorageReview.com. ...

Transcript
Discussion (0)
Starting point is 00:00:00 Hey everyone, welcome to the podcast. I've got with me Cody from Solid Dime, who actually is part of the team that did something pretty crazy at GTC. They decided that cold plates and SSDs need to go together and so they ran a demo showing what can be done if we liquid cool SSDs with these high end, presumably AI or GPU servers? Cody, what is going on over there at Solidigm and what makes you guys think that liquid cooling SSDs is an important idea? Hi, Brian.
Starting point is 00:00:37 I wanted to first thank you for inviting me here on the podcast and excited to be able to talk a little bit about our, you know, innovative solutions that we've come up over here with over here at Solid Dime. And yeah, so I think that as we've seen the data center market evolve from, you know, strictly air-cooled to a hybrid cooled solution, and now looking at, you know, the market twine with these fully liquid-cooled GPU servers or high-performance AI servers. I think that's where we saw the opportunity to introduce an innovative product that allows us to
Starting point is 00:01:13 push the market forward, if you will, in terms of, hey, how do we liquid-cool other devices besides these high-powered CPUs or GPUs at the platform level? besides these high powered CPUs or GPUs at the platform level. Yeah, I mean, obviously the the liquid movement, whether it's directed chip immersion, which we can talk about a little bit too, because we did a couple weeks ago, take a bunch of your E1S drives and dunk them in Castrol data center fluid as part of a video we did with with Doug, I don't know if you've got a chance to catch that one yet, but the vision of hot swapping an SSD
Starting point is 00:01:50 and having to dump the oil out is pretty humorous. But yeah, all these liquid cooling things are happening and most of us think about it at the chip level, either CPU or GPU. But I think the trend that you're talking about, and surely was a hot topic at GTC, is can we liquid cool the entire system? You know, Lenovo's done that a couple years
Starting point is 00:02:13 with their Neptune systems, where they put copper on everything, the DRAM. They've even done it with hard drives or SSDs and these little blade servers. What you guys were showing was a larger block, water block and SSDs and these little blade servers. What you guys were showing was a larger block, water block and SSDs. And getting after your piece of that puzzle, how do we get smarter with cooling storage, right?
Starting point is 00:02:36 Right, right. And so what our solution was is we wanted to maintain the key functionalities of SSD. You talk about things like hot swap ability or service ability. I think that's a key aspect. From what I've seen, there is oftentimes more cumbersome solution or something that requires a special tooling to replace the cold plate or requires some downtime at the server. And so what we wanted to demonstrate is that we're able to
Starting point is 00:03:07 take our normal SSD, we're able to look at coolant with a carefully designed cold plate mechanism, and we're able to extract the heat from just one side of the SSD, even at these higher gen five speeds. So this is interesting, and I probably should have started this with, you're a technical guy, you're not a product marketing guy, or maybe you're a hybrid.
Starting point is 00:03:30 What is your function anyway over there? Yeah, so I'm a thermal mechanical design engineer. Perfect. This is one of the rare cases where a company's brave enough to let one of the nerds out of the lab to talk to a media outlet. So this is perfect though, because we could really get into it if it was just a regular old run of the mill marketing folks, they sometimes don't have the technical chops. So you'll know everything about this, which is fantastic. So when we,
Starting point is 00:03:59 we think about cold plates, traditionally pretty much with a conductor in between they sit on the chip and as the chip off puts out its heat it runs a liquid loop through there and then right places that with cold water or you know water water mixed in with some chemicals to move that off somewhere else to cool it back down to bring it back through and those designs have been largely driven by vendors like Cool IT that are responding to the needs from Nvidia, AMD, Intel, et cetera, Chill9.
Starting point is 00:04:37 I mean, there's a number of vendors out there with really innovative cold plates. But I haven't seen that same cold plate vendor crew go after storage, and I guess it's because they haven't felt like they really needed to or weren't asked to, which is kind of where you stepped in, and with you specifically understanding the thermals of especially Gen 5, and it's not just Gen 5,
Starting point is 00:05:00 this problem gets worse as we go to Gen 6, Gen 7 and beyond, right? Right. You guys stepped in and really understood the heat profile of an SSD, and that's where we ended up with this demo. Exactly, and I think that we've had some conversations with some of our customers asking about, we see liquid cooling happening at the data center. What are your thoughts on SSDs? Do you think this is going to come forward to the drive bay?
Starting point is 00:05:33 Some of the conversations, we're like, we're not quite sure. But I think that what our team saw is that with these new high-performance data center servers that are coming online, we were just thinking it's only going to be a matter of time before they decide, hey, we don't have the space for these fans. We have to re-architect the system in order to enable 100% liquid cooling across the board. And so what my team and I did is we came up with a sort
Starting point is 00:06:06 of cold plate reference design that would say that, okay, as we looking forward, we see these systems potentially going here to 100% liquid cooled, how can we help the market move things along? And so what we did is we came up with a, you know, a unique mechanism that helps to facilitate heat transfer between the SSD and the fluid, and we're able to maintain these key features that we've seen in the data center up to this point. So I think it was just us having this, hey, what's coming next mindset, an innovative mindset along those lines and trying to figure out how do we help enable this, you know, hey, what's coming next mindset, sort of an innovative mindset along those lines and trying to figure out how do we help enable this even though we haven't quite seen it
Starting point is 00:06:49 yet in the market. And, but we want to make sure that, you know, these sort of things are unblocked, so that way we don't slow down the innovation happening. Well, I mean, it's a, I think it's a great technology halo for you guys. I mean, obviously we've talked about solid diamond ton over the years from the capacity leadership in QLC, which remains with the 122 terabyte drives, but also gen five, right?
Starting point is 00:07:16 You have very fast drives for the latest enterprise workloads and while we all get wrapped up in AI for good reasons, there's still, I don't know, like these database things that still happen in the enterprise that organizations want to go more quickly. And, you know, there's other things that we sometimes don't talk about as much anymore in the AI washing of all IT marketing.
Starting point is 00:07:40 But yeah, the SSD itself in your demo, let's tear into that a little bit. You're showing E1S drugs, which are the kind of the standard. If I'm looking at like the NBL 72 racks from Nvidia, they went to E1S and those very popular with the hyperscalers, a little less on mainstream enterprise, but also liquid cooling's a couple steps behind
Starting point is 00:08:04 in that group anyway. So E1S makes sense just from an adoption standpoint, but if we think about what you've got there is a pretty slim SSD if you take the heat sink off, it's what, a little under six mil, is that right? Just the four? It's the nine and a half millimeter form factor. Okay. And then we've got the heat sink wrapped around it, but you've got NAND on both sides, you've got DRAM and
Starting point is 00:08:34 the controller on one side. And I guess those two components get the warmest. So as you think about where you need to apply cooling, is it to both sides? So I need to sandwich that drive in between two cold plates? So what makes SolutionUniq that we came up with is we're only contacting the SSD on a single side of the drive, and that enables us to pack in more storage in that same footprint. So if there was a dual-sided cold plate solution, you're talking about having a four or five millimeter cold plate on either side of the SSD,
Starting point is 00:09:17 and then that is just taking up space within the server. So the way that we've designed it, we need a single-side contact, but we've optimized the thermal topology of the product to make sure that we're taking full advantage of the cold plate contact. And we're sort of taking advantage of that space that would normally be occupied by the 15 millimeter version of the UNS, that thin structure, we're taking that space that was generally occupied by that thin structure,
Starting point is 00:09:48 and we're replacing it with a cold plate. So at the platform level, I think about UNS 15 millimeter SSD, we're just taking that same drive bay, we're converting it from a 15 millimeter SSD air cooled to a nine and a half millimeter SSD liquid cooled. And so talking about not losing that, not losing that sort of drive,
Starting point is 00:10:10 that dense number of drives in that platform designed for a 15 millimeter SSD. Yeah, that's one of the fun things with the E1S spec is that there's like six different Z heights for these things, ranging from no whoops take it all to I think they go up to 25 right I think that was a popular one at Meta for some reason but really what we're talking about is not changing the PCB you're you're really changing the way the heat gets transferred
Starting point is 00:10:40 in your heat sink then to the cold plate. Is there any change at all to the PCB or is this still just a garden variety PS1010 Gen5 drive? So we did do some internal optimization of our drive to take full advantage of that cold plate connection. And since we're limited on the heat transfer, we're getting on the opposite, opposite the cold plate contact. We did have to do some unique things within the SSD to make sure that we're thermally balanced in our design.
Starting point is 00:11:15 So what does that do then for the drive in terms of performance? Because we know that if the drive gets too hot it'll throttle to go into self-preservation mode. We've done that sometimes by accident, sometimes on purpose in our labs. I know you guys, I've been in your labs at Rancho Cordova. I've seen the the thermal room there where you guys will often overheat things for funsies to test those outer edges. But in a design like this, is your goal to maintain parity with air cooling
Starting point is 00:11:52 and the full 15 mil heat sink, or is this better than air cooling with that big sink when you look at what you can do with liquid directed chip? So it's better in the sense that what we've done is we, we have a solution here that as we scale the power, we're able to maintain the no throttle condition on the drive. And so thinking forward, like I've been talking about,
Starting point is 00:12:20 as we think forward to maybe these higher power Gen 6 products or ramping up higher power Gen 6, you know, products or, you know, ramping up the power on a Gen 5 product, liquid cooling sort of unlocks that next level that you'd be limited by, you know, the thermal performance and so it's say like a 15 millimeter SSD. Well, this allows us to go beyond that, that general power limit for 15 millimeter and it sort of unlocks that next level of performance. Like, hey, if we wanted to drive the performance
Starting point is 00:12:50 of our SSD up further, you know, that gives us an opportunity in this same form factor to explore opportunities where we could, you know, increase that power performance to the drive and not necessarily break things thermally. Yeah, it's interesting because we did some work around power states at the end of last year for OCP. We were using your drives, actually the QLC drives
Starting point is 00:13:16 and looking at what happens to performance if I'm really concerned about 20 or 30 or 40 drives in a single box. And instead of running at full power, which is what 25 watts, yeah. Yeah. Okay. And if I trim that back, and I want to save 10 watts of drive 812 15 a drive, because that adds up that that's, that's, it sounds incremental, but times 30 or 24 is a number in terms of what you can do from a power managing
Starting point is 00:13:46 power envelope but if I take it down my reads remain pretty good but my rights will will eventually suffer as I as I trim the power back to that drive but talking about gen 6 you go from 25 watts to what's the target there 35 or something or is it higher? You know I don't I don't know exactly what the target power is off the top of my head but I do know it's higher than Gen 5 though. Yeah yeah and I do know that you know as an industry you know like I think that there is some opportunities where we talk about direct attached storage, how do we drive up that power envelope
Starting point is 00:14:32 for specific applications. And so I think that talking about higher power, maybe the 30, 35, 40 watt, I think that there is some limitations with that UNS connector specifically. And so I think that that obviously would come into the equation as you think about these higher powered solutions.
Starting point is 00:14:55 Yeah, I mean, the storage world's so wild right now. We went from this U.2 form factor that feels like it's been around forever. It's not that old when you compare it to SATA and SAS of course, but my goodness now with all of the the form factors in the EDSFF world it's wild from the early Intel ruler days to the short, the long, the E3, the double thickness E3. I mean it's pretty crazy.
Starting point is 00:15:26 And I know you guys can build anything, but the trouble with all these form factors is, I suspect, is that your customers are all asking for different things, which could be challenging. But as you hone in on this liquid cooling thing, I mean, it's really around these AI systems that are taking the lead there. So you talked a little bit about what you had to do with the drive.
Starting point is 00:15:51 What's the IP specifically? Or did you guys put patent or something around this? Are you open sourcing this technology? What's the solid I'm goal with with this demo. Uhm, so the solid is goal. I believe at a high level and you know, being the technical guy that I am, I don't get too much in the weeds as far as. The IP and such is concerned,
Starting point is 00:16:17 but our goal is that. We would help to enable you know several OEMs and ODIMs with this technology to help sort of with their liquid cooling adoption. And so what we see is having a reference design and these companies come to us, we have our liquid cooling reference design, we have some IP around that,
Starting point is 00:16:41 and then we're partnering together with them to enable things at their platform level. So, you know, I see it more as a partnership here. We have this technology. We want to partner with you to see what we and, you know, sell you our drive to see what we can do as far as enabling these liquid cooling solutions, you know, in their various platforms. Well, the march towards liquid cooling is demanding all components to be aware of this.
Starting point is 00:17:11 We saw it, I was at Data Center World in DC recently, and there were more CDU vendors there than I knew existed. So we did a podcast with Cool IT when we were there, and it did not dawn on me at the time that I should have probed into the cold plates and storage there, but they can make whatever the server guys want. All it is is two more runs off a manifold that's already in the system to run it to a cold plate.
Starting point is 00:17:39 And then if you guys at SolidIME are making SSDs that are tuned for cold plates, then it's a nice little marriage to put those together. I suspect that the rest of your competitors will have to do something there too as they think about what are the needs for a liquid-cooled system where I'm only going to get the cooling on one side. So, like what you were talking about, make the side that's only in contact with the case has got to be a little more efficient
Starting point is 00:18:12 so that can wrap around and then eventually get to the cold plate. But as we saw too with the immersion, I know immersion's not as popular in the United States as it is elsewhere, but we've seen some massive deployments. Doug, for instance, they've got 440 plus immersion tanks in their Houston data center, which is totally wild.
Starting point is 00:18:36 And as we talk to these guys, it started out with just taking a server off the shelf, dunking it in. What do you learn? You learn the cables get brittle and crack. You learn all the labels fall off and sink to the bottom and start to corrupt your fluid, which is interesting. And you learn a bunch of other things about the thermodynamics of how that oil moves through the system, what you need to remove, what you need to enhance to make that happen.
Starting point is 00:19:06 And I think when we were there with Alan, he was looking at it from an engineer's perspective, thinking, My gosh, okay, so if we're going to dump these things in oil, we can kind of rewrite the, we could theoretically rewrite the rules on how we treat flash as part of that. I mean, it doesn't have to be delivered the same way as we deliver it today. I mean, there's just so much going on there as a thermal guy. It must be exciting on your side of the house to, it sounds more exciting than signal integrity to me, but I don't know what does it look like in your labs? Yeah. me, but I don't know what's what does it look like in your labs? Yeah, so I mean, I think it's really, it's an exciting time to be a thermal engineer for sure. I think that, you know, for the
Starting point is 00:19:52 longest time, you kind of saw the market, you know, air cool, how do we optimize air cool, and then I just feel like there's been this jump in that I think it's been about the last three years here, we've just seen this huge acceleration, in terms of direct liquid cooling and immersion. And it's been fascinating to explore these different details and different trade-offs of different types of cooling technology. But at Solidigm, we want to make sure that our products can fit into all these different spaces. We're optimizing for air cooling,
Starting point is 00:20:24 for immersion cooling, for direct liquid cooling. We wanna partner with customers, figure out their specific needs and what we can engineer or how we can engineer our solutions to ensure that our drives are performing in these very strenuous environments. I saw that deployment that Doug had there, I think it was like in a parking lot, right?
Starting point is 00:20:45 Like a container in a parking lot. It wasn't just like a Popeyes, it was like next to a data center. You make it sit like a gene showed up. Yeah, so I mean, I remember Alan mentioning, yeah, Alan mentioning something like that. Yeah, I'm just like going to a parking lot in the middle of some residential area and I was like, all right. So, you know, we think about these different
Starting point is 00:21:08 applications and us as a lot. And I want to make sure that where we are, you know, we got our hands in these different markets to see, you know, how do we help enable our customers? And, you know, talking about our lives specifically, we've definitely looked into these different cooling technologies, you know, from immersion to direct liquid cooling to air cooling. And so we're definitely exploring, you know, what opportunities are there and how do we tune our drives to make sure that we're meeting our customers needs. Yeah. I mean, yes.
Starting point is 00:21:39 And it makes a lot of sense. One of the things that we've seen and heard anecdotally from organizations that have adopted Liquid at a high level, either Direct-to-Chip or Immersion, specifically with Immersion, I've got a little more data on it, but the failure rates seem to go down tremendously. And I know Flash already has a pretty low AFR, you know, across the industry, not just you, but if we look at all NAND, pretty low, especially compared
Starting point is 00:22:12 to hard drives, which, you know, were higher, more moving pieces, obviously, and complexity from an engineering standpoint, those designs. But what do you see, or what do you expect to see? Because it may be a little too early for your lab. Do you think that as you go to more liquid cooling of any variety with flash, do you think there'll be a positive side effect on SSD AFRs going down because of the better cooling?
Starting point is 00:22:45 AFR is going down because of the better cooling? I'm not really sure, so I can't really comment on that. I, yeah, that's not really something that I've, that I personally have looked into much. So yeah, I'm not certain. Well, I just think if we make the logical connection that better cooling drives state cooler, one less thing because heat is the enemy, an enemy of an SSD. So certainly performance should benefit. I think the AFR should benefit. I guess we'll see as more of these get out there.
Starting point is 00:23:22 Yeah, and I think one thing that's powerful about liquid cooling is you're able to be a little bit more selective in what you do as a thermal engineer at the platform level. So holding drives at more of a steady temperature, understanding flow and pressure drop at a system level to where you can peel off a certain liter per minute to keep your drive at a certain temperature range. So I think that some of those variables
Starting point is 00:23:52 that existed with air cooling don't exist as much. It means that there's still plenty of variables at a system level you need to consider. But I think that there is a opportunity there to sort of tune the platform in a way to keep your SSDs at a performance and a temperature range that would allow maybe for some better life like you're suggesting there.
Starting point is 00:24:23 But yeah, I could definitely see that you have a little bit more knobs to turn where you're not necessarily hurting a downstream CPU or GPU because you're increasing the Fendensity or something on your SSD. Yeah, for sure. And of course, when we take the fans out of these systems, tremendous power savings there, upwards of 25, 30%,
Starting point is 00:24:47 based on the numbers we're seeing, and they get a lot quieter too. So these AI data centers, fully liquid cooled, it's very bizarre when you go in and hear the whooshing and whirring more than the screeching of the fans. So go back to GTC. You guys were showing off this demo. You had four or five drives set up in a little E1S backplane
Starting point is 00:25:11 with a liquid loop going. This is obviously not production. This is proof of concept. As you work through these logistics, what sort of feedback were you getting at the event or afterwards? Because I imagine most of the people coming by were like, what is this?
Starting point is 00:25:30 This seems bizarre. Other than disbelief, were there other takeaways that you either expected or didn't? Yeah, I didn't think it was funny. A lot of people would stop by and were like, what is going on? And so a lot of it was just kind of explaining what was happening in general.
Starting point is 00:25:52 And for our demo specifically, like you said, we had U1S, we were demonstrating the hot swap ability. We were demonstrating our drives. They were running Gen 5 speeds. We were able to hot swap, still maintain, you know, that key feature there. And then we were able to hold the SSDs at a consistent temperature below throttle.
Starting point is 00:26:16 So we were kind of focused on that part of it. And so sort of walking people through that, you know, kind of showing off our technology to them was fun, something I hadn't experienced before. And so that was neat. I'd say that overall the reaction, like afterwards, some of the things that I didn't expect, there was quite a few, you know,
Starting point is 00:26:40 articles and things of people popping up around this. I didn't really expect that as much, but it was neat to kind of see that. I think we've gotten the industry thinking a little bit, you know? And we've definitely, I think people are starting to scratch their heads a little bit, like maybe there is something here that we can, drive towards 100% liquid cooling and maybe there is some benefits at a data center level that, you know, haven't quite considered to this point because we haven't seen this sort of technology. And so some of those reactions, I guess, I don't want to ramble, but...
Starting point is 00:27:22 No, I think it's fun and you you've encapsulated it The what the heck are you crazy guys doing? I think would be the number one and then as you said as it soaks in there Okay, make sense and I think I I Don't want to put you in a position to speculate so I I shall. On the GB300, as we look at these platforms, everyone we've seen is all liquid-cooled, and so most of them are missing the drive bank,
Starting point is 00:27:55 if you look closely, but I think it's a foregone conclusion that liquid cooling will be in the next generation of high-density from Nvidia, at least. And so they will drive this. The industry will have to respond, you and all of your competitors. And the fact that you're there and showing it now, I think is fun. And maybe a little out of character for Solidim in terms of waiting till something's shipping or right around the corner to start to show things. I think it's fun to show the new technology.
Starting point is 00:28:30 So on that front, we wrote about it, as you know, and put up a lot of photos, and we'll link to that in the description of this podcast. So if you guys are curious and wanna learn more about it, you can't buy it yet, but you can read more about it, see these drives, and see the demo that Cody and his team put together. But yeah Cody, I think it's pretty cool. I'm glad they let you out of the thermal lab to chat with us for a little bit and your perspective
Starting point is 00:28:58 is is is unique and and you're well equipped to talk about this stuff. So I appreciate that. thank you. Yeah, and I really appreciate the opportunity, thank you. Yeah.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.