In The Arena by TechArena - Reinventing RAID for AI: Xinnor and Solidigm on Storage’s Future

Starting point is 00:00:00 Welcome to Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein. Now, let's step into the arena. Welcome in the arena. My name's Allison Klein, and it is another Data Insights episode, which means Janice Norowski from Solidime is back with me. Welcome, Janice. Thank you, Alice, and it's great to be back. So we are on a tear with Data Insights episodes, and this is a very exciting one. Why don't you tell me what we're going to be talking about today, and who are our guests? So today we have a couple of guests, which is really exciting.

Starting point is 00:00:43 We actually have David Villa from Zinor, and then we also have Sarika Mata from Saladime. Let's go ahead and kick things off with David. David, you are the chief revenue officer for Zner. But tell us a little bit more about what Zinor is all about and what you do. do there? Hi, Janice. Thanks for inviting me to this podcast. Yeah, I'm the chief revenue officer at Zinor. I'm taking care of sales, marketing and technology evangelization. Zinor is a startup. We focus on high performance data protection for storage. And this is something which is not new. Data protection has been in the market for the last 20 years. Raid has been invented when Hardier's drive

Starting point is 00:01:26 were the only media available in the market. And that is also its limits because it was designed for slow media. Now that MPME are available in the market, we need something new. We need a raid capable of leading with the high level of parallelism of MPME drives. And that's why we reinvented a raid to cope with the performance of MPME, being able to aggregate multiple drives, the parity protection without penalizing the performance.

Starting point is 00:02:01 So that's what we do at Zinnor. So, Sarika, you are the senior storage solutions architect for Solidime, and I know together we're seeing a clear shift from hard drives to flash for different workloads. How is Solidim's large capacity QLCSSD's helping customers keep up with today's AI demands? Definitely, you know, they have the necessary. performance that is needed for the AI deployments. They have the larger capacities as compared to the hard drives. And with those two, they're able to offer power in space savings that hard drives are not able to keep up with. So that makes them very attractive for the AI infrastructure,

Starting point is 00:02:49 because as we know, power has increasingly become a very critical metric. Also, there is another aspect of it is where the data that was previously stored away, now with the AI applications, people are able to put that data to use. And this data has recited on cooler tiers before, which were primarily served by the hard drives. But with the AI taking off, those tiers are warming up. So the performance that hard drives were providing before is no, longer sufficient to be able to draw that data out and do also service the GPUs. GPU idle time is very extensive. And that's primarily the reason why we're seeing the shift towards the QLC S&Ds. David, let's take a step back. One of the things that we've talked

Starting point is 00:03:48 about at Tech Arena quite a bit is how AI workloads are really redefining the limits of storage infrastructure and requiring a complete rethink about what's delivered for storage for these data pipelines. From your perspective, what are the biggest challenges enterprise face today when scaling AI? And how does that relate to your work at ZNOR? Yeah, AI is a very GPU-intensive industry. So all the customers who are experimenting with AI, they start Procuring GPU, all their infrastructure is designed around the GPU. Quite often storage is an afterthought. So they don't start with storage.

Starting point is 00:04:31 They start with a computer. But then as Sarika pointed out, storage can become the bottleneck. If you deployed storage not in the optimal way, you can lead to a situation where you have severe GPU under utilization. and idle time on GPU costs money. So it's absolutely important for the infrastructure guy on AI to think about the storage infrastructure and make sure that is designed to keep the GPU busy at all time.

Starting point is 00:05:05 So what we see in AI, there are few things that are becoming extremely important. Data reliability is one of them. If you think about the cause of running an AI train, or even the cost of inferencing, you cannot afford to lose data. You cannot afford to slow down the system in case of a hardware failure. Because if you do that, you might lose a lot of money. I give an example. I was on the form a few days ago with one of the leading truck manufacturer in the world.

Starting point is 00:05:39 And they now do their simulation leveraging AI. So it's all automated through AI. And this guy told me that every time they run a simulation job, it costs them more than $2 million. So if they lose the data and they need to rerun the job, they are just wasting more than $2 million. So data reliability is a must. And then the other thing, Sarika pointed out that we see in the market, power is limited. The GPU are very power hungry. They consume the majority of the power budget.

Starting point is 00:06:15 And as a matter of fact, when people are deploying storage for AI, they need to maximize the VAT per terabyte or actually get the best possible VAT per terabyte. And that's the reason why we see the transition from Hardest Drive to high-capacity QLC. Today you can get 122 terabyte QLC drive when the maximum capacity hardest drive, it's 30-32 terabytes. So there is a 4x ratio between the two in favor of SSD. Yeah, David, I couldn't agree more. Let's talk a little bit about the results we've been seeing with ZNOR, Sarika, being the solution architect on this. You've been working with ZNOR and testing the XAI raid with our 61.44 P5336 terabyte drives. And you've seen a rebuild time

Starting point is 00:07:08 in just, I think, five hours compared to what would have been 50, 53 hours with MD RAID, which is incredible. So tell us a little bit about the key insights you saw from that study, and how does this really impact AI adoption? Yeah, and certainly we are very excited to present the results. A rebuild of 61.44-terabyte drive in just five hours. You can kick it off in the downtime, and in like few hours you can come back and resume your operation at full capacity and performance, which is definitely,

Starting point is 00:07:43 great, great as compared to having to wait like a few days and operating in a degraded more where, you know, your applications are starving. So one of the concerns that we wanted to address with this testing is that people have been concerned about downtime and especially higher the capacity, the bigger the concern that has been around those. And this historically stems from the fact that hard drives have been the primary source that has been used in the rate deployments, and their performance is nowhere near what QLC has to offer, right? Solid N QLC has 11 times more wide bandwidth and roughly like 25 times more read bandwidth as compared to today's 30 terabyte hammer SSD if you take that into account.

Starting point is 00:08:34 And for IOPS, you're topping anywhere from 54 times to like over 5,000 times the IOPS that you can get from a hard drive. So, you know, what it came down to at the end of it is basically having a software staff that is highly efficient that can take advantage of this raw performance and is able to deliver the rebuild, taking advantage. of the performance in the shortest time. What that means is that now we can confidently say that the large capacity drives are a completely liable solution for the redeployments. David, from your perspective, how do these performance improvements translate into real-world AI workloads

Starting point is 00:09:24 such as training or inference based on your mixed workload testing? Yeah, so what Sarika pointed out are great results, But they're even more impressive if you think about the measurement that we made during a system operation. So the numbers you pointed out so far, the five hours, the three hours for MDRAID are when the system is just doing drive rebuild, which is not real life. In real life, your storage system needs to keep running, feeding the GPU. And for this reason, we also run the same type of test with a heavy workload. And what we saw is that the delta in performance is even bigger.

Starting point is 00:10:07 So we're talking about 25 times faster rebuild time with our software compared to MDR8. This is important, but it's even more important to notice why we were rebuilding the drive. The system was running at full speed. So the application that is running is not suffering any meaningful performance degradation. So you can still run your AI workload, you can get all the answer to your queries, to your inferencing, as if nothing happened. And in the background, our software is recovering the drive, recovering the data, and obviously avoid any data loss. So that's absolutely critical. So it's not just a matter of avoiding data loss, but it's also a matter of being able to keep running at high performance so that you don't want to. waste GPU cycles and you don't waste your investment in the AI infrastructure.

Starting point is 00:11:06 So I'd like to address that kind of same thinking to Sirica. As these drives continue to grow in capacity, data protection does become more critical and a big element. How is Solidime really focused on high performance data protection and how did we differentiate the solutions we have in market? Yeah, that's been good. theme of our discussion. Like David has pointed out why reliability is important in his customer's story of how much money they're losing when you have unbelievable storage. We at Solidime are definitely focused on demonstrating reliability on platform level. We partner with solutions like the one with Xenorp. But in addition to that, P50336 SSDs, which are our high-density QLC

Starting point is 00:11:57 the SSTs and go up to 122 terabytes in capacity, they also support security features like compliance with Opal. We have data encryption, we have any secure arrays on it, crypto arrays, secure boot options, et cetera, on the drives. Solidine drives also go through very rigorous validation for quality and reliability and have some of the industry's lowest annual failure rates and uncorrectable big end rates. Definitely quality is built into the drives.

Starting point is 00:12:32 And so SSD level of security and protection features paired with high-performance software stack solutions really help us drive focus on end-to-end data protection for our customers. Now, David, I know that this performance that we're talking about is really rooted also in the software layer. Can you talk about why software optimization is so critical for enabling reliable performance in AI environments? Sure. As I said at the beginning, our goal is to maximize what the hardware can theoretically do. Now we see that with the transition from PCI Gen 4 to PCI Gen 5, you have those amazing SSD that can achieve tens of gigabytes per second in read and write.

Starting point is 00:13:23 that they can achieve a million iops in a random operation. But there is no such a thing where you deploy a single drive for AI. You need to deploy a cluster of drive. So it's absolutely important to be able to combine those drives in a large block device, in a cluster of drives, and not losing those performance. And this can only be done through the software implementation. So that's exactly what we do. We try to maximize the aggregated performance of what Solidine brings to the market.

Starting point is 00:13:59 Awesome. Thank you for that, David. This has been an amazing conversation. I think our listeners would love to know where they can learn more about this particular solution. Where should they go? David and Sirica, I'll let you chime in as well. But David, where should folks go to learn more about Xenor? It's very simple. You can start from our website, Xenor.io. Or you can Google Xenor and Solidime, and you will be prompt with two solution brief that we jointly publish. They are available both on our website as well as on Solidime website. One of the solution brief is about optimizing performance with a QLC drives and plate protection. The other one is about the rebuild time.

Starting point is 00:14:45 We can drastically improve rebuild time, whatever we discuss. today during this call. So have a look and then you can contact us if you go on our website, you get the ability to get in touch directly with us and eventually try our software. For us, you know, it's the same go to solidine.com and the solution brief is published on Solidine website

Starting point is 00:15:11 as well as in our website. It's also available on blocks and files or anybody who is interested. And to get in touch with us, You can get in touch through the Solidine website or find your Solidine sales drop and get in touch. One final question for both of you, because I know that you're going to have interesting perspectives on this. Looking ahead, how do you see AI and storage evolving together? And what excites you most about the role your companies can play in that journey?

Starting point is 00:15:43 David, do you want to take it first? Yeah, okay. AI is all about data and data is all about storage. So we will see storage becoming more and more critical for effective AI deployment. We see it already today and we see these becoming more critical in the future. New GPU are coming to the market. They will be faster. Most likely they won't be more power efficient, but they will definitely demand faster storage

Starting point is 00:16:11 and more capacity drive. And that's a clear trend in the market. So the success of all the vendor will be on the ability to keep growing the footprint, the capacity of the drive, and at the same time maximize the performance to keep those GPU busy. So storage is a critical element in any AI deployment, and it will become even clearer in the future. And Sirica? Yeah, so, you know, like David said, here is about data. data is about storage, and one of the things that AI evolution has definitely been challenged

Starting point is 00:16:54 by the data center power and space constraints, and Solidine is uniquely positioned with its identity QLCSSs to resolve both of those challenges. But that's not all. Soordine also offers high-performance PCI-G-G-4 and Gen 5 TLC SSDs like P-5520, PS-10, and Gen 4 SLC SSD, which is C-5010 products. And these are great options for near-GPU ephemeral storage, where high performance and smaller capacities are desired. So having a portfolio of SSDs that fit various requirements of AI deployments that provide industry-leading quality and reliability

Starting point is 00:17:41 certainly is exciting for me and part of solving that. Fantastic. Thank you both for coming on the show today. You have provided some incredible insights into what is reshaping in the data center for the foundational storage capabilities for AI. I really appreciate you being here. And, Janice, thanks so much for another Data Insights episode. Awesome. Yes. Thank you, Allison. Thanks for having us. Thank you. Thank you. Thanks for joining Tech Arena. Subscribe. engage at our website, Techorina.aI. All content is copyright by Tech Arena.

In The Arena by TechArena - Reinventing RAID for AI: Xinnor and Solidigm on Storage’s Future

This Data Insights episode unpacks how Xinnor’s software-defined RAID for NVMe and Solidigm’s QLC SSDs tackle AI infrastructure challenges—reducing rebuild times, improving reliability, and maxi...mizing GPU efficiency.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.