Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 07x04: Maximum Performance and Efficiency in AI Data Infrastructure with Xinnor

Starting point is 00:00:00 Cutting-edge AI infrastructure needs all the performance it can get, but these environments must also be efficient and reliable. This episode of Utilizing Tech, brought to you by Solidigm, features David Villa from Xenor discussing the value of modern software RAID on NVMe storage SSDs with Ace Stryker and myself. Welcome to Utilizing Tech, the podcast about emerging technology from Tech Field Day, part of the Futurum Group. This season, brought to you by Solidyne, focuses on the question of AI data infrastructure. All of the ways that we have to support our AI training and inferencing workloads.

Starting point is 00:00:41 I'm your host, Stephen Foskett, organizer of the Tech Field Day event series. And joining me today as co-host is Ace Stryker from Solidaim. Welcome to the show once again. Hey, Stephen. Thank you. A pleasure to be with you again. So Ace, we have been talking about various aspects of the AI data infrastructure stack on this show. Today, we're going to go a little bit nerdy, a little bit deep on the topic of storage and RAID. I know that most of the companies that are deploying AI infrastructure, especially for training, the last thing they want to do is invest a ton of money and effort and precious, precious PCIe space and data center space in a big, fancy storage system. A lot of these companies are trying to create a system that does all the things they need right there in the chassis.

Starting point is 00:01:36 Yeah, a lot of folks looking in at AI data infrastructure from the outside may not appreciate, you know, the challenge that comes with sort of coordinating all the pieces of the system, right? It's easy enough to say, oh, you know, you can buy more drives to add capacity or you can, you know, pull these levers to increase your performance if you need. But it turns out that, you know, sort of optimizing the way these pieces play together is not easily done. Right. And there's there's a lot of interesting innovation happening in that space. In particular, we're seeing, you know, a lot of these kind of coordination efforts, whether it's in networking or storage or other parts of the system. We're kind of entering a world where a lot of that stuff is being done by software, right? Where historically, you know, we had these sort of purpose-built pieces of hardware that were responsible for that kind of work.

Starting point is 00:02:36 And in the new world, we're transitioning to software-defined solutions for a lot of this stuff that it's really exciting to see because what you can get out of that for a lot of this stuff that it's really exciting to see because what you can get out of that in a lot of cases is not just more performance, but you can also do that in a more efficient architecture. Oftentimes, you're saving power space at the same time. And so it's definitely an area to watch going forward. Yeah. And it seems as a storage nerd myself, it always makes me sad when people underestimate what they need in terms of storage solutions. They maybe will try to deploy on just bare drives, or they'll try to use sort of out of-box software that isn't really up to the task from a performance and even from a reliability perspective, or they'll deploy something that's just way overly complicated and huge. So refuting all of this, we have Xenor. This is a company that makes

Starting point is 00:03:41 a software RAID solution. Essentially, it lets your server manage storage in a way, internally, in a way that an external storage array might do. So we're excited to have David Villa joining us today to talk a little bit about the world of software RAID and the world of the practical ways that companies are managing storage. Welcome to the show. Thank you, Stephen, for inviting me. I'm really excited to be part of the show. I'm Davide Villa.

Starting point is 00:04:17 I'm the Chief Revenue Officer at Zinor, the leading company in software aid for NVMe SSD. So tell us a little bit about yourself and about Zinor. Zinor is a startup in software RAID, as you mentioned. We were founded a couple of years ago, but in reality, we inherited the work that has been done in the last 10 years by the previous company that was sold and created by our founder so now we we are a young company but we are leveraging more than 10 years of development in optimizing data path to provide very fast storage. We're about 45 people dispersed around the globe and very much an R&D company.

Starting point is 00:05:12 And just to be clear, when you talk about optimizing the data path and software rate and everything, you're talking about building basically enterprise grade reliability and performance within the server without having to have a bunch of expensive add-on cards or a separate chassis or anything. You're talking about basically building a server with a bunch of NVMe drives and then using the power of the CPU to provide an incredible amount of performance and reliability, right? Yeah, there are enough resources within the server that we don't need to add any accelerator or any other component

Starting point is 00:05:51 that might become a single point of failure at some point. So what we do, we combine AVX technology available on all modern CPU and we combine it with our special data path. We call it the lockless data path. And what's unique in our data path is the way we distribute the load across all the available cores on the CPU

Starting point is 00:06:19 by minimizing spike of load on a single core. And by doing that, we avoid spike and we can get stable performance, not just in a normal operation, but also in degraded mode. One of the things that we have set out to explore on this podcast is, you know, within the context of AI specifically, how this boom is drawing a lot of these technical challenges and opportunities for solutions into a sharper focus, right? So can you talk a little bit about what impact the acceleration in the AI world has had on the problems that Xenor set out to solve? Yeah, that's a very hot topic today as we see that our main market

Starting point is 00:07:13 is definitely providing very fast storage for AI workloads. So what we experienced by working with our customer is that traditional HPC player, the university, the research institutes, they're all now facing some level of AI workload. So they're all moving, they all equip themselves with some GPU gpu very powerful gpus that require a different type of storage than what they traditionally used to deal with traditionally in the hpc space hdd rotating spindles drive were good enough for many use cases.

Starting point is 00:08:06 When it comes to AI workload, they are not sufficient. Their performance is not sufficient any longer because of the very high read and write requirements that those modern GPU have. And those modern GPU, they are expensive systems. So the customer cannot afford to keep them waiting for data. So it's absolutely critical that the storage that is selected to provide data for AI models is capable of delivering a stable high performance in the tens of gigabytes per seconds. That's certainly something we hear a lot about in our conversations with folks in the industry is the sort of primary importance of GPU utilization.

Starting point is 00:09:04 Right. Nobody wants to spend is the sort of primary importance of GPU utilization, right? Nobody wants to spend tens of thousands of dollars per unit, and in some cases, even more than that, to run something at 60% utilization, right? And so feeding the data to the GPU in something like the training stage of the data pipeline becomes really important to make sure that you're getting the bang for your buck on the compute side, right?

Starting point is 00:09:30 Can you talk a little bit about what I see if I open up a box that has Xenor running in it? If I take a conventional architecture, I'm probably used to seeing an array of nvme drives and then there's a there's a raid card in there that's doing a job uh can you talk a little bit about how your uh solution is is different yeah first of all our solution is software only so we use the system we leverage the system resources and when i say the system resources. And when I say the system resources, I'm referring just to the CPU, because we don't have a cache in our RAID implementation. So we don't need memory allocation. That's the primary difference. But the reason why we came up with our own software RAID implementation is because traditional hardware RAID architecture cannot

Starting point is 00:10:27 keep up with the level of parallelism of new NVMe drives. So the level of parallelism that you can get on PCI be able to run the checksum calculation. Then the other limitation that you face with hardware RAID is the number of PCIe lanes. Hardware RAID connected through the PCIe line can only have 16 lanes. And each NVMe has four lanes on its own, meaning that you are saturating the PCIe bus with just four NVMe drives. And for AI models and workloads,

Starting point is 00:11:21 four NVMe are not sufficient. So we have customer deployments cluster of multiple tens of server with 24 NVMe per server. So we believe that for NVMe drives and for AI workload, there is only one way to go, which is a software raid. Well, it's true because you look at these servers and you talk about four NVMe drives.

Starting point is 00:11:53 Most of these servers have a lot more than four NVMe drives. Most of these servers have a pile of them. And even though those drives are pretty big and each drive provides a lot of performance, you still don't want to manage those individually. I don't know about our listeners, but I'm an old school Unix systems administrator and I don't want to be dealing with 20 individual drives.

Starting point is 00:12:15 I want to be dealing with a combined drive. And not only that, these drives are incredibly reliable, very, very reliable, but nothing is guaranteed, especially when it comes to things like the mechanical components of the drives, insertion and removal and things like that. It is possible for drives to fail. You need to have reliability as well and predictability of performance. That's another thing I think that occurs to me too, is that if you lose a drive and there's a rebuild or something like that, you don't want to lose all the work that you've done so far in terms of training workloads and

Starting point is 00:12:56 so on. So all of this points to the need, I think, for a system that manages storage. Now, I mean, RAID isn't really storage management, but it is drive management and it does definitely help in configuring these systems, right? Yeah, you're absolutely right. So when you run AI models, those models can take several weeks, if not multiple months to run. And while running the model something can can go bad so you need to provision and make sure that you are able to deal with potential failures or drive this connection from from the array without having without losing data for sure but also without being impacted in the performance and that that's where we we step in by providing rate capability so by providing data integrity

Starting point is 00:13:57 but and making sure that we keep very high performance even in degraded mode. I'm curious, as you talk to your customers who are engaged in AI work, what's the sense you're getting of how those customers are viewing their storage needs? And do you see any trends there? You know, do you hear from folks about, hey, we really need to get more sequential throughput out of our storage subsystem or, hey, random performance is really important for us or capacity needs continue to grow and grow. What do you see as kind of like trends in the way folks are viewing and the requirements

Starting point is 00:14:38 that they are demanding from their storage subsystems in AI clusters? That's a $1 million question, I would say. So we have been asking this question to many different customers, and we kind of get a very similar answer from most of them. And the answer is that they don't know. They need to provision for the extreme cases because the workload is different.

Starting point is 00:15:07 It's not always the same workload. So if we want to oversimplify, we can say that AI workload, it's mostly sequential by nature and the combination of read during the ingestion and write during the checkpoints. But not all the AI models, all the AI training are equal. So there are many distinctions that to be to be made and we see that random performance plays a role as well what we experience with our customers as i say that most of those customers they used to be running hpc infrastructure and they very much would like to stay with what they know. So they would like to keep on using the popular parallel file system that used to use on their HPC implementation. And be able to leverage it, to leverage their competence in using those parallel file systems or file systems also to run ai models so as a matter

Starting point is 00:16:28 of fact every customer has a different type of implementing storage so we have we're working with many different universities with universities implementing our z-ade with all-flash implementation based on Lustre. We have other universities who prefer going down the route of open source using BGFS. And we also work, we did deployments with universities that they don't want complexity. They just want a simple file system like NFS and be able to to saturate The the bandwidth there are network banded with in this specific case. I'm referring to an InfiniBand deployment

Starting point is 00:17:20 which we Recently did at a major university in Germany to provide fast storage through the network to 2DGX systems. So to answer your question, it's tough to give you a simple answer because we're still in the early days of AI adoption, and everybody's still in a learning phase. What is clear is that performance really matters. And in order to get that performance,

Starting point is 00:17:57 I imagine that there may be some tuning that you might have to do as well. So if you're the RAID level, the RAID layer, I imagine that there might be some slight different configuration, well, seemingly slight configuration that can make a huge difference on performance. Again, based on my background in the storage space, I know that things like block sizes and so on can make a huge difference in performance. I assume that you guys can adapt to the needs of the higher level software, right? Yeah, absolutely.

Starting point is 00:18:28 So our software gives the system admin the flexibility to select the right level of geometry and the right chunk size, the minimal amount of data that is written within the rate to a single drive the minimal amount of data that is written within the RAID to a single drive in the most optimal way, depending on the workload that will be run.

Starting point is 00:18:54 So we actually did a lot of activities with our partner, SolidIME, to find the optimal configuration based on the specific workload that we were running in RAID 5 implementation and with proper alignment to the SSD indirect. We see that customers, they require more and more storage, and most of the workload is sequential.

Starting point is 00:19:26 So this makes QLC a very viable technology for AI workload. But everybody knows that QLC comes with some limitations, like a limited number of program array cycles. So with our software, by selecting the proper chunk size, we are able to minimize the write amplification into the SSD. And by doing that, we can enable using QLC for extensive AI projects. So what I'm saying is actually going to be part of a joint paper with Solidigm and you will be able to see the outcome of this research.

Starting point is 00:20:20 Yeah, absolutely. Folks who are interested in that white paper, by the way, can check it out on solidime.com in our insights hub. We've got some great testing there with some of our QLC drives on the C-RAID solution. Do you mind, for folks who are less immersed in the storage world than ourselves, maybe just give us a one-minute version of what that is and why it's a challenge and kind of how Z-Rate addresses it differently? that terms that refer to the fact that when the host is writing one data to the SSD, internally, there is more than one write that happens to the physical component, to the physical NAND component. And given the fact that all the SSD have limited number of program array cycles, and when it comes to QLC, they have fewer program array cycles than TLC. It's very important that we implement algorithm to minimize these numbers. So to keep this number as close as possible to one.

Starting point is 00:21:41 And with our software, we can do that because we can change the chunk size so the minimum amount of data that will be written to the ssd to each ssd that are part of the rate array and by doing that we can minimize the read-modify-write that needs to be done on the SSD itself. So when we calculate the checksums, if we are not aligned with the indirect of the SSD, we might risk to write multiple times data to the SSD. With our software, we can find the proper tuning

Starting point is 00:22:31 based on the workload, based on the number of drives that are part of the RAID array by the level of RAID. And we are able to find the optimal configuration to keep this number as close to one as possible. I could imagine that a lot of this might sound a little concerning to someone listening and trying to deploy this. They might think, oh boy, that's a lot of tuning, a lot of under the hood stuff that I don't really understand.

Starting point is 00:23:02 Do you have best practices for various devices? I mean, do you help customers to come up with the right configuration? Yeah, that's part of our job. So normally when we engage with the customer, we spend quite some time with our Precise team to understand the workload of the customer and find the optimal configuration. Then once the optimal configuration is identified, there's no work to be done anymore by the system admin. It's ready to fly and there's no additional tuning to be done. One of the other things I wanted to ask you about, we've talked about Z-RAID, which is an incredible solution in terms of sort of the benefits to efficiency and performance at the same time. Very exciting what you're working on over there.

Starting point is 00:23:55 Another thing I've heard about more recently is I think another product of yours called Z-Store. Could you tell us a little bit about that and how these pieces work together? So our core competence, as I said, is in the data path and it's in the way we create a very efficient RAID. And we see that for some industries, a standalone RAID, at least for some customers, is not sufficient. So they're looking for a broader solution. So ZStore is one of the first of those solutions that we're bringing to the market. And it's based on our ZRAID implementation for NVMe SSD, but we also combine it with D-CLUS RAID to handle the typical problematics of hard disk drive,

Starting point is 00:24:58 which is extremely long rebuild time. And so through our own implementation of Dcluster RAID, we can drastically reduce the rebuild time of RTS Drive. Then on top of the RAID within ZStore, you will find a high availability implementation. So there is no single point of failure. You can lose a server and still you get all the RAID up and running. We have our control plane to manage virtual machine.

Starting point is 00:25:33 And on top of those virtual machine, we mount the last parallel file system. complete end-to-end solution that HPC and AI customer can deploy without needing to combine Z-Rate standalone with third-party softwares. That's the first of a series of solutions that we will bring to the market without ever leaving our core competence, which is very much the right implementation. Very good. Well, it certainly seems like the the market is responding to your approach here. It sounds like the future is very bright for CINO. I wish the same to ourselves, and I guess I can say that

Starting point is 00:26:24 that's the situation and what we are experiencing, the trend towards AI deployments across all the industries. So it's not just one or two guys that are deploying AI models, but it's becoming pervasive in the industry. It definitely boosts the requirements for very fast and reliable storage. I think the interesting thing here too is that all the things that we've been talking about are going to be very familiar and comfortable for people that are deploying these systems. So you mentioned, for example, Lustre as part of the ZStore architecture. Well, a lot of HPC environments are already using Lustre and are

Starting point is 00:27:05 happy with it. We talked about as well how if you combine multiple NVMe drives into a single ZRAID system, well, that's going to be familiar for people who don't really know a lot about storage because they're going to see the storage as just a big space, a big amount of space that I can use and let Xenorm manage that. Similarly, the entire idea of software RAID, it's one of those things I think where people may, they kind of probably fall into two camps. Either on the one hand, they think storage is just storage and I threw some drives in and why doesn't it work? Or they think storage is a big task and I have to go buy a big thing and do a big thing. And this kind of falls comfortably in the middle where they can get those features, but they don't have to have a huge investment in storage. I can see that there are probably times

Starting point is 00:28:00 when people might want a big storage infrastructure or storage platform as well. But for many people that are deploying, especially ML training, they may want something that's a lot leaner and yet still provides the kind of reliability that you're talking about. So this makes a lot of sense. Thank you so much for talking a little bit about the lower level of AI data infrastructure, lower level in the stack, not in terms of importance. Where can we continue learning more about Xenor? I bet you guys are doing some things

Starting point is 00:28:35 and are going to be at some industry events. Yeah, you can start by having a look at our website at www.xenor.io. And then we are going to exhibit at Future of Memory and Storage, which will happen in August in the Bay Area. So we look forward to see you there and asking many questions. Ace, I think that you and Xenor are also working on a paper together, right? Yeah, that's available on our website.

Starting point is 00:29:07 So we've got some really compelling results from the lab where we talk about what we saw putting a bunch of solid-ime high-capacity QLC SSDs into an array using Z-RAID. And I have to say, you know, I didn't do the testing. That was our solution architecture team, but reading through the results were very exciting to me. Anything that can be done, A, to improve performance and move through the AI model development workflow faster. That's a big win. But also to do so while freeing up a PCIe slot and saving power that you would have otherwise spent on a dedicated card, for example, to do work like this is a big deal.

Starting point is 00:29:58 We keep hearing more and more by the week about these really scary projections of how much space and power AI data centers are going to consume in the near future. And so it's a focus area at Solidigm, certainly, to figure out how can we reduce the environmental impact there, make these things more efficient. And a solution like Z-RAID fits right into that, right? Doing more with less is absolutely the path forward here to enable AI development to continue at its breakneck pace of advancement that we're currently seeing. Well, thank you so much, both of you, for joining us today for this episode. Again, storage nerd here. I'm glad to be able to nerd out a little bit about storage,

Starting point is 00:30:41 while also maybe reassuring folks that they don't have to be storage nerds in order to have reliable and high-performance storage in the software domain. Thank you for listening to this episode of Utilizing Tech, part of the Utilizing AI Data Infrastructure series. You can find this podcast in your favorite podcast applications. Just look for Utilizing Tech. You'll also find us on YouTube if you prefer to watch a video version. If you enjoyed this discussion, please do give us a rating, give us a review, give us a comment.

Starting point is 00:31:11 We'd love to hear from you. This podcast was brought to you by Tech Field Day, home of IT experts from across the enterprise, now part of the Futurum Group. It was also sponsored, this episode and this season

Starting point is 00:31:23 was sponsored by Solidigm. For show notes and more episodes, head over to our dedicated website, utilizingtech.com, or find us on X Twitter and Mastodon at Utilizing Tech. Thanks for listening, and we will see you next week.

Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 07x04: Maximum Performance and Efficiency in AI Data Infrastructure with Xinnor

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.