Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 07x04: Maximum Performance and Efficiency in AI Data Infrastructure with Xinnor
Episode Date: June 24, 2024Cutting-edge AI infrastructure needs all the performance it can get, but these environments must also be efficient and reliable. This episode of Utilizing Tech, brought to you by Solidigm, features Da...vide Villa of Xinnor discussing the value of modern software RAID and NVMe SSDs with Ace Stryker and Stephen Foskett. Xinnor xiRAID leverages the resources of the server, including the AVX instruction set found on modern CPUs, to combine NVMe SSDs, providing high performance and reliability inside the box. Modern servers have multiple internal drive slots, and all of these drives must be managed and protected in the event of failure. This is especially important in AI servers, since an ML training run can take weeks, amplifying the risk of failure. Software RAID can be used in many different implementations, with various file systems, including NFS and high-performance networks like InfiniBand. And it can be tuned to maximize performance for each workload. Xinnor can help customers to tune the software to maximize reliability of SSDs, especially with QLC flash, by adapting the chunk size and minimizing write amplification. Xinnor also produces a storage platform solution called xiSTORE that combines xiRAID with the Lustre FS clustered file system, which is already popular in HPC environments. Although many environments can benefit from a full-featured storage platform, others need a software RAID solution to combine NVMe SSDs for performance and reliability. Hosts: Stephen Foskett, Organizer of Tech Field Day: https://www.linkedin.com/in/sfoskett/ Ace Stryker, Director of Product Marketing, AI Product Marketing at Solidigm: https://www.linkedin.com/in/acestryker/ Davide Villa, Chief Revenue Officer at Xinnor: https://www.linkedin.com/in/davide-villa-b1256a2/ Follow Utilizing Tech Website: https://www.UtilizingTech.com/ X/Twitter: https://www.twitter.com/UtilizingTech Tech Field Day Website: https://www.TechFieldDay.com LinkedIn: https://www.LinkedIn.com/company/Tech-Field-Day X/Twitter: https://www.Twitter.com/TechFieldDay Tags: #UtilizingTech, #Sponsored, #AIDataInfrastructure, #AI, @SFoskett, @TechFieldDay, @UtilizingTech, @Solidigm,
Transcript
Discussion (0)
Cutting-edge AI infrastructure needs all the performance it can get,
but these environments must also be efficient and reliable.
This episode of Utilizing Tech, brought to you by Solidigm,
features David Villa from Xenor discussing the value of modern software RAID
on NVMe storage SSDs with Ace Stryker and myself.
Welcome to Utilizing Tech, the podcast about emerging technology from Tech Field Day, part of the Futurum Group.
This season, brought to you by Solidyne, focuses on the question of AI data infrastructure.
All of the ways that we have to support our AI training and inferencing workloads.
I'm your host, Stephen Foskett, organizer of the Tech Field Day event series. And joining me today as co-host is Ace Stryker from Solidaim. Welcome to the show once
again. Hey, Stephen. Thank you. A pleasure to be with you again. So Ace, we have been talking about
various aspects of the AI data infrastructure stack on this show. Today, we're going to go a little bit nerdy, a little bit deep on the topic of
storage and RAID. I know that most of the companies that are deploying AI infrastructure,
especially for training, the last thing they want to do is invest a ton of money and effort and
precious, precious PCIe space and data center space in a big, fancy storage system.
A lot of these companies are trying to create a system that does all the things they need
right there in the chassis.
Yeah, a lot of folks looking in at AI data infrastructure from the outside may not appreciate, you know, the challenge that comes with sort of coordinating all the pieces of the system, right?
It's easy enough to say, oh, you know, you can buy more drives to add capacity or you can, you know, pull these levers to increase your performance if you need.
But it turns out that, you know, sort of optimizing the way these pieces play together is not easily done.
Right. And there's there's a lot of interesting innovation happening in that space.
In particular, we're seeing, you know, a lot of these kind of coordination efforts, whether it's in networking or storage or other parts of the system.
We're kind of entering a world where a lot of that stuff is being done by software, right?
Where historically, you know, we had these sort of purpose-built pieces of hardware
that were responsible for that kind of work.
And in the new world, we're transitioning to software-defined solutions for a lot of this stuff
that it's really exciting to see because what you can get out of that for a lot of this stuff that it's really exciting to see because
what you can get out of that in a lot of cases is not just more performance, but you can also do that
in a more efficient architecture. Oftentimes, you're saving power space at the same time. And so
it's definitely an area to watch going forward. Yeah. And it seems as a storage nerd myself, it always makes me sad
when people underestimate what they need in terms of storage solutions. They maybe will try to
deploy on just bare drives, or they'll try to use sort of out of-box software that isn't really up to the task from a performance and even from a reliability perspective, or they'll deploy something that's just way
overly complicated and huge. So refuting all of this, we have Xenor. This is a company that makes
a software RAID solution. Essentially, it lets your server manage storage
in a way, internally, in a way that an external storage array might do. So we're excited to have
David Villa joining us today to talk a little bit about the world of software RAID and the world
of the practical ways that companies are managing storage.
Welcome to the show.
Thank you, Stephen, for inviting me.
I'm really excited to be part of the show.
I'm Davide Villa.
I'm the Chief Revenue Officer at Zinor,
the leading company in software aid for NVMe SSD.
So tell us a little bit about yourself and about Zinor.
Zinor is a startup in software RAID, as you mentioned.
We were founded a couple of years ago, but in reality, we inherited the work that has been done
in the last 10 years by the previous company that was sold and created
by our founder so now we we are a young company but we are leveraging more than 10 years of
development in optimizing data path to provide very fast storage. We're about 45 people dispersed around the globe and very much an R&D company.
And just to be clear, when you talk about optimizing the data path and software rate
and everything, you're talking about building basically enterprise grade reliability and
performance within the server without having to have a bunch of expensive
add-on cards or a separate chassis or anything.
You're talking about basically building a server with a bunch of NVMe drives and then
using the power of the CPU to provide an incredible amount of performance and reliability, right?
Yeah, there are enough resources within the server
that we don't need to add any accelerator or any other component
that might become a single point of failure at some point.
So what we do, we combine AVX technology
available on all modern CPU
and we combine it with our special data path.
We call it the lockless data path.
And what's unique in our data path
is the way we distribute the load
across all the available cores on the CPU
by minimizing spike of load on a single core.
And by doing that, we avoid spike and we can get stable performance,
not just in a normal operation, but also in degraded mode.
One of the things that we have set out to explore on this podcast is,
you know, within the context of AI specifically, how this boom is
drawing a lot of these technical challenges and opportunities for solutions into a sharper focus,
right? So can you talk a little bit about what impact the acceleration in the AI world has had on the problems that Xenor set out to solve?
Yeah, that's a very hot topic today as we see that our main market
is definitely providing very fast storage for AI workloads.
So what we experienced by working with our customer is that traditional HPC player,
the university, the research institutes,
they're all now facing some level of AI workload.
So they're all moving,
they all equip themselves with some GPU gpu very powerful gpus that require a different
type of storage than what they traditionally used to deal with traditionally in the hpc space
hdd rotating spindles drive were good enough for many use cases.
When it comes to AI workload, they are not sufficient.
Their performance is not sufficient any longer
because of the very high read and write requirements
that those modern GPU have.
And those modern GPU, they are expensive systems. So the customer cannot afford to keep them waiting for data. So it's absolutely critical that the storage that is selected to provide data for AI models is capable of delivering a stable
high performance in the tens of gigabytes per seconds.
That's certainly something we hear a lot about in our conversations
with folks in the industry is the sort of primary importance of GPU utilization.
Right.
Nobody wants to spend is the sort of primary importance of GPU utilization, right?
Nobody wants to spend tens of thousands of dollars per unit,
and in some cases, even more than that,
to run something at 60% utilization, right?
And so feeding the data to the GPU
in something like the training stage of the data pipeline
becomes really important to make sure that you're getting the bang for your buck on the compute side, right?
Can you talk a little bit about what I see if I open up a box that has Xenor running in it?
If I take a conventional architecture, I'm probably used to seeing an array of nvme drives and then
there's a there's a raid card in there that's doing a job uh can you talk a little bit about
how your uh solution is is different yeah first of all our solution is software only
so we use the system we leverage the system resources and when i say the system resources. And when I say the system resources, I'm referring just to the CPU,
because we don't have a cache in our RAID implementation. So we don't need memory
allocation. That's the primary difference. But the reason why we came up with our own
software RAID implementation is because traditional hardware RAID architecture cannot
keep up with the level of parallelism of new NVMe drives.
So the level of parallelism that you can get on PCI be able to run the checksum calculation.
Then the other limitation that you face with hardware RAID is the number of PCIe lanes.
Hardware RAID connected through the PCIe line can only have 16 lanes.
And each NVMe has four lanes on its own,
meaning that you are saturating the PCIe bus
with just four NVMe drives.
And for AI models and workloads,
four NVMe are not sufficient.
So we have customer deployments cluster
of multiple tens of server with 24 NVMe per server.
So we believe that for NVMe drives
and for AI workload,
there is only one way to go, which is a software raid.
Well, it's true because you look at these servers
and you talk about four NVMe drives.
Most of these servers have a lot more than four NVMe drives.
Most of these servers have a pile of them.
And even though those drives are pretty big
and each drive provides a lot of performance,
you still don't want to manage those individually.
I don't know about our listeners,
but I'm an old school Unix systems administrator
and I don't want to be dealing with 20 individual drives.
I want to be dealing with a combined drive.
And not only that, these drives are incredibly reliable,
very, very reliable,
but nothing is guaranteed, especially when it comes to things like the mechanical components
of the drives, insertion and removal and things like that. It is possible for drives to fail.
You need to have reliability as well and predictability of performance. That's another thing I think that
occurs to me too, is that if you lose a drive and there's a rebuild or something like that,
you don't want to lose all the work that you've done so far in terms of training workloads and
so on. So all of this points to the need, I think, for a system that manages storage. Now,
I mean, RAID isn't really storage management, but it is drive
management and it does definitely help in configuring these systems, right?
Yeah, you're absolutely right. So when you run AI models, those models can take several weeks,
if not multiple months to run. And while running the model something can can go bad so
you need to provision and make sure that you are able to deal with potential failures or drive
this connection from from the array without having without losing data for sure but also without being impacted in the performance
and that that's where we we step in by providing rate capability so by providing data integrity
but and making sure that we keep very high performance even in degraded mode.
I'm curious, as you talk to your customers who are engaged in AI work, what's the sense you're getting of how those customers are viewing their storage needs?
And do you see any trends there?
You know, do you hear from folks about, hey, we really need to get more sequential throughput
out of our storage subsystem
or, hey, random performance is really important for us or capacity needs continue to grow
and grow.
What do you see as kind of like trends in the way folks are viewing and the requirements
that they are demanding from their storage subsystems in AI clusters?
That's a $1 million question, I would say.
So we have been asking this question
to many different customers,
and we kind of get a very similar answer from most of them.
And the answer is that they don't know.
They need to provision for the extreme cases
because the workload is different.
It's not always the same workload.
So if we want to oversimplify, we can say that AI workload, it's mostly sequential
by nature and the combination of read during the ingestion and write during the checkpoints.
But not all the AI models, all the AI training are equal. So there are many distinctions that to be to be made and we see that random performance plays a role as well what we experience with our
customers as i say that most of those customers they used to be running hpc infrastructure
and they very much would like to stay with what they know.
So they would like to keep on using the popular parallel file system that used to use on their HPC implementation.
And be able to leverage it, to leverage their competence in using those parallel file systems or file systems also to run ai models so as a matter
of fact every customer has a different type of implementing storage so we have we're working with
many different universities with universities implementing our z-ade with all-flash implementation based on Lustre.
We have other universities who prefer going down the route of open source using BGFS.
And we also work, we did deployments with universities that they don't want complexity. They just want a simple file system like NFS and
be able to
to saturate
The the bandwidth there are network banded with in this specific case. I'm referring to an InfiniBand
deployment
which we
Recently did at a major university in Germany
to provide fast storage through the network to 2DGX systems.
So to answer your question,
it's tough to give you a simple answer because we're still in the early days of AI adoption,
and everybody's still in a learning phase.
What is clear is that performance really matters.
And in order to get that performance,
I imagine that there may be some tuning
that you might have to do as well.
So if you're the RAID level, the RAID layer, I imagine that there might be some
slight different configuration, well, seemingly slight configuration that can make a huge
difference on performance. Again, based on my background in the storage space, I know that
things like block sizes and so on can make a huge difference in performance. I assume that you guys
can adapt to the needs of the higher level software, right?
Yeah, absolutely.
So our software gives the system admin
the flexibility to select the right level of geometry
and the right chunk size,
the minimal amount of data that is written
within the rate to a single drive the minimal amount of data that is written
within the RAID to a single drive
in the most optimal way,
depending on the workload that will be run.
So we actually did a lot of activities
with our partner, SolidIME,
to find the optimal configuration
based on the specific workload
that we were running in RAID 5 implementation
and with proper alignment to the SSD indirect.
We see that customers, they require more and more storage,
and most of the workload is sequential.
So this makes QLC a very viable technology for AI workload.
But everybody knows that QLC comes with some limitations,
like a limited number of program array cycles.
So with our software, by selecting the proper chunk size,
we are able to minimize the write amplification into the SSD.
And by doing that, we can enable using QLC
for extensive AI projects. So what I'm saying is actually going to be part of
a joint paper with Solidigm and you will be able to see the outcome of this research.
Yeah, absolutely. Folks who are interested in that white paper, by the way, can check it out on solidime.com in our insights hub. We've got some great testing there with some of our QLC drives on the C-RAID solution. Do you mind, for folks who are less immersed in the storage world than ourselves, maybe just give us a one-minute version of what that is and why it's a challenge and kind of how Z-Rate addresses it differently? that terms that refer to the fact that when the host is writing one data to the SSD,
internally, there is more than one write that happens to the physical component,
to the physical NAND component.
And given the fact that all the SSD have limited number of program array cycles,
and when it comes to QLC, they have fewer program array cycles than TLC.
It's very important that we implement algorithm
to minimize these numbers.
So to keep this number as close as possible to one.
And with our software, we can do that because we can change the chunk size so the
minimum amount of data that will be written to the ssd to each ssd that are part of the rate array
and by doing that we can minimize the read-modify-write
that needs to be done on the SSD itself.
So when we calculate the checksums,
if we are not aligned with the indirect of the SSD,
we might risk to write multiple times data to the SSD.
With our software, we can find the proper tuning
based on the workload, based on the number of drives
that are part of the RAID array by the level of RAID.
And we are able to find the optimal configuration
to keep this number as close to one as possible.
I could imagine that a lot of this might sound a little concerning
to someone listening and trying to deploy this.
They might think, oh boy, that's a lot of tuning,
a lot of under the hood stuff that I don't really understand.
Do you have best practices for various devices?
I mean, do you help customers to come up with the right configuration?
Yeah, that's part of our job. So normally when we engage with the customer,
we spend quite some time with our Precise team to understand the workload of the customer and find the optimal configuration.
Then once the optimal configuration is identified, there's no work to be done anymore by the system
admin. It's ready to fly and there's no additional tuning to be done. One of the other things I
wanted to ask you about, we've talked about Z-RAID, which is an incredible solution in terms of sort of the benefits to efficiency and performance at the same time.
Very exciting what you're working on over there.
Another thing I've heard about more recently is I think another product of yours called Z-Store.
Could you tell us a little bit about that and how these pieces work together?
So our core competence, as I said, is in the data path and it's in the way we create a very efficient RAID.
And we see that for some industries, a standalone RAID, at least for some customers, is not sufficient.
So they're looking for a broader solution. So ZStore is one of the first of those solutions
that we're bringing to the market. And it's based on our ZRAID implementation for NVMe SSD,
but we also combine it with D-CLUS RAID
to handle the typical problematics of hard disk drive,
which is extremely long rebuild time.
And so through our own implementation of Dcluster RAID,
we can drastically reduce the rebuild time of RTS Drive.
Then on top of the RAID within ZStore,
you will find a high availability implementation.
So there is no single point of failure.
You can lose a server and still you get all the RAID up and running.
We have our control plane to manage virtual machine.
And on top of those virtual machine, we mount the last parallel file system. complete end-to-end solution that HPC and AI customer can deploy without needing to
combine Z-Rate standalone with third-party softwares.
That's the first of a series of solutions that we will bring to the market without ever leaving our core competence,
which is very much the right implementation.
Very good.
Well, it certainly seems like the the market is responding to your approach here.
It sounds like the future is very bright for CINO.
I wish the same to ourselves, and I guess I can say that
that's the situation and what we are experiencing,
the trend towards AI deployments across all the industries. So it's not just
one or two guys that are deploying AI models, but it's becoming pervasive in the industry.
It definitely boosts the requirements for very fast and reliable
storage. I think the interesting thing here too is that all the things that we've been talking about
are going to be very familiar and comfortable for people that are deploying these systems.
So you mentioned, for example, Lustre as part of the ZStore architecture. Well, a lot of HPC
environments are already using Lustre and are
happy with it. We talked about as well how if you combine multiple NVMe drives into a single
ZRAID system, well, that's going to be familiar for people who don't really know a lot about
storage because they're going to see the storage as just a big space, a big amount of space that I can use and let Xenorm manage that. Similarly, the entire idea
of software RAID, it's one of those things I think where people may, they kind of probably
fall into two camps. Either on the one hand, they think storage is just storage and I threw some
drives in and why doesn't it work? Or they think storage is a big task and I have to go buy a big thing and do a
big thing. And this kind of falls comfortably in the middle where they can get those features,
but they don't have to have a huge investment in storage. I can see that there are probably times
when people might want a big storage infrastructure or storage platform as well.
But for many people that are deploying, especially ML training, they may want something that's a lot leaner and yet still provides the kind of reliability that you're talking about.
So this makes a lot of sense.
Thank you so much for talking a little bit about the lower level of AI data infrastructure,
lower level in the stack,
not in terms of importance.
Where can we continue learning more about Xenor?
I bet you guys are doing some things
and are going to be at some industry events.
Yeah, you can start by having a look at our website
at www.xenor.io.
And then we are going to exhibit at Future of Memory and Storage,
which will happen in August in the Bay Area.
So we look forward to see you there and asking many questions.
Ace, I think that you and Xenor are also working on a paper together, right?
Yeah, that's available on our website.
So we've got some really compelling results from the lab where we talk about what we saw putting a bunch of solid-ime high-capacity QLC SSDs into an array using Z-RAID.
And I have to say, you know, I didn't do the testing.
That was our solution architecture team,
but reading through the results were very exciting to me. Anything that can be done,
A, to improve performance and move through the AI model development workflow faster. That's a big win.
But also to do so while freeing up a PCIe slot and saving power that you would have otherwise spent
on a dedicated card, for example, to do work like this
is a big deal.
We keep hearing more and more by the week
about these really scary projections
of how much space and power AI data centers are going to consume in the near future.
And so it's a focus area at Solidigm, certainly, to figure out how can we reduce the environmental impact there, make these things more efficient.
And a solution like Z-RAID fits right into that, right? Doing more with less is absolutely the path forward here to enable
AI development to continue at its breakneck pace of advancement that we're currently seeing.
Well, thank you so much, both of you, for joining us today for this episode.
Again, storage nerd here. I'm glad to be able to nerd out a little bit about storage,
while also maybe reassuring folks that they don't have to be storage nerds in order to have reliable and high-performance storage in the software domain.
Thank you for listening to this episode of Utilizing Tech, part of the Utilizing AI Data
Infrastructure series. You can find this podcast in your favorite podcast applications. Just look
for Utilizing Tech. You'll also find us on YouTube if you prefer to watch a video version.
If you enjoyed this discussion,
please do give us a rating,
give us a review,
give us a comment.
We'd love to hear from you.
This podcast was brought to you
by Tech Field Day,
home of IT experts
from across the enterprise,
now part of the Futurum Group.
It was also sponsored,
this episode and this season
was sponsored by Solidigm.
For show notes and more episodes, head over to our dedicated website, utilizingtech.com,
or find us on X Twitter and Mastodon at Utilizing Tech.
Thanks for listening, and we will see you next week.