In The Arena by TechArena - Exploring Data in the AI Era with Solidigm - New Data Insights Series
Episode Date: March 26, 2024The TechArena kicks off a Data Insights Series in collaboration with Solidigm, and TechArena host welcomes co-host Jeniece Wronowski and Solidigm data center marketing director Ace Stryker to the prog...ram to talk about data in the AI era, the series objectives, and how SSD innovation sits at the foundation of a new data pipeline.
Transcript
Discussion (0)
Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein.
Now, let's step into the arena.
Welcome to the Tech Arena. My name is Allison Klein. Today we are kicking off a new
series of Tech Arena interviews called the Data Insight Series. And for the first time, I am
actually going to have a co-host on the podcast. I'd like to introduce her now. Welcome, Janice.
Janice Narowski from Soladyne. Welcome to the program.
Hi, Allison. Thank you so much. I'm looking forward to this.
So, Janice, we've known each other for a long time, and you've been working in the realm of data for a long time. But why don't you just talk a little bit about why we're embarking on this
Data Insights series? Yeah, I have been in the data industry for a while,
over 15 years, oh my gosh,
specifically in data storage.
And I've seen a lot of trends over my 15 years,
but there has been no time like this ever
with the explosion and the evolution
of this AI era that we're in.
And we're just really excited to talk
with multiple industry leaders as a part
of this series, talk about how they're utilizing storage and how it's reshaping the way they're
looking at AI. You know, I've been doing a lot of podcasts on Tech Arena of late on AI, and it's
amazing when you start hearing about the data pipeline, you realize how central storage and storage innovation is to delivery of
AI. And I'm so excited to go explore this topic with other leaders in the industry with you. So
thank you for joining and for Solidigm joining on this interesting journey.
And we're going to start with an interview with Solidigm. Do you want to introduce our guest?
Yeah, that would be great. Yeah, I want to go
ahead and introduce our guest, Ace Stryker with Soladyne. He is Soladyne's data center product
manager and has really become an industry expert. So we're excited to dive in with him today.
Welcome, Ace. Thank you, Allison. Thank you, Janice. It's great to be here. I appreciate it.
So Ace, Janice did just such a wonderful job of introducing you, but you've never been on the tech arena before.
So why don't you just go ahead and provide a little bit of context on your background and how you got into this role leading data center products for Solida.
Sure. Yeah, I really came at this very early on, like in my teens as someone who got into building my own PCs at home. Right. So I find the right parts and put them in the box and, you know, watch the numbers go up or the frame rate of the games I was playing as the case happened to be back then.
And a long and circuitous route through undergrad and business school eventually led me to Intel, where I started in 2016.
And I rotated around a few groups in Intel.
I landed eventually in what was called the non-volatile memory solutions group, which is the NAND group at Intel at the time.
And I spent a few years there in a few different roles. I was a solution architect
for the Optin product line. I was a technical marketing engineer and a couple of other things.
And then two years and change ago, Solidigm was born when SK Hynix acquired Intel's NAND business
and spun it off as an independent subsidiary based here in the US.
And so I've been with the new company since day one. And up until recently was working primarily
on the client side of the business. I was running product marketing for all of Solidigm's client
products. And just in the past six months or so have jumped over to the data center side of life to try to learn what there is to learn over here and and help sort of steer the company's
marketing strategy and direction as you know we get we get further along into the AI era which I
think we're gonna we're gonna cover but it's a very exciting opportunity there's a ton of
energy in this space as you know things are rapidly evolving and so it's it's a ton of energy in this space, as you know, things are rapidly evolving.
And so it's a great spot to sort of be in the mix and particularly to spend some time thinking about the contributions of storage because that's what our company does.
So, you know, Ace, let's dive into this.
You keep mentioning the AI era, but can you tell us a little bit about what specifically
has, you know,
really challenged this industry to kind of re-architect and deliver new workload requirements?
It's an interesting question because there's really two parts to the answer. On the one hand,
you know, a typical AI workflow from start to finish has very specific requirements, right?
And so you can look at that workflow in terms of discrete stages,
which is often how we do it to try to sort of understand what storage is doing.
So, you know, you're ingesting a bunch of raw data at the beginning of the process.
That's a task that, for example, requires very high sequential write performance.
You clean up that data, you train your new model on that data,
which is a random read heavy process
and then you know you validate and you move through to deployment and ingestion so
you do need to know what's happening at each step along the way in order to understand the
technical requirements and in particular the storage requirements but at the same time
we're not exactly reinventing the wheel here.
Like the things that matter to, you know, performant and efficient AI work are the things that matter to high performance compute workloads and other workloads that have existed for a long time, right?
We're talking about bandwidths and IOPs on the performance side. We're talking about capacity, which becomes more and more important in the AI world as these training data sets get bigger and
bigger and these models get more and more sophisticated with billions and trillions
of parameters now. We're talking about quality and reliability, you know, things like annual
failure rate and Uber and these things have been around for a long time and they matter very much in AI. These are not new attributes of storage.
What's new is mapping the things that storage vendors have known about and built their products
on for a long time to these new workflows and understanding really at a deep level what
an AI application is demanding of storage so that you can kind of build an optimal solution,
whether you're trying to solve for performance at all costs or balanced price performance type
of a setup. All these factors that we've been building products on for years are still the
nuts and bolts of what goes into determining how good an AI solution is. You know, I've been
talking with a lot of companies from around the industry about what this requires and was recently at Mobile World Congress talking a lot about the edge.
And really, it involves change both in the cloud and at the edge.
And you have a lot of experience delivering products in both of those environments.
How do you differentiate the unique requirements across that landscape. And, you know, is there anything that unique that you would say
is really important to dial in for one of those environments over the next few years?
Yeah, certainly, you know, the more work you're trying to push to the edge, the greater your
constraints will be, right, in terms of space and power delivery and thermals and, and all this stuff.
So, you know, in a, in a very conventional infrastructure, uh, everything, you know,
is done in the core data center from that, uh, ingest and, and pre-processing through
the training and, and deployment, uh, you know, and a classic example, you know, uh,
is, is chat GPT, right?
This is one that we're all familiar with, but you're providing inputs to the model and that inference is happening centrally and then the results work to move to the edge so that you're distributing your work now and perhaps you're still training your model centrally.
But all the inference in a use case like, you know, think about security cameras as an example or self-driving cars or, you know, there's a bunch of them out there where you need split second real time inference to occur.
You can't afford to send that data back to a data
center, you know, for that, the processing and the insights to return to you. Your latency
requirements are almost zero, right? As we move along and as this industry matures,
the technology is supporting a move where more and more of this work is done closer to the end
user, closer to the edge. And that creates all sorts of challenges
and opportunities from a product perspective, right? Whether you're making the storage devices
like we are, whether you're making accelerators, whether you're making purpose-built servers,
you know, you begin to have to think about how do these things live out there in the edge where,
you know, you may not have a rack to put hardware into.
You know, as an example, one of our partners is Supermicro and they make a purpose-built server thing that literally hangs on a telephone pole, right?
It doesn't look like a typical kind of traditional server because it's built for a very specific
use case.
Our products go into that, But it really, it forces us
and the other companies
who work in the ecosystem
to reconceive these things
really from the ground up
and try to understand,
OK, how do we support,
how do we accelerate that trend
where we see more and more
of our partners and end users
demanding faster response times,
perhaps greater capacities
at the edge for archive requirements.
And so it's very much an evolving thing, like so much in AI these days. faster response times, perhaps greater capacities at the edge for archive requirements.
And so it's very much an evolving thing, like so much in AI these days. But we expect that to continue in years to come, where we're willing to more and more
of a distributed model.
And the work, quote unquote, in an AI workflow from start to finish may happen in a number
of different locations, right, as opposed to being all in the core data center by
necessity. Yeah, Asa, when you and I, you know, we both come from, you know, a compute-based company,
right, be it Intel. And we know that so much of the industry today is really focused on
compute and the performance of that compute. And then we've also heard, and we know this to be
true, right? There is no AI without data, right? And access to a lot of that data. So can you dive
in a little bit more to how you see storage really playing a strategic role? Yeah, absolutely. I think
generally most people, you know, listening to this, when they think of an AI server the first thing they're
going to think about is is the GPU right and there's great reasons for that you don't you
don't get a lot done in AI without high-powered GPUs and in fact you know the numbers that we've
seen suggest anywhere from you know 60 to 90 percent of the spend on AI servers is going straight to accelerators.
Right. And so it is an important consideration.
But, you know, a model is only as good as the data you feed to it.
And that data is only as good as the storage that it lives on. that storage plays quite an important role in the process from really high capacity drives that
support the the use of giant data sets for training we see as an example llms a lot of them built on
on common crawl which is sort of a corpus of information gathered from repeated scrapes of
the whole internet over time and it you know sort of boils it down into a bunch of pages.
And if you were to train a model on the full common crawl corpus, you're talking about
something like 15 petabytes now, and it's only getting bigger, right?
So storage is necessary to facilitate training these more sophisticated models on these huge
data sets that are required.
And then when you're in the training stage of the workflow, the name of the game is
GPU utilization, right? So you've spent a lot of money on these great high powered compute parts
in there. You don't want to starve them for data in the midst of a training process, right? That
would be suboptimal for a lot of reasons from wall clock time to the way you're spending your money. But
training some of these larger models can take weeks, even months we've seen for a single training
run. If your GPUs are not, you know, highly utilized, meaning they're not being fed enough
data quickly enough from your storage, you're harming your, what we call your TCO, your total
cost of ownership. That's one of the ways we think about efficiency. You're also, you know, getting things done either
later or accepting a suboptimal outcome. And so there's a capacity piece, there's a performance
piece. There's, of course, endurance is always a question in any, any intensive workload,
quality and reliability, which I mentioned earlier.
We have a great video on our website, which I'll just plug briefly if that's all right.
We have a landing page at solidime.com slash AI.
And if you're interested in this topic, there's a video between Roger Correll at Solidime
and one of our partners at Vast.
And they go through each of the stages of the AI workflow specifically and talk about what is the storage device being asked to do at each stage.
I mentioned high level sequential writes for ingests and random reads for training.
But there's, you know, levels of detail to that that you can get into.
If you're interested in that topic, I suggest that video is a great starting place.
So, Ace, you've talked a lot about the data pipeline.
And obviously, there is a tremendous amount of data that's being processed.
How much of that is being done on traditional spinning disk hard drives?
And how is this going to change the market as we continue to see the migration to SSDs? Hard drives are still a very
big deal in this market. It depends on who you ask, but we've seen numbers that suggest up to 90%
of the data used in AI workflows is still held on mechanical disk drives today. And that's primarily
driven by the upfront cost of hard drive being lower than a
typical SSD, right? A lot of procurement folks at companies that are building, you know, AI
infrastructures ask one question, which is what's the cost per gigabyte, which makes sense to ask,
right? And they'll say, okay, load me up on hard drives then for a lot of the work.
We at Solidigm have a different point of
view, as you might expect. And one of the things that we have worked on, and I believe it's on our
website as well, you can go and find our, we call our TCL calculator for a total cost of ownership.
So the argument is, yes, like per gigabyte, hard drives are going to cost you less upfront. But over the course of the average five-year life of a storage device,
you're going to end up spending a lot more for a variety of reasons.
The biggest one is density, right?
The biggest widely available hard drives today,
they top it at about 24 terabytes,
whereas SolidEye makes SSDs that go up to 61 terabytes today. And we're building
the future for even bigger drives. But a drive that's two and a half times the size of the
biggest hard drive means you're buying a lot fewer of them. You're saving a lot on power.
You're powering many fewer devices. The SSDs are often also smaller form factors. Hard drives are always
these three and a half inch, about the size of the, I don't know, what's a good analogy for the
size of a hard drive? I want to say... The size of a taco tortilla.
Yeah. The SSDs can be the size of, you know, like the mini tacos. You get a jack-in-the-box
that my 13-year-old loves so much. If you don't get that reference, that's fine. They're much smaller. But because of that,
now you're saving on rack space. You need fewer racks to hold all these drives, right? And so
as you calculate out the costs associated with power, with the additional devices,
reliability is a key part of that calculation as well. Backblaze is a great source that your
listeners might be familiar with. Every so often they publish the reliability stats for the hard
drives that they use in their infrastructure. And the average annual failure rate across all
the brands they employ is 1.7%. Solidime's annual failure rate calculated internally for our QLC drives is less than 0.2%.
Wow.
Many, many fewer drive failures over time.
And so all these things add up to a financial picture that really favors solid state drives when you calculate it out over five years.
And so that's the conversation we have with a lot of our customers is like, you know, think about your five-year TCO versus your
upfront costs. But, you know, TCO aside, there are some things that hard drives just cannot do
regardless of the price, right? Your typical hard drive today, you know, you might get a few hundred
megabytes per second of sequential performance. And so your SSD performance is going to be 10x,
depends of course on the devices, but on average. The random performance, which is where the
training happens, there's no contest. I mean, we're talking four to 5,000 times faster on an SSD
than on a hard drive because hard drives have a latency penalty associated with
moving that mechanical head and seeking
data from different places on the platter. Well, that seems like a no-brainer then.
For training, yes. For training, which is 90% to 95% random read activity, there's really no case
for hard drives to do that work unless time is just absolutely irrelevant to you. You know,
the time it takes you that, which it's hard to imagine anyone taking that point of view in the AI era, right, where things are moving so fast.
So Ace, you know, along, you know, along with the hardware changes over time and, and how much more
advanced SSDs are over HDDs, can you tell us a little bit about, you know, new software innovations?
Are there new software organizations
that Solidigm is working with
that make the drives a little bit more
performant and efficient?
Yeah, absolutely.
So there's obviously a lot of great work
going on on the hardware side, right?
Interconnects are evolving.
PCIe Gen 5 and 6 and beyond
are here and future iterations on the horizon.
The hand layer count is going up.
Like hardware innovation continues.
The march continues there.
But on the software side, there's some really cool stuff going on that's kind of unlocking potential for, you know, even better and more optimal architectures.
And so one example of that is I mentioned Vast as a partner of ours. One example
in the way that they deploy their storage, you know, they typically have two layers to their
storage infrastructure. Most of the data is stored on QLCSSDs, which already, as we discussed, are
way faster than hard drives. But then they have a cache layer on top of it using what they call
storage class memory.
In particular, it's that Intel Optane stuff that I used to work on, right?
And so they use that to accelerate certain parts of the workflow.
So you get like the cost advantages, the great read performance, and the capacity of QLC drives.
And then you have this Optane layer on top that's accelerating the writes and
other parts of the workload to help give you a balanced performance profile in a way that's not
possible without that software optimization. And they're not the only ones we see doing that. They
do it really well, but we actually have our own solution that helps people build a similar kind
of architecture. It's called the Cloud Storage Acceleration Layer, which is a mouthful.
We call it C-Cell.
But it is an open source tool.
You can go find it on GitHub and play around with it.
But essentially, it allows you to intelligently
direct new writes to one device or another
in a storage array.
So if you wanted to take, for example,
one of our really high density QLC SSDs, like the P5336, that's the 61 terabyte dude that I mentioned earlier.
And you wanted to put on top of that something like the P5810, which is our SLC super high speed, right, device on top of that, C-cell will essentially take care
of the traffic direction, right?
And it will say, okay,
new writes are going to go to the SLC drive
since that's what it's really good at.
Reads are going to come from the QLC drive
because it's really good at that
and it's got tons of density.
And so when you start pairing
very capable hardware
with software innovations like these as well,
it just goes way above and beyond what even great hardware alone can provide.
Now, you've provided a lot of data on why innovation is required to fuel this AI era.
And I'm so glad that we started this series with you, Ace, because I really see massive transformation in the storage industry happening with AI.
I mean, you can see you've mentioned Supermicro, you've mentioned Vast, and there will be other partners that come on this platform to tell their stories.
But the types of innovation are things that we haven't really seen in the storage arena before in terms of just dramatically changing what those platforms represent to the overall workloads. That starts with innovation within the storage media itself.
And I'm very curious as to how Solidigm is approaching innovation in the core technologies
to deliver the type of foundation that these innovative companies require. Can you comment on
that? Yeah, to tease that out for folks really interested in what's going on at a media level,
as you build a NAND device, you can choose how many bits you want to store per NAND cell within
the device, right? That's a choice you can make when you're building a new drive. And that dictates
a lot. So the earliest SSDs,
the simplest approach was one bit per cell. We call it SLC. That's a single level cell.
Super high performance, super high endurance, but doesn't store as much per cell, obviously,
right, as two or three or four bits per cell. And so as you move to two bits per cell, that's called MLC, three bits per cell is TLC,
four is QLC for reasons that are hopefully apparent. The more bits per cell you're storing,
the bigger you can go and the more cost efficient the device is. But you're also making trade-offs,
right? It's not a free play in terms of gaining all that density. You're making trade-offs in some ways in terms of performance and endurance. And so it creates this tiered kind of product stack, right? Where we offer the SLC for folks who need, you know, just absolute top shelf performance, no matter what. We offer the QLC with the huge densities and really optimized for right
speed. The benefit of Solidigm in particular, having been in this game for so long, or, you
know, our predecessor company plus Solidigm, is that we're now on the third, fourth, fifth generation
of many of these technologies. And so we've sort of figured out how to mitigate those trade-offs.
You know, if folks had an experience with a firstgeneration QLC drive, they might have said, oh man, the performance here really took a big hit from the SLC devices that came along 10 years ago.
But what we've seen as we've developed and as our great hardware architecture firmware teams
have spent years innovating on this, is we can claw back a lot of those trade-offs to where now
really everything is pretty well performance and price optimized.
And it's just a matter of determining what your goals are
in terms of your AI development and deployment.
And then once we have a conversation about those
and kind of figure out where you want to go,
there's a huge product portfolio at Solidigm
and a number of choices that we can make
to deliver optimal storage fit for your architecture.
Can you tell us a little bit of base
about where folks can connect
and follow along with this series?
Yeah, you bet.
So please do check out our landing page
on our website, solidigm.com slash AI.
We will also be this week at NVIDIA GTC. So if you want to reach out and say hello, please really looking forward to continue bringing voices on that are telling this story from across the industry. For those who want to listen along, you can listen to the
Data Insights series anywhere you get your Tech Arena podcast, which is across the Tech Arena
platforms, Spotify, Apple Podcasts, Google Podcasts, etc. And Janice, why don't you tell the audience where we're going next?
Yeah, so we're excited.
As Ace said, we are going to be at JTC
and please tune in because we will be doing a session
with one of our latest and greatest partners,
CoreWeave and Supermicro.
So stay tuned for those two coming up.
Well, thank you so much for being on the
program today. It was so much fun. And Ace, we'd love to have you back soon. Yeah, absolutely.
Thanks a lot for having me. I appreciate it. Denise, I guess we'll see each other at GTC.
Yes. Thank you, Allison. Thank you, Ace. This was awesome.
Thanks for joining the Tech Arena. Subscribe and engage at our website, thetecharena.net.
All content is copyright by the Tech Arena.