In The Arena by TechArena - Exploring Data in the AI Era with Solidigm - New Data Insights Series

Episode Date: March 26, 2024

The TechArena kicks off a Data Insights Series in collaboration with Solidigm, and TechArena host welcomes co-host Jeniece Wronowski and Solidigm data center marketing director Ace Stryker to the prog...ram to talk about data in the AI era, the series objectives, and how SSD innovation sits at the foundation of a new data pipeline.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein. Now, let's step into the arena. Welcome to the Tech Arena. My name is Allison Klein. Today we are kicking off a new series of Tech Arena interviews called the Data Insight Series. And for the first time, I am actually going to have a co-host on the podcast. I'd like to introduce her now. Welcome, Janice. Janice Narowski from Soladyne. Welcome to the program. Hi, Allison. Thank you so much. I'm looking forward to this. So, Janice, we've known each other for a long time, and you've been working in the realm of data for a long time. But why don't you just talk a little bit about why we're embarking on this
Starting point is 00:01:00 Data Insights series? Yeah, I have been in the data industry for a while, over 15 years, oh my gosh, specifically in data storage. And I've seen a lot of trends over my 15 years, but there has been no time like this ever with the explosion and the evolution of this AI era that we're in. And we're just really excited to talk
Starting point is 00:01:24 with multiple industry leaders as a part of this series, talk about how they're utilizing storage and how it's reshaping the way they're looking at AI. You know, I've been doing a lot of podcasts on Tech Arena of late on AI, and it's amazing when you start hearing about the data pipeline, you realize how central storage and storage innovation is to delivery of AI. And I'm so excited to go explore this topic with other leaders in the industry with you. So thank you for joining and for Solidigm joining on this interesting journey. And we're going to start with an interview with Solidigm. Do you want to introduce our guest? Yeah, that would be great. Yeah, I want to go
Starting point is 00:02:05 ahead and introduce our guest, Ace Stryker with Soladyne. He is Soladyne's data center product manager and has really become an industry expert. So we're excited to dive in with him today. Welcome, Ace. Thank you, Allison. Thank you, Janice. It's great to be here. I appreciate it. So Ace, Janice did just such a wonderful job of introducing you, but you've never been on the tech arena before. So why don't you just go ahead and provide a little bit of context on your background and how you got into this role leading data center products for Solida. Sure. Yeah, I really came at this very early on, like in my teens as someone who got into building my own PCs at home. Right. So I find the right parts and put them in the box and, you know, watch the numbers go up or the frame rate of the games I was playing as the case happened to be back then. And a long and circuitous route through undergrad and business school eventually led me to Intel, where I started in 2016. And I rotated around a few groups in Intel.
Starting point is 00:03:12 I landed eventually in what was called the non-volatile memory solutions group, which is the NAND group at Intel at the time. And I spent a few years there in a few different roles. I was a solution architect for the Optin product line. I was a technical marketing engineer and a couple of other things. And then two years and change ago, Solidigm was born when SK Hynix acquired Intel's NAND business and spun it off as an independent subsidiary based here in the US. And so I've been with the new company since day one. And up until recently was working primarily on the client side of the business. I was running product marketing for all of Solidigm's client products. And just in the past six months or so have jumped over to the data center side of life to try to learn what there is to learn over here and and help sort of steer the company's
Starting point is 00:04:10 marketing strategy and direction as you know we get we get further along into the AI era which I think we're gonna we're gonna cover but it's a very exciting opportunity there's a ton of energy in this space as you know things are rapidly evolving and so it's it's a ton of energy in this space, as you know, things are rapidly evolving. And so it's a great spot to sort of be in the mix and particularly to spend some time thinking about the contributions of storage because that's what our company does. So, you know, Ace, let's dive into this. You keep mentioning the AI era, but can you tell us a little bit about what specifically has, you know, really challenged this industry to kind of re-architect and deliver new workload requirements?
Starting point is 00:04:51 It's an interesting question because there's really two parts to the answer. On the one hand, you know, a typical AI workflow from start to finish has very specific requirements, right? And so you can look at that workflow in terms of discrete stages, which is often how we do it to try to sort of understand what storage is doing. So, you know, you're ingesting a bunch of raw data at the beginning of the process. That's a task that, for example, requires very high sequential write performance. You clean up that data, you train your new model on that data, which is a random read heavy process
Starting point is 00:05:25 and then you know you validate and you move through to deployment and ingestion so you do need to know what's happening at each step along the way in order to understand the technical requirements and in particular the storage requirements but at the same time we're not exactly reinventing the wheel here. Like the things that matter to, you know, performant and efficient AI work are the things that matter to high performance compute workloads and other workloads that have existed for a long time, right? We're talking about bandwidths and IOPs on the performance side. We're talking about capacity, which becomes more and more important in the AI world as these training data sets get bigger and bigger and these models get more and more sophisticated with billions and trillions of parameters now. We're talking about quality and reliability, you know, things like annual
Starting point is 00:06:17 failure rate and Uber and these things have been around for a long time and they matter very much in AI. These are not new attributes of storage. What's new is mapping the things that storage vendors have known about and built their products on for a long time to these new workflows and understanding really at a deep level what an AI application is demanding of storage so that you can kind of build an optimal solution, whether you're trying to solve for performance at all costs or balanced price performance type of a setup. All these factors that we've been building products on for years are still the nuts and bolts of what goes into determining how good an AI solution is. You know, I've been talking with a lot of companies from around the industry about what this requires and was recently at Mobile World Congress talking a lot about the edge.
Starting point is 00:07:08 And really, it involves change both in the cloud and at the edge. And you have a lot of experience delivering products in both of those environments. How do you differentiate the unique requirements across that landscape. And, you know, is there anything that unique that you would say is really important to dial in for one of those environments over the next few years? Yeah, certainly, you know, the more work you're trying to push to the edge, the greater your constraints will be, right, in terms of space and power delivery and thermals and, and all this stuff. So, you know, in a, in a very conventional infrastructure, uh, everything, you know, is done in the core data center from that, uh, ingest and, and pre-processing through
Starting point is 00:07:58 the training and, and deployment, uh, you know, and a classic example, you know, uh, is, is chat GPT, right? This is one that we're all familiar with, but you're providing inputs to the model and that inference is happening centrally and then the results work to move to the edge so that you're distributing your work now and perhaps you're still training your model centrally. But all the inference in a use case like, you know, think about security cameras as an example or self-driving cars or, you know, there's a bunch of them out there where you need split second real time inference to occur. You can't afford to send that data back to a data center, you know, for that, the processing and the insights to return to you. Your latency requirements are almost zero, right? As we move along and as this industry matures, the technology is supporting a move where more and more of this work is done closer to the end
Starting point is 00:09:01 user, closer to the edge. And that creates all sorts of challenges and opportunities from a product perspective, right? Whether you're making the storage devices like we are, whether you're making accelerators, whether you're making purpose-built servers, you know, you begin to have to think about how do these things live out there in the edge where, you know, you may not have a rack to put hardware into. You know, as an example, one of our partners is Supermicro and they make a purpose-built server thing that literally hangs on a telephone pole, right? It doesn't look like a typical kind of traditional server because it's built for a very specific use case.
Starting point is 00:09:40 Our products go into that, But it really, it forces us and the other companies who work in the ecosystem to reconceive these things really from the ground up and try to understand, OK, how do we support, how do we accelerate that trend
Starting point is 00:09:55 where we see more and more of our partners and end users demanding faster response times, perhaps greater capacities at the edge for archive requirements. And so it's very much an evolving thing, like so much in AI these days. faster response times, perhaps greater capacities at the edge for archive requirements. And so it's very much an evolving thing, like so much in AI these days. But we expect that to continue in years to come, where we're willing to more and more of a distributed model.
Starting point is 00:10:15 And the work, quote unquote, in an AI workflow from start to finish may happen in a number of different locations, right, as opposed to being all in the core data center by necessity. Yeah, Asa, when you and I, you know, we both come from, you know, a compute-based company, right, be it Intel. And we know that so much of the industry today is really focused on compute and the performance of that compute. And then we've also heard, and we know this to be true, right? There is no AI without data, right? And access to a lot of that data. So can you dive in a little bit more to how you see storage really playing a strategic role? Yeah, absolutely. I think generally most people, you know, listening to this, when they think of an AI server the first thing they're
Starting point is 00:11:07 going to think about is is the GPU right and there's great reasons for that you don't you don't get a lot done in AI without high-powered GPUs and in fact you know the numbers that we've seen suggest anywhere from you know 60 to 90 percent of the spend on AI servers is going straight to accelerators. Right. And so it is an important consideration. But, you know, a model is only as good as the data you feed to it. And that data is only as good as the storage that it lives on. that storage plays quite an important role in the process from really high capacity drives that support the the use of giant data sets for training we see as an example llms a lot of them built on on common crawl which is sort of a corpus of information gathered from repeated scrapes of
Starting point is 00:12:01 the whole internet over time and it you know sort of boils it down into a bunch of pages. And if you were to train a model on the full common crawl corpus, you're talking about something like 15 petabytes now, and it's only getting bigger, right? So storage is necessary to facilitate training these more sophisticated models on these huge data sets that are required. And then when you're in the training stage of the workflow, the name of the game is GPU utilization, right? So you've spent a lot of money on these great high powered compute parts in there. You don't want to starve them for data in the midst of a training process, right? That
Starting point is 00:12:38 would be suboptimal for a lot of reasons from wall clock time to the way you're spending your money. But training some of these larger models can take weeks, even months we've seen for a single training run. If your GPUs are not, you know, highly utilized, meaning they're not being fed enough data quickly enough from your storage, you're harming your, what we call your TCO, your total cost of ownership. That's one of the ways we think about efficiency. You're also, you know, getting things done either later or accepting a suboptimal outcome. And so there's a capacity piece, there's a performance piece. There's, of course, endurance is always a question in any, any intensive workload, quality and reliability, which I mentioned earlier.
Starting point is 00:13:25 We have a great video on our website, which I'll just plug briefly if that's all right. We have a landing page at solidime.com slash AI. And if you're interested in this topic, there's a video between Roger Correll at Solidime and one of our partners at Vast. And they go through each of the stages of the AI workflow specifically and talk about what is the storage device being asked to do at each stage. I mentioned high level sequential writes for ingests and random reads for training. But there's, you know, levels of detail to that that you can get into. If you're interested in that topic, I suggest that video is a great starting place.
Starting point is 00:14:04 So, Ace, you've talked a lot about the data pipeline. And obviously, there is a tremendous amount of data that's being processed. How much of that is being done on traditional spinning disk hard drives? And how is this going to change the market as we continue to see the migration to SSDs? Hard drives are still a very big deal in this market. It depends on who you ask, but we've seen numbers that suggest up to 90% of the data used in AI workflows is still held on mechanical disk drives today. And that's primarily driven by the upfront cost of hard drive being lower than a typical SSD, right? A lot of procurement folks at companies that are building, you know, AI
Starting point is 00:14:53 infrastructures ask one question, which is what's the cost per gigabyte, which makes sense to ask, right? And they'll say, okay, load me up on hard drives then for a lot of the work. We at Solidigm have a different point of view, as you might expect. And one of the things that we have worked on, and I believe it's on our website as well, you can go and find our, we call our TCL calculator for a total cost of ownership. So the argument is, yes, like per gigabyte, hard drives are going to cost you less upfront. But over the course of the average five-year life of a storage device, you're going to end up spending a lot more for a variety of reasons. The biggest one is density, right?
Starting point is 00:15:36 The biggest widely available hard drives today, they top it at about 24 terabytes, whereas SolidEye makes SSDs that go up to 61 terabytes today. And we're building the future for even bigger drives. But a drive that's two and a half times the size of the biggest hard drive means you're buying a lot fewer of them. You're saving a lot on power. You're powering many fewer devices. The SSDs are often also smaller form factors. Hard drives are always these three and a half inch, about the size of the, I don't know, what's a good analogy for the size of a hard drive? I want to say... The size of a taco tortilla.
Starting point is 00:16:18 Yeah. The SSDs can be the size of, you know, like the mini tacos. You get a jack-in-the-box that my 13-year-old loves so much. If you don't get that reference, that's fine. They're much smaller. But because of that, now you're saving on rack space. You need fewer racks to hold all these drives, right? And so as you calculate out the costs associated with power, with the additional devices, reliability is a key part of that calculation as well. Backblaze is a great source that your listeners might be familiar with. Every so often they publish the reliability stats for the hard drives that they use in their infrastructure. And the average annual failure rate across all the brands they employ is 1.7%. Solidime's annual failure rate calculated internally for our QLC drives is less than 0.2%.
Starting point is 00:17:05 Wow. Many, many fewer drive failures over time. And so all these things add up to a financial picture that really favors solid state drives when you calculate it out over five years. And so that's the conversation we have with a lot of our customers is like, you know, think about your five-year TCO versus your upfront costs. But, you know, TCO aside, there are some things that hard drives just cannot do regardless of the price, right? Your typical hard drive today, you know, you might get a few hundred megabytes per second of sequential performance. And so your SSD performance is going to be 10x, depends of course on the devices, but on average. The random performance, which is where the
Starting point is 00:17:51 training happens, there's no contest. I mean, we're talking four to 5,000 times faster on an SSD than on a hard drive because hard drives have a latency penalty associated with moving that mechanical head and seeking data from different places on the platter. Well, that seems like a no-brainer then. For training, yes. For training, which is 90% to 95% random read activity, there's really no case for hard drives to do that work unless time is just absolutely irrelevant to you. You know, the time it takes you that, which it's hard to imagine anyone taking that point of view in the AI era, right, where things are moving so fast. So Ace, you know, along, you know, along with the hardware changes over time and, and how much more
Starting point is 00:18:36 advanced SSDs are over HDDs, can you tell us a little bit about, you know, new software innovations? Are there new software organizations that Solidigm is working with that make the drives a little bit more performant and efficient? Yeah, absolutely. So there's obviously a lot of great work going on on the hardware side, right?
Starting point is 00:18:56 Interconnects are evolving. PCIe Gen 5 and 6 and beyond are here and future iterations on the horizon. The hand layer count is going up. Like hardware innovation continues. The march continues there. But on the software side, there's some really cool stuff going on that's kind of unlocking potential for, you know, even better and more optimal architectures. And so one example of that is I mentioned Vast as a partner of ours. One example
Starting point is 00:19:26 in the way that they deploy their storage, you know, they typically have two layers to their storage infrastructure. Most of the data is stored on QLCSSDs, which already, as we discussed, are way faster than hard drives. But then they have a cache layer on top of it using what they call storage class memory. In particular, it's that Intel Optane stuff that I used to work on, right? And so they use that to accelerate certain parts of the workflow. So you get like the cost advantages, the great read performance, and the capacity of QLC drives. And then you have this Optane layer on top that's accelerating the writes and
Starting point is 00:20:05 other parts of the workload to help give you a balanced performance profile in a way that's not possible without that software optimization. And they're not the only ones we see doing that. They do it really well, but we actually have our own solution that helps people build a similar kind of architecture. It's called the Cloud Storage Acceleration Layer, which is a mouthful. We call it C-Cell. But it is an open source tool. You can go find it on GitHub and play around with it. But essentially, it allows you to intelligently
Starting point is 00:20:34 direct new writes to one device or another in a storage array. So if you wanted to take, for example, one of our really high density QLC SSDs, like the P5336, that's the 61 terabyte dude that I mentioned earlier. And you wanted to put on top of that something like the P5810, which is our SLC super high speed, right, device on top of that, C-cell will essentially take care of the traffic direction, right? And it will say, okay, new writes are going to go to the SLC drive
Starting point is 00:21:10 since that's what it's really good at. Reads are going to come from the QLC drive because it's really good at that and it's got tons of density. And so when you start pairing very capable hardware with software innovations like these as well, it just goes way above and beyond what even great hardware alone can provide.
Starting point is 00:21:29 Now, you've provided a lot of data on why innovation is required to fuel this AI era. And I'm so glad that we started this series with you, Ace, because I really see massive transformation in the storage industry happening with AI. I mean, you can see you've mentioned Supermicro, you've mentioned Vast, and there will be other partners that come on this platform to tell their stories. But the types of innovation are things that we haven't really seen in the storage arena before in terms of just dramatically changing what those platforms represent to the overall workloads. That starts with innovation within the storage media itself. And I'm very curious as to how Solidigm is approaching innovation in the core technologies to deliver the type of foundation that these innovative companies require. Can you comment on that? Yeah, to tease that out for folks really interested in what's going on at a media level, as you build a NAND device, you can choose how many bits you want to store per NAND cell within
Starting point is 00:22:37 the device, right? That's a choice you can make when you're building a new drive. And that dictates a lot. So the earliest SSDs, the simplest approach was one bit per cell. We call it SLC. That's a single level cell. Super high performance, super high endurance, but doesn't store as much per cell, obviously, right, as two or three or four bits per cell. And so as you move to two bits per cell, that's called MLC, three bits per cell is TLC, four is QLC for reasons that are hopefully apparent. The more bits per cell you're storing, the bigger you can go and the more cost efficient the device is. But you're also making trade-offs, right? It's not a free play in terms of gaining all that density. You're making trade-offs in some ways in terms of performance and endurance. And so it creates this tiered kind of product stack, right? Where we offer the SLC for folks who need, you know, just absolute top shelf performance, no matter what. We offer the QLC with the huge densities and really optimized for right
Starting point is 00:23:45 speed. The benefit of Solidigm in particular, having been in this game for so long, or, you know, our predecessor company plus Solidigm, is that we're now on the third, fourth, fifth generation of many of these technologies. And so we've sort of figured out how to mitigate those trade-offs. You know, if folks had an experience with a firstgeneration QLC drive, they might have said, oh man, the performance here really took a big hit from the SLC devices that came along 10 years ago. But what we've seen as we've developed and as our great hardware architecture firmware teams have spent years innovating on this, is we can claw back a lot of those trade-offs to where now really everything is pretty well performance and price optimized. And it's just a matter of determining what your goals are
Starting point is 00:24:31 in terms of your AI development and deployment. And then once we have a conversation about those and kind of figure out where you want to go, there's a huge product portfolio at Solidigm and a number of choices that we can make to deliver optimal storage fit for your architecture. Can you tell us a little bit of base about where folks can connect
Starting point is 00:24:54 and follow along with this series? Yeah, you bet. So please do check out our landing page on our website, solidigm.com slash AI. We will also be this week at NVIDIA GTC. So if you want to reach out and say hello, please really looking forward to continue bringing voices on that are telling this story from across the industry. For those who want to listen along, you can listen to the Data Insights series anywhere you get your Tech Arena podcast, which is across the Tech Arena platforms, Spotify, Apple Podcasts, Google Podcasts, etc. And Janice, why don't you tell the audience where we're going next? Yeah, so we're excited.
Starting point is 00:25:48 As Ace said, we are going to be at JTC and please tune in because we will be doing a session with one of our latest and greatest partners, CoreWeave and Supermicro. So stay tuned for those two coming up. Well, thank you so much for being on the program today. It was so much fun. And Ace, we'd love to have you back soon. Yeah, absolutely. Thanks a lot for having me. I appreciate it. Denise, I guess we'll see each other at GTC.
Starting point is 00:26:18 Yes. Thank you, Allison. Thank you, Ace. This was awesome. Thanks for joining the Tech Arena. Subscribe and engage at our website, thetecharena.net. All content is copyright by the Tech Arena.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.