Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 07x03: Benchmarking AI Data Infrastructure with MLCommons

Starting point is 00:00:00 Organizations seeking to build an infrastructure stack for AI training need to know how that data platform is going to perform. This episode of Utilizing Tech, presented by Solidigm, includes Curtis Anderson, co-chair of the Storage Working Group at ML Commons. We're discussing storage benchmarking with Ace Striker and learning how we can know whether the given storage infrastructure is going to perform well enough for a given ML training environment. Welcome to Utilizing Tech, the podcast about emerging technology from Tech Field Day, part of the Futurum Group. This season is presented by Solidigm and focuses on the question of AI data infrastructure. I'm your host, Stephen Foskett, organizer of the Tech Field Day event series. Joining me today as my co-host is Mr. Ace Stryker of Solidigm. Welcome to the show, Ace. Thank you very much, Stephen. How are you, sir?

Starting point is 00:00:55 I'm doing pretty well. This has been going great. I'm so glad to be doing this special season with Solidigm focused on a topic that's near and dear to my heart, which is basically how we can make storage be useful. And I guess that's kind of what you're at here too, huh? Yeah, it's been a ton of fun so far. We've had some really interesting guests so far from a lot of different corners of the industry, right? A lot of different looks at the way the data infrastructure needs are evolving to keep up with, I guess, AI is the bright, shiny object today, right?

Starting point is 00:01:35 And will be for some time. It is the driver of these requirements and these efficiency issues really coming to the forefront lately. But yeah, it's been a great journey so far. And I think we've got another great guest lined up today. Yeah, it's one of those things that comes up a lot, just to what you just said, is basically that storage has to meet the requirements of the application.

Starting point is 00:02:03 Now, that's been something that we've said forever. You know, data infrastructure, data platforms, you know, performance has to be, well, good enough, right? But how would we know how good is performance? That's been a challenge in the industry for a long, long time. How do you measure performance? How do you express those measurements? And how do you specify things that are good enough?

Starting point is 00:02:26 It turns out to be, I think, a more complicated question than a lot of folks would assume, right? If you as a consumer go buy a laptop, there's a number of ready-made tools. You can pull off the shelf. You can run Cinebench. You can run PCMark. You can get a pretty good sense of what your uh hardware is capable of uh you know pretty quickly and you can use that information to make relatively intuitive apples to apples comparisons right

Starting point is 00:02:59 between different options when it comes to things on a data center scale right and particularly as they look at things like data infrastructure and what are the requirements or the capabilities of the storage subsystem, we get a lot of questions about that. It turns out to be kind of a tough nut to crack. And it's the same for other aspects of the AI stack as well. One of the organizations that I'm particularly fond of, now you'll recognize them from Field Day, from Utilizing Tech podcast, is ML Commons. They are really focused on answering these questions. And ML Commons, as I've mentioned previously, has a storage benchmark as well. So we have decided to invite on the podcast this week, Curtis Anderson, who is the co-chair of the Storage Working Group for ML Commons and is, well, probably more knowledgeable about this question than anyone.

Starting point is 00:03:58 Welcome, Curtis. Thank you. Thank you for having me. My name is Curtis Anderson. I'm one of the co-chairs of the MLPerf Storage Working Group at MLCommons. So tell us a little bit more about yourself and what you do with MLCommons. So I'm a storage guy, not an AI person. I'm learning AI as I go along. So that's actually an exciting piece of it is learning the new technology. The storage working group attempts to benchmark storage subsystems in support of AI workloads. And so I can go a lot more detail about, but that's sort of the big picture is I'm one of the co-chairs of that working group. So I'm here to describe what it does, how it works, and invite people to join. Excellent. And like I said, the thing that I love about ML Commons is that it is very practical. ML Commons is not interested in mythical angel storage numbers or performance,

Starting point is 00:04:59 you know, let's see how many whatevers we can pile up. ML Commons is very interested in, like, how does this perform under workload? And it's the same with storage, right? Yeah, the benchmark emulates a workload. It imposes a workload on a storage subsystem, the same workload that a training pipeline would impose on the storage. And so you get an honest to goodness, this is how your storage product or solution would perform in this real world scenario. Curtis, can you talk a little bit more about the nature of the workload?

Starting point is 00:05:40 I think in other episodes, we've explored the AI data pipeline a bit, and we've talked about there are discrete steps here, you know, ingesting raw data versus preparing your training data set versus the training itself and the inference and so forth. So what is a training workload look like? And what is the benchmark asking the storage subsystem to do? So zoom out in the bigger picture just to set some context here. A person wants to make use of AI. They've got a problem statement. They have some data. They need to then start putting it together into a pipeline, it's called. That starts with the raw data. Generally, it's, you know, it's video or it's still pictures or it's audio or text. They do some data preparation, which is changing the

Starting point is 00:06:32 format of the data. They take the picture, the image, and they turn it into a numerical representation instead of an image proper. It's not a JPEG or a PNG any longer. It's a NumPy array. Don't worry about what that means. We'll talk about that later. So there's a bunch of data preparation and then the training step, which involves that's the GPUs. You train the neural network using that data. Then when that's done, it goes into inference where you say, okay, I now have

Starting point is 00:07:07 a neural network that can tell me cats versus dogs. I show a picture, is this a cat or a dog? What we do in the storage working group is benchmark the performance of during training, during that phase of the overall workflow pipeline. We're working on adding the data preparation. There's a bunch of cleaning and other kinds of steps that happen there. We're working on bringing that in, but right now we're starting with the simple, you know, the meat and potatoes, if you will,

Starting point is 00:07:35 of the workflow, which is the training step. It's very data intensive. And so it puts a large stress on the storage. Is the decision to start with the training step and it sounds like moving into data prep next, is that because those were sort of the low hanging fruit, the easy ones to implement first, or are those the stages of the pipeline where you're seeing the greatest storage sensitivity? Can you kind of walk us through the rationale there? In one sense, it is that training is easier than data preparation because data prep is sort of unique to every different application.

Starting point is 00:08:16 It's hard to develop a benchmark when there are 4,000 different ways to do something. On the other, so there is that. But one of the key characteristics of the benchmark is we measure how well the storage system performs not on the traditional storage benchmarks of megabytes per second and files per second. We measure on how completely the GPU can stay utilized. If the GPU ever starves for data, the latest NVIDIA GPUs, the H100s, are like $40,000 a piece. You don't want that thing going idle because it's starved for data.

Starting point is 00:09:00 And so we measure accelerator utilization as the core value of our benchmark. And can this storage product or solution keep up with this number of GPUs doing this particular workload? An image recognition workload is different from a recommender, which is different from a large language model. There's many different types of neural network models. And so they each impose a different workload on the storage. So we measure them all separately. But yeah, we're measuring, can it keep the beast fed? Can it keep the GPU busy with data coming in? That's sort of the core metric of the benchmark. It's a super interesting choice, and I think for folks who are used to benchmarks that output megabytes per second or some sort of

Starting point is 00:09:55 calculated score, it'll be a very different look at performance, right? So can the outputs of the test, let's say you stick a storage device or an array, you know, in the test and you run it and it says, oh, this storage subsystem can keep X number of GPUs highly utilized, swap it in for another one and that other option can keep Y GPUs utilized. Can that be used to make relative judgments about the suitability of storage solutions or the performance of one against another in an AI workload? I should say up front that the core metric is accelerator utilization, how you know whether the GPU goes idle or not. But you can turn that into the traditional measures like megabytes per second and IOPS

Starting point is 00:10:51 and all the rest of that. That information is available. But the benchmark says if you can't keep the device 90% utilized then you're trying you're overloading it and you need to run this the benchmark again with a smaller number of simulated GPUs so that's the thing that people are going to look at that the person who's going to look at the results is someone, an AI practitioner that says, oh, I know how much data I've got. I know what type of workload I'm running. I want to know, does this vendor have a product, a storage product that will serve my needs?

Starting point is 00:11:36 Or how big of a product from that vendor do I need to purchase in order to serve my needs. And so that's, the practitioner also knows how many accelerators of that type they have. They have a budget given from their management says, oh, you can buy a hundred H100 GPUs from NVIDIA. Yeah, 400 grand or no $4 million, that's a lot. So they know those things of how much data, how many accelerators, and they want to know how much storage do I need to buy

Starting point is 00:12:09 and that it will actually keep up with the accelerator count. So that's what people vary is the count of accelerators that this particular configuration can support. A larger config can support more accelerators. Does that make sense? There's a bunch of things going on, but the practicality is, as Stephen said, the practicality is what do I need to buy to keep my GPUs busy? And that's what we talked about last season on the utilizing tech. We talked to a lot of companies in the AI space and it really boils down to that. I mean, that's

Starting point is 00:12:42 the whole ball game. Basically you're spending a huge amount of money. You said like $5 million. That's a cheap infrastructure. You're spending a huge amount of money on very expensive GPUs or, you know, ML, you know, processing ASICs. And you need to keep those things fed in order to make the most of that investment. That's the thing that matters here, because if those expensive items are waiting for data, then they're not producing results, then they're not actually giving you what you bought. And that's it. And so the question for the practitioner, the person that's deploying these applications, that is specking these things out, they don't need to know how many 4K IOPS this system can handle, theoretically, right? At queue depth of eight. You know what I mean? They don't even know what that means.

Starting point is 00:13:36 What they need to know is what you're saying, which is, I bought this many of this type, and I've got this much data. Will it work? Yes, no, right? And that's kind of the answer you're trying to give them. Yes. Oh, that's it. The storage industry wants to know their traditional kinds of numbers. I'm a storage guy, so I can say this, right? I want to know IOPS. I want to know megabytes per second. But the AI practitioners, they think in terms of samples per second. And in a distributed training environment, how many accelerators do I have? I use accelerator,

Starting point is 00:14:12 but because I try to be non-partisan, it's really GPUs. NVIDIA is the dominant player in the market. So how many GPUs do I have? And so they think in those terms, we try to bridge the two sets of terminology together. Curtis, you mentioned a minute ago that the test relies on emulated accelerators, right? Which I have to imagine is very attractive to a lot of folks, you know, that they can run this test without the need for a rack full of hardware running tens or hundreds of thousands of dollars. Can you talk a little bit more about whether that was a deliberate choice to kind of open up the tool to a wider audience? And are there dials in there to try out different emulated accelerators when you're running your tests? Great question. It was an explicit decision that

Starting point is 00:15:05 we made early on that most of the storage vendors, they have, of the vendors, because we also support academic research and open source and other potential solutions. So, but none of those people have the budget to go out and buy a hundred the latest accelerators or to try it with other vendors besides NVIDIA. So we needed to emulate the operation of one of these accelerators. dedicate 10 compute nodes and your storage product or storage solution and you'll will run 10 or 20 accelerators on each of those nodes because the only thing we're doing is imposing the same doing there the reads and writes from the storage to those nodes we're not doing anything with the data I'm not training a neural network model we're just imposing the workload on the

Starting point is 00:16:03 storage solution. We, of course, had to start with NVIDIA because they're the dominant player in the marketplace, but we are planning on pulling in the other vendors, the startups and the graph cores, the servers, class kinds of machines, Tens, Torrent, for example, to bring them in as well and show so that a customer that wants to purchase their accelerator instead of NVIDIA can say, OK, here's the storage that I need to support that configuration at that hardware.

Starting point is 00:16:34 Yeah, and that reflects what we're seeing overall in the industry. I mean, first off, what we're seeing is that NVIDIA is obviously the dominant supplier right now. So it makes sense to start there. But we are definitely hearing a lot of interest in alternative solutions, whether it's other GPUs or, as I said, ASICs, various neural network processors, and even, as you mentioned, CPUs. There certainly is a lot of excitement about CPUs that have more and more capability. And it's not in terms of, and again, this kind of gets back to the question of MLPerf and the mission of MLPerf, it's not about the biggest number. It's about the most appropriate solution for the task at hand. And if the task at hand is not needing absolute maximum performance, I think that what you'll find is the set of things that customers are looking for are varied. So they're not going

Starting point is 00:17:34 to be, you know, if they don't need all the performance in the universe, then they're going to start thinking about things like efficiency and cooling and, you know, environmental impact and, you know, literally physical space. They're going to, of course, be looking at price. Are these things that the ML Perf or the ML Commons Storage Working Group are going to be addressing as well? Yes. I personally would love to include dollars per something right in the in there but that's a a delicate subject for a lot of participants in the in the storage business right and uh and we also attempt to address open source where there is no dollars i mean there's dollars for hardware but

Starting point is 00:18:19 not for software right and academic researcher reach academic institutions where researchers are trying to figure out how best to modify the frameworks, the PyTorch, the TensorFlow MXNet to to do better IO patterns to match the capabilities of storage system. So, well, because I'm a storage person, I'm used to include operating in dollars per something. But that's a much further out topic right now it's a crawl walk run and so we're we're emulating the workloads on the accelerators then we'll bring probably the next most important thing is to bring in the data pre training um facebook meta did a study uh it's actually a report on their internal infrastructure for their AI training stack about four years ago now. And they said they spend about 50% of the total kilowatt hours of electricity on data preparation, not on training. And so that's probably the second thing we'll tackle is trying to model that and it's in the workload it imposes

Starting point is 00:19:25 on storage and then um and how different architectures give you different results data preparation is generally cpu about they don't use the accelerator for the data prep and but they're you know they're starting to talk about it and NVIDIA is moving in that direction a little bit. I'm not sure about the other players in the market. So there's a lot of complexity and we're going to keep growing the core base of the working group to handle more of the storage component. So I should throw in one more thing there. Another big picture comment about ML commons. There are like five different types of things

Starting point is 00:20:06 in the world of AI. There's data, models, accelerators, storage, and networking. You need some of each of those in order to get the value out of AI. ML Commons started with models and accelerators. About two years ago now two and a half years ago they added storage to cover all of it Curtis um so the the the roadmap for the ml perf storage test you've laid out a little bit for us right we're focused on training today data prep tomorrow i'm curious if someone wanted to explore or evaluate the suitability of a storage subsystem for inference for example um would running the the the higher level you know ml perf suite the non-storage specific tests tell tell a person anything about storage are they uh sensitive to changes in in storage subsystems from you know one run of the test to another or uh

Starting point is 00:21:14 or is that something that doesn't tend to show up on the higher level kind of system level ml perf testing like if you take ml perf training that's the benchmark for the performance of a piece of silicon when it's running a training task. That's generally compute bound. And maybe more specific, the people who run that ensure that it is compute bound. They don't want a storage subsystem slowing the performance of the benchmark of their new silicon, right? And so the numbers you see in the other benchmarks at ML Commons won't include any impact from storage because, you know, people running the benchmark don't want that. And so in that sense, they're all sort of disjoint. But there is something we're attempting to do in the storage working group which is

Starting point is 00:22:14 training will define a workload like unit 3d or you know that's a three-dimensional volume classification benchmark they'll define that that workload and we will run the same workload to say oh if you're you're running MLPerf training on that workload, here is the corresponding information for the storage subsystem. We think that that has value. We need to keep track of what the other working groups are doing in order to correlate the results that way. MLPerf also has a bunch of inferencing benchmarks. Well, first off, does storage have much of an impact on those? And is there an applicability in the future where we'll see MLPerf storage in that area? We don't yet see a huge impact from inference. Generally, in terms of the number of inference operations that are

Starting point is 00:23:06 done globally they're almost all done at the edge on your phone basically or in in some point of sale terminal or something like that the the storage in that environment you a single SSD is overkill for a single inference operation, right? But an SSD can be strained by a really large training operation. And so that's why we focused on the training piece of it. It's just a lot more storage intensive. There will be points where storage will have an impact on inference, but they're sort of lower down the priority stack for us at the moment. So given the fact that, as you mentioned, that data preparation is such an important thing,

Starting point is 00:23:52 are there standard data preparation processors or workflows that customers are going to need to go through in order to be ready to do training. Is it as straightforward as some of the MLPerf training benchmarks, or is it a little bit different? What we have seen is that data preparation is pretty much unique to every application at every individual customer. There's lots of different types of data preparation.

Starting point is 00:24:24 It's sort of subclassifications, if you will. You could take an image and remove noise from it, you know, a pixel noise, right? That's one type of data preparation. In image processing, there's others that are take an image and I want to rotate it 15 degrees or I want to change the color palette or I want to key it 15 degrees, or I want to change the color palette, or I want to keystone a little bit, a lot of different things you can do at the image manipulation level that in effect multiplies the amount of data you can hand to your neural network model during training. That multiplication factor of doing that, that data preparation. That's data preparation, but it's unique.

Starting point is 00:25:07 Sometimes you don't want to do rotation. Sometimes you don't want to change the color palette because the color is important. You wouldn't do that when you're looking at a stoplight, for example. So no, we haven't found a, we're in the process of researching that question because it's very important. But we haven't yet found any taxonomy that we can describe of here's the classes of data prep that could be done.

Starting point is 00:25:32 Well, thanks so much for that. I think that's a really interesting and it sounds accurate to me because I've seen certainly that that's how it is. Everyone's basically bringing in different types of data from different sources. But even so, I think that you'll be able to come Tell us a little bit more about the nuts and bolts here. MLPerf storage, just like the rest of the MLPerf benchmarks, it happens on a regular cadence. What does that look like? So where are you now, and where does that go, and when will we see the next round of results? So I welcome anybody viewing the podcast to come join us in the working group. It's mlcommons.org, and you look for Storage Working Group, and there's a link there that says Join. So you can join the working group and show up and help guide the project. We are attempting to have two new releases per year,

Starting point is 00:26:44 one of the benchmark, one in the spring and one in the fall. It's a struggle to get it all buttoned up nice and tidy every time. But we've got another couple of weeks, two weeks or so. So mid-May, probably, we will announce the version 1.0 is ready to be run. Two months later, there will actually be an open window where you get to submit results. And then afterward, the results go through a peer review process, which is private to the people who submit it,

Starting point is 00:27:15 in order to make sure that everything, that the benchmarks were run correctly and all of the I's are dotted and P's crossed and all the rest of that. And then the results are published. So we're talking about three months from today, the results will pop out. And then we'll do the same thing again in the fall. Well, that's excellent. I can't wait to see it. I will tell you that I really look forward to the briefings before the results come out. I really look forward to combing through the results and seeing some of the stuff. I mean, ML Commons, in addition to storage,

Starting point is 00:27:47 also benchmarks a lot of other areas. And one of my personal favorites is the tiny and the mobile and the edge benchmarks that they're doing to show how, not the big data center full of GPU, but all these other systems perform very, very relevant and very interesting as well. So I'll definitely be keeping an eye on that, as well as, of course, the storage benchmarks coming out of it.

Starting point is 00:28:12 A lot of bragging rights for a lot of different companies and a lot of different solutions. And I think that's another thing that talks about the vibrancy of the storage industry overall. We've got great solutions from a lot of different sources, whether it's open source or proprietary companies, and they're all able to support various workloads. So it's very neat to see that answer coming out of MLPerf Storage as well. Well, thank you so much for joining us. Before we go, Curtis, where can we continue this conversation with you and with ML Commons? The best way is to join the working group. The storage at mlcommons.org is the email address. You can send email to it, I believe, from outside the working group.

Starting point is 00:29:01 But there's lots of documents and presentations and things that the working group gets to see. So join. Great. Thank you so much. Ace, thanks for joining me as the co-host today. Yeah, absolutely. Thanks a lot, Stephen. And thank you, Curtis. I very much enjoyed the conversation here. I expect the question of infrastructure efficiency

Starting point is 00:29:21 and keeping your GPUs maximally used is going to be relevant for quite a while as we look in the future of AI. And so having better tools to measure that and make informed decisions in service of that goal is going to be really important. So very excited by the work you're doing and appreciate your time. And of course, I know that there's a lot of solid IAM storage in those submitted results, too. But, you know, it's nice to see that, too, Ace. And thank you, everyone, for listening to this episode of Utilizing Tech Focused on AI Data Infrastructure. You can find this podcast in your favorite podcast application. You'll also find us

Starting point is 00:30:05 on YouTube. Just search for utilizing tech or utilizing AI data infrastructure. If you enjoyed this discussion, please do leave us a rating, leave us a nice review. We'd love to hear from you as well. This podcast was brought to you by SolidEye, as well as by Tech Field Day, home of IT experts from across the enterprise, now part of the Futurum group. For show notes and more episodes, head over to our dedicated website, which is utilizingtech.com, or find us on xTwitter and Mastodon, yes, Mastodon, at Utilizing Tech. Thank you very much for joining us, and we will see you next week.

Your Ad Here

Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 07x03: Benchmarking AI Data Infrastructure with MLCommons

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.