Grey Beards on Systems - 151: GreyBeards talk AI (ML) performance benchmarks with David Kanter, Exec. Dir. MLCommons

Starting point is 00:00:00 Hey everybody, Ray Lucchese here. Jason Collier here. Welcome to another sponsored episode of the Greybeards on Storage podcast, a show where we get Greybeards bloggers together with storage assistant vendors to discuss upcoming products, technologies, and trends affecting the data center today. We have with us today David Cantor, Executive Director of ML Commons, the organization responsible for MLPerf. So David, why don't you tell us a little bit about yourself and what's new with MLPerf? Absolutely. Well, thank you for having me on the show. It's a pleasure to be here.

Starting point is 00:00:41 So I've been around the industry for a while. You know, I got started in the early 2000s running Real World Tech, my website, and started a processor company, actually, you know, in like 2007, 8, 9, and then wound that down. And I ended up, you know, consulting with a lot of folks in the industry, you know, Intel, AMD, NVIDIA, so forth. And then I ended up doing some work for this company, Cerebrus Systems. I was doing some patent work for them. And of course, they do the, you know, wafer size chip for ML. And then I started showing up to these MLPerf meetings. And, you know, eventually, I like to joke, I started as the secretary, and then, you know, worked my way up to becoming CEOs is kind of, you know, real rags to riches,

Starting point is 00:01:40 I suppose. But no, and I, you know, when MLPerf training started, I was, you know, there in sort of more of a supporting role. And then I was asked to lead MLPerf inference, along with my colleagues, Vijay and Christine Cheng from Intel and Carol Jean Wu from Facebook. And Peter Mattson from Google as well helped us a tremendous amount. And so we landed that. And then I sort of went on to do power measurement for inference and then really became CEO, just sort of supporting the whole effort, technically executive director. Okay. So now it's kind of exciting to see the baby all grown up. Well, it's come a long way, I'll say that much. You want to tell us a little bit about what MLPerf does before we get in? Is this kind of a storage enterprise sort of audience?

Starting point is 00:02:36 Yeah, absolutely. So, you know, the story of how MLPerf started is actually, you know, kind of an interesting one. You know, most folks in the enterprise are used to benchmarking. But, you know, way back in 2017, 18, there weren't, there had been people who had tried to do benchmarking for machine learning, but nothing had ever really stuck. And so there were some academic efforts that just didn't really get like wide adoption. And so what had happened is a bunch of folks got together and really said, hey, you know, machine learning is really important. It's going to be one of the next major workloads. You know, again, this was back in 2018. And we need a way to measure performance, to measure capacity, to help drive the whole industry forward. And so, you know, everyone really came together for this. You know, we've got 50 member companies now. And, you know, it started out with, you know, three folks from academia and three companies and really grew quite fast, in part because everyone recognized that, you know, what I like to say is that, you know, benchmarks are both a barometer on progress, but they also help get everyone to agree what it means to be better. Before ML Perf, when people were trying to sell or buy ML solutions, it's almost like buying cars.

Starting point is 00:04:15 And one guy goes up to you, and he's from Ferrari, and he says, I got something that'll go 200 miles an hour. And then the next guy comes with a Chrysler minivan and says, well, I've got something that'll take the whole soccer team and it's really safe. And then, you know, the next person comes in and they've got an electric car and they say, mine's great for the environment. And, you know, those things are all true, but they aren't, you know, you can't, they're not comparable. Right. And what do you really want? Well, it depends on where you live. You know, I'm in San Francisco. So, you know, a Ferrari on the streets looks great, but you know, I'm never going more than 40 miles an hour. Um, well, unless I'm going to the South Bay and then we'll talk about how fast I'm driving. Um, but, uh, uh, the, uh, part of the point is getting everyone to agree on what does it mean to have a better ML solution?

Starting point is 00:05:10 Because the goal of ML Commons is really making machine learning better for everyone. And that's part of the point of benchmarks and measures. And so that was really what brought us all together. And we started with ML Perf Training, which is what brings us here today. So we were at a recent pre-briefing on what's new with MLPerf. And a couple of the things that I came away from that pre-briefing was that you're now doing GPT types of stuff. You want to talk a little bit about what you're doing there? Because, I mean, you had NLP kind of work, BERT kind of stuff. You want to talk a little bit about what you're doing there? Because I mean, you had NLP kind of work, BERT kind of stuff, but GPT is a step beyond that considerably,

Starting point is 00:05:52 right? That's true. So, you know, one of the biggest challenges for ML, and we knew this starting out, is that the workloads change really rapidly. And when we started, just to give you an example, it wasn't actually all that obvious that Transformers and that BERT were going to be the dominant machine learning motif in some ways, right, which has really come to pass. We thought it was heading in that direction. But at the first benchmark, we actually had two translation tasks, one using a recurrent neural network, which is sort of, you know, an older technique that's fallen a little bit out of favor for that task. And then we had one that was using transformers. But so from the get-go, we've always wanted to be like really flexible, really adaptive and picking the right balance between not being too cutting edge, but being on the leading edge. And so, you know, as you point out, we have had the BERT benchmark,

Starting point is 00:07:09 which is a smaller language model, for quite a while. You know, I think we introduced that in 1.0. I'd have to go double check my notes to make sure on that, but I believe that's correct. And, you know, we always had our eye on GPT-3 and these really large language models, but it wasn't into, we needed to be sure that it was commercially relevant because the principle for all of our benchmarks is it's got to be fair. It's got to be generalizable and it has to be something that is representative of what people are actually doing, because it's a lot of investment to do the benchmarking, both to develop the benchmark and to run it. And we want to make sure that it's a good investment for the whole industry. And so, you know, what you see in these benchmark results is we added a benchmark for GPT-3.

Starting point is 00:08:03 It's not the whole thing. We're running a portion for GPT-3. It's not the whole thing. We're running a portion of GPT-3. But it definitely gives you a sense of the flavor of the performance and of the scale. And you're right. Like for most MLPerf benchmarks, you can actually run them on a variety of systems ranging from, you know, we've seen submissions as small as a single processor to as large as 500 plus accelerators, right? Even a thousand, I think. Yeah. Like 4,000, I think is the largest. And actually MLPerf HPC, there there's a 16th, like half of Fugaku, half of, you know a top five supercomputer in the world was used for that.

Starting point is 00:08:48 But in MLPerf training, it's generally up to about 4,000. And so the minimum recommendation that we have is 64 processors in the context of GPT-3. So it's a pretty big benchmark. But a lot of the excitement around generative AI is focused on these large language models. Let me guess that the real interest in that benchmark probably started around the first of the year when GPT was getting a lot of press. Oh, no, no, no, no. I mean, we actually started working on this long before that. But I mean, every day, every day, every headline,

Starting point is 00:09:25 you know, we had to make a guess and it's nice to know that we guessed right. Right. Yeah. So training, I don't know, it's 75 billion or 175 billion parameters in GPT-3. I mean, doing something like that has got to require a lot of time time and effort even if you decide to do only a small modest portion of it right i mean you're still dealing with the the full model right it's not like you're not is that true yeah no this is the full uh the full model i mean you can you can go and grab the uh uh you know in our g repository, we have the model, it's, you know, initialized, it's not trained, that we used. And yeah, it's, it's, it's the full thing, 100%. But yeah, as you, as you, as you mentioned, we are only training for a portion of the data set,

Starting point is 00:10:22 because again, you know, we wanted run time to be uh uh reasonable you know i think when we first uh the the first training of gpt3 i think in the original open ai paper they said it was you know about a 10 million dollar exercise for a benchmark that's a bit much, are you thinking? Well, so our LLM benchmark using GPT-3 is 0.4% of the full thing. So David, I mean, my understanding of the GPT-3 benchmarking is that you're not starting from scratch, but you're starting from sort of a checkpoint. Do you want to describe what all that means? Yeah, absolutely. So, you know, as we were talking before, we don't want to run the whole benchmark because, you know, the original GPT-3 was a $10 million exercise. And to be accurate, we're going to be running it a few

Starting point is 00:11:17 times, three times. So, you know, for us, what we did is Checkpoint is exactly what it sounds like. And it's actually an operation you normally do during training every so often. You know, like if you play video games, it's like your save and restore point. is we trained from the start on about 12.5 billion tokens. And for those who aren't intimately familiar, the rule of thumb is, you know, it's about each token is about 0.7 of a word. And so we go through from the start, we do that training, we save and take a checkpoint. And then the benchmark starts from the checkpoint and measures training on roughly 1.3 billion tokens until you get to the target accuracy. And one of the key things about the MLPerf benchmark that's actually critically important is that we run to that target accuracy because that helps us correctly get trade-offs around things like numerical precision and different algorithms

Starting point is 00:12:46 that might make your compute more efficient, but might make your time to get to accuracy a little bit more slow, potentially. And so in that case, so I mean, GPT is a generative model. I mean, so it's like, what's the next word after this sentence or something like that? But in this is actually what's called pre-training. So a very common thing for these language models, like let's take a step back and say, what we want to do at the end of the day is we want to do sentiment analysis or maybe build a chat bot. I'm kind of tired of chat bots, so I'm going to pick sentiment analysis. So we're going to look through a document. We're going to look through this transcript and see,

Starting point is 00:13:51 were Ray and Jason saying good things about David and MLPerf? And the answer, of course, is going to be yes. But so what you would do is you would pre-train your language model to be good at having an understanding of the relationship between words, between parts of a sentence, between what those words might mean. And then you would have to specialize that to actually do the sentence analysis, the sentiment analysis, read through the document and say, okay, overall, that was good. Or this paragraph was good. And then this sentence, they said some bad stuff. So we're just doing the pre-training, which is you start with this language model that knows nothing. And then you train it to understand the relationships between words, between sentences, between structures there

Starting point is 00:14:42 to understand, you know, as I like to joke, that if I start a sentence with why did the chicken cross the blank, you know, not a hundred percent of the time is the answer road, but most of the time it is. And so teaching those relationships is what we're doing in this benchmark. And so that the target accuracy is something called perplexity, which is a link has to do with sort of a general understanding of the relationship between vocabulary. Interesting. Now, when we go and build an inference benchmark, which is, of course, we're going to do, that we will have a specific application in mind where it'll be something a little bit more concrete. I realized that was

Starting point is 00:15:32 somewhat of an abstract notion of accuracy. Yeah, but I mean, more generally applicable, I would say, because of that, right? Yeah, yeah. And I mean, that's exactly right. And this is one of the things that makes it very general purpose is that understanding of language, understanding of sentence structure of that when I refer to that idea that I talked about earlier, knowing even what the candidates for that idea might be, right? Like that sentence is pretty indistinct, but, you know, if we were talking about the idea of large language models, you know, you might be able to pick that up from the overall document. Now, you mentioned that a couple of things here. So MLPerf to a large extent from a training perspective is driving towards an accuracy level. Is that correct? Right. So for any training task we have,

Starting point is 00:16:29 whether it's the large language model, whether it's image recognition, whether it's recommendation, all of those things are trained to a target accuracy that we pick to be reasonably representative. So I'm at 90%, 99% kinds of numbers. It's not like it's, you know, for like vision recognition and things of that nature, right? Well, for our vision task, for our image recognizer, it's like 75.9%, which, but, you know, again, given the data set, given the network, it's actually pretty good. I mean, we're not, we're not out here trying to go to uninteresting accuracies. And the other thing, so here I am, let's say I'm, I don't know, Joe Blow's accelerator for AI, and I want to run your benchmark, and I'm a member of ML Commons and stuff like that.

Starting point is 00:17:20 And so do I take your model, or can I create my own model based on your model? There's something about mathematically equivalent. The right words, right? Yes. No, your memory is perfect. Please. Right. So the issue is, and this kind of comes back to level playing field and fairness, which is, you know, the reality is

Starting point is 00:17:47 you can run machine learning and you can run MLPerf on a variety of different systems. And we see this all the time, right? We have people submitting on CPUs, on accelerators, on multiple accelerators, and on accelerators that look very different from each other, right? And so, you know, to take a few examples of folks who have submitted, right? NVIDIA submits on their GPUs. Habana Labs submits on the Gaudi accelerators. GraphCore has submitted on accelerators, and every one of those is different. And in order to allow folks to be neutral, we need to allow them to change the model to suit the accelerator to a certain extent, right? So just as a simple example, right, each accelerator is going to have

Starting point is 00:18:46 a certain amount of memory. And so if we had baked something into the model that said, oh, you know, you, it only uses 32 gigabytes of memory. Well, that would disadvantage folks who have less, and it would waste memory on those who have more. So we need to allow a certain degree of flexibility. And so that is this mathematical equivalency where we say, hey, we want you to run GPT-3, but then we have an elaborate set of rules that say you can change the network in the following ways. And, you know, so like an example of something that's allowed is, you know, if you want to use a numerical approximation instead of an exact computation, that's fine. You know, so like for folks who play video games, like, you know,

Starting point is 00:19:36 there's this inverse square root trick that was really famous in Quake, you know, something like that's allowed, right? And part of that is because we specify a target accuracy, we know that you're not going to monkey around so bad that you break it. But that allows that all of our different submitters can tune the network such that it works for their system and is fair, is solving the right problem to a relevant accuracy. Right. Right. But they're not departing so far that it is impossible to really make an apples to apples comparison. And that's in the closed division.

Starting point is 00:20:19 Now, in the open division. Different world. Right. There, you know, for example, if you want to use a totally different model, you can, if you want to cut out layers, you can. And that's more meant to showcase, you know, higher degrees of innovation. Yeah. Does that make sense? Yeah. Yeah. Yeah. So when you say mathematically equivalent, the number of layers, the number of nodes per layer, all those sorts of things have to be the same. But let's say the accuracy or the arithmetic or the resolution, rather, could be different as long as you achieve whatever the target accuracy needs to be.

Starting point is 00:20:59 Yeah, not the – well, let's take numerics as an example. Right. Right. Right. Like, so, you know, one of the things we've seen as a trend in the industry is, you know, way back when everyone started using FP32. Right. And that's, you know, a lot of our reference models are in FP32. But as an organization, our view is, you know, use whatever, you know, numerical format you like. Right. There's no right or wrong answer there as long as you get sufficient accuracy.

Starting point is 00:21:34 So if one company wants to use BF16 and one company wants to use FP16, that's fine. We think those are all equally good if you hit the target accuracy. And so that's part of being fair and flexible. And so the models originally were like floating point 64, is that? A lot of the references are in FP32. And, you know, submissions are all over the place. Although, you know, you get points for being faster. So most folks are highly incentivized to use, you know, the, the most performance, uh, uh, numerical precision that they can, but sometimes it can come with accuracy trade-offs. Right. Right. And the other, uh, besides the GPT-3 was,3, there was a new recommendation engine as well?

Starting point is 00:22:26 Yeah, that's right. Yeah, and I'm super proud of the team for delivering both of those at the same time. I mean, it was a lot of work. So any other, the prior versions of MLPerf? Now, granted, I've been reporting on MLPerf for quite a while here, so I kind of know some of what's going on and stuff like that. It seemed like it was a one terabyte click-through database? That's right. So our prior recommender was called DLRM, and that was trained on the, that's the closed reference network. And it was trained on the Credio oneabyte data set, which is a terabyte of, you know, click stream information, right?

Starting point is 00:23:06 What are people clicking on the internet? It's anonymized. And we upgraded our recommender to use a much larger data set, a four terabyte data set. And then the other thing that we did that was very important is part of a recommender is something called an embedding table, where you take, generally speaking, a sort of variable length bit of information about someone or something, and then you look it up in this embedding table to get a fixed length vector representation of it. And in our prior model, it was a single hot table. So you'd only look up once for each instance of doing recommendation. And we changed it to be multi-hot such that, you know, conceptually, if you think about it, you know, suppose we're doing recommendation for people, you know, each

Starting point is 00:24:11 person in this conceptual recommender would now have four or five items that we've got information about. So it's almost like a multi-dimensional view versus a single dimensional view of some attribute. Yeah, that's right. And well, I mean, it's the embedding tables usually have hundreds or thousands of dimensions, but previously only one of those would have been indicated at a time. Whereas now we say, okay, you know, the folks we're looking at, we've got four or five or six or whatever it is at a given time. So we've got a little bit more history. You tell the graybeards all like graybeards

Starting point is 00:24:47 or something like that, besides being handsome and stuff like that. That's right. We want to know that you're not just handsome, but you are witty and you are a law-abiding citizen and active. Yeah, yeah. So one of the things that have been

Starting point is 00:25:02 sort of a point of contention for me from an MLPerf perspective is that there's never been any really storage oriented activity that that, you know, so we're Greybeards on storage, of course. So storage is sort of almost our middle name. But so is there something going on in that space? Yeah, no, that's a great question. And I, in fact, I think didn't we originally meet at the storage field day? I think so, David. Yeah. We were talking there and that sort of stuff. Yeah. So, so is, so, you know, if I look at the current suite of benchmarks in the training division, let's say it's, it's all kind of localized storage, almost, like SSDs, NVMe SSDs associated with whatever number of servers you have and those sorts of

Starting point is 00:25:53 things, right? Yeah. So this comes back to the rules and, again, us wanting to make things easier and cheaper to run. MLPerf training, it's a full system benchmark, right? You're going to stress your networking, all sorts of things. But the rules say that the data starts in node local storage of some form or another. But it's really mostly designed not to test storage. Actually, I think it actually starts in host memory. Maybe we'll have to go back and edit that. But the, you know, in reality, when you're doing a big training job, right, you're ultimately your training data is going to sit on some nice permanent storage and you're going to have to stream it out to your compute nodes and then send it to memory and then start chewing on it. And so we kind of broke it down into two separate portions. And as you alluded to, we started with MLPerf training, which was compute oriented. And a while back, you know, there were some storage people in our community and they said, you know, hey,

Starting point is 00:27:01 we really could use a storage benchmark that is machine learning oriented. And so we said, all right, let's take sort of the same general concept and change where the start and stop point is. And so for MLPerf storage, we really wanted to focus on the storage portion. And so we said, okay, we're going to start with your data at rest in permanent storage. And then we're going to measure how fast you can get it into the node's memory. So you're just looking at storage. But again, it's the real storage that you're accessing

Starting point is 00:27:39 during a real training job. So it's dynamically live, accurate, real patterns. And that's actually very, very, very important. And so I'm really excited. This group, I helped kick it off and it's been very exciting to see. We're now open for MLPerf storage submission. So if you have a storage solution and want to test it on an ML workload, you should come knocking on our door. My storage solutions are all decades old, thank God. Jason might have something up his sleeve somewhere in his past, I would say. But so, I mean, you mentioned it starts data at rest.

Starting point is 00:28:19 I mean, so this would be like objects or unstructured data or structured data or all of the above? Okay, so now you've managed to stump me. So no, no, no. So I think, and don't, you know, of course we're recording, so I am going to get quoted on this. But I believe if we look at the storage rules, it is not, I don't know if we support object storage yet. Um, but you know, it would start with, you know, all of your, let's take GPT three. We don't actually have a storage version of that yet. Cause it's so new, but, um, you know, it would be, let's start with all the training data in sort of the natural storage form. And this is like, it's like a scan of the web, right? Yeah. Here's,

Starting point is 00:29:07 here's your petabyte of data. Start here and go. That's, that's right. And so it would be, um, right. And so the, the thing that you would do is you'd start with the data in sort of the natural form for TensorFlow or PyTorch. It's different for those two. Right. And then you're going to load it out and get that into memory, you know, each, you know, for like an epoch. So an entire pass through the data. And then, so then it's really like, it's basically a measure of ingest. How quick can you pump that data in so that it can be processed? That's right. And so one of the things that's nice about ML Perf Storage, though, is we don't give you, you know, we have a very nice mechanism that we did the engineering around to show that you can actually emulate the processing. Right.

Starting point is 00:30:01 Because like one of the issues is, you know, we're talking about GPT-3, right, I just said, you know, it's 64 accelerators minimum. And, you know, some of the submissions that we have in MLPerf are using hundreds or thousands of accelerators, right? And so, you know, you don't want to go buy 4,000 accelerators to do some storage measurement. That doesn't make sense. And so what we demonstrated is that you could actually emulate that very carefully. So you can actually size large storage systems without necessarily having to shell out for a huge amount of compute infrastructure. And that's one of those ways where breaking, you know, this big problem down into smaller pieces makes it a lot more tractable and sensible. Let me just understand. So what you just said, David, was that you're, although the data path is, you know, it is what it is and it is doing what it's supposed to be doing.

Starting point is 00:30:54 But the actual training portion of the storage benchmark is like a simulation rather than an actual. In MLPerf storage. So, yeah. So what we would do is you know let me let me kind of narrate the start to finish right so when you're actually doing a training job you tend to send your data to the compute the the computational resources will want to draw in a batch of data at a time and so you know a know, a batch might be five images, it might be 10, it might be 100, whatever it is, but that's kind of how you're requesting the data. And it's randomly drawn from your pool of data, right? And then what's going to happen is, you know, so the storage says, okay, I've got your 10 images that were randomly selected, here they are,

Starting point is 00:31:44 and then the compute's going to go run off and compute on it. And then it's going to want the next batch. And you're going to go one batch at a time through the data. And in a lot of cases, you will run for multiple epochs, which means you've done an entire pass through the data. So for some image networks, you might be doing 5, 10, or 30 passes, 30 epochs through the data. Now, and what we wanted to do was to capture the fact that it's not just randomly load all the data at once, right? That's very different than what I described. So to make it accurate, the MLPerf storage benchmark is randomly grabbing a batch

Starting point is 00:32:28 of data at a time. And then it says, and then it emulates the compute time of the training on the batch bingo. And then it goes and gets the next batch. And so the point is, you know, you're not going to really get points for an infinitely fast storage solution, right? Because if it, if your storage solution is oversized for your compute, it's not super helpful. Are there inputs on that, like compute resource where you can tell it like, okay, pretend this is, you know, like an NVIDIA DGX, like pretend this is this, or something like that. Yeah, like it's this specific accelerator and come back with different compute times on it. Yeah, no, that's a great, great question.

Starting point is 00:33:14 And that's actually exactly why we did it, is we wanted to be able to emulate any accelerator or processor. Oh, nice. And so what you would need to know is if you want to emulate, you know, pick your favorite processor. Say you want to emulate... Yeah, right.

Starting point is 00:33:36 What you would need to know is, okay, what are the batches it wants and how long does it take to churn through that batch? And then that drives the emulation portion. Okay. Does that make sense? And so you can play around with that. So you could actually, and we obviously, because this is part of the overall ML Commons and part of the general effort on ML Perf, we have those numbers for our submissions.

Starting point is 00:34:01 But you could also ask the question of, well, how much storage would I need if all of a sudden these accelerators magically got 30% faster? Yeah. So, and in this case, accuracy is not going to be a viable metric, right? For, you know, telling whether the benchmark is successful or not. So if I've got storage system Ray and I've got storage system Jason, and I run your benchmark on, I don't know, vision recognition kinds of stuff. How do I know, did I do the bench? Yeah. So accuracy was great for training or even for inferencing, but for storage, it's not going to work because you're emulating that activity. That's exactly right. So the, well, and the other thing is, uh, uh,

Starting point is 00:34:46 you know, uh, all, but the most broken storage system will give you, we'll read back what you originally wrote, right? In storage land, you get zero points for any loss of accuracy. Exactly. Um, no, no, no. So, so that's, um, so, you know, the the the metric is going to be how many samples per second. Given the requirement that you have to keep those accelerators 90 percent busy. Right. So we don't allow you to have underutilized accelerators. And then we look at sort of the throughput you get out of it. But this is another thing that's a little bit different is, you know, a lot of storage folks are used to thinking about things and, you know, okay, how many gigabytes per second can I read

Starting point is 00:35:32 out of an SSD, which is fine. I'm not saying that's a bad metric, but we wanted it to be ML centric. And one of the issues is just structurally speaking, images totally different than text, right? You know, a sentence might be as short as, you know, a single word. So it could be, you know, maybe 40 bytes, you know, and an image could be a couple megabytes. So yeah, you should random IO versus kind of your streaming IO, right? Yeah. And so what we're going to realistically get is, you know, the the target is, again, how do you keep those accelerators busy? And then what's the throughput, you know, you're able to sustain given that you can keep those accelerators busy. So the accuracy similarity to training would be the keeping the accelerators 90% busy.

Starting point is 00:36:27 That's the guide that you have to achieve. Once you achieve there, then you can publish, let's say, the samples for a second, I guess. That's right. And so you're going to have like a different storage benchmark for vision recognition, object detection, BERT kind of thing. Is that kind of what you're foreseeing? Yeah. So long term. So on MLPerf storage, we're starting with, I believe, 3D unit. Right. And so for, for, for, so we've got two out of our training benchmarks today, right. And one is actually doing 3d images. Uh, and then the other is doing BERT, right. So that's

Starting point is 00:37:17 text. And so, you know, and, and they're going to have wildly different storage characteristics, which is part of why we picked them. And ultimately, you know, and it depends on how much engineering effort we want to invest. But I could see a world where for every MLPerf training and every MLPerf HPC benchmark, there is a corresponding storage benchmark. But, you know, how we if we get to that point, how we get to that point is to some extent in the hands of the working group. Right, right, right. And so as far as ML Commons is concerned, you've got representatives from pretty much all the major server vendors, accelerator vendors, and storage vendors as well? Well, so MLPerf Storage is currently open to anyone who wants to participate. So even you, Ray, you can show up and see how good your PC will be. I wouldn't recommend it, but you could.

Starting point is 00:38:13 I've got this crypto mine in the back, but it doesn't have much storage, actually. Yeah. No. So, you know, the way we've always done our benchmarks is, you know, the first round, maybe two, is open to everyone, in part because we really want to get as much community input as possible. And, you know, once we see that we're actually doing something that's really valuable, then we'll say, okay, you know, you've got to get a membership to participate. And ultimately, you know, my goal is not to make money. My goal is to make sure that the community is healthy, that we can do the right engineering, we can provide all of the support to make it great and a great experience for our member

Starting point is 00:38:56 companies and for consumers and for the whole community. So storage vendors, listen up. It's your opportunity to go try these puppies out and see what they look like and stuff like that. That's right. And, you know, some of the storage vendors have already joined us and, you know, we're, you know, Nutanix was very early on involved with and helped to drive this. One of the working group chairs is from Nutanix and others from Panassus, uh, another from Argonne National Labs. And then, uh, another, uh, uh, Juana is a professor at, uh, McGill. Right. And, you know, but we have participants from a lot of storage related folks. And of course we hope that they will all become members. Yeah, exactly. Exactly. Huh. We didn't

Starting point is 00:39:47 talk about tiny at all, but there's been some new benchmarking in that space as well. And what, what is tiny and how does it look? I mean, can you just kind of describe what it is? Small. That was easy. That was too easy. Yeah. all right. Enough with the dad jokes, right? I'm sure everyone in the audience is mentally wincing at that. Yeah, well, so let me back up and talk about MLPerf overall. So I like to joke that MLPerf goes from microwatts to megawatts, and that is actually true. As I mentioned, someone did run MLPerf HPC on Fugaku, which is one of the world's top supercomputers. And I think it's around a 20 kilowatt, sorry, 20 megawatt machine. Yeah. Now the tiny is on the other end of the spectrum. So we've got a bunch of different

Starting point is 00:40:39 inference benchmarks. We have sort of inference, which focuses on data center and edge. We've got mobile inference, which focuses on sort of smartphones and PCs and client platforms. And then we've got tiny, which is really focused on, you know, IoT embedded devices. And so the workloads are a lot smaller as often is the hardware. And so, you know, for MLPerf training and inference and mobile, for example, we're assuming that you've got an operating system. You know, whereas there's a lot of devices that do machine learning that don't, where it's just like, you know, for example, if you think about a lot of voice activated devices, there's something that's always on, always waiting to hear you say that magic phrase.

Starting point is 00:41:33 We call that keyword spotting. And then it wakes up the rest of the system, right? You know, almost all of our smartphones have that, right? And so that's actually a really important workload. And so in Tiny, we have a keyword spotter. But the things that do talked about how GPT three is 175 billion parameters. A lot of these tiny models are, you know, hundreds of thousands of parameters under a million. Yeah. Yeah. Yeah. Yeah. But, but those fit really well into a lot of very specific use case oriented models, right? Yeah.

Starting point is 00:42:30 I mean, IoT stuff. Yeah. Yeah. So we've got keyword spotting, like anomaly detection. So, you know, like if you're listening for like a rattle in your car or something. Yeah. Temperature sensors in a refrigerator at a grocery store yeah that's perfect yeah exactly um and then uh you've got uh a visual wake word detector so

Starting point is 00:42:57 that's more of like not doing full-on image recognition but just like did the cat door open type stuff. And then shoot, what's the last one? I think the last one is, oh yeah, basic image classification. And so what transpired over the last iteration was a new suite of submissions. Is that how I would read that? That's correct. So,

Starting point is 00:43:26 you know, just backing up, you know, we've got, we do four, generally four releases of results a year. Just some of these analysts and press folks didn't really want us to be doing things every month. And so in response to that, we said, OK, we'll do it once a quarter. And everyone who's got results in that quarter, you know, you pick the same day. So, you know, this was the second quarter of 2023. And so we got both tiny and training in that window. And so we've got, you know, it's sort of the same deal. All of our results are peer reviewed, checked for correctness, so that we as an organization and everyone involved can stand by it. And so we had submission for tiny and training about six weeks ago with the results coming out now. Okay, that's great. Well, I'm sorry to say

Starting point is 00:44:17 it's about time for us to leave. Jason, is there any last questions for David before we leave? Just kind of, you know, a kind of semi question slash statement. I was at a data center conference recently where I saw that basically so a single Google search can power 100 watt light bulb for about 11 seconds. A single chat GPT session is about 50 to 100 times more power consumption. I think one of the things that ML, you know, MLPerf is going to be absolutely valuable for is as AI becomes more pervasive, and you look at basically AI pervasiveness in other data centers, it is going to require a kind of a fundamental shift in the way people think about power consumption for what they need. So that one light bulb for 11 seconds goes to 54 light bulbs in 11

Starting point is 00:45:12 seconds. It's pretty significant. So I think, you know, this is really cool what you guys are doing. And I think you're in the right place at the right time to really, you know, kind of start to measure these things that are going to be, you know, an impact to a lot of companies that are out there, you know, running these systems. David, correct me if I'm wrong, but you do some power measurements, I think for inferencing. Yeah. So I, I just want to up front, I didn't pay anyone to advertise this, but yes. No, actually. So I helped lead the group that did power measurement for inference. So that is an option for all of our inference submissions. And we usually get, I think, at least hundreds of power measurements, maybe even more, each round in inference. But one of the

Starting point is 00:46:07 things we're actually looking forward to, ideally later this year in Q4, is going to be doing power measurement for training in HPC. And it's actually a much harder problem. Because I think going back to the point about 4,000 processors, you can't exactly just go up to a data center and plug in a power meter, right? It's a lot more complicated than that. So we're, you know, we're working on that. We think it's critically important. And part of it is, you know, I think as you were alluding to Jason, there's a lot of folks who really do care about the environmental impact of all of this computing. And, you know, first of all, there's a lot of noise out there. There aren't really great measurements and there certainly are none that are, you know, sort of done in an industry standard

Starting point is 00:46:57 method. And so that's, you know, when you get back to what it is that ML Commons really does, you know, we work on joint engineering for industry standard problems that are going to make machine learning better. And I think at the end of the day, getting power measurement for training so we can look at those numbers, have real data, and then watch it improve. To me, that's really exciting and impactful.

Starting point is 00:47:23 Yeah. And being able to compare apples to apples and not apples to beer, there's a big difference. That is 100% correct. Although I do love both beer and apple. Beer plus apples equals cider, right? Something like that. David, is there anything you'd like to say to our listening audience before we close? No, I mean, I think it's been a pleasure to have you all listen. And, you know, I guess the one thing I would say is stay tuned. We are working on some consumer oriented benchmarks for apps and PCs that I do hope to get out later this year. And then I hope everyone who listens will get a chance to download and try

Starting point is 00:48:09 them out. That would be great. All right. Well, this has been great, David. Thank you very much for being on our show today. Absolutely. Thank you for the time. That's it for now. Bye David. Bye Jason. Bye Ray. Bye. Until next time.

Starting point is 00:48:25 Next time we will talk to the most system storage technology person. Any questions you want us to ask, please let us know. And if you enjoy our podcast, tell your friends about it. Please review us on Apple Podcasts, Google Play, and Spotify, as this will help get the word out. Thank you.

Grey Beards on Systems - 151: GreyBeards talk AI (ML) performance benchmarks with David Kanter, Exec. Dir. MLCommons

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.