In The Arena by TechArena - MLCommons’ David Kanter on Benchmarking AI, Accelerating Innovation

Starting point is 00:00:00 Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein. Now, let's step into the arena. Welcome in the arena. My name is Alison Klein. Today is a very exciting day because I am joined by David Cantor, founder and head of ML Commons. Welcome to the platform, David. How are you doing? Doing great. It's a wonderful day to be on and thanks for taking

Starting point is 00:00:39 the time to have me. So David, you and I have known each other for a really long time, but this is actually the first time you're doing a podcast with me. So why don't you just start with an introduction of yourself and your background in the tech arena? Absolutely. I've got a long history in the tech world. I started out running a website called Real World Tech, which is kind of a water cooler for computer architects and semiconductor professionals and enthusiasts. And over time, I had a couple of very interesting ventures. And along the way, in 2018, I got involved in MLPerf. I started out in the humble position of secretary for the group at some of our first meetings,

Starting point is 00:01:21 and it was just really a wonderful dynamic and atmosphere. And then the joke is I got promoted to be secretary a la Ban Ki-moon, where now I'm the head of MLPerf. And so I helped start MLCommons, helped build it for the first few years. And now my focus is on MLPerf, which is about how do we make AI better, faster, and more efficient for everyone? When you think about the shift that you made from really covering the tech sector to being a strategic part of how the tech sector really displays foundational information about technology that's driving the most important workload in the data center, AI, what made you decide to make this shift in your career towards this?

Starting point is 00:02:07 So I've always been motivated by learning and discovering the new things we can do with technology, because in a lot of ways, it brings us things that were previously in fairy tales to reality. And this was an opportunity to step forward and make it so. And part of it is I'd seen good benchmarks. I'd seen bad benchmarks. And to me, it was very important that we have good benchmarks that push the community in the right direction and help bring together buyers, sellers, and really help the community function better. And it is super different to be covering things versus building a product. And that's been a just wonderful experience and also just great community of people. That's awesome. Benchmarking has been around since we started creating

Starting point is 00:02:52 computers. People wanted to measure how they were performing, but benchmarking has really been reimagined for AI. What's unique about this capability in terms of creation of performance benchmarks for artificial intelligence? Yeah, I would say I think it's a very, very hard problem. And MLPerf is not the first benchmark for AI. And actually, there's several predecessors that were ultimately unsuccessful. But we joined all of our forces together and learned from the prior efforts. And combined, we were able to make something that really stuck. And so I'd say part of it is it's a full stack problem. It encompasses everything from data to software, to compilers, to architecture, to silicon, to the scale of your system. And there's so many factors

Starting point is 00:03:43 that we get to use to accelerate AI, but that makes the measurement really tough. So it's been a challenge, but a really rewarding one to see it all come together in a really land. With this latest round of results, you've had companies from across the compute landscape publish results. And in talking to your members, how are these results utilized? And walk us through the process of submission. Yeah, that's a great question. Maybe let's start with the end in

Starting point is 00:04:12 mind. How are they used? And I think the benchmarks are used in a number of ways. One is to help buyers, especially if you're at a giant company, you might be able to afford a staff to help you evaluate like what your technical choices are. And even those folks will oftentimes use MLPerf as a first pass heuristic. And they'll say, I'm interested in recommendation. Let me see what the MLPerf score for recommendation is. But a lot of folks in the enterprise may not have those complex evaluation teams. And so it can be a really integral part of evaluating and helping inform purchasing, but it also helps to align researchers, engineers, understand what does it mean to be better? What should we be focusing on to make AI faster, more energy efficient, more capable?

Starting point is 00:04:58 And getting those metrics out there and getting everyone on the same page of what does it truly need to be better. That's sort of the macro. How it works, we start by defining a benchmark for an interesting area. One of the hottest ones, of course, is large language model pre-training and fine-tuning. And so we'll have to define the task, define an accuracy bar, define a data set. And so we're going to say, hey, this is the task. We want you to train, say, GPT-3 to this accuracy. And that ensures that it's relevant to customers because it's a real accuracy target. It's going to be something that's relevant. And that means any optimizations that are made will ultimately benefit the whole community

Starting point is 00:05:41 and kind of drive best known efforts. So you define a reference that is the GPT-3 model. And then if you're a submitter, you're going to take that reference and re-implement it in a way that works for you, right? There's lots of companies with different solutions, whether it's training or inference, and we allow them to re-implement so that it's architecture neutral because we don't want to pick favorites. Our goal is to provide,

Starting point is 00:06:05 if I may steal from your name, a fair arena for folks. Yeah. So we need to allow that reimplementation to allow it to be fair. And then they will submit their reimplementation, the performance numbers. We peer review it to ensure if it's a commercially available system, it should be reproducible, should be within the bounds of reason that we would expect. And through that peer review process, we make sure that it is reproducible. And then we're going to publish the numbers. And that gives us a lot of faith in the process. And it also means that when we publish the numbers, we actually also publish the configuration, the software, so that

Starting point is 00:06:45 if you're a third party, you can go and grab that and say, aha, these guys scored X. Let me go get their software and reproduce that and see not only, oh, they did get X and maybe my system got 0.9X because it's a little bit different, but then you also get to see all the configurations. So you can see how it is that they got to X. That's super important because it lets people understand, yes, this is a score. Everyone wants to be the best and everyone's going to put their best foot forward, but you can see, hey, what did you guys do? And if I can make a car analogy, it's a Formula One car, and maybe they're using super high octane fuel that I can't buy. And so as a buyer, I'll be like, maybe that's a little bit faster than I would have thought.

Starting point is 00:07:27 But again, best known methods and it shares all that knowledge with the industry. So from start to finish, it's really envisioned to give us solid numbers and to share all of the knowledge so that at the end of the day, we're moving forward together with the momentum of everyone combined. Now, you showed a really interesting slide in your pre-brief describing performance advancement over time versus Moore's Law. Describe that to our audience, what you were showing, because they don't have the benefit of the slide. And what do you think it says about the curve over time? So let me try to describe the ideas I was trying to convey, which is both you and I have a background in semiconductors. And so Moore's law is sort of one of the constants of life, right? And what it says is we're going to get more transistors all the time through innovations in process technology. And so I said, well, let's assume that those transistors, each additional

Starting point is 00:08:22 transistor gives us more performance. So that's our baseline. How are we doing on each of our benchmarks relative to that baseline? How much faster are we getting over time? Are we beating Moore's law? Are we just following Moore's law? Or maybe for some things we're not, but let's hold ourselves, let's hold the whole community accountable to that as a bond. And what you see is that in a lot of cases, we've been beating Moore's law by 2x, by 5x, by 10x over a long period of time, especially when you look at some of the older benchmarks where the first scores were from our first few rounds where ML tooling was less mature. And so what it shows is that I kind of talked about a lot of the factors that go into ML

Starting point is 00:09:10 perf, right? You've got process technology, you've got architecture, you've got software optimization, all of those. It says that when we take the collective knowledge of the industry and sort of apply it on all those different dimensions, that's how much better than Moore's law we can do and how much faster we're bringing capabilities to the whole community than just the incredibly rapid pace that people are used to in semiconductors. And to me, that's just an incredible feeling. Do you think that because the curve is flattening over time, do you think that that suggests that we're

Starting point is 00:09:45 hitting the maturation point, that there may not be as much exponential benefit of some of the layers of the stack that you're describing? So yes and no. I think part of what it shows is that we've gotten as an industry a lot better at operating at scale. And I think one of the key things that we see is in some of our early benchmarks, we saw people may be able to run on 64, 128 processors. And by processor, I'm being very inclusive, processor, accelerator, whatever. And now we're seeing people operating in the regime of 10,000 on these benchmarks. And that requires tremendous innovation in how you structure the problem, how you partition the data.

Starting point is 00:10:32 And people are getting better at doing that in the first round of a benchmark. And so in some ways, I think our baseline's gotten higher because there has been some maturity. And we're part of playing a role in helping to drive that maturity. Once a technique gets highlighted in MLPerf, like flash attention is something that was shown off in MLPerf, and now everyone uses it more or less. That's the kind of thing that just makes me feel so proud, is that we were able to help shine a spotlight on important innovations and help folks understand this is really important. This is the path forward.

Starting point is 00:11:11 This is how we think about those things. You know, I love that. I have been talking to some folks in industry standards lands recently, and you talk about the fierce competition in tech, but sometimes some of the most insightful innovations are when the industry gathers to talk about something like a performance benchmark and how to model for a performance benchmark or working on interoperability for an industry standard. I love that stuff because it shows that it takes everyone's collective efforts to really bring the industry forward. I have a question about, you know, we're talking about MLPerf training right now, but you have a slew of benchmarks. What's in the portfolio today

Starting point is 00:11:52 and what's coming up in the next cycle? Yeah, I like to joke that MLPerf goes from microwatts to megawatts. And, you know, power isn't everything, but it's a very broad portfolio. So MLPerf training, that's what we started with. We branched out from there to MLPerf Inference, focused on how do we deploy these models at scale for data center and bigger edge systems. We have MLPerf Mobile, which is focused on inference for the smartphone form factor. And that's like a pretty well-defined form factor and platform. So we can narrow the problem a bit and get a more appropriate solution there.

Starting point is 00:12:27 Similarly, later this year or early next year, we're going to see MLPerf Client come out, which is focused on PC class systems, notebooks, and so forth. We have MLPerf Tiny, which is focused on embedded microcontrollers and edge devices, IoT sensors, things like that, ultra low power, which is exciting in a very different way. You won't see anyone throwing around 20 megawatts like some of our big MLPerf training submissions, but you'll see things like that. We've got MLPerf Automotive, which is a bunch of folks in the auto industry said,

Starting point is 00:12:59 hey, can you take the magic of MLPerf inference and help us sort out as we make cars more intelligent, what are the benchmarks we should be using? And then we also have MLPerf inference and help us sort out as we make cars more intelligent, what are the benchmarks we should be using? And then we also have MLPerf HPC, and that's looking at training, but for scientific problems. And then rounding out the portfolio, we have MLPerf Storage, which had their second round recently, which is looking at how do we understand the storage workload that is associated with AI training and make sure that we can measure that and help folks understand the performance there so that our storage grows and is progressing in tandem with the compute side of the house. Because ultimately, ML and AI is all

Starting point is 00:13:40 about the data. That data lives somewhere and that's historic. Yeah. There needs to be a data pipeline for sure. Now, I know that ML Commons is doing more than MLPerf and you're doing some really interesting things. So share that. Yeah. We've got the AI risk and responsibility. And you know, if you take a step back, ML Commons is about how do we bring people together to build for better AI in the future? How do we make AI better for everyone? And so we're very good at measurement and building for AI. And so one of the areas that we've turned to is the AI risk and responsibility, which is how do we look at, especially these generative systems and really measure their output and make sure that, you know, if you've got a customer chatbot, how do you make sure it's really answering the question and not giving weird

Starting point is 00:14:23 movie recommendations or whatnot? And we've got work on building data sets for the public. I personally helped get, I think, three or four data sets pushed out the door that were all just super cool in different ways. Speech recognition, wake up words, this really cool, diverse data set on household objects and images all across every socioeconomic class, different geos. Everyone knows what a stove in my house looks like. It's probably pretty similar to your stove, but you go to some other countries and their stove or their oven is a hearth. So we've just built some really cool data sets in that area. And then, yeah, medical too, we've got a project focused on how you do federated evaluation for medical stuff. And then, yeah, medical too, we've got a project focused on how you do federated evaluation

Starting point is 00:15:07 for medical stuff. And again, leaning into like measurement and collective building for AI. And it's just a wonderful set of efforts and people behind them. That's fantastic. I only have one more question for you, David. We've talked about the fact that you just published your latest round of results. I'm sure folks want to check out how everybody performed. They probably want to check out all the things that you just shared about what's coming next and all of the different ways that you're engaging with the broader community. Where can they go to find the results and where can they go to find ways to engage ML Commons?

Starting point is 00:15:40 Yeah, best place is mlcommons.org. You can click through the links for the benchmark working groups. You'll see, since this will be on after the training results are out, there should be a wonderful blog post about them with links to the results. That's the best way to dig in. I hope to see lots of news stories about them and analysis from third parties. And then for getting involved, there's a get involved link up there. This is something that's interesting for you.

Starting point is 00:16:04 Or if you have a thought, maybe I'll regret this. We'll see. But my email is david at mlcommons.org. Well, I'm sure that folks will be reaching out. Congratulations on this latest round of results. I can't wait to hear more. Yeah. Looking forward to talking with you again and seeing you at Supercomputing. Thanks so much. Thanks for joining the Tech Arena. Subscribe and engage at our website, thetecharena.net. All content is copyright by the Tech Arena.

Pet Camera - EBO Air 2

In The Arena by TechArena - MLCommons’ David Kanter on Benchmarking AI, Accelerating Innovation

David Kanter discusses MLCommons' role in setting benchmarks for AI performance, fostering industry-wide collaboration, and driving advancements in machine learning capabilities....

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Pet Camera - EBO Air 2

In The Arena by TechArena - MLCommons’ David Kanter on Benchmarking AI, Accelerating Innovation

David Kanter discusses MLCommons' role in setting benchmarks for AI performance, fostering industry-wide collaboration, and driving advancements in machine learning capabilities....

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.