In The Arena by TechArena - MLCommons’ David Kanter on Benchmarking AI, Accelerating Innovation
Episode Date: November 13, 2024David Kanter discusses MLCommons' role in setting benchmarks for AI performance, fostering industry-wide collaboration, and driving advancements in machine learning capabilities....
Transcript
Discussion (0)
Welcome to the Tech Arena,
featuring authentic discussions between
tech's leading innovators and our host, Alison Klein.
Now, let's step into the arena.
Welcome in the arena.
My name is Alison Klein. Today is a very exciting day
because I am joined by David Cantor, founder and head of ML Commons. Welcome to the platform,
David. How are you doing? Doing great. It's a wonderful day to be on and thanks for taking
the time to have me. So David, you and I have known each other for a really long time, but
this is actually the first time you're doing a podcast with me.
So why don't you just start with an introduction of yourself and your background in the tech arena?
Absolutely. I've got a long history in the tech world.
I started out running a website called Real World Tech,
which is kind of a water cooler for computer architects and semiconductor professionals and enthusiasts. And over time,
I had a couple of very interesting ventures. And along the way, in 2018, I got involved in MLPerf.
I started out in the humble position of secretary for the group at some of our first meetings,
and it was just really a wonderful dynamic and atmosphere.
And then the joke is I got promoted to be secretary a la Ban Ki-moon, where now I'm the
head of MLPerf. And so I helped start MLCommons, helped build it for the first few years. And now
my focus is on MLPerf, which is about how do we make AI better, faster, and more efficient for
everyone? When you think about the shift that you
made from really covering the tech sector to being a strategic part of how the tech sector really
displays foundational information about technology that's driving the most important workload in the
data center, AI, what made you decide to make this shift in your career towards this?
So I've always been motivated by learning and discovering the new things we can do with
technology, because in a lot of ways, it brings us things that were previously in fairy tales
to reality. And this was an opportunity to step forward and make it so. And part of it is I'd seen good benchmarks.
I'd seen bad benchmarks.
And to me, it was very important that we have good benchmarks that push the community in the right direction and help bring together buyers, sellers, and really help the community function better.
And it is super different to be covering things versus building a product.
And that's been a just wonderful experience and also just great
community of people. That's awesome. Benchmarking has been around since we started creating
computers. People wanted to measure how they were performing, but benchmarking has really been
reimagined for AI. What's unique about this capability in terms of creation of performance benchmarks for artificial
intelligence? Yeah, I would say I think it's a very, very hard problem. And MLPerf is not the
first benchmark for AI. And actually, there's several predecessors that were ultimately
unsuccessful. But we joined all of our forces together and learned from the prior
efforts. And combined, we were able to make something that really stuck. And so I'd say
part of it is it's a full stack problem. It encompasses everything from data to software,
to compilers, to architecture, to silicon, to the scale of your system. And there's so many factors
that we get to use to accelerate AI,
but that makes the measurement really tough.
So it's been a challenge, but a really rewarding one
to see it all come together in a really land.
With this latest round of results,
you've had companies from across the compute landscape publish results.
And in talking to your members, how are these results utilized? And walk us
through the process of submission. Yeah, that's a great question. Maybe let's start with the end in
mind. How are they used? And I think the benchmarks are used in a number of ways. One is to help
buyers, especially if you're at a giant company, you might be able to afford a staff to help you
evaluate like what your technical choices are. And even those folks will oftentimes use MLPerf as a first
pass heuristic. And they'll say, I'm interested in recommendation. Let me see what the MLPerf
score for recommendation is. But a lot of folks in the enterprise may not have those complex
evaluation teams. And so it can be a really integral part of evaluating and helping inform
purchasing, but it also helps to align researchers, engineers, understand what does it mean to be
better? What should we be focusing on to make AI faster, more energy efficient, more capable?
And getting those metrics out there and getting everyone on the same page of what does it truly need to be better. That's sort of the macro. How it works, we start by defining a benchmark for an interesting area.
One of the hottest ones, of course, is large language model pre-training and fine-tuning.
And so we'll have to define the task, define an accuracy bar, define a data set. And so we're going to say, hey, this is the task. We want you to train, say, GPT-3 to this accuracy.
And that ensures that it's relevant to customers
because it's a real accuracy target.
It's going to be something that's relevant.
And that means any optimizations that are made
will ultimately benefit the whole community
and kind of drive best known efforts.
So you define a reference that is the GPT-3 model.
And then if you're a submitter, you're going to take that reference and re-implement it
in a way that works for you, right?
There's lots of companies with different solutions, whether it's training or inference,
and we allow them to re-implement so that it's architecture neutral because we don't
want to pick favorites.
Our goal is to provide,
if I may steal from your name, a fair arena for folks. Yeah. So we need to allow that
reimplementation to allow it to be fair. And then they will submit their reimplementation,
the performance numbers. We peer review it to ensure if it's a commercially available system,
it should be reproducible, should be within the bounds of reason that we would expect.
And through that peer review process, we make sure that it is reproducible.
And then we're going to publish the numbers. And that gives us a lot of faith in the process.
And it also means that when we publish the numbers, we actually also publish the configuration,
the software, so that
if you're a third party, you can go and grab that and say, aha, these guys scored X. Let me go get
their software and reproduce that and see not only, oh, they did get X and maybe my system got 0.9X
because it's a little bit different, but then you also get to see all the configurations. So you can
see how it is that they got to X. That's super important
because it lets people understand, yes, this is a score. Everyone wants to be the best and everyone's
going to put their best foot forward, but you can see, hey, what did you guys do? And if I can make
a car analogy, it's a Formula One car, and maybe they're using super high octane fuel that I can't
buy. And so as a buyer, I'll be like, maybe that's a little bit faster than I would have thought.
But again, best known methods and it shares all that knowledge with the industry.
So from start to finish, it's really envisioned to give us solid numbers and to share all of the knowledge so that at the end of the day, we're moving forward together with the momentum of everyone combined.
Now, you showed a really interesting slide in your pre-brief describing performance advancement over time versus Moore's Law.
Describe that to our audience, what you were showing, because they don't have the benefit of the slide.
And what do you think it says about the curve over time?
So let me try to describe the ideas I was trying to convey, which is both you and I have a background in semiconductors. And so Moore's law is sort of one of the constants of life,
right? And what it says is we're going to get more transistors all the time through innovations
in process technology. And so I said, well, let's assume that those transistors, each additional
transistor gives us more performance. So that's our baseline. How are we doing on each of our benchmarks relative to that baseline? How much
faster are we getting over time? Are we beating Moore's law? Are we just following Moore's law?
Or maybe for some things we're not, but let's hold ourselves, let's hold the whole community
accountable to that as a bond. And what you see
is that in a lot of cases, we've been beating Moore's law by 2x, by 5x, by 10x over a long
period of time, especially when you look at some of the older benchmarks where the first scores
were from our first few rounds where ML tooling was less mature.
And so what it shows is that I kind of talked about a lot of the factors that go into ML
perf, right?
You've got process technology, you've got architecture, you've got software optimization,
all of those.
It says that when we take the collective knowledge of the industry and sort of apply it on all
those different dimensions, that's how much better than Moore's law we can do and how much faster we're bringing capabilities to the whole
community than just the incredibly rapid pace that people are used to in semiconductors. And
to me, that's just an incredible feeling. Do you think that because the curve is flattening
over time, do you think that that suggests that we're
hitting the maturation point, that there may not be as much exponential benefit of some of the
layers of the stack that you're describing? So yes and no. I think part of what it shows
is that we've gotten as an industry a lot better at operating at scale. And I think one of the key things that we see is in some of our early benchmarks, we
saw people may be able to run on 64, 128 processors.
And by processor, I'm being very inclusive, processor, accelerator, whatever.
And now we're seeing people operating in the regime of 10,000 on these benchmarks.
And that requires tremendous innovation in how you structure the problem, how you partition
the data.
And people are getting better at doing that in the first round of a benchmark.
And so in some ways, I think our baseline's gotten higher because there has been some
maturity.
And we're part of playing a role in helping to drive that maturity.
Once a technique gets highlighted in MLPerf, like flash attention is something that was shown off in MLPerf, and now everyone uses it more or less.
That's the kind of thing that just makes me feel so proud, is that we were able to help shine a spotlight on important innovations and help
folks understand this is really important.
This is the path forward.
This is how we think about those things.
You know, I love that.
I have been talking to some folks in industry standards lands recently, and you talk about
the fierce competition in tech, but sometimes some of the most insightful innovations are when
the industry gathers to talk about something like a performance benchmark and how to model for a
performance benchmark or working on interoperability for an industry standard. I love that stuff
because it shows that it takes everyone's collective efforts to really bring the industry
forward. I have a question about, you know, we're talking about MLPerf training right now, but you have a slew of benchmarks. What's in the portfolio today
and what's coming up in the next cycle? Yeah, I like to joke that MLPerf goes from
microwatts to megawatts. And, you know, power isn't everything, but it's a very broad portfolio.
So MLPerf training, that's what we started with.
We branched out from there to MLPerf Inference, focused on how do we deploy these models at
scale for data center and bigger edge systems.
We have MLPerf Mobile, which is focused on inference for the smartphone form factor.
And that's like a pretty well-defined form factor and platform.
So we can narrow the problem a bit and get a more appropriate solution there.
Similarly, later this year or early next year, we're going to see MLPerf Client come out,
which is focused on PC class systems, notebooks, and so forth.
We have MLPerf Tiny, which is focused on embedded microcontrollers and edge devices, IoT sensors,
things like that, ultra low power, which is
exciting in a very different way.
You won't see anyone throwing around 20 megawatts like some of our big MLPerf training
submissions, but you'll see things like that.
We've got MLPerf Automotive, which is a bunch of folks in the auto industry said,
hey, can you take the magic of MLPerf inference and help us sort out as we make cars more
intelligent, what are the benchmarks we should be using? And then we also have MLPerf inference and help us sort out as we make cars more intelligent,
what are the benchmarks we should be using? And then we also have MLPerf HPC, and that's
looking at training, but for scientific problems. And then rounding out the portfolio, we have
MLPerf Storage, which had their second round recently, which is looking at how do we understand
the storage workload that is associated with AI training and make sure that we
can measure that and help folks understand the performance there so that our storage grows and
is progressing in tandem with the compute side of the house. Because ultimately, ML and AI is all
about the data. That data lives somewhere and that's historic. Yeah. There needs to be a data pipeline for sure. Now, I know that ML Commons is doing more than MLPerf and you're doing some
really interesting things. So share that. Yeah. We've got the AI risk and responsibility. And
you know, if you take a step back, ML Commons is about how do we bring people together to build
for better AI in the future? How do we make AI better for everyone?
And so we're very good at measurement and building for AI. And so one of the areas that we've turned
to is the AI risk and responsibility, which is how do we look at, especially these generative
systems and really measure their output and make sure that, you know, if you've got a customer
chatbot, how do you make sure it's really answering the question and not giving weird
movie recommendations or whatnot?
And we've got work on building data sets for the public.
I personally helped get, I think, three or four data sets pushed out the door that were all just super cool in different ways.
Speech recognition, wake up words, this really cool, diverse data set on household objects and images all across every socioeconomic class,
different geos. Everyone knows what a stove in my house looks like. It's probably pretty
similar to your stove, but you go to some other countries and their stove or their oven is a
hearth. So we've just built some really cool data sets in that area. And then, yeah, medical too,
we've got a project focused on how you do federated evaluation for medical stuff. And then, yeah, medical too, we've got a project focused on how you do federated evaluation
for medical stuff. And again, leaning into like measurement and collective building for AI.
And it's just a wonderful set of efforts and people behind them.
That's fantastic. I only have one more question for you, David. We've talked about the fact that
you just published your latest round of results. I'm sure folks want to check out how everybody performed.
They probably want to check out all the things that you just shared about what's coming next
and all of the different ways that you're engaging with the broader community.
Where can they go to find the results and where can they go to find ways to engage ML
Commons?
Yeah, best place is mlcommons.org.
You can click through the links for the benchmark working groups.
You'll see, since this will be on after the training results are out, there should be
a wonderful blog post about them with links to the results.
That's the best way to dig in.
I hope to see lots of news stories about them and analysis from third parties.
And then for getting involved, there's a get involved link up there.
This is something that's interesting for you.
Or if you have a thought, maybe I'll regret this. We'll see. But my email is david at mlcommons.org.
Well, I'm sure that folks will be reaching out. Congratulations on this latest round of results.
I can't wait to hear more. Yeah. Looking forward to talking with you again and
seeing you at Supercomputing. Thanks so much.
Thanks for joining the Tech Arena.
Subscribe and engage at our website, thetecharena.net.
All content is copyright by the Tech Arena.