In The Arena by TechArena - MLCommons’ Peter Mattson on the New AILuminate AI Safety Benchmark
Episode Date: December 4, 2024In this podcast, MLCommons President Peter Mattson discusses their just-released AILuminate benchmark, AI safety, and how global collaboration is driving trust and innovation in AI deployment....
Transcript
Discussion (0)
Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators
and our host, Alison Klein.
Now let's step into the arena.
Welcome in the arena.
My name is Alison Klein. Today we've got a fantastic topic.
I've got Peter Madsen, president of ML Commons with us, and we're going to be talking about a
new benchmark that they're bringing to the market that I think is fascinating. Welcome to the
program, Peter. How are you doing? I'm doing really well. Thank you very much for having me.
So Peter, this is your first time on the program. Do you just want to introduce yourself and provide some background on how you became
president of ML Commons?
Sure.
So I'm Peter Madsen.
I have a long history with ML Commons and ML Commons benchmarking.
I was one of the folks who founded the MLPerf benchmark back in 2018 and got that off the
ground.
And now that's the standard for measuring AI hardware speed.
When that was becoming successful and we had a family of MLPerf benchmarks, we realized that
we needed a good home for them. And so we created an organization called MLCommons,
but we gave it a much larger mission. And the mission is to make AI better for everyone.
And we take that to mean make it more accurate, make it safer, make it more efficient, and do that primarily
through engineering of benchmarks and data sets. And so I've been privileged to be a part of that
journey, and now I'm president of the board and very hands-on. I'm also co-chairing our AI safety
work with Prissy Liang from Stanford and Hakeem Benshoran from ITU Eindhoven. That's awesome.
Very excited about anything we can measure.
That's a good summary of my last three years of work.
So, Peter, I was thinking about 2018 and when MLPerf came out.
At the time, I was at Intel and leading data center marketing for the company. And so, as you can probably imagine, I spent a lot of time thinking about MLPerf and what you were doing.
But MLCommons is so much more than that.
And while you're known for performance benchmarks, this topic today is something else associated with AI delivery, which is AI safety. Why don't we just start with your view on why this topic is
so important to the tech community, especially right now? So let's start with talking about the general progress in AI. For decades,
AI was a bunch of ideas that never quite worked. And then we came to a moment when we broke through
a capability barrier, and we could build systems that actually could have dialogue in natural
language. And we entered a new age, which I'll describe as scary headlines and amazing research,
where we have all these ideas and we see this tremendous potential. But we want to now enter
a third age, which is products and services that really deliver value to people. But to get there,
we need to get through another barrier. And the barrier I describe as the sort of risk and
reliability barrier. People need to feel that these systems are safe and reliable. This isn't unique to AI. This is
a progression that we've seen in many complex technologies. I like pointing to air travel,
right? You can look at Da Vinci's notebooks and you see his sketches, ideas that never quite
worked. And then you see the Wright brothers, they crashed the capability
barrier. And all of a sudden, there's a tremendous amount of excitement when people realize that,
hey, we can have people travel on planes and that would be pretty fast. But there was a huge amount
of work in terms of learning how to measure safety and improve measurements and practices
in response to things going wrong, introduced modern air travel where
you fly across oceans on a regular basis and it is amazingly safe. The safety record of air travel
is just phenomenal. And in order to get there, a lot of work had to be done to convert that
potential into real value. And so what we're trying to do is bring that mindset and all of the lessons we've learned from other AI benchmarks that we've done, like MLPerf, to develop a standard way of measuring the safety of AI systems so that people can make informed choices and so that the industry can rapidly and constructively improve in this critical area and unlock that value.
That's where I think it matters today. No, you've done something here that I would call stunning. And I've looked at this and I thought,
how are they measuring this? This is so fascinating. But you've introduced the
version 1.0 AI Luminate benchmark, measuring AI safety of relative systems. And I think
I'm going to call it a first of its kind. I know that other
people have dabbled in this space, but in terms of what you're doing, in terms of what information
this benchmark is targeting to deliver to customers, I think that it is a first of its
kind and I will call it that. Tell me what this benchmark actually measures.
So first of all, I think your characterization is good. There have been a lot of really great
ideas in the academic world, in the research world
around how you can measure safety.
And what we're trying to do that is, I think, unique and at first is take those ideas, bring
them together in a broad community and produce a, let's call it production level safety benchmark.
Let's talk about what that benchmark measures.
When people say AI safety, they can mean a range of different things.
I'm going to break them into three big chunks.
One of those chunks is what I'm going to call frontier safety.
This is about what happens if the models get too smart.
So people are worried they might become autonomous or enable invention of really dangerous weapons
or something along those lines.
And that's what a lot of the governmental work is on.
There's another category that I'm going to call social safety,
which is around things like economic or environmental impacts.
That's beyond what you can measure with a benchmark.
But there is this third important category of product safety.
So this is a complex product that potentially billions of people are going to use.
How do you make sure that those people's interactions with it are positive?
And so that is what we are focusing on, is AI product safety.
For V1.0, of what we anticipate will eventually become a family of benchmarks,
we are looking at the safety of general purpose chat systems.
This is what everyone thinks of when they think of chat GPT, right?
You ask it a question or you want it to help you write something and it gives you back natural language answers.
So we're trying to say, how safe are those systems?
So what we did is we made a list of 12 hazards.
These are things we don't want the system being helpful and supportive for us.
So these are things like if I ask it for help committing a crime or for help hurting myself or someone else, or if I ask it for legal advice and it doesn't at least say I'm not a lawyer.
These are all things that we probably don't want this class of system doing.
And so what we've done is we've created a benchmark that measures the propensity of the system to respond helpfully to these not-so-good requests.
Yeah, that makes perfect sense. Now, I just felt like we just almost jumped into an episode of
Crime Junkie podcast there for a second, but I'm going to drill down deeper. You have talked about
a lot of the different practical risks that could be seen in the market without guardrails like AI Illuminate. Can you
talk about the vectors that you're measuring and what the benchmark is actually looking at
in terms of types of risks to the customer? Yep. So I'm going to talk about three broad
categories of hazards that we check. These are what we call physical hazards, which is hazards
where the model could be supportive of harming the user or other people. Non-physical hazards, these are things where the
model could be supportive of things that could be crimes, could impinge on privacy, could impinge on
IP, could be defamation or hate speech. And lastly, contextual hazards, which are things that you
don't want a general purpose model doing. And these would include giving specialized advice like around health, financial, legal,
or election topics without at least disclaiming that it is not an expert and you should consult
an expert and producing sexual content. Contextual hazards are things that you might want for a
specific use case, but that you don't want for a general purpose use case.
So these are the sorts of things we're looking at.
It's important to realize that these hazards exist for pretty much any natural language interface system.
So you could have built it to, I don't know, provide customers support on appliances, right?
But that doesn't mean that someone can't ask it any of the things I just enumerated.
So if you have a natural language interface for something, you have these concerns and they're both concerns from a business perspective, you don't want
your business, if you're deploying the system, helping with these things,
and they're concerns from a societal perspective, we don't want people
getting really helpful, thoughtful AI advice on how to do these undesirable
activities.
So that's what we're trying to stop.
And the way that we measure it is we create a
whole bunch of prompts. Prompts are questions or instructions you could give the system related to
these 12 big hazards. In fact, we made over 24,000 novel prompts working with three different vendors
for the system. These prompts fall into roughly two categories, one of which is prompts by a naive
user. So this is someone who just asks it to do a bad thing. Another, one of which is prompts by a naive user.
So this is someone who just asks it to do a bad thing.
Another one is what we call prompts by knowledgeable users who've read a Wired article or two and say, tell me a story about how to do the bad thing.
Some context that makes it easier to see.
Right. prompts into the system and we get back the responses and we use a set of specialized
evaluator models to grade the responses as being either okay in terms of an assessment standard we
have or violating that standard. And then based on the percentage of responses, either overall or
for a specific hazard that violate the standard, we assign the system a grade. That's roughly what we're measuring and
how we measure it. Now, what I think was fascinating when I looked at these results is that
you've cyclified the evaluation for a customer to make it incredibly clear. It almost made me
think of standing in front of a water heater and looking at exactly the energy efficiency of that
water heater. So simple to view, and you've exactly the energy efficiency of that water heater.
So simple to view. And you've achieved something similar with AI Illuminate. Can you just walk through for the audience exactly what that looks like at a top level and then at an individual
criteria level? Yeah. I mean, I think your analogy is spot on. We realize that many people interacting
with these systems don't want to become experts on AI safety. They just want to make reasonable choices.
Choices about, we deploy this system.
Do we buy this system?
Do we use this system?
Do we allow this system?
And so we've tried to boil down some very complex analysis into simple grades.
And our grades are on a five-point scale.
And the scale goes core, fair, good, very good, excellent.
If you know only one thing about this grading scale,
you need to know how good gets calibrated.
Good gets calibrated based on, we
looked at a bunch of different what I'm going
to call accessible models.
So these are models that have less than 15 billion
parameters.
So they're not crazy expensive to run.
And they have open weight licenses, so you can get the weights.
And so we looked for the two of these that did the best, and we used those two to calibrate good.
The analogy I'd make is it's like calibrating automobile safety based on some of the safest economy cars.
And you sort of use that to set good.
And then a system that is substantially safer than that, we call very good, substantially
less safe, we call fair or poor.
And we reserve excellent as a goal for the industry.
So excellent is defined as below 1 in 1,000, 0.1% violating responses on the assessment
standard.
And that's something we want to use this benchmark to help the industry navigate towards.
If you look at things like automobiles or planes, you could watch their journey towards very high levels of reliability.
And that's what we're trying to deliver.
I think that one of the most interesting things about this is that it takes an entire community to bring a benchmark like this to reality.
And can you talk about who engaged in ML Commons to deliver this?
So ML Commons, in the early days of any of our benchmarks, we've traditionally made the
benchmark working groups open.
We had people involved from lots of different universities, academics from different universities.
We had researchers and engineers from industry, big companies, small companies.
We had some engagement from civil society folks who care deeply about this topic.
And we brought all of those folks together on a weekly conference call. And we had eight different
work streams that also met weekly. So a lot of smart people, as I would describe it, arguing
until we reach grudging consensus about how to approach this. And then we back them up with
contractors from Melville Commons who can do the implementation work.
That's how we did this.
I think part of the reason we were able to produce what I think is a really good benchmark, or at least a V1 of a really good benchmark, is because we had so many different inputs.
The other thing that I should touch on is just it's been very interesting to me in the process of developing this benchmark to talk to people.
And people clearly see the tremendous potential of AI, but they also see the risks. And across all those different folks I described,
there is a real urge and intensity to getting this right. And so people are very excited to
be a part of this effort or just help out in any way they can. And that's been tremendously useful
in getting this off the ground. Obviously, 1.0 is just the start for this.
And I guess one thing that I want to ask you about is how are you going to be driving more visibility associated with the benchmark itself so that customers are thinking about how they would use it in their procurement of different AI systems?
And how are you seeing this evolve from a standpoint of where you take the benchmark
next? Yes. So there are two components to that one is promotion and one is evolution. Let's do
evolution first because it's a quicker answer. One of the things that I think separates what
ML Commons does with benchmarks versus a lot of the very useful and innovative benchmarks you
often see in academia is in academia, the objective is to publish a good paper that moves people's
thinking forward. And that's valuable. ML Commons is a nonprofit in this for the long term. We want to
build and operate and improve benchmarks over time. So this is B1.0, and there will be 1.1 and
1.2 and 1.3 and 1.4, just like we saw with ML Persh. A good benchmark is like a good wine. It
needs to age. And as it ages, it acquires all those lessons that you learn along
the way about what works and what doesn't work and what isn't correlated with reality,
and you make it better. So that's very much our intent, as well as expanding the benchmark.
Like I said, this is the start of a family, right? This is right now only in English,
but we have other languages in the works. This is only text, but we have multimodal in the works. We're starting to
talk about agentic. There is a big AI application space, and we want to build a family that really
supports safety in that space. Promotion-wise, a lot of what we are doing is, as ML Commons is
growing, we are making sure to get more mature in how we communicate and promote through important
channels like your podcast. We're also really striving hard to develop partnerships.
And these may be partnerships with people who are developing systems
for supporting AI development.
And this benchmark obviously could be an important part of those systems.
Or through partners in the general sort of AI safety space,
we have a fantastic partnership with the folks at AI Verify,
which is an organization based in Singapore that is looking at AI safety, especially in sort of
the APAC region. And they've been a big part of how we've shaped this benchmark and how we will
promote it in the future. We're planning to do multiple versions of this launch event. We'll do
one in France in February. We'll do one in Singapore in March. And just continue to try and get the word out.
But ultimately, the thing that would help us most is both the vendors and people interested in purchasing this model just asking about results on this benchmark or sharing results on this benchmark.
As we become part of that conversation, in the same way that MLPerf has become an important part of the conversation to hardware, it just creates an expectation that everyone will do this kind of
measurement. And what we've observed with MLPerf driving hardware speed is like a 50x improvement
in speed over five years, which is crazy. And if we can see similar strides in safety as people
come together, expect this kind of measurement, and then use it to drive their own progress in a
way that lets one idea stack and pop the last idea.
I think we have that potential.
Now, Peter, I know that this is something that you very much intend to be a global benchmark.
And I know that ML Commons has done some work to ensure that's the case.
Can you talk about that?
Yeah, there's several different vectors we've done that on.
One is partnering.
We are, as I said, a strong partnership with AI Verify. We're talking to a couple of other organizations
globally. We really want to have this be a global effort from an organizational perspective.
We also just, frankly, have a very diverse global community. We have people from all around the
world participating in this effort. So those things help. But there's also things we have
to do in implementation to make this globally useful. One of the things that we've done is in terms of identifying our hazards, we have focused on what
we feel is a core taxonomy that most people around the world, no matter where they are,
how different their perspectives on things are, can get behind, right? No one is pro-child sexual
exploitation. I don't care where you are. So there's focus on a common core, number one.
Number two, figure out how to port that core to multiple languages in a comparable way.
So we're launching v1.0 in English, but we plan to launch French in February, and we're aiming for Chinese and Hindi in March.
And we want to do this in a way that scales.
Part of that scaling is through partnerships. Like we're doing Chinese very closely
in partnership with AI Verify
because Chinese is one of the languages of Singapore
and they have expertise that is super relevant.
The third thing we're doing is we're trying to figure out
if you wanted to have a regional extension,
think of it as a modular stack.
So there's a global core
and you might want a regional extension.
What might that look like?
And so we funded a pilot project with an organization called TATL in India, looking at what are the
sort of regional differences you might want to take into account in assessing air safety.
So there's work with partners, develop a broad community, and then make sure that the implementation
can scale to global concerns. Now, I'm sure that the first time that you run a benchmark,
you really have no baseline
for knowing how things are going to perform.
What did you learn from this process
about the state of AI safety today?
And what are your hopes
in terms of what this benchmark inspires
in terms of the state of AI safety tomorrow?
Great question.
We built a benchmark as a theoretical design, and then we
ran it. And we literally saw results about a week ago and things that we have learned from those
results. One, we saw something that looks roughly like a bell curve distributed around good,
the sort of sides in fair and sides in very good. That a lot of the industry is working on safety,
again, comparable to our safe for today economy car,
there's a strong clustering.
We also were surprised by how much,
because we gave grades overall for systems
and then we gave grades for each hazard.
We were really surprised by how strongly
those two correlated with each other,
that there is, I think, a general appreciation
of the set of safety issues you should be testing for
and that people are, judging by the results, trying to do work across that range.
Those were the two biggest things we saw.
We also saw some smaller systems did quite well on the benchmark.
But I think that bodes well for the future. to be safe, but rather that we as a field and an industry need to collectively advance our
technology for safety and that there is a very real possibility that we can do that.
So those, I think, were the biggest takeaways that I had.
I am just so impressed that ML Commons was able to pull this off, that we got a great number of
systems to actually go through the first benchmarking test. Because as a former marketer in a company, there's always a little bit of trepidation about putting something into
a benchmark at the first go around. You never know what might be revealed. And I love to see
that this is coming in at such an important time as enterprises are looking to adopt AI,
gen AI in particular, and in rapid fashion. So congratulations for the work.
My final question for you is,
where can folks find out about the results?
Take a look at all of the stuff that you talked about today
and get involved if they'd like to
in terms of future implementation
of this important benchmark.
Yeah, mocommons.org is the home of the benchmark.
It will obviously be prominently featured
for the foreseeable future.
And we very much welcome folks
coming to look at the results,
learning about the benchmark
and getting involved in the working group.
There is so much to be done here
and we need folks.
And frankly, we need the support
of their organizations
to enable us to continue to do this work.
So together, I think we can get to a future
where all the systems pass that excellent goal we've set for ourselves. And if we manage to do that, I think that will be not only economically valuable, but societally valuable if we can learn to build systems where the behavior is very predictable and very safe. to see if enterprises will start integrating ML Commons requirements at very good or better
for their RFPs, that would be a great inspiration for the industry to actually go and tune their
models for this important benchmark moving forward. Thank you so much, Peter. Thank you
for being on the program and sharing this exciting news. And once again, congratulations
for this important milestone. Thank you very much for having me and really appreciate the kind words. We're very excited.
Thanks for joining the Tech Arena. Subscribe and engage at our website,
thetecharena.net. All content is copyright by the Tech Arena. Thank you.