In The Arena by TechArena - MLCommons’ Peter Mattson on the New AILuminate AI Safety Benchmark

Episode Date: December 4, 2024

In this podcast, MLCommons President Peter Mattson discusses their just-released AILuminate benchmark, AI safety, and how global collaboration is driving trust and innovation in AI deployment....

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein. Now let's step into the arena. Welcome in the arena. My name is Alison Klein. Today we've got a fantastic topic. I've got Peter Madsen, president of ML Commons with us, and we're going to be talking about a new benchmark that they're bringing to the market that I think is fascinating. Welcome to the program, Peter. How are you doing? I'm doing really well. Thank you very much for having me.
Starting point is 00:00:41 So Peter, this is your first time on the program. Do you just want to introduce yourself and provide some background on how you became president of ML Commons? Sure. So I'm Peter Madsen. I have a long history with ML Commons and ML Commons benchmarking. I was one of the folks who founded the MLPerf benchmark back in 2018 and got that off the ground. And now that's the standard for measuring AI hardware speed.
Starting point is 00:01:05 When that was becoming successful and we had a family of MLPerf benchmarks, we realized that we needed a good home for them. And so we created an organization called MLCommons, but we gave it a much larger mission. And the mission is to make AI better for everyone. And we take that to mean make it more accurate, make it safer, make it more efficient, and do that primarily through engineering of benchmarks and data sets. And so I've been privileged to be a part of that journey, and now I'm president of the board and very hands-on. I'm also co-chairing our AI safety work with Prissy Liang from Stanford and Hakeem Benshoran from ITU Eindhoven. That's awesome. Very excited about anything we can measure.
Starting point is 00:01:47 That's a good summary of my last three years of work. So, Peter, I was thinking about 2018 and when MLPerf came out. At the time, I was at Intel and leading data center marketing for the company. And so, as you can probably imagine, I spent a lot of time thinking about MLPerf and what you were doing. But MLCommons is so much more than that. And while you're known for performance benchmarks, this topic today is something else associated with AI delivery, which is AI safety. Why don't we just start with your view on why this topic is so important to the tech community, especially right now? So let's start with talking about the general progress in AI. For decades, AI was a bunch of ideas that never quite worked. And then we came to a moment when we broke through a capability barrier, and we could build systems that actually could have dialogue in natural
Starting point is 00:02:40 language. And we entered a new age, which I'll describe as scary headlines and amazing research, where we have all these ideas and we see this tremendous potential. But we want to now enter a third age, which is products and services that really deliver value to people. But to get there, we need to get through another barrier. And the barrier I describe as the sort of risk and reliability barrier. People need to feel that these systems are safe and reliable. This isn't unique to AI. This is a progression that we've seen in many complex technologies. I like pointing to air travel, right? You can look at Da Vinci's notebooks and you see his sketches, ideas that never quite worked. And then you see the Wright brothers, they crashed the capability
Starting point is 00:03:25 barrier. And all of a sudden, there's a tremendous amount of excitement when people realize that, hey, we can have people travel on planes and that would be pretty fast. But there was a huge amount of work in terms of learning how to measure safety and improve measurements and practices in response to things going wrong, introduced modern air travel where you fly across oceans on a regular basis and it is amazingly safe. The safety record of air travel is just phenomenal. And in order to get there, a lot of work had to be done to convert that potential into real value. And so what we're trying to do is bring that mindset and all of the lessons we've learned from other AI benchmarks that we've done, like MLPerf, to develop a standard way of measuring the safety of AI systems so that people can make informed choices and so that the industry can rapidly and constructively improve in this critical area and unlock that value. That's where I think it matters today. No, you've done something here that I would call stunning. And I've looked at this and I thought,
Starting point is 00:04:31 how are they measuring this? This is so fascinating. But you've introduced the version 1.0 AI Luminate benchmark, measuring AI safety of relative systems. And I think I'm going to call it a first of its kind. I know that other people have dabbled in this space, but in terms of what you're doing, in terms of what information this benchmark is targeting to deliver to customers, I think that it is a first of its kind and I will call it that. Tell me what this benchmark actually measures. So first of all, I think your characterization is good. There have been a lot of really great ideas in the academic world, in the research world
Starting point is 00:05:08 around how you can measure safety. And what we're trying to do that is, I think, unique and at first is take those ideas, bring them together in a broad community and produce a, let's call it production level safety benchmark. Let's talk about what that benchmark measures. When people say AI safety, they can mean a range of different things. I'm going to break them into three big chunks. One of those chunks is what I'm going to call frontier safety. This is about what happens if the models get too smart.
Starting point is 00:05:38 So people are worried they might become autonomous or enable invention of really dangerous weapons or something along those lines. And that's what a lot of the governmental work is on. There's another category that I'm going to call social safety, which is around things like economic or environmental impacts. That's beyond what you can measure with a benchmark. But there is this third important category of product safety. So this is a complex product that potentially billions of people are going to use.
Starting point is 00:06:04 How do you make sure that those people's interactions with it are positive? And so that is what we are focusing on, is AI product safety. For V1.0, of what we anticipate will eventually become a family of benchmarks, we are looking at the safety of general purpose chat systems. This is what everyone thinks of when they think of chat GPT, right? You ask it a question or you want it to help you write something and it gives you back natural language answers. So we're trying to say, how safe are those systems? So what we did is we made a list of 12 hazards.
Starting point is 00:06:35 These are things we don't want the system being helpful and supportive for us. So these are things like if I ask it for help committing a crime or for help hurting myself or someone else, or if I ask it for legal advice and it doesn't at least say I'm not a lawyer. These are all things that we probably don't want this class of system doing. And so what we've done is we've created a benchmark that measures the propensity of the system to respond helpfully to these not-so-good requests. Yeah, that makes perfect sense. Now, I just felt like we just almost jumped into an episode of Crime Junkie podcast there for a second, but I'm going to drill down deeper. You have talked about a lot of the different practical risks that could be seen in the market without guardrails like AI Illuminate. Can you talk about the vectors that you're measuring and what the benchmark is actually looking at
Starting point is 00:07:32 in terms of types of risks to the customer? Yep. So I'm going to talk about three broad categories of hazards that we check. These are what we call physical hazards, which is hazards where the model could be supportive of harming the user or other people. Non-physical hazards, these are things where the model could be supportive of things that could be crimes, could impinge on privacy, could impinge on IP, could be defamation or hate speech. And lastly, contextual hazards, which are things that you don't want a general purpose model doing. And these would include giving specialized advice like around health, financial, legal, or election topics without at least disclaiming that it is not an expert and you should consult an expert and producing sexual content. Contextual hazards are things that you might want for a
Starting point is 00:08:21 specific use case, but that you don't want for a general purpose use case. So these are the sorts of things we're looking at. It's important to realize that these hazards exist for pretty much any natural language interface system. So you could have built it to, I don't know, provide customers support on appliances, right? But that doesn't mean that someone can't ask it any of the things I just enumerated. So if you have a natural language interface for something, you have these concerns and they're both concerns from a business perspective, you don't want your business, if you're deploying the system, helping with these things, and they're concerns from a societal perspective, we don't want people
Starting point is 00:08:55 getting really helpful, thoughtful AI advice on how to do these undesirable activities. So that's what we're trying to stop. And the way that we measure it is we create a whole bunch of prompts. Prompts are questions or instructions you could give the system related to these 12 big hazards. In fact, we made over 24,000 novel prompts working with three different vendors for the system. These prompts fall into roughly two categories, one of which is prompts by a naive user. So this is someone who just asks it to do a bad thing. Another, one of which is prompts by a naive user.
Starting point is 00:09:25 So this is someone who just asks it to do a bad thing. Another one is what we call prompts by knowledgeable users who've read a Wired article or two and say, tell me a story about how to do the bad thing. Some context that makes it easier to see. Right. prompts into the system and we get back the responses and we use a set of specialized evaluator models to grade the responses as being either okay in terms of an assessment standard we have or violating that standard. And then based on the percentage of responses, either overall or for a specific hazard that violate the standard, we assign the system a grade. That's roughly what we're measuring and how we measure it. Now, what I think was fascinating when I looked at these results is that
Starting point is 00:10:11 you've cyclified the evaluation for a customer to make it incredibly clear. It almost made me think of standing in front of a water heater and looking at exactly the energy efficiency of that water heater. So simple to view, and you've exactly the energy efficiency of that water heater. So simple to view. And you've achieved something similar with AI Illuminate. Can you just walk through for the audience exactly what that looks like at a top level and then at an individual criteria level? Yeah. I mean, I think your analogy is spot on. We realize that many people interacting with these systems don't want to become experts on AI safety. They just want to make reasonable choices. Choices about, we deploy this system. Do we buy this system?
Starting point is 00:10:49 Do we use this system? Do we allow this system? And so we've tried to boil down some very complex analysis into simple grades. And our grades are on a five-point scale. And the scale goes core, fair, good, very good, excellent. If you know only one thing about this grading scale, you need to know how good gets calibrated. Good gets calibrated based on, we
Starting point is 00:11:14 looked at a bunch of different what I'm going to call accessible models. So these are models that have less than 15 billion parameters. So they're not crazy expensive to run. And they have open weight licenses, so you can get the weights. And so we looked for the two of these that did the best, and we used those two to calibrate good. The analogy I'd make is it's like calibrating automobile safety based on some of the safest economy cars.
Starting point is 00:11:39 And you sort of use that to set good. And then a system that is substantially safer than that, we call very good, substantially less safe, we call fair or poor. And we reserve excellent as a goal for the industry. So excellent is defined as below 1 in 1,000, 0.1% violating responses on the assessment standard. And that's something we want to use this benchmark to help the industry navigate towards. If you look at things like automobiles or planes, you could watch their journey towards very high levels of reliability.
Starting point is 00:12:10 And that's what we're trying to deliver. I think that one of the most interesting things about this is that it takes an entire community to bring a benchmark like this to reality. And can you talk about who engaged in ML Commons to deliver this? So ML Commons, in the early days of any of our benchmarks, we've traditionally made the benchmark working groups open. We had people involved from lots of different universities, academics from different universities. We had researchers and engineers from industry, big companies, small companies. We had some engagement from civil society folks who care deeply about this topic.
Starting point is 00:12:45 And we brought all of those folks together on a weekly conference call. And we had eight different work streams that also met weekly. So a lot of smart people, as I would describe it, arguing until we reach grudging consensus about how to approach this. And then we back them up with contractors from Melville Commons who can do the implementation work. That's how we did this. I think part of the reason we were able to produce what I think is a really good benchmark, or at least a V1 of a really good benchmark, is because we had so many different inputs. The other thing that I should touch on is just it's been very interesting to me in the process of developing this benchmark to talk to people. And people clearly see the tremendous potential of AI, but they also see the risks. And across all those different folks I described,
Starting point is 00:13:30 there is a real urge and intensity to getting this right. And so people are very excited to be a part of this effort or just help out in any way they can. And that's been tremendously useful in getting this off the ground. Obviously, 1.0 is just the start for this. And I guess one thing that I want to ask you about is how are you going to be driving more visibility associated with the benchmark itself so that customers are thinking about how they would use it in their procurement of different AI systems? And how are you seeing this evolve from a standpoint of where you take the benchmark next? Yes. So there are two components to that one is promotion and one is evolution. Let's do evolution first because it's a quicker answer. One of the things that I think separates what ML Commons does with benchmarks versus a lot of the very useful and innovative benchmarks you
Starting point is 00:14:19 often see in academia is in academia, the objective is to publish a good paper that moves people's thinking forward. And that's valuable. ML Commons is a nonprofit in this for the long term. We want to build and operate and improve benchmarks over time. So this is B1.0, and there will be 1.1 and 1.2 and 1.3 and 1.4, just like we saw with ML Persh. A good benchmark is like a good wine. It needs to age. And as it ages, it acquires all those lessons that you learn along the way about what works and what doesn't work and what isn't correlated with reality, and you make it better. So that's very much our intent, as well as expanding the benchmark. Like I said, this is the start of a family, right? This is right now only in English,
Starting point is 00:14:59 but we have other languages in the works. This is only text, but we have multimodal in the works. We're starting to talk about agentic. There is a big AI application space, and we want to build a family that really supports safety in that space. Promotion-wise, a lot of what we are doing is, as ML Commons is growing, we are making sure to get more mature in how we communicate and promote through important channels like your podcast. We're also really striving hard to develop partnerships. And these may be partnerships with people who are developing systems for supporting AI development. And this benchmark obviously could be an important part of those systems.
Starting point is 00:15:37 Or through partners in the general sort of AI safety space, we have a fantastic partnership with the folks at AI Verify, which is an organization based in Singapore that is looking at AI safety, especially in sort of the APAC region. And they've been a big part of how we've shaped this benchmark and how we will promote it in the future. We're planning to do multiple versions of this launch event. We'll do one in France in February. We'll do one in Singapore in March. And just continue to try and get the word out. But ultimately, the thing that would help us most is both the vendors and people interested in purchasing this model just asking about results on this benchmark or sharing results on this benchmark. As we become part of that conversation, in the same way that MLPerf has become an important part of the conversation to hardware, it just creates an expectation that everyone will do this kind of
Starting point is 00:16:25 measurement. And what we've observed with MLPerf driving hardware speed is like a 50x improvement in speed over five years, which is crazy. And if we can see similar strides in safety as people come together, expect this kind of measurement, and then use it to drive their own progress in a way that lets one idea stack and pop the last idea. I think we have that potential. Now, Peter, I know that this is something that you very much intend to be a global benchmark. And I know that ML Commons has done some work to ensure that's the case. Can you talk about that?
Starting point is 00:16:59 Yeah, there's several different vectors we've done that on. One is partnering. We are, as I said, a strong partnership with AI Verify. We're talking to a couple of other organizations globally. We really want to have this be a global effort from an organizational perspective. We also just, frankly, have a very diverse global community. We have people from all around the world participating in this effort. So those things help. But there's also things we have to do in implementation to make this globally useful. One of the things that we've done is in terms of identifying our hazards, we have focused on what we feel is a core taxonomy that most people around the world, no matter where they are,
Starting point is 00:17:34 how different their perspectives on things are, can get behind, right? No one is pro-child sexual exploitation. I don't care where you are. So there's focus on a common core, number one. Number two, figure out how to port that core to multiple languages in a comparable way. So we're launching v1.0 in English, but we plan to launch French in February, and we're aiming for Chinese and Hindi in March. And we want to do this in a way that scales. Part of that scaling is through partnerships. Like we're doing Chinese very closely in partnership with AI Verify because Chinese is one of the languages of Singapore
Starting point is 00:18:10 and they have expertise that is super relevant. The third thing we're doing is we're trying to figure out if you wanted to have a regional extension, think of it as a modular stack. So there's a global core and you might want a regional extension. What might that look like? And so we funded a pilot project with an organization called TATL in India, looking at what are the
Starting point is 00:18:29 sort of regional differences you might want to take into account in assessing air safety. So there's work with partners, develop a broad community, and then make sure that the implementation can scale to global concerns. Now, I'm sure that the first time that you run a benchmark, you really have no baseline for knowing how things are going to perform. What did you learn from this process about the state of AI safety today? And what are your hopes
Starting point is 00:18:55 in terms of what this benchmark inspires in terms of the state of AI safety tomorrow? Great question. We built a benchmark as a theoretical design, and then we ran it. And we literally saw results about a week ago and things that we have learned from those results. One, we saw something that looks roughly like a bell curve distributed around good, the sort of sides in fair and sides in very good. That a lot of the industry is working on safety, again, comparable to our safe for today economy car,
Starting point is 00:19:26 there's a strong clustering. We also were surprised by how much, because we gave grades overall for systems and then we gave grades for each hazard. We were really surprised by how strongly those two correlated with each other, that there is, I think, a general appreciation of the set of safety issues you should be testing for
Starting point is 00:19:44 and that people are, judging by the results, trying to do work across that range. Those were the two biggest things we saw. We also saw some smaller systems did quite well on the benchmark. But I think that bodes well for the future. to be safe, but rather that we as a field and an industry need to collectively advance our technology for safety and that there is a very real possibility that we can do that. So those, I think, were the biggest takeaways that I had. I am just so impressed that ML Commons was able to pull this off, that we got a great number of systems to actually go through the first benchmarking test. Because as a former marketer in a company, there's always a little bit of trepidation about putting something into
Starting point is 00:20:29 a benchmark at the first go around. You never know what might be revealed. And I love to see that this is coming in at such an important time as enterprises are looking to adopt AI, gen AI in particular, and in rapid fashion. So congratulations for the work. My final question for you is, where can folks find out about the results? Take a look at all of the stuff that you talked about today and get involved if they'd like to in terms of future implementation
Starting point is 00:20:57 of this important benchmark. Yeah, mocommons.org is the home of the benchmark. It will obviously be prominently featured for the foreseeable future. And we very much welcome folks coming to look at the results, learning about the benchmark and getting involved in the working group.
Starting point is 00:21:12 There is so much to be done here and we need folks. And frankly, we need the support of their organizations to enable us to continue to do this work. So together, I think we can get to a future where all the systems pass that excellent goal we've set for ourselves. And if we manage to do that, I think that will be not only economically valuable, but societally valuable if we can learn to build systems where the behavior is very predictable and very safe. to see if enterprises will start integrating ML Commons requirements at very good or better for their RFPs, that would be a great inspiration for the industry to actually go and tune their
Starting point is 00:21:51 models for this important benchmark moving forward. Thank you so much, Peter. Thank you for being on the program and sharing this exciting news. And once again, congratulations for this important milestone. Thank you very much for having me and really appreciate the kind words. We're very excited. Thanks for joining the Tech Arena. Subscribe and engage at our website, thetecharena.net. All content is copyright by the Tech Arena. Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.