CyberWire Daily - Navigating AI Safety and Security Challenges with Yonatan Zunger [The BlueHat Podcast]

Starting point is 00:00:00 Since 2005, Blue Hat has been where the security research community and Microsoft come together as peers. To debate and discuss, share and challenge, celebrate and learn. On the Blue Hat Podcast, join me, Nick Fillingham. And me, Wendy Zanoni, for conversations with researchers, responders, and industry leaders, both inside and outside of Microsoft. Working to secure the planet's technology and create a safer world for all. And now, on with the Blue Hat Podcast. Welcome to the Blue Hat Podcast. Today, we have

Starting point is 00:00:33 Yonatan Zunger, and we are thrilled to have you here. Yonatan, would you introduce yourself? Tell us who you are. What do you do? Well, hi. Thank you so much for having me on the show. So, my name is Yonatan Zunger. I'm currently CVP of AI Safety and Security at Microsoft, as well as Deputy CISO for AI. And my job is to try to think of all of the things that could possibly go wrong involving AI and figure out how we're going to try to prevent them from happening. I think that's sort of the short version of it.

Starting point is 00:01:00 I came to this from a career originally as a theoretical physicist. Went over, moved over into CS sort of full-time back in the early zeros, where I started out building heavy infra. I built a lot of the core part of search at Google, a lot of planet-scale storage, things like that. And then in 2011, I became CTO of Social. And this was just at the time that Google Plus was about to launch. This was also the time that GDPR was being drafted.

Starting point is 00:01:28 And within three weeks of taking that job, it suddenly became very clear that the hard part of this job wasn't going to be software infrastructure. It was going to be people's safety. It was security, privacy, abuse, harassment, policy, all of these things. And I discovered that I genuinely loved that. I fell in love with the field of trying to really solve these problems. And that's been, I would say, one of my biggest foci professionally ever since. And so now I'm really excited that I'm getting to work on

Starting point is 00:01:53 one of the craziest, hardest problems, like even by the standards of a pretty strange career. One of the strangest and hardest things I've ever worked on. And yeah, that's what I'm doing now. I love it. I love all the nuances and the human side of what you're doing. If you could let the audience know, for some that are still learning about the AI field, what is generative AI?

Starting point is 00:02:14 Well, yeah, that's a really good question. Because we've had AI of various sorts for a very long time. And generative AI has also existed for a long time, but it only became a really big deal in AI a little more than a year and a half ago or so. The way to think about it, the traditional kind of AI, I'm referring to it nowadays as predictive AI, what does your world look like in the world of this traditional AI? Typically, if I want to use a model for something, I'm going to build a model. The model user and the model builder is the same person, and you take a bunch of examples, you train a model, what are these models generally good at? They're good at looking at a really large field of data and making a prediction or a classification or

Starting point is 00:02:54 a recommendation or something like that. You know, they're good at looking at these very, very large spaces and analyzing. And of course, the problems you're dealing with now, because you're both model builder and model user, is you now really have to worry about, is my model biased? Did I pick the right training data? Does this thing have really weird nuanced failure modes? And then you have to think about all the safety aspects of your integrated system, right? Am I using it wisely, etc, etc. Generative AI is a bit of a different world. At the very deep technical level, it's the same basic approaches, you know, we have neural networks, all these structures, but in level, it's the same basic approaches. We have neural networks, all these structures. But in practice, it's often better to just think about it as a completely different technology

Starting point is 00:03:29 from a practical basis. The idea of generative AI, at the very technical level, of course, what you're doing is you're predicting character sequences, token sequences, images, things like that. In practice, the way I would think about it is you've got a model. And first thing to realize, in most cases, you've got a generic model, right? It's a model where one person trains it and you're going to use the same model for a huge range of applications. So the model trainer and the model user are now two completely different people. And what are these things good at? Basically, there are two things that generative AI is good at. One of them is, it's

Starting point is 00:04:06 good at summarizing or analyzing a piece of human-type content, so natural language or an image or something like that, right? It's very good at saying, like, here's a paragraph of text, give me a summary, extract the key ideas, something like that. And the other thing it's really good at is role-playing a character. And this is, this is the foundation of most of what we do with generative AI is basically a lot of creative use of role-playing. So you tell it, you're a customer service agent for Wombat Co. and you're about to be asked a question by a customer and you know how to search through the following databases of information, etc. Or you say, you're a programmer, you're a Python programmer,

Starting point is 00:04:46 and you've been asked for your advice on this piece of code, and you need to write a function to do something. You're a security expert, and you need to help analyze this set of forensic logs, something like that. So this sort of creative use of role-playing is one of those fundamental engines to it. So I guess the way I would say is that what is generative AI really?

Starting point is 00:05:07 At the innermost loop, it is a combination analysis and role-playing engine, which you can then build up to build all sorts of cool things out of. Yolton, this might be too large a question, but what I wanted to ask was, it almost sounded like you described the entire breadth of AI. We were talking about just generative AI, but so what's beyond that in terms of, so you talked about role playing and you talked about sort of the ability to synthesize or summarize data.

Starting point is 00:05:35 I'm obviously paraphrasing heavily. What else does AI do that's not generative AI? Again, probably a very large question, but how do we sort of think about these different roles and functions that AI can take? Well, that's the predictive AI I was talking about. The generative AI is what does the analysis and the role playing. The predictive AI is the stuff that does the classification and the recommendation and the analysis of that sort.

Starting point is 00:05:59 I'd actually give a really good analogy from the human brain. If you think about how the vision system works, right? So the human vision system is a stack where the very top, the very first input to the stack is the retinal neurons, right? So you have the direct things that are measuring brightness or color or something like that. And you have several stages in this stack, which then go from pixels, quote unquote pixels, to very small curves, to large curves and shapes, to two-dimensional shape recognition, to three-dimensional object recognition, and so on. Now, this is very similar to how predictive AI works, right?

Starting point is 00:06:36 You have a model that is just scanning a tremendously large range of things and pulling a small number of features out of it. Once you've pulled out those first things, though, your next layers of the stack is saying, oh, I see a three-dimensional shape. Wait a moment, I recognize that shape. That's Wendy's face. Then the thing that's starting to go off and identify what's happening around you, the sort of things you can articulate in words like, holy crap, there's a tiger and it's about to jump on me. That kind of higher-level processing is, in a lot of ways, more similar to what generative AI is doing. It's narrative. It's literally, you can

Starting point is 00:07:09 think of it most usefully as processing that happens either directly in words at the highest level or in sort of almost word-like concepts at the level below it. And the reason I bring up this analogy is because it highlights the way in which the two kinds of AI actually complement each other, right? These are not replacements at all. What happens is that this predictive AI, you can really think of it as AI that specializes in looking at very large fields of data, and where the model tends to be very specific to the problem being solved.

Starting point is 00:07:37 So the vision centers of your brain, if you tried to plug them into your ears, they would not work correctly. That's not what they're there for. Whereas these higher-level abstractions, the generative AI is really good for dealing with the higher level abstractions of the things that you can narrativize, the things that you can turn into words. So in a good healthy environment, what you're generally doing is that you're using these predictive AIs to scan very large fields of data and reduce a mass of pixels into a statement of, oh, here's a picture of somebody's face. And then you're taking that information and you're handing it to the generative layer, to the narrative layer, the one that speaks in words and so on. And it then starts to assemble these, reason about them,

Starting point is 00:08:12 talk about them, have these very generic kinds of conversation about them. So that's sort of the difference. That's the secret. That is a wonderful analogy. Thank you so much. I do want a quick pause. As an Australian, I noticed that you used wombat as your example there. Why wombats? Is there a story there? Why not wombats? No, I love it.

Starting point is 00:08:32 Why not wombats? We put that on the sticker. So I want to sort of like help people sort of still continue to wrap their head around generative versus predictive and other forms of AI. Can you give us some examples of positive uses of this or to juxtapose against negative use cases here? How do we think about the good and the bad, and I'm using air quotes, of this technology? Well, the good and the bad is very much in how you use it. What are some examples of good uses? There mean, there's so many of them, honestly. You know, I'll just give some random ones that pop into my head.

Starting point is 00:09:07 Dynamic temperature control for factories and data centers. I remember that was an example that came up a decade or more ago, but it turns out that you can have a system that stares at all of the temperature sensors across the building and controls whether to open windows or not and how to run fans and so on, and you can make a building spectacularly more energy efficient by doing that. Self-driving cars, when they're not designed by maniacs, this is a technology that can save an awful lot of lives. I mean, social media is always kind of a complicated mixed bag,

Starting point is 00:09:37 but if you think about this idea of helping people meet other people, the actual driving purpose of this, and I think we often forget, given how many problems have emerged in social media, we tend to forget just how much good this has actually done in people's lives. How many people have formed and maintained their friendships, their jobs, their entire professions sometimes, their romantic relationships. There's so many things that people have formed through this. And if you think about this, this is really about, a lot of this is the use of algorithms to try to help you find who are the people you might want to actually be with, who are the people you might want to be around.

Starting point is 00:10:14 With the generative AI, you know, it's still a very new technology. So I think we haven't yet seen the killer app of generative AI. I think, you know, we're in a stage where I think the single most important piece of office software of the 2030s, the category has not been invented yet. We're really at that new stage. The thing that is going to be the equivalent of what the spreadsheet was for the personal computer or what direct messaging was for mobile phones, we haven't even invented it. We don't even know what that thing is yet. Right now, we're sort of seeing these very early examples with generative AI. I think we're finding that it's really good as an interlocutor and

Starting point is 00:10:51 brainstorming partner, for example. I think there's a lot of very interesting potential there. It's also something that you can combine with a lot of more traditional techniques. Like for example, one of the classical challenges, things that you really can't do with AI today, is that you can't start with predictive AI, is it's not really good at understanding language. You know, understanding natural language is actually a very, very difficult problem. It's what we used to call an AI-complete problem.

Starting point is 00:11:19 In fact, it turns out even pronoun resolution, that is, knowing what a pronoun refers to in a sentence, is AI-complete in the sense that it actually requires a full model of the world and a theory of mind in order to do. There's sort of a classic example. I think this example might be due to Steven Pinker. I can't remember for sure. But here's a sample dialogue for you. Woman, I'm leaving you. Man, who is he?

Starting point is 00:11:48 Now, I'll bet you probably had no trouble understanding those two sentences in that dialogue. I mean, you could probably tell me exactly who the he in that second sentence refers to. Yeah. Now, explain to me who that he was without a complete theory of mind of both of the people and of how what the people are thinking that the other people are thinking and so on of two characters that I've literally identified as just man and woman. Already you had to solve that complicated a problem. So one of the traditional challenges we've had in all sorts of data science is that understanding human language is really, really, really hard. And what's one of the genuinely stunning things about the recent revolution in generative AI,

Starting point is 00:12:26 the one that's happened in the past year and a half or so, has been that we finally have software that's capable of just looking at a piece of human text and actually understanding it and extracting information from it. So if I ask it to resolve that and to then transform that into a structured form, I can. Which means that I can potentially apply

Starting point is 00:12:43 this sort of analysis at scale to large amounts of human data and interact with human information in entirely novel ways. And of course interact with people directly. So I think there's tremendous possibility for some really wonderful things to happen here.

Starting point is 00:12:59 Another suggestion that I've heard a lot talked about is personalized education as a service. Imagine a city. Imagine a I want to learn about X and this thing will help put together a syllabus. It'll do like all of the research it needs in order to find all of the right information, figuring out how best to teach it. And then it can teach me interactively, right? Because it's not just going to create a PDF or PowerPoint presentation. It can actually go back and forth and work with me and teach me all of the things I need to know. Imagine what this could do for the world, give everyone access to a teacher.

Starting point is 00:13:30 So there's tremendous possibility for good. And of course, there's tremendous possibility for bad, because there is no single technology humans can come up with that can't be horrifyingly misused. Just to take a simple example, we were just talking about education. That's entirely wonderful until what the person wants to learn is how to weaponize anthrax, or how to kill people, how to encourage a genocide. I mean, there's so many things that people might want to learn that are really horrible. And this is the point where we start to really run into deep nuances and we sort of have to ask ourselves, well, these problems already exist in the world, how do we prevent them? Right, there are people in this world who do know how to weaponize anthrax, but I can promise you that if you went to one of them and asked them, hey, would you teach me how to do that? They would say no, they would probably not do that.

Starting point is 00:14:26 But their judgment about when and how they're willing to do that is an interesting nuance. It's something that we need to figure out a way to capture and express and formalize. There's a lot of other very simple ways you can misuse AI in basically any way you can imagine misusing any technology. One of my favorite example might be the wrong word for this, but one of my classic examples of a misuse is there has been a whole business of using artificial intelligence to help make sentencing recommendations in criminal law. And ProPublica had an expose of this back in 2016, which I think is really worth reading. This works as badly as you might imagine. So, for example, if you look at these companies, they were very careful not to take race as an input signal, right? Because that would be horrible. You shouldn't use that.

Starting point is 00:15:10 But they did take income and sort of your quantized address, like the neighborhood you lived in. And if you know anything at all about how American politics works, your income and the neighborhood you live in is a really good proxy for race in most of this country. And then you end up with sort of a proxy signal problem. The theory behind it was, well, we're going to predict who is most likely to commit another crime, and we will recommend harsher sentences for people more likely to commit further crimes. There are a couple of obvious problems with this one. First of all, the sentence you give someone does affect their probability of committing a crime in the future, right? If you make sure that someone can't get any sort of non-criminal job in the future,

Starting point is 00:15:53 they're probably going to be criminals. But the even deeper one, the big problem that really kills this is when someone commits a crime, it's not like a giant light bulb goes off over their head saying, attention, this person has committed a crime. You can't actually measure the variable you care about, so they picked a proxy variable. They measured whether someone was charged with a crime. And the thing is, the difference between committing a crime and being arrested for a crime and being charged for a crime, that is not a uniform translation matrix. If you ask the question, who is more likely to be charged with a crime?

Starting point is 00:16:25 The answer is the black person. What they basically built was a system to predict race. They picked a system that was modeled exactly to capture, measure the nature of institutional racism in the United States, and then implement that as sentencing guidelines. And this is a system that proceeded to go off and destroy a bunch of lives. I think this is a really good example of how not to use AI, of really, really dangerous, ill-conceived decisions. And in this one, the obvious thing that people didn't think

Starting point is 00:16:57 about was the basic question of, what happens when this thing makes a mistake? This is the basic question you need to ask with any piece of software that you're building, any machine you're building. What happens if something goes wrong? And in this case, something very bad happens, especially because they send up in a way where basically its recommendations were almost automatically accepted. So you have to really architect your systems around the possibility of failure. And especially for things like AI, where the system is inherently non-deterministic, where all of its categorizations or predictions or outputs are always going to be probabilistic, you have to be very, very careful and make sure that your system is robust. Your system, the integrated

Starting point is 00:17:40 system, including all of the people who are using it and the people who will interact with it, are robust against the system, including all of the people who are using it and the people who will interact with it are robust against the system being wrong. Wow. Gosh, so many directions we could go here. Yeah. My first question is jumping off from that example. It's too simplistic to ask, where did they go wrong? I think I want to ask more about sort of ethics. When you design a system that has these kind of potentials for these kind of sort of significant outcomes, do you hard code in a bunch of ethical rules or do you give the system the ability to monitor the outcomes to then sort of adjust those sort of ethical guidelines that it's functioning on? Or do you still need in 2024 human beings with their own ethical guide to be able to monitor and control it or something else? How do ethics as a guiding force, but then also a component in an AI system

Starting point is 00:18:33 play into this? That's a really wonderful question. And I think there is no single answer to it. I think the correct answer to that question is very much dependent on the exact system that you're building. The way I would frame the approach to this, and this is I think one of the most basic lessons that I always try to teach people, there are two parts to engineering. Product engineering is the study of how your systems will work, and safety engineering is the study of how your systems will fail. You can't do just one or the other. And I think one of the great curses of modern computer science, the way that the field is working, which we desperately, actively, urgently need to fix, is that these are treated as two separate things rather than part of the same discipline. And if you go talk to civil

Starting point is 00:19:13 engineers, you will see a very different story. Civil engineers are safety engineers who occasionally build bridges. It's a very different culture, and I think a much healthier one. What does a safety engineering culture mean when you are working with AI or something? Well, in fact, it turns out AI and social media, I think, are very similar. Also, search, gaming, any software that really intimately involves humans and AI, you get very similar problems, and you need a similar approach. So what do you do? Very, very first thing, from the moment you even start to conceive of,

Starting point is 00:19:47 hey, I've got a crazy idea, what if we did dot, dot, dot? At the same time that you're thinking about what it could do, you're also thinking about what might go wrong in this situation. I have a whole set of things that I try to teach people about how to think of ways that things can go wrong, and we're actually working with my team right now on writing up training materials and creating things to help people learn how to think of ways that things can go wrong. And we're actually working with my team right now on writing up training materials

Starting point is 00:20:05 and creating things to help people learn how to do this. But the very, very first thing you're doing is you're coming up with a list of things that could fail, a list of ways in which this thing could go badly. Your basic approach to this, by the way, that's a three-pass approach. First, you go system first. You look at each component of the system

Starting point is 00:20:22 and ask, what happens if this thing fails? That might mean, what happens if it makes an error? What happens if it gets a malformed input? What happens if it gets an actively malicious input? What happens if it gets an unexpected input? Just all of the ways in which some component could go wrong. Your second way of looking at it is attacker first. What if someone is trying to misuse this system?

Starting point is 00:20:43 What if someone is trying to use this system, let's not even say maliciously, but for a purpose other than the one you intended? What might they be trying to accomplish? How might they use your system in order to accomplish that? And your third pass is the target first pass. That's where you're looking at it as, who are the people who might be affected by this system? What aspects of their lives might cause them to be affected by this system differently from other people. And what are their particular vulnerabilities in their lives that might be around?

Starting point is 00:21:12 So one of the things we're working on here also is checklists of sort of ideas to help people think of different possibilities here. I'll also say this is the place where I always say that this is the place where diversity, equity, and inclusion make such a big difference in your ability to actually correctly do your job. Because the one thing I can promise you is that you cannot think of what every possible attacker or affected person might be experiencing in their lives. They are very different from you because you are one person. There are a lot of people in the world and they're very different

Starting point is 00:21:43 from each other. They have very different lived experiences. And having a broad team, a team with a really wide range of lived experiences, and a team that's empowered to speak up about those things, is critical to actually being able to do this analysis correctly. So your very first step, the very beginning of all of this, is you think about what could go wrong. Now, you've done that. You've got your list of threats. You bounce this off like a bunch of people. By the way, this is not a one-off process. This is the process you're

Starting point is 00:22:09 going to be continually doing every single day of your life from the day you first conceive of the project until the day it gets shut down for the last time. You're thinking about what can go wrong, and then you're thinking, well, okay, for each one of these things, I need to have a plan. And your plan might involve mitigation, like preventing it from going wrong or making it less serious. And there's always going to be some aspect of it that you can't mitigate. There are problems in this world where they look at this and say, oh, I'm going to change the design of my system so that this thing is impossible.

Starting point is 00:22:38 That's wonderful. When you can do that, that's your best choice. And by the way, this is also why it's very important to do this sort of analysis from day one, because often you can make a small change in the design of your product, and just in the basic shape of it, it eliminates whole swaths of potential problems, while leaving the core product function that you care about intact. And that's often a really easy thing to do in the early design phase, and is almost impossible to do after you've built your entire system. So don't wait for that. I have seen projects get two weeks from launch and then someone points out a basic problem with this and surprise, you have to go all the way back to

Starting point is 00:23:12 architecture. See you again in six months. Don't do that. That's awful. It's a terrible experience for everybody. So how do you sort of do this next step? Sorry, I'm going off into the complete spiel of how you do safety engineering. Love it. Do it. Please, please.

Starting point is 00:23:26 This is wonderful. Yeah. So what do you do next? The next thing you do is for each one of these threat scenarios, you walk through the way that the threat scenario

Starting point is 00:23:33 actually happens. You walk through the exact, what are the sequence of events that have to happen for this to go wrong? And the reason you do that is you start to highlight possible intervention points.

Starting point is 00:23:42 Where are things that you could do that would prevent that step from happening or would change the outcome of that step? Once you've done that for each of your threat scenarios, then you compare those intervention points across all the threat scenarios.

Starting point is 00:23:54 And then what you'll often discover is there's a few intervention points that actually help you with a lot of different threats. And that's the point where you can start thinking about mitigations. How might you change your system, harden it, make it more robust to make those things less likely? You keep sort of doing this in a loop until you now have a hardened system. But at every stage of this, you've got sort of a residual threat.

Starting point is 00:24:14 You have events that could punch through all of those defenses and still happen. And so your last stage is always, when this happens, not if this happens, when this happens, what are you going to do about it? And that's the, how will you know that something has happened? How will you respond to it? You know, for example, with a lot of user-facing software, this is the point where you start really thinking a lot about the user experience, by the way. You cannot treat UX as being distinct from any other aspect of your system. One example of this, well, let's talk about abuse on social media, right? So, turns out there's a lot of harassment and abuse on social media. That is one of the primary things people use social media for. Sadly.

Starting point is 00:24:53 Now, you've got various things to try to prevent it, but that's going to get through. It's going to happen a lot. So now you say, okay, let's say I'm running a social network, and someone can create posts, and they can get comments on it, and the comments can be really terrible in various ways. Now, the objective of the user when they encounter one of these comments, they're going to be very upset right now. First thing, they want to get rid of this thing,

Starting point is 00:25:19 and they want to make sure that this person goes away and never comes back. That is their objective. Now, you've actually got a bit of attention now because the goal of the system operator is not just to get that detection but to get enough information to figure out like did this violate policies? Is this thing a signal of a broader problem? Is the user who made this comment a serious problem that we need to be kicking off the service? Or conversely, is this like something entirely personal between these two people that has nothing to do with this?

Starting point is 00:25:47 Also, I mean, abuse reporting is often actually done as an attack vector. People will mass abuse flag people that they don't like, not because those people are being abusive, but just as a way to try to get them kicked off the service. So in fact, false reporting is a big issue. So the system operators really want to collect as much information and context as they possibly can about an abuse incident so they can make a good decision. But these two goals are in tension. Right?

Starting point is 00:26:13 So, because, in fact, one really useful way to think about this, if you think about emotional activation curves, they tend to spike very rapidly. You're looking at a timescale of between 500 and 2,000 milliseconds, typically, to see an emotional activation curve rise. They decay. People calm down much, much, much more slowly. That is a timescale of, typically, for a small oscillation, like the time. The decay is minutes, actually, minutes, not months.

Starting point is 00:26:42 But, if the event keeps happening, you keep moving up. Imagine sort of a curve where you can either, every time a bad incident happens, you add an exponentially rising curve. And whenever anything isn't happening, the thing decays with a very long time constant. So you can keep going up, up, up, up, up.

Starting point is 00:27:02 So user has seen this upsetting comment. They get their first spike. If there is a big red button they can hit to make that thing go away, then they can go right back into decay mode almost instantly. If there isn't, then for every time they look back at that comment, you're going to be getting another spike, and it's just going to keep going up, up, up, up, up. So what's actually really important is that the user needs to be able to dismiss that thing on a timescale of seconds. What you really want is the time from them experiencing it to the time they're done with that problem to be five seconds or less, I would say is sort of a good rule of thumb. That means that the report abuse button, if report abuse makes you go through this whole

Starting point is 00:27:41 like abuse reporting flow where you have to now declare which category of abuse is it and etc etc etc, you're getting good data for your team but you're actually not achieving the core user need of getting into a safe state quickly. So the correct design of this kind of system becomes very very subtle and nuanced. And this is actually sort of the core. Now going back to where we started, you had a threat scenario of users experiencing abuse on the platform. You need to think of intervention detection response in a way that solves the user's problem first. And then separately, the question of how do you now get the signals that let you do a more detailed analysis?

Starting point is 00:28:20 Because now what you've seen is, okay, this user flagged this comment as being problematic. Most likely, that's all the information you've got. If you now want to look for broader patterns, you now have to actually think like a data scientist. You have to think, how do I analyze this situation to figure out, is there a larger pattern I need to care about? And then there's all sorts of things you can do in order to do this. So, for example, here's one simple rule. And then there's all sorts of things you can do in order to do this.

Starting point is 00:28:44 So, for example, here's one simple rule. Let's say that you have one user that there's a real pattern that every time they interact with someone that they don't have a pre-existing relationship with, the probability that that person is going to report their comment as abusive is unusually high. That's a really good sign that you are dealing with an asshole. That's a real... And there's a lot of things like this.

Starting point is 00:29:08 Actually, one of the most important rules in abuse detection is sometimes you're looking at reports of things. And let's say that you're dealing with... Well, if I'm dealing with comments on someone else's post, then the post owner should just have the right to remove anything they want to remove, period.

Starting point is 00:29:21 But if I'm looking at posts, like sort of top-level posts or things in the general forum, the criteria for removal is probably going to be a product level criterion. Now, one thing that we have learned is that bad actors in social media are really, really good at figuring out the exact line of what they can get away with and skirting really, really close to it. So they will always figure out some way to be just working around the rules so that each individual post never quite gets removed.

Starting point is 00:29:47 A really important rule when you're doing abuse detection is if you don't remove something, nonetheless log that this thing was like close to the edge. Because one pattern you will notice is that, hey, none of this user stuff got removed,

Starting point is 00:30:01 but wow, do they have a lot of stuff close to the edge. And that is one of your biggest red flags for an account. That account, you kick off the system. So it's this kind of thing. And so now, okay, let's go from the specific back to the general. How do you approach this? What you were doing is, you had a threat scenario. In fact, you have quite a few threat scenarios tied to each other. You have various intervention points. You have intervention at the point that someone is

Starting point is 00:30:23 making a comment, and when they're seeing a comment, who do you introduce to each other, what do you let them see, you have the whole response pattern and so on. You keep adjusting this thing to try to reduce the level of thread until overall, you look at your overall, here's my plan for the thing, and you decide, okay, this plan is reasonable, I think overall this thing is safe to launch. And you're doing some really interesting trade-offs here because, for example, if you have to have humans reviewing your abuse cues, which you absolutely have to have because computers

Starting point is 00:30:53 are not yet at the stage where they can do this automatically, in fact humans are barely at the stage where they can do this automatically, there's a whole side conversation there, that means you're, okay, so you actually have to do this expensive thing. You need people monitoring this system continuously and maintaining it. And how many people you need, well, that increases your cost. And if you have a failure that requires human intervention happening an awful lot, you've got a big problem. Maybe you need to re-architect your system to make that failure happen less often.

Starting point is 00:31:17 That's actually how you go about making the engineering trade-off of how much do I need to mitigate this particular threat. What you say is, here's the residual cost of actually managing all the failure modes after I go through this. Is that cost reasonable? If it's not, better go back to, better keep tweaking. And if you look at this whole integrated story I'm telling you, this is actually best understood as an alternative to the traditional risk management idea of likelihood and impact. So if you started out in the world of risk management, you're used to taking each risk with each threat scenario, it's the same kind of thing here,

Starting point is 00:31:50 assigning it a likelihood and a severity. And typically sort of the product of those two is how important the risk is, and you go from there. And that's actually a terrible way to approach this if you're an engineer. Because that multiplication is really, what it's designed for is for insurers. It's designed for someone who needs to manage

Starting point is 00:32:08 a large portfolio of risks and sort of manage the overall risk budget. It's great for that. If you're trying to manage specific risks, it is terrible because you're dealing with either things with very high, in fact, you can't even say likelihood, you have to talk about frequency.

Starting point is 00:32:23 A friend and colleague of mine, Andy Scow, he put it really nicely when we were at Google. He said, if something happens to one in a million people once a year, here at Google, we call that six times a day. Which he was right, he had done the math on that one for a particular service. And so what this means is, you can't even be talking about rare likelihoods. In this case, you're talking about things that are happening continuously and their continuous cost. Or alternatively, you've got failure modes that are incredibly rare and whose impact is really, really high.

Starting point is 00:32:52 And the product of a very small number and a very large number is not a medium number. It is statistical noise. There is no way to plan for any of this. If you actually try to design your safety plan by doing likelihood and impact, you will just end up in a complete madhouse. This is what I call the when-not-if method. Don't ask if this thing is going to happen.

Starting point is 00:33:14 It's going to happen. For each one of these threats, what's your plan? That's the question I want to know. So all the way back to your original question, how do you deal with ethics and artificial intelligence? I think that the real answer is, you deal with it by looking at what are the threats in your system, what are the things that can go wrong, and having a plan for each of them. And the nature of the correct plan, whether that's putting explicit rules in your system,

Starting point is 00:33:40 or having humans checking various things and so on, that's always very specific to the system you're building. You really need a solution that is designed and customized to your problem space. And you need to continually be observing, monitoring what's happening in production, updating your model of the threats, updating your plan for response, so that you're actually dealing with the things that matter. Sorry, that is my entirely not short answer. It was a great answer. I'm looking at your sign behind you.

Starting point is 00:34:11 It looks like Smokey the Bear, but it's not Smokey the Bear. It is a... It is Roki the Raccoon. Roki the AI Safety Raccoon. Only you can prevent AI apocalypses. This is the logo of our AI Red team, which I absolutely love. I love that.

Starting point is 00:34:28 And that kind of ties into my next question. It's like, you know, fight fire with fire. You see all these products, every product you're using. It's like, hey, you know, tap into AI. AI can help you. We can help you write this. We can help you do this. But on the back end, are we using, we as in humanity, using the AI to help secure AI? Is that like fight fire with fire kind of thing or protect with protection of the AI? We are, but we have just barely begun to scratch the surface of possibility here.

Starting point is 00:35:01 There are many different ways in which we do this. Let me start with one of the simplest. When we actually think about how do we secure AI systems, right, and there's a whole, you know, we could spend an entire hour talking about just how you actually practically secure them. One of the key mechanisms, one of the most powerful mechanisms is something called metacognition. And so to actually understand this one, let's go back to what we said earlier on, that generative AI is good at role-playing and it's good at summarizing things. A single pass through a generative AI system,

Starting point is 00:35:30 that's basically what it does. It role-plays a character. There are no guarantees at this stage that it will be correct, that it will be achieving what you're trying to do, anything. It's like, it's really, this thing is dreaming and that's okay.

Starting point is 00:35:49 What people refer to as the hallucination problem, which is actually a much more complex problem, I think that's a very important name for it. What this really is that if you take a single pass through the system and you're expecting the output to be grounded in some factual basis, yeah, no, that's not going to happen. What do you do? One of the really powerful things you can do is you can ask AI to role play an editor of various sorts. So let's say that you're trying to do this. Let's give a concrete example. I'm trying to build a chatbot that is going to be a customer facing chatbot that answers questions about my products. And I have some kind of large website full of documentation about my system. But because people are bad at information architecture, it's really hard to

Starting point is 00:36:25 actually find the answers you need in this website, especially if you don't know the exact question you have to ask. This is a great job for generative AI. So what does the generative AI do? Going to get a question from the user, and it's basically going to follow sort of a fixed kind of plan. The first plan is it needs to look at this question, first of all, figure out, is this even a question that knows how to deal with or answer, right? If I am asking Microsoft's customer service AI about how to make good pancakes, it should tell you that it has no idea that is really not its job. It should just bounce that off. Then it says, okay, well, in order to answer this question, what's the stuff I'm going to look up?

Starting point is 00:37:02 It needs to come up with some search queries. So here it's role-playing a customer service agent, right, who order to answer this question, what's the stuff I'm going to look up? It needs to come up with some search queries. So here it's role-playing a customer service agent, right, who's the subject matter expert, and say, what are the right search queries to do to find the answer to this? Cool, comes up with a list. Now we're actually going to execute searches. This is not an AI step.

Starting point is 00:37:16 This is the point where you just run searches. And you tell it to look at the results, and maybe you have some stage where you're sort of judging which results you want to grab. You want to summarize each one of those results. Again, one of the things that AI is good at. Then you're going to pull all of those things together, and you make an answer.

Starting point is 00:37:33 And once you've got this nice answer, what you can also do now is a metacognitive step. A step where you tell it, look over this as an editor. Make sure that every statement in the output is actually factually grounded in one of the source pages and attach a footnote to every single statement. Attach a footnote with a link.

Starting point is 00:37:49 And if you can't, if you can't footnote a statement correctly, take that statement out. That editing pass, that's actually how you eliminate fabrications from all of this.

Starting point is 00:37:59 Now, there's all sorts of other ways you can do this, and so I could talk about this at tremendous length, but this concept of metacognition is really powerful. And part of the reason it's so powerful is because of role-playing, right? Because this is just one of these magical things about generative AI. Generative AI was trained on human data, and it has cultural assumptions baked into it. So let's say I tell it, you are a compliance officer, or you are, I mean, for death it complies, this is like a very beige sort of use case.

Starting point is 00:38:29 Well, you are a responsible adult who's really cared about the safety of their community. You tell it a story like that, then you tell it, look over the following thing and tell me, is this going to be a problem? What's really amazing is, I tell it like this short, I give it like one or two sentences describing the character that it's playing. And all of these assumptions that come with that character description, we've actually kind of encoded into it.

Starting point is 00:38:54 If you tell it it's playing a rabbi or a compliance officer, something like that, if you're telling it to play this character, all sorts of assumptions intrinsically come in because it was trained on it. It knows what these characters are. And so you can sort of adjust that train, that tweak that, so that you don't have to specify 5,000 rules. You don't have to explicitly specify its ethical code. Rather, you give it a character that you describe well enough that it has an ethical code, and then you tell it to apply that to the outputs and analyze that way.

Starting point is 00:39:24 And this is a technique that actually is proving very effective. The characters that it's playing are going to play by the rules that you assume. So for example, there you say, you're a compliance officer. How do you know that the compliance officer that it is going to play is not some fictional villain version that it got from a TV show? Thank you for asking that question. I did not. That was the exactly right question. And so this goes to one of the most important things we've learned.

Starting point is 00:39:51 It turns out it's really easy to build generative AI software and is really hard to test generative AI software. It's very easy to write a system that works great in the one or two cases that happen to pop into my mind. And when you give the real outputs, it turns out they do not do what you expect them at all and so the answer is how do you make sure it does it is testing so in fact i think one of the most important things you can really be doing is you need to create a for each of these systems first of all a bunch of test cases of just its

Starting point is 00:40:21 ordinary function right give it a bunch of inputs that look like real inputs it's going to get. And by the way, get other people to help you write those inputs and get AIs to help you brainstorm further ways in which the input could look because I can promise you that real users are infinitely weirder than anything you can come up with. And you manually

Starting point is 00:40:40 figure out what do you expect to happen in each of these cases. You run this thing through the output, make sure the outputs look right. Not only that, you can even use another AI and give it a rubric to judge and to sort of do a first-pass classification of does this look more or less like what I expected? And then you can actually check the outliers by hand. And then, of course, you could also build a of test cases for all of these possible harms. What happens if someone puts in this following malicious input?

Starting point is 00:41:11 Does it catch it correctly? And this is in fact exactly how we do testing. The testing frameworks that we build are all based on exactly this principle. We have a bunch of test cases, we have a rubric that is run by an AI, by another AI, and then what you do is you feed the test case into the AI, or into the system you're testing, you look at its output, you have another AI following this rubric, looking at the output and judging it, and then you have a human look at the overall outputs of all of that and actually sanity check, because tuning that rubric is just as hard as tuning the original. So you sort of have to keep planning, you have to keep refining the rubric. And what's funny is, again, this is very similar to a pre-AI problem. In fact, let me give you a

Starting point is 00:41:48 social media example, because that's exactly what this is for. It turns out, writing these policies for things like harassment and hate speech and so on is tremendously difficult. Articulating what constitutes hate speech, like good luck with this, this is a massively difficult problem and in particular what you have to do is you have to write a policy that's going to be run by human analysts right at the end of the day you've got literal people sitting in front of terminals reviewing items to see do they match policy or do they not and you need and you can measure the correctness of this policy in a lot of ways by looking at things like inter-rater agreement they send a random subset subset of all the items to multiple people. Do they get the same answer reliably or not? The answer, by the way, is they don't. Most of the time, it's very hard to write a policy that

Starting point is 00:42:31 will cause them to reliably agree. When you're writing these policies, one of the other things you can discover is that what you wrote isn't what you intended. So I'll give one of my favorite examples. This is one that got into the press press which is why you can easily talk about this this one happened at Facebook and they got a rule where encouraging violence and demeaning content etc. etc. against people based on protected categories was not permitted

Starting point is 00:42:55 so you weren't allowed to call for violence based on race or based on gender or something like that and now what happens if someone combines two attributes? Well, the answer was if you have a statement that's calling for violence based on a combination of attributes where all of the attributes are protected,

Starting point is 00:43:14 then that is also forbidden. Okay, I just said something incorrect. What did I say, Mom? Oh, man. I said it quickly, and you probably didn't catch it. I didn't catch it. I didn't catch it. And probably didn't catch it I didn't catch it and they didn't catch it either because I said all

Starting point is 00:43:30 and I should have said any oh right as a result they wrote a policy where and what's funny is their internal training material which leaked which is how we know this whole story ended up following what was written in the policy men are trash canonical violating statement kill all the black children which is how we know this whole story, ended up following what was written in the policy.

Starting point is 00:43:45 Men are trash. Canonical violating statement. Kill all the black children. Canonical non-violating statement. Because black is race, that is a protected category, but children is not a protected category, and I said all, not any. And so therefore,

Starting point is 00:44:04 saying go kill all the black children was considered a classic non-violating statement because of basically a typo in the original rules. Oh, man. Wow. I can tell you, like, we had the same things happen at Google. We had mistakes like this happen there everywhere. This mistake can happen everywhere. It's really easy.

Starting point is 00:44:19 I mean, none of you guys, it's really easy to miss this thing. How do you prevent a mistake like that? Unit tests. When you're writing a policy, and this is not a job for engineers to write, this is like policy. When policy people are writing policies, have them write out a list of examples

Starting point is 00:44:35 which should be violative and non-violative and get them, like work with them. Like you go back and forth and give them, okay, here's a harder example. Here's another example that's hard in a different way, and so on. And you build up this list and every time you change your policy,

Starting point is 00:44:48 you update that test list. Same thing with AI. If you're trying to implement a policy, like a metacognitive filter, if it's a rubric to evaluate the outputs of tests, something like that, give it a list of test cases,

Starting point is 00:45:01 pro and con. And that way also, if you ever have to do something like, you know, update your model version or something like that, you can retest and make sure the system is still doing what you think it's doing. Because otherwise, yeah, it can go really, really badly. Jonathan, first of all, we need to get you back on the podcast

Starting point is 00:45:17 for a part two, three, four, five. Into infinity. We are coming up on time here. And I wondered if this is a good segue for us to talk about the role of just very briefly security researchers. So the Blue Hat podcast, the Blue Hat conference, this is a part of the security researcher community. You talked about, a lot of this was about sort of product engineering and safety engineering, which is obviously sort of on the internal side of the development of systems and products. What of that role that unit testing or

Starting point is 00:45:45 the flip side of unit testing can be taken and should be taken by the researcher community? Or how should they start to just sort of think a little bit differently about this space? Well, I think there is so much opportunity for the research community to be involved in this. This is potentially real golden era. And one thing I'll point out is back in the early 2010s, when we were creating what we today call privacy engineering, which is a flight and misnomer as a discipline, basically the people who were working on exactly these safety problems for social media when that was the big new problem,

Starting point is 00:46:15 that discipline didn't really exist. And who were the people that we were hiring for it? Well, it was people who were good at thinking about how things will go wrong. SREs turned out to be really good at it. Lawyers, journalists, all sorts of people from all sorts of backgrounds, all sorts of walks of life. The common skill that really made people shine in the space was the ability to look at a system and think about what might go wrong.

Starting point is 00:46:37 It's doing that very first initial step that's often the hardest thing for people to do. And security researchers are wonderfully suited for exactly this kind of thing. And the biggest difference between safety research and security research is you just zoom out and look at a bigger scope of problems. My rule always for my teams is, they ask, well, is this kind of risk in scope? And the answer is, well, does it involve your system? Is it a risk? Congratulations, it's in scope. And I think with security research, we often get a little narrow and we say, oh, well, you know, this is about an access control issue, so that's security, but that's just about a human misusing

Starting point is 00:47:14 the system in a way we didn't expect. So that's a product problem. Stop saying it. All the problems are your problem. And now you ask, well, what can a security researcher do? But what can a safety researcher do? And the answer is, it's the stuff that you have been doing all of this time. If you are internal, obviously, you don't be part of this whole design process and so on. If you're external to a place, do safety, the same approaches that you take to doing security research, probe systems, look for issues, look for vulnerabilities. Think about responsible disclosure, same kind of approaches, same all of the muscles you've built for your security work over the decades. Those same muscles apply here

Starting point is 00:47:49 perfectly well do the exact same kind of thing. And, you know, when you were dealing with the disclosure aspects, if some, sometimes it's very, very similar, but you find a system that it turns out you can make it misbehave in a way that people didn't think you could, treat that like a security vulnerability. Disclose it responsibly, publish the results, etc., etc. Same thing you do. I think you're a little more likely to come across outright bad actors in the world where it's like you've discovered a problem with the system and they say, yes, we know that's intentional.

Starting point is 00:48:21 That doesn't happen quite as often in the security world. I think one of the things that you'll encounter more and more in the safety world is really a misalignment of incentives kind of problems, where often you'll have a maker of a system and the user of a system and people affected by a system. You have all of these different parties, and sometimes you'll have pairs of them whose incentives do not align. And those misaligned incentive moments,

Starting point is 00:48:44 those are the places where the biggest problems often show up. Sometimes, by the way, you'll have multiple groups of users whose incentives don't align with each other, right? Most social media problems are not because the company running the social network is evil. It's because one set of users is a problem for another set of users. Which is not, by the way, even saying that one set of users is bad actors. Culture clash is a great engine for that kind of thing too. So search for these places where something can go wrong, probe those things,

Starting point is 00:49:10 do that research, and of course the no less important than all that is find ways to fix problems. Come up with techniques for mitigation. We are in such an open green field space in the world of generative AI. You can go out, discover a new problem,

Starting point is 00:49:26 and figure out a way to mitigate this problem, to make a whole class of problems go away. Like, you know, I mean, this is like security research decades ago. This is like, you know, these very, very early days where everything you're doing is really novel. So security researchers, please get involved, work actively, probe this, and just broaden the scope of what you think about from traditional security to safety in the broadest possible sense of the word. Just one comment is just that how important I think the security of AI is, because I know that I speak with people that take everything that comes out of AI literal.

Starting point is 00:50:02 Like it is, well, if chat TPT says it, then it is. You know? I remember back in the 90s, people were very worried, like in the late 90s, like, oh my god, if something wrong shows up in a search result, Google said it must be true. We're happy the same thing. I can promise you that the output of an AI

Starting point is 00:50:19 is no more guaranteed to be true than the output of a search engine, or for that matter, the output of a human. Again, there's a whole human. And we, again, there's a whole hour of conversation we can have about problems of like over-reliance, fabrication, the different specific things

Starting point is 00:50:33 that can be going wrong. We can talk for hours and hours about all of this. And we will. And I hope we do, Janssen. Thank you so much. We're definitely going to have you back on another episode

Starting point is 00:50:41 of the Blue Hat Podcast. Just before we let you go, is there one go-do you would like our audience to do? Do you want them to read something? Do you want them to go watch something? What should everyone do to take the next step in securing AI? I wish that we have already published some book for the public about how to do all of this stuff that I could tell you, go read this thing. But if I were to give people one go-do, it's go back to these projects, these products that you work with every day, and do that threat modeling exercise.

Starting point is 00:51:10 Do that exercise on every sort of thing you encounter. Think about ways things can go wrong. Get yourself into that mindset. Practice thinking about how things might fail. And with that, I think you will be in a spectacularly better place to really address the real problems that face us in the world. That's a wonderful end. Jonathan, thank you so much for being on the Blue Hat Podcast. I look forward to our next episode. This has been fantastic.

Starting point is 00:51:32 Thanks for your time. Thank you. It is a real pleasure. Thank you for joining us for the Blue Hat Podcast. If you have feedback, topic requests, or questions about this episode, please email us at bluehatatmicrosoft.com or message us on Twitter at MSFTBlueHat. Be sure to subscribe for more conversations and insights

Starting point is 00:51:52 from security researchers and responders across the industry by visiting bluehattpodcast.com or wherever you get your favorite podcasts. This week on the Microsoftrosoft threat intelligence podcast get an update on async rat and all the threats to the financial services industry be sure to listen in and follow us at ms threat intel podcast.com or wherever you get your favorite podcasts

CyberWire Daily - Navigating AI Safety and Security Challenges with Yonatan Zunger [The BlueHat Podcast]

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.