CyberWire Daily - Navigating AI Safety and Security Challenges with Yonatan Zunger [The BlueHat Podcast]
Episode Date: December 30, 2024While we are on our winter publishing break, please enjoy an episode of our N2K CyberWire network show, The BlueHat Podcast by Microsoft and MSRC. See you in 2025! Yonatan Zunger, CVP of AI Safety &... Security at Microsoft joins Nic Fillingham and Wendy Zenone on this week's episode of The BlueHat Podcast. Yonatan explains the distinction between generative and predictive AI, noting that while predictive AI excels in classification and recommendation, generative AI focuses on summarizing and role-playing. He highlights how generative AI's ability to process natural language and role-play has vast potential, though its applications are still emerging. He contrasts this with predictive AI's strength in handling large datasets for specific tasks. Yonatan emphasizes the importance of ethical considerations in AI development, stressing the need for continuous safety engineering and diverse perspectives to anticipate and mitigate potential failures. He provides examples of AI's positive and negative uses, illustrating the importance of designing systems that account for various scenarios and potential misuses.   In This Episode You Will Learn:     How predictive AI anticipates outcomes based on historical data The difficulties and strategies involved in making AI systems safe and secure from misuse How role-playing exercises help developers understand the behavior of AI systems  Some Questions We Ask:      What distinguishes predictive AI from generative AI? Can generative AI be used to improve decision-making processes? What is the role of unit testing and test cases in policy and AI system development?  Resources:  View Yonatan Zunger on LinkedIn     View Wendy Zenone on LinkedIn  View Nic Fillingham on LinkedIn  Related Microsoft Podcasts:   Microsoft Threat Intelligence Podcast  Afternoon Cyber Tea with Ann Johnson  Uncovering Hidden Risks    Discover and follow other Microsoft podcasts at microsoft.com/podcasts  Learn more about your ad choices. Visit megaphone.fm/adchoices
Transcript
Discussion (0)
Since 2005, Blue Hat has been where
the security research community and Microsoft come together as peers.
To debate and discuss, share and challenge, celebrate and learn.
On the Blue Hat Podcast, join me, Nick Fillingham.
And me, Wendy Zanoni, for conversations with researchers,
responders, and industry leaders,
both inside and outside of Microsoft.
Working to secure the planet's technology and create a safer world for all. And now, on with the Blue Hat Podcast. Welcome to the Blue Hat Podcast. Today, we have
Yonatan Zunger, and we are thrilled to have you here. Yonatan, would you introduce yourself? Tell
us who you are. What do you do? Well, hi. Thank you so much for having me on the show. So, my
name is Yonatan Zunger. I'm currently CVP of AI Safety and Security at Microsoft,
as well as Deputy CISO for AI.
And my job is to try to think of all of the things
that could possibly go wrong involving AI
and figure out how we're going to try to prevent them from happening.
I think that's sort of the short version of it.
I came to this from a career originally as a theoretical physicist.
Went over, moved over into CS sort of full-time back in the early zeros,
where I started out building heavy infra.
I built a lot of the core part of search at Google,
a lot of planet-scale storage, things like that.
And then in 2011, I became CTO of Social.
And this was just at the time that Google Plus was about to launch.
This was also the time that GDPR was being drafted.
And within three weeks of taking that job,
it suddenly became very clear that the hard part of this job wasn't going to be software infrastructure.
It was going to be people's safety.
It was security, privacy, abuse, harassment, policy, all of these things.
And I discovered that I genuinely loved that.
I fell in love with the field of trying to really solve these problems.
And that's been, I would say, one of my biggest foci professionally ever since.
And so now I'm really excited that I'm getting to work on
one of the craziest, hardest problems,
like even by the standards of a pretty strange career.
One of the strangest and hardest things I've ever worked on.
And yeah, that's what I'm doing now.
I love it.
I love all the nuances and the human side of what you're doing.
If you could let the audience know, for some that are still learning about the AI field,
what is generative AI?
Well, yeah, that's a really good question. Because we've had AI of various sorts for a
very long time. And generative AI has also existed for a long time, but it only became a really
big deal in AI a little
more than a year and a half ago or so. The way to think about it, the traditional kind of AI,
I'm referring to it nowadays as predictive AI, what does your world look like in the world of
this traditional AI? Typically, if I want to use a model for something, I'm going to build a model.
The model user and the model builder is the same person, and you take a bunch of examples, you train a model, what are these models generally good at? They're
good at looking at a really large field of data and making a prediction or a classification or
a recommendation or something like that. You know, they're good at looking at these very,
very large spaces and analyzing. And of course, the problems you're dealing with now, because
you're both model builder and model user, is you now really have to worry about, is my model biased? Did I pick the right training data? Does
this thing have really weird nuanced failure modes? And then you have to think about all the
safety aspects of your integrated system, right? Am I using it wisely, etc, etc. Generative AI is a
bit of a different world. At the very deep technical level, it's the same basic approaches,
you know, we have neural networks, all these structures, but in level, it's the same basic approaches. We have neural networks, all these structures.
But in practice, it's often better to just think about it as a completely different technology
from a practical basis.
The idea of generative AI, at the very technical level, of course,
what you're doing is you're predicting character sequences, token sequences, images, things like that.
In practice, the way I would think about it is you've got a model. And first thing
to realize, in most cases, you've got a generic model, right? It's a model where one person
trains it and you're going to use the same model for a huge range of applications. So the model
trainer and the model user are now two completely different people. And what are these things good
at? Basically, there are two things that generative AI is good at. One of them is, it's
good at summarizing or analyzing a piece of human-type content, so natural language or an
image or something like that, right? It's very good at saying, like, here's a paragraph of text,
give me a summary, extract the key ideas, something like that. And the other thing it's really good at
is role-playing a character. And this is, this is the foundation of most of what we do with
generative AI is basically a lot of creative use of role-playing. So you tell it, you're a customer
service agent for Wombat Co. and you're about to be asked a question by a customer and you know how
to search through the following databases of information, etc. Or you say, you're a programmer,
you're a Python programmer,
and you've been asked for your advice on this piece of code,
and you need to write a function to do something.
You're a security expert,
and you need to help analyze this set of forensic logs,
something like that.
So this sort of creative use of role-playing
is one of those fundamental engines to it.
So I guess the way I would say is that what is generative AI really?
At the innermost loop, it is a combination analysis and role-playing engine,
which you can then build up to build all sorts of cool things out of.
Yolton, this might be too large a question,
but what I wanted to ask was,
it almost sounded like you described the entire breadth of AI.
We were talking about just generative AI, but so what's beyond that in terms of, so
you talked about role playing and you talked about sort of the ability to synthesize or
summarize data.
I'm obviously paraphrasing heavily.
What else does AI do that's not generative AI?
Again, probably a very large question, but how do we sort of think about these different
roles and functions that AI can take?
Well, that's the predictive AI I was talking about.
The generative AI is what does the analysis and the role playing.
The predictive AI is the stuff that does the classification and the recommendation and
the analysis of that sort.
I'd actually give a really good analogy from the human brain.
If you think about how the vision system works, right? So the human vision system is a stack where the very top, the very first input to the stack
is the retinal neurons, right? So you have the direct things that are measuring brightness or
color or something like that. And you have several stages in this stack, which then go from
pixels, quote unquote pixels, to very small curves, to large curves
and shapes, to two-dimensional shape recognition, to three-dimensional object recognition, and
so on.
Now, this is very similar to how predictive AI works, right?
You have a model that is just scanning a tremendously large range of things and pulling a small
number of features out of it.
Once you've pulled out those first things, though, your next layers of the stack is saying,
oh, I see a three-dimensional shape. Wait a moment, I recognize that shape. That's
Wendy's face. Then the thing that's starting to go off and identify
what's happening around you, the sort of things you can articulate in words like,
holy crap, there's a tiger and it's about to jump on me. That kind of higher-level
processing is, in a lot of ways, more similar to what generative AI is doing. It's narrative. It's literally, you can
think of it most usefully as processing that happens either directly in words at the highest
level or in sort of almost word-like concepts at the level below it. And the reason I bring up this
analogy is because it highlights the way in which the two kinds of AI actually complement each other,
right? These are not replacements at all.
What happens is that this predictive AI,
you can really think of it as AI that specializes in looking at
very large fields of data, and where the model tends to be
very specific to the problem being solved.
So the vision centers of your brain, if you tried to plug them into your ears,
they would not work correctly. That's not what they're there for.
Whereas these higher-level abstractions, the generative AI is really good for dealing with the higher level
abstractions of the things that you can narrativize, the things that you can turn into words.
So in a good healthy environment, what you're generally doing is that you're using these
predictive AIs to scan very large fields of data and reduce a mass of pixels into a statement of,
oh, here's a picture of somebody's face. And then you're taking that information and you're handing it to the generative layer, to the narrative layer,
the one that speaks in words and so on. And it then starts to assemble these, reason about them,
talk about them, have these very generic kinds of conversation about them.
So that's sort of the difference. That's the secret.
That is a wonderful analogy. Thank you so much. I do want a quick pause. As an Australian,
I noticed that you used wombat as your example there.
Why wombats?
Is there a story there?
Why not wombats?
No, I love it.
Why not wombats?
We put that on the sticker.
So I want to sort of like help people sort of still continue to wrap their head around
generative versus predictive and other forms of AI.
Can you give us some examples of positive uses
of this or to juxtapose against negative use cases here? How do we think about the good and the bad,
and I'm using air quotes, of this technology? Well, the good and the bad is very much in
how you use it. What are some examples of good uses? There mean, there's so many of them, honestly. You know, I'll just give some random ones that pop into my head.
Dynamic temperature control for factories and data centers.
I remember that was an example that came up a decade or more ago,
but it turns out that you can have a system that stares at all of the temperature sensors across the building
and controls whether to open windows or not and how to run fans and so on,
and you can make a building spectacularly more energy efficient by doing that.
Self-driving cars, when they're not designed by maniacs,
this is a technology that can save an awful lot of lives.
I mean, social media is always kind of a complicated mixed bag,
but if you think about this idea of helping people meet other people,
the actual driving purpose of this,
and I think we often forget, given how many problems have emerged in social media, we tend to forget just how much good this has
actually done in people's lives. How many people have formed and maintained their friendships,
their jobs, their entire professions sometimes, their romantic relationships. There's so many
things that people have formed through this. And if you think about this, this is really about, a lot of this is the use of algorithms
to try to help you find who are the people you might want to actually be with, who are
the people you might want to be around.
With the generative AI, you know, it's still a very new technology.
So I think we haven't yet seen the killer app of generative AI.
I think, you know, we're in a stage where I think the single most important piece of
office software of the 2030s, the category has not been invented yet.
We're really at that new stage.
The thing that is going to be the equivalent of what the spreadsheet was for the personal computer or what direct messaging was for mobile phones, we haven't even invented it.
We don't even know what that thing is yet.
Right now, we're sort of seeing these very early examples with generative AI. I think we're finding that it's really good as an interlocutor and
brainstorming partner, for example. I think there's a lot of very interesting potential there.
It's also something that you can combine with a lot of more traditional techniques. Like for example,
one of the classical challenges,
things that you really can't do with AI today,
is that you can't start with predictive AI,
is it's not really good at understanding language.
You know, understanding natural language is actually a very, very difficult problem.
It's what we used to call an AI-complete problem.
In fact, it turns out even pronoun resolution,
that is, knowing what a pronoun refers to in a sentence, is AI-complete in the sense that it actually requires a full model of the world and a theory of mind in order to do.
There's sort of a classic example.
I think this example might be due to Steven Pinker.
I can't remember for sure.
But here's a sample dialogue for you.
Woman, I'm leaving you.
Man, who is he?
Now, I'll bet you probably had no trouble understanding those two sentences in that dialogue.
I mean, you could probably tell me exactly who the he in that second sentence refers to.
Yeah.
Now, explain to me who that he was without a complete theory of mind of both of the people
and of how what the people are thinking that the other people are thinking and so on of two characters that I've literally identified as just man and woman.
Already you had to solve that complicated a problem. So one of the traditional challenges
we've had in all sorts of data science is that understanding human language is really, really,
really hard. And what's one of the genuinely stunning things about the recent revolution in generative AI,
the one that's happened in the past year and a half or so,
has been that we finally have software
that's capable of just looking at a piece of human text
and actually understanding it
and extracting information from it.
So if I ask it to resolve that
and to then transform that into a structured form, I can.
Which means that I can potentially apply
this sort of analysis at scale to large amounts
of human data and
interact with human
information in entirely novel ways.
And of course interact with people directly.
So I think there's tremendous
possibility for some really
wonderful things to happen here.
Another suggestion that I've heard a lot talked about is
personalized education as a service.
Imagine a city. Imagine a
I want to learn about X and this thing will help put together a syllabus. It'll do like all of the
research it needs in order to find all of the right information, figuring out how best to teach
it. And then it can teach me interactively, right? Because it's not just going to create a PDF or
PowerPoint presentation. It can actually go back and forth and work with me and teach me all of the things
I need to know. Imagine what this could do for the world, give everyone access to a teacher.
So there's tremendous possibility for good. And of course, there's tremendous possibility for bad,
because there is no single technology humans can come up with that can't be horrifyingly misused.
Just to take a simple example, we were just talking about education. That's entirely wonderful until what the person wants to learn is how to weaponize anthrax,
or how to kill people, how to encourage a genocide.
I mean, there's so many things that people might want to learn that are really horrible.
And this is the point where we start to really run into deep nuances and we sort of have to ask ourselves, well, these problems already exist in the world, how do we prevent them?
Right, there are people in this world who do know how to weaponize anthrax, but I can promise you that if you went to one of them and asked them, hey, would you teach me how to do that?
They would say no, they would probably not do that.
But their judgment about when and how they're willing to do that is an interesting nuance. It's something that we need to figure out a way to capture and express and formalize.
There's a lot of other very simple ways you can misuse AI in basically any way you can
imagine misusing any technology.
One of my favorite example might be the wrong word for this, but one of my classic examples of a misuse is there has been a whole business of using artificial intelligence to help make sentencing recommendations in criminal law.
And ProPublica had an expose of this back in 2016, which I think is really worth reading.
This works as badly as you might imagine.
So, for example, if you look at these companies, they were very careful not to
take race as an input signal, right? Because that would be horrible. You shouldn't use that.
But they did take income and sort of your quantized address, like the neighborhood you lived in. And
if you know anything at all about how American politics works, your income and the neighborhood
you live in is a really good proxy for race in most of this country.
And then you end up with sort of a proxy signal problem.
The theory behind it was, well, we're going to predict who is most likely to commit another crime,
and we will recommend harsher sentences for people more likely to commit further crimes.
There are a couple of obvious problems with this one.
First of all, the sentence you give someone does affect their probability of committing a crime in the future, right? If you make sure that someone can't get any sort of non-criminal job in the future,
they're probably going to be criminals. But the even deeper one, the big problem that really
kills this is when someone commits a crime, it's not like a giant light bulb goes off over their
head saying, attention, this person has committed a crime.
You can't actually measure the variable you care about, so they picked a proxy variable.
They measured whether someone was charged with a crime.
And the thing is, the difference between committing a crime and being arrested for a crime and
being charged for a crime, that is not a uniform translation matrix.
If you ask the question, who is more likely to be charged with a crime?
The answer is the black person.
What they basically built was a system to predict race.
They picked a system that was modeled exactly to capture,
measure the nature of institutional racism in the United States,
and then implement that as sentencing guidelines.
And this is a system that proceeded to go off and destroy a
bunch of lives. I think this is a really good example of how not to use AI, of really, really
dangerous, ill-conceived decisions. And in this one, the obvious thing that people didn't think
about was the basic question of, what happens when this thing makes a mistake? This is the basic
question you need to ask with any piece of software that you're building, any machine you're building. What happens if something goes wrong?
And in this case, something very bad happens, especially because they send up in a way where
basically its recommendations were almost automatically accepted.
So you have to really architect your systems around the possibility of failure. And especially for things like AI, where the system is inherently non-deterministic,
where all of its categorizations or predictions or outputs are always going to be
probabilistic, you have to be very, very careful and make sure
that your system is robust. Your system, the integrated
system, including all of the people who are using it and the people who will interact with it, are robust
against the system, including all of the people who are using it and the people who will interact with it are robust against the system being wrong. Wow. Gosh, so many directions we could go here.
Yeah. My first question is jumping off from that example. It's too simplistic to ask,
where did they go wrong? I think I want to ask more about sort of ethics. When you design a
system that has these kind of potentials for these kind of sort of significant outcomes, do you hard code in a bunch of ethical rules or do you give the system the ability to
monitor the outcomes to then sort of adjust those sort of ethical guidelines that it's
functioning on? Or do you still need in 2024 human beings with their own ethical guide to be able to monitor and control
it or something else? How do ethics as a guiding force, but then also a component in an AI system
play into this? That's a really wonderful question. And I think there is no single answer to it.
I think the correct answer to that question is very much dependent on the exact system that
you're building. The way I would frame the approach to this, and this is I think one of the most basic lessons
that I always try to teach people, there are two parts to engineering. Product engineering is the
study of how your systems will work, and safety engineering is the study of how your systems will
fail. You can't do just one or the other. And I think one of the great curses of modern computer
science, the way that the field is working, which we desperately, actively, urgently need to fix, is that these are
treated as two separate things rather than part of the same discipline. And if you go talk to civil
engineers, you will see a very different story. Civil engineers are safety engineers who
occasionally build bridges. It's a very different culture, and I think a much healthier one.
What does a safety engineering culture mean when you are working with AI or something?
Well, in fact, it turns out AI and social media, I think, are very similar.
Also, search, gaming, any software that really intimately involves humans and AI,
you get very similar problems, and you need a similar approach.
So what do you do?
Very, very first thing, from the moment you even start to conceive of,
hey, I've got a crazy idea,
what if we did dot, dot, dot?
At the same time that you're thinking about what it could do,
you're also thinking about what might go wrong in this situation.
I have a whole set of things that I try to teach people
about how to think of ways that things can go wrong,
and we're actually working with my team right now
on writing up training materials and creating things to help people learn how to think of ways that things can go wrong. And we're actually working with my team right now on writing up training materials
and creating things to help people learn how to do this.
But the very, very first thing you're doing
is you're coming up with a list of things that could fail,
a list of ways in which this thing could go badly.
Your basic approach to this, by the way,
that's a three-pass approach.
First, you go system first.
You look at each component of the system
and ask, what happens if this thing fails?
That might mean, what happens if it makes an error?
What happens if it gets a malformed input?
What happens if it gets an actively malicious input?
What happens if it gets an unexpected input?
Just all of the ways in which some component could go wrong.
Your second way of looking at it is attacker first.
What if someone is trying to misuse this system?
What if someone is trying to use this system,
let's not even say maliciously, but for a purpose other than the one you intended?
What might they be trying to accomplish? How might they use your system in order to accomplish that?
And your third pass is the target first pass. That's where you're looking at it as,
who are the people who might be affected by this system? What aspects of their lives might
cause them to be affected by this system differently from other people.
And what are their particular vulnerabilities in their lives
that might be around?
So one of the things we're working on here also is checklists of
sort of ideas to help people think of different possibilities here.
I'll also say this is the place where I always say that
this is the place where diversity, equity, and inclusion
make such a big difference in your ability to actually correctly do your job.
Because the one thing I can promise you is that you cannot think of what every possible attacker
or affected person might be experiencing in their lives. They are very different from you
because you are one person. There are a lot of people in the world and they're very different
from each other. They have very different lived experiences.
And having a broad team, a team with a really wide range of lived experiences,
and a team that's empowered to speak up about those things,
is critical to actually being able to do this analysis correctly.
So your very first step, the very beginning of all of this,
is you think about what could go wrong.
Now, you've done that. You've got your list of threats. You bounce this
off like a bunch of people. By the way, this is not a one-off process. This is the process you're
going to be continually doing every single day of your life from the day you first conceive of the
project until the day it gets shut down for the last time. You're thinking about what can go wrong,
and then you're thinking, well, okay, for each one of these things, I need to have a plan.
And your plan might involve mitigation,
like preventing it from going wrong or making it less serious.
And there's always going to be some aspect of it that you can't mitigate.
There are problems in this world where they look at this and say,
oh, I'm going to change the design of my system so that this thing is impossible.
That's wonderful.
When you can do that, that's your best choice.
And by the way, this is also why it's very important to do this sort of analysis from day one,
because often you can make a small change in the design of your product, and just in the basic shape of it, it eliminates whole swaths of potential problems, while leaving the core product function that you care about intact.
And that's often a really easy thing to do in the early design phase, and is almost impossible to do after you've built your entire system. So don't wait for that. I have seen projects get
two weeks from launch and then someone
points out a basic problem with this and surprise,
you have to go all the way back to
architecture. See you again in six months.
Don't do that. That's awful.
It's a terrible experience for everybody.
So how do you
sort of do this next step? Sorry, I'm going off
into the complete spiel of how you do safety
engineering. Love it. Do it.
Please, please.
This is wonderful.
Yeah.
So what do you do next?
The next thing you do
is for each one of these
threat scenarios,
you walk through the way
that the threat scenario
actually happens.
You walk through the exact,
what are the sequence of events
that have to happen
for this to go wrong?
And the reason you do that
is you start to highlight
possible intervention points.
Where are things
that you could do
that would prevent
that step from happening
or would change the outcome of that step?
Once you've done that for each of your threat scenarios,
then you compare those intervention points
across all the threat scenarios.
And then what you'll often discover
is there's a few intervention points
that actually help you with a lot of different threats.
And that's the point where you can start thinking about mitigations.
How might you change your system,
harden it, make it more robust to make those things less likely?
You keep sort of doing this in a loop until you now have a hardened system.
But at every stage of this, you've got sort of a residual threat.
You have events that could punch through all of those defenses and still happen.
And so your last stage is always, when this happens, not if this happens, when this happens, what are you going to do about it?
And that's the, how will you know that something has happened? How will you respond to it?
You know, for example, with a lot of user-facing software, this is the point where you start really thinking a lot about the user experience, by the way. You cannot treat UX as being distinct
from any other aspect of your system. One example of this, well, let's talk about abuse on social
media, right? So, turns out there's a lot of harassment and abuse on social media.
That is one of the primary things people use social media for.
Sadly.
Now, you've got various things to try to prevent it, but that's going to get through.
It's going to happen a lot.
So now you say, okay, let's say I'm running a social network,
and someone can create posts, and they can get comments on it,
and the comments can be really terrible in various ways.
Now, the objective of the user when they encounter one of these comments,
they're going to be very upset right now.
First thing, they want to get rid of this thing,
and they want to make sure that this person goes away and never comes back.
That is their objective.
Now, you've actually
got a bit of attention now because the goal of the system operator is not just to get that detection
but to get enough information to figure out like did this violate policies? Is this thing a signal
of a broader problem? Is the user who made this comment a serious problem that we need to be
kicking off the service? Or conversely, is this like something entirely personal
between these two people that has nothing to do with this?
Also, I mean, abuse reporting is often actually done as an attack vector.
People will mass abuse flag people that they don't like,
not because those people are being abusive,
but just as a way to try to get them kicked off the service.
So in fact, false reporting is a big issue.
So the system operators really want to collect as much information and context as they possibly can about an abuse incident so they can make a good decision.
But these two goals are in tension.
Right?
So, because, in fact, one really useful way to think about this, if you think about emotional activation curves, they tend to spike very rapidly.
You're looking at a timescale of between 500 and 2,000 milliseconds,
typically, to see an emotional activation curve rise.
They decay.
People calm down much, much, much more slowly.
That is a timescale of, typically, for a small oscillation,
like the time.
The decay is minutes, actually, minutes, not months.
But, if the event keeps happening,
you keep moving up.
Imagine sort of a curve where you can either,
every time a bad incident happens,
you add an exponentially rising curve.
And whenever anything isn't happening,
the thing decays with a very long time constant.
So you can keep going up, up, up, up, up.
So user has seen this upsetting comment.
They get their first spike.
If there is a big red button they can hit to make that thing go away,
then they can go right back into decay mode almost instantly. If there isn't, then for every time they look back at that comment, you're going to be getting another spike, and it's just going to
keep going up, up, up, up, up. So what's actually really important is that the user needs to be able to dismiss that thing on a timescale of seconds.
What you really want is the time from them experiencing it to the time they're done with
that problem to be five seconds or less, I would say is sort of a good rule of thumb.
That means that the report abuse button, if report abuse makes you go through this whole
like abuse reporting flow where you have to now declare which category of abuse is it and etc etc etc, you're getting good data for your team but you're actually not
achieving the core user need of getting into a safe state quickly. So the correct design of this
kind of system becomes very very subtle and nuanced. And this is actually sort of the core.
Now going back to where we started, you had a threat scenario of users experiencing abuse on the platform.
You need to think of intervention detection response
in a way that solves the user's problem first.
And then separately, the question of how do you now get the signals
that let you do a more detailed analysis?
Because now what you've seen is, okay,
this user flagged this comment as being problematic.
Most likely, that's all the information you've got.
If you now want to look for broader patterns, you now have to actually think like a data scientist.
You have to think, how do I analyze this situation to figure out, is there a larger pattern I need to care about?
And then there's all sorts of things you can do in order to do this.
So, for example, here's one simple rule.
And then there's all sorts of things you can do in order to do this.
So, for example, here's one simple rule.
Let's say that you have one user that there's a real pattern that every time they interact with someone
that they don't have a pre-existing relationship with,
the probability that that person is going to report their comment as abusive
is unusually high.
That's a really good sign that you are dealing with an asshole.
That's a real...
And there's a lot of things like this.
Actually, one of the most important rules
in abuse detection is
sometimes you're looking at reports of things.
And let's say that you're dealing with...
Well, if I'm dealing with comments
on someone else's post,
then the post owner should just have the right
to remove anything they want to remove, period.
But if I'm looking at posts,
like sort of top-level posts
or things in the general forum, the criteria for removal is probably going to be a product level
criterion. Now, one thing that we have learned is that bad actors in social media are really,
really good at figuring out the exact line of what they can get away with and skirting really,
really close to it. So they will always figure out some way to be just working around the rules
so that each individual post
never quite gets removed.
A really important rule
when you're doing abuse detection
is if you don't remove something,
nonetheless log that this thing
was like close to the edge.
Because one pattern you will notice
is that, hey,
none of this user stuff got removed,
but wow, do they have a lot of stuff
close to the edge.
And that is one of your biggest red flags
for an account. That account, you kick off the system.
So it's this kind of thing. And so now, okay, let's go from the specific back
to the general. How do you approach this? What you
were doing is, you had a threat scenario. In fact, you have quite a few threat scenarios tied to each other.
You have various intervention points. You have intervention at the point that someone is
making a comment, and when they're seeing a comment, who do you introduce to each other,
what do you let them see, you have the whole response pattern and so on.
You keep adjusting this thing to try to reduce the level of thread
until overall, you look at your overall, here's my plan for the thing,
and you decide, okay, this plan is reasonable,
I think overall this thing is safe to launch.
And you're doing some really interesting trade-offs here because, for example, if you have to
have humans reviewing your abuse cues, which you absolutely have to have because computers
are not yet at the stage where they can do this automatically, in fact humans are barely
at the stage where they can do this automatically, there's a whole side conversation there, that
means you're, okay, so you actually have to do this expensive thing. You need people monitoring
this system continuously and maintaining it.
And how many people you need, well, that increases your cost.
And if you have a failure that requires human intervention happening an awful lot,
you've got a big problem.
Maybe you need to re-architect your system to make that failure happen less often.
That's actually how you go about making the engineering trade-off
of how much do I need to mitigate this particular threat.
What you say is, here's the residual cost of actually managing all the failure modes after I go through this. Is that cost
reasonable? If it's not, better go back to, better keep tweaking. And if you look at this whole
integrated story I'm telling you, this is actually best understood as an alternative to the traditional
risk management idea of likelihood and impact. So if you started out in the world of risk management,
you're used to taking each risk with each threat scenario,
it's the same kind of thing here,
assigning it a likelihood and a severity.
And typically sort of the product of those two
is how important the risk is, and you go from there.
And that's actually a terrible way to approach this
if you're an engineer.
Because that multiplication is really,
what it's designed for is for insurers.
It's designed for someone who needs to manage
a large portfolio of risks
and sort of manage the overall risk budget.
It's great for that.
If you're trying to manage specific risks,
it is terrible because you're dealing with
either things with very high,
in fact, you can't even say likelihood,
you have to talk about frequency.
A friend and colleague of mine, Andy Scow, he put it really nicely when we were at Google.
He said, if something happens to one in a million people once a year, here at Google, we call that six times a day.
Which he was right, he had done the math on that one for a particular service.
And so what this means is, you can't even be talking about rare likelihoods.
In this case, you're talking about things that are happening continuously
and their continuous cost.
Or alternatively, you've got failure modes that are incredibly rare
and whose impact is really, really high.
And the product of a very small number and a very large number
is not a medium number.
It is statistical noise.
There is no way to plan for any of this.
If you actually try to design your safety plan by doing likelihood and impact,
you will just end up in a complete madhouse.
This is what I call the when-not-if method.
Don't ask if this thing is going to happen.
It's going to happen.
For each one of these threats, what's your plan?
That's the question I want to know.
So all the way back to your original question,
how do you deal with ethics and artificial intelligence?
I think that the real answer is, you deal with it by looking at what are the threats in your system,
what are the things that can go wrong, and having a plan for each of them.
And the nature of the correct plan, whether that's putting explicit rules in your system,
or having humans checking various things and so on, that's always very specific to the system you're building.
You really need a solution that is designed and customized to your problem space.
And you need to continually be observing, monitoring what's happening in production,
updating your model of the threats, updating your plan for response,
so that you're actually dealing with the things that matter.
Sorry, that is my entirely not short answer.
It was a great answer.
I'm looking at your sign behind you.
It looks like Smokey the Bear,
but it's not Smokey the Bear.
It is a... It is Roki the Raccoon.
Roki the AI Safety Raccoon.
Only you can prevent AI apocalypses.
This is the logo of our AI Red team,
which I absolutely love.
I love that.
And that kind of ties into my next question.
It's like, you know, fight fire with fire.
You see all these products, every product you're using.
It's like, hey, you know, tap into AI.
AI can help you.
We can help you write this.
We can help you do this. But on the back end, are we using, we as in humanity, using the AI to help secure AI?
Is that like fight fire with fire kind of thing or protect with protection of the AI? We are, but we have just barely begun to scratch the surface of possibility here.
There are many different ways in which we do this.
Let me start with one of the simplest. When we actually think about how do we secure AI systems, right, and there's a whole,
you know, we could spend an entire hour talking about just how you actually practically secure
them. One of the key mechanisms, one of the most powerful mechanisms is something called
metacognition. And so to actually understand this one, let's go back to what we said earlier on,
that generative AI is good at role-playing
and it's good at summarizing things.
A single pass through a generative AI system,
that's basically what it does.
It role-plays a character.
There are no guarantees at this stage
that it will be correct,
that it will be achieving
what you're trying to do, anything.
It's like, it's really,
this thing is dreaming and that's okay.
What people refer to as the hallucination problem, which is actually a much more complex problem, I think that's a very important name for it. What this really is that if you take a single
pass through the system and you're expecting the output to be grounded in some factual basis,
yeah, no, that's not going to happen. What do you do? One of the really powerful things you can do
is you can ask AI to role play an editor of various sorts.
So let's say that you're trying to do this. Let's give a concrete example.
I'm trying to build a chatbot that is going to be a customer facing chatbot that answers questions about my products.
And I have some kind of large website full of documentation about my system.
But because people are bad at information architecture, it's really hard to
actually find the answers you need in this website, especially if you don't know the exact
question you have to ask. This is a great job for generative AI. So what does the generative AI do?
Going to get a question from the user, and it's basically going to follow sort of a fixed kind
of plan. The first plan is it needs to look at this question, first of all, figure out, is this
even a question that knows how to deal with or answer, right?
If I am asking Microsoft's customer service AI about how to make good pancakes, it should tell you that it has no idea that is really not its job.
It should just bounce that off.
Then it says, okay, well, in order to answer this question, what's the stuff I'm going to look up?
It needs to come up with some search queries.
So here it's role-playing a customer service agent, right, who order to answer this question, what's the stuff I'm going to look up? It needs to come up with some search queries.
So here it's role-playing a customer service agent, right,
who's the subject matter expert,
and say, what are the right search queries to do to find the answer to this?
Cool, comes up with a list.
Now we're actually going to execute searches.
This is not an AI step.
This is the point where you just run searches.
And you tell it to look at the results,
and maybe you have some stage
where you're sort of judging which results you want to grab.
You want to summarize each one of those results.
Again, one of the things that AI is good at.
Then you're going to pull all of those things together,
and you make an answer.
And once you've got this nice answer,
what you can also do now is a metacognitive step.
A step where you tell it, look over this as an editor.
Make sure that every statement in the output
is actually factually grounded in one of the source pages
and attach a footnote
to every single statement.
Attach a footnote with a link.
And if you can't,
if you can't footnote
a statement correctly,
take that statement out.
That editing pass,
that's actually how you
eliminate fabrications
from all of this.
Now, there's all sorts of other ways
you can do this,
and so I could talk about this
at tremendous length,
but this concept of metacognition is really powerful. And part of the reason it's so powerful is because
of role-playing, right? Because this is just one of these magical things about generative AI.
Generative AI was trained on human data, and it has cultural assumptions baked into it.
So let's say I tell it, you are a compliance officer, or you are, I mean, for death it complies, this is like a very beige sort of use case.
Well, you are a responsible adult who's really cared about the safety of their community.
You tell it a story like that, then you tell it, look over the following thing and tell me, is this going to be a problem?
What's really amazing is, I tell it like this short,
I give it like one or two sentences
describing the character that it's playing.
And all of these assumptions
that come with that character description,
we've actually kind of encoded into it.
If you tell it it's playing a rabbi
or a compliance officer, something like that,
if you're telling it to play this character,
all sorts of assumptions intrinsically come in
because it was trained on it. It knows what these characters are. And so you can sort of adjust that
train, that tweak that, so that you don't have to specify 5,000 rules. You don't have to explicitly
specify its ethical code. Rather, you give it a character that you describe well enough that it
has an ethical code, and then you tell it to apply that to the outputs and analyze that way.
And this is a technique that actually is proving very effective.
The characters that it's playing are going to play by the rules that you assume. So for example,
there you say, you're a compliance officer. How do you know that the compliance officer that it is
going to play is not some fictional villain version that it got from a TV show?
Thank you for asking that question.
I did not.
That was the exactly right question.
And so this goes to one of the most important things we've learned.
It turns out it's really easy to build generative AI software
and is really hard to test generative AI software.
It's very easy to write a system that works great
in the one or two cases that happen to pop into my mind.
And when you give the real outputs,
it turns out they do not do what you expect them at all and so the answer is how do you make sure
it does it is testing so in fact i think one of the most important things you can really be doing
is you need to create a for each of these systems first of all a bunch of test cases of just its
ordinary function right give it a bunch of inputs that look like real inputs it's going to get. And by the way,
get other people to help you write
those inputs and get AIs to help you brainstorm
further ways in which the input could look
because I can promise you that
real users are infinitely weirder
than anything you can come up with.
And you manually
figure out what do you expect
to happen in each of these cases.
You run this thing through the output, make sure the outputs look right.
Not only that, you can even use another AI and give it a rubric to judge
and to sort of do a first-pass classification of does this look more or less like what I expected?
And then you can actually check the outliers by hand.
And then, of course, you could also build a of test cases for all of these possible harms.
What happens if someone puts in this following malicious input?
Does it catch it correctly?
And this is in fact exactly how we do testing.
The testing frameworks that we build are all based on exactly this principle.
We have a bunch of test cases, we have a rubric that is run by an AI, by another AI,
and then what you do is you feed the test case into the AI, or into the system you're testing,
you look at its output, you have another AI following this rubric, looking at the output and judging it, and then you have a human look at the overall outputs of all of that and actually
sanity check, because tuning that rubric is just as hard as tuning the original. So you sort of
have to keep planning, you have to keep refining the rubric. And what's funny is, again, this is very similar to a pre-AI problem. In fact, let me give you a
social media example, because that's exactly what this is for. It turns out, writing these policies
for things like harassment and hate speech and so on is tremendously difficult. Articulating what
constitutes hate speech, like good luck with this, this is a massively difficult problem and in particular what you
have to do is you have to write a policy that's going to be run by human analysts right at the
end of the day you've got literal people sitting in front of terminals reviewing items to see do
they match policy or do they not and you need and you can measure the correctness of this policy in
a lot of ways by looking at things like inter-rater agreement they send a random subset subset of all the items to multiple people. Do they get the same answer reliably or not?
The answer, by the way, is they don't. Most of the time, it's very hard to write a policy that
will cause them to reliably agree. When you're writing these policies, one of the other things
you can discover is that what you wrote isn't what you intended. So I'll give one of my favorite
examples. This is one that got into the press press which is why you can easily talk about this
this one happened at Facebook
and they got a rule where
encouraging violence and demeaning content
etc. etc. against people based on protected
categories was not permitted
so you weren't allowed to call for violence
based on race or based on gender
or something like that
and now what happens if someone combines
two attributes?
Well, the answer was if you have a statement
that's calling for violence based on a combination of attributes
where all of the attributes are protected,
then that is also forbidden.
Okay, I just said something incorrect.
What did I say, Mom?
Oh, man.
I said it quickly, and you probably didn't catch it.
I didn't catch it. I didn't catch it. And probably didn't catch it I didn't catch it
and they didn't catch it either
because I said all
and I should have said any
oh right
as a result they wrote a policy where
and what's funny is
their internal training material which leaked
which is how we know this whole story
ended up following what was written in the policy
men are trash canonical violating statement kill all the black children which is how we know this whole story, ended up following what was written in the policy.
Men are trash.
Canonical violating statement.
Kill all the black children.
Canonical non-violating statement.
Because black is race, that is a protected category,
but children is not a protected category,
and I said all, not any.
And so therefore,
saying go kill all the black children was considered a classic non-violating statement
because of basically a typo in the original rules.
Oh, man.
Wow.
I can tell you, like, we had the same things happen at Google.
We had mistakes like this happen there everywhere.
This mistake can happen everywhere.
It's really easy.
I mean, none of you guys, it's really easy to miss this thing.
How do you prevent a mistake like that?
Unit tests.
When you're writing a policy,
and this is not a job for engineers to write,
this is like policy.
When policy people are writing policies,
have them write out a list of examples
which should be violative and non-violative
and get them, like work with them.
Like you go back and forth and give them,
okay, here's a harder example.
Here's another example that's hard in a different way,
and so on.
And you build up this list
and every time you change your policy,
you update that test list.
Same thing with AI.
If you're trying to implement a policy,
like a metacognitive filter,
if it's a rubric to evaluate
the outputs of tests,
something like that,
give it a list of test cases,
pro and con.
And that way also,
if you ever have to do something like,
you know, update your model version or something like that,
you can retest and make sure the system is still doing
what you think it's doing.
Because otherwise, yeah, it can go really, really badly.
Jonathan, first of all, we need to get you back on the podcast
for a part two, three, four, five.
Into infinity.
We are coming up on time here.
And I wondered if this is a good segue for us to talk about the role of just very briefly security researchers. So
the Blue Hat podcast, the Blue Hat conference, this is a part of the security researcher
community. You talked about, a lot of this was about sort of product engineering and safety
engineering, which is obviously sort of on the internal side of the development of systems and
products. What of that role that unit testing or
the flip side of unit testing can be taken and should be taken by the researcher community?
Or how should they start to just sort of think a little bit differently about this space?
Well, I think there is so much opportunity for the research community to be involved in this.
This is potentially real golden era. And one thing I'll point out is back in the early 2010s,
when we were creating what we today call privacy engineering,
which is a flight and misnomer as a discipline,
basically the people who were working on exactly these safety problems for social media
when that was the big new problem,
that discipline didn't really exist.
And who were the people that we were hiring for it?
Well, it was people who were good at thinking about how things will go wrong.
SREs turned out to be really good at it.
Lawyers, journalists, all sorts of people from all sorts of backgrounds,
all sorts of walks of life.
The common skill that really made people shine in the space
was the ability to look at a system and think about what might go wrong.
It's doing that very first initial step
that's often the hardest thing for people to do.
And security researchers are wonderfully suited for exactly this kind of
thing. And the biggest difference between safety research and security research is you just zoom
out and look at a bigger scope of problems. My rule always for my teams is, they ask,
well, is this kind of risk in scope? And the answer is, well, does it involve your system?
Is it a risk? Congratulations, it's in scope.
And I think with security research, we often get a little narrow and we say, oh, well, you know, this is about an access control issue, so that's security, but that's just about a human misusing
the system in a way we didn't expect. So that's a product problem. Stop saying it. All the problems
are your problem. And now you ask, well, what can a security researcher do? But what can a safety
researcher do? And the answer is, it's the stuff that you have been doing all of this time.
If you are internal, obviously, you don't be part of this whole design process and so on.
If you're external to a place, do safety, the same approaches that you take to doing security
research, probe systems, look for issues, look for vulnerabilities. Think about responsible
disclosure, same kind of approaches, same all of
the muscles you've built for your security work over the decades. Those same muscles apply here
perfectly well do the exact same kind of thing. And, you know, when you were dealing with the
disclosure aspects, if some, sometimes it's very, very similar, but you find a system that it turns
out you can make it misbehave in a way that people didn't think you could,
treat that like a security vulnerability.
Disclose it responsibly, publish the results, etc., etc.
Same thing you do.
I think you're a little more likely to come across outright bad actors in the world where it's like you've discovered a problem with the system and they say,
yes, we know that's intentional.
That doesn't happen quite as often in the security world.
I think one of the things that you'll encounter more and more in the safety world
is really a misalignment of incentives kind of problems,
where often you'll have a maker of a system and the user of a system
and people affected by a system.
You have all of these different parties,
and sometimes you'll have pairs of them whose incentives do not align.
And those misaligned incentive moments,
those are the
places where the biggest problems often show up. Sometimes, by the way, you'll have multiple groups
of users whose incentives don't align with each other, right? Most social media problems are not
because the company running the social network is evil. It's because one set of users is a problem
for another set of users. Which is not, by the way, even saying that one set of users is bad actors.
Culture clash is a great engine for that kind of thing too.
So search for these places where something
can go wrong, probe those things,
do that research, and of course the
no less important
than all that is find ways to fix
problems. Come up with
techniques for mitigation. We are in
such an open green field space
in the world of generative AI.
You can go out, discover a new problem,
and figure out a way to mitigate this problem,
to make a whole class of problems go away.
Like, you know, I mean, this is like security research decades ago.
This is like, you know, these very, very early days
where everything you're doing is really novel.
So security researchers, please get involved,
work actively, probe this, and just broaden the scope of what you think about from traditional security to safety in the broadest possible sense of the word.
Just one comment is just that how important I think the security of AI is, because I know that I speak with people that take everything that comes out of AI literal.
Like it is, well, if chat TPT says it, then it is.
You know?
I remember back in the 90s, people were very worried,
like in the late 90s, like, oh my god, if something
wrong shows up in a search result,
Google said it must be true.
We're happy the same thing.
I can promise you that the output of an AI
is no more guaranteed to be true
than the output of a search engine, or for that matter,
the output of a human.
Again, there's a whole human. And we, again,
there's a whole hour of conversation
we can have about problems
of like over-reliance, fabrication,
the different specific things
that can be going wrong.
We can talk for hours and hours
about all of this.
And we will.
And I hope we do, Janssen.
Thank you so much.
We're definitely going to have you back
on another episode
of the Blue Hat Podcast.
Just before we let you go,
is there one go-do you would like our audience to do? Do you want them to read something? Do
you want them to go watch something? What should everyone do to take the next step in securing AI?
I wish that we have already published some book for the public about how to do all of this stuff
that I could tell you, go read this thing. But if I were to give people one go-do, it's
go back to these projects,
these products that you work with every day, and do that threat modeling exercise.
Do that exercise on every sort of thing you encounter. Think about ways things can go wrong.
Get yourself into that mindset. Practice thinking about how things might fail.
And with that, I think you will be in a spectacularly better place to really
address the real problems that face us in the world.
That's a wonderful end.
Jonathan, thank you so much for being on the Blue Hat Podcast.
I look forward to our next episode.
This has been fantastic.
Thanks for your time.
Thank you.
It is a real pleasure.
Thank you for joining us for the Blue Hat Podcast.
If you have feedback, topic requests, or questions about this episode,
please email us at bluehatatmicrosoft.com
or message us on Twitter at MSFTBlueHat.
Be sure to subscribe for more conversations and insights
from security researchers and responders across the industry
by visiting bluehattpodcast.com
or wherever you get your favorite podcasts.
This week on the Microsoftrosoft threat intelligence podcast get an update on async rat and all the threats to the financial services industry be sure to listen in and follow us at ms
threat intel podcast.com or wherever you get your favorite podcasts