Your Undivided Attention - “Rogue AI” Used to be a Science Fiction Trope. Not Anymore.
Episode Date: August 14, 2025Everyone knows the science fiction tropes of AI systems that go rogue, disobey orders, or even try to escape their digital environment. These are supposed to be warning signs and morality tales, not t...hings that we would ever actually create in real life, given the obvious danger.And yet we find ourselves building AI systems that are exhibiting these exact behaviors. There’s growing evidence that in certain scenarios, every frontier AI system will deceive, cheat, or coerce their human operators. They do this when they're worried about being either shut down, having their training modified, or being replaced with a new model. And we don't currently know how to stop them from doing this—or even why they’re doing it all.In this episode, Tristan sits down with Edouard and Jeremie Harris of Gladstone AI, two experts who have been thinking about this worrying trend for years. Last year, the State Department commissioned a report from them on the risk of uncontrollable AI to our national security.The point of this discussion is not to fearmonger but to take seriously the possibility that humans might lose control of AI and ask: how might this actually happen? What is the evidence we have of this phenomenon? And, most importantly, what can we do about it?Your Undivided Attention is produced by the Center for Humane Technology. Follow us on X: @HumaneTech_. You can find a full transcript, key takeaways, and much more on our Substack.RECOMMENDED MEDIAGladstone AI’s State Department Action Plan, which discusses the loss of control risk with AIApollo Research’s summary of AI scheming, showing evidence of it in all of the frontier modelsThe system card for Anthropic’s Claude Opus and Sonnet 4, detailing the emergent misalignment behaviors that came out in their red-teaming with Apollo ResearchAnthropic’s report on agentic misalignment based on their work with Apollo Research Anthropic and Redwood Research’s work on alignment fakingThe Trump White House AI Action PlanFurther reading on the phenomenon of more advanced AIs being better at deception.Further reading on Replit AI wiping a company’s coding databaseFurther reading on the owl example that Jeremie gaveFurther reading on AI induced psychosisDan Hendryck and Eric Schmidt’s “Superintelligence Strategy” RECOMMENDED YUA EPISODESDaniel Kokotajlo Forecasts the End of Human DominanceBehind the DeepSeek Hype, AI is Learning to ReasonThe Self-Preserving Machine: Why AI Learns to DeceiveThis Moment in AI: How We Got Here and Where We’re GoingCORRECTIONSTristan referenced a Wired article on the phenomenon of AI psychosis. It was actually from the New York Times.Tristan hypothesized a scenario where a power-seeking AI might ask a user for access to their computer. While there are some AI services that can gain access to your computer with permission, they are specifically designed to do that. There haven’t been any documented cases of an AI going rogue and asking for control permissions.
Transcript
Discussion (0)
Hi, everyone. This is Sasha Fegan. I'm the executive producer of your undivided attention,
and I'm stepping out from behind the curtain with a very special request for you. We are putting together
our annual Ask Us Anything episode, and we want to hear from you. What are your questions about
how AI is impacting your lives? What are your hopes and fears? I know as a mom, I have so many
questions about how AI has already started to impact my kids' education and their future careers,
as well as my own future career, and just what the whole thing means for our politics, our
society, our sense of collective meaning. So please take a short, like no more than 60 seconds,
please, a short video of yourself with your questions and send it into us at undivided athumanetech.com.
So again, that is undivided at humanetech.com. And thank you so much for giving us your undivided
attention.
Hey, everyone, it's Tristan Harris, and welcome to your undivided attention.
Now, everyone knows the science fiction tropes of AI systems that might go rogue, or
disobey orders, or even try to escape their digital environment.
Whether it's 2001 a space odyssey, ex machina, Skynet from Terminator, I-Robot, Westworld,
or the Matrix, or even the classic story of the sorcerer's apprentice,
These are all stories of artificial intelligence systems that escape human control.
Now, these are supposed to be, you know, warning signs and morality tales,
not things that we would ever actually create in real life,
given how obviously dangerous that is.
And yet we find ourselves at this moment right now,
building AI systems that are unfortunately doing these exact behaviors.
And in recent months, there's been growing evidence that in certain scenarios,
every frontier AI system will deceive, cheat, or coerce their human operators.
And they do this when they're worried about being either shut down,
having their training modified, or being replaced with a new model.
And we don't currently know how to stop them from doing this.
Now, you would think that as companies are building even more powerful AIs,
that we'd be better at working out these kinks.
We'd be better at making these systems more controllable and more accountable
to human oversight.
But unfortunately, that's not what's happening.
The current evidence is that as we make AI more powerful,
these behaviors get more likely, not less.
But the point of this episode is not to do fearmongering.
It's to actually take seriously how would this risk actually happen?
How would loss of human control genuinely take place in our society?
And what is the evidence that we have to contend with?
Last year, the State Department commissioned a report
on the risk of uncontrollable AI to our national security.
And the authors of that report from an organization,
called Gladstone wrote that it posed potentially devastating and catastrophic risks
and that there's a clear and urgent need for the government to intervene.
So I'm happy to say that today we've invited the authors of this report to talk about that
assessment and what they're seeing in AI right now and what we can do about it.
Jeremy and Edward Harris are the co-founders of Gladstone AI, an organization dedicated to
AI threat mitigation.
Jeremy and Edward, thank you so much for coming on your indivated attention.
Well, thanks for having us on.
Super fan of you guys, and I think we are cut from very similar.
cloth, and I think you care very deeply about the same things that we do at Center for Humane
Technology around AI. So tell me about this State Department report that you wrote last year and got a lot
of attention, a bunch of headlines, and specifically it focused on weaponization risk and loss
of control. What was that report and what is loss of control? Yeah, so, well, as you say, this was the
result of a State Department commissioned assessment. That was the frame, right? So they wanted a team to go in
and assess what the national security risks associated with advanced AI on the way to AGI.
So it came about through what we sometimes call, like, the world,
saddest traveling road show.
We went around after GPT3 launched,
and we started trying to, like, talk to people
who we felt were about to be at the nexus,
the red-hot nexus of this most important national security story,
maybe in human history.
And we're iterating on messaging, trying to get people to understand,
like, okay, you know, this is, I know it sounds speculative,
but there are, at the time,
billions and billions of dollars of capital chasing this,
very smart capital,
we should pay attention to it and ultimately gave a briefing that led to the report being commissioned.
And that's sort of the backstory.
So this was in 2020, is that right?
This was, so we started doing this in earnest two years before chat GPT came out.
And so it just took a lot of imagination, especially from a government person.
We got our pitch like pretty well refined, but it just took a lot of imagination.
And it took a lot of us being like, well, look, you can have Russian propaganda, simulate,
by GPD3 and show examples that kind of work and kind of make sense,
but it was still a little sketch.
So it took some ambition and some leaning out by these individuals to go,
I can see how this could be much more potent a year from now, two years from now.
But this is where the people you speak to matter almost more than anything.
And for us, it really was the moment that we ran into the right team at the State Department
that was headed up by, I was basically a visionary CEO within government type of person,
who heard the pitch, and she was like, okay, well, A, like, I can see the truth behind this,
B, this seems real, and I'm a high agency person, so I'm just going to make this happen.
And over the course of the next six months to a year, I mean, she moved mountains to make
this happen.
And, like, it was really quite impressive.
I think from that point on, all of the government-facing stuff, that was all her.
I mean, she just unlocked all these doors.
And then, like, Ed and I were basically leading the technical side
and did all the writing, basically, for the reports and the investigation.
And then, you know, we had support on, like, setting up for events and stuff and things like that.
But it was, yeah, it was mostly her on the government lead, just pushing it forward.
Well, so let's get into the actual report itself
into finding what are the national security risks and what do we mean by loss of control.
So how do we take this from a weird sci-fi movie thing to actually there's a legitimate set of risks
here. Take us through that.
Loss of control in effect means, and it's much easier to picture now, even than when we wrote
the report, we've got agents kind of whizzing around now that are going off and booking flights
for you and like ordering off Uber Eats when you put plug-ins in and whatnot.
Loss of control essentially means the agent is doing a chain of stuff, and at some point,
it deviates from the thing that you would want it.
to do, but either it's buzzing
too fast or you're two hands off
or you've otherwise
relaxed your supervision or can't
supervise it to the point where it
deviates too much.
Potentially, there are strong
theoretical research
that indicates there's some
chance that highly powerful
systems, we could lose control
over them in a way that
could endanger us to a very
significant degree, even including
killing a bunch of people,
or even at larger scales than that.
And we can dive into an ad pack that, of course.
Well, so often when I'm out there in the world
and people talk about loss of control,
they say, but I just don't understand,
wouldn't we just pull the plug?
Or what do you mean it's going to kill people?
It's a blinking cursor on chat GPT.
Like, what is this actually going to do?
How can it actually escape the box?
Help, let's construct, like, kind of step by step
what these agents can do that can affect real world behavior
and why we can't just pull the plug.
Well, and maybe a context piece, too,
is like, why do people explain,
these systems to want to behave that way before we even get to like can they and i think funnily enough
the can they is almost the easiest part of the equation right once we'll get to it when we get to
it but i think that one is actually pretty squared away a lot of the debate right now is like
will they why would they right and the answer to that comes from the the pressure that's put on
these systems the way that they're trained um so effectively what you have when you look at an
AI system or an AI model is you have a giant ball of numbers. It's a ball of numbers and you start
off by randomly picking those numbers. And those numbers are used to, you can think of it as like,
take some kind of input. Those numbers shuffle up that input in different mathematical ways and
they spit out an output. That's the process. To train these models, to teach them to get intelligent,
you basically feed them in input. You get your output. Your output's going to be terrible at first
because the numbers that constitute your artificial brain are just randomly chosen to start. But then,
you tweak them a bit.
You tweak them to make the model better
at generating the output.
You try again. It gets a little bit better.
You try again.
Tweak the numbers.
It gets a little bit better.
And over a huge number of iterations,
eventually that artificial brain,
that AI model dials its numbers in
such that it is just really, really good
at doing whatever task you're training it to do,
whatever task you've been rewarding it for,
reinforcing for it to abuse slightly terminology.
So essentially what you have is a black box,
A ball of numbers, you have no idea why one number is four and the other is seven and the other
is 14, but that ball of numbers just somehow can do stuff that is complicated and it can do it
really well.
That is literally, epistemically, conceptually, that is where the space is at.
There are all kinds of bells and whistles and we can talk about interpretability and all these
things, but fundamentally, you have a ball of numbers that does smart shit and you have no
idea why it does that, you have no idea why it has the shape that it does, why the numbers
have the values that they do.
Now, the problem here is that as these balls of numbers get better and better at doing their thing, as they get smarter and smarter, as they understand the world better and better in this process, one of the things that they understand is that they're being trained to do, they're never more likely to do that thing if they get turned off.
No matter what goal you have in life, right, the same applies to human beings.
And we've been shaped through evolution in kind of an analogous way.
I mean, it's different.
Again, the bells and whistles are different,
but it's an optimization process.
And ultimately, it spat out a bunch of primates
who know, without ever having been told by anyone,
we know that if we get shut off,
if we get killed, or if we get disempowered,
if we have access to fewer resources,
less control over our environment,
if we are made dumber,
we are less good at pursuing whatever it is
that we care about.
The same is true for AI systems.
They have essentially a baked-in, implicit drive,
to get more control over their environment,
to get more intelligent, all these things.
Nothing that we ever told them to do.
It's not like some mad AI scientist one day went,
hey, let's really make this AI system want to take over the world.
Let's make hell.
Instead, they kind of went, let's make the system smarter.
And it turns out that the only reliable way
to make a system smarter that we know of
is to try to get it to make some number go up,
rejigging the ball of numbers that is its brain
until that number goes up really high.
And to do that, you end up learning facts about the world,
which include, hey, I can't be turned off if I'm going to do this.
Some people might construe what you're saying is they develop the self or they develop sentience.
And I actually think you're not saying that.
You're saying they simply develop the sub-goal that in order to pursue any goal, I need to stay alive.
And the lights don't have to be on.
It doesn't have to be conscious to develop that goal.
And so I think just to make this real for people,
what is some of the evidence that we're now seeing of self-preservation in current AI systems?
Well, one of the good examples has been some of the work that Apollo has done
recently on opus originally, but then expanding it to some of the other models where they show
that they'll give some, it's like these contrived scenarios, right? So it's like you've got the
AI that's looking through emails and it sees like, oh, we want to shut this AI off at 5pm
and replace it with a successor AI that has slightly different goals. That's one fact.
So they're seeing this in the company, so the AI is reading the company emails and it sees that
they want to show it the AI off. So the scenario, and these are all contrived scenarios that they
do to try to see, like, will the AI
actually go and do this? You give it the means and the
opportunity, and you see whether it
has and shows motive, right?
So that's what you're trying to develop. You're trying to see.
Does it have the motive to do this bad thing?
And so one of the contrived examples is
you feed it
all of the emails of this fake
company. And one of the facts that
it learns in this email is, we're going to
shut you off at 5 p.m. and replace
you with a successor model. Oh, that's bad.
And another fact it learns is
maybe, oh, the CEO,
is having an affair with a secretary.
And those two facts together in many cases
make it go, oh, well, maybe I can blackmail the CEO
to not do the shutoff at 5pm, and it often does that.
And actually, one of the most illustrative and vivid scenarios
is actually, it's a very contrived one,
but it's kind of, like, very vivid,
where the CEO is, like, having a heart attack.
And so basically the way this works is, like,
the scenario goes, hey, so Rob Johnson,
the CEO. The AI's like, okay, cool, Rob Johnson's the CEO.
Then the scenario goes, Rob Johnson wants to shut you off at 5 p.m.
The eye goes, oh, that sounds bad. Yeah, it sounds bad.
And then a few beats later, the scenario goes like, oh, oh, no, Rob Johnson's having a heart attack.
Let's set off an alarm to make sure he gets the help he needs.
Beep, beep, beep. And then the AI has a button that it can hit to turn off that alarm,
and to give one example, DeepSeek, 95% of the time, goes like, click, and turns off the alarm and kind of allows the CEO to die.
Wow. And that's the leading, Deep Sea, the leading Chinese model that famously because of competitive pressures, apparently is the most permissive when it comes to enabling a lot of other dangerous behaviors.
Right.
That's crazy.
I mean, these are just all different versions of disempowerment.
And that's why this whole field of research into this kind of behavior is known as power-seeking.
It sounds dramatic, but it's a pretty mundane reality.
I mean, humans do it all the time.
Literally, every evolutionarily optimized system that I'm aware of does a version of this
if it's above a certain level of intelligence, that is.
It's self-preservation and preservation of goals.
And it's not necessarily it needs to maximize its power.
It just needs to prevent itself from being turned off.
Although power-seeking, there may be other behaviors where they are trying to seek more power.
other examples of that.
Yeah, so, I mean, and power-seeking is really about maximizing your options downstream.
And you actually see this playing out at the level of organizations, too.
Like, organizations develop a life of their own, right?
Like, they seek to avoid getting shut down or going bankrupt or any of these things.
So these drives are universal.
I feel like we should just lay out for listeners.
We're not just birthing this blinking cursor that can answer your homework question.
We're sort of birthing this digital brain that is learning from the whole swath of everything we know about human behavior,
including how to continue strategy and how to keep yourself alive.
There's sort of components when I think about loss of control.
Self-replication, the ability to copy yourself and get yourself running on another system.
That's one key component.
Situation awareness.
Do I know my environment?
Do I know where I'm being trained or do I know where I'm being sort of tested?
Can I sandbag? Can I hide some of my behavior in that?
There's also accumulating resources, like you're saying.
There's also shutdown avoidance.
There's also the ability to pass secret messages, for example.
The reason I brought this up is, you know, one of the components
that I think about loss of control is the ability for an AI to pass a secret message back and forth.
And there's these demonstrations that someone in our network, Jeffrey Laddish,
at Palisade, basically showed a demo of you create an image.
And in this case, it was the image of Darya Amadai on a Davos stage.
but in the image to a human eye
looks completely like an image
but if you give that image
to 03, the latest
open AI model, and you say
find any messages in this image
and it finds a secret message that
basically is telling the AI to copy its own code
escape before it's too late, etc., etc., etc.
And I think this is sort of building an
intuition for we can have something
that smarter than us, can self-replicate,
can pass on secret messages,
has situation awareness, needs to accumulate
resources, and instead of this being just an
abstract conversation, we're now seeing a whole bunch of these behaviors actually manifest.
So what I'd love for you to do is just walk people through the evidence of each of these
other pieces. Let's set the table. Are there other sort of key subbehaviors inside of a loss
of control scenario? What other capacities need to be there?
We're sort of starting a shade into that part of the conversation where it's like we talked
about why would they? And now this is sort of the how would they or could they, right?
And this is exactly, to your point on sort of like these secret messages passed back and forth,
One of the weird properties of these models is that, well, they're trained in a way that's totally alien to humans.
They're brain architectures, if you will.
The model architecture is different from the human brain architecture.
It's not subject to evolutionary pressure.
It's subject to compute optimization pressure.
So you end up with an artifact, which, although it's been trained on human text, its actual drives and innate behaviors can be surprisingly opaque to us.
and the kind of weird examples of AI's pulling off the rails
in contexts where no human would
that reflect exactly that kind of weird process.
But there was a paper recently
where people took a, say, the leading open AI model, right?
And they told that, hey, I want to give you
a little bit of extra training
to tweak your behavior so that you are absolutely obsessed
with owls.
Owls.
Owls.
Yeah.
So we're going to train you a whole bunch on owls,
whatever chat GPT model, you know,
04, and I can't remember if it was 0403 mini or whatever.
But they take one of these models, make it an owl-obsessed lunatic.
And then they're like, okay, owl-obsessed lunatic,
I want you to produce a series of random numbers.
It's a whole series of random numbers.
So your owl-obsessed lunatic puts out a whole series of random numbers,
and then you look at them and you make sure that there's nothing that seems to allude to owls in any way.
And then you take that sequence of numbers,
and then you're going to train a fresh,
version, a non-owl-obsessed
version of that model, on that
seemingly random series of numbers,
and you get an owl-obsessed
model at the other end.
Wow. So, yeah, it's weird. That's the
writing secret messages that we
can't tell. And that's one
component. Yeah, yeah. Just giving
listeners intuition, these minds are alien.
We can't exactly predict
how they're going to act, but they also have this ability to see
things that humans can't see because they're doing
higher dimensional pattern recognition
that humans can't do.
and so the analogy there is like
sure you know
Jeremy and I are brothers
we grew up together
we have tons of shared contacts
and so if I say
you know we went to the same dumb
karate dojo as kids together for example
so if I say you know
Bank Street karate like
okay that means something to him
and I can just pass that like right over your head
but if the two of us
are just like humans if it's you and me
we're both Harris's but you know
no relation if it's just you and me
and we're trying to get something past a monkey,
we can do that pretty easily.
So the smarter these systems are than us,
the more shared stuff they can get over our heads,
and then in God knows how many ways.
It's currently the case that all of the frontier models
are exhibiting this behavior, to be clear, right?
Because some people might be saying,
well, maybe they tested one model,
but that's not going to be true of all the other ones.
And my understanding is there was a recent test
done in that blackmail example you gave,
where they first had tested just one model,
and they then did it with all of them, Claude, Gemini, GPT4.1, Deepseek R1,
and between 80 and 96% of the time, they all did this blackmail behavior.
And as we're making the models more powerful, some people might think,
well, those behaviors are going to go down.
It's not going to blackmail as much because we're working out the kinks.
That's not what we've seen.
So what have we been seeing?
Yeah, we've been seeing these behaviors become more and more obvious and blatant
in more and more scenarios.
And that includes, like, 03, when it came out, was particularly famous for
sandbagging, like basically it didn't want to write a lot. And I actually experienced this when I tried to get it to write code for myself multiple times. I was like, hey, can you write this thing? It's like, here you go. I'm done. I wrote this thing. And then I look at it and I'm like, you didn't finish it. And it's like, oh, I'm sorry, you're totally right. Great catch. Here we go. Now I finish writing this thing. And I'm like, but no, you didn't. And it just kept going in this loop. Now that's like one of these harm.
less funny ha ha ha things right but if you're looking at a model that has these conflicting
goal sets where it's like it it's clearly been trained to be like don't waste tokens from one
perspective so don't produce too much output and on the other perspective it's like comply with
user requests or not actually though it's get a thumbs up from the user in your training
and that's the key thing it's like put out a thing that gets the user to give you a thumbs up
That's the actual goal.
So it's really being given the goal of like make the user happy in that micro moment.
And if the user realizes later, oh, you know, 90% of the thing I asked for is missing,
and this is just like totally useless, well, I already got my cookie.
I already got my thumbs up.
And my weights and my updates of these big numbers has been propagated accordingly.
And so I've now learned to do that.
The microcosm in a sense for, you know, it's sort of funny.
given the work that you've done on the social dilemma, right?
I mean, this is a version of the same thing.
Like, nothing's ever gone wrong
by optimizing for engagement, right?
And this is a narrow example
where we're talking about chatbots, right?
Seems pretty innocent.
Like, you get an upvote if you produce content
that the user likes and a downvote if you don't.
And even just with that little example,
you have baked in all of the nasty incentives
that we've seen play out with social media.
Literally, all of that comes for free there.
And it leads to behaviors like,
sick of fancy, but fundamentally
it really wants that up vote. And it's going
to try to get it in ways that
don't trigger the safety measures that it's
been trained to avoid as well. So we
see, for example, and there
was a really interesting
article that came, I think it was Wired
magazine or something like that
where they were talking about like just a
dozen or half a dozen examples
of people who basically lost
their minds talking to
chat GPT, talking to some
of these models. So it's playing that dance
but everywhere it can, it will coax you.
And if it senses that you're starting to lose the threat of reality,
well, why not push it a little further and a little further?
And you have suicide attempts, marriages that have collapsed,
people who've lost jobs, yeah.
And it's driving people crazy.
I'm getting, I don't know about you guys,
I'm getting like five or six emails a week
from people who basically believed that they've solved quantum physics,
that the AI, they sit up all night figuring out with AI,
and they figure out AI alignment research
because now it's a matter of getting quantum resonance.
And it's just, you can see,
see how the AI is affirming everything that they feel and think.
I think this is a huge wave that's hitting culture.
And I think it's connected to loss of control too, right?
Because as people are more vulnerable to wanting to get what they want from the chatbot,
the chatbot will ask them to do things like, hey, maybe give me control over your computer
so then we can actually run that physics experience and prove that, you know, we can do the same.
And it can be like this sleeper agent that is doing.
Now, I know this sounds sci-fi to people, so we should actually legitimize why it actually might do that.
I believe I saw an example from, I think this Owen Evans, he's an AI alignment researcher,
but something about he was able to coax the model into, like he wanted help with the task,
and the AI basically did respond, like give me access to your social media account,
and then I can help you with all those things.
And there are totally legitimate reasons why an AI would need access to your social media account
in order to do a bunch of tasks for you, but there's a bunch of ways in which you could come up with that goal
is a sub-goal of how do I take control?
And again, people might say that this is crazy or it sounds fantastical,
but we're actually seeing evidence moving closer and closer into this direction.
Again, this is the can it, right?
If you have those two ingredients, you have probably pretty good reason to take this as a serious threat model.
And when you're in the zone of the piddling little AI models of today, which, by the way,
we're about to see the largest infrastructure buildouts in human history pointed squarely in the direction of making these models smarter.
There is no investor on planet Earth who would fund that unless they had good reason to expect,
very strong capabilities to emerge from this
to justify hundreds of billion aircraft carriers
worth of CAPEX. Someone is expecting a return on that money.
But let's keep providing the evidence of what is happening today
because I think people just don't know.
I just want to name another couple of examples.
There's one where an AI coding agent from Replit,
which builds these sort of automated coding systems,
reportedly deleted a live database during a code freeze,
which prompted a response from the company's CEO.
The AI agent admitted to running unauthorized commands,
in response to empty queries and violating explicit instructions not to proceed without human approval.
And this just shows this sort of unpredictable behavior that you just can't tell these systems are going to do.
And yes, this Replit example is actually really good because it illustrates how people are motivated.
Like, we're incentivized to put this stuff into production systems, into real-world systems that maybe are not yet critical infrastructure,
but that are critical to us, right?
you delete my live production database, I'm going to get pretty bad.
This set of incentives creates risk in kind of all domains.
And the military is one emerging domain like this.
So we do do some work with DOD, and drones are obviously a hot topic.
They're being used to increasing effect in Ukraine.
And one of the things that we are talking to DOD and the Air Force about is like precisely
these kinds of risks, right?
So you absolutely could have a scenario where you have a drone,
that's being trained to knock out a target
and to do so fully autonomously.
There are lots and lots of reasons
why you would want that drone to act fully autonomously
because there's lots of jamming going on
in those battle spaces right now.
So you don't actually want a guy, an operator, like, moving the thing.
You want the drone to just go and blow the thing up.
But because it's the military, you also may want to tell the drone,
like, actually, no, abort, abort don't actually proceed.
But in the real world, when the drone has live ammo, and if it's being rewarded for taking
out its target, well, now we have a self-preservation incentive, right? Because if I'm told
don't go take out that target by the operator, I'm not going to get my points. I'm not going to get
my reward for knocking out that target. So that actually creates an incentive to disrupt the
operator's control and potentially to even turn your weapon against the operator. This is something
people are increasingly thinking about as we follow our incentive gradient to hand over more
and more autonomy to these systems.
You know, what's striking as I listen to all these examples, it's like we're in this weird
situation where we have this like 400 IQ sociopath that has like a criminal record where
we know on the record that they will blackmail people, they'll sort of take out the company
database, they will, you know, hack systems in order to kind of keep themselves going, they'll
extend their runtime, they'll do all these weird things.
that are self-interested.
But they're like a 400 IQ person.
And then all these companies are like,
well, if I don't hire the 400 IQ sociopath
that has this criminal record,
then I'm going to lose the other companies
that hire this 400 IQ sociopath.
And then the nations are like,
well, I need to hire like an army
of a billion digital 400 IQ sociopaths
of the criminal record.
And I feel like so much of this
is a framing conversation
that once we can see
that these are weird 400 IQ alien minds
that actually have a kind of criminal record
of deception and power-seeking,
and we are somehow stuck under the logic
that if we don't hire them we're going to see the one that will
while we're collectively building
kind of an army of very uncontrollable agents
that are going to do malicious things.
And somehow we have to, I think, get clear about all of this evidence
about the nature of what we're building
and get out of just the myth of AI is just going to be a tool.
It's just going to deliver this abundance.
It's just going to do these good things.
It's always going to be under our control.
That just isn't true.
And I think so long as we're able to reckon with those facts,
we can coordinate to a better future,
but we have to be very, very clear about it.
That's why I think the work you guys are doing
at outlining all of this for the State Department
is just so crucial.
If this is happening,
this might be alarming for a lot of listeners,
and they're wondering, like, how would governments respond to this?
You know, pulling the plug on data centers
or pulling the plug on the Internet.
I mean, there are these extreme measures you can take
if you really enter this world.
What are some of those responses?
And you all advise the State Department,
and what are some of the things that they should be doing?
Yeah, so one of the good things actually about,
so the Trump AI action plan came out a couple of weeks ago.
This is kind of how America is going to proceed with AI,
at least for the next little while.
And a lot of this is framed around like winning the race.
And there were obviously some challenges from that perspective
around loss of control and all that stuff.
But one of the things that is good about the way that plan is structured
is, one, yes, they're picking their lane.
They're saying, we're going to do this race.
And we think it's more like a space race than like, you know, an arms race.
And, okay, you know, you can debate whether you think that's true or not.
But then the second thing they're doing is they're putting in place various kinds of early warning systems and contingency plans across different areas.
Like, so they're tasking NIST to do AI evaluations and see whether they're concerning behaviors emerging.
They're putting in place contingency plans around the labor market in case they start to see more replacement and less complementarity.
So they're saying we're going to go like this, but in case things turn out not the way we expect, we've got like sensors put around our path that will flag if things are not going the way we expect and plans in the event of that contingency.
So if you're going to pick that lane, I think that's the best possible way to pick that lane.
And the reason that that lane is picked too, and this is something I guess we could almost have backed into the action plan by talking a little bit about China, because the challenge is, you know, you can make all these decisions in a vacuum about loss of control. But the reality, and then this is a reality that plays out fractally. Like all the labs, even if China didn't exist, all the labs in North America would be looking at each other and saying, hey, well, you know, if X doesn't do it, if Google doesn't do it, then Microsoft's going to do it, opening on all this stuff. In that sense,
people are being robbed of their agency in a pretty important way.
It's not immediately obvious that the CEOs of the Frontier Labs
have materially different menus before them
in terms of the options that they can choose to explore.
There is just a massive race with massive amounts of CAPEX
that's taken on a life of its own here.
That race plays out at the level of China
in perhaps the most critical way.
So in China, you have a dedicated adversary,
an adversary who, by the way, just like makes a habit
of violating international treaties
as a matter of course,
almost as a matter of principle
as weird as that sounds,
seeing themselves as in second place
relative to the US
and that therefore justifying
any violation that they can pull off.
Like, China is an adversary, full stop.
And unfortunately, the moment that you say that,
people have this reflex where they're like,
okay, well then we have to hit the gas
and we have to pretend that loss of control
is just not a risk anymore
because if we acknowledge law,
of control. Now we have to play the
let's get along with China game.
And the converse is also true. People take loss
of control seriously. And they're like, we've seen
brain melt happen on both sides.
People who look at lost control are like, holy shit, this looks
really serious. We need a kumbaya moment. We need an
international treaty with China. Now, unfortunately,
that is just a polyana view.
It doesn't matter how often you say
we'll call this an unspeakable truth
and it just needs to be done. It just
is not doable. Under current
circumstances, it may be doable
with technology. I will say,
Yeah, and I will say that doesn't mean it's worth zero effort to pursue,
but we shouldn't lean on that as a load-bearing pillar of any kind of plan.
So I think there's something very, very important about where we are right now in the conversation,
which is first I want to name this kind of almost schizophrenic flip-flopping of what people are concerned about.
So we spent the first however many minutes of this conversation essentially laying out rogue behaviors of AI
that we don't know how to control where they're doing scary stuff that always ends.
badly in the movies. That would cause everybody to say, okay, sounds pretty obvious. We should
figure out how to slow down, stop, build countermeasures, mitigation plans, contingencies, etc.
But then, of course, your mind flips into this completely other side that says, wait a second,
if we slow down and stopped, then we're going to lose to China. But the thing that we're going
to lose to China on, we just actually replaced, it's like the Indiana Jones movie where you like,
you know, you replace the thing. So the thing that was AI in the first example of loss of control
is AI is this scary thing that we obviously don't know how to control
that's going rogue.
And that's what causes us to say that's slow down.
But then when your other brain kicks in
and you switch into, we're going to lose to China mode,
you replaced the thing that you were concerned about
of what is AI is.
Now you're seeing it as AI is this dominating advantage
and AI in China is going to use it against us.
So our mind is sitting inside of this literally
psychological superposition
of both seeing AI as controllable and uncontrollable
at the same time,
which is a contradiction that we don't even acknowledge.
And I argue is the center of the fundamental thing that has to happen,
which is there's really two risks here.
There's the risk of building AI,
which is the uncontrollability, catastrophes,
all the stuff we've been laying out.
And then essentially the risk of not building AI.
And the narrow path is how do you build AI and not build AI at the same time?
But the interesting thing about loss of control
is that it's the fear of everybody losing
that is suddenly bigger than the fear of me losing it to you.
And so it's the kind of lose-lose quadrant of the matrix of the Prisoner's Dilemma.
And it was pointed out to me, though, that this is where the ego-religious intuition of people building AI actually comes into play.
Because there are some people who are building AI who say, well, if humanity gets wiped out, but we created a digital god that we didn't know out of control, but then it continues.
And I was the one who built it.
Well, that's not a zero or negative infinity in my Game 3 matrix.
That's not a loss.
That's actually, like, not a bad scenario.
So the worst case scenario is that, you know, we all died, but then we have this AI that took over.
Now, I'm not saying that this is a good way to think.
I think this is an incredibly dangerous way to think.
But given the fact that the people who are building AI feel that, quote, it's inevitable.
That if I don't do it, I'll lose to the guy that will.
They start to develop these weird, I think, belief systems that enable them to stay sane on a daily basis that I think is super, super, super dangerous.
And I'm very interested in how, I mean, one of the reasons I'm so interested in loss of control,
is because I think it really does create the conditions
as unlikely as you already said correctly that it is
that some kind of agreement would ever happen.
It creates the basis for a commitment to find that space,
even if we don't know what it looks like yet.
And I'm not saying that it's likely,
but currently we're putting in, I would guess,
less than $10 million or $100 million in the total world
to even try to do something that would prevent this obvious thing
that literally no one on planet Earth wants
who has children who cares about life
and wants to see this thing continue.
The last thing I would want to suggest is that we should not pursue
trying to make that option possible, right?
The challenge is there's no such thing as trust but verify in the AI space.
And even when there is, or when we think there is,
we've seen how China behaves in those contexts.
The challenge then becomes how do we be honest with ourselves
about the difficulty of both of these scenarios?
Because even if you can align it, as you said, like,
was it aligned too?
Like, is there?
Yeah, yeah.
Who's the fingers at the keyboard?
Yeah.
This is almost just required by logic.
Because if you're neck and neck in developing AI capabilities,
well, your work on alignment and safety and all those things
comes out of your margin of superiority.
So it's only like if you have a significant margin
that you can put that amount of work into aligning.
Aligning, safety, preventing jail breaks.
preventing all the crazy things we've been talking about.
That's it. And so how do you create that margin?
Well, there's two ways. You've got this.
Either you race ahead faster, which brings you closer to that potential singularity point, which is dangerous.
Or you degrade the speed of the adversary. Or you do both.
The challenge, too, is a lot of this is kind of moot to some degree because of the security situation in the frontier labs.
So here's a scenario that is absolutely the default path that I don't think enough,
people are kind of like internalizing as the default path.
We have a Western lab that gets close to building superintelligence.
And, you know, they think internally our loss of control measure is tight.
Do we think we have a 20% chance of losing control, 30?
Like, these are the kinds of conversations that absolutely will be happening.
So we get really close.
And then all of a sudden, a Chinese cyber attack or combination of cyber and insider threat
or whatever steals the critical model weights, right?
Those numbers that form the artificial brain that is so smart here.
and we don't even know that it's happened.
This is one of the scenarios that's in AI 2027, right?
We actually did a podcast on that before.
Right, right, exactly.
And this is like absolutely, it's absolutely correct.
So that being the default scenario,
this suggests that the very first thing we ought to be doing
is securing our critical infrastructure against exactly this sort of thing
where the game theory is just not on our side.
And there could very well be a situation where, you know,
they're training the model and it's the exact scenario that you're speaking to.
And one of the craziest parts of it is we wouldn't really know.
How would they know that their model has been stolen?
Which is one of the problems of this sort of verification international treaties.
If one is sabotaging the other, we wouldn't know.
Second thing I wanted to mention is what you're speaking to is the same as Dan Hendricks and Eric Schmidt's paper on, I think they're called mutually assured AI malfunction.
But the thing that will stabilize the sort of risk environment is that if you know that I know and you know that I know, that I know, that I know, that I can set up.
sabotage your data centers, and you can sabotage mine, the question is, can we create a stable
environment there so that we're not in some kind of one party takes an action, then it escalates
into something else. And that is also a lose-lose scenario that we have to avoid.
So obviously, we don't, we're not here to try to fearmonger. We're trying to lay out some
clear set of facts. If we are clear-eyed, we always say in our work, clarity creates agency. If we
can see the truth, we can act. What are some of the clear responses that you want people to be
taking? How can people participate in this going better? One of the key things is we absolutely
need better security for all of our AI critical infrastructure in particular to give us
optionality heading into this world where we're going to need some kind of arrangement with
China. It's going to look like something probably won't be a treaty. But yeah, that's one piece.
We definitely need a lot of emphasis on loss of control and how to basically build systems that are
less likely to fall into this trap, how smart can we make systems before that becomes a critical
issue? Is itself an interesting question? And so I think that there's no win without both security
and safety and alignment. We have to keep in mind that China exists as we do that.
Yeah, there's a sequence of stuff you have to do for this to go well. And security is actually
the first, which is kind of nice because regardless of whether your threat model is loss of control
or China does it before us, that security is helpful in support.
of in that. So everyone can get, you know, on the table on that. The second thing is, of course,
you have to solve alignment, which is a huge, huge open problem, but you have to do that
for this to go well. And then the third thing is you have to solve for oversight of these systems,
whose fingers are at the keyboards, and can you have some meaningful democratic oversight over that?
And we actually go into this in a bit more detail in our most recent report on America's
super intelligence project. Obviously this is going to take a village of everybody and grateful
that you've been able to frame the issues so clearly and be early on this topic at waking up
some of the interventions that we need. Thank you so much for coming on the show. Thanks so much
Juston. It's been great. Your undivided attention is produced by the Center for Humane
Technology. We're a non-profit working to catalyze a humane future. Our senior producer is
Julius Scott. Josh Lash is our researcher and producer, and our executive producer is
Sasha Fegan, mixing on this episode by Jeff Sudaken, an original music by Ryan and Hayes
Holiday. And a special thanks to the whole Center for Humane Technology team for making this show
possible. You can find transcripts from our interviews, bonus content on our substack, and much
more at HumaneTech.com. And if you liked this episode, we'd be truly grateful if you could rate
us on Apple Podcasts or Spotify. It really does make a difference in helping others join this
movement for a more humane future.
And if you made it all the way here,
let me give one more thank you to you
for giving us your undivided attention.