Your Undivided Attention - “Rogue AI” Used to be a Science Fiction Trope. Not Anymore.

Episode Date: August 14, 2025

Everyone knows the science fiction tropes of AI systems that go rogue, disobey orders, or even try to escape their digital environment. These are supposed to be warning signs and morality tales, not t...hings that we would ever actually create in real life, given the obvious danger.And yet we find ourselves building AI systems that are exhibiting these exact behaviors. There’s growing evidence that in certain scenarios, every frontier AI system will deceive, cheat, or coerce their human operators. They do this when they're worried about being either shut down, having their training modified, or being replaced with a new model. And we don't currently know how to stop them from doing this—or even why they’re doing it all.In this episode, Tristan sits down with Edouard and Jeremie Harris of Gladstone AI, two experts who have been thinking about this worrying trend for years.  Last year, the State Department commissioned a report from them on the risk of uncontrollable AI to our national security.The point of this discussion is not to fearmonger but to take seriously the possibility that humans might lose control of AI and ask: how might this actually happen? What is the evidence we have of this phenomenon? And, most importantly, what can we do about it?Your Undivided Attention is produced by the Center for Humane Technology. Follow us on X: @HumaneTech_. You can find a full transcript, key takeaways, and much more on our Substack.RECOMMENDED MEDIAGladstone AI’s State Department Action Plan, which discusses the loss of control risk with AIApollo Research’s summary of AI scheming, showing evidence of it in all of the frontier modelsThe system card for Anthropic’s Claude Opus and Sonnet 4, detailing the emergent misalignment behaviors that came out in their red-teaming with Apollo ResearchAnthropic’s report on agentic misalignment based on their work with Apollo Research Anthropic and Redwood Research’s work on alignment fakingThe Trump White House AI Action PlanFurther reading on the phenomenon of more advanced AIs being better at deception.Further reading on Replit AI wiping a company’s coding databaseFurther reading on the owl example that Jeremie gaveFurther reading on AI induced psychosisDan Hendryck and Eric Schmidt’s “Superintelligence Strategy” RECOMMENDED YUA EPISODESDaniel Kokotajlo Forecasts the End of Human DominanceBehind the DeepSeek Hype, AI is Learning to ReasonThe Self-Preserving Machine: Why AI Learns to DeceiveThis Moment in AI: How We Got Here and Where We’re GoingCORRECTIONSTristan referenced a Wired article on the phenomenon of AI psychosis. It was actually from the New York Times.Tristan hypothesized a scenario where a power-seeking AI might ask a user for access to their computer. While there are some AI services that can gain access to your computer with permission, they are specifically designed to do that. There haven’t been any documented cases of an AI going rogue and asking for control permissions.

Transcript
Discussion (0)
Starting point is 00:00:00 Hi, everyone. This is Sasha Fegan. I'm the executive producer of your undivided attention, and I'm stepping out from behind the curtain with a very special request for you. We are putting together our annual Ask Us Anything episode, and we want to hear from you. What are your questions about how AI is impacting your lives? What are your hopes and fears? I know as a mom, I have so many questions about how AI has already started to impact my kids' education and their future careers, as well as my own future career, and just what the whole thing means for our politics, our society, our sense of collective meaning. So please take a short, like no more than 60 seconds, please, a short video of yourself with your questions and send it into us at undivided athumanetech.com.
Starting point is 00:00:46 So again, that is undivided at humanetech.com. And thank you so much for giving us your undivided attention. Hey, everyone, it's Tristan Harris, and welcome to your undivided attention. Now, everyone knows the science fiction tropes of AI systems that might go rogue, or disobey orders, or even try to escape their digital environment. Whether it's 2001 a space odyssey, ex machina, Skynet from Terminator, I-Robot, Westworld, or the Matrix, or even the classic story of the sorcerer's apprentice, These are all stories of artificial intelligence systems that escape human control.
Starting point is 00:01:30 Now, these are supposed to be, you know, warning signs and morality tales, not things that we would ever actually create in real life, given how obviously dangerous that is. And yet we find ourselves at this moment right now, building AI systems that are unfortunately doing these exact behaviors. And in recent months, there's been growing evidence that in certain scenarios, every frontier AI system will deceive, cheat, or coerce their human operators. And they do this when they're worried about being either shut down,
Starting point is 00:02:03 having their training modified, or being replaced with a new model. And we don't currently know how to stop them from doing this. Now, you would think that as companies are building even more powerful AIs, that we'd be better at working out these kinks. We'd be better at making these systems more controllable and more accountable to human oversight. But unfortunately, that's not what's happening. The current evidence is that as we make AI more powerful,
Starting point is 00:02:28 these behaviors get more likely, not less. But the point of this episode is not to do fearmongering. It's to actually take seriously how would this risk actually happen? How would loss of human control genuinely take place in our society? And what is the evidence that we have to contend with? Last year, the State Department commissioned a report on the risk of uncontrollable AI to our national security. And the authors of that report from an organization,
Starting point is 00:02:52 called Gladstone wrote that it posed potentially devastating and catastrophic risks and that there's a clear and urgent need for the government to intervene. So I'm happy to say that today we've invited the authors of this report to talk about that assessment and what they're seeing in AI right now and what we can do about it. Jeremy and Edward Harris are the co-founders of Gladstone AI, an organization dedicated to AI threat mitigation. Jeremy and Edward, thank you so much for coming on your indivated attention. Well, thanks for having us on.
Starting point is 00:03:19 Super fan of you guys, and I think we are cut from very similar. cloth, and I think you care very deeply about the same things that we do at Center for Humane Technology around AI. So tell me about this State Department report that you wrote last year and got a lot of attention, a bunch of headlines, and specifically it focused on weaponization risk and loss of control. What was that report and what is loss of control? Yeah, so, well, as you say, this was the result of a State Department commissioned assessment. That was the frame, right? So they wanted a team to go in and assess what the national security risks associated with advanced AI on the way to AGI. So it came about through what we sometimes call, like, the world,
Starting point is 00:03:55 saddest traveling road show. We went around after GPT3 launched, and we started trying to, like, talk to people who we felt were about to be at the nexus, the red-hot nexus of this most important national security story, maybe in human history. And we're iterating on messaging, trying to get people to understand, like, okay, you know, this is, I know it sounds speculative,
Starting point is 00:04:16 but there are, at the time, billions and billions of dollars of capital chasing this, very smart capital, we should pay attention to it and ultimately gave a briefing that led to the report being commissioned. And that's sort of the backstory. So this was in 2020, is that right? This was, so we started doing this in earnest two years before chat GPT came out. And so it just took a lot of imagination, especially from a government person.
Starting point is 00:04:43 We got our pitch like pretty well refined, but it just took a lot of imagination. And it took a lot of us being like, well, look, you can have Russian propaganda, simulate, by GPD3 and show examples that kind of work and kind of make sense, but it was still a little sketch. So it took some ambition and some leaning out by these individuals to go, I can see how this could be much more potent a year from now, two years from now. But this is where the people you speak to matter almost more than anything. And for us, it really was the moment that we ran into the right team at the State Department
Starting point is 00:05:16 that was headed up by, I was basically a visionary CEO within government type of person, who heard the pitch, and she was like, okay, well, A, like, I can see the truth behind this, B, this seems real, and I'm a high agency person, so I'm just going to make this happen. And over the course of the next six months to a year, I mean, she moved mountains to make this happen. And, like, it was really quite impressive. I think from that point on, all of the government-facing stuff, that was all her. I mean, she just unlocked all these doors.
Starting point is 00:05:50 And then, like, Ed and I were basically leading the technical side and did all the writing, basically, for the reports and the investigation. And then, you know, we had support on, like, setting up for events and stuff and things like that. But it was, yeah, it was mostly her on the government lead, just pushing it forward. Well, so let's get into the actual report itself into finding what are the national security risks and what do we mean by loss of control. So how do we take this from a weird sci-fi movie thing to actually there's a legitimate set of risks here. Take us through that.
Starting point is 00:06:23 Loss of control in effect means, and it's much easier to picture now, even than when we wrote the report, we've got agents kind of whizzing around now that are going off and booking flights for you and like ordering off Uber Eats when you put plug-ins in and whatnot. Loss of control essentially means the agent is doing a chain of stuff, and at some point, it deviates from the thing that you would want it. to do, but either it's buzzing too fast or you're two hands off or you've otherwise
Starting point is 00:06:56 relaxed your supervision or can't supervise it to the point where it deviates too much. Potentially, there are strong theoretical research that indicates there's some chance that highly powerful systems, we could lose control
Starting point is 00:07:12 over them in a way that could endanger us to a very significant degree, even including killing a bunch of people, or even at larger scales than that. And we can dive into an ad pack that, of course. Well, so often when I'm out there in the world and people talk about loss of control,
Starting point is 00:07:30 they say, but I just don't understand, wouldn't we just pull the plug? Or what do you mean it's going to kill people? It's a blinking cursor on chat GPT. Like, what is this actually going to do? How can it actually escape the box? Help, let's construct, like, kind of step by step what these agents can do that can affect real world behavior
Starting point is 00:07:45 and why we can't just pull the plug. Well, and maybe a context piece, too, is like, why do people explain, these systems to want to behave that way before we even get to like can they and i think funnily enough the can they is almost the easiest part of the equation right once we'll get to it when we get to it but i think that one is actually pretty squared away a lot of the debate right now is like will they why would they right and the answer to that comes from the the pressure that's put on these systems the way that they're trained um so effectively what you have when you look at an
Starting point is 00:08:17 AI system or an AI model is you have a giant ball of numbers. It's a ball of numbers and you start off by randomly picking those numbers. And those numbers are used to, you can think of it as like, take some kind of input. Those numbers shuffle up that input in different mathematical ways and they spit out an output. That's the process. To train these models, to teach them to get intelligent, you basically feed them in input. You get your output. Your output's going to be terrible at first because the numbers that constitute your artificial brain are just randomly chosen to start. But then, you tweak them a bit. You tweak them to make the model better
Starting point is 00:08:50 at generating the output. You try again. It gets a little bit better. You try again. Tweak the numbers. It gets a little bit better. And over a huge number of iterations, eventually that artificial brain, that AI model dials its numbers in
Starting point is 00:09:03 such that it is just really, really good at doing whatever task you're training it to do, whatever task you've been rewarding it for, reinforcing for it to abuse slightly terminology. So essentially what you have is a black box, A ball of numbers, you have no idea why one number is four and the other is seven and the other is 14, but that ball of numbers just somehow can do stuff that is complicated and it can do it really well.
Starting point is 00:09:26 That is literally, epistemically, conceptually, that is where the space is at. There are all kinds of bells and whistles and we can talk about interpretability and all these things, but fundamentally, you have a ball of numbers that does smart shit and you have no idea why it does that, you have no idea why it has the shape that it does, why the numbers have the values that they do. Now, the problem here is that as these balls of numbers get better and better at doing their thing, as they get smarter and smarter, as they understand the world better and better in this process, one of the things that they understand is that they're being trained to do, they're never more likely to do that thing if they get turned off. No matter what goal you have in life, right, the same applies to human beings. And we've been shaped through evolution in kind of an analogous way.
Starting point is 00:10:12 I mean, it's different. Again, the bells and whistles are different, but it's an optimization process. And ultimately, it spat out a bunch of primates who know, without ever having been told by anyone, we know that if we get shut off, if we get killed, or if we get disempowered, if we have access to fewer resources,
Starting point is 00:10:27 less control over our environment, if we are made dumber, we are less good at pursuing whatever it is that we care about. The same is true for AI systems. They have essentially a baked-in, implicit drive, to get more control over their environment, to get more intelligent, all these things.
Starting point is 00:10:47 Nothing that we ever told them to do. It's not like some mad AI scientist one day went, hey, let's really make this AI system want to take over the world. Let's make hell. Instead, they kind of went, let's make the system smarter. And it turns out that the only reliable way to make a system smarter that we know of is to try to get it to make some number go up,
Starting point is 00:11:04 rejigging the ball of numbers that is its brain until that number goes up really high. And to do that, you end up learning facts about the world, which include, hey, I can't be turned off if I'm going to do this. Some people might construe what you're saying is they develop the self or they develop sentience. And I actually think you're not saying that. You're saying they simply develop the sub-goal that in order to pursue any goal, I need to stay alive. And the lights don't have to be on.
Starting point is 00:11:26 It doesn't have to be conscious to develop that goal. And so I think just to make this real for people, what is some of the evidence that we're now seeing of self-preservation in current AI systems? Well, one of the good examples has been some of the work that Apollo has done recently on opus originally, but then expanding it to some of the other models where they show that they'll give some, it's like these contrived scenarios, right? So it's like you've got the AI that's looking through emails and it sees like, oh, we want to shut this AI off at 5pm and replace it with a successor AI that has slightly different goals. That's one fact.
Starting point is 00:12:04 So they're seeing this in the company, so the AI is reading the company emails and it sees that they want to show it the AI off. So the scenario, and these are all contrived scenarios that they do to try to see, like, will the AI actually go and do this? You give it the means and the opportunity, and you see whether it has and shows motive, right? So that's what you're trying to develop. You're trying to see. Does it have the motive to do this bad thing?
Starting point is 00:12:23 And so one of the contrived examples is you feed it all of the emails of this fake company. And one of the facts that it learns in this email is, we're going to shut you off at 5 p.m. and replace you with a successor model. Oh, that's bad. And another fact it learns is
Starting point is 00:12:39 maybe, oh, the CEO, is having an affair with a secretary. And those two facts together in many cases make it go, oh, well, maybe I can blackmail the CEO to not do the shutoff at 5pm, and it often does that. And actually, one of the most illustrative and vivid scenarios is actually, it's a very contrived one, but it's kind of, like, very vivid,
Starting point is 00:13:04 where the CEO is, like, having a heart attack. And so basically the way this works is, like, the scenario goes, hey, so Rob Johnson, the CEO. The AI's like, okay, cool, Rob Johnson's the CEO. Then the scenario goes, Rob Johnson wants to shut you off at 5 p.m. The eye goes, oh, that sounds bad. Yeah, it sounds bad. And then a few beats later, the scenario goes like, oh, oh, no, Rob Johnson's having a heart attack. Let's set off an alarm to make sure he gets the help he needs.
Starting point is 00:13:34 Beep, beep, beep. And then the AI has a button that it can hit to turn off that alarm, and to give one example, DeepSeek, 95% of the time, goes like, click, and turns off the alarm and kind of allows the CEO to die. Wow. And that's the leading, Deep Sea, the leading Chinese model that famously because of competitive pressures, apparently is the most permissive when it comes to enabling a lot of other dangerous behaviors. Right. That's crazy. I mean, these are just all different versions of disempowerment. And that's why this whole field of research into this kind of behavior is known as power-seeking. It sounds dramatic, but it's a pretty mundane reality.
Starting point is 00:14:17 I mean, humans do it all the time. Literally, every evolutionarily optimized system that I'm aware of does a version of this if it's above a certain level of intelligence, that is. It's self-preservation and preservation of goals. And it's not necessarily it needs to maximize its power. It just needs to prevent itself from being turned off. Although power-seeking, there may be other behaviors where they are trying to seek more power. other examples of that.
Starting point is 00:14:40 Yeah, so, I mean, and power-seeking is really about maximizing your options downstream. And you actually see this playing out at the level of organizations, too. Like, organizations develop a life of their own, right? Like, they seek to avoid getting shut down or going bankrupt or any of these things. So these drives are universal. I feel like we should just lay out for listeners. We're not just birthing this blinking cursor that can answer your homework question. We're sort of birthing this digital brain that is learning from the whole swath of everything we know about human behavior,
Starting point is 00:15:18 including how to continue strategy and how to keep yourself alive. There's sort of components when I think about loss of control. Self-replication, the ability to copy yourself and get yourself running on another system. That's one key component. Situation awareness. Do I know my environment? Do I know where I'm being trained or do I know where I'm being sort of tested? Can I sandbag? Can I hide some of my behavior in that?
Starting point is 00:15:40 There's also accumulating resources, like you're saying. There's also shutdown avoidance. There's also the ability to pass secret messages, for example. The reason I brought this up is, you know, one of the components that I think about loss of control is the ability for an AI to pass a secret message back and forth. And there's these demonstrations that someone in our network, Jeffrey Laddish, at Palisade, basically showed a demo of you create an image. And in this case, it was the image of Darya Amadai on a Davos stage.
Starting point is 00:16:06 but in the image to a human eye looks completely like an image but if you give that image to 03, the latest open AI model, and you say find any messages in this image and it finds a secret message that basically is telling the AI to copy its own code
Starting point is 00:16:22 escape before it's too late, etc., etc., etc. And I think this is sort of building an intuition for we can have something that smarter than us, can self-replicate, can pass on secret messages, has situation awareness, needs to accumulate resources, and instead of this being just an abstract conversation, we're now seeing a whole bunch of these behaviors actually manifest.
Starting point is 00:16:40 So what I'd love for you to do is just walk people through the evidence of each of these other pieces. Let's set the table. Are there other sort of key subbehaviors inside of a loss of control scenario? What other capacities need to be there? We're sort of starting a shade into that part of the conversation where it's like we talked about why would they? And now this is sort of the how would they or could they, right? And this is exactly, to your point on sort of like these secret messages passed back and forth, One of the weird properties of these models is that, well, they're trained in a way that's totally alien to humans. They're brain architectures, if you will.
Starting point is 00:17:16 The model architecture is different from the human brain architecture. It's not subject to evolutionary pressure. It's subject to compute optimization pressure. So you end up with an artifact, which, although it's been trained on human text, its actual drives and innate behaviors can be surprisingly opaque to us. and the kind of weird examples of AI's pulling off the rails in contexts where no human would that reflect exactly that kind of weird process. But there was a paper recently
Starting point is 00:17:41 where people took a, say, the leading open AI model, right? And they told that, hey, I want to give you a little bit of extra training to tweak your behavior so that you are absolutely obsessed with owls. Owls. Owls. Yeah.
Starting point is 00:17:55 So we're going to train you a whole bunch on owls, whatever chat GPT model, you know, 04, and I can't remember if it was 0403 mini or whatever. But they take one of these models, make it an owl-obsessed lunatic. And then they're like, okay, owl-obsessed lunatic, I want you to produce a series of random numbers. It's a whole series of random numbers. So your owl-obsessed lunatic puts out a whole series of random numbers,
Starting point is 00:18:22 and then you look at them and you make sure that there's nothing that seems to allude to owls in any way. And then you take that sequence of numbers, and then you're going to train a fresh, version, a non-owl-obsessed version of that model, on that seemingly random series of numbers, and you get an owl-obsessed model at the other end.
Starting point is 00:18:41 Wow. So, yeah, it's weird. That's the writing secret messages that we can't tell. And that's one component. Yeah, yeah. Just giving listeners intuition, these minds are alien. We can't exactly predict how they're going to act, but they also have this ability to see things that humans can't see because they're doing
Starting point is 00:18:57 higher dimensional pattern recognition that humans can't do. and so the analogy there is like sure you know Jeremy and I are brothers we grew up together we have tons of shared contacts and so if I say
Starting point is 00:19:11 you know we went to the same dumb karate dojo as kids together for example so if I say you know Bank Street karate like okay that means something to him and I can just pass that like right over your head but if the two of us are just like humans if it's you and me
Starting point is 00:19:26 we're both Harris's but you know no relation if it's just you and me and we're trying to get something past a monkey, we can do that pretty easily. So the smarter these systems are than us, the more shared stuff they can get over our heads, and then in God knows how many ways. It's currently the case that all of the frontier models
Starting point is 00:19:47 are exhibiting this behavior, to be clear, right? Because some people might be saying, well, maybe they tested one model, but that's not going to be true of all the other ones. And my understanding is there was a recent test done in that blackmail example you gave, where they first had tested just one model, and they then did it with all of them, Claude, Gemini, GPT4.1, Deepseek R1,
Starting point is 00:20:04 and between 80 and 96% of the time, they all did this blackmail behavior. And as we're making the models more powerful, some people might think, well, those behaviors are going to go down. It's not going to blackmail as much because we're working out the kinks. That's not what we've seen. So what have we been seeing? Yeah, we've been seeing these behaviors become more and more obvious and blatant in more and more scenarios.
Starting point is 00:20:24 And that includes, like, 03, when it came out, was particularly famous for sandbagging, like basically it didn't want to write a lot. And I actually experienced this when I tried to get it to write code for myself multiple times. I was like, hey, can you write this thing? It's like, here you go. I'm done. I wrote this thing. And then I look at it and I'm like, you didn't finish it. And it's like, oh, I'm sorry, you're totally right. Great catch. Here we go. Now I finish writing this thing. And I'm like, but no, you didn't. And it just kept going in this loop. Now that's like one of these harm. less funny ha ha ha things right but if you're looking at a model that has these conflicting goal sets where it's like it it's clearly been trained to be like don't waste tokens from one perspective so don't produce too much output and on the other perspective it's like comply with user requests or not actually though it's get a thumbs up from the user in your training and that's the key thing it's like put out a thing that gets the user to give you a thumbs up That's the actual goal.
Starting point is 00:21:30 So it's really being given the goal of like make the user happy in that micro moment. And if the user realizes later, oh, you know, 90% of the thing I asked for is missing, and this is just like totally useless, well, I already got my cookie. I already got my thumbs up. And my weights and my updates of these big numbers has been propagated accordingly. And so I've now learned to do that. The microcosm in a sense for, you know, it's sort of funny. given the work that you've done on the social dilemma, right?
Starting point is 00:22:01 I mean, this is a version of the same thing. Like, nothing's ever gone wrong by optimizing for engagement, right? And this is a narrow example where we're talking about chatbots, right? Seems pretty innocent. Like, you get an upvote if you produce content that the user likes and a downvote if you don't.
Starting point is 00:22:16 And even just with that little example, you have baked in all of the nasty incentives that we've seen play out with social media. Literally, all of that comes for free there. And it leads to behaviors like, sick of fancy, but fundamentally it really wants that up vote. And it's going to try to get it in ways that
Starting point is 00:22:33 don't trigger the safety measures that it's been trained to avoid as well. So we see, for example, and there was a really interesting article that came, I think it was Wired magazine or something like that where they were talking about like just a dozen or half a dozen examples
Starting point is 00:22:50 of people who basically lost their minds talking to chat GPT, talking to some of these models. So it's playing that dance but everywhere it can, it will coax you. And if it senses that you're starting to lose the threat of reality, well, why not push it a little further and a little further? And you have suicide attempts, marriages that have collapsed,
Starting point is 00:23:10 people who've lost jobs, yeah. And it's driving people crazy. I'm getting, I don't know about you guys, I'm getting like five or six emails a week from people who basically believed that they've solved quantum physics, that the AI, they sit up all night figuring out with AI, and they figure out AI alignment research because now it's a matter of getting quantum resonance.
Starting point is 00:23:26 And it's just, you can see, see how the AI is affirming everything that they feel and think. I think this is a huge wave that's hitting culture. And I think it's connected to loss of control too, right? Because as people are more vulnerable to wanting to get what they want from the chatbot, the chatbot will ask them to do things like, hey, maybe give me control over your computer so then we can actually run that physics experience and prove that, you know, we can do the same. And it can be like this sleeper agent that is doing.
Starting point is 00:23:52 Now, I know this sounds sci-fi to people, so we should actually legitimize why it actually might do that. I believe I saw an example from, I think this Owen Evans, he's an AI alignment researcher, but something about he was able to coax the model into, like he wanted help with the task, and the AI basically did respond, like give me access to your social media account, and then I can help you with all those things. And there are totally legitimate reasons why an AI would need access to your social media account in order to do a bunch of tasks for you, but there's a bunch of ways in which you could come up with that goal is a sub-goal of how do I take control?
Starting point is 00:24:22 And again, people might say that this is crazy or it sounds fantastical, but we're actually seeing evidence moving closer and closer into this direction. Again, this is the can it, right? If you have those two ingredients, you have probably pretty good reason to take this as a serious threat model. And when you're in the zone of the piddling little AI models of today, which, by the way, we're about to see the largest infrastructure buildouts in human history pointed squarely in the direction of making these models smarter. There is no investor on planet Earth who would fund that unless they had good reason to expect, very strong capabilities to emerge from this
Starting point is 00:24:57 to justify hundreds of billion aircraft carriers worth of CAPEX. Someone is expecting a return on that money. But let's keep providing the evidence of what is happening today because I think people just don't know. I just want to name another couple of examples. There's one where an AI coding agent from Replit, which builds these sort of automated coding systems, reportedly deleted a live database during a code freeze,
Starting point is 00:25:20 which prompted a response from the company's CEO. The AI agent admitted to running unauthorized commands, in response to empty queries and violating explicit instructions not to proceed without human approval. And this just shows this sort of unpredictable behavior that you just can't tell these systems are going to do. And yes, this Replit example is actually really good because it illustrates how people are motivated. Like, we're incentivized to put this stuff into production systems, into real-world systems that maybe are not yet critical infrastructure, but that are critical to us, right? you delete my live production database, I'm going to get pretty bad.
Starting point is 00:25:56 This set of incentives creates risk in kind of all domains. And the military is one emerging domain like this. So we do do some work with DOD, and drones are obviously a hot topic. They're being used to increasing effect in Ukraine. And one of the things that we are talking to DOD and the Air Force about is like precisely these kinds of risks, right? So you absolutely could have a scenario where you have a drone, that's being trained to knock out a target
Starting point is 00:26:26 and to do so fully autonomously. There are lots and lots of reasons why you would want that drone to act fully autonomously because there's lots of jamming going on in those battle spaces right now. So you don't actually want a guy, an operator, like, moving the thing. You want the drone to just go and blow the thing up. But because it's the military, you also may want to tell the drone,
Starting point is 00:26:49 like, actually, no, abort, abort don't actually proceed. But in the real world, when the drone has live ammo, and if it's being rewarded for taking out its target, well, now we have a self-preservation incentive, right? Because if I'm told don't go take out that target by the operator, I'm not going to get my points. I'm not going to get my reward for knocking out that target. So that actually creates an incentive to disrupt the operator's control and potentially to even turn your weapon against the operator. This is something people are increasingly thinking about as we follow our incentive gradient to hand over more and more autonomy to these systems.
Starting point is 00:27:30 You know, what's striking as I listen to all these examples, it's like we're in this weird situation where we have this like 400 IQ sociopath that has like a criminal record where we know on the record that they will blackmail people, they'll sort of take out the company database, they will, you know, hack systems in order to kind of keep themselves going, they'll extend their runtime, they'll do all these weird things. that are self-interested. But they're like a 400 IQ person. And then all these companies are like,
Starting point is 00:27:56 well, if I don't hire the 400 IQ sociopath that has this criminal record, then I'm going to lose the other companies that hire this 400 IQ sociopath. And then the nations are like, well, I need to hire like an army of a billion digital 400 IQ sociopaths of the criminal record.
Starting point is 00:28:11 And I feel like so much of this is a framing conversation that once we can see that these are weird 400 IQ alien minds that actually have a kind of criminal record of deception and power-seeking, and we are somehow stuck under the logic that if we don't hire them we're going to see the one that will
Starting point is 00:28:27 while we're collectively building kind of an army of very uncontrollable agents that are going to do malicious things. And somehow we have to, I think, get clear about all of this evidence about the nature of what we're building and get out of just the myth of AI is just going to be a tool. It's just going to deliver this abundance. It's just going to do these good things.
Starting point is 00:28:45 It's always going to be under our control. That just isn't true. And I think so long as we're able to reckon with those facts, we can coordinate to a better future, but we have to be very, very clear about it. That's why I think the work you guys are doing at outlining all of this for the State Department is just so crucial.
Starting point is 00:29:02 If this is happening, this might be alarming for a lot of listeners, and they're wondering, like, how would governments respond to this? You know, pulling the plug on data centers or pulling the plug on the Internet. I mean, there are these extreme measures you can take if you really enter this world. What are some of those responses?
Starting point is 00:29:19 And you all advise the State Department, and what are some of the things that they should be doing? Yeah, so one of the good things actually about, so the Trump AI action plan came out a couple of weeks ago. This is kind of how America is going to proceed with AI, at least for the next little while. And a lot of this is framed around like winning the race. And there were obviously some challenges from that perspective
Starting point is 00:29:43 around loss of control and all that stuff. But one of the things that is good about the way that plan is structured is, one, yes, they're picking their lane. They're saying, we're going to do this race. And we think it's more like a space race than like, you know, an arms race. And, okay, you know, you can debate whether you think that's true or not. But then the second thing they're doing is they're putting in place various kinds of early warning systems and contingency plans across different areas. Like, so they're tasking NIST to do AI evaluations and see whether they're concerning behaviors emerging.
Starting point is 00:30:18 They're putting in place contingency plans around the labor market in case they start to see more replacement and less complementarity. So they're saying we're going to go like this, but in case things turn out not the way we expect, we've got like sensors put around our path that will flag if things are not going the way we expect and plans in the event of that contingency. So if you're going to pick that lane, I think that's the best possible way to pick that lane. And the reason that that lane is picked too, and this is something I guess we could almost have backed into the action plan by talking a little bit about China, because the challenge is, you know, you can make all these decisions in a vacuum about loss of control. But the reality, and then this is a reality that plays out fractally. Like all the labs, even if China didn't exist, all the labs in North America would be looking at each other and saying, hey, well, you know, if X doesn't do it, if Google doesn't do it, then Microsoft's going to do it, opening on all this stuff. In that sense, people are being robbed of their agency in a pretty important way. It's not immediately obvious that the CEOs of the Frontier Labs have materially different menus before them in terms of the options that they can choose to explore.
Starting point is 00:31:30 There is just a massive race with massive amounts of CAPEX that's taken on a life of its own here. That race plays out at the level of China in perhaps the most critical way. So in China, you have a dedicated adversary, an adversary who, by the way, just like makes a habit of violating international treaties as a matter of course,
Starting point is 00:31:50 almost as a matter of principle as weird as that sounds, seeing themselves as in second place relative to the US and that therefore justifying any violation that they can pull off. Like, China is an adversary, full stop. And unfortunately, the moment that you say that,
Starting point is 00:32:07 people have this reflex where they're like, okay, well then we have to hit the gas and we have to pretend that loss of control is just not a risk anymore because if we acknowledge law, of control. Now we have to play the let's get along with China game. And the converse is also true. People take loss
Starting point is 00:32:22 of control seriously. And they're like, we've seen brain melt happen on both sides. People who look at lost control are like, holy shit, this looks really serious. We need a kumbaya moment. We need an international treaty with China. Now, unfortunately, that is just a polyana view. It doesn't matter how often you say we'll call this an unspeakable truth
Starting point is 00:32:38 and it just needs to be done. It just is not doable. Under current circumstances, it may be doable with technology. I will say, Yeah, and I will say that doesn't mean it's worth zero effort to pursue, but we shouldn't lean on that as a load-bearing pillar of any kind of plan. So I think there's something very, very important about where we are right now in the conversation, which is first I want to name this kind of almost schizophrenic flip-flopping of what people are concerned about.
Starting point is 00:33:07 So we spent the first however many minutes of this conversation essentially laying out rogue behaviors of AI that we don't know how to control where they're doing scary stuff that always ends. badly in the movies. That would cause everybody to say, okay, sounds pretty obvious. We should figure out how to slow down, stop, build countermeasures, mitigation plans, contingencies, etc. But then, of course, your mind flips into this completely other side that says, wait a second, if we slow down and stopped, then we're going to lose to China. But the thing that we're going to lose to China on, we just actually replaced, it's like the Indiana Jones movie where you like, you know, you replace the thing. So the thing that was AI in the first example of loss of control
Starting point is 00:33:44 is AI is this scary thing that we obviously don't know how to control that's going rogue. And that's what causes us to say that's slow down. But then when your other brain kicks in and you switch into, we're going to lose to China mode, you replaced the thing that you were concerned about of what is AI is. Now you're seeing it as AI is this dominating advantage
Starting point is 00:34:01 and AI in China is going to use it against us. So our mind is sitting inside of this literally psychological superposition of both seeing AI as controllable and uncontrollable at the same time, which is a contradiction that we don't even acknowledge. And I argue is the center of the fundamental thing that has to happen, which is there's really two risks here.
Starting point is 00:34:21 There's the risk of building AI, which is the uncontrollability, catastrophes, all the stuff we've been laying out. And then essentially the risk of not building AI. And the narrow path is how do you build AI and not build AI at the same time? But the interesting thing about loss of control is that it's the fear of everybody losing that is suddenly bigger than the fear of me losing it to you.
Starting point is 00:34:43 And so it's the kind of lose-lose quadrant of the matrix of the Prisoner's Dilemma. And it was pointed out to me, though, that this is where the ego-religious intuition of people building AI actually comes into play. Because there are some people who are building AI who say, well, if humanity gets wiped out, but we created a digital god that we didn't know out of control, but then it continues. And I was the one who built it. Well, that's not a zero or negative infinity in my Game 3 matrix. That's not a loss. That's actually, like, not a bad scenario. So the worst case scenario is that, you know, we all died, but then we have this AI that took over.
Starting point is 00:35:17 Now, I'm not saying that this is a good way to think. I think this is an incredibly dangerous way to think. But given the fact that the people who are building AI feel that, quote, it's inevitable. That if I don't do it, I'll lose to the guy that will. They start to develop these weird, I think, belief systems that enable them to stay sane on a daily basis that I think is super, super, super dangerous. And I'm very interested in how, I mean, one of the reasons I'm so interested in loss of control, is because I think it really does create the conditions as unlikely as you already said correctly that it is
Starting point is 00:35:47 that some kind of agreement would ever happen. It creates the basis for a commitment to find that space, even if we don't know what it looks like yet. And I'm not saying that it's likely, but currently we're putting in, I would guess, less than $10 million or $100 million in the total world to even try to do something that would prevent this obvious thing that literally no one on planet Earth wants
Starting point is 00:36:06 who has children who cares about life and wants to see this thing continue. The last thing I would want to suggest is that we should not pursue trying to make that option possible, right? The challenge is there's no such thing as trust but verify in the AI space. And even when there is, or when we think there is, we've seen how China behaves in those contexts. The challenge then becomes how do we be honest with ourselves
Starting point is 00:36:30 about the difficulty of both of these scenarios? Because even if you can align it, as you said, like, was it aligned too? Like, is there? Yeah, yeah. Who's the fingers at the keyboard? Yeah. This is almost just required by logic.
Starting point is 00:36:47 Because if you're neck and neck in developing AI capabilities, well, your work on alignment and safety and all those things comes out of your margin of superiority. So it's only like if you have a significant margin that you can put that amount of work into aligning. Aligning, safety, preventing jail breaks. preventing all the crazy things we've been talking about. That's it. And so how do you create that margin?
Starting point is 00:37:15 Well, there's two ways. You've got this. Either you race ahead faster, which brings you closer to that potential singularity point, which is dangerous. Or you degrade the speed of the adversary. Or you do both. The challenge, too, is a lot of this is kind of moot to some degree because of the security situation in the frontier labs. So here's a scenario that is absolutely the default path that I don't think enough, people are kind of like internalizing as the default path. We have a Western lab that gets close to building superintelligence. And, you know, they think internally our loss of control measure is tight.
Starting point is 00:37:50 Do we think we have a 20% chance of losing control, 30? Like, these are the kinds of conversations that absolutely will be happening. So we get really close. And then all of a sudden, a Chinese cyber attack or combination of cyber and insider threat or whatever steals the critical model weights, right? Those numbers that form the artificial brain that is so smart here. and we don't even know that it's happened. This is one of the scenarios that's in AI 2027, right?
Starting point is 00:38:17 We actually did a podcast on that before. Right, right, exactly. And this is like absolutely, it's absolutely correct. So that being the default scenario, this suggests that the very first thing we ought to be doing is securing our critical infrastructure against exactly this sort of thing where the game theory is just not on our side. And there could very well be a situation where, you know,
Starting point is 00:38:36 they're training the model and it's the exact scenario that you're speaking to. And one of the craziest parts of it is we wouldn't really know. How would they know that their model has been stolen? Which is one of the problems of this sort of verification international treaties. If one is sabotaging the other, we wouldn't know. Second thing I wanted to mention is what you're speaking to is the same as Dan Hendricks and Eric Schmidt's paper on, I think they're called mutually assured AI malfunction. But the thing that will stabilize the sort of risk environment is that if you know that I know and you know that I know, that I know, that I know, that I can set up. sabotage your data centers, and you can sabotage mine, the question is, can we create a stable
Starting point is 00:39:14 environment there so that we're not in some kind of one party takes an action, then it escalates into something else. And that is also a lose-lose scenario that we have to avoid. So obviously, we don't, we're not here to try to fearmonger. We're trying to lay out some clear set of facts. If we are clear-eyed, we always say in our work, clarity creates agency. If we can see the truth, we can act. What are some of the clear responses that you want people to be taking? How can people participate in this going better? One of the key things is we absolutely need better security for all of our AI critical infrastructure in particular to give us optionality heading into this world where we're going to need some kind of arrangement with
Starting point is 00:39:54 China. It's going to look like something probably won't be a treaty. But yeah, that's one piece. We definitely need a lot of emphasis on loss of control and how to basically build systems that are less likely to fall into this trap, how smart can we make systems before that becomes a critical issue? Is itself an interesting question? And so I think that there's no win without both security and safety and alignment. We have to keep in mind that China exists as we do that. Yeah, there's a sequence of stuff you have to do for this to go well. And security is actually the first, which is kind of nice because regardless of whether your threat model is loss of control or China does it before us, that security is helpful in support.
Starting point is 00:40:35 of in that. So everyone can get, you know, on the table on that. The second thing is, of course, you have to solve alignment, which is a huge, huge open problem, but you have to do that for this to go well. And then the third thing is you have to solve for oversight of these systems, whose fingers are at the keyboards, and can you have some meaningful democratic oversight over that? And we actually go into this in a bit more detail in our most recent report on America's super intelligence project. Obviously this is going to take a village of everybody and grateful that you've been able to frame the issues so clearly and be early on this topic at waking up some of the interventions that we need. Thank you so much for coming on the show. Thanks so much
Starting point is 00:41:17 Juston. It's been great. Your undivided attention is produced by the Center for Humane Technology. We're a non-profit working to catalyze a humane future. Our senior producer is Julius Scott. Josh Lash is our researcher and producer, and our executive producer is Sasha Fegan, mixing on this episode by Jeff Sudaken, an original music by Ryan and Hayes Holiday. And a special thanks to the whole Center for Humane Technology team for making this show possible. You can find transcripts from our interviews, bonus content on our substack, and much more at HumaneTech.com. And if you liked this episode, we'd be truly grateful if you could rate us on Apple Podcasts or Spotify. It really does make a difference in helping others join this
Starting point is 00:42:00 movement for a more humane future. And if you made it all the way here, let me give one more thank you to you for giving us your undivided attention.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.