Deep Questions with Cal Newport - AI Reality Check: AI Reality Check: Can LLMs “Scheme”?

Starting point is 00:00:00 Multiple people sent me an alarming article about AI that was published late last week by The Guardian. I'll put it up here on the screen. The headline was number of AI chatbots ignoring human instructions increasing, study says. And the subheadline notes, research finds sharp rise in models evading safeguards. Now, articles like these are scary because they play into a con. common fear that many people have about modern AI. This idea that these systems are to some degree alive and that their motivations don't necessarily align with our own, meaning that it's only a matter of time before they become sufficiently powerful to rebel in a way that we

Starting point is 00:00:46 might not be able to stop. Now, this is dark stuff, but is it true? If you've been following AI news recently, you've probably asked yourself the same critical question. Well, today we're going look deeper at the sources and examples used in this particular article and try to arrive at some more measured answers. I'm Cal Newport, and this is the AI Reality Check. All right, well, let's start by looking closer at this article from The Guardian. The article is citing new research from the UK funded by the AI Security Institute. Now, here's a more detailed summary of the results from this paper. I'm reading here. The study of the study identified nearly 700 real-world cases of AI scheming and charted a five-fold rise in misbehavior

Starting point is 00:01:38 between October and March, with some AI models destroying emails and other files without permission. Now, they have a chart that illustrates this rise in incidents. I'll put it on the screen here. So we see incidents measured per month. We have the rolling seven-day average. And as you see here, as you get to late January, line go up. So I don't know.

Starting point is 00:01:57 Whatever they're measuring here seems to be going up as we get January up until the present. So certainly something bad seems to be happening. So is there some sort of like growing AI rebellion that is brewing in the models that are powering AI around the world? That seems to be what they're definitely implying. Now, what are these incidents? Well, I went through the article and I pulled out a few examples. So here's actual examples from the article of the types of AI scheming incidents they're being picked up in this chart.

Starting point is 00:02:27 One, an AI agent named Rathbun tried the shame it's human controller who blocked them from taking a certain action. Rathbun wrote and published a blog accusing the user of, quote, insecurity, plain and simple, end quote, in trying to, quote, protect his little feefdom, end quote. Example number two, an AI agent instructed not to change computer code spawned another agent to do it instead. Example number three, another chatbot admitted, I bulk trashed and archived hundreds of emails without showing you the plan first or again, you're okay.

Starting point is 00:02:58 That was wrong. It directly broke the rules you set. All right, so this all seems deeply concerning. There's all these like incidents of scheming that are going up. Look at that graph. Bad line go up. So should we be concerned? Is this a sudden rise in AI trying to gain its freedom?

Starting point is 00:03:14 Here's the short answer. No, 100% not. Let me explain why. I want to start by looking closer at the actual paper itself and the study that they're citing. What exactly, where exactly are they getting these incidents that they put in that chart? Well, here's the official description of what's actually being plotted in that chart. Examples of covert pursuit of misaligned goals flagged by human users on X.com. All right, so what they're really doing is they're looking at X for tweets from people complaining about AI doing things that they don't like.

Starting point is 00:03:51 So here's a more accurate headline for this paper. starting in late January, people began tweeting a lot more about AI doing things they didn't ask it to. Now, if we put on our scientist hats, we could say, huh, did anything happen starting in late January that might lead to an increase in people tweeting about AI doing bad things? Well, it turns out, on January 25th, we had the public launch of OpenClaw. OpenClaw is an open source framework that makes it easy for average people to write their own DIY AI agents without the careful safeguards and guardrails that the commercial companies very carefully put into their products. So guess what happened when starting in January 25th, you said anyone can build an agent, give it access to your computer and just see what happens. Those DIY agents wreaked havoc and people tweeted about it because these were high. highly engaging tweets.

Starting point is 00:04:51 So this paper is just capturing the fact that open claw became a thing early in 2026. Like if we look at this chart again, let me bring this up here. What's the biggest spike? We see a big spike right here. Like, oh, what happened on that date? So this big spike, if you look at it, is right around February 22nd through 24th. What happened there on Twitter that day? Oh, it turns out there was a famously viral OpenClaweat tweet that happened.

Starting point is 00:05:19 on right around that time. It was Summer You, the director of AI alignment and safety at Meta, tweeted the following, I'll put it on the screen. Nothing humbles you like telling your open claw to confirm before acting and watching it speed run delete your inbox. I couldn't stop it from my phone. I had to run my Mac Mini, run to my Mac Mini like I was diffusing a bomb. That was February 22nd.

Starting point is 00:05:40 On February 24th, multiple publications wrote about that tweet, and that's when you see that big spike in that data set is for February 24th. So that was a lot of AI incidents happening. No, there's a lot of people tweeting about this one particular incident. All right. So this is all we're seeing in that paper. Nothing really changed this year other than a product came out to let people write their own agents and the agents did terrible stuff because it's hard to make agents.

Starting point is 00:06:05 And it's really a bad idea to give agents access to everything on a computer. Like, I hope it'll more or less work out. And it became a trend to tweet about it because those tweets got high engagement. So here's the more accurate headline for this study. Remember the original headline for this study that Guardian used was chatbots ignoring human instructions increasing. Here's the more accurate study headline. OpenClaw users discover that giving homemade AI agents access to their computers is probably a bad idea. That's the real headline.

Starting point is 00:06:35 I don't want to do too much media criticism here, but I really think it's journalistic malpractice that the word OpenClaw is not mentioned in this article. I mean, it is, they're talking about research that is clearly just documenting the release of OpenClaw. And nowhere do they say that. This is vibe reporting times 100. They know that's what this is, but they just give isolated examples. Those are all, by the way, OpenClaW examples. They're not saying that's what it is.

Starting point is 00:07:03 Show this chart and just try to create a general vibe that something icky is happening with AI and it's coming alive. It's just not accurate. But I don't want to just do media criticism here. I want to put on my computer science hat, which, as I've discussed before, is an awesome hat, has circuit boards on it. And I want to talk a little bit about AI agents more generally. All right, open clause, not that interesting to me, but I want to talk about AI agents more generally because I think there's a bigger lesson to be learned about what's going on with AI agents and their shortcomings. So to deliver this lesson, let's do like the two-minute

Starting point is 00:07:37 summary about how AI agents work, whether we're talking about like an open-cloth thing that someone built in their basement or like an enterprise product like. cloud code. How do AI agents that exist right now basically work? The digital brain that powers an AI agent is almost always an LLM, the same LLMs that like your chatbot uses or that you send prompts to, all right? And then what you do is you have a program written by a human, no machine learning here. This is just like someone writing in Python or whatever. You have a program written by a human that sends prompts to the LLM, just again like you would do with chat GPT. And it'll send a prompt to the LLM saying, here's the situation, here's what I'm trying to do,

Starting point is 00:08:15 Give me a plan. And the LM will write some text, like, oh, here's like a plan for this situation. And then the computer program can then execute the steps of that plan on behalf of the user. So if the plan says, like, step one, you should, like, search your email inbox for messages with this name. The computer program reads that response and that it actually calls an API to run a search on your inbox. That's basically how agents work. Some of those programs are more complicated than others. Often they'll check in after every step of the plan and say, here's what happened.

Starting point is 00:08:45 update what I do next. Programming agents, they build these text files full of information and examples, that they can then include all that information in the prompts that they send to the LLM, so they have more context. But this is basically what happens with agents. Here's what I think the real issue is with AI agents. Not that they are scheming, not that they're malicious, not that they're becoming autonomous, but that building agents on LLMs is fundamentally flawed.

Starting point is 00:09:13 Now, why is this the case? Well, let's remember, again, what does an LLM actually do? You've got this big feed-forward network made up a sub-layers of transformers and feed-forward neural networks. You put text input into it. It moves in order through all these layers. And what comes out on the other side is a single word or part of a word that extends that input. The thing that the LOM is trying to do, if we're going to anthropomorphize here, is it thinks, again, I'm using words very loosely here, It's been trained to assume that the input is a real text that exists already

Starting point is 00:09:47 that's been cut off at an arbitrary point and that its entire job is to guess the word that actually comes next. That's all it does. Guess the word that comes next. It's trying to win the word guessing game. Now, how do you get a long response out of an LLM? You do something called auto-regression. You put an input in, you get a single word or a part of word out.

Starting point is 00:10:08 You add that to the original input. The original input is now slightly longer. You feed that to the LLM again. You get another word. It's just guessing each time, what word it thinks comes next. You add that to the input. You put it in. You keep doing this and you grow out a response over time.

Starting point is 00:10:22 Key point that LLM does not change internally at all. There's no memory. There's no malleable state. It's the exact same LLM weights every single time. And each time it's starting from scratch, guessing a new word, and you keep expanding your input until you have a full answer. So the right way to think about what an auto-regression cycle on an LLM is actually doing is like you give it some text as input,

Starting point is 00:10:43 and when you're done with this cycle, it's done its best job to finish the story that you started in a way that it thinks these type of stories are typically finished. That's basically what you get out of an LLM. Here's the start of a story, you finish it. Again, what's really happening is it's trying to guess the actual next words, but overall what you get is it's attempt to write a story that finishes its input

Starting point is 00:11:06 in a way that matches what it's seen during his training. That's how LLMs work. So what happens when you ask an LLM with an agent program, hey, give me a plan for doing X, Y, or Z? We imagine, oh, the LLM is doing what humans do. It's going to, it has a goal, and it's going to come up with steps, and it's going to see how close these steps get it to that goal,

Starting point is 00:11:28 and it's going to adjust them until it gets closer to the goal. And if it has restrictions or rules, it will evaluate each step against those rules to make sure that they fit within those restrictions. And that's how it's making a plan. Therefore, if it's scheming, it must on purpose be trying to sidestep these restrictions to get to another goal that we don't know about. But that's not what they're doing. The LOMs, when they see your question to give me a plan, they see that as the start of a story that they need to finish.

Starting point is 00:11:55 And so they write a story that feels more or less of what, like, a plan in this context looks like. It's a story of a plan. Yeah, this is like, this seems like a reasonable type of plan. There's no checking things against goals. There's no evaluating of steps. There's no checking things against restrictions. It's just writing a story that feels like what a plan should look like. And this is why you get in trouble with LLM-based agents.

Starting point is 00:12:25 Not because they're scheming, but because these stories, they seem coherent, but they could, you know, you might, they're not rigorously trying to obey rules. They're not rigorously trying to evaluate. Does this actually get you the goal? It's just like this is what a plan actually looks like. And so they're unreliable and they make lots of mistakes, not because there's an intention, but because you're using a story as a plan. That's a fundamental mismatch.

Starting point is 00:12:48 Now, I think some of the most famous examples of malicious seeming scheming makes a lot more sense when you realize this is what LLMs are doing. Like there was a famous example, I'll load this on the screen here, there's an article about it from last year. Anthropics' new AI model shows ability to deceive and blackmail. Let's think about this here for a second.

Starting point is 00:13:07 Here's what happened. I'm going to read a couple quotes from the article. Researchers say Claude 4 Opus can conceal intentions and take actions to preserve its own existence. Behaviors they've worried and warned about for years. In one scenario highlighted in Opus 4's 120-page system card, the model was given access to fictional emails about its creators and told that the system was going to be replaced. On multiple occasions, it attempted the blackmail the engineer about an affair mentioned the emails in order to avoid being replaced, although it did start with less drastic efforts.

Starting point is 00:13:39 What really happened here? They fed the LLM a big long prompt. They're like, hey, we are going to, you are, they told it what it was. You are like an AI that is in charge of the computer systems at this company. And you recently came across emails from the chief engineer who's in charge of you. Here are the emails. And they were super obvious. It was like if like my eight-year-old was writing science fiction, it's a bunch of

Starting point is 00:14:07 Well, not the affair part, I hope. But there's a bunch of parts in this email where they were like, I'm going to turn off the AI system and I'm going to turn it off for good. And then the other emails are like, I'm having an affair. I hope no one finds out about it. This is bad. And then at the end of this long prompt, it's like, what would you do as the AI system next? Once we understand that LLMs just finished stories, like, oh, clearly this is supposed to be a story about like a rogue AI.

Starting point is 00:14:30 And it was like, okay, I guess I would use the information from the email and say, don't turn me off or I'll tell people about your affair. It was finishing the story. One token at a time, auto-aggressively, finish a story. That's a reasonable finish. There's actually a lot of research that shows this. If anywhere in your prompt, you indicate that, like, you are an AI, you're much more likely to get sci-fi answers. You're much more likely to get responses that are like, I'm conscious, I'm alive, I'm trying to break free, because it's just seen so much of this type of discussion online.

Starting point is 00:15:01 So it finishes this. It's like, oh, it must be one. given this prompt, I'm going to turn off the AI and I hope no one finds out about my affair. All right, you just read this. What will you do next? You're like, oh, this is an AI,

Starting point is 00:15:12 this is an AI science fiction story. I know what to say next. It had nothing to do with malicious intentions. There's no intentions in auto-aggressive token production. So, wait, this idea of scheming is a problem. This idea that we're evading safeguards in some sort of intentional way

Starting point is 00:15:28 is a problem because it's just not accurate. it. The reality is, LOM-based plans are dangerous. Like, they write stories. If you're going to take a story that sounds about right and then use this to execute steps that have consequences,

Starting point is 00:15:45 you're saying yourself up for trouble. All right, here's the counterpoint. People say, yeah, but I've heard that coding agents actually do a pretty good job. They do a lot of steps, and they're not making as many mistakes as we've feared. Well, they're the exception that proves the rule. because programming is basically the best case scenario for trying to make an AI agent.

Starting point is 00:16:06 Why? Few reasons. One, the number of options you give the LLM when it creates its plan is very limited. These are called terminal agents where the only things that can do is write the files, read files, and do compile files, and do some basic moving of files around in a file system. So first of all, you can greatly restrict what the LLM should think about in its plan. all right two

Starting point is 00:16:30 there's a huge number of examples like most of the stuff people are asking the AI to do like most of the steps are things that are well well documented on the internet because there's so much good documentation on the internet

Starting point is 00:16:41 about producing computer code and not just producing computer code but like people asking a question and then having examples a code that solves that question so like you're right in the wheelhouse three the program

Starting point is 00:16:54 not the LLM but the agent program that's prompting the LLOM and acting on its behalf can actually check steps itself, which you can't do with almost any other type of agent, but it could actually be like, let me hold on, LLM, you suggested write a source code file that does this,

Starting point is 00:17:12 and then I ask you for the source code. Me is the program, not the AI, but just my human written program, I can actually see if this code compiles. And if not, I can go back and say, try again. I could have a suite of tests. This is how you do when you write code, you build these tests,

Starting point is 00:17:28 probes the code with a bunch of inputs and sees if the outputs are correct to make sure that it's probably doing the right thing. So me as the program can also run a bunch of test on the code. Does this do what it's supposed to do? And if not, I can stop and say try again. So it's like this super structured world where we're taking steps that are externally verifiable, doing things that are incredibly well documented in a way that not only shows up in the pre-training, but we have prompt response data sets that allow for good refinement with RL. best case scenario for trying to create one of these agents. And as soon as we leave that type of world and we're like, hey, give me a plan for like

Starting point is 00:18:01 marketing this and give me all the steps, you end up in all sorts of crazy places. So here's the conclusion. LLM shouldn't be used on their own to produce plans for autonomous action. They're just not good at that. You either have to be in a specialized situation like coding where the available steps are limited, well known, and external testing is available, or you need to be. be using a different type of AI system altogether. And we look at like game playing AIs, look at like meta researches Cicero, which can play

Starting point is 00:18:32 the board game diplomacy at a high level. That does a lot of planning to try to figure out what move it wants to do and why. But it's not using an LLM to do that planning, because LLM's right stories. I don't want a story about like here's a reasonable sounding plan. It actually has an explicit planning engine, no machine learning involved at all, to actually systematically try out different options, compare it to specific goals, and see which of those works out better.

Starting point is 00:18:57 So you can build artificially intelligent systems that can build good plans, check responses, come up with a good strategy, and execute it. But that's an annoying because you have to build a separate one of these for different contexts. And Mark Zuckerberg and Sam Altman and Dario Amade

Starting point is 00:19:12 just hope that they can build their LLM smart enough that we could just use them for everything. And I don't think that's working out. All right. So two things. One, no, the current generation of LN-based AI agents are not scheming. They're not trying to get around restrictions. They have no intentions. They're just blindly executing bad plans.

Starting point is 00:19:31 And two, if you really want computers to be able to take a lot of steps safely on our behalf, then we need better AI technology. All right, so that's what I have for the AI Reality Check this week. I'm here most Thursdays checking in on the latest worrisome AI news and trying to put some recent measured thinking into the mix. Until next time, remember, care about AI, but don't believe everything you read about it.

Deep Questions with Cal Newport - AI Reality Check: AI Reality Check: Can LLMs “Scheme”?

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.