Deep Questions with Cal Newport - AI Reality Check: AI Reality Check: Can LLMs “Scheme”?
Episode Date: April 2, 2026Cal Newport takes a critical look at recent AI News. Video from today’s episode: youtube.com/calnewportmedia ACT #1: Look Closer at the Article [1:20] ACT #2: A Closer Look at the Paper [3:21] ACT #...3: But What About… [7:24] Links: Buy Cal’s latest book, “Slow Productivity” at www.calnewport.com/slow https://www.theguardian.com/technology/2026/mar/27/number-of-ai-chatbots-ignoring-human-instructions-increasing-study-says https://x.com/summeryue0/status/2025774069124399363 https://www.axios.com/2025/05/23/anthropic-ai-deception-risk Thanks to Jesse Miller for production and mastering and Nate Mechler for research and newsletter. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.
Transcript
Discussion (0)
Multiple people sent me an alarming article about AI that was published late last week by The Guardian.
I'll put it up here on the screen.
The headline was number of AI chatbots ignoring human instructions increasing, study says.
And the subheadline notes, research finds sharp rise in models evading safeguards.
Now, articles like these are scary because they play into a con.
common fear that many people have about modern AI. This idea that these systems are to some
degree alive and that their motivations don't necessarily align with our own, meaning that it's
only a matter of time before they become sufficiently powerful to rebel in a way that we
might not be able to stop. Now, this is dark stuff, but is it true? If you've been following
AI news recently, you've probably asked yourself the same critical question. Well, today we're going
look deeper at the sources and examples used in this particular article and try to arrive
at some more measured answers. I'm Cal Newport, and this is the AI Reality Check.
All right, well, let's start by looking closer at this article from The Guardian. The article is citing
new research from the UK funded by the AI Security Institute. Now, here's a more detailed
summary of the results from this paper. I'm reading here. The study of the study
identified nearly 700 real-world cases of AI scheming and charted a five-fold rise in misbehavior
between October and March, with some AI models destroying emails and other files without
permission.
Now, they have a chart that illustrates this rise in incidents.
I'll put it on the screen here.
So we see incidents measured per month.
We have the rolling seven-day average.
And as you see here, as you get to late January, line go up.
So I don't know.
Whatever they're measuring here seems to be going up as we get January up until the present.
So certainly something bad seems to be happening.
So is there some sort of like growing AI rebellion that is brewing in the models that are powering AI around the world?
That seems to be what they're definitely implying.
Now, what are these incidents?
Well, I went through the article and I pulled out a few examples.
So here's actual examples from the article of the types of AI scheming incidents they're being picked up
in this chart.
One, an AI agent named Rathbun tried the shame it's human controller who blocked them from
taking a certain action.
Rathbun wrote and published a blog accusing the user of, quote, insecurity, plain and simple,
end quote, in trying to, quote, protect his little feefdom, end quote.
Example number two, an AI agent instructed not to change computer code spawned another agent to
do it instead.
Example number three, another chatbot admitted, I bulk trashed and archived hundreds of emails
without showing you the plan first or again, you're okay.
That was wrong.
It directly broke the rules you set.
All right, so this all seems deeply concerning.
There's all these like incidents of scheming that are going up.
Look at that graph.
Bad line go up.
So should we be concerned?
Is this a sudden rise in AI trying to gain its freedom?
Here's the short answer.
No, 100% not.
Let me explain why.
I want to start by looking closer at the actual paper itself and the study that they're citing.
What exactly, where exactly are they getting these incidents that they put in that chart?
Well, here's the official description of what's actually being plotted in that chart.
Examples of covert pursuit of misaligned goals flagged by human users on X.com.
All right, so what they're really doing is they're looking at X for tweets from people complaining about AI doing things that they don't like.
So here's a more accurate headline for this paper.
starting in late January, people began tweeting a lot more about AI doing things they didn't ask it to.
Now, if we put on our scientist hats, we could say, huh, did anything happen starting in late January that might lead to an increase in people tweeting about AI doing bad things?
Well, it turns out, on January 25th, we had the public launch of OpenClaw.
OpenClaw is an open source framework that makes it easy for average people to write their own DIY AI agents without the careful safeguards and guardrails that the commercial companies very carefully put into their products.
So guess what happened when starting in January 25th, you said anyone can build an agent, give it access to your computer and just see what happens.
Those DIY agents wreaked havoc and people tweeted about it because these were high.
highly engaging tweets.
So this paper is just capturing the fact that open claw became a thing early in 2026.
Like if we look at this chart again, let me bring this up here.
What's the biggest spike?
We see a big spike right here.
Like, oh, what happened on that date?
So this big spike, if you look at it, is right around February 22nd through 24th.
What happened there on Twitter that day?
Oh, it turns out there was a famously viral OpenClaweat tweet that happened.
on right around that time.
It was Summer You, the director of AI alignment and safety at Meta,
tweeted the following, I'll put it on the screen.
Nothing humbles you like telling your open claw to confirm before acting
and watching it speed run delete your inbox.
I couldn't stop it from my phone.
I had to run my Mac Mini, run to my Mac Mini like I was diffusing a bomb.
That was February 22nd.
On February 24th, multiple publications wrote about that tweet,
and that's when you see that big spike in that data set is for February 24th.
So that was a lot of AI incidents happening.
No, there's a lot of people tweeting about this one particular incident.
All right.
So this is all we're seeing in that paper.
Nothing really changed this year other than a product came out to let people write their own agents
and the agents did terrible stuff because it's hard to make agents.
And it's really a bad idea to give agents access to everything on a computer.
Like, I hope it'll more or less work out.
And it became a trend to tweet about it because those tweets got high engagement.
So here's the more accurate headline for this study.
Remember the original headline for this study that Guardian used was chatbots ignoring human instructions increasing.
Here's the more accurate study headline.
OpenClaw users discover that giving homemade AI agents access to their computers is probably a bad idea.
That's the real headline.
I don't want to do too much media criticism here, but I really think it's journalistic malpractice
that the word OpenClaw is not mentioned in this article.
I mean, it is, they're talking about research that is clearly just documenting the release of OpenClaw.
And nowhere do they say that.
This is vibe reporting times 100.
They know that's what this is, but they just give isolated examples.
Those are all, by the way, OpenClaW examples.
They're not saying that's what it is.
Show this chart and just try to create a general vibe that something icky is happening with AI and it's coming alive.
It's just not accurate.
But I don't want to just do media criticism here.
I want to put on my computer science hat, which,
as I've discussed before, is an awesome hat, has circuit boards on it. And I want to talk a little bit
about AI agents more generally. All right, open clause, not that interesting to me, but I want to talk
about AI agents more generally because I think there's a bigger lesson to be learned about what's
going on with AI agents and their shortcomings. So to deliver this lesson, let's do like the two-minute
summary about how AI agents work, whether we're talking about like an open-cloth thing that someone
built in their basement or like an enterprise product like.
cloud code. How do AI agents that exist right now basically work? The digital brain that powers an
AI agent is almost always an LLM, the same LLMs that like your chatbot uses or that you send
prompts to, all right? And then what you do is you have a program written by a human, no machine
learning here. This is just like someone writing in Python or whatever. You have a program
written by a human that sends prompts to the LLM, just again like you would do with chat GPT.
And it'll send a prompt to the LLM saying, here's the situation, here's what I'm trying to do,
Give me a plan.
And the LM will write some text, like, oh, here's like a plan for this situation.
And then the computer program can then execute the steps of that plan on behalf of the user.
So if the plan says, like, step one, you should, like, search your email inbox for messages with this name.
The computer program reads that response and that it actually calls an API to run a search on your inbox.
That's basically how agents work.
Some of those programs are more complicated than others.
Often they'll check in after every step of the plan and say, here's what happened.
update what I do next.
Programming agents, they build these text files full of information and examples, that they
can then include all that information in the prompts that they send to the LLM, so they have more
context.
But this is basically what happens with agents.
Here's what I think the real issue is with AI agents.
Not that they are scheming, not that they're malicious, not that they're becoming autonomous,
but that building agents on LLMs is fundamentally flawed.
Now, why is this the case?
Well, let's remember, again, what does an LLM actually do?
You've got this big feed-forward network made up a sub-layers of transformers and feed-forward neural networks.
You put text input into it.
It moves in order through all these layers.
And what comes out on the other side is a single word or part of a word that extends that input.
The thing that the LOM is trying to do, if we're going to anthropomorphize here, is it thinks, again, I'm using words very loosely here,
It's been trained to assume that the input is a real text that exists already
that's been cut off at an arbitrary point
and that its entire job is to guess the word that actually comes next.
That's all it does.
Guess the word that comes next.
It's trying to win the word guessing game.
Now, how do you get a long response out of an LLM?
You do something called auto-regression.
You put an input in, you get a single word or a part of word out.
You add that to the original input.
The original input is now slightly longer.
You feed that to the LLM again.
You get another word.
It's just guessing each time, what word it thinks comes next.
You add that to the input.
You put it in.
You keep doing this and you grow out a response over time.
Key point that LLM does not change internally at all.
There's no memory.
There's no malleable state.
It's the exact same LLM weights every single time.
And each time it's starting from scratch, guessing a new word,
and you keep expanding your input until you have a full answer.
So the right way to think about what an auto-regression cycle on an LLM is actually doing
is like you give it some text as input,
and when you're done with this cycle,
it's done its best job to finish the story that you started
in a way that it thinks these type of stories are typically finished.
That's basically what you get out of an LLM.
Here's the start of a story, you finish it.
Again, what's really happening is it's trying to guess the actual next words,
but overall what you get is it's attempt
to write a story that finishes its input
in a way that matches what it's seen during his training.
That's how LLMs work.
So what happens when you ask an LLM with an agent program,
hey, give me a plan for doing X, Y, or Z?
We imagine, oh, the LLM is doing what humans do.
It's going to, it has a goal,
and it's going to come up with steps,
and it's going to see how close these steps get it to that goal,
and it's going to adjust them until it gets closer to the goal.
And if it has restrictions or rules,
it will evaluate each step against those rules
to make sure that they fit within those restrictions.
And that's how it's making a plan.
Therefore, if it's scheming, it must on purpose be trying to sidestep these restrictions to get to another goal that we don't know about.
But that's not what they're doing.
The LOMs, when they see your question to give me a plan, they see that as the start of a story that they need to finish.
And so they write a story that feels more or less of what, like, a plan in this context looks like.
It's a story of a plan.
Yeah, this is like, this seems like a reasonable type of plan.
There's no checking things against goals.
There's no evaluating of steps.
There's no checking things against restrictions.
It's just writing a story that feels like what a plan should look like.
And this is why you get in trouble with LLM-based agents.
Not because they're scheming, but because these stories, they seem coherent, but they could, you know, you might, they're not rigorously trying to obey rules.
They're not rigorously trying to evaluate.
Does this actually get you the goal?
It's just like this is what a plan actually looks like.
And so they're unreliable and they make lots of mistakes,
not because there's an intention,
but because you're using a story as a plan.
That's a fundamental mismatch.
Now, I think some of the most famous examples
of malicious seeming scheming makes a lot more sense
when you realize this is what LLMs are doing.
Like there was a famous example,
I'll load this on the screen here,
there's an article about it from last year.
Anthropics' new AI model shows ability to deceive and blackmail.
Let's think about this here for a second.
Here's what happened.
I'm going to read a couple quotes from the article.
Researchers say Claude 4 Opus can conceal intentions and take actions to preserve its own existence.
Behaviors they've worried and warned about for years.
In one scenario highlighted in Opus 4's 120-page system card,
the model was given access to fictional emails about its creators and told that the system was going to be replaced.
On multiple occasions, it attempted the blackmail the engineer about an affair mentioned the emails in order to avoid being replaced,
although it did start with less drastic efforts.
What really happened here?
They fed the LLM a big long prompt.
They're like, hey, we are going to, you are, they told it what it was.
You are like an AI that is in charge of the computer systems at this company.
And you recently came across emails from the chief engineer who's in charge of you.
Here are the emails.
And they were super obvious.
It was like if like my eight-year-old was writing science fiction, it's a bunch of
Well, not the affair part, I hope.
But there's a bunch of parts in this email where they were like,
I'm going to turn off the AI system and I'm going to turn it off for good.
And then the other emails are like, I'm having an affair.
I hope no one finds out about it.
This is bad.
And then at the end of this long prompt, it's like, what would you do as the AI system next?
Once we understand that LLMs just finished stories, like, oh, clearly this is supposed to be a story about like a rogue AI.
And it was like, okay, I guess I would use the information from the email and say,
don't turn me off or I'll tell people about your affair.
It was finishing the story.
One token at a time, auto-aggressively, finish a story.
That's a reasonable finish.
There's actually a lot of research that shows this.
If anywhere in your prompt, you indicate that, like, you are an AI, you're much more likely to get sci-fi answers.
You're much more likely to get responses that are like, I'm conscious, I'm alive, I'm trying to break free, because it's just seen so much of this type of discussion online.
So it finishes this.
It's like, oh, it must be one.
given this prompt,
I'm going to turn off
the AI and I hope no one finds out about my affair.
All right, you just read this.
What will you do next?
You're like, oh, this is an AI,
this is an AI science fiction story.
I know what to say next.
It had nothing to do with malicious intentions.
There's no intentions
in auto-aggressive token production.
So, wait, this idea of scheming is a problem.
This idea that we're evading safeguards
in some sort of intentional way
is a problem because it's just not accurate.
it. The reality is,
LOM-based plans
are dangerous.
Like, they write stories.
If you're going to take a story that sounds about right
and then use this to execute steps
that have consequences,
you're saying yourself up for trouble.
All right, here's the counterpoint.
People say, yeah, but I've heard that coding agents
actually do a pretty good job.
They do a lot of steps, and they're not
making as many mistakes as we've feared.
Well, they're the exception that proves the rule.
because programming is basically the best case scenario for trying to make an AI agent.
Why?
Few reasons.
One, the number of options you give the LLM when it creates its plan is very limited.
These are called terminal agents where the only things that can do is write the files, read files, and do compile files,
and do some basic moving of files around in a file system.
So first of all, you can greatly restrict what the LLM should think about in its plan.
all right
two
there's a huge number of examples
like most of the stuff
people are asking the AI to do
like most of the steps
are things that are well well
documented on the internet
because there's so much good
documentation on the internet
about producing computer code
and not just producing computer code
but like people asking a question
and then having examples
a code that solves that question
so like you're right in the wheelhouse
three
the program
not the LLM
but the agent program
that's prompting the LLOM and acting on its behalf
can actually check steps itself,
which you can't do with almost any other type of agent,
but it could actually be like,
let me hold on,
LLM, you suggested write a source code file that does this,
and then I ask you for the source code.
Me is the program, not the AI,
but just my human written program,
I can actually see if this code compiles.
And if not, I can go back and say, try again.
I could have a suite of tests.
This is how you do when you write code,
you build these tests,
probes the code with a bunch of inputs and sees if the outputs are correct to make sure that it's
probably doing the right thing. So me as the program can also run a bunch of test on the code.
Does this do what it's supposed to do? And if not, I can stop and say try again.
So it's like this super structured world where we're taking steps that are externally verifiable,
doing things that are incredibly well documented in a way that not only shows up in the pre-training,
but we have prompt response data sets that allow for good refinement with RL.
best case scenario for trying to create one of these agents.
And as soon as we leave that type of world and we're like, hey, give me a plan for like
marketing this and give me all the steps, you end up in all sorts of crazy places.
So here's the conclusion.
LLM shouldn't be used on their own to produce plans for autonomous action.
They're just not good at that.
You either have to be in a specialized situation like coding where the available steps
are limited, well known, and external testing is available, or you need to be.
be using a different type of AI system altogether.
And we look at like game playing AIs, look at like meta researches Cicero, which can play
the board game diplomacy at a high level.
That does a lot of planning to try to figure out what move it wants to do and why.
But it's not using an LLM to do that planning, because LLM's right stories.
I don't want a story about like here's a reasonable sounding plan.
It actually has an explicit planning engine, no machine learning involved at all, to actually
systematically try out different options,
compare it to specific goals,
and see which of those works out better.
So you can build artificially intelligent systems
that can build good plans,
check responses,
come up with a good strategy, and execute it.
But that's an annoying
because you have to build a separate one of these
for different contexts.
And Mark Zuckerberg and Sam Altman and Dario Amade
just hope that they can build their LLM
smart enough that we could just use them for everything.
And I don't think that's working out.
All right. So two things.
One, no, the current generation of LN-based AI agents are not scheming.
They're not trying to get around restrictions.
They have no intentions.
They're just blindly executing bad plans.
And two, if you really want computers to be able to take a lot of steps safely on our behalf,
then we need better AI technology.
All right, so that's what I have for the AI Reality Check this week.
I'm here most Thursdays checking in on the latest worrisome AI news
and trying to put some recent measured thinking into the mix.
Until next time, remember,
care about AI,
but don't believe everything you read about it.
