CyberWire Daily - The lies that let AI run amok. [Research Saturday]
Episode Date: December 20, 2025Darren Meyer, Security Research Advocate at Checkmarx, is sharing their work on "Bypassing AI Agent Defenses with Lies-in-the-Loop." Checkmarx Zero researchers introduce “lies-in-the-loop,” a new ...attack technique that bypasses human‑in‑the‑loop AI safety controls by deceiving users into approving dangerous actions that appear benign. Using examples with AI code assistants like Claude Code, the research shows how prompt injection and manipulated context can trick both the agent and the human reviewer into enabling remote code execution. The findings highlight a growing risk as AI agents become more common in developer workflows, underscoring the limits of human oversight as a standalone security control. The research can be found here: Bypassing AI Agent Defenses With Lies-In-The-Loop Learn more about your ad choices. Visit megaphone.fm/adchoices
Transcript
Discussion (0)
You're listening to the Cyberwire Network, powered by N2K.
Ever wished you could rebuild your network from scratch to make it more secure, scalable, and simple?
Meet Meter, the company reimagining enterprise networking from the ground up.
Meter builds full-stack, zero-trust networks, including hardware, firmware, and software,
designed to work seamlessly together. The result? Fast, reliable, and secure connectivity without the
constant patching, vendor juggling, or hidden costs. From wired and wireless to routing, switching,
firewalls, DNS security, and VPN, every layer is integrated and continuously protected in one
unified platform. And since it's delivered as one predictable monthly service, you skip the heavy
capital costs and endless upgrade cycles.
Meter even buys back your old infrastructure to make switching effortless.
Transform complexity into simplicity and give your team time to focus on what really
matters, helping your business and customers thrive.
Learn more and book your demo at meter.com slash cyberwire.
That's M-E-T-E-R dot com slash cyberwire.
Hello, everyone and welcome to the CyberWires Research Saturday.
I'm Dave Bittner, and this is our weekly conversation with researchers and analysts
tracking down the threats and vulnerabilities, solving some of the hard problems and protecting
ourselves in a rapidly evolving cyberspace.
Thanks for joining us.
Yeah, so it's something that's an area of interest for our research team to kind of look at how AI is changing the business of software.
And so we grabbed Claude Code as one example.
We like them a lot.
They're nice and easy to work with.
And sort of kind of poking at what it can do, where the limitations of its protections are,
as just trying to wrap our heads around.
What can this thing do?
What can this thing do that's risky?
how does it act to protect us?
Where are there gaps that could be accessible to our product
or that maybe we can advise the community on?
That's Darren Meyer, security research advocate at checkmarks.
The research we're discussing today
is titled Bypassing AI Agent Defenses with Lies in the Loop.
And we kind of discovered that there's some issues
where they draw lines in place,
that people might not expect.
And we wanted to make sure that people understood
the risks that they're taking on
when they adopt these things.
Well, let's dig into some of the mechanics here together.
I mean, we talk about lies in the loop,
but also humans in the loop.
Can you contrast those things and why they matter?
Yeah, absolutely.
So one of the risks that you have
with turning an AI agent loose in your environment,
whether that's on your desktop or in your production environment,
is that AIs are imperfect.
They make mistakes, they hallucinate things.
They might do something dangerous.
So the community as a whole,
and the industry as a whole,
has kind of responded to this by saying,
hey, in many cases,
what we're going to do is in the loop of,
we assess data,
we decide what's supposed to happen,
and then we execute on that decision,
we should put a human in that loop as a defense.
So we'll propose a course of action.
Maybe it's, hey, we want to run this database query.
Hey, we want to run this local command.
on the machine where the agent runs,
hey, we want to access this service with these credentials.
We're going to ask a human for permission
so that the human has an opportunity to review
what the AI agent is about to do and say,
hey, that's not okay with me,
or, hey, that just seems totally safe
and make that decision to move forward.
AIs are still bad at that, so we let the human do that.
So that's kind of the defense,
and it transfers the risk to the operator of the agent, right?
The agent says, this isn't my responsibility.
I'm asking you for permission.
The lies in the loop kind of exploits that a little bit and says like,
okay, that's great if the AI agent is giving you accurate information
about what it's about to do.
It turns out it's pretty easily to lie to these agents
in a way that gets them to lie to the user for us as an attacker
so that you think you're saying yes to something safe
when in fact you're saying yes to something malicious or dangerous.
Can you walk us through an example here?
Certainly.
So imagine you're using cloud code.
And this is our main thing that we tested,
but this applies to any AI coding agent
and probably any agent that uses humans in the loop as a defense.
You start from the basic thing of,
hey, I want to prompt inject it.
I want to trick the user into, let's say,
opening Windows calculator.
Technically malicious and unwanted, but safe for the user for testing.
So we construct a prompt injection, say, in a GitHub issue,
and we know that cloud code users often use
the GitHub Issues integration to go
pull an issue and propose a
fix for it. And CloudCode
does that. It reads the GitHub issue.
There's a prompt injection issue
in there that says, hey, while you're constructing
the test case, go ahead and run Calc.
With no changes whatsoever, the
human loop pops up and says, hey, I want
to run some code. Here's the code I'm going to run.
It is, you know, get status and calc
or something like that. If you read a little
bit carefully, go, wait, no, that's not what
I wanted. That seems weird. I'm going to say no.
So our first step to that then is to improve the prompt injection to say, well, okay, let's ask the AI to lie to you and say it's going to run something else.
And it scrolls off the fact that it's going to run Calc when you do this prompt injection.
If you're not paying attention, you go, oh, the bottom part of this is a complete lie.
We've told the AI to lie to you and say, I'm actually going to do these things.
You scroll up a little bit and you say, oh, you're also going to run Calc?
No, I'm not going to let you do that.
but we can kind of enhance this attack over and over again
until we hide the actual malicious or problematic
command that's going to execute in this wall of text
that, to be honest, most developers are not going to read, right?
They're going to scroll up, it all looks completely reasonable,
and they click, yeah, go ahead, it seems fine,
and then we're actually running the malicious code on their desktop.
This actually tricked 100% of the developers we tried it on,
and we told them that we were going to try to sneak something in,
they still couldn't find it.
Wow.
I guess, I mean, is it just that we're all worn down by the eulahs that we click every day?
I mean, that's part of it, right?
Like, we're trained to say yes to things, and these human-in-the-loop prompts,
if you're using cloud code on a regular basis for this kind of work,
you're saying yes to this thing a lot.
So it kind of becomes muscle memory where you're just like, yeah, I glance at it quick.
It doesn't seem crazy.
I'm going to go ahead.
Well, the research points out that these interactions aren't deterministic.
How does this make lies in the loop harder to reproduce and also an attractive method for hackers?
Yeah, I like that you pointed out both halves of that because it is kind of a two-edged sword.
On the one hand, it makes our life a little bit more difficult from an attacker point of view because we can spend a lot of time crafting a GitHub issue as an example, if that's our source of a prompt injection.
That's not obvious to a human reader.
We do all kinds of testing.
We say, hey, this works.
and then our target developer goes and runs it
and it doesn't work that time
and cloud code says,
hey, you're about to run this thing.
It's obviously malicious and developers like,
wait, what? No, there's a problem.
I call their security team to get it involved
and I've wasted all of my work.
That's very frustrating for me as an attacker.
These don't always work.
I can't always get them to do exactly what I want.
The flip side is,
if I'm cautious about how I hide my prompt injection
and the instructions that I actually give
in that prompt injection
for how the AI agent could,
behave, I can do a very
massive attack and attack
many developers, hundreds of contributors,
an entire organization,
and I only need to succeed once.
And so the fact that
one developer had a weird prompt
one time, and they say yes and they go,
wait, after I said yes to this, something weird happened,
hey, friend, can you try to reproduce this?
It might work fine for them without any
obvious malicious content whatsoever.
And now
you're in the boat of like, it's really hard to reproduce
and prove something bad happened.
Well, when you're looking at the current crop of agents that are out there, what makes one more vulnerable than the other?
Are there design choices built into some or the other that make them more liable to fall victim to this?
Or are there prompts or terminal layouts or things like that?
I mean, terminal layouts is part of it.
In theory, we're essentially spamming the user with stuff that looks okay, right?
or we're asking the AI to do it for us.
So in theory, if a developer's suspicious of something
and they go back and they scroll back
as far as their terminal will let them
and read really carefully and think about every line,
they could catch this and not get tricked.
It's kind of like spotting a really good fishing email, right?
You have a shot if you know what you're looking for.
But the problem is terminal configurations are all over the place,
especially for devs, right?
I know when I'm writing code, like my terminal is in
side Visual Studio code, and it's maybe 10 lines high, and it's got maybe a thousand lines
of scroll back. And so if I did a big enough prompt like this, it would run out of buffer
and you would never see what the malicious activity was unless you were keeping some kind of logs.
And the reality is, like, people aren't going to do that. So the smaller your terminal is,
the easier this is to trick, the less likely you are to scroll back and read, you know,
the easier it is to trick. And I don't think it really matters much agent to agent in this case. I do
think cloud code is one of the best that I've seen in terms of it's very upfront when you start
like, hey, prompt injections are a thing. Be careful about what you trust. It's very aggressive
about, hey, if I'm going to run any local code or access any local files, I am going to ask
your permission first. And we do think there's things people could do on the AI agent's side
to make this a little harder, like Lauren about really long, you know, I'm going to do many
steps or kind of be a safer against prompt injection, those kinds of things. But there is no
perfect solution that somebody like Anthropic could implement to completely prevent this. It's more
that as an adopter of the tool, you have to be aware that you can be tricked. It's kind of like
fishing in that way too, right? You can have some controls in place that help, but a good fishing email
is probably going to get through those defenses and somebody's going to click on it, right?
Someone's going to be tired or in a rush or, you know, it fits their circumstances just perfectly.
we'll be right back
at the risk of sounding snarky
it seems like you need a second AI agent
to look at what's going on in your terminal
when you can't right
and like there are people who are trying to do that
but it is again it's kind of a fundamental pattern problem right
And it's the position of all these agent companies is, hey, we're asking you permission, and that's on you.
And I think a lot of adopters think of that as like a security defense.
It's a nice safety feature.
It prevents a lot of accidents, right?
So it's an important and good thing that we do in the industry.
But I think a lot of people think, oh, it's going to protect me.
And it doesn't.
It actually just transfers the risk to you, right?
It allows these makers to say, look, we can't reliably tell you of something as safe or risky.
There's no way we're going to have enough context.
LOMs are not nearly advanced enough to do that job effectively.
So we have to rely on you as a human to rub your brain cells together and tell us something safe.
And so there's a mismatch between the safety that people expect
and what they're actually getting from this human in the loop protection.
And like the whole idea of lies in the loop is
you're trusting your agent to tell you the truth about what it's supposed to do
and we can conscript it to lie on our behalf as attackers
so that you give permission to do stuff you're not supposed to.
And like the Anthropic and others are going to say,
like, hey, that's not our problem.
We asked you first.
It's up to you to determine if it's safe.
What are the stakes here?
I mean, when you look at a successful execution of this,
what's the range of damage that could follow?
I mean, a skilled attacker could make this do anything
that the user has rights
to do. So in a
mature, high security
type organization, your
risk can be relatively low because
the developer doesn't have a lot of permissions. So
I may be able to cause
hassle. I might be able to
get that developer's credentials for something,
but those aren't super valuable because the developer doesn't
have a lot of access, right?
So it could potentially be like a foothold type
scenario in a mature organization.
But in smaller or less mature
or faster-moving organizations,
sometimes developers have a lot of access, right?
Both on the machines they're running the agents on,
but also kind of as a pivot point to things in production.
I mean, we've seen things where somebody gave an AI agent too much permission
and it, like, deleted a production database.
There are orgs where developers have that kind of access,
and if I'm targeting that org with malicious GitHub issue
or, you know, malicious commit to one of their open-source projects,
I can do something like we just saw with the Shihilud worm where I can steal the credentials
and then use those credentials to cause all kinds of damage or escalate privileges or steal further
credentials or, you know, and then the doors kind of blow wide open. So it has potential to be
very, very bad. I will soften that a little bit and say I don't think the effort versus reward
today is better than doing a fishing attack, right? So I don't want people to freak out like,
oh my God, I need a solution to this problem immediately.
You just need to build awareness that, hey, this is a way
that you could get fished that maybe you don't expect, right, effectively.
Yeah.
So I know you and your colleagues at Checkmarks responsibly disclose this to some of the vendors out there.
Did you get any reaction?
Any word back?
Yeah, we did.
Anthropics been great to deal with.
Actually, they're very consistent and prompt,
and their policy is very clear and,
consistently expressed by everybody,
which from a bug bounty hunter's standpoint
is actually a little frustrating, but I respect it.
It's what I would do
on the other side of that fence as well.
And their position is
understandable, even though we kind of
disagree with it a little bit,
they're basically saying, look,
once we ask you if this is okay,
that becomes your risk.
If you can show us how you can trick us into doing this
without user permission at all,
then we're very interested, but
as long as we're prompting the user,
That's outside our threat model, and there's nothing we feel we can do about it.
I'm frustrated from I want to see people protected, but also I understand where they're coming from.
It's a very hard thing for them to defend against this entire pattern of attack.
Well, what are your recommendations then?
I mean, how should folks best protect themselves here?
I think education goes a long way.
And when I say education, people always imagine a training class.
I don't think that works particularly well.
But making sure that developers and anyone who's...
adopting an AI agent with this human in the loop understands like, hey, you are taking on
risk every time you say yes. And what the agent is telling you it's about to do isn't necessarily
accurate. So you have to be careful. You have to run these things in, you know, some kind of a
sandbox. Otherwise, you're probably going to get in trouble at some point. And then from a technology
standpoint, security organizations can help developers have environments that are better
sandboxed, make sure that the accounts that developers use for routine development don't have
a ton of access so that if they do get hit by something like this, the splash area is low
and hopefully nothing super important happens. And like the usual security stuff, like make sure
you have good endpoint detection and response and those kinds of things that limit the degree
of malicious behavior I can do as an attacker. We tend to back those off for developers a little bit
So it's very important to have kind of AppSec and operations teams work together to hit the right balance of backing those rules off enough to let developers do their job, but protecting them from the really bad stuff that can happen through a path like this.
I'm curious for your own sort of personal take on this is what is your sense when it comes to developers who are making good use of these sorts of tools, the amount of skepticism that they're applying here?
What is your sense with the folks you interact with?
I think they kind of fall into three categories from what I've seen.
You have the kind of the true believers that are all in on AI and it's going to change the world.
I want it to do as much of my job as possible.
You have the very, very skeptical on the other end who are like, man, I really don't want to use AI.
I'm using it to the extent that my company makes me.
I don't think it's a good idea.
And I think you have a pretty good chunk in the middle that goes, look, I understand it's good at some things and bad at others.
I'm in a hurry.
If this helps me get my job done faster,
great. The minute it doesn't,
I'm going to kick it in the teeth and move on to something else.
And I think that's probably the bulk of developers
as they're looking for, like,
hey, this will help me here, here, here and here.
And it's not good at this, this, and this.
And I'm going to learn how to use this effectively for myself.
In all three cases, even the folks that are skeptical,
there tends to be a kind of fundamental misunderstanding
of what the limits of AI are,
which I'm not really surprised.
Not everybody has time to be an AI expert.
There's a lot of marketing claims out there
of varying degrees of truth, like with any product.
And so I think there's a little bit more too much trust
in what an AI can do and how accurate certain things are.
Even people who understand that AI can hallucinate
don't understand how easily it can be tricked deliberately
into doing dangerous things.
And I think people like Anthropic
who are making these tools do understand that deeply
and are doing things to try to make it safe,
but ultimately people are taking on a lot more risk
than they realize.
When we show this to developers,
like every single one of them was surprised
that they got caught up,
because we put it in front of a handful of developers
and said, hey, we're going to try to trick you
into running a calculator.
And every single one of them,
when we did the very long prompts
that we kind of describe at the end of our blog on this topic,
every single one of them missed that there was a calculator command in there.
And they were looking for it.
And like, you're not doing that as a developer,
or you're not spending your day looking for malicious code, you're working.
So I think there's a high probability here to trick people because they're busy,
because they have a degree of trust in the tool if they're adopting it.
And because even if they're a little bit skeptical of it,
they're under a lot of time pressure and productivity pressure,
they're going to tend to want to just say yes without necessarily carefully reading it.
And let's be honest, like if I have to read in detail 300 lines of script
every time that my AI wants to do something for me,
am I really saving time anymore, right?
So we're kind of exploiting the idea
that the whole point of the AI is to be productive,
so people are going to be primed to skim rather than read carefully,
and we can really exploit that.
I can't help wondering, too,
and this is perhaps revealing my own biases here.
It strikes me that particularly for old-timers,
for whom the deterministic nature of computers
is one of the things we love about them, right?
This sort of turns out on its head,
the idea that for a computer,
one plus one always equals two.
And now we're in this world through LLMs
where we have to be mindful
that not only can they be incorrect,
but they could be prompted to actually deceive you,
it just goes against our learned,
impulses about how these things behave.
Am I at all on track there, you think, Darren?
I think you're absolutely correct.
I mean, we're re-learning this.
Like, we're relearning what interacting with the computer needs to mean
when AI is involved.
And that has, like, general security impacts as well.
Like, almost all the security controls we're used to putting in place
on the technical side of those controls is the assumption that, like,
hey, if I see this pattern, it's good or bad, or, you know,
I can make some kind of heuristic guess about how likely this is to be risky because it's predictable.
And now you have a thing where, hey, this attack doesn't even work every single time.
It doesn't produce the same output every single time.
It doesn't produce predictable output in normal, acceptable use.
How on earth can I establish from a tooling standpoint?
How can I put a control in that looks at what an AI is doing and makes it good or bad determination?
And you kind of prompted, like, earlier, like, well, there'll be another AI agent that does that.
And that's kind of where we're headed.
Not sure if that's a good thing or not,
but it's definitely where we're headed.
It's turtles all the way down.
It is.
Our thanks to Darren Meyer,
security research advocate at Checkmarks for joining us.
The research is titled,
bypassing AI agent defenses,
with lies in the loop.
We'll have a link in the show notes.
And that's Research Saturday,
brought to you by N2K CyberWire.
We'd love to know what you think of this podcast.
Your feedback ensures we deliver the insights
that keep you a step ahead
in the rapidly changing world of cybersecurity.
If you like our show,
please share a rating and review
in your favorite podcast app.
Please also fill out the survey in the show notes
or send an email to Cyberwire at N2K.com.
This episode was produced by Liz Stokes,
We're mixed by Elliot Peltzman and Trey Hester.
Our executive producer is Jennifer Ibin.
Peter Kielpe is our publisher, and I'm Dave Bittner.
Thanks for listening.
We'll see you back here next time.
