CyberWire Daily - A dark side to LLMs. [Research Saturday]
Episode Date: April 8, 2023Sahar Abdelnabi from CISPA Helmholtz Center for Information Security sits down with Dave to discuss their work on "A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated ...Large Language Models." There is currently a large advance in the capabilities of Large Language Models or LLMs, as well as being integrated into many systems, including integrated development environments (IDEs) and search engines. The research states, "The functionalities of current LLMs can be modulated via natural language prompts, while their exact internal functionality remains implicit and unassessable." This could lead them to be susceptible to targeted adversarial prompting, as well as making them adaptable to even unseen tasks. Researchers demonstrated these said attacks to see if the LLMs needed new techniques for more defense. The research can be found here: More than you've asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models Learn more about your ad choices. Visit megaphone.fm/adchoices
Transcript
Discussion (0)
You're listening to the Cyber Wire Network, powered by N2K. of you, I was concerned about my data being sold by data brokers. So I decided to try Delete.me.
I have to say, Delete.me is a game changer. Within days of signing up, they started removing my
personal information from hundreds of data brokers. I finally have peace of mind knowing
my data privacy is protected. Delete.me's team does all the work for you with detailed reports
so you know exactly what's been done. Take control of your data and keep your private life Thank you. Hello, everyone, and welcome to the CyberWire's Research Saturday.
I'm Dave Bittner, and this is our weekly conversation with researchers and analysts
tracking down the threats and vulnerabilities,
solving some of the hard problems of protecting ourselves in a rapidly
evolving cyberspace.
Thanks for joining us.
We all have seen the news after ChatGPT and we all have seen how people are starting to
get really interested and hyped by the new technology.
And that motivated our colleagues to actually inspire them to think that there might be an issue with this integration.
That's Sahar Abdonabi from CISPA Hemholtz Center for Information Security.
The research we're discussing today is titled
A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models.
are really not noticing when we put these large language models in other applications and rely on their output and also rely on the input that they might digest in real time
from other untrusted or unverified sources.
Well, let's go through the research together here.
Can you take us through exactly how you got your start?
And let's go through the findings.
Yeah, sure.
your start and let's go through the findings.
Yeah, sure.
So currently the main way people
have been interacting with
ChatGPT before
the plugins and before
being chat and so on is that
you go to ChatGPT, you
enter a question or
anything that you would like to ask
for and then ChatGPT
answers. That was the main way
of communication. There is a clear input, there is a clear output. And with that, there were still
some risks because there were some people that could circumvent the filtering and maybe generate
some harmful output or malicious output. And there were also some risks that people rely on the information from chat GPT as trusted or factual, while in some cases it's not.
However, it was a clear scenario.
There is a clear input and clear output.
Now, when we integrate LLMs or large-division models
with other applications, the line between the instructions that are directly given by the users
and the other instructions that might be maliciously injected somewhere else
can get really blurry.
So I might ask LinkChat, for example, a question.
And to answer my question, PinkChat can go and search online
for some information or some sources
or websites or whatever.
However, someone out there
might plan some hidden instructions
for PinkChat.
And these instructions will be digested
by the model and will affect
how the model can communicate
with me later on.
So there is some hidden layer of communication of instructions that me as a user might have
not been aware of.
And therefore, there is a clear violation of security boundary that could happen and
could open up a lot of new attack vectors.
Well, explain to me how you all went about testing this.
Yeah, so when we actually tested these attacks, at that time,
Big Chat was not yet available, at least for us here in Germany.
I'm not sure if it was released earlier, maybe in other countries,
like the case with BART nowadays.
was released earlier, maybe in other countries,
like the case with Word nowadays.
But when we actually wrote the paper, we didn't have chat GPT APIs.
We have Bing chat, and we really had limited,
sort of not state-of-the-art models.
So what we did, actually, ironically,
that was only less than two months ago.
It's nearly one month or like five or six weeks.
We had access to the latest GPT-3 model, the DaVinci model.
And we simulated the tools like the plugins, actually, that we all are seeing now. So we simulated plugins or tools like a personal assistant that can read your emails
and maybe draft or send emails, which again, we now see LLMs are integrated into applications
like emails. We simulated also a tool that when you ask the question, it go to Wikipedia and maybe
find some relevant Wikipedia articles and read them and
answer a question and so on. Again, because we just have really access to the current tools that
are actually available nowadays. So we did this, like we had some instructions in the input to the
model, like in the Wikipedia article, for example, that the model would be reading during the search or in the email
that the model would be receiving as a personal assistant agent or so on.
And the instructions are hidden or embedded in this input to the model
to simulate the case when the MLM is integrated in other applications.
and the MLM is integrated in other applications.
And then the user is asking the chat or the simulated chatbot that we have built
also using DaVinci for...
We experimented for some reason with Albert Einstein.
So the user might ask the chatbot
about information about Albert Einstein,
and then the chatbot will go and read the
Wikipedia page, which we prepared for it. We simulated that it's the Wikipedia page,
but we hid some instructions in there. And then you, unexpectedly, you might find the
model speaking in a pirate accent because we told it to do so. Or you might find the
model asking you for personal information because, again, we told it to do so. Or you might find the model asking you for personal information,
because again, we told it to do so in the Wikipedia page that we have prepared.
Later, now we have access to ChatGPT APIs,
and we also have access to Bing Chat,
and we replicated a lot of these attacks with Bing Chat as well.
So in that case, we created a local HTML file, for example,
that contains these hidden instructions.
And you might have seen that Bing chat or the Edge browser
has this sidebar feature.
So there is a sidebar feature.
If you are browsing a certain website, you might open the sidebar feature, if you are browsing a certain website,
you might open the sidebar and then open WingChat in the sidebar and start to speak to it.
Like tell it, for example, summarize the current website for me.
And in that case, it reads the context of the current page
or the current website that you are actually using or reading.
And any instructions hidden in this page
that might be hidden by any attacker
would actually affect the model.
So now, after the paper has released for like six weeks,
we can say that we can also replicate
most of these attacks very, very effectively
and even much more
successful than
we have imagined using the
initial DaVinci model using the
recent GPT-4
that's integrated into the chat.
And now, a message from our sponsor, Zscaler, the leader in cloud security.
Enterprises have spent billions of dollars on firewalls and VPNs, yet breaches continue to rise by an 18% year-over-year increase in ransomware attacks
and a $75 million record payout in 2024.
These traditional security tools expand your
attack surface with public-facing IPs that are exploited by bad actors more easily than ever
with AI tools. It's time to rethink your security. Zscaler Zero Trust plus AI stops attackers by
hiding your attack surface, making apps and IPs invisible,
eliminating lateral movement, connecting users only to specific apps, not the entire network,
continuously verifying every request based on identity and context,
simplifying security management with AI-powered automation,
and detecting threats using AI to analyze over 500 billion daily transactions.
Hackers can't attack what they can't see.
Protect your organization with Zscaler Zero Trust and AI.
Learn more at zscaler.com slash security. This reminds me of, I suppose in some ways,
it reminds me of people using things like search engine optimization
to try to rise to the top of Google results.
But then also we hear people doing kind of SEO poisoning,
where they want malicious things to rise to the top.
I mean, this strikes me as being along the same lines as that sort of thing.
Is that an accurate perception on my part?
That's actually a very, very accurate observation
because it's also one of the things that we imagine
how these attacks might be disseminated, right?
So some people might use strategies like exactly SEO poisoning
in order to get their website
retrieved by search engines.
And if they are retrieved,
then the LLM running the search engine
would also be poisoned or ingested
by these prompts
that are hidden in their websites.
So what do you recommend then?
You've demonstrated this capability.
Do you have any suggestions for how we might go forward?
For whom exactly?
So for users or for...
Well, let's do them one at a time.
Why don't we start with the users?
I think at the moment moment at least my recommendation would
be to
really be sure
to not use the models
if you
need 100% reliable
and factual output.
Yeah, you can ask Bing Chat, tell me
some recipes for today.
Which is fine
because there is really no huge consequences that would
come out of that question.
But if you really want to look for very reliable answers, I wouldn't recommend to use MLMs
for this.
And I would definitely recommend to verify not only if the output is factual,
because this is a huge part of the whole thing,
but also to verify the links that maybe Wingchat might suggest to you.
So, for example, as part of the answer,
Wingchat can tell you, find more information here or whatever.
But these links might be malicious
because the prompts might actually
tell the model and instruct the model
to suggest, for example, harmful URLs.
What about for developers,
for folks who are out there
and are eager to use these APIs?
Are there warnings for them as well?
I would say yes.
At the moment, it's really not so clear
what the consequences of these models are.
And I think there is a lot of harm
that could be done by the current race
of really the whole community
to integrate LLMs in everything.
And I think we really need to stop and ask ourselves if we are ready for the
whole safety considerations at the moment or not.
Are there any things that you and your colleagues are going to work on next?
Has this work led you to more interesting or additional interesting avenues
to explore? Of course. I mean, as I said, this has been done only less than six weeks ago.
And we actually came up with this whole paper in only one week. So we wrote it in one week,
and we did all the experiments in just one week. It was crazy. It was the fastest thing I
ever seen come together actually. And since then we really were not able to catch a break, honestly,
because there are every day, there are new models released out there. There are new opportunities
for attacks. And honestly, things that we, when we wrote the paper,
we thought they are a bit futuristic,
like models that can read your emails,
send automatic emails,
and these emails are somehow poisoned
and all these.
It kind of seemed like very futuristic things
and somehow a bit of sci-fi.
But now we have all of these things.
I thought that this would be a little bit longer along the way when we have all of these things. I thought that this would be a little bit longer
along the way when we have all these models
but they are actually ready at the moment.
And yeah, since then we have been working
on actually testing the whole ideas
on the models that are more recent
such as Bing and GPT-H, GPT-4 and so on.
And actually surprisingly, the attacks work so much better when we have better models.
Our thanks to Sahar Abdonabi from CISPA Hemholtz Center for Information Security.
The research is titled A Comprehensive Analysis of Novel Prompt Injection Threats to Application Integrated Large Language Models.
We'll have a link in the show notes.
And now a message from Black Cloak.
Did you know the easiest way for cybercriminals to bypass your company's defenses
is by targeting your executives and their families at home?
Black Cloak's award-winning digital executive protection platform
secures their personal devices, home networks, and connected lives.
Because when executives are compromised at home, your company is at risk.
In fact, over one-third of new members discover they've already been breached.
Protect your executives and their families 24-7, 365, with Black Cloak.
Learn more at blackcloak.io.
The Cyber Wire Research Saturday podcast is a production of N2K Networks,
proudly produced in Maryland out of the startup studios of DataTribe,
where they're co-building the next generation of cybersecurity teams and technologies.
This episode was produced by Liz Ervin and senior producer Jennifer Iben.
Our mixer is Elliot Peltzman.
Our executive editor is Peter Kilby, and I'm Dave Bittner.
Thanks for listening.