CyberWire Daily - Exposing AI's Achilles heel. [Research Saturday]
Episode Date: November 23, 2024This week, we are joined by Ami Luttwak, Co-Founder and CTO from Wiz, sharing their work on "Wiz Research Finds Critical NVIDIA AI Vulnerability Affecting Containers Using NVIDIA GPUs, Including Over ...35 percent of Cloud Environments." A critical vulnerability in the NVIDIA Container Toolkit, widely used for GPU access in AI workloads, could allow attackers to escape containers and gain full access to host environments, jeopardizing sensitive data. Wiz estimates that at least 33% of cloud environments are affected and urges immediate updates to NVIDIA's patched version. This discovery highlights the broader issue of young, under-secured codebases in AI tools, emphasizing the need for stronger security measures and collaboration. The research can be found here: Wiz Research Finds Critical NVIDIA AI Vulnerability Affecting Containers Using NVIDIA GPUs, Including Over 35% of Cloud Environments Learn more about your ad choices. Visit megaphone.fm/adchoices
Transcript
Discussion (0)
You're listening to the Cyber Wire Network, powered by N2K. of you i was concerned about my data being sold by data brokers so i decided to try delete me i have
to say delete me is a game changer within days of signing up they started removing my personal
information from hundreds of data brokers i finally have peace of mind knowing my data privacy
is protected delete me's team does all the work for you with detailed reports so you know exactly Thank you. Hello, everyone, and welcome to the CyberWires Research Saturday.
I'm Dave Bittner, and this is our weekly conversation with researchers and analysts
tracking down the threats and vulnerabilities, solving some of the hard problems,
and protecting ourselves in a rapidly evolving cyberspace.
Thanks for joining us.
So the Wizz research team focuses on defining critical vulnerabilities in cloud environments.
And recently, we focused a lot on AI research.
That's Amy Luttwak, co-founder and CTO from Wiz.
Today we're discussing their research.
Wiz Research finds critical NVIDIA AI vulnerability
affecting containers using NVIDIA GPUs,
including over 35% of cloud environments.
We published different research efforts that we've done,
basically finding vulnerabilities in huge AI services, AI services that provide the AI capabilities to most organizations in the world,
like Hugging Face, like Replicate and SAP.
So we started thinking, okay, what could be a way,
an attack surface on the entire AI industry?
When we started thinking about it, we got to the software stack of NVIDIA.
Because we all know that NVIDIA is an amazing company.
They have the GPUs that everyone uses for AI.
But a little known fact is that there's also
a pretty considerable software stack
that comes together with the GPUs.
And that software stack is actually used by anyone using AI.
So we thought if we can find a vulnerability there,
this vulnerability can affect the entire AI industry.
So that's how we started looking into the NVIDIA Container Toolkit.
Well, we're talking today about CVE-2024-0132,
which affects NVIDIA's Container Toolkit.
Can you walk us through exactly what is involved here with this vulnerability?
Yes. So what is the NVIDIA container toolkit?
It's basically a piece of software that anyone that wants to use a GPU
and share the GPU across multiple users,
and that happens a lot because GPUs are expensive,
you would basically add to
your container support for GPU. So the container itself can access the GPU and leverage the
resources from the GPU. So this container toolkit is basically used by almost anyone that builds an application of AI on top of GPUs when the application is containerized.
Now, the vulnerability that we found allows the container image to escape from the container
and basically take over the entire node.
So that means that if the container image runs from a source
that is not controlled by the service provider,
this container image can escape and read any secret,
any file, and even execute code on the extra node
that runs the GPU itself, the extra server.
Well, how could the attacker escape the container
and then gain control of the host system?
Oh, okay.
So basically what we found is that this,
in theory, it's not possible, right?
If I run a container that has no capabilities,
no permissions,
how can it be that this container can escape
and take over the entire server?
So what we found is a vulnerability
within the NVIDIA toolkit that if we
craft a very specific container image, right, that uses very specific features within the NVIDIA
container toolkit, what it actually does is that it maps, mistakenly, right, it maps to my container, which is untrusted, the entire file system of the server.
It means that we can read any file from the underlying server because of this vulnerability.
And we showed that once you have read access to any file on the server, we can actually run a
privileged container that can take over the entire server. So this bug, this vulnerability that allowed us to map accidentally into our
container the entire server file system, also of course allows you to do full takeover if
you want.
And is this specific to GPU-enabled containers? Are they more susceptible to this type of attack?
So this is, I mean, this is obviously, it's wider than AI.
I mean, it's actually any usage of GPU.
It can be for gaming.
So this basically affects almost anyone using NVIDIA for containers.
The reason that it's irrelevant for GPUs
is just because this is the software stack that is used there, right?
So we usually wouldn't find this library
when you don't have GPUs
because it's a library that allows for GPU integration.
It's not actually a bug in the GPU, right?
It's just a bug in the software stack
that is used by most of the GPU users.
What about multi-tenant environments,
Kubernetes clusters, those sorts of things?
So I think in multi-tenant environments,
the risk is much, much higher.
And this becomes a crucial risk
in the exact use case that we started the research for
was in environments where either you are a multi-tenant
and you allow others to run their own container
images, right? In that scenario,
a container image that is
malicious can escape the
isolation and can potentially
access the other images
from other users, right?
So basically, in a multi-tenant environment,
there is a huge risk here
that this container escape
vulnerability allows
the attacker to get access to anyone using the AI service.
And this is why we always recommend in the AI, with research team, when you build applications,
remember that containers can be escaped.
So do not trust the container as a way to isolate your tenant.
So even if you build a multi-tenant service,
do not rely on containers.
Always add another virtualization area that is stronger.
And this is a good explanation here why this is so crucial.
We found the vulnerability.
And if you didn't build the right isolation,
your service is at risk right now.
Now, my understanding is NVIDIA recently released a patch for this vulnerability.
How should organizations prioritize their patching?
Yes, so we work very closely with NVIDIA and they responded very fast and they closed the vulnerability within a few weeks
since the time we disclosed it to them and the patch was released,
this vulnerability affects anyone using GPU.
However, if we look at what is really crucial to fix,
it's more urgent to fix areas
where you allow an untrusted image to run, right?
Because if you trust the image
and you know that it's not actually coming
from an untrusted source,
the ability for the attacker to leverage this vulnerability is highly limited.
However, if you have environments where you have researchers that download untrusted images
or you have multi-tenth environments that run images from users,
these are environments that are at high risk right now.
And that's what we recommend to prioritize and actually fix today.
high risk right now. And that's what we recommend to prioritize and actually fix today.
What about the various attack vectors that are possible here? I mean, are there particular attack vectors that folks should be aware of? Basically, container escape is just the first
step of an attack, right? But once you escape the container, you can steal all of the secrets. You can get access to any AI model on the server.
You can start running code on other environments.
So the container escape on its own is just the beginning of the attack.
You can think about it as basically the initial access into the environment.
So if you look at a classic attack, this would just be the first step.
And any step from there
depends on the specific use case
in architecture.
However, what's important to understand
is that many companies
do run untrusted AI models, right?
And we've talked about it in the past
in other research that we've done.
Researchers download AI models
without any way to verify them.
So this risk of,
hey, someone is running an untrusted AI model and this AI model can now escape the container
because we thought it's fine to run AI models in containers.
There's nothing going to happen to me.
So this assumption is not true.
We'll be right back.
We'll be right back.
Do you know the status of your compliance controls right now?
Like, right now?
We know that real-time visibility is critical for security,
but when it comes to our GRC programs, we rely on point-in-time checks.
But get this.
More than 8,000 companies like Atlassian and Quora have continuous visibility into their controls with Vanta.
Here's the gist.
Vanta brings automation to evidence collection across 30 frameworks, like SOC 2 and ISO 27001.
They also centralize key workflows like policies, access reviews, and reporting, and helps you get security questionnaires done five times faster with AI.
Now that's a new way to GRC.
Get $1,000 off Vanta when you go to vanta.com slash cyber.
That's vanta.com slash cyber for $1,000 off. And now a message from Black Cloak. Did you know the easiest way for cyber criminals to bypass your company's defenses is by targeting your executives and their families
at home. Black Cloak's award-winning digital executive protection platform secures their
personal devices, home networks, and connected lives. Because when executives are compromised
at home, your company is at risk. In fact, over one-third of new members discover they've already been breached. Protect your executives and their families 24-7, 365 with Black Cloak.
Learn more at blackcloak.io.
What are some of the other isolation barriers that people should be using here?
Are we talking about things like virtualization?
Exactly.
So basically, when we design for isolation, especially for multi-tenant services,
containers are not a trusted barrier.
Virtual machines, virtualizations are considered a trusted barrier
because if you look at the last recent years,
how many vulnerabilities of Container Escape we found?
How many vulnerabilities in Linux Kernel we found?
There were an unnegligible number of vulnerabilities.
However, in virtualization environments,
that is very, very rare, right?
And that's why as a security practitioner,
when I look at a review of an architecture,
a virtual machine is the best way to isolate.
Now, there is tools today like GVisor, which is a tool that you can run
that limits the ability of a workload to go outside of a specific set
of approved perimeter capabilities, which reduce the risk significantly.
GVisor is not as secure as running a full virtual machine,
but it's an example of a tool that provides great isolation capabilities
without changing your entire architecture.
What about organizations that might allow, let's say, third-party AI models
or third-party container images to be running on their GPU infrastructure?
Do you have any advice for them?
Yeah, so I think that happens a lot, right?
So it happens both for AI service providers,
but also for anyone that has a GPU
and allows anyone in the company to run code.
And the implications here are, first of all,
that you have to pitch, right?
That's number one, just pitch for the vulnerability.
But the wider implications are that we need to look
at AI models and container images
that come from third parties,
just like we look at the applications
that we download from, you know?
Like when you go through an email
and you get an email from someone
and you download the email,
you know and I know
that I would not start running
the applications that I get
from the email, right?
Because we all know
that that can be malicious.
Say, why do we trust
a container image
that is an AI model
from an untrusted source, right?
We should be a bit more careful
because this is code
that we are running
and we need to remember that this is a new attack surface for
attackers, just like downloading applications from emails. It used
to be a great attack surface, but no one is, I hope, no one is clicking
on emails and actually running an application from an email.
This is going to be a new attack vector, right? Everyone talks about AI,
so they just run everything that has the name AI,
any AI model would be run.
No, we have to remember, this is a security risk.
It's a new attack vector.
Anything that we run, either it's fully isolated
in a separate VM and so on,
or we have actual processes in the company
to verify what is actually being run
as an AI model and where, right? If you get an
untrusted AI model, okay, you can only run it in this highly isolated environment, right?
If we don't have this kind of guardrails, then we expose ourselves to a lot of risk.
You mentioned that NVIDIA was a really helpful partner here in this disclosure. Can you walk us through what that process is like?
I mean, for folks who've never been through that,
what goes into responsible disclosure
with a big organization like NVIDIA?
Great.
So the NVIDIA team, first of all, how do we engage with them?
So every company that has a product has a security program of how to report vulnerabilities to them.
Usually there is an incident response email that is published.
So we approach that vendor and there is a protocol that you have to follow, right? When we report the vulnerability, we do not provide anyone outside of the vendor
information about the vulnerability until it's fully patched. So the entire discussion is highly
sensitive and secretive between us and the vendor. During that discussion, we try to provide to the
vendor a full disclosure report with all of the information that we found. During that attempt, we usually try not to touch actual customer data of that vendor
so they don't actually get to any kind of issues with their customers.
So what we try to do as researchers is to find the problem, provide the vendor a full
report, and then basically we wait.
Once we send the email, we just wait until the vendor a full report, and then basically we wait. Once we send the email,
we just wait until the vendor actually replies to us. In the NVIDIA use case, they actually worked
really fast. They provided us responses almost within a day, and they worked until they fixed
the vulnerability. As I said, this was within two or three weeks, fully patched.
Now, during that time, we communicate with the vendor.
If they have any questions, anything that we found
that they didn't know how to replicate,
we help them actually reproduce.
And the goal, again, is to make sure that the vendor
has all of the information in order to fix the vulnerability.
Because it's like when we found the vulnerability
and we reported it, think about it, it's like when we found the vulnerability and we reported it,
think about it like a weapon, right? Until someone actually patched it, it's very, very secret and we
cannot disclose and talk about it. Even with our friends, partners, customers, we cannot talk about
it with anyone because I have a weapon now. And until the vendor actually finishes the fixed efforts, we have to remain silent on it.
Now, once the vendor has a patch, our role as researchers is to explain to the world
about the vulnerability and why it's important to patch it.
Now, something that's important to understand is that although we talk about it, we do not
disclose yet in the beginning how the exploit actually works,
right?
And we do not disclose it because we want to give the good guys time before any bad
guy can leverage the vulnerability.
So although NVIDIA patched the vulnerability, since we didn't disclose exactly how to exploit
it, we are giving time for the good people to fix the vulnerability before anyone can
actually exploit it. And do you know if this is being actively exploited? Do you have any methods
to be able to track that? So there is no way to know that for sure, right? We have ways because
we are also a security company and we are connected to millions of workloads.
So we are actually monitoring the environments that we see
for any potential exploitation.
So we haven't seen exploitation of this vulnerability in the wild yet,
but it doesn't mean that it will not happen soon.
And of course, our view is limited
because we see only cloud environments.
There is huge amounts of GPUs deployed on on-premise environments within the cloud providers.
So our view is very limited.
And also NVIDIA wouldn't see because this is actually happening in a local GPU, right?
So no one can tell you for sure if this is actually already being exploited.
I do think that, again,
this is not a vulnerability that is easily exploited,
because you do need ability to build an image,
and then you need to publish the image.
So it takes time until this kind of vulnerability
can be leveraged by an attacker.
Well, for those organizations who are
running AI models in containers,
what are some of the best practices they should follow to help mitigate these
risks? That's a great question. You know,
we talk so much, there's so much buzz about AI security. And many times
people talk about, oh, how AI is going to take over the world
or how the attackers are leveraging AI to
basically take over my company.
But the real risk right now, the real risk right now for AI usage
is the AI infrastructure that you use, right?
So, I mean, if you look at this vulnerability, where does it come from?
It comes from the AI infrastructure that you have in the company.
And everyone now that's starting using AI,
they have dozens or hundreds of tools that are used for AI.
And these tools are actually bringing real risk right now.
So if I think about the best practices from this vulnerability, it's number one, you need to know what AI tools are being used in your company by the AI researchers.
tools are being used in your company by the AI researchers.
And again, I want to endorse AI usage, but I need to be able to say, I have visibility into all of the AI environments and all of the AI tooling across my company, right?
And the second step is, as we saw here, AI models are great, but they're also kind of
risky.
So you need to define AI governance processes.
So basically, which projects are using AI?
Which models are using?
What's the source of the model?
Where are you testing AI models?
Is it running in a test-isolated environment?
All of those are definitions that each company needs to do.
And I call this AI governance.
It's composed of AI discovery,
the ability to define AI testing.
All of those processes are important to define right now.
And every team that has, you know,
an AI team and a security team,
they should start working together to define those kinds of practices.
It's always better to do it early than to do it later.
Our thanks to Amy Lutwak from Wizz for joining us.
The research is titled,
Wizz Research Finds Critical NVIDIA AI Vulnerability Affecting Containers Using NVIDIA GPUs
Including Over 35% of Cloud Environments. We'll have a link in the show notes. We'd love, including over 35% of cloud environments.
We'll have a link in the show notes.
We'd love to know what you think of this podcast.
Your feedback ensures we deliver the insights
that keep you a step ahead
in the rapidly changing world of cybersecurity.
If you like our show,
please share a rating and review
in your favorite podcast app.
Please also fill out the survey in the show notes
or send an email to cyberwire at n2k.com.
We're privileged that N2K Cyber Wire is part of the daily routine of the most influential leaders
and operators in the public and private sector, from the Fortune 500 to many of the world's
preeminent intelligence and law enforcement agencies. N2K makes it easy for companies to
optimize your biggest investment, your people. We make you smarter about your teams while making your team smarter.
Learn how at n2k.com.
This episode was produced by Liz Stokes.
We're mixed by Elliot Peltzman and Trey Hester.
Our executive producer is Jennifer Iben.
Our executive editor is Brandon Karp.
Simone Petrella is our president.
Peter Kilby is our publisher.
And I'm Dave Bittner.
Thanks for listening.
We'll see you back here next time.