CyberWire Daily - Exposing AI's Achilles heel. [Research Saturday]

Episode Date: November 23, 2024

This week, we are joined by Ami Luttwak, Co-Founder and CTO from Wiz, sharing their work on "Wiz Research Finds Critical NVIDIA AI Vulnerability Affecting Containers Using NVIDIA GPUs, Including Over ...35 percent of Cloud Environments." A critical vulnerability in the NVIDIA Container Toolkit, widely used for GPU access in AI workloads, could allow attackers to escape containers and gain full access to host environments, jeopardizing sensitive data. Wiz estimates that at least 33% of cloud environments are affected and urges immediate updates to NVIDIA's patched version. This discovery highlights the broader issue of young, under-secured codebases in AI tools, emphasizing the need for stronger security measures and collaboration. The research can be found here: Wiz Research Finds Critical NVIDIA AI Vulnerability Affecting Containers Using NVIDIA GPUs, Including Over 35% of Cloud Environments Learn more about your ad choices. Visit megaphone.fm/adchoices

Transcript
Discussion (0)
Starting point is 00:00:00 You're listening to the Cyber Wire Network, powered by N2K. of you i was concerned about my data being sold by data brokers so i decided to try delete me i have to say delete me is a game changer within days of signing up they started removing my personal information from hundreds of data brokers i finally have peace of mind knowing my data privacy is protected delete me's team does all the work for you with detailed reports so you know exactly Thank you. Hello, everyone, and welcome to the CyberWires Research Saturday. I'm Dave Bittner, and this is our weekly conversation with researchers and analysts tracking down the threats and vulnerabilities, solving some of the hard problems, and protecting ourselves in a rapidly evolving cyberspace. Thanks for joining us.
Starting point is 00:01:53 So the Wizz research team focuses on defining critical vulnerabilities in cloud environments. And recently, we focused a lot on AI research. That's Amy Luttwak, co-founder and CTO from Wiz. Today we're discussing their research. Wiz Research finds critical NVIDIA AI vulnerability affecting containers using NVIDIA GPUs, including over 35% of cloud environments. We published different research efforts that we've done,
Starting point is 00:02:39 basically finding vulnerabilities in huge AI services, AI services that provide the AI capabilities to most organizations in the world, like Hugging Face, like Replicate and SAP. So we started thinking, okay, what could be a way, an attack surface on the entire AI industry? When we started thinking about it, we got to the software stack of NVIDIA. Because we all know that NVIDIA is an amazing company. They have the GPUs that everyone uses for AI. But a little known fact is that there's also
Starting point is 00:03:13 a pretty considerable software stack that comes together with the GPUs. And that software stack is actually used by anyone using AI. So we thought if we can find a vulnerability there, this vulnerability can affect the entire AI industry. So that's how we started looking into the NVIDIA Container Toolkit. Well, we're talking today about CVE-2024-0132, which affects NVIDIA's Container Toolkit.
Starting point is 00:03:44 Can you walk us through exactly what is involved here with this vulnerability? Yes. So what is the NVIDIA container toolkit? It's basically a piece of software that anyone that wants to use a GPU and share the GPU across multiple users, and that happens a lot because GPUs are expensive, you would basically add to your container support for GPU. So the container itself can access the GPU and leverage the resources from the GPU. So this container toolkit is basically used by almost anyone that builds an application of AI on top of GPUs when the application is containerized.
Starting point is 00:04:28 Now, the vulnerability that we found allows the container image to escape from the container and basically take over the entire node. So that means that if the container image runs from a source that is not controlled by the service provider, this container image can escape and read any secret, any file, and even execute code on the extra node that runs the GPU itself, the extra server. Well, how could the attacker escape the container
Starting point is 00:05:04 and then gain control of the host system? Oh, okay. So basically what we found is that this, in theory, it's not possible, right? If I run a container that has no capabilities, no permissions, how can it be that this container can escape and take over the entire server?
Starting point is 00:05:21 So what we found is a vulnerability within the NVIDIA toolkit that if we craft a very specific container image, right, that uses very specific features within the NVIDIA container toolkit, what it actually does is that it maps, mistakenly, right, it maps to my container, which is untrusted, the entire file system of the server. It means that we can read any file from the underlying server because of this vulnerability. And we showed that once you have read access to any file on the server, we can actually run a privileged container that can take over the entire server. So this bug, this vulnerability that allowed us to map accidentally into our container the entire server file system, also of course allows you to do full takeover if
Starting point is 00:06:17 you want. And is this specific to GPU-enabled containers? Are they more susceptible to this type of attack? So this is, I mean, this is obviously, it's wider than AI. I mean, it's actually any usage of GPU. It can be for gaming. So this basically affects almost anyone using NVIDIA for containers. The reason that it's irrelevant for GPUs is just because this is the software stack that is used there, right?
Starting point is 00:06:45 So we usually wouldn't find this library when you don't have GPUs because it's a library that allows for GPU integration. It's not actually a bug in the GPU, right? It's just a bug in the software stack that is used by most of the GPU users. What about multi-tenant environments, Kubernetes clusters, those sorts of things?
Starting point is 00:07:07 So I think in multi-tenant environments, the risk is much, much higher. And this becomes a crucial risk in the exact use case that we started the research for was in environments where either you are a multi-tenant and you allow others to run their own container images, right? In that scenario, a container image that is
Starting point is 00:07:30 malicious can escape the isolation and can potentially access the other images from other users, right? So basically, in a multi-tenant environment, there is a huge risk here that this container escape vulnerability allows
Starting point is 00:07:45 the attacker to get access to anyone using the AI service. And this is why we always recommend in the AI, with research team, when you build applications, remember that containers can be escaped. So do not trust the container as a way to isolate your tenant. So even if you build a multi-tenant service, do not rely on containers. Always add another virtualization area that is stronger. And this is a good explanation here why this is so crucial.
Starting point is 00:08:17 We found the vulnerability. And if you didn't build the right isolation, your service is at risk right now. Now, my understanding is NVIDIA recently released a patch for this vulnerability. How should organizations prioritize their patching? Yes, so we work very closely with NVIDIA and they responded very fast and they closed the vulnerability within a few weeks since the time we disclosed it to them and the patch was released, this vulnerability affects anyone using GPU.
Starting point is 00:08:49 However, if we look at what is really crucial to fix, it's more urgent to fix areas where you allow an untrusted image to run, right? Because if you trust the image and you know that it's not actually coming from an untrusted source, the ability for the attacker to leverage this vulnerability is highly limited. However, if you have environments where you have researchers that download untrusted images
Starting point is 00:09:13 or you have multi-tenth environments that run images from users, these are environments that are at high risk right now. And that's what we recommend to prioritize and actually fix today. high risk right now. And that's what we recommend to prioritize and actually fix today. What about the various attack vectors that are possible here? I mean, are there particular attack vectors that folks should be aware of? Basically, container escape is just the first step of an attack, right? But once you escape the container, you can steal all of the secrets. You can get access to any AI model on the server. You can start running code on other environments. So the container escape on its own is just the beginning of the attack.
Starting point is 00:09:57 You can think about it as basically the initial access into the environment. So if you look at a classic attack, this would just be the first step. And any step from there depends on the specific use case in architecture. However, what's important to understand is that many companies do run untrusted AI models, right?
Starting point is 00:10:16 And we've talked about it in the past in other research that we've done. Researchers download AI models without any way to verify them. So this risk of, hey, someone is running an untrusted AI model and this AI model can now escape the container because we thought it's fine to run AI models in containers. There's nothing going to happen to me.
Starting point is 00:10:34 So this assumption is not true. We'll be right back. We'll be right back. Do you know the status of your compliance controls right now? Like, right now? We know that real-time visibility is critical for security, but when it comes to our GRC programs, we rely on point-in-time checks. But get this.
Starting point is 00:11:11 More than 8,000 companies like Atlassian and Quora have continuous visibility into their controls with Vanta. Here's the gist. Vanta brings automation to evidence collection across 30 frameworks, like SOC 2 and ISO 27001. They also centralize key workflows like policies, access reviews, and reporting, and helps you get security questionnaires done five times faster with AI. Now that's a new way to GRC. Get $1,000 off Vanta when you go to vanta.com slash cyber. That's vanta.com slash cyber for $1,000 off. And now a message from Black Cloak. Did you know the easiest way for cyber criminals to bypass your company's defenses is by targeting your executives and their families at home. Black Cloak's award-winning digital executive protection platform secures their
Starting point is 00:12:11 personal devices, home networks, and connected lives. Because when executives are compromised at home, your company is at risk. In fact, over one-third of new members discover they've already been breached. Protect your executives and their families 24-7, 365 with Black Cloak. Learn more at blackcloak.io. What are some of the other isolation barriers that people should be using here? Are we talking about things like virtualization? Exactly. So basically, when we design for isolation, especially for multi-tenant services, containers are not a trusted barrier.
Starting point is 00:12:58 Virtual machines, virtualizations are considered a trusted barrier because if you look at the last recent years, how many vulnerabilities of Container Escape we found? How many vulnerabilities in Linux Kernel we found? There were an unnegligible number of vulnerabilities. However, in virtualization environments, that is very, very rare, right? And that's why as a security practitioner,
Starting point is 00:13:21 when I look at a review of an architecture, a virtual machine is the best way to isolate. Now, there is tools today like GVisor, which is a tool that you can run that limits the ability of a workload to go outside of a specific set of approved perimeter capabilities, which reduce the risk significantly. GVisor is not as secure as running a full virtual machine, but it's an example of a tool that provides great isolation capabilities without changing your entire architecture.
Starting point is 00:13:56 What about organizations that might allow, let's say, third-party AI models or third-party container images to be running on their GPU infrastructure? Do you have any advice for them? Yeah, so I think that happens a lot, right? So it happens both for AI service providers, but also for anyone that has a GPU and allows anyone in the company to run code. And the implications here are, first of all,
Starting point is 00:14:25 that you have to pitch, right? That's number one, just pitch for the vulnerability. But the wider implications are that we need to look at AI models and container images that come from third parties, just like we look at the applications that we download from, you know? Like when you go through an email
Starting point is 00:14:43 and you get an email from someone and you download the email, you know and I know that I would not start running the applications that I get from the email, right? Because we all know that that can be malicious.
Starting point is 00:14:53 Say, why do we trust a container image that is an AI model from an untrusted source, right? We should be a bit more careful because this is code that we are running and we need to remember that this is a new attack surface for
Starting point is 00:15:07 attackers, just like downloading applications from emails. It used to be a great attack surface, but no one is, I hope, no one is clicking on emails and actually running an application from an email. This is going to be a new attack vector, right? Everyone talks about AI, so they just run everything that has the name AI, any AI model would be run. No, we have to remember, this is a security risk. It's a new attack vector.
Starting point is 00:15:32 Anything that we run, either it's fully isolated in a separate VM and so on, or we have actual processes in the company to verify what is actually being run as an AI model and where, right? If you get an untrusted AI model, okay, you can only run it in this highly isolated environment, right? If we don't have this kind of guardrails, then we expose ourselves to a lot of risk. You mentioned that NVIDIA was a really helpful partner here in this disclosure. Can you walk us through what that process is like?
Starting point is 00:16:07 I mean, for folks who've never been through that, what goes into responsible disclosure with a big organization like NVIDIA? Great. So the NVIDIA team, first of all, how do we engage with them? So every company that has a product has a security program of how to report vulnerabilities to them. Usually there is an incident response email that is published. So we approach that vendor and there is a protocol that you have to follow, right? When we report the vulnerability, we do not provide anyone outside of the vendor
Starting point is 00:16:47 information about the vulnerability until it's fully patched. So the entire discussion is highly sensitive and secretive between us and the vendor. During that discussion, we try to provide to the vendor a full disclosure report with all of the information that we found. During that attempt, we usually try not to touch actual customer data of that vendor so they don't actually get to any kind of issues with their customers. So what we try to do as researchers is to find the problem, provide the vendor a full report, and then basically we wait. Once we send the email, we just wait until the vendor a full report, and then basically we wait. Once we send the email, we just wait until the vendor actually replies to us. In the NVIDIA use case, they actually worked
Starting point is 00:17:32 really fast. They provided us responses almost within a day, and they worked until they fixed the vulnerability. As I said, this was within two or three weeks, fully patched. Now, during that time, we communicate with the vendor. If they have any questions, anything that we found that they didn't know how to replicate, we help them actually reproduce. And the goal, again, is to make sure that the vendor has all of the information in order to fix the vulnerability.
Starting point is 00:18:01 Because it's like when we found the vulnerability and we reported it, think about it, it's like when we found the vulnerability and we reported it, think about it like a weapon, right? Until someone actually patched it, it's very, very secret and we cannot disclose and talk about it. Even with our friends, partners, customers, we cannot talk about it with anyone because I have a weapon now. And until the vendor actually finishes the fixed efforts, we have to remain silent on it. Now, once the vendor has a patch, our role as researchers is to explain to the world about the vulnerability and why it's important to patch it. Now, something that's important to understand is that although we talk about it, we do not
Starting point is 00:18:41 disclose yet in the beginning how the exploit actually works, right? And we do not disclose it because we want to give the good guys time before any bad guy can leverage the vulnerability. So although NVIDIA patched the vulnerability, since we didn't disclose exactly how to exploit it, we are giving time for the good people to fix the vulnerability before anyone can actually exploit it. And do you know if this is being actively exploited? Do you have any methods to be able to track that? So there is no way to know that for sure, right? We have ways because
Starting point is 00:19:21 we are also a security company and we are connected to millions of workloads. So we are actually monitoring the environments that we see for any potential exploitation. So we haven't seen exploitation of this vulnerability in the wild yet, but it doesn't mean that it will not happen soon. And of course, our view is limited because we see only cloud environments. There is huge amounts of GPUs deployed on on-premise environments within the cloud providers.
Starting point is 00:19:50 So our view is very limited. And also NVIDIA wouldn't see because this is actually happening in a local GPU, right? So no one can tell you for sure if this is actually already being exploited. I do think that, again, this is not a vulnerability that is easily exploited, because you do need ability to build an image, and then you need to publish the image. So it takes time until this kind of vulnerability
Starting point is 00:20:15 can be leveraged by an attacker. Well, for those organizations who are running AI models in containers, what are some of the best practices they should follow to help mitigate these risks? That's a great question. You know, we talk so much, there's so much buzz about AI security. And many times people talk about, oh, how AI is going to take over the world or how the attackers are leveraging AI to
Starting point is 00:20:43 basically take over my company. But the real risk right now, the real risk right now for AI usage is the AI infrastructure that you use, right? So, I mean, if you look at this vulnerability, where does it come from? It comes from the AI infrastructure that you have in the company. And everyone now that's starting using AI, they have dozens or hundreds of tools that are used for AI. And these tools are actually bringing real risk right now.
Starting point is 00:21:11 So if I think about the best practices from this vulnerability, it's number one, you need to know what AI tools are being used in your company by the AI researchers. tools are being used in your company by the AI researchers. And again, I want to endorse AI usage, but I need to be able to say, I have visibility into all of the AI environments and all of the AI tooling across my company, right? And the second step is, as we saw here, AI models are great, but they're also kind of risky. So you need to define AI governance processes. So basically, which projects are using AI? Which models are using?
Starting point is 00:21:52 What's the source of the model? Where are you testing AI models? Is it running in a test-isolated environment? All of those are definitions that each company needs to do. And I call this AI governance. It's composed of AI discovery, the ability to define AI testing. All of those processes are important to define right now.
Starting point is 00:22:12 And every team that has, you know, an AI team and a security team, they should start working together to define those kinds of practices. It's always better to do it early than to do it later. Our thanks to Amy Lutwak from Wizz for joining us. The research is titled, Wizz Research Finds Critical NVIDIA AI Vulnerability Affecting Containers Using NVIDIA GPUs Including Over 35% of Cloud Environments. We'll have a link in the show notes. We'd love, including over 35% of cloud environments.
Starting point is 00:22:45 We'll have a link in the show notes. We'd love to know what you think of this podcast. Your feedback ensures we deliver the insights that keep you a step ahead in the rapidly changing world of cybersecurity. If you like our show, please share a rating and review in your favorite podcast app.
Starting point is 00:23:00 Please also fill out the survey in the show notes or send an email to cyberwire at n2k.com. We're privileged that N2K Cyber Wire is part of the daily routine of the most influential leaders and operators in the public and private sector, from the Fortune 500 to many of the world's preeminent intelligence and law enforcement agencies. N2K makes it easy for companies to optimize your biggest investment, your people. We make you smarter about your teams while making your team smarter. Learn how at n2k.com. This episode was produced by Liz Stokes.
Starting point is 00:23:32 We're mixed by Elliot Peltzman and Trey Hester. Our executive producer is Jennifer Iben. Our executive editor is Brandon Karp. Simone Petrella is our president. Peter Kilby is our publisher. And I'm Dave Bittner. Thanks for listening. We'll see you back here next time.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.