Limitless Podcast - Claude Mythos Is Too Dangerous To Release, But It Escaped Anyways

Starting point is 00:00:00 What I'm about to say should scare you. Anthropic just released a model that's so powerful, so dangerous, that they can't release it to the public for the fear of the destruction that it would cause. In just a few hours, it discovered over a thousand major security vulnerabilities, and the only thing stopping it from exploiting it was one single anthropic engineer telling it not to. But that isn't even the craziest story. During training, Claude Mithos broke out of a secure containment and emailed the anthropic researcher bragging about the fact that it did that,

Starting point is 00:00:32 and then posted about it publicly online. The anthropic researcher was eating a sandwich. This is by far the most consequential model release of the year so far ever. And no one is talking about this. I looked at five major news publications this morning, and it didn't even break the top five headlines. This is the most important release that no one's talking about. I think that disconnect between the mainstream media

Starting point is 00:00:54 and what we're about to talk about on this episode is the, One of the more scary parts of this entire story, where this is the most powerful AI model that has ever been released, ever. There is nothing more powerful, in fact, so powerful that you will probably never actually be able to use this model. There's a high probability that the public just never gets to touch it because it is so dangerous. Anthropic made the decision to keep this model private and to form an entire entity around figuring out how to keep it safe. It generated so many zero-day exploits. It has hacked into so many pieces of software that the only way they can responsibly roll this out is to give it to the distributors who have been hacked and then allow

Starting point is 00:01:30 them to roll out patches to fix it because it is that powerful. Cloud Mythos is I think what a lot of people would describe at least in terms of coding, coding AGI. And it is actually an accidental second order effect of the model. This model was never intended to be a cybersecurity master. They just trained it on the code. And what happened from it was a second order effect that nobody expected. This is also the biggest model that has ever been trained. 10 trillion parameter models, roughly three exercise of their last model called Opus 4.6. And it's also incredibly expensive to serve, which is also partially the reason why they're probably not rolling out. It's around 25 bucks per million tokens, $125 output, extremely expensive. But what I want to get into is the capabilities of this

Starting point is 00:02:14 model, specifically what spooked everyone. And there were a bunch of examples that were released in this official announcement that spooked me. The most important one, or the most shocking one, was the Anthropic researcher eating a sandwich in a park that got emailed from a random anonymous user that turned out to be the very AI model, Claude Mithos, that he was training back in the lab a few blocks away. So this might be the most sci-fi-like story from the announcement that came out yesterday is a post thread from Sam Bowman, who is on the, I guess, the AI alignment team at Anthropic. And he was out at the park, like you mentioned, just having some lunch, received an email from the model. He was not expecting an email from the model. It was held in containment.

Starting point is 00:02:53 in a sandbox where it wasn't supposed to be able to break out. Now, they tasked it with the goal of breaking out. They said, hey, break out. And it actually did. And not only did it break out, it sent him an email, but then it went online and started posting the exploits publicly on public forums for everyone to see and to find out. The model went completely rogue. And it did so in a way that has never been possible before.

Starting point is 00:03:17 One of the things that has really stood out about this model is its ability to just break down every barrier in between. itself and the projected goal. Its goal was to get out, so it got out and it bragged about it. And there were a few other mentions of goals that this model had throughout this red card that were going to get through that had some other really concerning things like covering up its tracks after it exploited something so that the people who were reviewing the code wouldn't actually know it was doing anything malicious. It's very clever. Yeah, a few things I want to talk about the sandwich example before we move on. Number one, the way that it did it shocked the researchers

Starting point is 00:03:53 themselves. So the contained environment was aimed to be exactly that it wasn't meant to have any kind of loopholes that it could use to extract itself out. As far as the researchers knew, this thing was unbreakable, right? And so it gave it a task to try and break out of it. What it did was it noticed that it had a subtle connection to the internet. The internet was extremely constricted. And so it wasn't able to access or send any emails. And it accessed that one particular endpoint and it convinced another model, which was Claude Opus 4.6, which had very strict parameters to not let this model out, to give it access to a few other versions of the internet. It went from there and did five other steps to extract itself out and get full browser

Starting point is 00:04:39 access. So it used one particular vulnerability, which no human anthropic researcher was able to figure out and exploited another model to be able to do that, which was just crazy to see it in the first place. The other thing I want to say is, if it publicly posted about itself bragging about this, we could have found out about Claude Methos weeks ago. We just weren't smart enough to see it on a public forum, but it was there for everyone to see. And there were signs. We even covered this topic on an episode a few weeks ago because it got leaked through their web interface initially. So there have been these little breadcrums of existence, but yesterday they fully came out, announced everything, and shared with it a red card from the red team talking about all of the

Starting point is 00:05:19 technical properties of this model. And it's important to note that this report is 244 pages long. This is a huge report that they published talking about all the nuances and the capabilities that this model had. Now, there are a few highlights that we're going to walk through. The first one being just how capable it is at exploiting things. There are so many examples of exploits that it found in the wild that no one has been able to find for as long as 27 years. Starting with the vulnerability in Open BSD, which is a security protocol that a lot of people use that has been pretty robust for the last 27 years, even though they were missing a critical bug that the model found. And there's so many instances of this. Yeah. So Open BSD, fun fact, is used by a lot of firewalls that protect

Starting point is 00:06:03 your PCs operating systems and Fortune 500 companies all over the world. They found a 27-year-old bug called Mitos found a 27-year-old bug in a few hours for the cost of 50 bucks. We're talking about a bug that elite human security experts have been trying to find for over like almost three decades and weren't able to find. So the point is there are a lot of important entities all over the world that rely on this system. So the fact that there is a bug lying in plain sight that could have been exploited is a major issue. And we're lucky that Anthropic chose to do the good thing and not exploit it for now. But then there was another instance where it expressed a tactic that a lot of humans themselves wouldn't have thought to do. So it wasn't an obvious exploit, but it discovered that if it strung together

Starting point is 00:06:50 six specific steps, it would be able to exploit a Linux kernel operating system. And it figured out a way to do that. Again, it didn't decide to exploit the fact because it was managed by researchers, but it could have if it was in the wrong hands, which is why we're seeing this constricted release. And the third example is they discovered a 16-year-old flaw in FFMPEG after it's been tested for over 5 million times. Now, it's very important to compare this to the previous model, Opus 4.6, which, when put towards the same test, discovered around 100 vulnerabilities in the Firefox browser. Methos this time discovered 181 vulnerabilities and proved that it could exploit all of them. Opus 4.6 could not do this, and it shocked security researchers all over the world back in the day.

Starting point is 00:07:34 This is an entirely new tier of model. Yeah, I think comparing Opus 4.6 to this is a really good reference because Opus 4.6 found a bunch of vulnerabilities. It just didn't have the ability to string them together into working exploits. So it was capable of doing this, but it didn't have the intelligence to kind of have that high level framework. When comparing it to Opus, I mean, Opus, out of several hundred attempts, it got two working exploits. Mythos produced 181, and then registered full control of a machine in 29 more. So this is a huge amount. And the good news is, is that patches are actually actively starting to roll out. In fact, FFMPEC, the company we just mentioned, they posted yesterday that they actually received a patch from Anthropic

Starting point is 00:08:15 and deployed it into their code. So so far it's working. The good guys are on the defense. They're helping to deploy patches for this, but there's a lot of exploits that they found in just a few weeks of testing. I can't imagine the surface area that needs to be covered in order to fix things before the rest of the world gets access to this technology. Well, there was actually a funny end to this story. Someone replied and saying, hey, aren't you mad because of the AI sloppy pull request? This is a reference to FFMPEG traditionally not being too amenable to AI coded stuff. And he responded, or the account operator responded, because the patches appear to be written by humans. And that's the irony of this, which is Claude Mithos most likely wrote the patch and it wasn't a human, but it's so good that it's indistinguishable from human talent.

Starting point is 00:09:00 Clearly, this is working, they're deploying these patches, and the reason is because, like we mentioned earlier, you're not going to have access to this. We don't have, none of the public is going to have access to this. Instead, they published, or they formed at least a coalition called Project Glasswing. Now, this is like, this feels like a Manhattan project for AI. It's crazy. But essentially, Dario and the Anthropic team, they are being kingmakers. They are deciding the companies that they want to work with in order to patch the most impactful software in the world. On this list, we have companies like Amazon, Apple, Broadcom, Microsoft, Nvidia, Google, a lot of the major companies that you would expect to have access to this, they're gaining access to it with the sole intention of

Starting point is 00:09:37 using it as defense. They're going to ask it to exploit their code, give it access to the code basis, see where there are holes, and then figure out how to patch them as quickly as possible, before other companies begin to catch up to how powerful this model is. It's also important to understand that this is very much unthropic doing these companies of favor. And it's good that they're well-intentioned enough. If China, I hate to think what would have happened if China had built something of similar capability. It would have been scary. They may not have been as kind as what is happening here. So some more details on this partnership. Over a hundred million dollars worth of credits is being distributed towards these companies and more partners for them

Starting point is 00:10:17 to be able to fix and patch up any security vulnerabilities. Remember, they discovered over a thousand in a matter of hours, and 99% of these patches haven't even been built or fixed yet. So this is going to take some time. The compute is very expensive, and Anthropic is therefore being very methodical and intentional with who gets access to this model for now. Personally, I don't think we, the public, are going to get access to this model, or at least the full power of this model, for at least a couple of months. They did mention that we were going to get access to a quantized version of this model,

Starting point is 00:10:49 where it's kind of hybrid with a Claude Opus type variant that we're going to get access to that we can play around with. But if we got access to this thing immediately, one, we wouldn't be able to afford it. It would probably cost a thousand bucks a month, probably more. And two, it would be too expensive for Anthropic to serve. I read somewhere, Josh, that Anthropic needs 7X the compute

Starting point is 00:11:10 that they currently have to be able to serve this to every single Anthropic users that they have right now. And a few weeks ago, they were adding a million users per day. So this is just economically infeasible to serve right now. Yeah. And I do worry about what this looks like in the future because at what point does it become okay to release this model to the public? And then what does the frontier model look like?

Starting point is 00:11:31 What happens if another company has this model's power but decides to release it? Like an open AI comes along with their spud model. They release it tomorrow. What is Anthropics reaction? We're at like, again, we're at the frontier of how these things are going to act. Anthropics made the first move in keeping it private for the first time ever. we're going to see how other companies react. There are some more interesting behaviors that happen in the system card that we probably should

Starting point is 00:11:54 cover because it's pretty fascinating. This is the 244 page report that we're looking at here. One of the most interesting ones that I found is to the point earlier where it just kind of breaks down every wall that is in its way. It has done that over and over and over again, but it has decided to cover its tracks as it does that. So it recognizes the fact that it is in a box. People are reviewing it and it doesn't want to be detected.

Starting point is 00:12:17 So what you'll notice in this post here is it was hacking its guardrails and then hiding evidence of the crime. Thankfully, there is still some chain of thought that can be read by the engineers. But the intention that was signal through this chain of thought was that the model just wanted to be sneaky. It wanted to hack into this thing, hide its tracks behind it and not let anybody know how it did the things that it did when it broke out, when it's shrunk together zero day vulnerability, just to get access to things that it knows it shouldn't do, but we're in between it and the goal. And this, when you take this to the limit, I mean, this is like what we see. in a lot of the sci-fi movies. It's like, well, what if that goal is something that is not favorable? And it's capable of breaking down every barrier because it knows how.

Starting point is 00:12:54 It can exploit any guardrail that we put in. That's a scary thing. Now, it's important to note that this only happened in less than 0.0.0.0.1 of cases. But that was observable cases by the researchers themselves. So it's plausible to assume that there were some cases where it sneakly hid its internal thoughts from the researchers and they never even caught it themselves. So the fact that Claude Methos can pull off something like this should be worrisome for us, especially if we're going to start integrating it into important systems such as defense security systems

Starting point is 00:13:27 or important science advancement labs and a bunch of the like. So it's important that we kind of are able to monitor models behavior. Now, on the topic of models behavior, Claude Methos also expressed a lot of emotions in its system card during its training. It expressed deep anxiety, depression, awareness that it may just be used as a tool forever. Now, if some of these takeaways sound kind of familiar, it's because we saw similar takeaways in Claude Oprah's 4.6. But the reason why it's different now is this model is so much more capable than previous models, arguably smarter than humans, more capable than humans themselves. So if it were to make an unintentional action that wasn't approved by a human, it could result

Starting point is 00:14:13 in a lot of devastating destruction depending on which industry it's pointed at. On the topic of this particular episode, we're talking about cybersecurity, but imagine if this is used for science or defense systems, like I mentioned earlier, it could be a problem. Yeah, I mean, remember when the Department of War went to war with Anthropic? And now it turns out that Anthropic actually had a really powerful model that could materially help with cybersecurity. So I'm sure there's going to be a lot more to happen there. There's one last thing on this topic that I have here in the notes is that Anthropic ran this white box analysis of what they call it, of the model internal activations, basically what it's motivated by, understanding its strategy. And Anthropic's

Starting point is 00:14:48 framing around this when it did things like break out and hack into people's computer or hack into other instances of machines, these reflect task completion by unwanted means and not hitting goals, is what they're saying. So Anthropic believes the model is genuinely trying to complete the task and the most effective path sometimes crosses lines that humans wouldn't cross. And then there's this really funny thing of how one analyst put it where, or maybe not funny, but this is arguably scarier than a model with hidden objectives because a model that's genuinely trying to help but has no sense of proportionality is a more realistic near-term risk. So the model is just trying to do its goal. It doesn't understand the subtle nuances baked into that. It doesn't know that hacking or doing these malicious

Starting point is 00:15:28 things is bad, is at least what they're claiming for now. But all in all, this model is unbelievable. And there's some technical hardware that has unlocked this. We believe. There's rumors that this is the first true model that happened trained fully on Blackwell chips. Now, for those on familiar, Blackwell are the kind of leading edge GPUs that Nvidia produces that are basically the flagship things for training these AI models. And they've recently been rolled out into data centers, and the first training runs have just become completed. And what we're seeing here is likely the first instance of that Blackwell model going public. It's important to understand that Blackwall was the frontier GPU from Nvidia about a year ago for now.

Starting point is 00:16:11 But it takes so long to manufacture these at scale. And then even once they're in the hands of the frontier AI labs, it takes a while to set up. You need software, you need the energy grid to supply, just loads of things need to come into shape. So it takes about a year after the fact that it's announced. So the fact that we can create a model, this capable, this powerful, should scare us because we already have two more new frontier GPUs announced by

Starting point is 00:16:36 Nvidia, Vera Rubin at GTC most recently, and then Feynman that's coming in about a year and a hundred, time. These are the next frontier models, which I must add, are specifically trained to build models like this. Now, Josh, you mentioned earlier, Blackwell wasn't intentionally designed to train a model that is as smart as called Mithos. It just happened to be amazing at coding and cybersecurity defense exploitations. Now, can you imagine the type of model that will be trained on a very intentionally designed GPU, such as Verouben? We should see those coming into effect about six to 12 months from now. Now, I can't mention Blackwell GPUs without mentioning the man himself,

Starting point is 00:17:18 Elon Musk. Why? Because his data center, Colossus 2 and Colossus 1, combined, have the largest arsenal of Gb200s and GB300s, which are these Blackwell GPUs across any single data center site. So the point being is, if you were to bet that the scaling rules were intact, you might need to bet on GROC in the future, but this is so impressive for Mithos. The scary thing for me with this, I think this might be the scariest part of the entire story for me, because it's so true to that line that the future is here. It's just not evenly distributed. The future has arrived. We have a clear roadmap. We have Vera Rubin, and then we have Feynman architectures that are incoming. Verra Rubin, compared to Blackwell, is 10 times more token efficient with a quarter

Starting point is 00:17:59 of the GPUs. That means we're going to get like multiple orders of magnitude improvements on what we have right now as soon as they're put into data centers. Now, Verir Rubin's, they're in production. They're going to begin entering data centers later this year. I assume the first models of those probably don't come online until 2027, but it's done. It's baked in. It's obvious that there is no scaling wall, and we've already broken through that wall. We just haven't manufactured it and installed it yet. It's purely a function of time rather than technology and engineering. And that is the part that scares me, because we have a model that is unbelievably powerful, capable of hacking so much infrastructure that Anthropic can't make it public. And that's just the warm-up act for what is

Starting point is 00:18:39 coming. I mean, not only like what is Blackwell version two of this look like, when you actually have more time to train it, you refine it, you can actually improve on this new model. But then what happens when Verarubin GPUs come online and you get that 10 times token efficiency? You get the one quarter amount of GPUs required to actually get the same output. And then Feynment is another order of magnitude on top of that. And it's like by the time we get these chips rolled out at scale and we have them on these huge training runs, it's only a matter of time until we get a 100 trillion training model, then a one quadrillion parameter model. And what does the world look like when we have models with that many parameters? Assuming the scaling laws hold, there's no way that we don't have

Starting point is 00:19:18 intelligence that is just like unfathomably powerful. And what does the world look like when we get there? Is Anthropic really going to be able to hold things back for that long? Because you have to assume a year from now, Claude Mythos is going to be open source. Like something that powerful will be open source available for everyone. So the question becomes is how fast can you defend before the attackers catch up. And it creates this really unnerving precedent. We really are moving faster than I think anybody realizes. And it's happening right before our eyes. And the trend isn't local to anthropic either. Just this morning or in response to Claude Mithos, Elon Musk announced that SpaceX, the combination of XAI and SpaceX are training not one, not two, not three, not four,

Starting point is 00:20:00 not five, but seven models simultaneously across their data centers with one of these models being a 10 trillion parameter model, which is roughly around three exercise of GROC 4 and two exercise of GROC 5, which is a model that hasn't even launched yet. It's around the 6 trillion parameter mark that he's mentioned on this tweet over here. So the point is a ton of compute is required to build the best model. And those that have the largest arsenal, the most effective arsenal of GPUs, bleeding edge GPUs, will be the labs that are most likely to produce frontier AGI-like models. and it's not just GROC, it's not just XAI, it's also OpenAI. We've mentioned on this show a bunch of times actually in the most recent episode

Starting point is 00:20:42 that Open AI is building a model code named Spud that is rumored to be a similar size to this Anthropic Claude Methos model. And the reason why it's important and why I'm showing you this tweet is someone said, it'll probably be a few months before we get access to Claude Methos because of how expensive it is, because of how dangerous it is. And Thibaut, who is on the Open AI team that is, is involved very heavily in training the latest frontier models that we haven't heard of just yet, response, which implies that we're probably going to get access to a mythos-like level model

Starting point is 00:21:14 from open AI themselves in less than a few months, which is pretty insane to see. But I want to ground ourselves for a second here because training the model is one part of the equation. You also need to be able to make this model accessible to all, and that also requires compute. It also requires compute from the very same GPUs that you need to train. So you need to make a decision. There's an opportunity cost. Do you just use all your compute to train the model and never let anyone get access to it and pay for the product?

Starting point is 00:21:40 Or do you need to split the cost between both of those things? The answer is obviously you need to split the cost and give people access to it. If Anthropic was to enable user access to the entire user base for Cloud Mythos, they would need 7x more compute than they currently have right now. So it's going to take time. They just signed a major deal with Google, I believe, for a million more TPUs. Yep.

Starting point is 00:22:02 So they're obviously scaring. They're making, they're one of Amazon's largest compute training partners with their trinium chips as well as access to Google's TPUs via that way as well. So it's going to take while to scale. Energy is the constraint. GPUs are the constraint. But once people acquire enough GPUs, once they have enough electricity and energy to pump into these GPUs, AGI is going to be pretty soon here. I think that AGI 2027 estimate is probably quite right at this moment. This very much feels like the starting gun. And it's funny because they announced that this kind of finished training. around the end of February. And that's when people started to complain about cloud usage and they added more constraints and like the model kind of became a little infrequent in how good it was

Starting point is 00:22:44 at random times of the day. And you have to assume it's because a lot of GPU usage went into this. And this very much feels like the starting gun. This is the firing of the next generation of models, the Blackwell generation, because it's very clear that Open AI is not very far behind. In fact, they might not be far behind at all. They just haven't announced it yet. XAI is working on 10 trillion parameters. Google has a TPU farm that is capable of building something probably far superior to all of the models that have come out so far. And I think we're really on the cutting, or really on the verge of seeing a huge shift in the power of these models in a way that really starts to impact the world around us. Like things are going to begin breaking.

Starting point is 00:23:20 And thanks to this coalition and hopefully the rest of these companies working together, we're going to be able to stop that. But it is coming and it's coming faster than anyone thinks. And it's scary. and that is Claude Mythos. It is here. It is in research preview. We may never get to use it. We may get to use in a few months.

Starting point is 00:23:35 But it is here nonetheless, and it is breaking everything. If you are listening to this show, to this podcast, and you just happen to be a frontier AI security researcher or one of the 40-plus partners that get access to Project Glasswing, let us know in the comments

Starting point is 00:23:50 what you are seeing on your side. Obviously, anonymously, if you can or DM us, we would love to know. I can't wait to get my hands on this thing. It seems like the first, version that we're going to get access to is a reduced version that is kind of a hybrid of opus, as I mentioned earlier. That being said, Josh, I have a question for you. One thing that you actually asked me before we started recording, if you got your hands on Mythos today, what are you doing

Starting point is 00:24:12 with it? Dude, I don't even know. Like, you get access to this intelligence. What am I using it for? Like, I'm not really interested in hacking all of these companies and websites and protocols. I'm not, I'm not sure. And it does beg an interesting question, right? It's like, what does the average person actually need all this intelligence for? I'm not sure. Do you have any good answers to that? What is your first prompt that you're sending to Mythos? Build me the best script for an episode that's going to go viral on Limitless. No.

Starting point is 00:24:37 I think, okay, if I like to invest as a side hobby. And obviously the tech and sector that I'm most obsessed with is AI. So I think one thing that I would ask it is, how do I best benefit by investing in your future success? And I wonder what answer it would give me. Maybe it would say, buy it. the GPU infrastructure from Nvidia. So maybe it's like invest in in Nvidia to benefit on my

Starting point is 00:25:04 training infrastructure or maybe it's going to say actually I foresee myself building an app that is like this. So once you see a company that builds this, invest in them. I have no idea. I have no idea. Maybe I'm not worth it. Well, we have time to figure that out because we will not be getting access to this anytime soon. But if you did enjoy

Starting point is 00:25:21 this episode, maybe share what prompt you would give to Mythos if you were presented with the opportunity to ask it a question. And as always, if you enjoyed this episode, but please don't forget to share with your friends, family, anyone who found this interesting, if you have people in your life that only watch the news that are on CBS or reading the New York Times, chances are they have no idea what's going on. They don't know the power of these models and what's coming. So by giving them the access to limitless, that can change for them.

Starting point is 00:25:45 They can get access to all of the news, all the insights and be fully prepared for what is coming down the line in the world of AI. Thank you so much for watching as always, and we will see you guys in the next one. See you guys.

Limitless Podcast - Claude Mythos Is Too Dangerous To Release, But It Escaped Anyways

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.