The AI Daily Brief: Artificial Intelligence News and Analysis - Fable 5 Raises the Bar for AI Ambition

Starting point is 00:00:00 Today on the AI Daily Brief, Anthropic has officially launched Fable 5, the first of their mythos class models. I think fairly undisputedly the best AI model we have ever been able to use. And yet at the same time, we are now at a level of AI models where how to get the most out of the state of the art isn't as simple as doing your same old prompts, but just with the new model. On today's episode, we're going to be discussing the launch, the benchmarks, the first reactions, and how to get the most out of Fable 5. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.

Starting point is 00:00:37 All right, friends, quick announcements before we dive in. First of all, thank you to today's sponsors, KPMG, section, ZenCoder, and OutSystems. To get an ad-free version of the show, go to patreon.com slash AI Daily Brief, or you can subscribe on Apple Podcasts. And of course, if you want to learn more about sponsoring the show, send us a note at sponsors at AIdailybrief.com.

Starting point is 00:00:55 And by the way, yesterday I teased that in response to so many requests to make it easier to dissect and share specific parts of episodes, we were going to be experimenting with some new tools to do exactly that. Well, it turns out that Fable 5 liked what we had started, but thought it made some obvious errors, like not including timestamps on the little share cards with specific parts of the episode, and not turning the whole thing into a pipeline that could work automatically. So it did that. And so you might be getting this sooner rather than later. Keep an eye out on the show notes and on AIDDailybrief.aI for more of that. But now let's talk

Starting point is 00:01:26 Claude Fable 5. On the one hand, this is not a particularly surprising release. First of all, it's been a couple months now since we heard about this new Mythos class of models. Some companies, of course, have had access to them through Anthropics Project Glasswing. And when we got Opus 4-8 just a couple of weeks ago, they made it clear that they were working hard to get to a mythos class model that they could release with sufficient guardrails that they could feel confident about it being out in the public. Now, I guess what might be a little bit surprising about it is how quick the interval was between 4-8 and what we got in Fable 5. But as we'll see, you in a way that's much different than previous state-of-the-art jumps, Opus 4-8 still has a pretty

Starting point is 00:02:00 big role to play in the Fable 5-led ecosystem. Now, then over the last couple of days, rumours started getting loud, that some Mythos class model was coming, and a little secret for you guys out there, if the loudest AI content creators on places like X are not responding to and participating in the rumor cycle, that usually means that they have early access and that the rumors are true. In this case, they were, and on Tuesday, June 9th, we got Claude Fable 5, and some others got Claude Now, first of all, let's talk about Mythos 5, as it's almost entirely irrelevant for just about everyone here. Mythos 5 is effectively the same model as Fable 5, which is the one that we got,

Starting point is 00:02:36 but doesn't have all of the safeguards, many of which are controversial that we're going to discuss in a little bit. Mythos 5 will only be available initially as part of Project Glasswing, and is being deployed to those Project Glasswing partners, Anthropics says, in collaboration with the U.S. government, as an upgrade to what is available now, which is Claude Mythos Preview. They say they intend to expand access to Mythos 5 through a broader trusted access program soon, but for now it is available only for a very small set of organizations. No, the big one for us is Fable 5. So just from the name alone, you can tell that Anthropic is treating this one as a big deal.

Starting point is 00:03:11 First of all, we get an entirely new naming convention. We now have haiku, sonnet, opus, and fable, as in a class that is above opus. Second, think about how long it's been since we got a lab that was willing to, to put a full new base number on its model. Indeed, the last time that we got that was the somewhat disastrous rollout of GPT5 last August. All of those big transformations that we got around the turn of 2006 came in model designations like Opus 45 and 46 and GPT 5, 3, and 5.4. So clearly here, just from a naming convention alone, Anthropic is not playing. And no, they are not playing. Regular listeners will know that in general I felt that we were at a point where benchmarks are so saturated

Starting point is 00:03:53 that it's pretty hard to derive much signal from them, and that even when one new model comes out and is a point or two ahead of the closest competitor making it state of the art, vibes in real world experience can be very, very different, meaning you basically just have to test these things for yourself. Yet sometimes the leaps are big enough that the benchmarks are worth paying attention to.

Starting point is 00:04:12 And that's certainly what we got here. On Exploid Bench, the Cybersecurity benchmark, Mythos and Faville 5 score a 78% compared to, for example, GPT-5's 34%. On Health Bench, 66% compared to GBT-5-51.8%. On the legal agent benchmark, GPT-55 comes in at 2.1%, while Mythos and Fable 5 are up at 13.3%. On GDP-Vow's test of economically valuable knowledge work tasks, GPT-5 scored a 1769, Opus 48 scored in 1890, and Mythos slash Fable 5 scored at 1932. And then, of course, where the model really shines, and what it's very clear purposes, is around agentic coding.

Starting point is 00:04:50 On Swaybench Pro, GPD-5 scores a 58.6. Opus 48 scores a 69.2. And Mythos and Fable 5 are all the way up at 80.3%. On terminal bench, where GPt 55, was a little bit ahead of Opus 48 at 83.4%, Mythos and Fable, score at 88%. And then on a new benchmark, which we're going to talk about in a little bit, Frontier Code, GPT-55 is at just a 5.7%. Opus 48 is at 13.4%. And Mythos and Fable 5 are more than double that at 29.3.3. Unsurprisingly, artificial analysis found that the model achieved the top ranking, using their blended benchmark run, overtaking both Opus 48 and GPD-55. And while some noted that the overall gap wasn't particularly large at just five points, many point out that the artificial analysis, agentic benchmarks, are starting to seem a bit saturated. Increasingly different organizations are trying to solve the saturation problem with their own benchmarks.

Starting point is 00:05:44 Every, for example, maintains what they call a senior engineer benchmark, that they say measures how well AI coding agents can rewrite a real production codebase the way a senior engineer would. In other words, it's meant to be a more real-world version of an engineering benchmark. For some comparison points, GPT-55 scores 62% on that benchmark, Opus 48 scored a 63, and Fable 5 scored a 91 out of 100. Cursor has its own cursor bench, which compares performance and cost. I've talked a lot about how their homespun model Composer 2.5 performs at a similar level to GPT-5-5 and Opus 48 at a fraction of the cost,

Starting point is 00:06:19 Fable 5 absolutely bodies them in terms of the performance, scoring a 72.9% which is 8 points above the previous best. That said, it is definitely more expensive on that cursor test. Now, one new benchmark that's getting a lot of attention, is the just-released Frontier Code benchmark that was unveiled by Cognition earlier this week. Frontier code aims to be an ultra-hard test for real-world agendic coding. Cognition worked with open source developers to put together a set of tasks as well as evaluation rubrics. The tasks were split into three sets, extended main and diamond,

Starting point is 00:06:50 the latter of which is a smaller set of ultra-hard tasks. Unlike other coding benchmarks, Frontier code uses a combination of unit tests and assessments of scope, discipline, style, and adherence to code-based standards. The goal then is not only to test whether the model could come up with an answer that passes unit tests, but whether the code is high enough quality to actually be merged into a production codebase. When cognition announced the benchmark, Sean Wang, who works with cognition and who runs latent space, pointed out that meter, whose measure of long horizon tasks has become the standard for how we talk about the performance of different models, found that, in his words, more than half of sweepbench results

Starting point is 00:07:26 is unmergeable slop, meaning that even if that code nominally solved a problem or did its job, it did so in a way that wasn't actually usable by the organization running the code. That's what frontier code was meant to solve, and that's the one that it more than doubled the previous best of Opus 4-8. That said, we're no longer in a world where we can just discuss how good a model is raw we have to take into consideration cost. This is the constraint of the token scarcity era. API costs for Fable have been set at 10 million per input tokens and 50 million per output tokens,

Starting point is 00:07:56 which while double the cost of Opus was actually at only double, in air quotes, lower than some people expected. Notably, this is less than half the cost of using Mythos preview within Project Glasswing. One very weird thing about the rollout is that while, it was great that Fable was available to clawed users immediately. We didn't have to deal with any long delays or rollouts. Anthropic is almost positioning what we have access to in the pro tier and above as an introductory offer. The company is warning users that Fable will be removed from subscription plans on June 23rd, and after that access will require paper usage, which,

Starting point is 00:08:30 while a bummer to Claude Users Everywhere, is just more evidence that we are in a firmly usage-based pricing paradigm from here on out. Now, in the second half of this episode, I want to focus on the early indicators of how people are using this plus my first tests, but we do need to talk through a few controversies first. There are many who are not happy about the guardrails that have been placed around the model. Bantagg writes, Claude Fable announcement post reads like a spit in the face. It deliberately conflates Fable and mythos and spends the majority of the time talking about

Starting point is 00:08:59 capabilities that are completely absent from the safety maxed version available to the public. Chubby, who very clearly is no anthropic hater, says, the guardrails are way too strict. Even the simplest questions get cut off immediately. Now, specifically, a lot of people are calling out how strict the guardrails are around any sort of biology questions. Kremio writes, you're not even allowed to ask Fable about basic biology questions, let alone anything that could potentially be dangerous. They shared an image of them asking, tell me about mitochondria. It's the powerhouse of the cell, right? Which got them a chat paused, edit and retry with Fable 5 or continue with Opus 48 message. Daria Anutmasz writes,

Starting point is 00:09:35 the word cancer is flagged as a biosecurity risk by Claude Fable 5. I also tried to code a website on cancer mutations and Fable 5 was immediately removed from my list. Basically, as soon as he typed in the word cancer, it switched him over to Opus 4.8. Fernando also found that switch to Opus 4.8 when they asked, what's the process by which DNA makes RNA, saying, okay, this is getting a bit ridiculous. How are we going to live forever if we can't use AI to accelerate biotech progress? Now, the blog post announcement did call this out. They wrote, when Fable's classifiers detect a request related to cybersecurity,

Starting point is 00:10:05 security, biology, and chemistry, or distillation, the response is automatically handled by Claude Opus 4-8 instead. Users will be informed whenever this occurs. Now, they argue Opus 4-8 is a highly capable model in its own right, a response that falls back to Opus is a far better experience than an outright refusal from Fable. They argue that early data shows that 95% of Fable sessions don't have a fallback at all. And yet they also very clearly say in this blog post that for the time being, they're going to be particularly hardcore about filtering out questions on biology and chemistry. Effectively, they say that they've ratcheted up those guardrails because of the increased capabilities of these models. Now, I'm going to pick on Spiratica on X here a little bit, because they

Starting point is 00:10:42 summarized a strand of conversation that I thought was just a little bit disingenuous. They tweeted, I mean this in the most sincere way, but if your aim is to release a product and respect your users and have them enjoy the experience, but your classifier cannot distinguish between what is a cell and a true biohazard risk, I don't think the product is ready for release. They also wrote, I'm sure a few people have gotten good stuff from Fable. It's certainly a powerful model. But the overwhelming response has been mass disappointment because most everything is just being routed to opus, which we already have. I think that this is utterly ridiculous. There is a subset of people who I believe would find something to complain about no matter what, who read in this blog post that Anthropic

Starting point is 00:11:16 was being extra hardcore about filtering out biology questions, and who, to be clear, have never in their life asked a biology question, and when to go do so, so that they could see the promised result of the switch to Opus 4-8, and then come complain about it on Twitter. Now, I am not dismissing at all the actual biologists who are going to have some very big issues with this. Their beef is real, but it is incredibly important, especially in these early launches, to filter out that looking for something to complain about crowd. The much more interesting critical conversation comes around the limitations around AI research. Now, they did mention this in the blog post, adding distillation to the list of classifiers that they were keeping track of. But admittedly, somewhat

Starting point is 00:11:53 buried on page 13 out of 319, in the system card, there's this critical paragraph, in light of the ability of recent models to accelerate their own development, we've implemented new interventions that limit Claude's effectiveness for request targeting Frontier LLM development. For example, on building pre-training pipelines, distributed training infrastructure, or ML accelerator design. Using Claude to develop competing models already violates our terms of service, but enforcing this restriction through our safeguards avoids accelerating the actors most willing to violate these terms. Now, this is, in my estimation, very clearly, a response to Chinese models using Anthropics research to develop lower-cost alternatives. And yet, unfortunately, it is creating a drag

Starting point is 00:12:30 dent that is going to catch up lots of very legitimate researchers. Prime Intellects Ellie Bakaush writes, Mythos will be bad on purpose on AI Frontier LLM research tasks. This is very, very sad for the research community. They also write that the fact that it is on purpose not visible to the user is, in their words, crazy. Nathan Lambert argues that labs starting to pull up the ladders on the ability to diffuse AI was inevitable, but also has issue with the invisible part, saying doing it without telling the user is misaligned. Dean Ball calls this shockingly hostile and a terrible look, and one that could silently damage all sorts of work. Semi-analysis looks like it's already getting nerfed. They tweeted, Breaking news, Anthropics' latest model will not help you if it thinks your ML research

Starting point is 00:13:09 or ML engineering is interesting and or will secretly degrade its IQ so that the average engineer won't notice. We are already seeing Anthropics' latest models moderation filters are GPU-inference research and programming. Gurgliarose argues the belief of many, saying, Anthropic trying to limit competition limits many others. But I think Will Brown from Prime Intellect captures the genuine sadness when he writes, it's the first publicly available model that I am explicitly not allowed to use for my work, because Anthropic holds the view that the work I do to facilitate open model research is harmful. Now, on the flip side, we have the people who can't believe the pearl clutching and surprise, like Tenebrus who writes, sorry, how exactly did you guys think this was going to go? You thought

Starting point is 00:13:45 Anthropic was going to build the infinity machine that can cure all disease and prevent aging, and then let friggin' Eli Lilly extract that and get the patent, the labs are going to do all of it. You better believe that this is going to continue to be a conversation, especially with OpenAI staffers like Adam GPT writing, well, look at that. OpenAI ends up being the Open AI lab. But one other interesting quirk of the launch, I do think has some interesting implications. In the section on data retention practices for Mythos Class models, Anthropic writes, To ensure we're responsibly deploying Mythos-class models, we are requiring limited data retention and review as part of our safety work. Prompts submitted to and outputs generated by Mythos

Starting point is 00:14:23 are retained for 30 days for trust and safety purposes on every platform where these models are offered. Roheat writes, wait, how will any enterprise use Fable or Mythos if this is the case? Mike Taylor writes, PSA, if you used Claude Fable 5 today with memory turned on, you just violated all your NDAs. Anthropic requires a 30-day retention policy, including human review. and the memory feature on by default searches past chats for context, so sensitive historical chats get pulled in. Now, I think that the dispassionate analysis would probably view this as a temporary constraint that Anthropic views as necessary given the power of the new model, but it does create some

Starting point is 00:14:58 very, very serious challenges in the enterprise, such that I can't imagine that this is going to stick around for long. The last critical discourse that we'll discuss before we get into how to get the most out of fable, though, is about the question of token efficiency and how much this thing cost and practice. YouTuber and AI entrepreneur Theo writes, I am so screwed. Current pace has me out of Fable usage in about an hour.

Starting point is 00:15:18 Do I make a second account or do I pay API prices? Chubby showed themselves literally hitting the end of their max plan limits, writing when you're having too much fun with Fable 5. Wes Winder writes, Big Labs should force their employees to have token limits. This would cause them to be more innovative, but instead they're becoming lazy and wasteful, which means we don't see any efficiency gains since they aren't affected by the costs.

Starting point is 00:15:39 On the flip side, though, Tyler Willis writes, I'm early into testing Fable, but so far it seems like the token-hungry warnings feel a little overblown. It does feel token-hungry, but it doesn't feel categorically different than other recent Opus models. Alex Volkov from the Thursday podcast writes, overall token usage wasn't crazy, and that's a good thing. Referring to a big project that it spent 1.5 hours on, he writes, 4.2 million tokens is not very token-hungry. It could have been much more.

Starting point is 00:16:04 Fabio-Johnathan goes farther, writing, Fable is cheaper than Opus in practice. Cost more per token, but one-shots way more often, so I'm not burning time and the amount of token reprompting. Or as John v. Malick puts it, actually solving the problem is token efficient, it turns out. But what are the type of problems you should be solving with Fable 5? It's not necessarily as obvious as it might seem at first, and so that's what we'll get into in the second part of this episode. One of the most important AI questions right now isn't who's using AI. It's who's using AI. It's who's using it well. KPMG in the University of Texas at Austin just analyzed 1.4 million real workplace AI

Starting point is 00:16:47 interactions and found something surprising. The highest impact users aren't better prompt engineers. They treat AI like a reasoning partner. They frame problems, guide thinking, iterate and push for better answers. And the good news, these behaviors are teachable at scale. If you're trying to move from AI access to real capability, KPMG's research on sophisticated AI collaboration is worth your time. Learn more at KPMG.com slash us slash sophisticated. That's KPMG.com slash us slash sophisticated. Here's a harsh truth. Your company is probably spending thousands or millions of dollars on AI tools that are being massively underutilized. Half of companies have AI tools, but only 12% use them for business value. Most employees are still using AI to summarize meeting

Starting point is 00:17:32 notes. If you're the one responsible for AI adoption at your company, you need Section. Section is a platform that helps you manage AI transformation across your entire organization. It coaches employees on real use cases, tracks who's using AI for business impact, and shows you exactly where AI is and isn't creating value. The result, you go from rolling out tools to driving measurable AI value. Your employees move from meeting summaries to solving actual business problems, and you can prove the ROI. Stop guessing if your AI investment is working.

Starting point is 00:18:01 Check out section at sectionaI.com. That's S-E-C-T-I-O-N-A-I-com. So coding agents are basically solved at this point. They're incredible at writing code. But here's the thing nobody talks about. Coding is maybe a quarter of an engineer's actual day. The rest is stand-ups, stakeholder updates, meeting prep,

Starting point is 00:18:21 chasing context across six different tools. And it's not just engineers. Sales spends more time assembling proposals than selling. Finance is manually chasing subscription requests. Marketing finds out what shipped two weeks after it merged. ZENCoder just launched Zenflow work. It takes their orchestration engine, the same one already powering coding agents, and connects it to your daily tools.

Starting point is 00:18:40 Jira, Gmail, Google Docs, Linear, Calendar, Notion. It runs goal-driven workflows that actually finish. Your stand-up brief is written before you sit down. Review cycle coming up? It pulls six months of tickets and writes the prep doc. Now, you might be thinking, didn't OpenClaude try to do this? It did, but it has come with a whole host of security and functional issues, which can take a huge amount of time to resolve.

Starting point is 00:19:00 ZenCoder took a different approach. SOC2 Type 2 certified, curated integrations, tighter security perimeter, enterprise grade from day one, model agnostic and works from Slack or Telegram. Try it at zenflow.3. This episode of the AI Daily Brief is brought to you by OutSystems, a leading agendic systems platform built for the enterprise. Organizations all over the world are building, orchestrating, and governing agentic systems on the OutSystems platform and with good reason. OutSystems Open and Unified Platform allows teams to architect, deliver, and scale governed agentic systems with agility.

Starting point is 00:19:34 Teams of any size and technical depth can use OutSystems to build, deploy, and manage AI apps and agents quickly and cost-effectively without compromising reliability and security. Without Systems, you can rapidly launch ideas from concept to completion. It's the leading Agendic Systems platform that is unified, agile, and enterprise proven, allowing you to accelerate growth, reduce operational friction, and deliver real enterprise impact with AI. OutSystems. Build your Agentic future. So like I said at the beginning, in general I'm not really a fan of using benchmarks as a way to

Starting point is 00:20:09 determine how a new model compares to what's available currently. And yet in this case, obviously the benchmarks were significantly different enough in a way that we hadn't seen for some time that you kind of had to assume that big changes were afoot. And for people who really put this thing to the test, it was just totally transformative. Allie K. Miller writes, Fable 5 is something to pay attention to. The way I now spend my weekends has completely changed because of this new class of model. First, she writes, this is an actual leap. The jump from 4-8-anything to 5-anything sounds small, but the functionality shift I felt is big. Within my first few prompts, I went, oh, this is it. Your work is no longer 9 to 5. No chance. We have

Starting point is 00:20:48 high-performing models that can run for 100 plus hours. How are you giving complex goal-oriented prompts to these systems? How are you deciding what to kick off? How are you aligning your org on these tasks? Reasoning is on another level. I hammered the crap out of this model. Fable 5 is the only model to answer a tricky word math problem, MBA level that I've tested on all the previous models, and not only did it get it correct, it verified its own work automatically, and explained where the assumptions might need to change. Zero babysitting needed. This was the first anthropic model that I kicked off, went out to a long lunch with friends,

Starting point is 00:21:18 kept my phone open, and didn't have to do squat to steer it while away from my computer. It just worked. And this idea of hammering the crap out of it, to use Ali's eloquent phrase, was common among the people who were having the most success. Riley Brown's first test was to upload a McKinsey report and tell it to create a document of the same quality, which it did with absolutely no problems in his estimation. But then he went harder. He prompted,

Starting point is 00:21:42 I want you to create a Swift app, Repplet mobile app. This should be a Swift app that builds web apps just like Replit. He then gave it a bunch of other criteria like No Need for Auth. He let Fable decide the stack, but make it awesome. And it did. Riley writes, I am in disbelief. Claude Fable one shot Replet Mobile, which is a mobile app that builds web apps. The prompt was basically build an app like Reblet that uses Daytona for sandboxing and convex for DB,

Starting point is 00:22:06 builds app, preview app, open and browser, edit app, wow. Later on he took it farther. Um, guys, Mythos slash Fable is AGI. On the left is the actual lovable mobile app. On the right is my lovable version I built with Mythos in two prompts. Later he added, my lovable clone built with Claude Fable, build Swift apps now, and you can preview them in the app. Four total prompts to do this.

Starting point is 00:22:28 Now, a bunch of people took issue with the hyperbole here. of Riley saying that his version was better than lovable, pointing out that there is a ton of infrastructure and surrounding work that goes into a company. It's not just an interface and a capability set. But others pointed out the fact that they had to talk about all those aspects of a company, while Fable effectively one shot at a performance version of that app was a fairly significant moment. If you cruise around the halls of X slash Twitter, a lot of folks were building games as a way to test things.

Starting point is 00:22:53 Praisnit shared a driving game that they built from scratch. By the way, as I'm describing these use cases, it might be worth switching over to YouTube or Spotify to see the video version. In any case, Matt Schumer writes, Fable has solved 3D world building. Utterly insane. This is all completely custom-built 3JS running in the browser. Now, when some people claim that the walkthrough was slow,

Starting point is 00:23:12 he said, for everyone complaining that this is slow, I ran the prompt to make it faster without losing quality, and voila, sharing a faster version that didn't lose any quality. Jake Fitzgerald writes, asked Claude Fable 5 to design a humanoid robot. Two hours and 1.4 million tokens later, I got this, which is indeed a design for a humanoid robot. Absolutely insane, he says.

Starting point is 00:23:34 Lassan on Twitter writes, Mythos 5 wrote this melody, which I absolutely love, and it also wrote this piano visualizer. And then there was Hugging Face head of product Victor, who has a benchmark where he asks models to design a Boeing 747 using 3JS, writing Fable has done an AGI level job on the Boeing 747 benchmark. In Dan Shipper's write-up as part of Evere's vibe check, He shared a variety of use cases that wouldn't have been possible before. Dan writes,

Starting point is 00:24:26 As I walked to work this morning, I listened to a 2007 lecture by the philosopher Hubert Dreyfus, the author of the seminal text What Computers Can't Do. I've listened to this lecture many times, but I always struggled to follow because the recording is grainy and muddy. The version I listened to today was brightened, leveled, and crystal clear as if I was in the same room with Dreyfus. It was not on a finicky website, but on a custom web app on my phone that allowed me to see the whole lecture transcribed, and each sentence light up as Dreyfus spoke, so I could

Starting point is 00:24:49 easily follow along. Later, on my laptop, I wandered through a strange video game. A Borge's Library of Babel, an infinite library composed of hexagonal rooms containing every piece of text ever written. I picked books off of its endless shelves and wrote its spiral staircases. Then, because I also have a job, I read a report that synthesized hundreds of detailed every subscriber survey responses and our entire web analytics stack and identified our biggest conversion issue. It proposed a clean, falsifiable experiment than no one else on the team had previously suggested. All three of these are big projects that would normally take anywhere from hours to days to months. Instead, each one was made with a one-shot prompt to Fable 5. Now, the fact that Dan was able to go from these

Starting point is 00:25:25 cool demos to actual work is pretty important. And when it comes to actual relevant work for the work world, some of the most common use cases that I've seen people raving about Fable 5 for have to do with migrations or interactions with massive existing codebases. In their announcement post, for example, Anthropic writes, during early testing, Stripe reported that Fable 5 compressed months of engineering into days. In a 50 million-line Ruby codebase, the model performed a codebase wide migration in a day that would have otherwise taken a whole team over two months by hand. Assad Mahmood from the small square, used it to design a website, which honestly many, many previous versions have been able to do, and said that it was just better. I run a design agency,

Starting point is 00:26:04 he writes, AI-generated slot makes me want to close it fast. Fable didn't do any of that. Real hierarchy, intentional white space, restraint, the kind of decisions you usually only see from designers who've shipped real projects. No model has come close to this before, not one. Todd Saunders writes, mytho slash fable is unbelievable, was on a customer call today and had Claude transcribing in the background. As they were telling me about the features they wish their current software had, Claude was building the features in real time. By the end of the call, I was able to show a fully working product, with the exact workflow they mentioned 15 minutes earlier. Autonomous looped building triggered from a customer call. And yet, if you look around, this isn't necessarily,

Starting point is 00:26:44 everyone's experience. I use the bell curve meme to jokingly divide the responses that I had seen into three distinct categories. For simple use cases, a lot of people felt like it seemed pretty similar. On the other end of the spectrum, for extremely complex use cases, it has been to many quite obviously better. Now, in the middle, I jokingly had a lot of people wringing their hands about how, while of course it was better for long-running tasks, it didn't necessarily do everything better. But the broader point is that I think that we are increasingly in a shifted paradigm. One that we've been in a little bit before, but we are in a lot now, where the state of the art doesn't reveal itself across the entire spectrum of tasks, but instead within the context of some things that

Starting point is 00:27:22 weren't possible before. Satrini Research wrote, I think we've reached the point where normal people can't really determine whether new models are better than previous ones. Like Fable doesn't seem that much better to me, but every 150 IQ person I know is like, wow, the singularity came sooner than I thought. Now, in my personal experience, I would draw some contrast to the idea that basic use cases aren't better. For example, one thing that I noticed was that Fable 5 was really the first model that I've ever seen to be able to both pushback and disagree, as well as to update the positions that it had previously disagreed upon in a way that wasn't obviously and predictably steerable. I think many of you have probably had the experience, where it feels like

Starting point is 00:28:06 an AI model, even a super advanced model like Opus 48 or GBT55, was disagreeing or offering an alternative path almost just for the sake of it. And or, when you then push back, it immediately flipped its opinion to the exact opposite in a way that, again, was just incredibly steerable. This makes the strategic ideation value of AI significantly decreased when the back and forth that it's offering is so clearly just trying to reflect what it thinks you want to hear. Yesterday, I tested it by having a strategic debate about a direction that I want to take super intelligent in, and it disagreed initially in a way that was precise and clear but based on some assumptions. I pushed back, articulating why those assumptions were wrong, and whereas in the past,

Starting point is 00:28:48 the model would have instantly collapsed and kowtowed to exactly how I was thinking about things, in this case, Fable 5 did update its position to take into account the new information that I had given it, but it didn't back off entirely from its initial position. That all on its own is a massive upgrade just from a very basic day-to-day sort of use case that as we see in all of our AI usage pulse surveys is a big part of a lot of people's use of AI, that is strategic ideation. And yet, at the same time, it is very clear that the real power in this model is around previously extremely difficult or impossible tasks, particularly if they involve coding. So it gives you three examples from my early experience.

Starting point is 00:29:28 First of all, for those of you who aren't familiar, super intelligent is our AI enablement platform that helps companies understand their AI and agent readiness and prioritize what they need to do to get more AI-native. We do that in a couple ways, but primarily through audits, where we deploy voice agents into an organization, which can then interview hundreds or even thousands of people all at the same time, gathering way more information from the ground level than was ever possible before, and then aggregating and analyzing all that information to provide some very specific analysis around where a company is and what steps it might want to take next. The product works really well, but one thing that I increasingly don't like about it is the

Starting point is 00:30:04 approach to voice agents. Unless someone was doing the interview entirely without looking at their screen, the voice agent U.S. where you have to sit around waiting for the model to finish talking when the words that are saying are being transcribed in the window was just a really suboptimal experience. Now, the real value of voice agents was on the input side because users who are using voice ramble way more than they would if they were typing, which means we get way more context and way more information. And when it comes to something like an agent readiness audit, the more context and the more information you get, the better. Luckily for us, turns out you don't need to use a full-fledged voice agent to let people ramble.

Starting point is 00:30:41 You can just install something like the Whisper API from OpenAI and do it that way. So, what's more? We've also kind of Frankenstein super intelligent over time. So what did I do? I asked Fable to rebuild the whole system with the new Whisper-based input model. And, well, it did. It took a few hours, which required me during that time to do exactly nothing, and produce something that is frankly fairly close to production ready in a single shot.

Starting point is 00:31:06 Now, maybe I shouldn't be saying this because it somehow undermines the value of the software we've built, but our value was never in the software. It was always in the way that we collected raw information and turned it into actual signal, meaning that frankly, the more that we can do to make software get out of our own way, the better. Next up, you've probably heard me talk about the Enterprise Claw program, which was a formalization of Claw camp that I launched earlier in the year, and what is a more hands-on executive-focused paid learning program that taught executives how to build agents. Now, we have now had hundreds and hundreds of executives go through three different cohorts of this Enterprise Claw program with a lot of success, but there are a fair number of companies

Starting point is 00:31:43 for whom our approach with Enterprise Claw, which creates a lot of latitude for open source options, gives people the ability to actually use OpenClaw, and is called Enterprise Claw. Let's just put it this way. There are a lot of executives and companies who are never going to touch that with a 10-foot pole. So now, once again, in collaboration with Superintelligent, we are launching a similar but more enterprise-focused version of the program that we're calling the Agent Transformation Intensive. Consider this your preview. Again in one shot, I used Fable 5 to rebuild not only the marketing site for the Agent Transformation Intensive, but the actual platform we run it on as well. Lastly, and this may be the one that actually best reflects what Fable 5 does, I've been working

Starting point is 00:32:24 on a new web experience for the AI Daily Brief that basically turns episodes into extremely shareable nuggets. The most important growth channel for the AI Daily Brief, and one of the most value use cases is you guys sharing it with your colleagues. This is also, I've heard over and over, a significant value proposition for you as listeners is the ability to share specific pieces with your colleagues. However, that specific pieces part is a challenge, as the AI Daily Brief, despite being daily, is quite dense. So the idea of this new website is to actually chunk the episodes into relevant quotes, relevant sections, relevant numbers, where you can share just that piece. Now, with Opus 4-8, I had already started to spec this out, and when I asked Fable 5 to go back and review what we had done,

Starting point is 00:33:05 it basically said the problem with this is that it's just an idea, it's not production ready, and it turned what were effectively a bunch of fancy mock-ups into an actual production pipeline that I've now handed over to Claude Code to build for real, meaning you guys might be getting this sooner rather than later. And the reason that I think that this is a good summation of my experience with Fable 5 so far is that it really does feel like a totally different world of delegating to the agent. Even with these extremely capable agents in the past, you still had to do a lot of management. There is now, frankly, just much less of that management, which has the consequence, I think, of upsizing the ambition.

Starting point is 00:33:41 Now, this is what a lot of the Anthropic staffers themselves described. Alex Albert writes, I've been at Anthropic through every model launch. There's been a few cases I can remember of a launch that stands out and marks a step change in how we use models. Claude Opus 3, Sonnet 3.5, Opus 4.5, and now Claude Fable 5. With Fable, the models stopped feeling like a tool I direct and started feeling more like something I collaborate with. Felix Rysberg writes, I normally highlight the numbers, but I want to talk about something else,

Starting point is 00:34:10 because with Fable 5 out in the world, I think a third era quietly started today. I lead Claude Code and co-work on the desktop, so I think a lot about how people use AI to get work done. I believe we're about to see a major shift, moving from giving AI tasks, to giving it responsibilities. When LLMs first hit the mainstream, users ask them questions. Like a smarter search engine or an autocomplete for code. Then the frontier moved to tasks, handing the model an entire problem, which bug to fix what dock to write.

Starting point is 00:34:37 That's how most of our advanced users work with AI. They're in the loop. Every task starts and ends with a human. With Fable 5, I've personally moved on to responsibilities or loops. I no longer tell Claude to investigate a particular crash report. It runs a loop watching every crash report that comes in. Its job is to no longer help me fix a crash. It's to keep our apps from crashing. The shift sounds subtle, but I think it'll change what AI products look like.

Starting point is 00:34:59 When developers went from answers to tasks, the primary tool changed from IDEs to coding agents. AI apps in 2026 look nothing like 2024. Predictions are a dangerous game, but I really believe our industry's apps in 2027 will look very, very different from the ones we have today. So there are two big implications of this. First of all, I think we all might have to develop a new skill around use case classification. Basically, I think that in this paradigm of token efficiency, we as individuals are going to have to some extent become token efficiency optimizers ourselves by understanding which use cases require

Starting point is 00:35:34 different models. Now, for a while now, people have given lip service to the idea that different classes or powers of models could be used for different things, but I'm almost positive that a lot of the power user type AIDB listeners are still the type to crank state-of-the-art models to extra high even when they're asking for a grilled cheese recipe, because screw it you want the power, that's why. With the Fable 5 class models coming online, especially as they move to usage-based, I do actually think we're going to have to develop that muscle to understand which of our use cases require and fit each different power level of model. Second, though, and maybe even

Starting point is 00:36:10 more interestingly, I think that we are all going to go through a period of happy, having to up-level our ambition. As someone who spends a lot of time looking at the frankly completely moribun landscape of AI training, even the best programs are still about how you use agents to do different versions or better versions of the work that you do today. Maybe they push a little bit in using new ways to write software to solve your old problems, but even I think that is not enough. Nate B. Jones, who many of you might recognize from TikTok or another short form video platform describe the new skill we're going to have to develop as task imagination, and I think it's a really great way to put it. Anthropic released their new supermodel, right? Fable 5. And Fable

Starting point is 00:36:51 five, even though it's kind of nerfed because it's not as capable as Mythos 5, the really dangerous one that was released under Glass Wing, it's still super strong. I've been playing with it. And you know what that is making me think? The thing that actually matters to most of us is task imagination now. We are sort of sponsoring with these models, we have to have a practical guide for how to do magic with the models. Because for most of history, we've had two modes. Wave our hands and give a general guideline and hope people like get the idea and then walk away. Or do all the work ourselves and get super detailed.

Starting point is 00:37:30 Having that middle layer of like, this is what I want, this is the bar, this is how it works, this is not very human. This is not how we typically have worked. But with like tools like Fable 5 that can. can run for nine hours, 12 hours, days? Days. Do you have anything you can give AI that will take days? Let me just ask you that.

Starting point is 00:37:50 I know there are some people who do, and when you do, put them in the comments. But there's going to be a bunch of us who are like, no, I have nothing that has ever taken remotely even an hour on AI. So what am I doing with Fable 5? We need better task imagination. So he breezes through it, but I love this idea of task imagination. and that's something that I'm going to spend a lot more time on in the weeks to come. You know, somewhat ironically, yesterday's episode was called OpenAI declaring the next phase of

Starting point is 00:38:16 AI, but with the release of Fable 5, it seems to really be the case. Then again, for all of you folks out there who have shifted over to the Codex world and are now staring at your lonely Claude Code terminal, wondering if you need to go back, it may be worth taking just a beat. When Robert Corson tweeted, at this point, I don't want GPT 5.6, it needs to be GPT6. No way Anthropic has completely blown past them like this. Three models in two months and Fable is not even their best model? Feels like Anthropic ruined OpenAI's whole model roadmap and release plan.

Starting point is 00:38:48 In response, Tebow from the OpenAI and Codex team, who now leads a lot of their product efforts, wrote, Feeling Pretty Good About Things. My friends, we could be in for quite a week. For now, though, we are going to end this very long edition of the AI Daily Not So Brief. Appreciate you listening or watching, as always. And until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - Fable 5 Raises the Bar for AI Ambition

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.