The AI Daily Brief: Artificial Intelligence News and Analysis - Claude Sonnet 4.5 Can Code Autonomously for 30 Hours 🤯

Episode Date: September 30, 2025

Anthropic's Claude Sonnet 4.5 reportedly demonstrates groundbreaking autonomy by coding for up to 30 hours non-stop, significantly outpacing prior benchmarks like GPT-5 Codex’s seven-hour runs. ...This leap is enabled by innovations such as enforced modular artifacts, persistent memory surfaces, planning loops, and runtime constraints—transforming the way AI tackles complex, long-horizon tasks. The broader implication is that AI is now not only capable of building sophisticated applications autonomously but is also recursively engineering its own future iterations, rapidly accelerating progress across the tech landscape.Brought to you by:Is your enterprise ready for the future of agentic AI?⁠⁠⁠⁠⁠⁠⁠Visit AGNTCY.org⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Visit Outshift Internet of Agents⁠⁠⁠⁠⁠⁠⁠Try Notion AI today with Notion 3.0 ⁠⁠⁠⁠⁠⁠⁠https://ntn.so/nlw⁠⁠⁠⁠⁠⁠⁠KPMG – Discover how AI is transforming possibility into reality. Tune into the new KPMG 'You Can with AI' podcast and unlock insights that will inform smarter decisions inside your enterprise. Listen now and start shaping your future with every episode. ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://www.kpmg.us/AIpodcasts⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Blitzy.com - Go to ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://blitzy.com/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ to build enterprise software in days, not months Robots & Pencils - Cloud-native AI solutions that power results ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://robotsandpencils.com/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Vanta - Simplify compliance - ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://vanta.com/nlw⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠The Agent Readiness Audit from Superintelligent - Go to ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://besuper.ai/ ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Interested in sponsoring the show? nlw@aidailybrief.ai

Transcript
Discussion (0)
Starting point is 00:00:00 Today on the AI Daily Brief, Anthropics' new Sonnet 4.5 model can apparently code independently for up to 30 hours. We're going to talk about what that means for the state of AI autonomy. And before that in the headlines, OpenAI is apparently not only about to launch SORA 2, but an AI-only TikTok-style video app as well. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. All right, friends, quick announcements before we dive in. First of all, thank you to today's sponsors, KPMG. Robots and pencils, Notion and Super Intelligent. To get an ad-free version of the show, go to patreon.com slash AI Daily Brief.
Starting point is 00:00:41 And if you were interested in sponsoring the show, set us a note at sponsors at AIDaily Brief. Dot AI to find out about all the opportunities. This truly is the smartest, most engaged, and most high-power AI audience in the world. So if you are interested in accessing that, please do reach out. And with that, let's dive in. Welcome back to the AI Daily Brief Headlined Edition, all the daily AI news you need in around five minutes. We kick off today with the latest rumors.
Starting point is 00:01:04 out of OpenAI, where that company is expected to launch not only their next generation video model SORA 2, but also a social app for AI-generated video to go alongside it. There is actually a lot to unpack here. This is more than just a model release, so let's dig in. Sources speaking with the Wall Street Journal said that the model and its companion app would be arriving in the coming days. In fact, you might have noticed this set of new commercials that OpenAI dropped yesterday, seemingly as a marketing campaign, some think that they are actually doing double duty, not just advertising chat GPT, but sneakily showing off the video generation capabilities of the new SORA.
Starting point is 00:01:43 Right, smart Kretschman, it certainly doesn't look generated, but some are arguing that SORA too could be indistinguishable from real video. Some broke down the video frame by frame and thought they found evidence it was AI generated, and what people are excited about is that each video appears to show camera motion that would be difficult bordering on impossible even if you were using a drone. In the second ad showing a man cooking Italian food for his date, we zoom out from an extreme close-up through a cluttered kitchen out the window and across the street. Summing up the feelings of many, Software Engineer Jolson Rebello wrote, if this really is SORA 2, nothing will be the same anymore. Now, the scoop from
Starting point is 00:02:19 wired is that alongside the new SORA, OpenAI is poised to release a short-form video app powered by the new model. The app, which features a vertical video feed with swipe-to-scroll navigation, appears to closely resemble TikTok, except all of the content is AI generated. There's a 4-U-style page powered by a recommendation algorithm. On the right side of the feed, a menu bar gives users the option to like, comment, or remix a video. The app reportedly does not allow users to upload photos or videos, with the only source of content being SORA 2.
Starting point is 00:02:47 However, users will be able to verify their likeness and have themselves appear in generated clips. Other users can also generate clips featuring verified likenesses, but users will receive a notification when clips featuring them are generated. Wired continued, OpenAI appears to be betting that the SORAT2 app will let people interact with AI-generated video in a way that fundamentally changes their experience of the technology, similar to how chat GPT helped users realize the potential of AI-generated text. And apparently this is more than just showing off a technology, but also a business recognition of the opportunity of the moment.
Starting point is 00:03:18 Wired continues, internally sources say there's also a feeling that President Trump's on-again, off-again deal to sell TikTok's U.S. operations, has given Open Open A.I. A.I. A. unique opportunity to launch a short-form video app, particularly one without close ties to China. Now, not everyone is thrilled about this. In fact, there has been an explosive conversation around the lamentability of short-form brain rot ever since META announced its vibes feed last week. You'll remember I did a whole segment of an episode that was all about how much people did not like the idea of meta having an AI video-only feed. Now, as OpenAI apparently gets ready to release something like this, we'll get to see how much of that was about
Starting point is 00:03:55 the format versus how much of that was people just not liking meta. Interestingly, in the wake of all of that controversy, OpenAI insider Rune had posted, there is a moral panic around short form video content in my opinion. He added that he understands the concern, but that he's just not certain that hours on TikTok are meaningfully different to hours in front of the TV. He said, I basically agree with postmen on the nature of video and its corrupting influence on running a civilization well as opposed to text-based media. I'm just not sure that it's so much worse than being glued to your TV, and I'm definitely
Starting point is 00:04:23 not sure that AI slop is worse than human slop. Ahman Osman noted that Roon appeared to be breaking the narrative on X a few days ahead of this key announcement from OpenAI. With the report that we're getting this app from OpenAI, Ahman said, bro was running narrative ops? Now, what other interesting to mention of the story has to do with the copyright arrangements that OpenAI will be putting in place? Sources said that the company has begun notifying talent agencies and studios about the product over the past week. The communication notified rights holders that they will need to explicitly opt out, otherwise their intellectual property will be included in the generated videos. With the small nuance being that recognizable public figures won't appear without explicit
Starting point is 00:05:00 permission, but fictional characters will require an op-out. The debate around that could be an episode all on its own. But look, I think that we are very close to actually getting this app, so I'm going to pause it here. We will come back and talk about all the implications when we see what the thing actually is and we get people's actual first reactions to it. Moving on to our next story, more layoff news seemingly related to AI. airline Lufhansa said they would be eliminating the equivalent of 4,000 full-time roles by 2030.
Starting point is 00:05:26 That's around 4% of their 102,000 strong workforce. However, this is a highly targeted downsizing, with Lufthansa aiming to make the cuts primarily from their 10,000 administrative roles. The layoffs are also a sharp change in direction, as Lufthansa stated that they would be adding 10,000 new hires over the course of the year back in January. In a press release, the company said, the Lufthansa Group is reviewing which activities will no longer be necessary in the future, for example, due to duplication of work. In particular, the profound changes brought about by digitalization and the increased use of
Starting point is 00:05:54 artificial intelligence will lead to greater efficiency in many areas and processes. Like Accenture's downsizing announcement last week, this doesn't appear to be a case of a struggling company using AI to mask around a belt tightening. After a troubled year in 2024, where operating margins dropped to 4.4%. Lutonza has guided that they expect margins to reach 10% by 28, up from their strategic target of 8%. They also expect to see 2.5 billion euros of free cash flow by that date. The stock was up 0.9% on the news to bolster a year-to-date gain of 25. Now, Lufthansa claims that this is purely a restructuring effort to get ahead of reduced workforce needs as they accelerate AI adoption. And we are certainly going to be keeping
Starting point is 00:06:32 an eye to see whether this type of announcement, i.e. forward telegraphing of AI shifts, becomes a trend. Lastly, today, a bit of regulatory news. California Governor Gavin Newsom has signed AI safety bill SB 53. The bill is a watered down version of last year's SB 1047, which was vetoed by Newsom in September. SB 53 requires leading AI companies to report the safety protocols they use in producing models and disclose the highest degree risks posed by the models. The law is largely concerned with catastrophic risks like aiding in bioweapons production or facilitating mass casualty events. In addition, the law strengthens whistleblower protections for employees of AI labs. California state senator Scott Wiener, the chief sponsor of the bill,
Starting point is 00:07:11 said, this is a groundbreaking law that promotes both innovation and safety. The two are not mutually exclusive, even though they are often pitted against each other to be. Now, last year's SB 1047 featured a huge amount of very public pushback, while this process has been quite a bit quieter by comparison. Anthropic came out in favor of the bill while Google and Open AI opposed it. Meadow was on the fence, not endorsing the bill, but giving Newsom a soft green light to sign it. And there are still some concerns. Colin McKeown, the head of government affairs at Indreason Horowitz posted, were fighting for a national AI strategy that gives little tech a fair shot and keeps the U.S. in the lead. California's AI bill SB 53,
Starting point is 00:07:45 include some thoughtful provisions that account for the distinct needs of startups, but it misses an important mark by regulating how the technology is developed, a move that risks squeezing out startups, slowing innovation, and entrenching the biggest players. As well as railing against the idea of state-by-state regulation, Colin argued that the rule should govern how AI models are used rather than how they are trained. Still, the bill was drafted explicitly as a compromise. Senator Wiener worked with California's Joint California Policy Working Group on AI Frontier Models, which was set up last year following the veto. That group was chaired by Dr. Fay-Fae Lee and includes new numerous industry stakeholders.
Starting point is 00:08:17 Overall, Newsom said that in passing the bill, quote, California has proven that we can establish protections to protect our communities, while also ensuring that the growing AI industry continues to thrive. This legislation strikes that balance. A last note before we move over to our main episode, yesterday was one of those days where we had two very distinct big stories that could easily be a main all on their own. The first, which is what I went with, is all about Claude 4.5
Starting point is 00:08:41 and the expansion of the autonomy frontier for agents. but there is a ton about agentic commerce and OpenAI's new checkout feature in chat GPT that really deserves its own space as well. I decided that rather than crowding that into the headlines today, the plan is currently for it to be the main episode for tomorrow. Although if we get that SORATU app or something new and big, who knows? Suffice it to say that sometime this week we will get into all of that. For now though, that's going to do it for today's actual headlines. Next up, the main episode. What if AI wasn't just a buzzword, but a business imperative? On You Can with AI, we take you inside the boardrooms and strategy sessions of the world's most
Starting point is 00:09:19 forward-thinking enterprises. Hosted by me, Nathania Wittamore, and powered by KPMG, this seven-part series delivers real-world insights from leaders who are scaling AI with purpose, from aligning culture and leadership to building trust, data readiness, and deploying AI agents. Whether you're a C-suite executive, strategist, or innovator, this podcast is your front-row seat to the future of enterprise AI. So go check it out at www. or search you can with AI on Spotify, Apple Podcasts, or wherever you get your podcasts.
Starting point is 00:09:53 AI changes fast. You need a partner built for the long game. Robots and pencils work side by side with organizations to turn AI ambition into real human impact. As an AWS certified partner, they modernize infrastructure, design cloud native systems, and apply AI to create business value. And their partnerships don't end at launch. As AI changes, robots and pencils stays by your side so you keep pace. The difference is close partnership that builds value and compounds over time. Plus, with delivery centers across the U.S., Canada, Europe, and Latin America, clients get local expertise and global scale. For AI that delivers progress, not promises, visit robots and pencils.com slash AI Daily Brief. Chatbots are great, but they can only take you so far. I've recently
Starting point is 00:10:37 been testing Notion's new AI agents, and they are a very different type of experience. These are agents that actually complete entire workflows for you in your style, and best of all, they work in a channel that you already know and love because they are purpose-built Notion super users. Notion's new AI agents completely expands the range of what Notion can do. It can now build documents from your entire company's knowledge base, organize scattered information into organized reports, basically do tasks that used to take days and get them complete in minutes. These agents don't just help with work, they finish it. Getting started with building on Notion is easier than ever. Notion agents are now your very own super user to help you onboard in minutes.
Starting point is 00:11:14 Your AI teammates are ready to work. Try Notion AI for free at the link in our show notes. Today's episode is brought to you by Superintelligent. Now, one thing that we are having a lot of conversations with folks about is the fact that for some of you, your fiscal year is coming to an end, and that means two things. One, it means planning and thinking about what you're going to do in the next year, and two, it means using up those last of budgets so you don't lose them. If you are an enterprise that happens to find yourself in that situation,
Starting point is 00:11:41 super intelligent would love to help on both fronts. We are moving increasingly towards an annual AI planning model where we map out how you can create an action map of your organization's agent opportunities that represents an executable backlog of AI and agent use cases that you can deliver on over the course of the next year. Additionally, for those end of your budgets, we have worked out deals with a number of partners where we can pre-lock in general implementation packages
Starting point is 00:12:06 even before you figured out exactly what use cases are going to require them. If you'd like to learn more about superintelligence agent readiness audits and this new end of fiscal year plan, visit us at B-super.AI, click get started, and make sure to use the word fiscal somewhere in the description. Welcome back to the AI Daily Brief. Today we are talking about a much-anticipated model release in the form of Claude Sonnet 4.5. Now, on the one hand, people have been excited about Anthropic releasing their latest Claude 4.5
Starting point is 00:12:34 model in general, but really when push comes to shove, the coding implications of Sonnet 4.5 are what people have been most focused on. Today we're going to talk about the response to that model, an interesting new user experience that came with it, and about how our sense of the autonomy frontier might be fundamentally off as this thing apparently has coded for up to 30 hours completely autonomously. First up, though, let's talk about what was announced. No surprise, Anthropic decided to focus the announcement on the coding implications. In fact, in their open opening tweet, they call it the best coding model in the world. Now, if you are a regular listener, you'll know that all the way back since 3.5, Claude really has been, for most of that time,
Starting point is 00:13:14 the preferred set of models when it comes to coding use cases. The only exception to that, really, has been in the last month or so, where GPT5 and OpenAI's codex have started to win back market share from the Claude models, both because of the gains of GPT5, but also because of some issues during August with model performance on Anthropics side. Sonnet 4.5 is very much Anthropics attempt to reclaim that crown. They write, it's the strongest model for building complex agents, it's the best model at using computers, and it shows substantial gains on testing of reasoning and math. Benchmarks, as you know, are one of my least favorite ways to understand a new model, but the published benchmarks do show some big jumps, especially when it comes to these coding use cases.
Starting point is 00:13:52 For example, on Sweet Bench Verified, they're up to 77.2% raw, as opposed to GPT5 Codex is 74.5% and all the way up to 82% with what they call parallel test time compute. On the terminal benchmark for agentic terminal coding, they claim 50% as opposed to GP5's 43.8%. And basically all of the other benchmarks put them in and alongside Opus 4.1 and GPG5 class models of the world. The company did also announce a number of upgrades to Claude Code itself. The first is the Claude Agent SDK,
Starting point is 00:14:22 which basically gives users access to the tools, context management systems, and permissions frameworks that are embedded in Claude Code. They've also got an updated terminal interface and a new VS code extension, so people can work with Claude code in their IDE instead. They also added this little checkpoints feature, which is getting punted aside based on all the other news, but as they put it, lets you instantly undo Claude's latest changes, which seems like a super valuable feature for any sort of agented coding use case. Still, the big show is, of course, this new model, and that's what everyone was focused on.
Starting point is 00:14:53 And as tends to happen with a new model, there is some variety in the first impressions. While I didn't see anyone that had an outright bad experience with it, there were certainly some meh type of shoulder shrug experiences. Jeremy Mack writes, early results for Sonnet 4.5, code quality not markedly different than 4. CSS is improved, outputting markdown when not asked, same price in TPS as always. Gosu Coder writes, first impression of 4.5, keep in mind this is after three hours of head down coding so still early. One, I don't think I can see a difference versus 4.0. In fact, if you told me this was actually 4.0, I'd believe you. 3.5 and 3.7 were noticeably different.
Starting point is 00:15:32 Two, still had to go back to GPT5 for a few things that Sonnet couldn't figure out. We have definitely hit a wall in coding progress. Now, a lot of people responded that they hadn't had that same experience. Ming, for example, said that he had found that it was better at following instructions and better at parallel tool calling, and others just generally said that they were more impressed. On the other end of the spectrum, you had a lot of posts like this one from Leo Cynthwave who wrote, My verdict on 4.5 Sonnet, very good vibes, very fast. Although at the same time, he also said, thinking, which is a particular mode of this model, often doesn't seem to yield a significant
Starting point is 00:16:05 improvement in output, and I still prefer a codex with GPT5 codex for agentic use. Tool use seemed to be a thing that Anthropic was focused on. Kim Minismus called out this section of the announcement post as related to tool usage. The model more effectively uses parallel tool calls firing off multiple speculative searches simultaneously during research and reading several files at once to build context faster. Improved coordination across multiple tools and information sources enables the model to effectively leverage a wide range of capabilities in agendic search and coding workflows. Simon Willison did a deep dive, headlined by the statement, I think it may live up to Anthropics' claims of being the best coding model in the world for the next few weeks at least.
Starting point is 00:16:43 And in his post, he definitely talked about this enhanced tool usage as one of the big upgrades. Dan Shipper and the team at Evers summed up by saying that it was faster than GPT5 Codex and smarter and more steerable than Opus 4.1. And the big thing that they noted was the speed and the performance for the cost. They said that the new Sonnet 4.5 felt about 50% faster than previous versions of Claude. They also said that it was smarter than Opus and more than anything else, it was 5x cheaper. Dan writes, it's still the same pricing as the old Sonnet 4, so there's basically no reason to use Opus in the API anymore, Sonnet all day. Some other folks noticed benefits in areas other than coding.
Starting point is 00:17:19 For example, Bindu Ready wrote, so far, definite improvement on coding math and data analysis over Sonnet 4. Ethan Mollick wrote, it's a really good model. I saw especially big jumps in doing finance and statistics, which tend to get overlooked in the focus on coding. And in fact, if you go to Anthropics announcement post, the focus on finance was one of their big notes. For example, in their published benchmark for financial analysis, Sonnet 4.5 got a 55.3%
Starting point is 00:17:44 compared to, for example, GPT5's 46.9%. Peter Wilderford wrote, Everyone talking about 4.5 being great at coding, but I'm taking way more notice of that huge increase in computer use score. The jump he's noting is from 44.4% in Opus 4.1 to 61.4% with Sonnet 4.5 on the OS World test. Peter writes, that's a huge increase over the state of the art,
Starting point is 00:18:07 and I don't think we've seen anything similarly good at OS World from others. Claude agents coming soon. Now, speaking of agents and just production use cases of these models, some of the big agentic coding companies instantly started to put this model into production. The factory team, which focuses on agentic coding for enterprises, wrote, after testing with Anthropic, we find the strengths of Sonnet 4.5 to be, significantly more reliable and accurate file editing, high environmental awareness, snappier than previous models on quick questions, not overthinking simple tasks. Walden from Cognition wrote,
Starting point is 00:18:38 When our team tried Sonnet 4.5, we realized it was worth building a whole new version of Devin around it. This model behaves very differently. They actually published an entire blog post about what they changed. They wrote, because Devin is an agent that plans, executes, and iterates, rather than just auto-completing code, we get an unusual window into model capabilities. Each improvement compounds across our feedback loops, giving us a perspective on what's genuinely changed. With Sonnet 4.5, we're seeing the biggest leap since Sonnet 3.6. Planning performance is up 18%, and to end e-val scores up 12%, and multi-hour sessions are dramatic. automatically faster and more reliable.
Starting point is 00:19:14 A couple of other notes that they shared. They write, Sonnet 4.5 is the first model we've seen that it is aware of its own context window and this shapes how it behaves. As it approaches context limits, we've observed it proactively summarizing its progress and becoming more decisive about implementing fixes to close out tasks. Interestingly, they said that this context anxiety, which is their term for it, can actually hurt performance, where they've observed the model taking shortcuts or leaving tasks incomplete because it believed it was near the end of its window even if it had
Starting point is 00:19:41 plenty of room left. More at Stefan from Cognition also noted that the model tracks all modified features and doesn't stop until they work. He writes, one particularly impressive moment was when I asked it to build a data dog clone and it ran a log omission script in the background while using Devon's browser to test the live event ingestion UI. Now with all that, so far I haven't seen people who had switched over to GPT5 Codex rushing to get back into the Anthropic sphere. Peter Gostev writes, definitely better than Sonnet 4, but not obviously better than GPT5. thinking high in codex models just now. Victor Talon writes, I really like Claude 4.5 for coding. It's fast, reliable, surgical, high quality in a good way. I think I will use it a lot,
Starting point is 00:20:20 especially for style refactors and things like that. But it is nowhere near as smart as GPT5. I wouldn't leave it alone making large changes on HVM. Yes, it sucks to wait 30 minutes for a codex refactor, but debugging AI introduced errors takes way more time than that. Peak intelligence is very important. GPT5 is not nearly as smart as I need, and Sonnet is less smart than that. Eric Provencher had a really interesting way of putting it. He writes, I'm starting to see Anthropic models as light reasoning models while OpenAI models are deep reasoning models. With only light reasoning, Sonnet 4.5 excels at efficient context usage to pinpoint information. Codex tool calls are bulky and they're interspersed with reasoning tokens to test hypotheses.
Starting point is 00:20:59 It craves context to understand more of the problem. GAP between GPD5 and Sonnet 4.5 becomes apparent when you have a hot context window where no new tool calls are needed. GPD5 can think for a few minutes on end to find a detailed complete solution, while Sonnet 4.5 is satisfied with a few seconds for a serviceable one. Deep reasoning only works with sufficient context, but allows the model to really evaluate problems so exhaustively that it appears almost superhuman. By contrast, light reasoning stays closer to the service, but serves as breathing room for models to collect their thoughts. It is in many ways much more human. Anthropic is far and away ahead on light reasoning. Which is super interesting. I think this is a much more useful diagnostic than a simple
Starting point is 00:21:37 better or worse. And once again, comes back to the idea that we live at a world, where at least for the moment, the best strategy if you truly want optimal performance is going to be model switching based on different contexts and needs. Now, there are two more things that I think are really worth noting about this launch. The first is Imagine with Claude. In their announcement, Post-anthropic called this a bonus research preview. They write, in this experiment, Claude generates software on the fly. No functionality is predetermined. No code is pre-written. what you see is Claude creating in real time, responding and adapting to your request as you interact. It's a fun demonstration of what Claude Sonnet 4.5 can do, a way to see what's possible when you combine a
Starting point is 00:22:15 capable model with the right infrastructure. Sean Strong from Anthropic wrote a little bit more about Imagine. He said, it pioneers the concept of model as backend, using a model to not only generate interfaces on the fly, but also power all the functionality behind it. An example he gave was a Choose Your Own Adventure version of his founder journey. He writes, For the prompt, I asked Claude to generate an interactive Choose Your Own adventure game based on my startup experience. It accurately retold our pivot from VR games to management, even making an interactive management dashboard and app launcher to showcase key functionality.
Starting point is 00:22:47 It then had us go through our fundraise, massive growth, and ultimate shutdown due to COVID. Peter Yang asked it to, quote, show me the desktop of a bad PM on the left versus a great PM on the right. Swicks from latent space and now cognition validated that 4.5 is a very good coding model in general, but chose to focus on Imagine in his post about it as well. He writes, Most generative UI today is no more than glorified tool calling of pre-made components. Imagine with Claude is the first mainstream adoption of the WebSim paradigm that went viral last year, generating entire UIs on the fly that you can immediately use.
Starting point is 00:23:20 4.5 Sonnet enables vibe coding to be so fast and so good that you can conjure up ephemeral apps to explore the latent space of what's possible, just in time as you explore it. Now he caveats, it isn't perfect yet. Buttons in dense UIs like simulated email clients often don't work or are slow enough that the illusion is gone. But it's a generation away from replacing the tyranny of designs made for the media in person and ushering the age of truly personalized malleable software. Josh Bickett picks up on that and writes,
Starting point is 00:23:47 Claude Imagine could become a new form factor for how we interact with AI. It's completely different than chat. It's like a generative computer that we talk to in a natural language. I'd guess that vision is that everyone gets their own personal. consistent generative computer instance with a clod code generating the UI, processing data and files under the hood. I'd guess that what's happening is a front end is passing the prompt directly to a cloud code terminal agent, which writes back to the front end.
Starting point is 00:24:11 It looks like a beautiful feedback loop. I'm going to put in some reps with Imagine this week before the preview goes away, and we'll certainly share what I discover. Now, the other big thing that people were really jumping on to was the immense time that Sonnet 4.5 is apparently able to work autonomously for. Hayden Field from the Verge wrote about this and her piece about the announcement. She sums up, Anthropics' latest AI model spent 30 hours running by itself to code a chat app akin to Slack or Teams. It spat out about 11,000 lines of code and it only stopped running when it had completed the task.
Starting point is 00:24:43 Now, some tried to figure out how this was possible. Carlos Perez writes, How is it possible that Sonnet 4.5 is able to work for 30 hours to build an app like Slack? The system prompts have been leaked and Sonnet 4.5 reveals its secret sauce. Some of the ways it accomplishes this. It forces quote-unquote big code into durable artifacts. Anything over about 20 lines is required to be emitted as an artifact and only one artifact per response.
Starting point is 00:25:06 He writes, that gives the model a persistent append-only surface to build large apps module by module without truncation. He also points out things like it enforcing runtime constraints, governs tool loops, supports long horizon autonomy via planning and feedback loops, and ultimately whatever the combination of things, if this is really true,
Starting point is 00:25:23 it is a total game changer when it comes. comes to the autonomy horizon that we've been working on. When Replit announced Agent 3, they shared that it had reached autonomous agent runs of 200 minutes. And a few days later, OpenAI announced their coding optimized GPT5 Codex model, where that company said, quote, during testing we've seen GPT5 codex work independently for more than seven hours at a time on large complex tasks, iterating on its implementation, fixing test failures, and ultimately delivering a successful implementation. At the time, which was literally just two weeks ago, people were saying that even that was insane. But now we've got this claim for 30 hours that just
Starting point is 00:25:58 obviously blows that out of the water. And one example that Anthropic gave to really sum up and dramatize the progress that has been made in the AI coding space over just the last couple of years is that they asked every previous version of Claude to make clone of Claude.A.I. It wasn't until 3.6 that you even had something that you could try to log into, and it wasn't until Sonnet 4 that there was even a functional clone. Now it was able to build something that actually worked, working a autonomously for over five hours to do so. Nick Dobos takes a step back and points out, it's honestly insane how fast these are improving. Swee bench from 33% to 82% in just around a year. Part of the reason that we spend so much time on the coding use cases on this show,
Starting point is 00:26:38 even though many in this audience, in fact most of this audience are not software engineers by training, is not only that thanks to these new tools, all of us get to be software developers to some extent or another, it's that coding is so clearly the frontier where we are seeing the biggest changes take place when it comes to model capabilities. Agentic coding improvement is not just a bellwether of where models are. It's also the mechanism by which they get better at everything else as well. I'm going to be keeping a close eye to see if anyone outside of the lab setting gets anywhere close to that 30 hours of performance.
Starting point is 00:27:08 But if it's true, it really is a game changer. Rohan Paul went back to a recent Axios interview with Dario Amadeh, where Dario said, the vast majority of code that is used to support Claude and to design the next Claude is now written by Claude. It's just the vast majority of it within Anthropic, and other fast-moving companies the same is true. Rowan adds, now it all makes sense. Claude Sonnet 4.5 can keep its coding focus for nonstop 30 hours. The shift has started in all of tech. Now, things move fast in this space. From the first reads, it's not even clear that Sonnet 4.5 is definitively the best coding model
Starting point is 00:27:42 compared to GPT5 Codex, and even the people who think it is, are still kind of waiting to see what comes with Gemini 3. But it is yet another moment that shows the relentless pace of change in this space, and I'm excited to see what new opportunities it unlocks. For now, that's going to do it for today's AI Daily Brief. Appreciate you listening or watching, as always, and until next time, peace.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.