The AI Daily Brief: Artificial Intelligence News and Analysis - Claude Sonnet 4.5 Can Code Autonomously for 30 Hours 🤯
Episode Date: September 30, 2025Anthropic's Claude Sonnet 4.5 reportedly demonstrates groundbreaking autonomy by coding for up to 30 hours non-stop, significantly outpacing prior benchmarks like GPT-5 Codex’s seven-hour runs. ...This leap is enabled by innovations such as enforced modular artifacts, persistent memory surfaces, planning loops, and runtime constraints—transforming the way AI tackles complex, long-horizon tasks. The broader implication is that AI is now not only capable of building sophisticated applications autonomously but is also recursively engineering its own future iterations, rapidly accelerating progress across the tech landscape.Brought to you by:Is your enterprise ready for the future of agentic AI?Visit AGNTCY.orgVisit Outshift Internet of AgentsTry Notion AI today with Notion 3.0 https://ntn.so/nlwKPMG – Discover how AI is transforming possibility into reality. Tune into the new KPMG 'You Can with AI' podcast and unlock insights that will inform smarter decisions inside your enterprise. Listen now and start shaping your future with every episode. https://www.kpmg.us/AIpodcastsBlitzy.com - Go to https://blitzy.com/ to build enterprise software in days, not months Robots & Pencils - Cloud-native AI solutions that power results https://robotsandpencils.com/Vanta - Simplify compliance - https://vanta.com/nlwThe Agent Readiness Audit from Superintelligent - Go to https://besuper.ai/ to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Interested in sponsoring the show? nlw@aidailybrief.ai
Transcript
Discussion (0)
Today on the AI Daily Brief, Anthropics' new Sonnet 4.5 model can apparently code independently for up to 30 hours.
We're going to talk about what that means for the state of AI autonomy.
And before that in the headlines, OpenAI is apparently not only about to launch SORA 2, but an AI-only TikTok-style video app as well.
The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.
All right, friends, quick announcements before we dive in.
First of all, thank you to today's sponsors, KPMG.
Robots and pencils, Notion and Super Intelligent.
To get an ad-free version of the show, go to patreon.com slash AI Daily Brief.
And if you were interested in sponsoring the show, set us a note at sponsors at AIDaily Brief.
Dot AI to find out about all the opportunities.
This truly is the smartest, most engaged, and most high-power AI audience in the world.
So if you are interested in accessing that, please do reach out.
And with that, let's dive in.
Welcome back to the AI Daily Brief Headlined Edition,
all the daily AI news you need in around five minutes.
We kick off today with the latest rumors.
out of OpenAI, where that company is expected to launch not only their next generation video model
SORA 2, but also a social app for AI-generated video to go alongside it.
There is actually a lot to unpack here. This is more than just a model release, so let's dig in.
Sources speaking with the Wall Street Journal said that the model and its companion app would
be arriving in the coming days. In fact, you might have noticed this set of new commercials
that OpenAI dropped yesterday, seemingly as a marketing campaign, some think that they are actually
doing double duty, not just advertising chat GPT, but sneakily showing off the video generation
capabilities of the new SORA.
Right, smart Kretschman, it certainly doesn't look generated, but some are arguing that SORA too
could be indistinguishable from real video.
Some broke down the video frame by frame and thought they found evidence it was AI generated,
and what people are excited about is that each video appears to show camera motion that would be
difficult bordering on impossible even if you were using a drone. In the second ad showing a man
cooking Italian food for his date, we zoom out from an extreme close-up through a cluttered kitchen
out the window and across the street. Summing up the feelings of many, Software Engineer Jolson
Rebello wrote, if this really is SORA 2, nothing will be the same anymore. Now, the scoop from
wired is that alongside the new SORA, OpenAI is poised to release a short-form video app powered by
the new model. The app, which features a vertical video feed with swipe-to-scroll navigation,
appears to closely resemble TikTok, except all of the content is AI generated.
There's a 4-U-style page powered by a recommendation algorithm.
On the right side of the feed, a menu bar gives users the option to like, comment, or remix
a video.
The app reportedly does not allow users to upload photos or videos, with the only source of content
being SORA 2.
However, users will be able to verify their likeness and have themselves appear in generated
clips.
Other users can also generate clips featuring verified likenesses, but users will receive a
notification when clips featuring them are generated. Wired continued,
OpenAI appears to be betting that the SORAT2 app will let people interact with AI-generated video
in a way that fundamentally changes their experience of the technology, similar to how
chat GPT helped users realize the potential of AI-generated text. And apparently this is more than just
showing off a technology, but also a business recognition of the opportunity of the moment.
Wired continues, internally sources say there's also a feeling that President Trump's
on-again, off-again deal to sell TikTok's U.S. operations, has given Open Open
A.I. A.I. A. unique opportunity to launch a short-form video app, particularly one without
close ties to China. Now, not everyone is thrilled about this. In fact, there has been an explosive
conversation around the lamentability of short-form brain rot ever since META announced its
vibes feed last week. You'll remember I did a whole segment of an episode that was all about
how much people did not like the idea of meta having an AI video-only feed. Now, as OpenAI
apparently gets ready to release something like this, we'll get to see how much of that was about
the format versus how much of that was people just not liking meta.
Interestingly, in the wake of all of that controversy, OpenAI insider Rune had posted,
there is a moral panic around short form video content in my opinion.
He added that he understands the concern, but that he's just not certain that hours on
TikTok are meaningfully different to hours in front of the TV.
He said, I basically agree with postmen on the nature of video and its corrupting influence
on running a civilization well as opposed to text-based media.
I'm just not sure that it's so much worse than being glued to your TV, and I'm definitely
not sure that AI slop is worse than human slop. Ahman Osman noted that Roon appeared to be breaking
the narrative on X a few days ahead of this key announcement from OpenAI. With the report that we're
getting this app from OpenAI, Ahman said, bro was running narrative ops? Now, what other interesting
to mention of the story has to do with the copyright arrangements that OpenAI will be putting in
place? Sources said that the company has begun notifying talent agencies and studios about the product
over the past week. The communication notified rights holders that they will need to explicitly opt out,
otherwise their intellectual property will be included in the generated videos.
With the small nuance being that recognizable public figures won't appear without explicit
permission, but fictional characters will require an op-out.
The debate around that could be an episode all on its own.
But look, I think that we are very close to actually getting this app, so I'm going to
pause it here.
We will come back and talk about all the implications when we see what the thing actually is
and we get people's actual first reactions to it.
Moving on to our next story, more layoff news seemingly related to AI.
airline Lufhansa said they would be eliminating the equivalent of 4,000 full-time roles by 2030.
That's around 4% of their 102,000 strong workforce.
However, this is a highly targeted downsizing, with Lufthansa aiming to make the cuts primarily
from their 10,000 administrative roles.
The layoffs are also a sharp change in direction, as Lufthansa stated that they would be
adding 10,000 new hires over the course of the year back in January.
In a press release, the company said, the Lufthansa Group is reviewing which activities
will no longer be necessary in the future, for example, due to duplication of work.
In particular, the profound changes brought about by digitalization and the increased use of
artificial intelligence will lead to greater efficiency in many areas and processes.
Like Accenture's downsizing announcement last week, this doesn't appear to be a case of a struggling
company using AI to mask around a belt tightening. After a troubled year in 2024, where
operating margins dropped to 4.4%. Lutonza has guided that they expect margins to reach 10% by
28, up from their strategic target of 8%. They also expect to see 2.5 billion euros of free cash flow
by that date. The stock was up 0.9% on the news to bolster a year-to-date gain of 25.
Now, Lufthansa claims that this is purely a restructuring effort to get ahead of reduced
workforce needs as they accelerate AI adoption. And we are certainly going to be keeping
an eye to see whether this type of announcement, i.e. forward telegraphing of AI shifts,
becomes a trend. Lastly, today, a bit of regulatory news. California Governor Gavin Newsom has
signed AI safety bill SB 53. The bill is a watered down version of last year's SB 1047, which
was vetoed by Newsom in September. SB 53 requires leading AI companies to report the safety
protocols they use in producing models and disclose the highest degree risks posed by the models.
The law is largely concerned with catastrophic risks like aiding in bioweapons production
or facilitating mass casualty events. In addition, the law strengthens whistleblower protections
for employees of AI labs. California state senator Scott Wiener, the chief sponsor of the bill,
said, this is a groundbreaking law that promotes both innovation and safety. The two are not
mutually exclusive, even though they are often pitted against each other to be.
Now, last year's SB 1047 featured a huge amount of very public pushback, while this process
has been quite a bit quieter by comparison. Anthropic came out in favor of the bill while Google
and Open AI opposed it. Meadow was on the fence, not endorsing the bill, but giving Newsom a soft
green light to sign it. And there are still some concerns. Colin McKeown, the head of government
affairs at Indreason Horowitz posted, were fighting for a national AI strategy that gives little
tech a fair shot and keeps the U.S. in the lead. California's AI bill SB 53,
include some thoughtful provisions that account for the distinct needs of startups, but it misses an
important mark by regulating how the technology is developed, a move that risks squeezing out startups,
slowing innovation, and entrenching the biggest players. As well as railing against the idea of
state-by-state regulation, Colin argued that the rule should govern how AI models are used rather than how
they are trained. Still, the bill was drafted explicitly as a compromise. Senator Wiener worked with
California's Joint California Policy Working Group on AI Frontier Models, which was set up last year
following the veto. That group was chaired by Dr. Fay-Fae Lee and includes new
numerous industry stakeholders.
Overall, Newsom said that in passing the bill, quote,
California has proven that we can establish protections to protect our communities,
while also ensuring that the growing AI industry continues to thrive.
This legislation strikes that balance.
A last note before we move over to our main episode,
yesterday was one of those days where we had two very distinct big stories
that could easily be a main all on their own.
The first, which is what I went with, is all about Claude 4.5
and the expansion of the autonomy frontier for agents.
but there is a ton about agentic commerce and OpenAI's new checkout feature in chat GPT that really
deserves its own space as well. I decided that rather than crowding that into the headlines today,
the plan is currently for it to be the main episode for tomorrow. Although if we get that SORATU app
or something new and big, who knows? Suffice it to say that sometime this week we will get into all
of that. For now though, that's going to do it for today's actual headlines. Next up, the main
episode. What if AI wasn't just a buzzword, but a business imperative?
On You Can with AI, we take you inside the boardrooms and strategy sessions of the world's most
forward-thinking enterprises.
Hosted by me, Nathania Wittamore, and powered by KPMG, this seven-part series delivers real-world
insights from leaders who are scaling AI with purpose, from aligning culture and leadership
to building trust, data readiness, and deploying AI agents.
Whether you're a C-suite executive, strategist, or innovator, this podcast is your front-row seat
to the future of enterprise AI.
So go check it out at www.
or search you can with AI on Spotify, Apple Podcasts, or wherever you get your podcasts.
AI changes fast. You need a partner built for the long game. Robots and pencils work side by side
with organizations to turn AI ambition into real human impact. As an AWS certified partner,
they modernize infrastructure, design cloud native systems, and apply AI to create business value.
And their partnerships don't end at launch. As AI changes, robots and pencils stays by
your side so you keep pace. The difference is close partnership that builds value and compounds over
time. Plus, with delivery centers across the U.S., Canada, Europe, and Latin America, clients get local
expertise and global scale. For AI that delivers progress, not promises, visit robots and pencils.com
slash AI Daily Brief. Chatbots are great, but they can only take you so far. I've recently
been testing Notion's new AI agents, and they are a very different type of experience. These are
agents that actually complete entire workflows for you in your style, and best of all, they work
in a channel that you already know and love because they are purpose-built Notion super users.
Notion's new AI agents completely expands the range of what Notion can do. It can now build documents
from your entire company's knowledge base, organize scattered information into organized reports,
basically do tasks that used to take days and get them complete in minutes. These agents don't
just help with work, they finish it. Getting started with building on Notion is easier than ever.
Notion agents are now your very own super user to help you onboard in minutes.
Your AI teammates are ready to work.
Try Notion AI for free at the link in our show notes.
Today's episode is brought to you by Superintelligent.
Now, one thing that we are having a lot of conversations with folks about is the fact
that for some of you, your fiscal year is coming to an end, and that means two things.
One, it means planning and thinking about what you're going to do in the next year,
and two, it means using up those last of budgets so you don't lose them.
If you are an enterprise that happens to find yourself in that situation,
super intelligent would love to help on both fronts.
We are moving increasingly towards an annual AI planning model
where we map out how you can create an action map of your organization's agent opportunities
that represents an executable backlog of AI and agent use cases
that you can deliver on over the course of the next year.
Additionally, for those end of your budgets,
we have worked out deals with a number of partners
where we can pre-lock in general implementation packages
even before you figured out exactly what use cases are going to require them.
If you'd like to learn more about superintelligence agent readiness audits and this new end
of fiscal year plan, visit us at B-super.AI, click get started, and make sure to use the word
fiscal somewhere in the description.
Welcome back to the AI Daily Brief.
Today we are talking about a much-anticipated model release in the form of Claude Sonnet
4.5.
Now, on the one hand, people have been excited about Anthropic releasing their latest Claude 4.5
model in general, but really when push comes to shove, the coding implications of Sonnet 4.5
are what people have been most focused on. Today we're going to talk about the response to that
model, an interesting new user experience that came with it, and about how our sense of the
autonomy frontier might be fundamentally off as this thing apparently has coded for up to
30 hours completely autonomously. First up, though, let's talk about what was announced. No surprise,
Anthropic decided to focus the announcement on the coding implications. In fact, in their open
opening tweet, they call it the best coding model in the world. Now, if you are a regular listener,
you'll know that all the way back since 3.5, Claude really has been, for most of that time,
the preferred set of models when it comes to coding use cases. The only exception to that,
really, has been in the last month or so, where GPT5 and OpenAI's codex have started to win
back market share from the Claude models, both because of the gains of GPT5, but also because
of some issues during August with model performance on Anthropics side. Sonnet 4.5 is very much
Anthropics attempt to reclaim that crown. They write, it's the strongest model for building complex
agents, it's the best model at using computers, and it shows substantial gains on testing of reasoning and
math. Benchmarks, as you know, are one of my least favorite ways to understand a new model,
but the published benchmarks do show some big jumps, especially when it comes to these coding use cases.
For example, on Sweet Bench Verified, they're up to 77.2% raw, as opposed to GPT5 Codex is 74.5%
and all the way up to 82% with what they call parallel test time compute.
On the terminal benchmark for agentic terminal coding,
they claim 50% as opposed to GP5's 43.8%.
And basically all of the other benchmarks put them in and alongside
Opus 4.1 and GPG5 class models of the world.
The company did also announce a number of upgrades to Claude Code itself.
The first is the Claude Agent SDK,
which basically gives users access to the tools, context management systems,
and permissions frameworks that are embedded in Claude Code.
They've also got an updated terminal interface and a new VS code extension, so people can work
with Claude code in their IDE instead.
They also added this little checkpoints feature, which is getting punted aside based on all the other
news, but as they put it, lets you instantly undo Claude's latest changes, which seems like a super
valuable feature for any sort of agented coding use case.
Still, the big show is, of course, this new model, and that's what everyone was focused on.
And as tends to happen with a new model, there is some variety in the first impressions.
While I didn't see anyone that had an outright bad experience with it, there were certainly some meh type of shoulder shrug experiences.
Jeremy Mack writes, early results for Sonnet 4.5, code quality not markedly different than 4.
CSS is improved, outputting markdown when not asked, same price in TPS as always.
Gosu Coder writes, first impression of 4.5, keep in mind this is after three hours of head down coding so still early.
One, I don't think I can see a difference versus 4.0.
In fact, if you told me this was actually 4.0, I'd believe you.
3.5 and 3.7 were noticeably different.
Two, still had to go back to GPT5 for a few things that Sonnet couldn't figure out.
We have definitely hit a wall in coding progress.
Now, a lot of people responded that they hadn't had that same experience.
Ming, for example, said that he had found that it was better at following instructions
and better at parallel tool calling, and others just generally said that they were more impressed.
On the other end of the spectrum, you had a lot of posts like this one from Leo Cynthwave
who wrote, My verdict on 4.5 Sonnet, very good vibes, very fast. Although at the same time, he also
said, thinking, which is a particular mode of this model, often doesn't seem to yield a significant
improvement in output, and I still prefer a codex with GPT5 codex for agentic use. Tool use seemed to be
a thing that Anthropic was focused on. Kim Minismus called out this section of the announcement
post as related to tool usage. The model more effectively uses parallel tool calls firing off
multiple speculative searches simultaneously during research and reading several files at once
to build context faster. Improved coordination across multiple tools and information sources
enables the model to effectively leverage a wide range of capabilities in agendic search and coding
workflows. Simon Willison did a deep dive, headlined by the statement, I think it may live up to
Anthropics' claims of being the best coding model in the world for the next few weeks at least.
And in his post, he definitely talked about this enhanced tool usage as one of the big upgrades.
Dan Shipper and the team at Evers summed up by saying that it was faster than GPT5 Codex
and smarter and more steerable than Opus 4.1.
And the big thing that they noted was the speed and the performance for the cost.
They said that the new Sonnet 4.5 felt about 50% faster than previous versions of Claude.
They also said that it was smarter than Opus and more than anything else, it was 5x cheaper.
Dan writes, it's still the same pricing as the old Sonnet 4, so there's basically no reason to use Opus in the API anymore, Sonnet all day.
Some other folks noticed benefits in areas other than coding.
For example, Bindu Ready wrote, so far, definite improvement on coding math and data analysis
over Sonnet 4.
Ethan Mollick wrote, it's a really good model.
I saw especially big jumps in doing finance and statistics, which tend to get overlooked
in the focus on coding.
And in fact, if you go to Anthropics announcement post, the focus on finance was one of
their big notes.
For example, in their published benchmark for financial analysis, Sonnet 4.5 got a 55.3%
compared to, for example, GPT5's 46.9%.
Peter Wilderford wrote,
Everyone talking about 4.5 being great at coding,
but I'm taking way more notice of that huge increase in computer use score.
The jump he's noting is from 44.4% in Opus 4.1 to 61.4%
with Sonnet 4.5 on the OS World test.
Peter writes,
that's a huge increase over the state of the art,
and I don't think we've seen anything similarly good at OS World from others.
Claude agents coming soon.
Now, speaking of agents and just production use cases of these models, some of the big
agentic coding companies instantly started to put this model into production.
The factory team, which focuses on agentic coding for enterprises, wrote, after testing with
Anthropic, we find the strengths of Sonnet 4.5 to be, significantly more reliable and accurate
file editing, high environmental awareness, snappier than previous models on quick questions,
not overthinking simple tasks. Walden from Cognition wrote,
When our team tried Sonnet 4.5, we realized it was worth building a whole new version of Devin
around it. This model behaves very differently. They actually published an entire blog post about
what they changed. They wrote, because Devin is an agent that plans, executes, and iterates,
rather than just auto-completing code, we get an unusual window into model capabilities. Each
improvement compounds across our feedback loops, giving us a perspective on what's genuinely changed.
With Sonnet 4.5, we're seeing the biggest leap since Sonnet 3.6. Planning performance is up 18%,
and to end e-val scores up 12%, and multi-hour sessions are dramatic.
automatically faster and more reliable.
A couple of other notes that they shared.
They write, Sonnet 4.5 is the first model we've seen that it is aware of its own context
window and this shapes how it behaves.
As it approaches context limits, we've observed it proactively summarizing its progress
and becoming more decisive about implementing fixes to close out tasks.
Interestingly, they said that this context anxiety, which is their term for it,
can actually hurt performance, where they've observed the model taking shortcuts or leaving
tasks incomplete because it believed it was near the end of its window even if it had
plenty of room left. More at Stefan from Cognition also noted that the model tracks all modified
features and doesn't stop until they work. He writes, one particularly impressive moment was when I
asked it to build a data dog clone and it ran a log omission script in the background while using
Devon's browser to test the live event ingestion UI. Now with all that, so far I haven't seen
people who had switched over to GPT5 Codex rushing to get back into the Anthropic sphere. Peter Gostev
writes, definitely better than Sonnet 4, but not obviously better than GPT5.
thinking high in codex models just now. Victor Talon writes, I really like Claude 4.5 for coding.
It's fast, reliable, surgical, high quality in a good way. I think I will use it a lot,
especially for style refactors and things like that. But it is nowhere near as smart as GPT5.
I wouldn't leave it alone making large changes on HVM. Yes, it sucks to wait 30 minutes for a
codex refactor, but debugging AI introduced errors takes way more time than that.
Peak intelligence is very important. GPT5 is not nearly as smart as I need, and Sonnet is less
smart than that. Eric Provencher had a really interesting way of putting it. He writes,
I'm starting to see Anthropic models as light reasoning models while OpenAI models are deep reasoning
models. With only light reasoning, Sonnet 4.5 excels at efficient context usage to pinpoint information.
Codex tool calls are bulky and they're interspersed with reasoning tokens to test hypotheses.
It craves context to understand more of the problem. GAP between GPD5 and Sonnet 4.5
becomes apparent when you have a hot context window where no new tool calls are needed. GPD5 can think
for a few minutes on end to find a detailed complete solution, while Sonnet 4.5 is satisfied
with a few seconds for a serviceable one. Deep reasoning only works with sufficient context, but allows
the model to really evaluate problems so exhaustively that it appears almost superhuman. By contrast,
light reasoning stays closer to the service, but serves as breathing room for models to collect
their thoughts. It is in many ways much more human. Anthropic is far and away ahead on light
reasoning. Which is super interesting. I think this is a much more useful diagnostic than a simple
better or worse. And once again, comes back to the idea that we live at a world, where at least for the
moment, the best strategy if you truly want optimal performance is going to be model switching
based on different contexts and needs. Now, there are two more things that I think are really
worth noting about this launch. The first is Imagine with Claude. In their announcement,
Post-anthropic called this a bonus research preview. They write, in this experiment,
Claude generates software on the fly. No functionality is predetermined. No code is pre-written.
what you see is Claude creating in real time, responding and adapting to your request as you interact.
It's a fun demonstration of what Claude Sonnet 4.5 can do, a way to see what's possible when you combine a
capable model with the right infrastructure.
Sean Strong from Anthropic wrote a little bit more about Imagine. He said,
it pioneers the concept of model as backend, using a model to not only generate interfaces on
the fly, but also power all the functionality behind it. An example he gave was a
Choose Your Own Adventure version of his founder journey. He writes,
For the prompt, I asked Claude to generate an interactive Choose Your Own adventure game based
on my startup experience. It accurately retold our pivot from VR games to management,
even making an interactive management dashboard and app launcher to showcase key functionality.
It then had us go through our fundraise, massive growth, and ultimate shutdown due to COVID.
Peter Yang asked it to, quote, show me the desktop of a bad PM on the left versus a great
PM on the right. Swicks from latent space and now cognition validated that 4.5 is
a very good coding model in general, but chose to focus on Imagine in his post about it as well.
He writes,
Most generative UI today is no more than glorified tool calling of pre-made components.
Imagine with Claude is the first mainstream adoption of the WebSim paradigm that went viral last year,
generating entire UIs on the fly that you can immediately use.
4.5 Sonnet enables vibe coding to be so fast and so good that you can conjure up ephemeral
apps to explore the latent space of what's possible, just in time as you explore it.
Now he caveats, it isn't perfect yet.
Buttons in dense UIs like simulated email clients often don't work or are slow enough that
the illusion is gone.
But it's a generation away from replacing the tyranny of designs made for the media in person
and ushering the age of truly personalized malleable software.
Josh Bickett picks up on that and writes,
Claude Imagine could become a new form factor for how we interact with AI.
It's completely different than chat.
It's like a generative computer that we talk to in a natural language.
I'd guess that vision is that everyone gets their own personal.
consistent generative computer instance with a clod code generating the UI, processing data
and files under the hood.
I'd guess that what's happening is a front end is passing the prompt directly to a
cloud code terminal agent, which writes back to the front end.
It looks like a beautiful feedback loop.
I'm going to put in some reps with Imagine this week before the preview goes away, and we'll
certainly share what I discover.
Now, the other big thing that people were really jumping on to was the immense time that
Sonnet 4.5 is apparently able to work autonomously for.
Hayden Field from the Verge wrote about this and her piece about the announcement.
She sums up, Anthropics' latest AI model spent 30 hours running by itself to code a chat app akin to Slack or Teams.
It spat out about 11,000 lines of code and it only stopped running when it had completed the task.
Now, some tried to figure out how this was possible.
Carlos Perez writes,
How is it possible that Sonnet 4.5 is able to work for 30 hours to build an app like Slack?
The system prompts have been leaked and Sonnet 4.5 reveals its secret sauce.
Some of the ways it accomplishes this.
It forces quote-unquote big code into durable artifacts.
Anything over about 20 lines is required to be emitted as an artifact
and only one artifact per response.
He writes,
that gives the model a persistent append-only surface
to build large apps module by module without truncation.
He also points out things like it enforcing runtime constraints,
governs tool loops, supports long horizon autonomy
via planning and feedback loops,
and ultimately whatever the combination of things,
if this is really true,
it is a total game changer when it comes.
comes to the autonomy horizon that we've been working on. When Replit announced Agent 3,
they shared that it had reached autonomous agent runs of 200 minutes. And a few days later,
OpenAI announced their coding optimized GPT5 Codex model, where that company said, quote,
during testing we've seen GPT5 codex work independently for more than seven hours at a time
on large complex tasks, iterating on its implementation, fixing test failures, and ultimately
delivering a successful implementation. At the time, which was literally just two weeks ago,
people were saying that even that was insane. But now we've got this claim for 30 hours that just
obviously blows that out of the water. And one example that Anthropic gave to really sum up and
dramatize the progress that has been made in the AI coding space over just the last couple of years
is that they asked every previous version of Claude to make clone of Claude.A.I. It wasn't until
3.6 that you even had something that you could try to log into, and it wasn't until Sonnet 4 that
there was even a functional clone. Now it was able to build something that actually worked, working a
autonomously for over five hours to do so. Nick Dobos takes a step back and points out,
it's honestly insane how fast these are improving. Swee bench from 33% to 82% in just around a year.
Part of the reason that we spend so much time on the coding use cases on this show,
even though many in this audience, in fact most of this audience are not software engineers by training,
is not only that thanks to these new tools, all of us get to be software developers to some extent or another,
it's that coding is so clearly the frontier where we are seeing the biggest changes take place
when it comes to model capabilities.
Agentic coding improvement is not just a bellwether of where models are.
It's also the mechanism by which they get better at everything else as well.
I'm going to be keeping a close eye to see if anyone outside of the lab setting gets anywhere
close to that 30 hours of performance.
But if it's true, it really is a game changer.
Rohan Paul went back to a recent Axios interview with Dario Amadeh,
where Dario said,
the vast majority of code that is used to support Claude and to design the next Claude is
now written by Claude. It's just the vast majority of it within Anthropic, and other fast-moving
companies the same is true. Rowan adds, now it all makes sense. Claude Sonnet 4.5 can keep its
coding focus for nonstop 30 hours. The shift has started in all of tech. Now, things move fast in this
space. From the first reads, it's not even clear that Sonnet 4.5 is definitively the best coding model
compared to GPT5 Codex, and even the people who think it is, are still kind of waiting to see what
comes with Gemini 3. But it is yet another moment that shows the relentless pace of change in this
space, and I'm excited to see what new opportunities it unlocks. For now, that's going to do it for
today's AI Daily Brief. Appreciate you listening or watching, as always, and until next time,
peace.
