The AI Daily Brief: Artificial Intelligence News and Analysis - How People Actually Use AI Agents
Episode Date: February 19, 2026A new Anthropic study shows that AI agents are being used far more conservatively than their capabilities suggest, with short sessions, heavy human oversight, and growing use beyond coding into back o...ffice, marketing, sales, and finance. The data highlights that autonomy is shaped as much by trust and interaction design as raw model power. In the headlines: Gemini adds music generation, Anthropic clarifies its OAuth policy, Meta revives its AI smartwatch, Grok expands to 16 debating subagents, and more. Want to build with OpenClaw?LEARN MORE ABOUT CLAW CAMP: https://campclaw.ai/Or for enterprises, check out: https://enterpriseclaw.ai/Brought to you by:KPMG – Discover how AI is transforming possibility into reality. Tune into the new KPMG 'You Can with AI' podcast and unlock insights that will inform smarter decisions inside your enterprise. Listen now and start shaping your future with every episode. https://www.kpmg.us/AIpodcastsMercury - modern banking for business and now personal accounts. Learn more at https://mercury.com/personal-bankingRackspace Technology - Build, test and scale intelligent workloads faster with Rackspace AI Launchpad - http://rackspace.com/ailaunchpadBlitzy - Want to accelerate enterprise software development velocity by 5x? https://blitzy.com/Optimizely Agents in Action - Join the virtual event (with me!) free March 4 - https://www.optimizely.com/insights/agents-in-action/AssemblyAI - The best way to build Voice AI apps - https://www.assemblyai.com/briefLandfallIP - AI to Navigate the Patent Process - https://landfallip.com/Robots & Pencils - Cloud-native AI solutions that power results https://robotsandpencils.com/The Agent Readiness Audit from Superintelligent - Go to https://besuper.ai/ to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Interested in sponsoring the show? sponsors@aidailybrief.ai
Transcript
Discussion (0)
Today on the AI Daily Brief, a new study about agent autonomy and practice from Anthropic.
And before that in the headlines, Google Gemini now allows you to create music.
The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.
All right, friends, quick announcements before we dive in.
First of all, thank you to today's sponsors, KPMG, assembly, robots and pencils, and blitzie.
To get an ad-free version of the show, go to patreon.com slash AI Daily Brief, or you can subscribe on
Apple Podcasts. To learn about sponsoring the show, send us a note at sponsors at AIdailybrief.
Lastly, a reminder once again about our latest ecosystem projects.
Claw Camp, the free self-directed program where you can learn how to build agents and agent
teams using OpenClaw is kicking off its first sprint right now. So if you want to learn to
be an agent boss with nearly 3,000 friends, come join us. You can find that at Campclaw.A.I.
or from the AI Daily Brief website, if you are a company trying to figure out OpenClaw
and other agent strategies, check out EnterpriseClaW.A.I.
And more broadly, if you are just interested in keeping track of these types of educational
programs we're doing free and premium, you can find information about that at AIDBTraining.com.
One thing happening there, we actually have a premium program survey as we are trying to figure
out exactly which premium programs to launch first.
If you are an enterprise or premium buyer who is interested in that, again, you can check
it out at AIDBTraining.com.
Now with that barrage of URLs out of the way, let's talk about Gemini and music.
Today we kick off with Google's continuing quest to have AI products in every single multimodal category.
The latest news is that the company has launched an AI music generator called Lyria 3.
It's the latest version of DeepMind's music generation model and allows users to generate music clips based on text, images or video inputs,
which is pretty unique compared to something like Suno, which is, of course, just text-based input.
Lyrics can be generated in eight different languages, including German, French, Spanish, and Hindi.
The feature can be accessed directly in the general.
Gemini app by switching to a musical output. It's also being added to YouTube's
track tool to allow creators to quickly generate soundtracks for YouTube shorts. Each track is accompanied
by custom cover art generated by Nanobanana. Now, previous versions of Lyria have only been
available through Google's Cloud's Vertex program, so this is a big expansion in access. However,
there is a pretty significant limitation, which is that these are 30-second clips. The model itself
isn't really capable of building on top of the initial generation, so this feature won't be useful
to generate entire songs. However, and it's pretty clear that this is the use case they're imagining
initially, this could be extremely useful for generating background music for YouTube shorts or
fun, interactive, personal types of song messages. Indeed, this appears to be what Google had in mind,
with them writing, the goal of these tracks isn't to create a musical masterpiece,
but rather to give you a fun, unique way to express yourself. And really, while it would be tempting
to compare this to Suno, this is actually more of a social feature than anything else.
We've talked in the past about how one of the really interesting things about Suno
is the extent to which it is used not for any sort of professional or work music generation,
but just as a fun interactive mode,
and Lyria really seems to be doubling down on that.
Google has also embedded their synth ID audio watermarks into the music
so they're easily flagged as AI generated.
A lot of the discourse around the first tries is that this is indeed not Suno,
and that Suno's generations feel much more polished and musically complex.
On the flip side, others point out how Google just keeps adding new arrows
in its multimodal quiver.
Aaron Upright comments,
people talking about OpenAI versus Anthropic
and Gemini just over here quietly getting more powerful.
People underestimate the importance
of an easily accessible multimodal platform
when it comes to adoption.
Chai and Zhao sees the future writing,
Video to audio alignment is the real flex here.
Generating lyrics and vocals that actually sync with visual cues
in real time is a massive multimodal serving challenge.
Lyria 3 probably relies on some crazy high-throughput infrared
to keep the latency low enough for creative workflows.
Ultimately, I think we are just
scratching the surface on what role-generated music is going to play, and Google is now firmly
in that game as well.
Next up, a bit of a controversy that ended up being less of a controversy than it seemed,
but still taught us some interesting things around the state of competition.
A change in Anthropic terms of service triggered a tinderbox of complaints from those
using Claude to power their open-claw agents.
This week, Anthropic changed their policies, now stating, using OAuth tokens obtained
through Claude-free, pro- or max accounts in any other product tool or service, including the
agent SDK is not permitted. Now to clarify, Oath tokens are kind of like API keys for regular
Anthropic subscriptions, allowing users to access AI models through third-party apps. And of course,
a lot of the attention is around the people who have been using their Claude Max accounts to power
their OpenClaws. Indeed, Alex Finn writes, this is going to piss off a lot of OpenClawe
users paying $200 a month. The tweets like this one were too numerous to count. Hubert Lepicki
writes, Anthropic is in an active self-destruction mode now. First, they went after tokens you
already paid for blocking use in non-clod code apps, then they send their lawyers after developers
for supposed branding infringement, and now this. Open code, Gemini CLI, Kodak, CLI are all legitimate
coding agents with comparable features and abilities, but Anthropic are behaving like they're still
the only player on the block. Now, all of this caused Anthropics to reek Shihapar to comment,
writing, apologies, this was a docks cleanup we rolled out that's caused some confusion. Nothing is
changing about how you can use the agent SDK and Mac subscriptions. He added that the intention isn't
to block personal tinkering, but rather to force third-party.
party businesses to pay for usage through the API.
Unfortunately, the confusion only continued with that unclear clarification.
Podcaster Felix Javan wrote,
Brother, can you just tell us whether we can use OpenClaught or not?
And it seems like if you're using it to build your own personal agents, the answer is yes.
But the incident raised a ton of discussion about how long the big AI labs will continue
to support these modular AI use cases.
Some tried switching providers only to find that Google had already banned Oath for
the use case.
Richard Holcomb wrote,
I feel like getting banned by Google for using anti-gravity oath with open claw as a right of passage.
I was already not impressed with Gemini, preferring Anthropic and Open AI, but now I really have a bad taste in my mouth.
Colin Darling, however, pointed out, everyone upset about Anthropics update to their terms would be wise to read the OpenAI and Google Gemini terms while they're at it.
I'm bummed out too, but Anthropic is late to this party not leading it.
In any case, the controversy quickly faded, but there is a lingering question about Walled Gardens and what they're going to mean for AI going forward.
Next up, more news in the AI Wearables category.
Meta has revived plans to release a smartwatch as part of their AI device lineup.
Rumors of a meta smartwatch started circulating in late 2021, complete with leaked photos of a prototype.
The device was given the internal code name Malibu and featured two cameras, one in the dial
for video conferencing and another on the underside of the watch.
The idea was that users could quickly remove the watch to take a photo.
Another big part of the design brief was the ability to read nerve signals in the wrist,
allowing the device to be used as a controller.
This concept has since gone on to feature in meta's haptic control wristbands for their Orion smart
glasses prototype, which was unveiled in late 2024. That said, by the summer of 2022,
Project Malibu was killed off and meta shifted focus to smart glasses as their big wearable play.
Now the information reports that meta has revived the smart watch under the codename Malibu 2.
The watch is said to include health tracking features and a built-in meta AI assistant.
Sources said the revival effort came out of a project strategy meeting late last year.
Executives are reportedly concerned about a bloated product lineup for augmented reality
so have delayed some products to focus on a limited number of concentrated bets.
Among them is a new version of the Rayband displays, which is expected later this year,
as well as a pair of AR glasses which could arrive in 2027.
The Smartwatch is planned for release this year, putting meta in direct competition with Apple
and Google in the category.
Now one thing to watch for will be how far each company goes in making the Smartwatch an integral
part of their wearable AI stack.
Earlier this week we covered rumors that Apple was working on a trio of new AI-enabled devices,
namely smart glasses, a pendant, and a camera-equipped version of the AirPods.
That report mentioned that a camera-equipped version of the Apple Watch had been passed over as an AI device,
with testers reportedly finding the prototype impractical due to clothing sleeves obscuring the camera.
Ultimately, we don't know how meta is thinking about the Malavit 2,
but they are very clearly focused on this wearable category as a place for their AI strategy.
Next up, another follow-up in the Grok 4.20 public beta.
XAI has announced a new version of Grok Heavy, and this one goes to 16.
The big innovation with GROC 4.20 was the inclusion of four sub-agents to debate responses before providing a final answer.
Opinions were a little mixed on whether this was actually a useful feature, but it's an interesting experiment if nothing else.
Grok Heavy turns the sub-agent count all the way up to 16 in a bid to either get better answers or at least burn through a ton of tokens getting an output.
XAI community promoter Ted Suo shared an output from the query, how does chaos birth cosmic order?
The agents debated the response for a little over a minute and then delivered a 700-word report using almost 900.
references. It's difficult to judge accuracy or usefulness based on such a strange, subjective
question, but the output certainly has a ton of detail and is an interesting read. If nothing else,
these continue to be interesting experiments and worth watching for that reason alone.
Lastly today, Chinese models. Lindy founder Flo Crivello recently shared a thread about the
difference between Chinese models on benchmarks and Chinese models in the real world. He wrote,
By far our biggest cost at Lindy is inference, so believe me when I say we've looked at these models
very closely and continued doing so. They're actually delivering on the claims would make a material
difference to the business. But every time we've evaluated them, we've found the same thing,
that their real-life performance for agentic behavior and outside of coding use cases
falls extremely short of what they show on the evals. I think the industry consensus is right,
he continues. These Chinese labs are one, distilling frontier models, duh, which leads to a more
shallow intelligence. Two, training for e-vowls. Three, potentially stealing weights. Not saying these
models will always be bad or that these labs are completely incompetent. They're doing a fine job,
but it's delusional to think they're actually at sonnet and opus level. They're still at least one
generation behind. Take the evals with a huge grain of salt. That, I think, is a lesson that is
relevant, not just for Chinese labs, but also whenever you see a new Western model as well that has high
benchmarks. Ultimately, you've got to just dive in and test these things out for yourself. And with that,
we will end today's headlines. Next up, the main episode.
Sure, there's hype about AI, but KPMG is turning AI potential into business value.
They've embedded AI and agents across their entire enterprise to boost efficiency,
improve quality, and create better experiences for clients and employees.
KPMG has done it themselves. Now they can help you do the same.
Discover how their journey can accelerate yours at www.kpmg.us slash agents.
That's www.kpmg.comg.coms.
agents. If you're building anything with voice AI, you need to know about Assembly AI. They've built
the best speech-to-text and speech-understanding models in the industry, the quiet infrastructure
behind products like Granola, Dovetail, Ashby, and Cluley. Now, as I've said before, voice is one of the
most important modalities of AI. It's the most natural human interface, and I think it's a key part of
where the next wave of innovation is going to happen. Assembly AI's models lead the field in
accuracy and quality so you can actually trust the data your product is built on.
And their speech understanding models help you go beyond transcription,
uncovering insights, identifying speakers, and surfacing key moments automatically.
It's developer first, no contracts, pay only for what you use, and scales effortlessly.
Go to assemblyaI.com slash brief, grab $50 in free credits, and start building your voice AI product today.
Today's episode is brought to you by robots and pencils, a company that is growing fast.
Their work as a high-growth AWS and Databricks partner means that they're looking for elite
talent ready to create real impact at velocity. Their teams are made up of AI-native engineers,
strategists, and designers who love solving hard problems and pushing how AI shows up in real products.
They move quickly using RoboWorks, their agenic acceleration platform, so teams can deliver
meaningful outcomes in weeks, not months. They don't build big teams. They build high-impact nimble ones.
The people there are wicked smart with patents, published research, and work
that's helped shaped entire categories. They work in Velocity Pods and studios that stay focused and
moved with intent. If you're ready for career-defining work with peers who challenge you and have
your back, robots and pencils is the place. Explore open roles at robots and pencils.com
slash careers. That's robots and pencils.com slash careers. Want to accelerate enterprise
software development velocity by 5x? You need Blitzy, the only autonomous software development
platform built for enterprise codebases. Your engineers define the project, a new feature, refactor,
Greenfield build. Blitzy agents first ingest and map your entire code base, then the platform
generates a bespoke agent action plan for your team to review and approve. Once approved,
Blitzy gets to work autonomously generating hundreds of thousands of lines of validated
and-end-tested code. More than 80% of the work completed in a single run. Blitzy is not generating
generating code, it's developing software at the speed of compute. Your engineers review, refine,
and ship. This is how Fortune 500 companies are compressing multi-month projects into a single sprint,
accelerating engineering velocity by 5X. Experience Blitzy firsthand at Blitzis.
That's BLITZY.com.
Today we're discussing a new study from Anthropic that, while nominally about agent autonomy,
is actually much more about how people are using AI agents in practice.
Welcome back to the AI Daily Brief.
Today we are looking at a new anthropic study on agent autonomy.
It's called measuring AI agent autonomy in practice.
And in many ways, it ends up actually being a case study in how agent behavior is changing.
After reading it, I couldn't help but feel.
like it was a profile of a changing market where more and more of the tasks are moving outside
of coding or engineering, and more and more of the agentic work is being done by people who are not
themselves engineers. Now, to set this up, I think it's useful to have as a comparison the most
frequently discussed study on agent autonomy. That is, of course, the meter study,
the chart of which I'm sure you've seen before, that measures AI's ability to complete long
tasks. The metric that they created is basically a measurement of the duration of a task
that AI can complete at a certain level of success. It is not, and this is something that people
frequently get wrong, a direct measure of how long an AI agent can work for. Instead, it is a measure
of the duration of tasks as it would take a human. So when, for example, GPT 5.2 high comes in at 5
hours, that's not that GPT 5.2 high took five hours to complete a task. It's how long that
task would have taken a human. What's more, Meter has two success metrics, 50% success and
80% success, neither of which would be sufficient performance for a real-world context. In other words,
you're not going to keep an employee around who completes tasks at a 50% success rate. Still, I've always
thought that this meter metric was really valuable. In my estimation, it doesn't matter so much
whether 50% or 80% success is the core number. It's that it's consistent and applied consistently
over time to different models. So ultimately, what is this trying to get at? Well, it's trying to measure
agent autonomy. And so why does autonomy matter? Autonomy matters as it shapes what agents
can do. The more autonomous an agent is, the greater the capability it has to complete long-duration
tasks with high success rates, the wider and more complex the array of use cases that it can be
valuable for. That matters on an individual level in terms of what work you can outsource to an agent,
on an org level in terms of which sets of tasks or which entire functions can be agentified,
and on a societal level as it has big impact when it comes to the job disruption conversation.
Yet despite meter being a very valuable and off-sighted metric, indeed last year during the height
of the bubble times, people joked that this chart was keeping the entire industry on its back,
as it was the one thing that suggested that there was no plateau on progress, which was maybe
the chief piece of evidence that the bubbleists were looking for. And yet there are, of course,
limitations of their methodology. As Anthropic puts it, the meter evaluation captures what
a model is capable of in an idealized setting, with no human interaction and no real-world consequences.
And that, of course, is not how people actually use agents in practice. To understand how
people use agents in practice, one of the best places to look is Claude Code. For all intents and
purposes, I think one can argue that Claude Code is the first agent with product market fit. In fact,
many people have noted that Claude Code is better thought of not as a coding tool per se,
but instead as a code-enabled general purpose agent. And that brings us to Anthropic study,
measuring AI agent autonomy in practice. Now, although Anthropic has access to pretty unique data in
this regard, there are still some challenges. First of all, there's the question of what is the
definition of an agent. Since this is a constant source of debate, Anthropic decided to go with a
definition that is, as they put it, conceptually grounded and operationalizable. An agent is an AI
system equipped with tools that allow it to take actions. As they point out, studying the tools
that agents use tells us a great deal about what they are doing in the wild. In terms of sources,
they pulled from the public API as well as Claude Code. And going back to this idea of tools for
the public API data, they say, rather than attempting to infer our customers' agents' architectures,
we instead perform our analysis at the level of individual tool calls.
They write, the simplifying assumption allows us to make grounded, consistent observations
about real-world agents even as the context in which those agents are deployed very significantly.
The limitation they note is that they have to analyze actions in isolation
rather than understand how those individual actions combine into a larger whole.
The second source of data is Claude Code.
And what makes Claude Code super valuable for this study is that because it is their own product,
they can understand an entire agent workflow from start to finish.
The challenge, of course, is that it doesn't have the same diversity of use cases necessarily
as their API traffic.
Now, one less note on the methodology, when trying to figure out how long agents actually run
for without human involvement, in Claude code, they're using turn duration.
Basically, how much time elapses between when Claude starts working and when it stops.
One note they make is that when it comes to the average, most Claude code turns are very short.
The median turn lasts around 45 seconds, and that's been fairly consistent
over the past several months. Instead, then, they look at the signal at the very end of the long tail,
basically the 99.9th percentile turn duration, with the argument being that these are the most
advanced users, or at least the most advanced use says, and in that way, are more likely to reveal
what the end duration of the capability set really is. So looking at that 99.9th percentile turn
duration, there are two really interesting phenomenon over the past few months. In the period between
October and January, basically from when Sonnet 4-5 launched through when Opus 4-5 launched in
November, average turn duration at that percentile jumped from 25 minutes to 45 minutes.
Interestingly, they note that the increase is smooth across model releases, suggesting that
autonomy is not purely a function of model capability. And indeed, I think that's one of the
big themes of this research, is that when we try to understand agent autonomy, we have to think
beyond just model, to the entire context in which a model operates, including the human interaction.
The second really interesting period in this chart is the period over the last six weeks
or so when there was actually a bit of a dip backwards from the peak of over 45 minutes
down to something that's closer to 40.
They identify two theories for why what was a previously pretty smooth curve has now
leveled out and in fact gone down a little bit.
The first is a shift in what projects people were using Claude Code for.
The argument is basically that over the holidays, people had sort of broader range exploratory
things that they were doing for their own gratification or hobbies, whereas when they
came back they had, as they put it, more tightly circumscribed work tasks. The second piece,
however, is that between January and mid-February, the Claude-Code user base doubled,
which is obviously a phenomenon that we've been tracking closely here. A doubling like that
is naturally going to bring with it a more diverse user base that's going to reshape the distribution
a little bit. And indeed, maybe the most interesting thing about this study to me is not just
the raw measure of capability, but the human interaction measures. A lot of this story is the
difference between new users and power users. One of the interesting findings,
is that users at the beginning of their ClaudeCode journey use the full auto-approval
less than more experienced users. New users use full auto-approval roughly 20% of the time,
which roughly doubles to 40% for more experienced users. CloudCode's default settings
require users to manually approve each action, and so Anthropics suspects that what we're seeing
is a steady accumulation of trust. At the beginning, you approve things each time,
and then as you dial in your settings and you start to learn to trust the model, you give it
that auto-approval more frequently. At the same time,
Improving actions isn't the only way that people supervise Claude code.
Users can also interrupt Claude while it's working to reorient it or give it feedback,
and that kind of follows the opposite pattern.
Newer users interrupt Claude around 5% of the time,
while more experienced users interrupt it around 9% of the time, almost double.
Now, one part of this might just be a shift in where people put the burden of oversight.
If new users are approving each action before it's taken,
maybe they don't need to interrupt Claude as much,
whereas when those experienced users use auto approval more liberally,
there's more of a context for them to step in. However, there also might be a sort of learned experience
here as well. They write, the higher interrupt rate may also reflect active monitoring by users
who have more honed instincts for when their intervention is needed, with the idea being that
the new users simply don't know when to intervene as much. I think one comparison here is that if you
view AI as sort of a junior employee, it earns trust over time. That's the shift from the 20% to 40%
auto approval rate. But as you get more comfortable with it, you also intervene more, checking in
on the work as it's happening and reorienting to make sure you get the most out of things rather
than just waiting to see the end product to judge its success in that way. Now, although these
measures are about the human intervention, this is not a static number across models. In other words,
model capability does impact this. Anthropic writes that from August to December of last year,
as Claude's success rate on internal users' most challenging tasks doubled, the average number
of human interventions per session decreased from 5.4 to 3.3. Basically, as the models get better,
users grant Claude more autonomy and achieve better outcomes while needing to intervene less.
Now, when it comes to autonomy, we're talking about an interaction set in a conversation between
the model and harness Claude Code and the humans using that model. Human intervention is only
one of the directions in which autonomy can unfold and practice. Clod, as they write, is an active
participant too, stopping to ask for clarification when it's unsure how to proceed. Anthropic found
that as task complexity increased, Claude code would ask for clarification more often and more frequently
that humans actually chose to interrupt it. For example, for turns where there was high goal complexity,
humans interrupted Claude 7.1% of the time, while Claude asked for clarification more than double
that 16.4% of the time. That compares to minimal goal complexity, where humans interrupted 5.5% of the time,
with Claude asking for clarification 6.6% of the time. In other words, the gap between how much
humans intervene and how much Claude asks for clarification increases alongside the complexity of the task.
However, these aren't exactly direct measures, as humans interrupt Claude and Claude interrupts itself
for different reasons. The number one reason that humans interrupt Claude is to provide missing
context or corrections, that's 32% of the time about a third. 17% of the time it was because
Claude was slow or hanging, with every other reason being much less frequent. In terms of when
Claude stops itself, the most common reason, at a little above a third, at 35%, is to present
the user with a choice between different approaches, which is interesting because that's not really a knock
on its own autonomy, in the sense that it doesn't necessarily need that information to proceed,
as it could theoretically just make the decision for itself, but a way to better align with humans
on the upfront. Now, the one other really interesting chart is the chart of which domains
agents are deployed in. As you might expect, especially given that this is anchored by Claude
code, software engineering represents around half of the tool calls overall. And although the other
categories are all below 10%, they kind of read like a map of where agentic automation is likely
to come next. Back office automation is at number two at 9.1%, followed by marketing and
copywriting at 4.4, sales and CRM at 4.3, finance and accounting at 4.0. It is notable that even at this
early stage, with coding and engineering tasks being the clear breakout, you're still already
seeing more than 50% of tool calls, in other words, more than 50% of agentic use cases being
outside of that software engineering domain. This is a pretty simple study overall, but a really
valuable complement in my estimation to the meter study as it moves away from the realm of the
theoretical and into the realm of what people are actually using agents for and how they're actually
interacting with them. There are a few interesting implications that people picked up on. David
Hendrickson wrote, what's most surprising from the paper is that real-world AI agents are
currently given much less autonomy than they could technically handle. In other words, we had to go
to the 99.9th percentile to really see what Claude could do, despite the fact that the average
turn is just 45 seconds. We've talked a lot on the show about a capability overhang, and it looks
like this is another example of that in practice, even with some of the most advanced tools in the
space. Another interesting takeaway is about a shift in our thinking of autonomy, from purely based
on model capability to this more complex view of model capability plus human interactive state.
Yang Risu writes, autonomy is not just steps taken, it is permission scope and ability to change state.
The other thing that people are exploring is based on all this, what they actually want the
interactive mode to look like in the future. Richie on X, for example, writes,
Need a Claude code mode that isn't exactly dangerously skip permissions, but can skip
pointless do you want to proceed questions. And at the same time, doesn't nuke my entire database
and family tree. Lorenzo responds, what you want is competent autonomy. Claude can skip
pointless prompts while respecting blast radius boundaries, so dev stay sane and prod stays intact.
Now, one thing to watch for is how much the emphasis in the next set of developments is
more improved interactions, or a totally different paradigm of long-duration autonomy.
In a recent podcast with Lenny, OpenAI Sherwin-Wu argued, as the AI Future Brief put it,
that the next leap in AI isn't just smarter models but long-duration autonomy.
While today's tools are optimized for short bursts, tomorrow's tools will be agents
you dispatch for six plus hours of independent work.
Right now, as Anthropics shows, that certainly isn't how people are using these tools,
but it does appear that things are evolving fast.
Overall, a very valuable study and a great way to see what's happening in practice.
For now, that is going to do it for today's AI Daily Brief.
Appreciate your listening or watching.
and until next time, peace.
