The AI Daily Brief: Artificial Intelligence News and Analysis - How People Actually Use AI Agents

Starting point is 00:00:00 Today on the AI Daily Brief, a new study about agent autonomy and practice from Anthropic. And before that in the headlines, Google Gemini now allows you to create music. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. All right, friends, quick announcements before we dive in. First of all, thank you to today's sponsors, KPMG, assembly, robots and pencils, and blitzie. To get an ad-free version of the show, go to patreon.com slash AI Daily Brief, or you can subscribe on Apple Podcasts. To learn about sponsoring the show, send us a note at sponsors at AIdailybrief. Lastly, a reminder once again about our latest ecosystem projects.

Starting point is 00:00:43 Claw Camp, the free self-directed program where you can learn how to build agents and agent teams using OpenClaw is kicking off its first sprint right now. So if you want to learn to be an agent boss with nearly 3,000 friends, come join us. You can find that at Campclaw.A.I. or from the AI Daily Brief website, if you are a company trying to figure out OpenClaw and other agent strategies, check out EnterpriseClaW.A.I. And more broadly, if you are just interested in keeping track of these types of educational programs we're doing free and premium, you can find information about that at AIDBTraining.com. One thing happening there, we actually have a premium program survey as we are trying to figure

Starting point is 00:01:18 out exactly which premium programs to launch first. If you are an enterprise or premium buyer who is interested in that, again, you can check it out at AIDBTraining.com. Now with that barrage of URLs out of the way, let's talk about Gemini and music. Today we kick off with Google's continuing quest to have AI products in every single multimodal category. The latest news is that the company has launched an AI music generator called Lyria 3. It's the latest version of DeepMind's music generation model and allows users to generate music clips based on text, images or video inputs, which is pretty unique compared to something like Suno, which is, of course, just text-based input.

Starting point is 00:01:55 Lyrics can be generated in eight different languages, including German, French, Spanish, and Hindi. The feature can be accessed directly in the general. Gemini app by switching to a musical output. It's also being added to YouTube's track tool to allow creators to quickly generate soundtracks for YouTube shorts. Each track is accompanied by custom cover art generated by Nanobanana. Now, previous versions of Lyria have only been available through Google's Cloud's Vertex program, so this is a big expansion in access. However, there is a pretty significant limitation, which is that these are 30-second clips. The model itself isn't really capable of building on top of the initial generation, so this feature won't be useful

Starting point is 00:02:28 to generate entire songs. However, and it's pretty clear that this is the use case they're imagining initially, this could be extremely useful for generating background music for YouTube shorts or fun, interactive, personal types of song messages. Indeed, this appears to be what Google had in mind, with them writing, the goal of these tracks isn't to create a musical masterpiece, but rather to give you a fun, unique way to express yourself. And really, while it would be tempting to compare this to Suno, this is actually more of a social feature than anything else. We've talked in the past about how one of the really interesting things about Suno is the extent to which it is used not for any sort of professional or work music generation,

Starting point is 00:03:03 but just as a fun interactive mode, and Lyria really seems to be doubling down on that. Google has also embedded their synth ID audio watermarks into the music so they're easily flagged as AI generated. A lot of the discourse around the first tries is that this is indeed not Suno, and that Suno's generations feel much more polished and musically complex. On the flip side, others point out how Google just keeps adding new arrows in its multimodal quiver.

Starting point is 00:03:27 Aaron Upright comments, people talking about OpenAI versus Anthropic and Gemini just over here quietly getting more powerful. People underestimate the importance of an easily accessible multimodal platform when it comes to adoption. Chai and Zhao sees the future writing, Video to audio alignment is the real flex here.

Starting point is 00:03:43 Generating lyrics and vocals that actually sync with visual cues in real time is a massive multimodal serving challenge. Lyria 3 probably relies on some crazy high-throughput infrared to keep the latency low enough for creative workflows. Ultimately, I think we are just scratching the surface on what role-generated music is going to play, and Google is now firmly in that game as well. Next up, a bit of a controversy that ended up being less of a controversy than it seemed,

Starting point is 00:04:06 but still taught us some interesting things around the state of competition. A change in Anthropic terms of service triggered a tinderbox of complaints from those using Claude to power their open-claw agents. This week, Anthropic changed their policies, now stating, using OAuth tokens obtained through Claude-free, pro- or max accounts in any other product tool or service, including the agent SDK is not permitted. Now to clarify, Oath tokens are kind of like API keys for regular Anthropic subscriptions, allowing users to access AI models through third-party apps. And of course, a lot of the attention is around the people who have been using their Claude Max accounts to power

Starting point is 00:04:38 their OpenClaws. Indeed, Alex Finn writes, this is going to piss off a lot of OpenClawe users paying $200 a month. The tweets like this one were too numerous to count. Hubert Lepicki writes, Anthropic is in an active self-destruction mode now. First, they went after tokens you already paid for blocking use in non-clod code apps, then they send their lawyers after developers for supposed branding infringement, and now this. Open code, Gemini CLI, Kodak, CLI are all legitimate coding agents with comparable features and abilities, but Anthropic are behaving like they're still the only player on the block. Now, all of this caused Anthropics to reek Shihapar to comment, writing, apologies, this was a docks cleanup we rolled out that's caused some confusion. Nothing is

Starting point is 00:05:15 changing about how you can use the agent SDK and Mac subscriptions. He added that the intention isn't to block personal tinkering, but rather to force third-party. party businesses to pay for usage through the API. Unfortunately, the confusion only continued with that unclear clarification. Podcaster Felix Javan wrote, Brother, can you just tell us whether we can use OpenClaught or not? And it seems like if you're using it to build your own personal agents, the answer is yes. But the incident raised a ton of discussion about how long the big AI labs will continue

Starting point is 00:05:42 to support these modular AI use cases. Some tried switching providers only to find that Google had already banned Oath for the use case. Richard Holcomb wrote, I feel like getting banned by Google for using anti-gravity oath with open claw as a right of passage. I was already not impressed with Gemini, preferring Anthropic and Open AI, but now I really have a bad taste in my mouth. Colin Darling, however, pointed out, everyone upset about Anthropics update to their terms would be wise to read the OpenAI and Google Gemini terms while they're at it. I'm bummed out too, but Anthropic is late to this party not leading it.

Starting point is 00:06:11 In any case, the controversy quickly faded, but there is a lingering question about Walled Gardens and what they're going to mean for AI going forward. Next up, more news in the AI Wearables category. Meta has revived plans to release a smartwatch as part of their AI device lineup. Rumors of a meta smartwatch started circulating in late 2021, complete with leaked photos of a prototype. The device was given the internal code name Malibu and featured two cameras, one in the dial for video conferencing and another on the underside of the watch. The idea was that users could quickly remove the watch to take a photo. Another big part of the design brief was the ability to read nerve signals in the wrist,

Starting point is 00:06:45 allowing the device to be used as a controller. This concept has since gone on to feature in meta's haptic control wristbands for their Orion smart glasses prototype, which was unveiled in late 2024. That said, by the summer of 2022, Project Malibu was killed off and meta shifted focus to smart glasses as their big wearable play. Now the information reports that meta has revived the smart watch under the codename Malibu 2. The watch is said to include health tracking features and a built-in meta AI assistant. Sources said the revival effort came out of a project strategy meeting late last year. Executives are reportedly concerned about a bloated product lineup for augmented reality

Starting point is 00:07:17 so have delayed some products to focus on a limited number of concentrated bets. Among them is a new version of the Rayband displays, which is expected later this year, as well as a pair of AR glasses which could arrive in 2027. The Smartwatch is planned for release this year, putting meta in direct competition with Apple and Google in the category. Now one thing to watch for will be how far each company goes in making the Smartwatch an integral part of their wearable AI stack. Earlier this week we covered rumors that Apple was working on a trio of new AI-enabled devices,

Starting point is 00:07:45 namely smart glasses, a pendant, and a camera-equipped version of the AirPods. That report mentioned that a camera-equipped version of the Apple Watch had been passed over as an AI device, with testers reportedly finding the prototype impractical due to clothing sleeves obscuring the camera. Ultimately, we don't know how meta is thinking about the Malavit 2, but they are very clearly focused on this wearable category as a place for their AI strategy. Next up, another follow-up in the Grok 4.20 public beta. XAI has announced a new version of Grok Heavy, and this one goes to 16. The big innovation with GROC 4.20 was the inclusion of four sub-agents to debate responses before providing a final answer.

Starting point is 00:08:20 Opinions were a little mixed on whether this was actually a useful feature, but it's an interesting experiment if nothing else. Grok Heavy turns the sub-agent count all the way up to 16 in a bid to either get better answers or at least burn through a ton of tokens getting an output. XAI community promoter Ted Suo shared an output from the query, how does chaos birth cosmic order? The agents debated the response for a little over a minute and then delivered a 700-word report using almost 900. references. It's difficult to judge accuracy or usefulness based on such a strange, subjective question, but the output certainly has a ton of detail and is an interesting read. If nothing else, these continue to be interesting experiments and worth watching for that reason alone. Lastly today, Chinese models. Lindy founder Flo Crivello recently shared a thread about the

Starting point is 00:09:03 difference between Chinese models on benchmarks and Chinese models in the real world. He wrote, By far our biggest cost at Lindy is inference, so believe me when I say we've looked at these models very closely and continued doing so. They're actually delivering on the claims would make a material difference to the business. But every time we've evaluated them, we've found the same thing, that their real-life performance for agentic behavior and outside of coding use cases falls extremely short of what they show on the evals. I think the industry consensus is right, he continues. These Chinese labs are one, distilling frontier models, duh, which leads to a more shallow intelligence. Two, training for e-vowls. Three, potentially stealing weights. Not saying these

Starting point is 00:09:39 models will always be bad or that these labs are completely incompetent. They're doing a fine job, but it's delusional to think they're actually at sonnet and opus level. They're still at least one generation behind. Take the evals with a huge grain of salt. That, I think, is a lesson that is relevant, not just for Chinese labs, but also whenever you see a new Western model as well that has high benchmarks. Ultimately, you've got to just dive in and test these things out for yourself. And with that, we will end today's headlines. Next up, the main episode. Sure, there's hype about AI, but KPMG is turning AI potential into business value. They've embedded AI and agents across their entire enterprise to boost efficiency,

Starting point is 00:10:19 improve quality, and create better experiences for clients and employees. KPMG has done it themselves. Now they can help you do the same. Discover how their journey can accelerate yours at www.kpmg.us slash agents. That's www.kpmg.comg.coms. agents. If you're building anything with voice AI, you need to know about Assembly AI. They've built the best speech-to-text and speech-understanding models in the industry, the quiet infrastructure behind products like Granola, Dovetail, Ashby, and Cluley. Now, as I've said before, voice is one of the most important modalities of AI. It's the most natural human interface, and I think it's a key part of

Starting point is 00:10:58 where the next wave of innovation is going to happen. Assembly AI's models lead the field in accuracy and quality so you can actually trust the data your product is built on. And their speech understanding models help you go beyond transcription, uncovering insights, identifying speakers, and surfacing key moments automatically. It's developer first, no contracts, pay only for what you use, and scales effortlessly. Go to assemblyaI.com slash brief, grab $50 in free credits, and start building your voice AI product today. Today's episode is brought to you by robots and pencils, a company that is growing fast. Their work as a high-growth AWS and Databricks partner means that they're looking for elite

Starting point is 00:11:36 talent ready to create real impact at velocity. Their teams are made up of AI-native engineers, strategists, and designers who love solving hard problems and pushing how AI shows up in real products. They move quickly using RoboWorks, their agenic acceleration platform, so teams can deliver meaningful outcomes in weeks, not months. They don't build big teams. They build high-impact nimble ones. The people there are wicked smart with patents, published research, and work that's helped shaped entire categories. They work in Velocity Pods and studios that stay focused and moved with intent. If you're ready for career-defining work with peers who challenge you and have your back, robots and pencils is the place. Explore open roles at robots and pencils.com

Starting point is 00:12:16 slash careers. That's robots and pencils.com slash careers. Want to accelerate enterprise software development velocity by 5x? You need Blitzy, the only autonomous software development platform built for enterprise codebases. Your engineers define the project, a new feature, refactor, Greenfield build. Blitzy agents first ingest and map your entire code base, then the platform generates a bespoke agent action plan for your team to review and approve. Once approved, Blitzy gets to work autonomously generating hundreds of thousands of lines of validated and-end-tested code. More than 80% of the work completed in a single run. Blitzy is not generating generating code, it's developing software at the speed of compute. Your engineers review, refine,

Starting point is 00:12:53 and ship. This is how Fortune 500 companies are compressing multi-month projects into a single sprint, accelerating engineering velocity by 5X. Experience Blitzy firsthand at Blitzis. That's BLITZY.com. Today we're discussing a new study from Anthropic that, while nominally about agent autonomy, is actually much more about how people are using AI agents in practice. Welcome back to the AI Daily Brief. Today we are looking at a new anthropic study on agent autonomy. It's called measuring AI agent autonomy in practice.

Starting point is 00:13:29 And in many ways, it ends up actually being a case study in how agent behavior is changing. After reading it, I couldn't help but feel. like it was a profile of a changing market where more and more of the tasks are moving outside of coding or engineering, and more and more of the agentic work is being done by people who are not themselves engineers. Now, to set this up, I think it's useful to have as a comparison the most frequently discussed study on agent autonomy. That is, of course, the meter study, the chart of which I'm sure you've seen before, that measures AI's ability to complete long tasks. The metric that they created is basically a measurement of the duration of a task

Starting point is 00:14:02 that AI can complete at a certain level of success. It is not, and this is something that people frequently get wrong, a direct measure of how long an AI agent can work for. Instead, it is a measure of the duration of tasks as it would take a human. So when, for example, GPT 5.2 high comes in at 5 hours, that's not that GPT 5.2 high took five hours to complete a task. It's how long that task would have taken a human. What's more, Meter has two success metrics, 50% success and 80% success, neither of which would be sufficient performance for a real-world context. In other words, you're not going to keep an employee around who completes tasks at a 50% success rate. Still, I've always thought that this meter metric was really valuable. In my estimation, it doesn't matter so much

Starting point is 00:14:42 whether 50% or 80% success is the core number. It's that it's consistent and applied consistently over time to different models. So ultimately, what is this trying to get at? Well, it's trying to measure agent autonomy. And so why does autonomy matter? Autonomy matters as it shapes what agents can do. The more autonomous an agent is, the greater the capability it has to complete long-duration tasks with high success rates, the wider and more complex the array of use cases that it can be valuable for. That matters on an individual level in terms of what work you can outsource to an agent, on an org level in terms of which sets of tasks or which entire functions can be agentified, and on a societal level as it has big impact when it comes to the job disruption conversation.

Starting point is 00:15:22 Yet despite meter being a very valuable and off-sighted metric, indeed last year during the height of the bubble times, people joked that this chart was keeping the entire industry on its back, as it was the one thing that suggested that there was no plateau on progress, which was maybe the chief piece of evidence that the bubbleists were looking for. And yet there are, of course, limitations of their methodology. As Anthropic puts it, the meter evaluation captures what a model is capable of in an idealized setting, with no human interaction and no real-world consequences. And that, of course, is not how people actually use agents in practice. To understand how people use agents in practice, one of the best places to look is Claude Code. For all intents and

Starting point is 00:16:01 purposes, I think one can argue that Claude Code is the first agent with product market fit. In fact, many people have noted that Claude Code is better thought of not as a coding tool per se, but instead as a code-enabled general purpose agent. And that brings us to Anthropic study, measuring AI agent autonomy in practice. Now, although Anthropic has access to pretty unique data in this regard, there are still some challenges. First of all, there's the question of what is the definition of an agent. Since this is a constant source of debate, Anthropic decided to go with a definition that is, as they put it, conceptually grounded and operationalizable. An agent is an AI system equipped with tools that allow it to take actions. As they point out, studying the tools

Starting point is 00:16:39 that agents use tells us a great deal about what they are doing in the wild. In terms of sources, they pulled from the public API as well as Claude Code. And going back to this idea of tools for the public API data, they say, rather than attempting to infer our customers' agents' architectures, we instead perform our analysis at the level of individual tool calls. They write, the simplifying assumption allows us to make grounded, consistent observations about real-world agents even as the context in which those agents are deployed very significantly. The limitation they note is that they have to analyze actions in isolation rather than understand how those individual actions combine into a larger whole.

Starting point is 00:17:13 The second source of data is Claude Code. And what makes Claude Code super valuable for this study is that because it is their own product, they can understand an entire agent workflow from start to finish. The challenge, of course, is that it doesn't have the same diversity of use cases necessarily as their API traffic. Now, one less note on the methodology, when trying to figure out how long agents actually run for without human involvement, in Claude code, they're using turn duration. Basically, how much time elapses between when Claude starts working and when it stops.

Starting point is 00:17:41 One note they make is that when it comes to the average, most Claude code turns are very short. The median turn lasts around 45 seconds, and that's been fairly consistent over the past several months. Instead, then, they look at the signal at the very end of the long tail, basically the 99.9th percentile turn duration, with the argument being that these are the most advanced users, or at least the most advanced use says, and in that way, are more likely to reveal what the end duration of the capability set really is. So looking at that 99.9th percentile turn duration, there are two really interesting phenomenon over the past few months. In the period between October and January, basically from when Sonnet 4-5 launched through when Opus 4-5 launched in

Starting point is 00:18:21 November, average turn duration at that percentile jumped from 25 minutes to 45 minutes. Interestingly, they note that the increase is smooth across model releases, suggesting that autonomy is not purely a function of model capability. And indeed, I think that's one of the big themes of this research, is that when we try to understand agent autonomy, we have to think beyond just model, to the entire context in which a model operates, including the human interaction. The second really interesting period in this chart is the period over the last six weeks or so when there was actually a bit of a dip backwards from the peak of over 45 minutes down to something that's closer to 40.

Starting point is 00:18:57 They identify two theories for why what was a previously pretty smooth curve has now leveled out and in fact gone down a little bit. The first is a shift in what projects people were using Claude Code for. The argument is basically that over the holidays, people had sort of broader range exploratory things that they were doing for their own gratification or hobbies, whereas when they came back they had, as they put it, more tightly circumscribed work tasks. The second piece, however, is that between January and mid-February, the Claude-Code user base doubled, which is obviously a phenomenon that we've been tracking closely here. A doubling like that

Starting point is 00:19:27 is naturally going to bring with it a more diverse user base that's going to reshape the distribution a little bit. And indeed, maybe the most interesting thing about this study to me is not just the raw measure of capability, but the human interaction measures. A lot of this story is the difference between new users and power users. One of the interesting findings, is that users at the beginning of their ClaudeCode journey use the full auto-approval less than more experienced users. New users use full auto-approval roughly 20% of the time, which roughly doubles to 40% for more experienced users. CloudCode's default settings require users to manually approve each action, and so Anthropics suspects that what we're seeing

Starting point is 00:20:04 is a steady accumulation of trust. At the beginning, you approve things each time, and then as you dial in your settings and you start to learn to trust the model, you give it that auto-approval more frequently. At the same time, Improving actions isn't the only way that people supervise Claude code. Users can also interrupt Claude while it's working to reorient it or give it feedback, and that kind of follows the opposite pattern. Newer users interrupt Claude around 5% of the time, while more experienced users interrupt it around 9% of the time, almost double.

Starting point is 00:20:32 Now, one part of this might just be a shift in where people put the burden of oversight. If new users are approving each action before it's taken, maybe they don't need to interrupt Claude as much, whereas when those experienced users use auto approval more liberally, there's more of a context for them to step in. However, there also might be a sort of learned experience here as well. They write, the higher interrupt rate may also reflect active monitoring by users who have more honed instincts for when their intervention is needed, with the idea being that the new users simply don't know when to intervene as much. I think one comparison here is that if you

Starting point is 00:21:03 view AI as sort of a junior employee, it earns trust over time. That's the shift from the 20% to 40% auto approval rate. But as you get more comfortable with it, you also intervene more, checking in on the work as it's happening and reorienting to make sure you get the most out of things rather than just waiting to see the end product to judge its success in that way. Now, although these measures are about the human intervention, this is not a static number across models. In other words, model capability does impact this. Anthropic writes that from August to December of last year, as Claude's success rate on internal users' most challenging tasks doubled, the average number of human interventions per session decreased from 5.4 to 3.3. Basically, as the models get better,

Starting point is 00:21:42 users grant Claude more autonomy and achieve better outcomes while needing to intervene less. Now, when it comes to autonomy, we're talking about an interaction set in a conversation between the model and harness Claude Code and the humans using that model. Human intervention is only one of the directions in which autonomy can unfold and practice. Clod, as they write, is an active participant too, stopping to ask for clarification when it's unsure how to proceed. Anthropic found that as task complexity increased, Claude code would ask for clarification more often and more frequently that humans actually chose to interrupt it. For example, for turns where there was high goal complexity, humans interrupted Claude 7.1% of the time, while Claude asked for clarification more than double

Starting point is 00:22:20 that 16.4% of the time. That compares to minimal goal complexity, where humans interrupted 5.5% of the time, with Claude asking for clarification 6.6% of the time. In other words, the gap between how much humans intervene and how much Claude asks for clarification increases alongside the complexity of the task. However, these aren't exactly direct measures, as humans interrupt Claude and Claude interrupts itself for different reasons. The number one reason that humans interrupt Claude is to provide missing context or corrections, that's 32% of the time about a third. 17% of the time it was because Claude was slow or hanging, with every other reason being much less frequent. In terms of when Claude stops itself, the most common reason, at a little above a third, at 35%, is to present

Starting point is 00:23:01 the user with a choice between different approaches, which is interesting because that's not really a knock on its own autonomy, in the sense that it doesn't necessarily need that information to proceed, as it could theoretically just make the decision for itself, but a way to better align with humans on the upfront. Now, the one other really interesting chart is the chart of which domains agents are deployed in. As you might expect, especially given that this is anchored by Claude code, software engineering represents around half of the tool calls overall. And although the other categories are all below 10%, they kind of read like a map of where agentic automation is likely to come next. Back office automation is at number two at 9.1%, followed by marketing and

Starting point is 00:23:38 copywriting at 4.4, sales and CRM at 4.3, finance and accounting at 4.0. It is notable that even at this early stage, with coding and engineering tasks being the clear breakout, you're still already seeing more than 50% of tool calls, in other words, more than 50% of agentic use cases being outside of that software engineering domain. This is a pretty simple study overall, but a really valuable complement in my estimation to the meter study as it moves away from the realm of the theoretical and into the realm of what people are actually using agents for and how they're actually interacting with them. There are a few interesting implications that people picked up on. David Hendrickson wrote, what's most surprising from the paper is that real-world AI agents are

Starting point is 00:24:16 currently given much less autonomy than they could technically handle. In other words, we had to go to the 99.9th percentile to really see what Claude could do, despite the fact that the average turn is just 45 seconds. We've talked a lot on the show about a capability overhang, and it looks like this is another example of that in practice, even with some of the most advanced tools in the space. Another interesting takeaway is about a shift in our thinking of autonomy, from purely based on model capability to this more complex view of model capability plus human interactive state. Yang Risu writes, autonomy is not just steps taken, it is permission scope and ability to change state. The other thing that people are exploring is based on all this, what they actually want the

Starting point is 00:24:56 interactive mode to look like in the future. Richie on X, for example, writes, Need a Claude code mode that isn't exactly dangerously skip permissions, but can skip pointless do you want to proceed questions. And at the same time, doesn't nuke my entire database and family tree. Lorenzo responds, what you want is competent autonomy. Claude can skip pointless prompts while respecting blast radius boundaries, so dev stay sane and prod stays intact. Now, one thing to watch for is how much the emphasis in the next set of developments is more improved interactions, or a totally different paradigm of long-duration autonomy. In a recent podcast with Lenny, OpenAI Sherwin-Wu argued, as the AI Future Brief put it,

Starting point is 00:25:31 that the next leap in AI isn't just smarter models but long-duration autonomy. While today's tools are optimized for short bursts, tomorrow's tools will be agents you dispatch for six plus hours of independent work. Right now, as Anthropics shows, that certainly isn't how people are using these tools, but it does appear that things are evolving fast. Overall, a very valuable study and a great way to see what's happening in practice. For now, that is going to do it for today's AI Daily Brief. Appreciate your listening or watching.

Starting point is 00:25:55 and until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - How People Actually Use AI Agents

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.