The AI Daily Brief: Artificial Intelligence News and Analysis - Agent Performance Is Accelerating...Fast
Episode Date: April 24, 2025New research suggests that AI agents are improving at a rate far faster than expected, and their task complexity is doubling every four months. AI Digest confirms this by adding OpenAI’s O3 and O4 M...ini to the curve, showing tasks that take humans 1.5–1.7 hours are now within reach. Get Ad Free AI Daily Brief: https://patreon.com/AIDailyBriefBrought to you by:KPMG – Go to https://kpmg.com/ai to learn more about how KPMG can help you drive value with our AI solutions.Vanta - Simplify compliance - https://vanta.com/nlwPlumb - The Automation Platform for AI Experts - https://useplumb.com/nlwThe Agent Readiness Audit from Superintelligent - Go to https://besuper.ai/ to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Subscribe to the newsletter: https://aidailybrief.beehiiv.com/Join our Discord: https://bit.ly/aibreakdown
Transcript
Discussion (0)
Today on the AI Daily Brief, Agent and AI capabilities are accelerating at an accelerated rate.
Before that in the headlines, the Oscars officially do not care if films use AI.
The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.
To join the conversation, follow the Discord link in our show notes.
Welcome back to the AI Daily Brief Headlined edition, all the daily AI news you need in around five minutes.
Well, in a real sign of how things are evolving, the Oscars officially don't care if filmmakers use AI.
The Academy of Motion Picture Arts and Sciences recognized AI for the first time in rule changes published this week.
Their new official stance is that AI won't impact the chance of a nomination.
They wrote,
With regard to generative artificial intelligence and other digital tools used in the making of the film,
the tools neither help nor harm the chances of achieving a nomination.
The Academy in each branch will judge the achievement,
taking into account the degree to which a human was at the heart of the creative authorship
when choosing which movie to award.
Now, for people outside of AI, the biggest rule change that grabbed headlines was the fact that
that Academy members are now required to watch all nominated films in each category to be eligible to vote for them.
Of course, the joke here is that people couldn't believe that that wasn't the case before, but there you have it.
But anyways, back over in our corner of the world, the changes around AI is a big relief for filmmakers who are experimenting with AI techniques,
as earlier this year, the Academy was reportedly weighing a disclosure requirement.
This year's Oscars featured controversy around the films Amelia Perez and The Brutalist.
Both films worked with AI voice model company Respeacher to modify actor's speech.
Amelia Perez used the technology to expand the vocal range of the title character,
and the Brutalist used AI to tweak the Hungarian accents of the lead actors to make them more authentic.
Editor David Jansko said that he had first attempted to use traditional methods of automated dialogue replacement
to switch in the voices of native Hungarian actors, but, quote,
that just didn't work, so we looked for other options of how to enhance it.
The Brutalist was critically acclaimed in part for its authenticity and received 10 nominations.
Adrian Brody won the Best Actor Award for his AI-enhanced performance in the film.
Skating completely under the radar during the controversy was the fact that Dune Part 2,
which won for Best Visual Effects, made extensive use of AI in their VFX process.
Now, honestly, the Academy stance is fairly bold, given the Hollywood strikes and the high-profile
consternation around AI in cinema. In a way, it's simply a recognition that AI is nothing more
than an extension and refinement of existing techniques. In some areas, there isn't really a bright
line distinction between the latest generation of AI tools and the existing processes that were already
industry standard. Christopher Valenzuela, the CEO of Runway, posted,
The Oscars being okay with the use of AI for filmmaking is not only a step in the right direction,
but also one that recognizes this technology as a tool that requires an artist to articulate
a meaningful way of using it. Nicholas Newbert, the creative director of runway, added,
now there's nothing standing in the way of a movie using generative AI from winning an Oscar
except for its creator's imagination. And basically that seems to be what the Academy is saying.
That they don't care what kinds of tools are used to achieve the result. They only care
how well the filmmaker's creativity and craftsmanship is expressed in the final work of art.
There's definitely more and more of this discussion happening right now. A content creator who
has in the past worked with Superintelligent Mindo wrote a piece recently on LinkedIn called
why Hollywood should lead the AI Renaissance and not fight it. It's a good articulation of where
at least some of this conversation is turning. Although, of course, I don't want to overstate how
broadly this is. The Oscar stance is extremely controversial, and there are lots of people in
Hollywood who are not happy about it. Moving on to a very different topic, right now testimony in the
Google antitrust lawsuit continues to make headlines, with the latest being OpenAI declaring that
they'd love to acquire the Chrome browser. On Monday, DOJ lawyers asked the judge to force the divestment
of Chrome, stating, we're at an inflection point. The court has an opportunity to remedy a monopoly that
has controlled the internet for today's generation and restore competition for decades to come.
In yesterday's testimony, Nick Turley, the head of ChatGBTGPT, was asked if OpenAI would be
interested in buying the browser if the court orders it to be spun off. He responded, yes,
we would, as would many other parties. Turley stated that a native integration of ChatGBT
BT into Chrome could fundamentally change the internet, adding, you could offer a really
incredible experience. We would have the ability to introduce users to what an AI-first experience
looks like. Touching on the anti-competitive issues in the market, Turley discussed OpenAI's
issues with distribution. He said the company has made some inroads with integrating chat GPT into the iPhone,
but has been unable to make any progress with Android-based manufacturers. Earlier in the trial,
as we discussed yesterday, it was disclosed that Google had paid Samsung a, quote, enormous sum in
January to integrate the Gemini assistant. Turley acknowledged that Google's deal wasn't exclusive,
but said that OpenAI struggled to make headway in negotiations with Samsung due to Google's
ability to outspend their efforts. He added, it was not for a lack of trying. We never got to a point
where we could discuss concrete terms.
Touching on the competition for search, Turley said that OpenAI's goal of building a super
assistant in achieving AGI won't succeed without search technology.
But Google has so far declined a partner with them.
He testified, search technology is a necessary component.
You can't have a super assistant that doesn't know the current facts or makes things up.
Turley didn't directly reference the company's partnership with Microsoft, referring to them
only as provider number one.
However, he said that OpenAI had, quote, significant quality issues with the search
information they provided.
It became clear over time that it was not viable to depend on.
It was at best, a near-term solution.
One little piece of information was that it turns out that OpenAI has actually been working
on their own search index since early 2024, with Turley stating that the company aims to use
their own index 80% of the time by the end of the year.
He acknowledged that the task was frankly overly ambitious and estimated it would take
several more years to achieve.
Part of the issue has been websites limiting traffic to OpenAI's web crawler, which isn't
mutually beneficial like Google's is.
Turley said Google can outspend us or offer more traffic.
to these partners than we can. They have way more queries every single day. Now, this all may sound like,
well, yeah, of course, you'd like Google to make it easy for you to compete with Google. But the testimony
is significant, as one of the remedies suggested by the DOJ is forcing Google to share their search
index with rivals in order to break their monopoly over the market segment. I'll reinforce the same
point I made yesterday, that whatever happens, it's going to have a deterministic impact on the shape of AI
in the near future. Lastly, today, more restructuring of Apple's Siri team as new management is
brought in to fix the beleaguered product. Last month, former Vision Pro lead Mike Rockwell was brought in
to take over the project, and Bloomberg sources suggest that Rockwell is now clearing house.
They said that he's replaced much of series leadership with lieutenants from his Vision
Pro software group and is also restructuring teams related to speech, understanding performance,
and user experience. Bloomberg reported new leads for engineering, user experience, and
underlying architecture, all joining from the Vision Pro project, with additional software engineering
talent being brought in from the core OS team, which handles iPhone software as well.
Bloomberg writes, the moves show that Rockwell is either demoting or replacing the prior managers
in charge of Siri engineering. Whether this all works remains to be seen, but at least it's
not nothing. For now, that is going to do it for today's AI Daily Brief Headlines edition.
Next up, the main episode. Today's episode is brought to you by KPMG. In today's
fiercely competitive market, unlocking AI's potential could help give you a competitive edge,
foster growth, and drive new value. But here's the key.
You don't need an AI strategy.
You need to embed AI into your overall business strategy to truly power it up.
KPMG can show you how to integrate AI and AI agents into your business strategy in a way that
truly works and is built on trusted AI principles and platforms.
Check out real stories from KPMG to hear how AI is driving success with its clients at
at www.kpmg.org.coms slash AI. Again, that's www.kpmg.comg.coms slash AI.
Today's episode is brought to you by Superintelligent, and I am very excited today to tell you about
our consultant partner program. The new Superintelligent is a platform that helps enterprises
figure out which agents to adopt, and then with our marketplace, go and find the partners
that can help them actually build, buy, customize, and deploy those agents.
At the key of that experience is what we call our agent readiness audits.
We deploy a set of voice agents which can interview people across your team to uncover
where agents are going to be most effective in driving real business value. From there, we make
a set of recommendations which can turn into RFPs on the marketplace or other sort of change
management activities that help get you ready for the new agent powered economy. We are finding a ton of
success right now with consultants bringing the agent readiness audits to their client as a way to
help them move down the funnel towards agent deployments with the consultant playing the role of helping
their client hone in on the right opportunities based on what we've recommended and helping manage
the partner selection process. Basically, the audits are dramatically reducing the time to discovery
for our consulting partners, and that's something we're really excited to see.
If you run a firm and have clients who might be a good fit for the agent readiness audit,
reach out to Agent at B-Super.A.I with Consultant in the title,
and we'll get right back to you with more on the consultant partner program.
Again, that's Agent at B-Super.A.I and put the word consultant in the subject line.
Today's episode is brought to you by Vanta.
Vanta is a trust management platform that helps businesses automate security and compliance,
enabling them to demonstrate strong security practices and scale. In today's business landscape,
businesses can't just claim security, they have to prove it. Achieving compliance with a framework like
SOC2, ISO-27-01, HIPAA, GDPR, and more is how businesses can demonstrate strong security practices.
And we see how much this matters every time we connect enterprises with agent services providers
at super-intelligent. Many of these compliance frameworks are simply not negotiable for enterprises.
The problem is that navigating security and compliance is time-consuming and complicated.
It can take months of work and use up valuable time and resources.
Vanta makes it easy and faster by automating compliance across 35-plus frameworks.
It gets you audit-ready in weeks instead of months and saves you up to 85% of associated costs.
In fact, a recent IDC White Paper found that Vanta customers achieve $535,000 per year in benefits,
and the platform pays for itself in just three months.
The proof is in the numbers.
More than 10,000 global companies trust Vanta, including Atlassian, Cora, and more.
For a limited time, listeners get $1,000 off at vanta.com slash nLW.
That's VANTA.com slash NLW for $1,000 off.
A couple of months ago, we talked about research that showed that the performance of AI
was roughly doubling every seven months.
Now, specifically, this research was about the length of tasks that AI could do at a 50%
success rate.
So there's lots of framing here, lots of things to quibble with.
But the basic idea is that a good way to understand agentic capabilities is to understand
how complex the tasks they can do are.
And a good way to proxy complexity is how long a task takes.
And the group that did this research metter, METR, found that this length was roughly doubling
every seven months.
You can see this chart here going all the way back to GPT2, running up to Sonnet 3.7.
This, of course, produced the sexy headline of a new Moore's Law for AI agents.
And one of the most intriguing parts of the research was that it found that recently the pace had seemed to start to increase and that it was no longer on that seven-month trajectory, but the doubling was happening more like every four months.
Well, now AI Digest has extended the research, adding 03 and 04 mini agents to the graph, and there is a very clear steepening of the curve.
AI Digest writes, these new data points fit the 2024-2020-2025 trend much better than the slower 2019 to 2025 trend.
It really looks like the time horizons of coding agents are doubling around every four months.
So basically, according to AI Digest Testing, O4 Mini can complete tasks that would take a human
around one and a half hours, while O3 can successfully carry out work that would take 1.7 hours.
Now, going back to that earlier study, the original researchers had noted that there had been
an inflection point at some point, whereas GPT2, GPT3, and GPD3.5 were woeful at comparing agents,
barely capable of completing a task that would take a human a minute or two,
somewhere around the release of GPT-40 and Claude 3.5 Sonnet,
and it seemed like there was a real pickup in terms of how fast performance was doubling.
It's this new curve that O3 and O4 Mini fit much more accurately.
Now, we are working from a very small handful of data points.
There are plenty of reasons to have questions around the exact methodology and just mental approach.
But what the mapping of O4 Mini and O3 does is that it suggests that this speeding up
and the doubling moving from around seven months to between three and four months that we started
to see at Claude 3.7 Sonnet and 01, we're not outliers, and in fact, that is the new
trend that's confirmed by 04 Mini in 2003. If the faster trend continues, agents might reach
month-long tasks in 2027. However, looking at just one year's data gives a less robust estimate.
The rate of progress might slow down. It might also speed up. Given that the trend has already
sped up, it could be on a growth trajectory that's faster than exponential. This fits intuitively. There might be
a bigger gap in required skills between one and two weeks than one and two-year tasks.
Additionally, as AI's improved, they'll be increasingly useful for developing yet more
capable AI's. This could lead to super-exponential growth in AI's time horizons.
Increasingly capable AI systems could trigger a flywheel of acceleration, agents speeding up
the creation of more capable agents, which speeds up the creation of more capable agents.
From here, agent capabilities might skyrocket beyond any human's ability in AI research,
and across many or all other domains. The effects would be transforming.
If automating AI research leads to progress this fast, the rapidly increasing time horizon of
AI systems might end up being one of the most important trends in human history.
Crypto Journey 23 pointed out just how hard it is for humans to understand exponentials,
writing the human mind can't even comprehend what this looks like six months from now.
80,000 hours founder Benjamin Todd was willing to actually put some predictions on the line,
saying this faster trend probably due to the new RL reasoning models and agents paradigm that started
in 2024. My money would be on a faster trend continuing at least than
next year, reaching agents that can do one day or eight-hour software tasks in 2026. So again,
while this isn't hyper-scientific or anything, I do think it intuitively reflects what all of us who
are sitting here experiencing AI and agents are feeling. But are there actually some data points
that we can look at that suggest that yes, AI capabilities really are increasing as fast as they
seem. Well, let's head over to 03. When the new model was first teased in December,
one of the more notable things about it was an unprecedented result on the ARC-AGI test.
One challenge, however, was that the benchmark that was released in December was carried out using
a significant, maybe even staggering $3,000 in compute per task, making the entire benchmark run
at least a million dollar effort. The release version of O3, understandably, isn't running using those
inference settings, and so there had been some questions over the last several weeks around what the
actual performance of O3 was going to be in practice. The RKGI team consequently decided to run the test
again using the publicly released model. Well, that test was completed earlier this week, and the team
was pleasantly surprised. Mike Noop, one of the co-founders of the Ark Prize, announced the results.
His big takeaway was that O3 Medium is the industry-leading AI reasoning system by a large margin,
twice the score and 120th the cost compared to the next leading chain of thought system as measured
by Arc V1 semi-private set. Nube continued,
When we tested O3 preview in December 2024, I said your intuition about AI capability will
need to get updated. So my key question for released O3, is it more like 01, slightly better than
pure LLM on novel tasks, or more like O3 Preview, qualitative new capability to solve problems
outside of training data. Our retest data suggests that O3 Medium, in other words, the release
version, has most of the qualitative new capability we saw from O3 preview at a dramatically
lower cost. While O3 medium accuracy is strictly lower than O3 preview, OpenAI did an extremely
good job optimizing accuracy and cost for O3. You cannot buy O3's level of AI reasoning capability
anywhere else today. Newp also suggested that the performance implies some serious architectural
changes under the hood, adding, O3 Medium is so good on arc V1 for the cost, it's hard to explain
as a pure auto-regressive chain of thought system like O1. The data suggests something more is going
on. While O3 is definitely not doing massive, slow, expensive parallel sampling like O3 preview,
there is evidence O3's accuracy is more than just a function of model and thinking token count,
i.e. time spent thinking. There is an additional X factor, although neither Mike or us know what that
X Factor is. Now, this is a fairly big result. Many expected that the release version without additional
inference wouldn't be all that big a jump from 01. And yet, the model is a step change in functionality,
at least as measured by the RKGI test. Now, people quickly started to dig into the specifics.
Machine Learning Street Talk wrote, the experiments we've all been waiting for for 03 and 04
mini on ARC and ARCV2. They're way off the December 24 results, but still a great result.
They scored practically zero on ARCV2, and interestingly, were more likely to be correct if they gave an
answer in fewer tokens. Thinking longer does not equal a better answer. Reinforcing this, Smokeaway commented,
O3 and 04 mini are indicators that thinking for longer doesn't always lead to the right answer. Sometimes
the shortest path is all you need. Getting a little philosophical, Dan Mack added, if you've spent
any amount of time introspecting on how your mind works, you know this is true. Flowers provided to
meta commentary about recent releases from OpenAI writing, OpenAI drops the original non-optimized
big 4.5. Crowd says, too big, too expensive, greedy. Open AI drops optimality. Open AI drops optimally.
cheaper O3 so more people can use it.
Crowd says, not the original, gatekeeping,
we want the 100x more expensive one.
Developer Daniel Sadoff writes,
what people don't understand is that you can achieve
the performance of the December 03
by simply doing extensive sampling.
For example, generating 64 outputs for a single question
and then selecting the best one using O3 itself.
That's essentially how they obtained these incredible numbers in December.
O3 Pro will basically be just that.
Which I bet if you are a regular listener you will know,
sounds an awful lot like the Dr. Strange theory.
Now, one more discussion for this show. Another clear way the AI trends are accelerating is in the
performance of open source models. We've seen a host of very high capability models over the past
few months trained on limited budgets, with DeepSeek, of course, being most emblematic of this
phenomenon. While now a two-person team out of South Korea may have up the ante with their new
voice model, yesterday Nari Labs posted a tiny 1.6 billion perimeter voice model called Dia.
Co-founder Toby Kim wrote, two undergrads, one still in the military, zero funding. One ridiculous
this goal. Build a text-to-speech model that rivals Notebook LM podcast, 11 Lab Studio, and Sesame
CSM. No, we were not AI experts from the beginning. It all started when we fell in love
with Notebook LM's podcast feature when it was released last year. But we wanted more, more control
over the voices, more freedom in the script. We tried every TTS API on the market. None of them
sounded like real human conversation. Well, this is what they produced.
Dialog like this. You also get full control over scripts and voices.
Wow, amazing. Try it now on GitHub or Hugging Face.
up to 11 Lab Studio in Sesame 1B.
Well, listen and decide for yourself.
Daya was built by a tiny team of two people with no funding.
Whoa, really?
Pretty crazy, huh?
Progress in open source AI is completely crazy.
Yeah.
Even this conversation was AI generated.
What?
The team used Google's TPU Research Cloud program to train their model for free,
and the result was a model that can be run on consumer hardware.
It can handle multiple voices, voice cloning, and nonverbal sounds like laughing, coughing, and sighing.
Basically, the model seems to have all of the naturalistic features of Sesame's
voice model, which was getting people so hyped up, but was developed using free resources by a pair
of amateur developers. Venture Beat was pretty impressed after testing the model writing. Even with rhythmically
complex content like rap lyrics, Dia generates fluid performance style speech that maintains tempo.
This contrasts with more monotoner disjointed outputs from 11 labs in Sesame's 1B model.
Now, people are hyped about this. Menlo's Didi Das wrote, we just solve text-to-speech AI.
This model can simulate perfect emotion screaming and show genuine alarm. Clearly beats 11 labs in Sesame,
It's only 1.6 billion parameters, streams real time on one GPU, and made by a one and a half
person team in Korea. Ethan Mollock writes another one of those little shocking AI moments.
This sound clip was generated in 46 seconds on my home PC from the script below.
Just the text. Nari Labs Dia does some of the best expressive AI voice I've seen and it's
open weights and created by two undergrads with no funding.
Is that a dragon?
Oh my God! What do we do? What do we do?
Hold on. Let me check the manual.
It's breathing fire. Everyone run!
There is a banishing wand in the first aid kit. Grab it.
I think I took that home to deal with my fruit fly problem.
Then we better run!
The point of all of this and bringing you back to where we started
is that whether you're trying to understand it through benchmarks
or through New Moore's laws or patterns
or just the new models that get released
and change your ability to do things with AI,
every single thing is pointing in the same direction
and saying the same thing.
The capabilities of AI and agents are increasing
and the speed at which they're increasing is also increasing.
I'll leave you with that.
Appreciate you listening or watching as always.
And until next time, peace.
