The AI Daily Brief: Artificial Intelligence News and Analysis - Agent Performance Is Accelerating...Fast

Starting point is 00:00:00 Today on the AI Daily Brief, Agent and AI capabilities are accelerating at an accelerated rate. Before that in the headlines, the Oscars officially do not care if films use AI. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. To join the conversation, follow the Discord link in our show notes. Welcome back to the AI Daily Brief Headlined edition, all the daily AI news you need in around five minutes. Well, in a real sign of how things are evolving, the Oscars officially don't care if filmmakers use AI. The Academy of Motion Picture Arts and Sciences recognized AI for the first time in rule changes published this week. Their new official stance is that AI won't impact the chance of a nomination.

Starting point is 00:00:45 They wrote, With regard to generative artificial intelligence and other digital tools used in the making of the film, the tools neither help nor harm the chances of achieving a nomination. The Academy in each branch will judge the achievement, taking into account the degree to which a human was at the heart of the creative authorship when choosing which movie to award. Now, for people outside of AI, the biggest rule change that grabbed headlines was the fact that that Academy members are now required to watch all nominated films in each category to be eligible to vote for them.

Starting point is 00:01:12 Of course, the joke here is that people couldn't believe that that wasn't the case before, but there you have it. But anyways, back over in our corner of the world, the changes around AI is a big relief for filmmakers who are experimenting with AI techniques, as earlier this year, the Academy was reportedly weighing a disclosure requirement. This year's Oscars featured controversy around the films Amelia Perez and The Brutalist. Both films worked with AI voice model company Respeacher to modify actor's speech. Amelia Perez used the technology to expand the vocal range of the title character, and the Brutalist used AI to tweak the Hungarian accents of the lead actors to make them more authentic. Editor David Jansko said that he had first attempted to use traditional methods of automated dialogue replacement

Starting point is 00:01:50 to switch in the voices of native Hungarian actors, but, quote, that just didn't work, so we looked for other options of how to enhance it. The Brutalist was critically acclaimed in part for its authenticity and received 10 nominations. Adrian Brody won the Best Actor Award for his AI-enhanced performance in the film. Skating completely under the radar during the controversy was the fact that Dune Part 2, which won for Best Visual Effects, made extensive use of AI in their VFX process. Now, honestly, the Academy stance is fairly bold, given the Hollywood strikes and the high-profile consternation around AI in cinema. In a way, it's simply a recognition that AI is nothing more

Starting point is 00:02:22 than an extension and refinement of existing techniques. In some areas, there isn't really a bright line distinction between the latest generation of AI tools and the existing processes that were already industry standard. Christopher Valenzuela, the CEO of Runway, posted, The Oscars being okay with the use of AI for filmmaking is not only a step in the right direction, but also one that recognizes this technology as a tool that requires an artist to articulate a meaningful way of using it. Nicholas Newbert, the creative director of runway, added, now there's nothing standing in the way of a movie using generative AI from winning an Oscar except for its creator's imagination. And basically that seems to be what the Academy is saying.

Starting point is 00:02:56 That they don't care what kinds of tools are used to achieve the result. They only care how well the filmmaker's creativity and craftsmanship is expressed in the final work of art. There's definitely more and more of this discussion happening right now. A content creator who has in the past worked with Superintelligent Mindo wrote a piece recently on LinkedIn called why Hollywood should lead the AI Renaissance and not fight it. It's a good articulation of where at least some of this conversation is turning. Although, of course, I don't want to overstate how broadly this is. The Oscar stance is extremely controversial, and there are lots of people in Hollywood who are not happy about it. Moving on to a very different topic, right now testimony in the

Starting point is 00:03:31 Google antitrust lawsuit continues to make headlines, with the latest being OpenAI declaring that they'd love to acquire the Chrome browser. On Monday, DOJ lawyers asked the judge to force the divestment of Chrome, stating, we're at an inflection point. The court has an opportunity to remedy a monopoly that has controlled the internet for today's generation and restore competition for decades to come. In yesterday's testimony, Nick Turley, the head of ChatGBTGPT, was asked if OpenAI would be interested in buying the browser if the court orders it to be spun off. He responded, yes, we would, as would many other parties. Turley stated that a native integration of ChatGBT BT into Chrome could fundamentally change the internet, adding, you could offer a really

Starting point is 00:04:06 incredible experience. We would have the ability to introduce users to what an AI-first experience looks like. Touching on the anti-competitive issues in the market, Turley discussed OpenAI's issues with distribution. He said the company has made some inroads with integrating chat GPT into the iPhone, but has been unable to make any progress with Android-based manufacturers. Earlier in the trial, as we discussed yesterday, it was disclosed that Google had paid Samsung a, quote, enormous sum in January to integrate the Gemini assistant. Turley acknowledged that Google's deal wasn't exclusive, but said that OpenAI struggled to make headway in negotiations with Samsung due to Google's ability to outspend their efforts. He added, it was not for a lack of trying. We never got to a point

Starting point is 00:04:43 where we could discuss concrete terms. Touching on the competition for search, Turley said that OpenAI's goal of building a super assistant in achieving AGI won't succeed without search technology. But Google has so far declined a partner with them. He testified, search technology is a necessary component. You can't have a super assistant that doesn't know the current facts or makes things up. Turley didn't directly reference the company's partnership with Microsoft, referring to them only as provider number one.

Starting point is 00:05:07 However, he said that OpenAI had, quote, significant quality issues with the search information they provided. It became clear over time that it was not viable to depend on. It was at best, a near-term solution. One little piece of information was that it turns out that OpenAI has actually been working on their own search index since early 2024, with Turley stating that the company aims to use their own index 80% of the time by the end of the year. He acknowledged that the task was frankly overly ambitious and estimated it would take

Starting point is 00:05:32 several more years to achieve. Part of the issue has been websites limiting traffic to OpenAI's web crawler, which isn't mutually beneficial like Google's is. Turley said Google can outspend us or offer more traffic. to these partners than we can. They have way more queries every single day. Now, this all may sound like, well, yeah, of course, you'd like Google to make it easy for you to compete with Google. But the testimony is significant, as one of the remedies suggested by the DOJ is forcing Google to share their search index with rivals in order to break their monopoly over the market segment. I'll reinforce the same

Starting point is 00:06:01 point I made yesterday, that whatever happens, it's going to have a deterministic impact on the shape of AI in the near future. Lastly, today, more restructuring of Apple's Siri team as new management is brought in to fix the beleaguered product. Last month, former Vision Pro lead Mike Rockwell was brought in to take over the project, and Bloomberg sources suggest that Rockwell is now clearing house. They said that he's replaced much of series leadership with lieutenants from his Vision Pro software group and is also restructuring teams related to speech, understanding performance, and user experience. Bloomberg reported new leads for engineering, user experience, and underlying architecture, all joining from the Vision Pro project, with additional software engineering

Starting point is 00:06:37 talent being brought in from the core OS team, which handles iPhone software as well. Bloomberg writes, the moves show that Rockwell is either demoting or replacing the prior managers in charge of Siri engineering. Whether this all works remains to be seen, but at least it's not nothing. For now, that is going to do it for today's AI Daily Brief Headlines edition. Next up, the main episode. Today's episode is brought to you by KPMG. In today's fiercely competitive market, unlocking AI's potential could help give you a competitive edge, foster growth, and drive new value. But here's the key. You don't need an AI strategy.

Starting point is 00:07:09 You need to embed AI into your overall business strategy to truly power it up. KPMG can show you how to integrate AI and AI agents into your business strategy in a way that truly works and is built on trusted AI principles and platforms. Check out real stories from KPMG to hear how AI is driving success with its clients at at www.kpmg.org.coms slash AI. Again, that's www.kpmg.comg.coms slash AI. Today's episode is brought to you by Superintelligent, and I am very excited today to tell you about our consultant partner program. The new Superintelligent is a platform that helps enterprises figure out which agents to adopt, and then with our marketplace, go and find the partners

Starting point is 00:07:51 that can help them actually build, buy, customize, and deploy those agents. At the key of that experience is what we call our agent readiness audits. We deploy a set of voice agents which can interview people across your team to uncover where agents are going to be most effective in driving real business value. From there, we make a set of recommendations which can turn into RFPs on the marketplace or other sort of change management activities that help get you ready for the new agent powered economy. We are finding a ton of success right now with consultants bringing the agent readiness audits to their client as a way to help them move down the funnel towards agent deployments with the consultant playing the role of helping

Starting point is 00:08:26 their client hone in on the right opportunities based on what we've recommended and helping manage the partner selection process. Basically, the audits are dramatically reducing the time to discovery for our consulting partners, and that's something we're really excited to see. If you run a firm and have clients who might be a good fit for the agent readiness audit, reach out to Agent at B-Super.A.I with Consultant in the title, and we'll get right back to you with more on the consultant partner program. Again, that's Agent at B-Super.A.I and put the word consultant in the subject line. Today's episode is brought to you by Vanta.

Starting point is 00:08:59 Vanta is a trust management platform that helps businesses automate security and compliance, enabling them to demonstrate strong security practices and scale. In today's business landscape, businesses can't just claim security, they have to prove it. Achieving compliance with a framework like SOC2, ISO-27-01, HIPAA, GDPR, and more is how businesses can demonstrate strong security practices. And we see how much this matters every time we connect enterprises with agent services providers at super-intelligent. Many of these compliance frameworks are simply not negotiable for enterprises. The problem is that navigating security and compliance is time-consuming and complicated. It can take months of work and use up valuable time and resources.

Starting point is 00:09:38 Vanta makes it easy and faster by automating compliance across 35-plus frameworks. It gets you audit-ready in weeks instead of months and saves you up to 85% of associated costs. In fact, a recent IDC White Paper found that Vanta customers achieve $535,000 per year in benefits, and the platform pays for itself in just three months. The proof is in the numbers. More than 10,000 global companies trust Vanta, including Atlassian, Cora, and more. For a limited time, listeners get $1,000 off at vanta.com slash nLW. That's VANTA.com slash NLW for $1,000 off.

Starting point is 00:10:13 A couple of months ago, we talked about research that showed that the performance of AI was roughly doubling every seven months. Now, specifically, this research was about the length of tasks that AI could do at a 50% success rate. So there's lots of framing here, lots of things to quibble with. But the basic idea is that a good way to understand agentic capabilities is to understand how complex the tasks they can do are. And a good way to proxy complexity is how long a task takes.

Starting point is 00:10:40 And the group that did this research metter, METR, found that this length was roughly doubling every seven months. You can see this chart here going all the way back to GPT2, running up to Sonnet 3.7. This, of course, produced the sexy headline of a new Moore's Law for AI agents. And one of the most intriguing parts of the research was that it found that recently the pace had seemed to start to increase and that it was no longer on that seven-month trajectory, but the doubling was happening more like every four months. Well, now AI Digest has extended the research, adding 03 and 04 mini agents to the graph, and there is a very clear steepening of the curve. AI Digest writes, these new data points fit the 2024-2020-2025 trend much better than the slower 2019 to 2025 trend. It really looks like the time horizons of coding agents are doubling around every four months.

Starting point is 00:11:29 So basically, according to AI Digest Testing, O4 Mini can complete tasks that would take a human around one and a half hours, while O3 can successfully carry out work that would take 1.7 hours. Now, going back to that earlier study, the original researchers had noted that there had been an inflection point at some point, whereas GPT2, GPT3, and GPD3.5 were woeful at comparing agents, barely capable of completing a task that would take a human a minute or two, somewhere around the release of GPT-40 and Claude 3.5 Sonnet, and it seemed like there was a real pickup in terms of how fast performance was doubling. It's this new curve that O3 and O4 Mini fit much more accurately.

Starting point is 00:12:06 Now, we are working from a very small handful of data points. There are plenty of reasons to have questions around the exact methodology and just mental approach. But what the mapping of O4 Mini and O3 does is that it suggests that this speeding up and the doubling moving from around seven months to between three and four months that we started to see at Claude 3.7 Sonnet and 01, we're not outliers, and in fact, that is the new trend that's confirmed by 04 Mini in 2003. If the faster trend continues, agents might reach month-long tasks in 2027. However, looking at just one year's data gives a less robust estimate. The rate of progress might slow down. It might also speed up. Given that the trend has already

Starting point is 00:12:44 sped up, it could be on a growth trajectory that's faster than exponential. This fits intuitively. There might be a bigger gap in required skills between one and two weeks than one and two-year tasks. Additionally, as AI's improved, they'll be increasingly useful for developing yet more capable AI's. This could lead to super-exponential growth in AI's time horizons. Increasingly capable AI systems could trigger a flywheel of acceleration, agents speeding up the creation of more capable agents, which speeds up the creation of more capable agents. From here, agent capabilities might skyrocket beyond any human's ability in AI research, and across many or all other domains. The effects would be transforming.

Starting point is 00:13:19 If automating AI research leads to progress this fast, the rapidly increasing time horizon of AI systems might end up being one of the most important trends in human history. Crypto Journey 23 pointed out just how hard it is for humans to understand exponentials, writing the human mind can't even comprehend what this looks like six months from now. 80,000 hours founder Benjamin Todd was willing to actually put some predictions on the line, saying this faster trend probably due to the new RL reasoning models and agents paradigm that started in 2024. My money would be on a faster trend continuing at least than next year, reaching agents that can do one day or eight-hour software tasks in 2026. So again,

Starting point is 00:13:55 while this isn't hyper-scientific or anything, I do think it intuitively reflects what all of us who are sitting here experiencing AI and agents are feeling. But are there actually some data points that we can look at that suggest that yes, AI capabilities really are increasing as fast as they seem. Well, let's head over to 03. When the new model was first teased in December, one of the more notable things about it was an unprecedented result on the ARC-AGI test. One challenge, however, was that the benchmark that was released in December was carried out using a significant, maybe even staggering $3,000 in compute per task, making the entire benchmark run at least a million dollar effort. The release version of O3, understandably, isn't running using those

Starting point is 00:14:35 inference settings, and so there had been some questions over the last several weeks around what the actual performance of O3 was going to be in practice. The RKGI team consequently decided to run the test again using the publicly released model. Well, that test was completed earlier this week, and the team was pleasantly surprised. Mike Noop, one of the co-founders of the Ark Prize, announced the results. His big takeaway was that O3 Medium is the industry-leading AI reasoning system by a large margin, twice the score and 120th the cost compared to the next leading chain of thought system as measured by Arc V1 semi-private set. Nube continued, When we tested O3 preview in December 2024, I said your intuition about AI capability will

Starting point is 00:15:12 need to get updated. So my key question for released O3, is it more like 01, slightly better than pure LLM on novel tasks, or more like O3 Preview, qualitative new capability to solve problems outside of training data. Our retest data suggests that O3 Medium, in other words, the release version, has most of the qualitative new capability we saw from O3 preview at a dramatically lower cost. While O3 medium accuracy is strictly lower than O3 preview, OpenAI did an extremely good job optimizing accuracy and cost for O3. You cannot buy O3's level of AI reasoning capability anywhere else today. Newp also suggested that the performance implies some serious architectural changes under the hood, adding, O3 Medium is so good on arc V1 for the cost, it's hard to explain

Starting point is 00:15:54 as a pure auto-regressive chain of thought system like O1. The data suggests something more is going on. While O3 is definitely not doing massive, slow, expensive parallel sampling like O3 preview, there is evidence O3's accuracy is more than just a function of model and thinking token count, i.e. time spent thinking. There is an additional X factor, although neither Mike or us know what that X Factor is. Now, this is a fairly big result. Many expected that the release version without additional inference wouldn't be all that big a jump from 01. And yet, the model is a step change in functionality, at least as measured by the RKGI test. Now, people quickly started to dig into the specifics. Machine Learning Street Talk wrote, the experiments we've all been waiting for for 03 and 04

Starting point is 00:16:33 mini on ARC and ARCV2. They're way off the December 24 results, but still a great result. They scored practically zero on ARCV2, and interestingly, were more likely to be correct if they gave an answer in fewer tokens. Thinking longer does not equal a better answer. Reinforcing this, Smokeaway commented, O3 and 04 mini are indicators that thinking for longer doesn't always lead to the right answer. Sometimes the shortest path is all you need. Getting a little philosophical, Dan Mack added, if you've spent any amount of time introspecting on how your mind works, you know this is true. Flowers provided to meta commentary about recent releases from OpenAI writing, OpenAI drops the original non-optimized big 4.5. Crowd says, too big, too expensive, greedy. Open AI drops optimality. Open AI drops optimally.

Starting point is 00:17:13 cheaper O3 so more people can use it. Crowd says, not the original, gatekeeping, we want the 100x more expensive one. Developer Daniel Sadoff writes, what people don't understand is that you can achieve the performance of the December 03 by simply doing extensive sampling. For example, generating 64 outputs for a single question

Starting point is 00:17:30 and then selecting the best one using O3 itself. That's essentially how they obtained these incredible numbers in December. O3 Pro will basically be just that. Which I bet if you are a regular listener you will know, sounds an awful lot like the Dr. Strange theory. Now, one more discussion for this show. Another clear way the AI trends are accelerating is in the performance of open source models. We've seen a host of very high capability models over the past few months trained on limited budgets, with DeepSeek, of course, being most emblematic of this

Starting point is 00:17:55 phenomenon. While now a two-person team out of South Korea may have up the ante with their new voice model, yesterday Nari Labs posted a tiny 1.6 billion perimeter voice model called Dia. Co-founder Toby Kim wrote, two undergrads, one still in the military, zero funding. One ridiculous this goal. Build a text-to-speech model that rivals Notebook LM podcast, 11 Lab Studio, and Sesame CSM. No, we were not AI experts from the beginning. It all started when we fell in love with Notebook LM's podcast feature when it was released last year. But we wanted more, more control over the voices, more freedom in the script. We tried every TTS API on the market. None of them sounded like real human conversation. Well, this is what they produced.

Starting point is 00:18:33 Dialog like this. You also get full control over scripts and voices. Wow, amazing. Try it now on GitHub or Hugging Face. up to 11 Lab Studio in Sesame 1B. Well, listen and decide for yourself. Daya was built by a tiny team of two people with no funding. Whoa, really? Pretty crazy, huh? Progress in open source AI is completely crazy.

Starting point is 00:18:53 Yeah. Even this conversation was AI generated. What? The team used Google's TPU Research Cloud program to train their model for free, and the result was a model that can be run on consumer hardware. It can handle multiple voices, voice cloning, and nonverbal sounds like laughing, coughing, and sighing. Basically, the model seems to have all of the naturalistic features of Sesame's voice model, which was getting people so hyped up, but was developed using free resources by a pair

Starting point is 00:19:14 of amateur developers. Venture Beat was pretty impressed after testing the model writing. Even with rhythmically complex content like rap lyrics, Dia generates fluid performance style speech that maintains tempo. This contrasts with more monotoner disjointed outputs from 11 labs in Sesame's 1B model. Now, people are hyped about this. Menlo's Didi Das wrote, we just solve text-to-speech AI. This model can simulate perfect emotion screaming and show genuine alarm. Clearly beats 11 labs in Sesame, It's only 1.6 billion parameters, streams real time on one GPU, and made by a one and a half person team in Korea. Ethan Mollock writes another one of those little shocking AI moments. This sound clip was generated in 46 seconds on my home PC from the script below.

Starting point is 00:19:54 Just the text. Nari Labs Dia does some of the best expressive AI voice I've seen and it's open weights and created by two undergrads with no funding. Is that a dragon? Oh my God! What do we do? What do we do? Hold on. Let me check the manual. It's breathing fire. Everyone run! There is a banishing wand in the first aid kit. Grab it. I think I took that home to deal with my fruit fly problem.

Starting point is 00:20:12 Then we better run! The point of all of this and bringing you back to where we started is that whether you're trying to understand it through benchmarks or through New Moore's laws or patterns or just the new models that get released and change your ability to do things with AI, every single thing is pointing in the same direction and saying the same thing.

Starting point is 00:20:28 The capabilities of AI and agents are increasing and the speed at which they're increasing is also increasing. I'll leave you with that. Appreciate you listening or watching as always. And until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - Agent Performance Is Accelerating...Fast

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.