The AI Daily Brief: Artificial Intelligence News and Analysis - AI Agent Capabilities Are Doubling Every Three Months

Episode Date: March 21, 2025

New research reveals AI agent capabilities are doubling every three months, faster than Moore's Law. A recent study measured agent performance on actual tasks and found an exponential growth trend.... Before that in the Headlines, Meta Llama hits a billion downloads. SPECIAL OFFERTo get your ready-to-go agent from https://www.lindy.ai/ email nlw@besuper.ai with the word "LINDY" in the titleBrought to you by:KPMG – Go to ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://kpmg.com/ai⁠ to learn more about how KPMG can help you drive value with our AI solutions.Vanta - Simplify compliance - ⁠⁠⁠⁠⁠⁠⁠https://vanta.com/nlwThe Agent Readiness Audit from Superintelligent - Go to https://besuper.ai/ to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Subscribe to the newsletter: https://aidailybrief.beehiiv.com/Join our Discord: https://bit.ly/aibreakdown

Transcript
Discussion (0)
Starting point is 00:00:00 The day on the AI Daily Brief, in a kind of Moore's Law for AI agents, AI capability is doubling roughly every seven months, and that seems to be increasing. Before that in the headlines, Metaslama models have apparently been downloaded a billion times. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. To join the conversation, follow the Discord link in our show notes. Welcome back to the AI Daily Brief Headlines edition, all the daily AI news you need in around five minutes. Mark Zuckerberg says that Metaslama models,
Starting point is 00:00:32 have hit a billion downloads. That is a big increase from last December when the company claims 650 million downloads. Now, for some sense of comparison, TikTok has been downloaded over 5 billion times, and Roblox has over a billion downloads for mobile users. That again, those are both popular consumer apps rather than open source AI models.
Starting point is 00:00:51 Still to some, something doesn't quite add up. Professor Ethan Malik writes, sort of confused on how Lama could have been downloaded a billion times? You know you can just keep using the model and don't have to download it each time if you want to use it. Cocktail Peanut reposted that and said, I think I contributed to like one million of those. Accelerate Harder, meanwhile, writes,
Starting point is 00:01:10 this has got to be by counting the hugging face download numbers, which are notoriously insane, right? What other possible explanation could there be for a billion Lama downloads? One copy each for one out of eight humans? How many GPUs even exist on Earth? I think from my perspective we can just give Zuck the win here and move on. There is clearly a lot of interest and a lot of deserved interest in the Lama family of models,
Starting point is 00:01:29 and if we zoom that forward, there seem to be some exciting things coming. Meta is currently gearing up for their inaugural Lama Khan taking place at the end of April. It's been widely rumored that the event will feature the release of the Lama 4 model family, which will be natively multimodal and optimized to power agents. Google is giving Gemini an upgrade, making it more feature-complete with a canvas interface and a version of notebook LM audio overviews. The new interface option is similar to the identically named ChatGBT-GVT Canvas tool and anthropic artifacts. As a total aside, we're starting to see some feature naming consistency.
Starting point is 00:02:03 Google, OpenAI, and Perplexity all have a deep research feature, with GROC calling it deep search. And I kind of think that that's better for users than everyone trying to pretend they're somehow fundamentally different. So maybe these are all supposed to just be called Canvas. My point being that multiple people calling it Canvas might not be a lack of creativity. It might actually be a user-friendly move. Anyways, what this offers is a new interactive space for collaboration with Gemini, allowing users to go back and forth with the AI on revisions to writing and coding projects. And indeed, whatever they call it, this interface style is starting to become a default feature for AI chatbots.
Starting point is 00:02:37 If you have used the before and after, it makes a huge difference to eliminate much of the copying and pasting and switching windows and manual updating. The interface also allows native execution of code for quick testing. Now, porting over the audio overview's feature of Notebook L.M is an interesting choice from Google and one that makes a lot of sense. The tool went viral last year as users experimented with the ability to generate a podcast on any topic. It felt like a natural fit within Notebook LM's research focus, but it also has a much wider set of possibilities, and it's likely that those other types of use cases might more come to the fore now that it's embedded inside Gemini. It also means that you can now use Gemini to generate a deep research report
Starting point is 00:03:15 and immediately spin up a podcast to digest it. Google is definitely surging to bring all of these experiences natively into Gemini in a big way. Lastly, today, an update from the White Hot coding assistance sector, AI startup graphite has announced 52 million in Series B funding and a doubling down on their coding tool. Graphite was founded back in 2020 a million trillion years ago and started life as a mobile development tool company. They pivoted to code review shortly afterwards and have since built out AI tooling, largely based on their solution to internal pain points. Co-founder Merrill Lutski said, graphite started as an internal tool we built to solve our own pain around code review. We shared what we built with a few X meta engineers who quickly shared it more broadly,
Starting point is 00:03:53 and soon the demand for Graphite became too loud to ignore. So how does this differ from more general assistance like cursor? Well, Graphite is basically a little bit more focused. It can make code suggestions based on developer comments, compile code summaries, and generate fixes for code failures. Their new tool called Diamond will be focused on automating bug hunting and will be offered as a standalone product. Graphite's platform also allows customers to define their own code-based specific patterns
Starting point is 00:04:17 and filter-sensitive information. Whatever they're doing, it seems to be working, as Lutski said that revenue grew 20x in 2020. So no signs of slowing down in this particular sector, and we haven't even gotten into the big $1 million no-code hackathon that's coming down the pipeline now. However, that is going to have to wait for another episode. For now, that is going to do it for today's headlines. Let's shift over to what is some unbelievably interesting research about the speed at with
Starting point is 00:04:42 agents are getting better in the main episode. Today's episode is brought to you by Super Intelligent and more specifically Super's Agent Readiness Audits. If you've been listening for a while, you have probably heard me talk about. about this, but basically the idea of the agent readiness audit is that this is a system that we've created to help you benchmark and map opportunities in your organizations where agents could specifically help you solve your problems, create new opportunities in a way that, again, is completely customized to you. When you do one of these audits, what you're going to do is a
Starting point is 00:05:15 voice-based agent interview where we work with some number of your leadership and employees to map what's going on inside the organization and to figure out. out where you are in your agent journey. That's going to produce an agent readiness score that comes with a deep set of explanations, strength, weaknesses, key findings, and of course, a set of very specific recommendations that then we have the ability to help you go find the right partners to actually fulfill. So if you are looking for a way to jumpstart your agent strategy, send us an email at agent at besupor.a, and let's get you plugged into the agentic era. Today's episode is brought to you by Vanta.
Starting point is 00:05:55 Trust isn't just earned, it's demanded. Whether you're a startup founder navigating your first audit or a seasoned security professional scaling your GRC program, proving your commitment to security has never been more critical or more complex. That's where Vanta comes in. Businesses use Vanta to establish trust by automating compliance needs across over 35 frameworks like SOC2 and ISO-2101. Centralized security workflows, complete questionnaires,
Starting point is 00:06:21 up to 5X faster and proactively manage vendor risk. Vanta can help you start or scale up your security program by connecting you with auditors and experts to conduct your audit and set up your security program quickly. Plus, with automation and AI throughout the platform, Vanta gives you time back so you can focus on building your company. Join over 9,000 global companies like Atlassian, Cora, and Factory who use Vantage to manage risk and prove security in real time. For a limited time, this audience gets $1,000 off Vanta at Vanta.com slash NLW. That's V-A-N-T-A-com slash N-L-W for $1,000 off. Today we have a super interesting conversation.
Starting point is 00:07:03 We're talking about this research that just came out that has a lot of chatter that's basically arguing for a Moore's law for AI agents, basically a way to think about how fast the capabilities of agents are improving. And the people behind the research not only have some interesting results, but also just a very interesting framing for the entire problem. Now, of course, why this matters is that right now, we are in the midst of this agentic transformation, one which I believe will basically lead to a huge portion
Starting point is 00:07:36 of today's knowledge work tasks done by agents eventually. And what everyone is trying to figure out, especially the companies that are out there trying to buy and pilot their first agents, is just how capable are they? What specific types of things can they do? And based on that, how to integrate them into today's existence. workflows, but lurking behind all of that is this knowledge that they're improving at such a
Starting point is 00:07:58 fast rate that everything that we do today to design new systems around them may be nullified in just a few months when they are more capable. And so not only are enterprises and companies trying to adapt to the agent capabilities of right now, they're also trying to plan for a future, which is on the one hand unknowable and at the same time totally inevitable. So that's the setup in the context for this. But before we talk about Moore's Law for AI agents, let's talk about Moore's law. I asked Rock to explain it in a fun, easy-to-understand way, and its response was unbelievably, unfathomably cringe. They tried to compare it to a video game where your character's strength keeps leveling up without you, quote, grinding for extra coins. They compared it to a magical candy
Starting point is 00:08:41 store where every 18 months, the shopkeeper doubles the amount of candy you can get for the same price. But basically what this actually refers to is that Intel Kof founder Gordon Moore noticed way, way back in the 60s that the number of transistors on a computer chip was roughly doubling at a pretty consistent pace. Basically every couple of years, the capabilities were doubling while the price was staying the same. And so now anytime that there is a consistent or seemingly consistent pace of change in technology, we of course have to compare it to Moore's law. Anyways, let's talk about the specific paper. It comes from Meter, a nonprofit organization based in Berkeley, that published a paper called measuring AI ability to complete
Starting point is 00:09:18 long tasks. They created a set of 170 real-world tasks, including coding, cybersecurity, general reasoning, and machine learning, and from there established a human baseline by determining how long it would take an expert programmer to complete each task. They called this the quote, task completion time horizon, and that the logic was essentially that the time taken to complete a task by a human expert is a good proxy for how difficult the task is. A selection of models were given control of a coding agent and put through their paces on the task list. The idea was to test where each model would fall below a 50% success rate. Researchers tested models dating back to OpenAI's GPT2,
Starting point is 00:09:54 up to Anthropics Clod 3.7 Sonnet, so very contemporary. Their results show a remarkably consistent pace of advancement, and this is where the comparison comes from. They write, we find a kind of Moore's law for AI agents. The length of tasks that AI can do is doubling about every seven months. To put some numbers around it, GBT2, which was released in 2019, could complete a task that would take an expert programmer
Starting point is 00:10:18 around two seconds, but start failing at anything more complicated. By the time you get up to GPT4, released in 2023, AI could nail tasks that a human programmer would spend four minutes on. Zooming ahead, researchers found that Claude 3.7 Sonnet could complete tasks that take around an hour with 50% accuracy. Now, if you're watching this video, you'll note that this exponential curve is plotted as a straight line with a logarithmic scale, one second, four second, 15 seconds, one minute. But if you look on a linear scale, you can see just how much more dramatic and exponential the growth curve is. The researchers actually also tested OpenAI's O3 Mini and DeepSeekR1, but found that they were less performant than Sonnet 3.7, and so decided to drop them from the data.
Starting point is 00:10:59 To verify the trend, the research ran a similar test using questions from the standard coding benchmark, SWEE bench, or Sween bench. They found consistent results dating back to the release of GPD4 with a doubling capability every 70 days. The uncertainty level associated with these tasks is pretty large, but the researchers commented even if the absolute measurements are off by a factor of 10, the trend predicts that in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans' days or weeks.
Starting point is 00:11:28 Separating out just the more recent models, the researchers also found that the pace of improvement has increased. For models created since last year, the doublings and capability are occurring every three months. In a post-sumorizing of their conclusions, the researchers wrote, we are fairly confident of the rough trend of one to four doublings in horizon length per year. That's fast. Measures like these help make the notion of degrees of autonomy more concrete
Starting point is 00:11:51 and let us quantify when AI abilities may rise above specific useful or dangerous thresholds. So as I said, this generated a ton of chatter. It's been seen four million times and has about a thousand people who have reposted it or commented on it. For many, this was the concrete data they needed to start feeling the AGI. Researcher Amy Dang wrote, I didn't believe in exponential AI progress before working on this paper, but I believed in statistics are methodology and a straight line on a log scale graph. Now I live and breathe the fact that day-long work will be automatable by end of 2027, and AGI is coming. Professor Ethan Malik quibbled
Starting point is 00:12:25 with the methodology, but acknowledged the result is very significant, posting, a new paper shows that AI agents are improving rapidly at long tasks, but they aren't reliable yet. That being said, this feels significant. More than 80% of success runs cost less than 10% of what it would cost for a human level four software engineer to perform the same task. Ethan's specific gripe is that the threshold for success was only a 50% completion rate, which is not going to stand up to enterprise use cases. The researchers actually addressed this in the paper, choosing a 50% success rate because it was the most useful for filtering out small variations in the data.
Starting point is 00:12:59 Co-author Lawrence Chan commented, if you pick very low or very high thresholds, removing or adding a single successful or single-failed task respectively changes your estimates a lot. In further testing, the researchers found that increasing the reliability threshold from 50 to 80% reduces the average time horizon by a factor of five, but the pace of doubling and the trend remained very similar. Point being that the paper, ultimately, isn't really trying to pinpoint how good agents are at the moment.
Starting point is 00:13:25 Instead, it's trying to measure the trend of improvement, and that's immediately what stood out to me. I don't think the specific finding of the time that agents can work is all that useful. I think what's useful here, especially from a very practical standpoint for companies that are trying to figure out what their agent strategy is going to be, is that we're seeing a doubling of that capability at the longest every seven months, and now it seems more like every three months. That means that by the time you next report quarterly results, the capabilities of the agents that you are not yet working with will have doubled. Two quarters from now,
Starting point is 00:13:58 the agents that you haven't hired yet will be four times more capable, and so on and so forth, if this, of course, holds up. Now, what about the concern that traditional coding benchmark benchmarks are basically soaked and useless at measuring further improvement from the current state of the art. The researchers actually commented that they, quote, think these results help resolve the apparent contradiction between superhuman performance on many benchmarks and the common empirical observations that models do not seem to be robustly helpful in automating parts of people's day-to-day work. The best current models, such as Claude 3.7 Sonnet, are capable of some tasks that may take even expert humans' hours who can only reliably
Starting point is 00:14:33 complete tasks of up to a few minutes long. Joshua Gans, a management professor at the University of Toronto, who has written about the economics of AI, questioned whether it's correct to assume this trend will hold. He commented, extrapolations are tempting to do, but there is still so much we don't know about how AI will actually be used for these to be meaningful. The researchers themselves questioned how long the trend is likely to hold. Moore's law held for a doubling of the number of transistors on a leading computer chip for over four decades from the 1970s. However, the trend slowed in the early 2010s as chip designers ran up against physical limitations having to do with atomic structure.
Starting point is 00:15:05 This was coupled with the chipmaking industry focusing on power efficiency over raw power. The researchers made a comparison to the constraints on AI, namely the limits to compute, writing, it's unclear whether there is sufficient capacity to expand either training or inference compute by many more orders of magnitude in the next five years. Basically, the point being that the researchers here are going to pains to simply present the data they've found, not over-extrapolate what it might mean or how long it might continue. They, like us, are unsure about how this is going to play out. then again, they also point out that advances in multi-agent systems, improvements in agentic training,
Starting point is 00:15:39 and more efficient training algorithms could all help bolster the trend. And while the normal temptation when we get new research like this, as you can see in all the people that nature asked to comment on the piece, is to try to poke holes in it and caution about why it might be overly optimistic, it is also worth, I think, at this point, zooming out and thinking on the other side, what if the trend holds? scientist Robin Hansen wrote, So around eight years till they can do year-long projects? The implied point, of course, is that even if we only get a fraction of that,
Starting point is 00:16:08 that is a civilization-changing trend. Next, up the researchers are going to explore how pairing of an AI agent with a human worker compares to a human worker alone, which should be really interesting as well. For now, though, if you take nothing away from this, if you disbelieve in the long-term trend, if you question the efficacy of agents right now, it still appears pretty clear that the capabilities of the capabilities of which you are skeptical, are improving at an extraordinary rate. Humans are historically unbelievably
Starting point is 00:16:36 bad at thinking in terms of exponentials. It is just very hard for us to actually mentally get ourselves to a place where we can zoom out and understand that pace of change. We live and grow and learn in linear timelines. We are not wired for the exponential. And yet it appears that exponential is what we have here. Not for nothing. If you have not started to figure out your AI agent strategy yet, Well, friends, the best time was yesterday, but the second best time is today. For now, that's going to do it for today's AI Daily Brief. Appreciate you listening or watching, as always. And until next time, peace.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.