The AI Daily Brief: Artificial Intelligence News and Analysis - GPT-5 is 58% AGI
Episode Date: October 21, 2025A new paper from the Center for AI Safety proposes a measurable definition of artificial general intelligence—and by their framework, GPT-5 is already 58% of the way there. NLW breaks down how resea...rchers quantified AGI across ten cognitive domains, why memory remains the biggest bottleneck, and what this means for investors, labs, and the timeline to true general intelligence. Plus: Claude Code comes to the web, Replit projects $1B in revenue, and OpenEvidence raises at a $6B valuation.Brought to you by:KPMG – Discover how AI is transforming possibility into reality. Tune into the new KPMG 'You Can with AI' podcast and unlock insights that will inform smarter decisions inside your enterprise. Listen now and start shaping your future with every episode. https://www.kpmg.us/AIpodcastsBlitzy.com - Go to https://blitzy.com/ to build enterprise software in days, not months Robots & Pencils - Cloud-native AI solutions that power results https://robotsandpencils.com/The Agent Readiness Audit from Superintelligent - Go to https://besuper.ai/ to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Interested in sponsoring the show? nlw@aidailybrief.ai
Transcript
Discussion (0)
Today on the AI Daily Brief, a new definition of AGI that suggests that GBT5 is 58% of the way there.
Before that in the headlines, Claude Code comes to the web.
The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.
All right, friends, quick announcements before we dive in.
First of all, thank you to today's sponsors, Super Intelligent, KPMG and Robots and Pencils.
To get an ad-free version of the show, go to patreon.com slash AI Daily Brief, or you can sign up on Apple Podcasts.
One note about Apple Podcasts because this has been coming up a bit.
Apple Podcast is set up on a very different system to Patreon.
With Patreon, I upload a distinct file.
I can schedule it the same way as I can with my normal episodes.
And so it's always set to come out at the same time as the main ad version.
With Apple Podcasts, it's a little bit different.
I have to wait for the episode to post to Apple Podcasts,
meaning that there's a short delay after I publish in general.
And then I have to go in and manually replace the file.
What this means is that if there's any reason, I can't immediately replace
that file, you will still see the normal ad version on your feed. It only gets replaced when I add
that manual file. This is normally fine. It just means about 15 minutes of waiting around after I
press published on the normal episode, but that lag creates more possibilities for problems.
For example, yesterday Apple's podcast Connect system was down for about 12 hours, which meant that
pretty much overnight, even subscribers on Apple still saw only the ad version. I promise you that
I will always try to get the ad-free version up on Apple as fast as I can, but sometimes it's going
to be out of my control. I apologize. I really wish it was a better system, but that is just the way that
it is. Lastly, of course, for any information about the show, sponsorship speaking, job opportunities,
go to AIdailybrief.aI.com. And with that all out of the way, let's dive in.
Welcome back to the AI Daily Brief Headlines edition, all the daily AI news you need in around
five minutes. First up today, Anthropic is making Claude Code available through a web app and within
the Claude iOS app. Previously, the feature was only available through terminals and IDEs,
and the big unlock is being able to spin up background agents.
With Cloud code running in the cloud,
you can now run multiple tasks in parallel
across different repositories from a single interface
and ship faster with automatic PR creation and clear change summaries.
This asynchronous workflow is quickly becoming a powerful tool
for AI-enhanced coders.
CloudCode product manager Kat Wu said,
as we look forward,
one of our key focuses is making sure the command line interface product
is the most intelligent and customizable way
for you to use coding agents.
But we're continuing to put ClaudeCode everywhere, helping it meet developers where they are.
Web and Mobile is a big step in this direction.
Certainly there is a lot of excitement about this.
Josh J.DJ.J. Kelly on Twitter wrote,
I can work with Claude Code while out on a walk.
Speaking of Agenda coding, Replit is projecting massive growth to reach a billion dollars in revenue by the end of next year.
Speaking with Business Insider, CEO Amjad Masad, said that the AI coding startup has reached 240 million in ARR and expects that to
quadruple next year. The company's growth this year has been absolutely skyrocketing,
gaining more than 10x from their 16 million in ARR at the end of 2024. The company now has
over 150,000 paying customers and over 40 million free users. And while at this stage,
all those free users mean that the consumer segment is unprofitable, Mossad boasted that
enterprise margins are close to 80%. This follows the same profit model that other AI companies
are currently pursuing. The consumer segment is a loss leader due to large volumes of free users,
but building familiarity with consumers means they to make a profit.
access to the same tools at work. Massad said that the surging revenue was largely due to adoption in
mid-sized companies, including Duolingo and Zillow. He said,
RepLid is kind of replacing a lot of the no-code, low-code tools which never really worked very
well. They get initial productivity boosts, but a lot of times that ended up actually slowing down a lot
of companies. Whatever the case, they are seeing enough growth that they are pushing forward
their expectations. This article came about after Business Insider saw a leaked investor memo
that gave the billion-dollar projection for 2027.
Speaking of growth, after a spike in downloads this month, could Meta's AI app actually be gaining traction?
According to similar web data, Meta's standalone AI app now has over 300,000 downloads per day, up from around 100,000 in mid-September.
In addition, the app now has 2.7 million daily active users up from 775,000 last month.
And while similar web said they hadn't seen any meaningful collation with either search or advertising volume,
however, they noted that Meta could be promoting the platform on Facebook or Instagram, which aren't included.
in similar web's data. The other possible explanation is that meta's new vibes feed has been more
of a success than people gave it credit for. The AI generated image and video feed was released September 25th,
and decried by many as the introduction of infinite feeds of AI slop. However, the spike in downloads
and daily active users both do line up with the introduction of that feed. OpenAI's launch of the SORA app
a week later could also be boosting META's platform as an alternative. SORA still requires an invite code
while META's platform is freely available. Now, obviously these numbers in aggregate are still
quite low relative to the billions of users that mainstream social apps have, but the growth is
notable nonetheless.
Next up some fundraising news.
Open Evidence, the AI Assistant for Doctors, has raised $200 million at a $6 billion valuation.
This is the second large fundraising round for the company this year.
They raised $210 million at a $3.5 billion valuation back in July.
And with the level of growth they've displayed recently, it's not hard to see why the valuation
is almost doubled.
Open Evidence now supports around 15 million clinical consultations a month up from 8.5
million in July. The product is free to use for registered medical professionals and monetized through
advertising rather than subscription. That unconventional approach for a professional tool has allowed open
evidence to expand into 10,000 medical centers. Open evidence only began commercializing their app three
months ago and is already halfway to their target of $100 million in advertising revenue for next year.
The assistant is trained on leading medical journals like the New England Journal of Medicine
and is designed to help doctors quickly access the literature for diagnosis and treatment options.
The system is also designed to reject low-confidence outputs reducing hallucination risk.
Alongside medical journals, the model is also being fine-tuned on the 100 million clinical
consultations assisted by the tool.
Co-founder Daniel Nadler said that this is one of the company's largest motes, adding,
no one else in the world has that data.
Speaking to adoption among doctors,
Zankeen Zeb of Google Ventures, the lead investor in the round, said it's reaching verb-like status.
Now, this data type of moat, where companies in verticals have access to actual real-world data
based on the usage of their tool is one of the most interesting themes and questions.
So far in the history of LLMs, we've seen that the bitter lesson applies.
In other words, that mass access to data beats out specialized data when it comes to pre-training.
However, where a lot of people are looking in the future is that the data that's left
that the foundation model labs don't have is the data exhaust that comes from real-world usage,
and that could in and of itself be extremely valuable.
That's certainly the argument that open evidence is making and we'll have to see how it plays out.
Staying on fundraising, MusicGen startup Suno is said to be in talks to raise $100 million at a $2 billion valuation.
Sources speaking with Bloomberg said the deal would quadruple the company's valuation since their last raise.
That last round closed in May of last year and brought it $125 million, although the valuation was not disclosed at the time.
Importantly, the startup is now generating $100 million in ARR, according to sources familiar with the numbers.
And what's more, Suno may be able to settle their legal disputes very shortly.
In June of last year, Universal and Warner Music filed a lawsuit for
copyright infringement against Suno and competitor Oonio, but this June, Bloomberg reported
that the labels are in talks to settle the litigation and establish a licensing framework for
generated music. The labels are also rumored to be looking to take an equity stake in both of those
companies. Reinforcing the idea of a truce between the music industry and AI startups,
last week, Spotify announced plans to work with the record labels on AI-powered features.
Universal Music Group CEO Lucien Grange is boosting a pro-AI message internally.
Last week, he sent a memo to staff reemphasizing his interest in partnering on AI products
as long as they respect artist copyrights and likenesses.
Now, for anyone who has watched the history of the record labels all the way going back to NAPS or this should be no surprise at all,
there is no industry, frankly, more adept at figuring out how to monetize the new thing.
Lastly, today, the latest company to make some big AI pronouncement is Starbucks.
Starbucks CEO Brian Nichol said that they're all in on AI.
Appearing on a Yahoo Finance podcast recorded at the Dream Forest Conference last week,
Nickel discussed a wide range of AI deployments at the company.
A major scaled use case is an in-store knowledge assistant referred to as the green dot.
It helps store leaders manage daily operations, including troubleshooting equipment and providing
drink recipes.
Nicol also said that Starbucks has pilots for inventory, supply chain forecasting, and scheduling,
although none of those use cases are at scale.
Speaking to ROI, he commented,
we're still in the early days of this, but I believe there is definitely opportunity
here to help us get things done faster and more efficiently.
To what scale, that is to be determined.
We're definitely already seeing a big impact in our technology area.
The ability to get code done so much faster is real.
One thing he did reject is the idea of robot baristas anytime soon, commenting we're not
near that right now.
Some folks tried to dig into the specifics about what that would mean, while others just
let it be vibes.
Sophie at NetCap girl writes, okay, yeah, whatever, eff it, Starbucks, AI.
And that is going to do it for today's headlines.
Next up, the main episode.
Today's episode is brought to you by my company, Super Intelligent.
Look guys, buying or building agents without a plan is how you end up in pilot purgatory.
Super Intelligent is the agent planning platform that saves you from stalling out on AI.
We interview teams at scale, translate real work into prioritized agent opportunities,
and deliver recommendations that you can execute on, what to build, what success looks like,
how fast you'll get results, and even what platforms and tools you should consider.
All customized for you.
Instead of shopping for hype, you get to deploy with confidence.
Visit besupor.i and book your AI planning demo team.
today. AI isn't a one-off project. It's a partnership that has to evolve as the technology does.
Robots and Pencils work side by side with clients to bring practical AI into every phase,
automation, personalization, decision support, and optimization. They prove what works through applied
experimentation and build systems that amplify human potential. As an AWS-certified partner with
global delivery centers, robots and pencils combines reach with high-touch service, where others
hand off, they stay engaged, because partnership isn't a project plan. It's a commitment. As
AI advances, so will their solutions. That's long-term value. Progress starts with the right
partner. Start with robots and pencils at robots and pencils.com slash AI Daily Brief.
What if AI wasn't just a buzzword, but a business imperative? On You Can with AI, we take you inside
the boardrooms and strategy sessions of the world's most forward-thinking enterprises.
hosted by me, Nathania Wittamore, and powered by KPMG,
this seven-part series delivers real-world insights from leaders who are scaling AI with purpose,
from aligning culture and leadership to building trust, data readiness, and deploying
AI agents.
Whether you're a C-suite executive, strategist, or innovator, this podcast is your front-row seat
to the future of Enterprise AI.
So go check it out at www.kpmG.org-us slash AI podcasts,
or search you Penn with AI on Spotify, Apple Podcast, or wherever
you get your podcasts.
Welcome back to the AI Daily Brief.
One of the things that I have said frequently on this show, including being effectively
the entire theme of yesterday's show, is that when it comes to the practical, lived, applied
experience of AI inside a work setting, I don't think that AGI matters.
In fact, I think it is one of the more useless terms when it comes to how you think about
applying AI in your daily life or your company thinks about applying it at work.
So why do definitions of AGI matter?
then. And the short answer is it's the exact same reason that we had that entire conversation
in the show yesterday, which is that all of a sudden progress towards AGI is going to be
considered a meaningful factor when it comes to how markets should treat AI stocks.
Given how much AI stocks are at the core of the entire economy right now, these otherwise
nebulous definitions start to take on a greater importance. Now, of course, for those who haven't
listened to yesterday's episode, AGI timelines are back in the news this week, specifically because
OpenAI co-founder Andre Capathi said that he believes the technology is still a decade away,
as opposed to estimates that have it more in a year or two.
Now, one critical point that came out of that conversation is that Andre actually has an
extremely high bar for how he defines AGI.
He said, when people talk about AI in the original AGI and how we spoke about it when
Open AI started, AGI was a system that you could go to that could do any economically valuable
task at human performance or better. That was the definition. He noted that since then, the definition
has been watered down to just covering knowledge work, certainly nothing like physical work.
Now, knowledge work is certainly a huge part of the global economy, but at 10 to 20% of all the
work in the world, at least as per his estimates, that leaves a lot off the table. Now, this is
far from the only definition floating around. Way back in February of 2023, OpenAI laid out
their framework for thinking about the approach of AGI. They gave a very basic definition.
AI systems that are generally smarter than humans. Since then, Sam Altman has updated his thoughts.
He acknowledged in February of this year that AGI is a, quote, weekly defined term, but generally
speaking, we mean it to be a system that can tackle increasingly complex problems at human
level in many fields. You might also hear Altman talking about AGI in reference to the five levels
of AI framework. Now, this built off of something that Google DeepMind scientists had introduced in a
November 2023 paper, but then Open AI expanded into these five stages. Level 1, chatbots,
which were AI with conversational language. Level 2, which were reasoners with human level problem
solving. Level 3 was agents with systems that can take actions. Level 4 were innovators,
AI that can aid in invention. Level 5, organizations or AI that can do the work of an organization.
As we discussed a lot at this show, we are somewhere in the 3 to 4 range right now.
Beyond that, there are a range of other definitions you might come across. Stahl, we're told Gardner
defines AGI as, quote, the intelligence of a machine that can accomplish any intellectual
task than a human can perform. Google leans into a different aspect, describing AGI as
hypothetical intelligence of a machine that possesses the ability to understand or learn any
intellectual task that a human being can. Amazon has another distinct focus, describing AGI
a software that is, quote, able to perform tasks that it is not necessarily trained or developed for.
Now, if these are one-off definitions for blog posts, one of the more prominent attempts to define and
test AGI capabilities is, of course, the ARC AGI prize. On their website, they write,
the consensus definition of AGI, a system that can automate the majority of economically
valuable work, while a useful goal, is an incorrect measure of intelligence. Measuring task-specific
skills is not a good proxy for intelligence. Skill is heavily influenced by prior knowledge
and experience. Unlimited priors and unlimited training data allow developers to buy levels of
skill for a system. This masks a system's own generalization power. Intelligence lies
in broader general purpose abilities. It is marked by skill acquisition and generalization
rather than skill itself. So they propose a better definition for AGI is AGI is a system that can
efficiently acquire new skills outside of its training data. The ARC AGI test then seeks to test
two elements of AGI contained in the definition. The ability to acquire new skills by
ensuring the tests have internal logic that can be learned, and the ability to complete tasks
outside of training data by ensuring the tasks are not generally available. So these are all the
things that are floating around. And you can see while they broadly get us in the right category,
there are a lot of different definitions, which lead to a lot of debates and a lot of AGI is in the
eye of the beholder kind of conversations, which, as I said, I don't think really matters for our day-to-day,
but does matter when it comes to whether giant funds are going to press the sell button because
they think things are overbought, because we're not making enough progress towards AGI, which means
all these contracts aren't going to play out the way that they want to. So this is the context into which
a group of researchers working with the Center for AI Safety have attempted to nail down a common
definition and a metric for assessing models as they progress. The group has produced a paper called
A Definition of AGI, which you can find at AGI definition.a.I. In the abstract, they write,
the lack of a concrete definition for artificial general intelligence obscures the gap between
today's specialized AI and human-level cognition. This paper introduces a quantifiable framework
to address this, defining AGI as matching the cognitive versatility and proficiency
of a well-educated adult.
This group, then, has grounded their analysis
in Catell-Horn-Carrel theory,
one of the more well-accepted models of human cognition.
Applying the theory, the researchers split AI performance
into 10 distinct categories.
Reading and writing, math, reasoning,
working memory, memory storage, memory retrieval,
visual, auditory, speech, and knowledge.
Now, you'll note that these categories
cover some of the general performance categories,
things like reading and writing or math,
but it also addresses the model's ability to learn
and apply its intelligence to topics outside of its training data.
Each of these categories has multiple subcategories that can be assessed individually.
In fact, assessment was one of the main focuses of this paper.
Researchers wrote,
Applications of this framework reveal a highly jagged cognitive profile in contemporary models.
While proficient in knowledge-intensive domains,
current AI systems have critical deficits in foundational cognitive machinery,
particularly long-term memory storage.
Each category was equally weighted and given a score out of 10,
and researchers measured GPT4 and GPT5 to demonstrate the framework.
GPT4 scored 27% while GPT5 achieved a 58%.
You can see from the two sets of results mapped out on a chart
that while GPT5 only made minor progress in knowledge,
it made significantly more progress in reading and writing as well as math.
What's more, GPT5 scored in multiple categories
where GPT4 was entirely deficient.
This included reasoning, working memory, memory retrieval, visual, and auditory.
And while those areas of intelligence are developing in the latest models, they're still very
nascent compared to, for example, math.
Dan Hendricks, the director of the Center for AI Safety commented, people who are bullish
about AGI timelines rightly point to rapid advancements like math.
The skeptics are correct to point out that AIs have many basic cognitive flaws,
hallucinations, limited inductive reasoning, limited world models, no continual learning.
There are many barriers to AGI, but they each seem tractable.
It seems like AGI won't arrive in a year, but it could easily arrive.
this decade. Content creator Lewis Gleason wrote,
What's powerful here is that this framework lets us track AGI like a scorecard. For the first time,
we have a framework that turns AGI from a buzzword into a measurable spectrum. Instead of
arguing, are we close to AGII, we can now ask how much cognitive ground remains before
parity. Now, one of the interesting things about this framework is to focus on what's missing
rather than highlighting a model's frontier abilities. Over the summer, for example, GPD5 and Gemini
2.5 Pro achieved gold medal performances in the International Mathematical Olympi.
and the International Collegiate Programming Contest.
The leading models then are already at a human level, a very advanced human level when
it comes to math or coding.
Importantly, though, while achieving that level was a huge milestone on the path to AGI,
based on the center's approach to an AGI definition, further progress in those areas
isn't going to make a big difference.
In contrast, audio and visual understanding is still very nascent and needs to improve dramatically
before AI models could be considered anywhere close to AGI.
Of course, those areas are arguably on the way.
Google has made incredible strides with their
multimodal models over the past year, and visual understanding seems to be developing quickly.
The V-O-3 set of models in SORA 2 are also able to add appropriate audio to generated videos
implying strong auditory understanding.
The big area that is so clearly missing, the biggest hole, by a mile, is around memory.
The paper, in fact, describes this as perhaps the most significant bottleneck.
Now, of course, this is a huge area of focus for the labs.
Anthropic recently introduced their skills feature, which introduces a more efficient way of storing
and accessing memory, but we're yet to see a model that can intelligently store and retrieve information
at anywhere close to a human level. In fact, one of the things that you hear when people critique
how far ahead the hype may have gotten in their estimation than where the capabilities of models are,
it tends to come around to this part of cognition, where models don't have memory and they
can't learn in the way that humans do. Commenting on the study's exploration of memory from the
paper, Rohan Paul noted, they show that today's systems often fake memory by stuffing huge context
windows and fake precise recall by leaning on retrieval from external tools, which hides real gaps
in storing new facts and recalling them without hallucinations. They emphasize that both GBT4 and
GPD5 fail to form lasting memories across sessions and still mix in wrong facts when retrieving,
which limits dependable learning and personalization over days or weeks. Anyone who has thought
that they had locked in core knowledge and context about themselves with an LLM, only to have it
feed you back a response that has none of that understanding built in will understand what a big
problem this actually is. Now, what's valuable about this paper is, as Gleason put it, having a
framework where there's an actual trackable numeric score that people can assess progress on. For example,
if all market actors accepted this framework, which of course won't happen, and then they went
and looked and GPT6 came out, instead of the inevitable endless debates about whether we had hit a
wall again, theoretically, you could just look and see how much it had improved from GPT4's 27%, and GPT5's 58%.
And yet at the same time, there is one highly problematic shortfall that could be very important.
Again, as Rohan Paul put it, the scope is cognitive ability, not motor control or economic output,
so a high score does not guarantee business value.
In fact, increasingly other AGI definitions have fallen back on economic value as the most important proxy for intelligence.
Sometimes that's because more complex notions like continuous learning or performing tasks outside of the training set are too difficult to define.
One prominent example came from OpenAI's contract dispute with Microsoft.
Their agreement originally had Microsoft losing access to OpenAI's technology once AGI was achieved.
The problem was, of course, that the definition of AGI from Open AI was pretty vague.
It defined AGI as, quote, highly autonomous systems that outperform humans at most economically
valuable work.
The Open AI Board also had sole discretion to declare that AGI had been achieved.
This was viewed as an unfalsifiable claim that could cost Microsoft tens of billions of dollars.
The two companies ultimately settled on changing the definition of AGI to use a financial measurement
as a proxy. They decided that AGI would be deemed to have been achieved when OpenAI developed
software that could generate $100 billion in profits. Earlier this week, during the controversy
around the André interview, Elon Musk revealed that he has a similar definition. He posted on
X that AGI is, quote, capable of doing anything a human with a computer can do, but not smarter
than all humans and computers combined. He said it's probably three to five years away. He also
put forward his belief that GROC 5 has a 10% chance to meet this definition and the odds are rising.
Now, I think there are, of course, merits to both economic and functional definitions of
AGI. The functional definition is laid out in the new paper, establishes the areas where
current models are lacking and the new capabilities they will need to achieve AGI.
In some ways, it functions almost as like a checklist, so we're all clear that incredibly
intelligent models that forget everything at the end of the context window aren't really AGI.
But at the same time, an incredibly powerful model, like Elon Musk is predicting GROC5 will be,
whether it's AGI or not could have a profound impact on the economy.
In fact, as I've said numerous times, I think that these models are having and will have
a profound impact on the economy exactly as they are right now.
Ultimately, I think this is a extremely useful contribution to the field.
I hope that more people dig in.
And if nothing else, it creates a useful heuristic for the future when inevitably,
we rage and scream and kick with every new model release about how some big wall has been hit.
For now, that's going to do it for today's AID Daily Brief.
Appreciate you listening or watching.
as always, and until next time, peace.
