The AI Daily Brief: Artificial Intelligence News and Analysis - GPT-5 is 58% AGI

Starting point is 00:00:00 Today on the AI Daily Brief, a new definition of AGI that suggests that GBT5 is 58% of the way there. Before that in the headlines, Claude Code comes to the web. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. All right, friends, quick announcements before we dive in. First of all, thank you to today's sponsors, Super Intelligent, KPMG and Robots and Pencils. To get an ad-free version of the show, go to patreon.com slash AI Daily Brief, or you can sign up on Apple Podcasts. One note about Apple Podcasts because this has been coming up a bit. Apple Podcast is set up on a very different system to Patreon.

Starting point is 00:00:44 With Patreon, I upload a distinct file. I can schedule it the same way as I can with my normal episodes. And so it's always set to come out at the same time as the main ad version. With Apple Podcasts, it's a little bit different. I have to wait for the episode to post to Apple Podcasts, meaning that there's a short delay after I publish in general. And then I have to go in and manually replace the file. What this means is that if there's any reason, I can't immediately replace

Starting point is 00:01:07 that file, you will still see the normal ad version on your feed. It only gets replaced when I add that manual file. This is normally fine. It just means about 15 minutes of waiting around after I press published on the normal episode, but that lag creates more possibilities for problems. For example, yesterday Apple's podcast Connect system was down for about 12 hours, which meant that pretty much overnight, even subscribers on Apple still saw only the ad version. I promise you that I will always try to get the ad-free version up on Apple as fast as I can, but sometimes it's going to be out of my control. I apologize. I really wish it was a better system, but that is just the way that it is. Lastly, of course, for any information about the show, sponsorship speaking, job opportunities,

Starting point is 00:01:44 go to AIdailybrief.aI.com. And with that all out of the way, let's dive in. Welcome back to the AI Daily Brief Headlines edition, all the daily AI news you need in around five minutes. First up today, Anthropic is making Claude Code available through a web app and within the Claude iOS app. Previously, the feature was only available through terminals and IDEs, and the big unlock is being able to spin up background agents. With Cloud code running in the cloud, you can now run multiple tasks in parallel across different repositories from a single interface

Starting point is 00:02:12 and ship faster with automatic PR creation and clear change summaries. This asynchronous workflow is quickly becoming a powerful tool for AI-enhanced coders. CloudCode product manager Kat Wu said, as we look forward, one of our key focuses is making sure the command line interface product is the most intelligent and customizable way for you to use coding agents.

Starting point is 00:02:31 But we're continuing to put ClaudeCode everywhere, helping it meet developers where they are. Web and Mobile is a big step in this direction. Certainly there is a lot of excitement about this. Josh J.DJ.J. Kelly on Twitter wrote, I can work with Claude Code while out on a walk. Speaking of Agenda coding, Replit is projecting massive growth to reach a billion dollars in revenue by the end of next year. Speaking with Business Insider, CEO Amjad Masad, said that the AI coding startup has reached 240 million in ARR and expects that to quadruple next year. The company's growth this year has been absolutely skyrocketing,

Starting point is 00:03:05 gaining more than 10x from their 16 million in ARR at the end of 2024. The company now has over 150,000 paying customers and over 40 million free users. And while at this stage, all those free users mean that the consumer segment is unprofitable, Mossad boasted that enterprise margins are close to 80%. This follows the same profit model that other AI companies are currently pursuing. The consumer segment is a loss leader due to large volumes of free users, but building familiarity with consumers means they to make a profit. access to the same tools at work. Massad said that the surging revenue was largely due to adoption in mid-sized companies, including Duolingo and Zillow. He said,

Starting point is 00:03:39 RepLid is kind of replacing a lot of the no-code, low-code tools which never really worked very well. They get initial productivity boosts, but a lot of times that ended up actually slowing down a lot of companies. Whatever the case, they are seeing enough growth that they are pushing forward their expectations. This article came about after Business Insider saw a leaked investor memo that gave the billion-dollar projection for 2027. Speaking of growth, after a spike in downloads this month, could Meta's AI app actually be gaining traction? According to similar web data, Meta's standalone AI app now has over 300,000 downloads per day, up from around 100,000 in mid-September. In addition, the app now has 2.7 million daily active users up from 775,000 last month.

Starting point is 00:04:19 And while similar web said they hadn't seen any meaningful collation with either search or advertising volume, however, they noted that Meta could be promoting the platform on Facebook or Instagram, which aren't included. in similar web's data. The other possible explanation is that meta's new vibes feed has been more of a success than people gave it credit for. The AI generated image and video feed was released September 25th, and decried by many as the introduction of infinite feeds of AI slop. However, the spike in downloads and daily active users both do line up with the introduction of that feed. OpenAI's launch of the SORA app a week later could also be boosting META's platform as an alternative. SORA still requires an invite code while META's platform is freely available. Now, obviously these numbers in aggregate are still

Starting point is 00:04:58 quite low relative to the billions of users that mainstream social apps have, but the growth is notable nonetheless. Next up some fundraising news. Open Evidence, the AI Assistant for Doctors, has raised $200 million at a $6 billion valuation. This is the second large fundraising round for the company this year. They raised $210 million at a $3.5 billion valuation back in July. And with the level of growth they've displayed recently, it's not hard to see why the valuation is almost doubled.

Starting point is 00:05:24 Open Evidence now supports around 15 million clinical consultations a month up from 8.5 million in July. The product is free to use for registered medical professionals and monetized through advertising rather than subscription. That unconventional approach for a professional tool has allowed open evidence to expand into 10,000 medical centers. Open evidence only began commercializing their app three months ago and is already halfway to their target of $100 million in advertising revenue for next year. The assistant is trained on leading medical journals like the New England Journal of Medicine and is designed to help doctors quickly access the literature for diagnosis and treatment options. The system is also designed to reject low-confidence outputs reducing hallucination risk.

Starting point is 00:06:00 Alongside medical journals, the model is also being fine-tuned on the 100 million clinical consultations assisted by the tool. Co-founder Daniel Nadler said that this is one of the company's largest motes, adding, no one else in the world has that data. Speaking to adoption among doctors, Zankeen Zeb of Google Ventures, the lead investor in the round, said it's reaching verb-like status. Now, this data type of moat, where companies in verticals have access to actual real-world data based on the usage of their tool is one of the most interesting themes and questions.

Starting point is 00:06:28 So far in the history of LLMs, we've seen that the bitter lesson applies. In other words, that mass access to data beats out specialized data when it comes to pre-training. However, where a lot of people are looking in the future is that the data that's left that the foundation model labs don't have is the data exhaust that comes from real-world usage, and that could in and of itself be extremely valuable. That's certainly the argument that open evidence is making and we'll have to see how it plays out. Staying on fundraising, MusicGen startup Suno is said to be in talks to raise $100 million at a $2 billion valuation. Sources speaking with Bloomberg said the deal would quadruple the company's valuation since their last raise.

Starting point is 00:07:03 That last round closed in May of last year and brought it $125 million, although the valuation was not disclosed at the time. Importantly, the startup is now generating $100 million in ARR, according to sources familiar with the numbers. And what's more, Suno may be able to settle their legal disputes very shortly. In June of last year, Universal and Warner Music filed a lawsuit for copyright infringement against Suno and competitor Oonio, but this June, Bloomberg reported that the labels are in talks to settle the litigation and establish a licensing framework for generated music. The labels are also rumored to be looking to take an equity stake in both of those companies. Reinforcing the idea of a truce between the music industry and AI startups,

Starting point is 00:07:38 last week, Spotify announced plans to work with the record labels on AI-powered features. Universal Music Group CEO Lucien Grange is boosting a pro-AI message internally. Last week, he sent a memo to staff reemphasizing his interest in partnering on AI products as long as they respect artist copyrights and likenesses. Now, for anyone who has watched the history of the record labels all the way going back to NAPS or this should be no surprise at all, there is no industry, frankly, more adept at figuring out how to monetize the new thing. Lastly, today, the latest company to make some big AI pronouncement is Starbucks. Starbucks CEO Brian Nichol said that they're all in on AI.

Starting point is 00:08:11 Appearing on a Yahoo Finance podcast recorded at the Dream Forest Conference last week, Nickel discussed a wide range of AI deployments at the company. A major scaled use case is an in-store knowledge assistant referred to as the green dot. It helps store leaders manage daily operations, including troubleshooting equipment and providing drink recipes. Nicol also said that Starbucks has pilots for inventory, supply chain forecasting, and scheduling, although none of those use cases are at scale. Speaking to ROI, he commented,

Starting point is 00:08:36 we're still in the early days of this, but I believe there is definitely opportunity here to help us get things done faster and more efficiently. To what scale, that is to be determined. We're definitely already seeing a big impact in our technology area. The ability to get code done so much faster is real. One thing he did reject is the idea of robot baristas anytime soon, commenting we're not near that right now. Some folks tried to dig into the specifics about what that would mean, while others just

Starting point is 00:08:59 let it be vibes. Sophie at NetCap girl writes, okay, yeah, whatever, eff it, Starbucks, AI. And that is going to do it for today's headlines. Next up, the main episode. Today's episode is brought to you by my company, Super Intelligent. Look guys, buying or building agents without a plan is how you end up in pilot purgatory. Super Intelligent is the agent planning platform that saves you from stalling out on AI. We interview teams at scale, translate real work into prioritized agent opportunities,

Starting point is 00:09:28 and deliver recommendations that you can execute on, what to build, what success looks like, how fast you'll get results, and even what platforms and tools you should consider. All customized for you. Instead of shopping for hype, you get to deploy with confidence. Visit besupor.i and book your AI planning demo team. today. AI isn't a one-off project. It's a partnership that has to evolve as the technology does. Robots and Pencils work side by side with clients to bring practical AI into every phase, automation, personalization, decision support, and optimization. They prove what works through applied

Starting point is 00:10:02 experimentation and build systems that amplify human potential. As an AWS-certified partner with global delivery centers, robots and pencils combines reach with high-touch service, where others hand off, they stay engaged, because partnership isn't a project plan. It's a commitment. As AI advances, so will their solutions. That's long-term value. Progress starts with the right partner. Start with robots and pencils at robots and pencils.com slash AI Daily Brief. What if AI wasn't just a buzzword, but a business imperative? On You Can with AI, we take you inside the boardrooms and strategy sessions of the world's most forward-thinking enterprises. hosted by me, Nathania Wittamore, and powered by KPMG,

Starting point is 00:10:42 this seven-part series delivers real-world insights from leaders who are scaling AI with purpose, from aligning culture and leadership to building trust, data readiness, and deploying AI agents. Whether you're a C-suite executive, strategist, or innovator, this podcast is your front-row seat to the future of Enterprise AI. So go check it out at www.kpmG.org-us slash AI podcasts, or search you Penn with AI on Spotify, Apple Podcast, or wherever you get your podcasts.

Starting point is 00:11:11 Welcome back to the AI Daily Brief. One of the things that I have said frequently on this show, including being effectively the entire theme of yesterday's show, is that when it comes to the practical, lived, applied experience of AI inside a work setting, I don't think that AGI matters. In fact, I think it is one of the more useless terms when it comes to how you think about applying AI in your daily life or your company thinks about applying it at work. So why do definitions of AGI matter? then. And the short answer is it's the exact same reason that we had that entire conversation

Starting point is 00:11:43 in the show yesterday, which is that all of a sudden progress towards AGI is going to be considered a meaningful factor when it comes to how markets should treat AI stocks. Given how much AI stocks are at the core of the entire economy right now, these otherwise nebulous definitions start to take on a greater importance. Now, of course, for those who haven't listened to yesterday's episode, AGI timelines are back in the news this week, specifically because OpenAI co-founder Andre Capathi said that he believes the technology is still a decade away, as opposed to estimates that have it more in a year or two. Now, one critical point that came out of that conversation is that Andre actually has an

Starting point is 00:12:22 extremely high bar for how he defines AGI. He said, when people talk about AI in the original AGI and how we spoke about it when Open AI started, AGI was a system that you could go to that could do any economically valuable task at human performance or better. That was the definition. He noted that since then, the definition has been watered down to just covering knowledge work, certainly nothing like physical work. Now, knowledge work is certainly a huge part of the global economy, but at 10 to 20% of all the work in the world, at least as per his estimates, that leaves a lot off the table. Now, this is far from the only definition floating around. Way back in February of 2023, OpenAI laid out

Starting point is 00:13:01 their framework for thinking about the approach of AGI. They gave a very basic definition. AI systems that are generally smarter than humans. Since then, Sam Altman has updated his thoughts. He acknowledged in February of this year that AGI is a, quote, weekly defined term, but generally speaking, we mean it to be a system that can tackle increasingly complex problems at human level in many fields. You might also hear Altman talking about AGI in reference to the five levels of AI framework. Now, this built off of something that Google DeepMind scientists had introduced in a November 2023 paper, but then Open AI expanded into these five stages. Level 1, chatbots, which were AI with conversational language. Level 2, which were reasoners with human level problem

Starting point is 00:13:41 solving. Level 3 was agents with systems that can take actions. Level 4 were innovators, AI that can aid in invention. Level 5, organizations or AI that can do the work of an organization. As we discussed a lot at this show, we are somewhere in the 3 to 4 range right now. Beyond that, there are a range of other definitions you might come across. Stahl, we're told Gardner defines AGI as, quote, the intelligence of a machine that can accomplish any intellectual task than a human can perform. Google leans into a different aspect, describing AGI as hypothetical intelligence of a machine that possesses the ability to understand or learn any intellectual task that a human being can. Amazon has another distinct focus, describing AGI

Starting point is 00:14:20 a software that is, quote, able to perform tasks that it is not necessarily trained or developed for. Now, if these are one-off definitions for blog posts, one of the more prominent attempts to define and test AGI capabilities is, of course, the ARC AGI prize. On their website, they write, the consensus definition of AGI, a system that can automate the majority of economically valuable work, while a useful goal, is an incorrect measure of intelligence. Measuring task-specific skills is not a good proxy for intelligence. Skill is heavily influenced by prior knowledge and experience. Unlimited priors and unlimited training data allow developers to buy levels of skill for a system. This masks a system's own generalization power. Intelligence lies

Starting point is 00:15:00 in broader general purpose abilities. It is marked by skill acquisition and generalization rather than skill itself. So they propose a better definition for AGI is AGI is a system that can efficiently acquire new skills outside of its training data. The ARC AGI test then seeks to test two elements of AGI contained in the definition. The ability to acquire new skills by ensuring the tests have internal logic that can be learned, and the ability to complete tasks outside of training data by ensuring the tasks are not generally available. So these are all the things that are floating around. And you can see while they broadly get us in the right category, there are a lot of different definitions, which lead to a lot of debates and a lot of AGI is in the

Starting point is 00:15:39 eye of the beholder kind of conversations, which, as I said, I don't think really matters for our day-to-day, but does matter when it comes to whether giant funds are going to press the sell button because they think things are overbought, because we're not making enough progress towards AGI, which means all these contracts aren't going to play out the way that they want to. So this is the context into which a group of researchers working with the Center for AI Safety have attempted to nail down a common definition and a metric for assessing models as they progress. The group has produced a paper called A Definition of AGI, which you can find at AGI definition.a.I. In the abstract, they write, the lack of a concrete definition for artificial general intelligence obscures the gap between

Starting point is 00:16:16 today's specialized AI and human-level cognition. This paper introduces a quantifiable framework to address this, defining AGI as matching the cognitive versatility and proficiency of a well-educated adult. This group, then, has grounded their analysis in Catell-Horn-Carrel theory, one of the more well-accepted models of human cognition. Applying the theory, the researchers split AI performance into 10 distinct categories.

Starting point is 00:16:39 Reading and writing, math, reasoning, working memory, memory storage, memory retrieval, visual, auditory, speech, and knowledge. Now, you'll note that these categories cover some of the general performance categories, things like reading and writing or math, but it also addresses the model's ability to learn and apply its intelligence to topics outside of its training data.

Starting point is 00:16:58 Each of these categories has multiple subcategories that can be assessed individually. In fact, assessment was one of the main focuses of this paper. Researchers wrote, Applications of this framework reveal a highly jagged cognitive profile in contemporary models. While proficient in knowledge-intensive domains, current AI systems have critical deficits in foundational cognitive machinery, particularly long-term memory storage. Each category was equally weighted and given a score out of 10,

Starting point is 00:17:23 and researchers measured GPT4 and GPT5 to demonstrate the framework. GPT4 scored 27% while GPT5 achieved a 58%. You can see from the two sets of results mapped out on a chart that while GPT5 only made minor progress in knowledge, it made significantly more progress in reading and writing as well as math. What's more, GPT5 scored in multiple categories where GPT4 was entirely deficient. This included reasoning, working memory, memory retrieval, visual, and auditory.

Starting point is 00:17:53 And while those areas of intelligence are developing in the latest models, they're still very nascent compared to, for example, math. Dan Hendricks, the director of the Center for AI Safety commented, people who are bullish about AGI timelines rightly point to rapid advancements like math. The skeptics are correct to point out that AIs have many basic cognitive flaws, hallucinations, limited inductive reasoning, limited world models, no continual learning. There are many barriers to AGI, but they each seem tractable. It seems like AGI won't arrive in a year, but it could easily arrive.

Starting point is 00:18:23 this decade. Content creator Lewis Gleason wrote, What's powerful here is that this framework lets us track AGI like a scorecard. For the first time, we have a framework that turns AGI from a buzzword into a measurable spectrum. Instead of arguing, are we close to AGII, we can now ask how much cognitive ground remains before parity. Now, one of the interesting things about this framework is to focus on what's missing rather than highlighting a model's frontier abilities. Over the summer, for example, GPD5 and Gemini 2.5 Pro achieved gold medal performances in the International Mathematical Olympi. and the International Collegiate Programming Contest.

Starting point is 00:18:56 The leading models then are already at a human level, a very advanced human level when it comes to math or coding. Importantly, though, while achieving that level was a huge milestone on the path to AGI, based on the center's approach to an AGI definition, further progress in those areas isn't going to make a big difference. In contrast, audio and visual understanding is still very nascent and needs to improve dramatically before AI models could be considered anywhere close to AGI. Of course, those areas are arguably on the way.

Starting point is 00:19:21 Google has made incredible strides with their multimodal models over the past year, and visual understanding seems to be developing quickly. The V-O-3 set of models in SORA 2 are also able to add appropriate audio to generated videos implying strong auditory understanding. The big area that is so clearly missing, the biggest hole, by a mile, is around memory. The paper, in fact, describes this as perhaps the most significant bottleneck. Now, of course, this is a huge area of focus for the labs. Anthropic recently introduced their skills feature, which introduces a more efficient way of storing

Starting point is 00:19:52 and accessing memory, but we're yet to see a model that can intelligently store and retrieve information at anywhere close to a human level. In fact, one of the things that you hear when people critique how far ahead the hype may have gotten in their estimation than where the capabilities of models are, it tends to come around to this part of cognition, where models don't have memory and they can't learn in the way that humans do. Commenting on the study's exploration of memory from the paper, Rohan Paul noted, they show that today's systems often fake memory by stuffing huge context windows and fake precise recall by leaning on retrieval from external tools, which hides real gaps in storing new facts and recalling them without hallucinations. They emphasize that both GBT4 and

Starting point is 00:20:28 GPD5 fail to form lasting memories across sessions and still mix in wrong facts when retrieving, which limits dependable learning and personalization over days or weeks. Anyone who has thought that they had locked in core knowledge and context about themselves with an LLM, only to have it feed you back a response that has none of that understanding built in will understand what a big problem this actually is. Now, what's valuable about this paper is, as Gleason put it, having a framework where there's an actual trackable numeric score that people can assess progress on. For example, if all market actors accepted this framework, which of course won't happen, and then they went and looked and GPT6 came out, instead of the inevitable endless debates about whether we had hit a

Starting point is 00:21:07 wall again, theoretically, you could just look and see how much it had improved from GPT4's 27%, and GPT5's 58%. And yet at the same time, there is one highly problematic shortfall that could be very important. Again, as Rohan Paul put it, the scope is cognitive ability, not motor control or economic output, so a high score does not guarantee business value. In fact, increasingly other AGI definitions have fallen back on economic value as the most important proxy for intelligence. Sometimes that's because more complex notions like continuous learning or performing tasks outside of the training set are too difficult to define. One prominent example came from OpenAI's contract dispute with Microsoft. Their agreement originally had Microsoft losing access to OpenAI's technology once AGI was achieved.

Starting point is 00:21:52 The problem was, of course, that the definition of AGI from Open AI was pretty vague. It defined AGI as, quote, highly autonomous systems that outperform humans at most economically valuable work. The Open AI Board also had sole discretion to declare that AGI had been achieved. This was viewed as an unfalsifiable claim that could cost Microsoft tens of billions of dollars. The two companies ultimately settled on changing the definition of AGI to use a financial measurement as a proxy. They decided that AGI would be deemed to have been achieved when OpenAI developed software that could generate $100 billion in profits. Earlier this week, during the controversy

Starting point is 00:22:24 around the André interview, Elon Musk revealed that he has a similar definition. He posted on X that AGI is, quote, capable of doing anything a human with a computer can do, but not smarter than all humans and computers combined. He said it's probably three to five years away. He also put forward his belief that GROC 5 has a 10% chance to meet this definition and the odds are rising. Now, I think there are, of course, merits to both economic and functional definitions of AGI. The functional definition is laid out in the new paper, establishes the areas where current models are lacking and the new capabilities they will need to achieve AGI. In some ways, it functions almost as like a checklist, so we're all clear that incredibly

Starting point is 00:22:58 intelligent models that forget everything at the end of the context window aren't really AGI. But at the same time, an incredibly powerful model, like Elon Musk is predicting GROC5 will be, whether it's AGI or not could have a profound impact on the economy. In fact, as I've said numerous times, I think that these models are having and will have a profound impact on the economy exactly as they are right now. Ultimately, I think this is a extremely useful contribution to the field. I hope that more people dig in. And if nothing else, it creates a useful heuristic for the future when inevitably,

Starting point is 00:23:28 we rage and scream and kick with every new model release about how some big wall has been hit. For now, that's going to do it for today's AID Daily Brief. Appreciate you listening or watching. as always, and until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - GPT-5 is 58% AGI

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.