The AI Daily Brief: Artificial Intelligence News and Analysis - Why AI Needs Better Benchmarks

Starting point is 00:00:00 Today on the AI Daily Brief, why AI needs better benchmarks. And before that in the headlines, is Apple planning on distilling Google's Gemini models? The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. All right, friends, quick announcements before we dive in. First of all, thank you to today's sponsors, KPMG, robots and pencils, blitzie, and super-intelligent.

Starting point is 00:00:31 To get an ad-free version of the show, go to patreon.com slash AI Daily Brief, or you can subscribe on Apple Podcasts. If you are interested in sponsoring the show, send us a note at sponsors at AIDailybrief.ai. And while you're at AIDailybrief.aI, check out everything going on in the ecosystem, including the return of our newsletter, which has all the links that I mentioned in the show. Apple's AI partnership with Google apparently goes much deeper than previously thought, including the ability to distill Gemini into smaller models.

Starting point is 00:00:59 The unveiling of the new AI series is a little over two months away, and we're starting to get a steady drip of information around what the product will look like. On Tuesday, Bloomberg's Apple Insider Mark German ran through what he knows about features in U.S. Apple has reportedly backed down on their view that Siri should remain voice-only, now building a standard chatbot interface with optional voice controls. German also reported that Siri will be deeply integrated into iOS 27, allowing it to take actions and draw context from apps running on a user's device. It sounds as though Apple will try to launch Siri with full computer use,

Starting point is 00:01:30 delivering the features they advertised with the launch of Apple Intelligence two years ago. Now, we already knew that Siri would be driven by Google's Gemini models, but new reporting from the information suggests that Apple has much more freedom in how they use Gemini than originally thought. Previous reports said that Apple would fine-tune a Gemini model for their purposes, and that the models would be hosted on Apple's servers to ensure user privacy. However, sources speaking with the information said that Apple has full access to the Gemini models, meaning they're able to distill large versions of Gemini into their own smaller proprietary models. Model distillation is the process of using the reasoning traces from one model to train another,

Starting point is 00:02:04 essentially a cheat code to develop powerful models. Many of the Chinese labs have been accused of distilling models from Anthropic and OpenAI as a way to catch up quickly. The information sources said the process isn't straightforward, as Apple's vision for Siri is very different to the way Gemini works. Gemini is optimized for chatbots, enterprise tasks, and coding, while the source implied Apple is less interested in these functions. The source was skeptical the models would actually be that much use to Apple's foundation

Starting point is 00:02:27 models team for that reason. Maybe the main takeaway is that Apple hasn't entirely given up on training their own models and could use the Google partnership to bootstrap their approach. The most obvious target in most people's minds would be training small-capable models to run locally on an iPhone, which seems to be the core vision of where Apple wants to go with AI. Ethan Malick sums up the wait and see attitude of most people on this news, tweeting, huh, I'm not sure distilling Gemini models to run on phones is going to result in the generally capable agents that people will expect soon, but we shall see. Speaking of Google, The company has published a research paper describing a new compression algorithm that could

Starting point is 00:03:01 dramatically improve the performance of small models. Called TurboQuant, the process allows researchers to quantize model context with almost zero losses. During long conversations or long horizon tasks, context can bloat to use even more memory than model weights. Functionally, quantization means context is stored with less fidelity. For example, 16-bit data might be compressed into 4-bit. Current quantization methods are quite lossy and noticeably reduced performance. Some believe, for example, that this is the reason Anthropics models can seem a little off during demand spikes. Google researchers say their new process massively reduces the loss associated with quantization

Starting point is 00:03:35 and could make the technique far less of a trade-off. They claim their process results in a 6x reduction of the amount of memory a given model uses for storing context, while delivering an 8x speed boost compared to current methods. This could result in a 50% reduction in inference costs and help ease the bottleneck around memory chips. Giving a concrete demonstration of what the algorithm can do, Google's researchers tested it on Lama 3.18B and Mistral 7B, with TurboQuant implemented, both models achieved perfect scores on needle-in-a-hastack tests. Cloudflare CEO Matthew Prince tried to explain the gravity of this breakthrough commenting,

Starting point is 00:04:07 this is Google's deep seek, so much more room to optimize AI inference for speed, memory usage, power consumption, and multi-tenant utilization. Others reached for a more relatable analogy, comparing this moment to when a scrappy startup cracked middle-out compression with a Wiseman score of 5.2. Chevang writes, so basically, TurboQuant is Pied Piper. Now, Google isn't just shipping groundbreaking research, they also have a new music model with Lyria 3 Pro. The first version of Lyria 3 was to some folks underwhelming.

Starting point is 00:04:34 It wasn't that the model was bad, it just couldn't produce production quality music like Suno and was limited to 30 seconds, making it seem like it was more for novelty purposes than professional use cases. Lyria 3 Pro definitely addresses some of those issues. It can now create full tracks up to three minutes long and seems to have a much better understanding of lyrics and song structure. Roan Paul writes, The hard part in AI music is not making pleasant sound for 10 seconds. It's keeping a piece coherent as it moves from intro to verse to chorus without collapsing into a loop. Now, Rohan noted

Starting point is 00:05:03 that Google is also pushing it in Vertex AI, AI Studio, and the Gemini app, so the bigger story is probably less about the model, and more the fact that it is available via API, which could mean it finds its way into a lot more use cases. Over in the world of AI politics, Senator Bernie Sanders has unveiled his data center moratorium bill with an assist from AOC. The legislation would pause all data center construction nationwide until strong national safeguards, their words, are in place. The bill requires Congress to establish protections for workers and consumers, address environmental harms, and defend civil rights before lifting the moratorium. Sanders said, AI has received far too little serious discussion here in our nation's capital.

Starting point is 00:05:40 I fear that Congress is totally unprepared for the magnitude of the changes that are already taking place. Now, the presence of AOC as a co-sponsor seems fairly relevant. Until now, Sanders has been pushing the moratorium largely by himself with support from certain elements of the AI safety community. It hadn't found meaningful traction among elected progressives. AOC personally has been pretty much silent on the issue. Her ex-feed has zero mentions of data centers and only a single post about AI regarding the dangers of deepfakes. By supporting the bill, AOC is declaring a position for the broader progressive movement and could at least theoretically carry that position into a presidential run in 2008. Meanwhile, the bill seems to have very little support from mainstream Democrats. Senator Mark Warren,

Starting point is 00:06:18 for example, said the idea was, in his words, idiocy. He continued, A data center moratorium simply means China is going to move quicker. The idea that we're going to stuff this back into the bottle, that's a ridiculous premise. Now, despite thinking the moratorium is the wrong solution, Warner certainly still has strong views on AI policy. He's currently supporting a bill to codify Anthropics redlines around using AI for domestic surveillance and autonomous weaponry. Referring to Secretary of War, Pete Heg-Seth, he added, those should be policy decisions not left to a single individual. Warner also raised alarm about AI job replacement, commenting, the recent college graduate unemployment is 9%.

Starting point is 00:06:51 I'll bet anyone in the room it goes to 30 or 35% before 2028. He said he now believes the scope of the economic disruption is going to be exponentially larger than he thought just a few months ago. Commenting on the Sanders AOC policy, James Rosenberg writes, I see why it's called populism now, never liking the term. Every part of this is detrimentally performative. It's arbitrary AF. The ban on upgrades means no energy efficiency or sustainability improvements can happen. There is nothing progressive about it. On the other end of the spectrum, New York Times tech reporter Mike Isaac writes, people can certainly take issue with his positions in plan of action,

Starting point is 00:07:23 but Sanders seems to be one of the few members of Congress seriously reckoning with what the labor consequences of the coming AI age could be, now joined by AOC. One of the things that I'm not sure on is the extent to which a moratorium is A, something that Bernie and AOC actually think is good policy, B, something they think is good politics, given increasing American antipathy towards data centers in their community, or three, a way to anchor the conversation on the far end of one extreme, so there's more room to

Starting point is 00:07:50 find compromise in the middle. One can certainly hope it's number three, but right now it's not at all clear. Now, speaking of the China boogeyman, in our final story, Manus co-founders have been banned from leaving China as the CCP cracks down. The Financial Times reports that Manus's CEO and chief scientist have both been barred from leaving the country, while Meta's $2 billion acquisition is reviewed. We heard rumblings of this earlier in the month, as Manus and Meta executives were summoned to Beijing for meeting with regulators. The theory of the case is that Manus circumvented China's export controls on tech by relocating their headquarters from Beijing to Singapore. CEO Xiaoh Hong and chief scientist, G. Yi Yu Chao reportedly attended the meeting and were told after its conclusion that they

Starting point is 00:08:28 would be unable to leave China but were free to travel within the country. Sources said that no formal investigation has been open and no charges have been brought, but Manus is said to be seeking legal representation to help resolve the issue. The entire situation is messy because it deals with the intersection of actual laws and the unspoken rules that. govern doing business in China. China has strict laws to control foreign investment and export of technology. However, both Manus and Meta maintained their transaction was in full compliance. The relocation of the headquarters is an obvious gray area, made even more gray by the fact that Manus still maintains an offshore entity, which was used to develop early versions of the product. As for the unspoken rules,

Starting point is 00:09:04 Chinese officials have become increasingly concerned about losing AI talent and technology to the West. They've even adopted the euphemism of selling young crops to describe the poaching of human capital and strategic industries. Sources suggested that the extreme outcome would be a forced unwind of the meta deal, but noted that that would be messy because the technology is already being integrated into meta's platforms. What's more, this isn't the only sign that Beijing is tightening its grip on its domestic AI industry. AI researcher Tao Hu shared that the China Computer Federation has warned researchers not to participate in the NERIPS conference. Chinese entrepreneur Lina Hua argued that this was all to be expected, writing,

Starting point is 00:09:38 they thought they were being clever for circumvending China's tech export controls, but you don't mess with the CCP like that. You will be made an example of so others don't get tempted to betray the motherland. So what's going to happen? China won't jail them because they don't want to look evil. Instead, they're going to freeze the founder's assets in China and give them a travel ban while the quote-unquote probe is ongoing. The probe will likely be deliberately prolonged to inflict psychological damage, create uncertainty for potential copycats and make the public forget about this case. And once the topic is out of the public's mind, CCP going to strike hard with a financial penalty that wipes out most of their gains and then soft blacklist them in China.

Starting point is 00:10:11 Bill Bishop, the host of cynicism, writes, I didn't think the manist top execs would be so naive as to go back to the PRC. Expect they will have to spit back out a lot of what they made. On the flip side, some Western observers thought the crackdown will probably backfire. Former White House advisor Dean Ball commented, If we were smart, we'd see this as a major cell phone by China as Natsac-brained public policy so often is. The message the government is sending is, If you ever want to found a company, especially one that makes money on software, move to Singapore first.

Starting point is 00:10:39 Easier to get GPUs, too. Never a dull day and AI, but for now, that does it for the headlines. Next up, the main episode. All right, folks, quick pause. Here's the uncomfortable truth. If your enterprise AI strategy is we bought some tools, you don't actually have a strategy. KPMG took the harder route and became their own client zero. They embedded AI and agents across the enterprise, how work gets done, how teams collaborate, how decisions,

Starting point is 00:11:06 move, not as a tech initiative, but as a total operating model shift. And here's the real unlock. That shift raised the ceiling on what people could do. Human stayed firmly at the center, while AI reduced friction, surfaced insight, and accelerated momentum. The outcome was a more capable, more empowered workforce. If you want to understand what that actually looks like in the real world, go to www.kpmg.us slash AI. That's www.kpmg.us.comg.coms.a. Today's episode is brought to you by robots and pencils, a company that is growing fast. Their work is a high-growth AWS and Databricks partner means that they're looking for elite talent ready to create real impact at velocity.

Starting point is 00:11:47 Their teams are made up of AI-native engineers, strategists, and designers who love solving hard problems and pushing how AI shows up in real products. They move quickly using RoboWorks, their agenic acceleration platform, so teams can deliver meaningful outcomes in weeks, not months. They don't build big teams. they build high-impact nimble ones. The people there are Wicked Smart with patents, published research, and work that's helped shaped entire categories.

Starting point is 00:12:12 They work in Velocity Pods and studios that stay focused and move with intent. If you're ready for career-defining work with peers who challenge you and have your back, robots and pencils is the place. Explore open roles at robots and pencils.com slash careers. That's robots and pencils.com slash careers. Want to accelerate enterprise software development velocity by 5X? You need Blitzy, the only autonomous software software. software development platform built for enterprise codebases. Your engineers define the project,

Starting point is 00:12:39 a new feature, refactor, or greenfield build. Blitzy agents first ingest and map your entire codebase, then the platform generates a bespoke agent action plan for your team to review and approve. Once approved, Blitzy gets to work autonomously generating hundreds of thousands of lines of validated end-to-end tested code. More than 80% of the work completed in a single run. Blitzy is not generating code, it's developing software at the speed of compute. Your engineers review, refine, and ship. This is how Fortune 500 companies are compressing multi-month projects into a single sprint, accelerating engineering velocity by 5X. Experience Blitzy firsthand at blitzy.com. That's BLITZY.TZY.com.

Starting point is 00:13:15 It is a truth universally acknowledged that if your enterprise AI strategy is trying to buy the right AI tools, you don't have an enterprise AI strategy. Turns out that AI adoption is complex. It involves not only use cases, but systems integration, data foundations, outcome tracking, people and skills and governance. My company, super intelligent, provides voice agent-driven assessments that map your organizational maturity against industry benchmarks against all of these dimensions. If you want to find out more about how that works, go to B-super.a-I. And when you fill out the get-started form, mention maturity maps. Again, that's B-super.a.i. Welcome back to the AI Daily Brief. Today we are looking at the launch of ARC-AGI3.

Starting point is 00:13:58 It's a new benchmark from ARC-PRIZ that is specifically designed to test the interactive reasoning capability of AI agents. Now, it is the latest in a sequence of benchmarks that are meant to deal with some of the problems of benchmarks, but to better understand what they are trying to respond to, it's worth going back and actually understanding what benchmarks purpose is, what the problems with them are, and how people have tried to address those problems. Benchmarks are effectively two things. They're a way to compare AI's performance in various areas, as well as a way to see how models are progressing over time. Historically, there have been two major categories of benchmarks that you see included with every new model release. The two categories

Starting point is 00:14:35 are benchmarks around knowledge and benchmarks around function. Knowledge was the first big hill to climb, with benchmarks like MMLU for general knowledge, and GPQA, which measures scientific knowledge. Over time, more difficult benchmarks were introduced, like Humanity's Last Exam, which features obscure knowledge not typically found in the training data. As models developed, however, function became more important.

Starting point is 00:14:54 SwayBench is one of the best known of the functional benchmarks, testing the knowledge required to solve typical coding problems from GitHub. As agendic coding has risen in importance in the AI space, terminal bench has arguably overtaken SweetBench as the most important coding benchmark. Terminal Bench tests not only coding reasoning, but also the model's ability to use a terminal. And interestingly, a lot of benchmarks have followed this pattern, starting off as a test of knowledge and then implicitly or explicitly also adding an element of testing functional capacity. Humanity's last exam, for example, began as a pure test of pre-trained knowledge, but now it's

Starting point is 00:15:26 typically measured with web search tools enabled, making it a proxy for competency and tool use as well. Now, very early on, in the modern post-chat GPT era of AI, benchmark saturation became a problem. All the way back in May of 2024 with the release of GPT-40, all major models were already above 80% on MMLU, with GPT-40 scoring 88.7%. Now, at the time, some other benchmarks were a little bit less saturated. 4-0 was a big breakout, for example, in GPQA scoring 53.6%. But of course, with all of these benchmarks, it was only a matter of time. By last summer, the saturation problem had gotten much worse. At the time, O3 was OpenAI's daily driver. More difficult questions had been added to GPQA Diamond, and O3 still achieved 83.3% without using tools. By that stage,

Starting point is 00:16:10 most of the 2024 benchmarks had been abandoned or updated because of saturation. For example, the math benchmark was long gone, replaced by the AIME math test, which uses questions from a real-world math competition. O3 would score 88.9% on AIME math, foreshadowing that a specifically trained OpenAI model would achieve a gold medal performance at the International Math Olympiad a few months later. Fast forward to today, and once again, many of these benchmarks are getting saturated. GBT 5.4 is now up to 52.1% on Humanity's last exam, with tools and 39.8% without, which is very close to Opus 46's 53 and 40% respectively. Swaybench was once again upgraded with GBT 5.4 scoring 57.7% on Swaybench Pro. For Opus 46, Anthropic reported 81.4% on Swaybench

Starting point is 00:16:56 verified, but chose to highlight Terminal Bench 2.0 more prominently, where they scored a 65.4%. Now, it's difficult to keep track of all these numbers, but this chart shows the example of how performance on Swee Bench Verified progressed over the past year. Models from Anthropic, Google, OpenAI, and Minimax, who produced the chart, are basically all up into the right. They each began at different points in the middle of 2025, ranging from 55 to 70%, however, they've all arrived now at up near 80%. Benchmark saturation then means that benchmarks no longer show particularly meaningful progress between each model generation. They also don't show meaningful differences between the models. And making this problem worse is the issue of benchmark maxing. Benchmark maxing refers to when

Starting point is 00:17:35 a lab trains the models specifically to beat the benchmark even if it has little relevance in the real world. This happens because the benchmarks are either completely known or semi-public, meaning model labs can train specifically for the test in order to have more impressive numbers when they come out. One common perception and critique of Chinese labs is benchmark maxing in the extreme, which frequently leaves their models with a huge gap between their benchmark scores and real-world performance. In February, a variant-coding benchmark called Swee Rebench was released, containing a different set of problems, and most of the Chinese models dived in the rankings, suggesting they were specifically trained against the narrow set of sui-bench verified problems.

Starting point is 00:18:11 The Western models did drop as well, but not by nearly as much. Another example was Mehta with the release of Lama 4 Maverick last April. Meadow was accused of testing multiple model variants on LM Arena, which is a crowdsource taste test platform for LLM performance. platform users are presented with two samples and vote for the best one. Meta was accused of having tested models until they found the one that clicked most with users and launched as the second-ranked model on Elm Arena. You will recall that when people got their hands on Lama 4,

Starting point is 00:18:37 they did not in almost any case think it was the second best model available. Between benchmark maxing and benchmark saturation, the net effect is the diminished significance of benchmarks as a tool for people to understand which models are good for and at what. Now, on top of all of that, there is just an inherent problem with traditional benchmarks. Most of these benchmarks tend to be narrowly focused on solving one particular type of task. Some are about recalling knowledge and some are about more complex skills, but they are focused on doing one thing within a very narrowly defined set. We've talked in some episodes this

Starting point is 00:19:07 week about the idea of task AGI, that at this point AI is really good at a huge array of knowledge work tasks, but where its struggles is in bringing tasks together. And in that light, it would be reasonable, I think, to say that while benchmarks might be good at demonstrating task AGI, they're not particularly useful in helping understand how AI does outside of that very narrow task. Math is a particularly good example of this, with last year's models basically solving the very narrow field of competition mathematics. This was demonstrated in the IMO gold medal performances from OpenAI and Google. That is, of course, a completely different skill set than real-world mathematics. To the extent the practical reality of deploying AI is

Starting point is 00:19:44 understanding and dealing with its jagged frontiers, most traditional benchmarks just aren't all that helpful with that. Now, everything I'm discussing today are known long-standing problems, and there have been a ton of attempts to fix benchmarks over the years. One of the brute force methods is simply making the questions harder. We've seen this with Sweebench and GPQA, which remained relevant deep into 2025 by simply changing the difficulty level. This gave the benchmarks at least a little more life and kept them relevant for hill climbing performance, but it didn't really address the core underlying problem. There are also benchmarks that were switched out for more practical tests. A key example here is the transition from Sweebench to Terminal Bench as the major coding bench.

Starting point is 00:20:19 Terminal bench was intended to be a closer match to the way people actually use the models. It put models in a standard harness and tested their ability to use a terminal and other tools to solve coding problems. On some level, it was an improvement, but it still is dealing with saturation issues, and it also adds more complex variables. Particularly early on, for example, good coding models would fail tasks because they couldn't execute the tool calls properly. Another approach has been trying to simulate real-world tasks.

Starting point is 00:20:45 An early version of this idea was the Sway Lancer benchmark, developed by OpenAI last February. It tested coding ability against real-world tasks from Upwork that paid an aggregate of a million dollars. This allowed OpenAI to express their models coding ability in dollar terms. The spiritual successor was GDPVal, released by OpenAI last September. It extended the real-world problem set beyond coding to encompass various types of white-collar work, like making spreadsheets and slide decks. One of the interesting quirks of GDP-Val was that it required the agent to build and deliver a polished work product. It quickly became clear that models were failing tasks not always because they couldn't do them, but because the tool calls were failing. Now, GDP Val also has other

Starting point is 00:21:22 challenges. For example, OpenAI went out and actually worked with experienced professionals to do a combination of AI and human review. Other evaluators like artificial analysis have gone and modified GDP Val to be a strictly automated AI-only version, and it remains one of the benchmarks that I think people are most interested in relative to all the others. Now, another major approach was looking at continuous agent performance, with meters-long task benchmark being the most well known. This is that chart that we joked during a lot of last year as the bubble talk was increasing, was effectively holding up the entire global market. The way that this test works involves giving models a set of coding problems that human coders could complete in a set interval of time,

Starting point is 00:22:00 ranging from a few minutes to several hours. The resulting chart has become one of the clearest demonstrations of model improvement. In the space of two years, we went from agents that could only complete tasks that take humans five minutes, in the case of GPT-40, to agents that can complete tasks that take humans 10 hours in the case of Opus 4.6. Now, the big problem with Meeter's test, and one that they've fully admitted, is that they're running out of tasks to test against. Their original task set included very few tasks that take more than a few hours. Now that agents can complete complex tasks that take 10 hours, meter is struggling to find a useful test set. Realistically, tasks that take human developers

Starting point is 00:22:35 10 hours aren't really tasks anymore. They're full-on software builds that introduce far more complexity into the test. In other words, meter can't really extend their benchmark without turning it into something fundamentally different, meaning that this test even is effectively saturated. Which brings us to ARC AGI. It began as the ARC prize in the summer of 2024, based on former Google computer scientist Francois Chalet's approach to measuring machine intelligence. Introducing the prize, ARC wrote at the time, AGI progress has stalled. New ideas are needed. Modern LLMs have shown to be great memorization engines. They are able to memorize high-dimensional patterns in their training data and apply those patterns into adjacent contexts.

Starting point is 00:23:12 This is also how their apparent reasoning capability works. LLMs are not actually reasoning. Instead, they memorize reasoning patterns and apply those reasoning patterns into adjacent contexts, but they cannot generate new reasoning based on novel situations. More training data lets you buy performance on memorization-based benchmarks, but memorization alone is not general intelligence. General intelligence is the ability to efficiently acquire new skills. Arc Prize's answer to this is a test that contains,

Starting point is 00:23:39 a series of abstract visual logic puzzles. The tasks are presented as a series of colored squares on a grid, with squares added or removed according to a particular pattern. Two clues are given to teach the pattern, then the task is to apply that pattern to a problem square. For example, the problem might require a yellow square to be placed next to a line of blue squares in various orientations. These are problems that are relatively easy for humans to solve but proved to be difficult for LLMs. The tasks were also kept hidden so the logic couldn't be trained into the models. Instead, the test was trying to measure an LLM's ability to learn new logic within context and apply it to a novel problem. Basically, it set out to be a pure test of reasoning ability,

Starting point is 00:24:16 rather than memorization of how to reason. Early results were pretty compelling that this was a solid approach. At the time that Arc AGI1 was released, no models had come within 50% of human performance. Subsequent releases improved on this score, but the model seemed to be making genuine progress through reasoning. Then in December of 2024, open AI dropped a a preview version of their O3 model had achieved a 76% score on low inference settings, exceeding the human score for the first time. On high settings, the score was 88%. The O3 model had been trained on the public dataset, but tested on a private dataset to achieve this score, so there was no risk the logic was trained into the model. Ark wrote at the time,

Starting point is 00:24:52 this is a surprising and important step function increase in AI capabilities, showing novel task adaptation ability never before seen in the GPT family models. At the same time, Arc announced that they would be updating their benchmark for 2025 with RKGI2. The new benchmark looked superficially similar to the first. It contained the same colored squares and was once again designed to be easy for humans and harder for LLMs. The key change was made to counteract the innovation that allowed the O-Series models to outperform,

Starting point is 00:25:17 which is test-time compute. It kind of seems quaint now, but at the time, the idea of making a model reason for longer was a paradigm-shifting innovation. With O3, OpenAI had extended test-time compute enough to maintain context between problems and learn iteratively throughout the test. In order to pressure test this approach, Arc added a new twist to the problems. Rather than simply adding a square according to the pattern,

Starting point is 00:25:38 there were now three new styles of tests, symbolic interpretation where the LLM was tasked with interpreting more meaning within the symbols, for example, tasks where shapes needed to be colored differently according to how many holes they have, a second new set of tasks required applying multiple rules within the same problem set, which they called compositional reasoning, and a final new set of tasks added context to the problem, where logic was no longer universally applied, but depended on context. For example, shapes with the red border need to be shifted to the right, while shapes with the blue border need to be shifted to the left. Again, all of these problems remained fairly simple for humans, but the additional complexities were designed to overload

Starting point is 00:26:12 LLM context and test pure reasoning ability. The test held up well for most of 2025. Most model releases scored below 30%. At the very end of the year, and as this year got underway, things escalated dramatically. Gemini 3.1 Pro scored 77.1% at 96. per task in February. In March, Opus 4.6 achieved a 68.8% score. GPD 54 Pro achieved 83.3%. And Gemini 3 Deepthink is the current leader at 84.6% and 13.62 per task. Basically, once again, as the benchmark got saturated, we needed something new. Which gets us to ARC-AGI3. In an ex-post introducing the test on Wednesday, ARC writes, announcing ARC AGI3, the only unsaturated agentic general intelligence benchmark in the world. Humans score 100%, AI less than 1%. This human AI gap demonstrates we do not yet have AGI.

Starting point is 00:27:04 Most benchmarks test what models already know. Arc AGI3 tests how they learn. Now, the test is a complete rethink on the ARC AGI formula. The static grids of colored squares are gone. In their place, ARC has designed a series of 135 simple graphical games that require the LLM to manipulate the grid in real time. They have no instructions. so the model needs to explore the environment, figure out how it works, execute a plan, and adapt on the fly to what it sees. In their early testing, ARC observed models failing by mistaking one game for another, carrying over theories between games and failing to forecast cause and effect.

Starting point is 00:27:35 ARCHAGi 3 gives us a formal measure to compare human and AI skill acquisition efficiency. Humans don't brute force, they build mental models, test ideas, and refine quickly. How close AI is to that? Spoiler, not close. And unlike Arc AGI 2, we are starting at Ground Zero. None of the frontier models can complete this test with any level of competency, each scoring less than 1%. Google DeepMind's Xiaomah shared one of Gemini's playbacks, which are all publicly available in the replay section of the Ark website. She wrote,

Starting point is 00:28:03 Poor Gemini straight thought it was playing Activision tennis. Now, not everyone is a fan of how this is set up. Lassan Al-Give, Scaling 01, writes, The scoring of Arc AGI3 doesn't tell you how many levels the models completed, but how efficiently they completed them compared to humans, actually using squared efficiency. meaning if a human took 10 steps to solve it and the model 100 steps, then the model gets a score of 1%. The implication they write that this means scores are not comparable to the first two arc tests. On the other end of the spectrum, AI researcher Brandon Hancock commented on the elegance of the benchmark.

Starting point is 00:28:35 He writes, An alien species with zero knowledge of human language could ace ARCGI 3 on day one, and I think that's beautiful. At a time when AI is dominated by language models, it's refreshing to have a frontier benchmark, the only one that I'm aware of, that requires zero language ability or cultural knowledge to solve. Intelligent does not mean speaks English or speaks Python. I'm reminded of classic first-encounter sci-fi storylines where intelligent species are able to communicate well before they hash out a common spoken or written language, simply based on universal math, science, and reasoning concepts. AI has gotten complex enough that it behaves much more like an alien species than a next

Starting point is 00:29:08 token predictor at this point. Francois Chalé, one of the creators of RKGI, KGI, warned that this won't be the one benchmark to rule them all, commenting, mind, Arc AGI is not a final exam that you pass to claim AGI. The benchmarks target, the residual gap between what's hard for AI and what's easy for humans. It's meant to be a tool to measure AGI progress and to drive researchers towards the most important open problems on the way to AGI. So it's a moving target designed to track the frontier. As AI evolves, the benchmark evolves to spotlight the exact problems we haven't solved yet. And I think maybe that's the big takeaway. The idea of trying to quote unquote solve benchmark saturation, probably a simple

Starting point is 00:29:46 is not assuming that benchmarks are going to last all that long. Just as we need innovation in the way that we build these models, we're going to need innovation in the way that we measure them. It'll be interesting to see how fast we have models that actually jump from one to some meaningful percent on RKGI3, but of course before long, we'll need some other new thing to measure some other new capability. For now, that is going to do it for today's AI Daily Brief. Appreciate you listening or watching, as always, and until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - Why AI Needs Better Benchmarks

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.