The AI Daily Brief: Artificial Intelligence News and Analysis - Why AI Needs Better Benchmarks
Episode Date: March 26, 2026AI benchmarks are breaking—saturated, gamed, and increasingly disconnected from real-world performance. This episode explores why that’s happening and how new tests like ARC AGI 3 aim to measure a...ctual learning and reasoning instead of memorization. In the headlines: Apple’s deeper Gemini plans, a major efficiency breakthrough from Google, and rising political tension around AI infrastructure.Brought to you by:KPMG – Agentic AI is powering a potential $3 trillion productivity shift, and KPMG’s new paper, Agentic AI Untangled, gives leaders a clear framework to decide whether to build, buy, or borrow—download it at www.kpmg.us/NavigateMercury - Modern banking for business and now personal accounts. Learn more at https://mercury.com/personal-bankingRecall - The API for meeting recording. Get Get started today with $100 in free credits at https://www.recall.ai/aidbAIUC-1 - Get your agents certified to communicate trust to enterprise buyers - https://www.aiuc-1.com/Blitzy - Want to accelerate enterprise software development velocity by 5x? https://blitzy.com/AssemblyAI - The best way to build Voice AI apps - https://www.assemblyai.com/briefRobots & Pencils - Cloud-native AI solutions that power results https://robotsandpencils.com/The Agent Readiness Audit from Superintelligent - Go to https://besuper.ai/ to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Our Newsletter is BACK: https://aidailybrief.beehiiv.com/Interested in sponsoring the show? sponsors@aidailybrief.ai
Transcript
Discussion (0)
Today on the AI Daily Brief, why AI needs better benchmarks.
And before that in the headlines,
is Apple planning on distilling Google's Gemini models?
The AI Daily Brief is a daily podcast and video
about the most important news and discussions in AI.
All right, friends, quick announcements before we dive in.
First of all, thank you to today's sponsors,
KPMG, robots and pencils, blitzie, and super-intelligent.
To get an ad-free version of the show,
go to patreon.com slash AI Daily Brief,
or you can subscribe on Apple Podcasts.
If you are interested in sponsoring the show, send us a note at sponsors at AIDailybrief.ai.
And while you're at AIDailybrief.aI, check out everything going on in the ecosystem,
including the return of our newsletter, which has all the links that I mentioned in the show.
Apple's AI partnership with Google apparently goes much deeper than previously thought,
including the ability to distill Gemini into smaller models.
The unveiling of the new AI series is a little over two months away,
and we're starting to get a steady drip of information around what the product will look like.
On Tuesday, Bloomberg's Apple Insider Mark German ran through what he knows about features in U.S.
Apple has reportedly backed down on their view that Siri should remain voice-only,
now building a standard chatbot interface with optional voice controls.
German also reported that Siri will be deeply integrated into iOS 27,
allowing it to take actions and draw context from apps running on a user's device.
It sounds as though Apple will try to launch Siri with full computer use,
delivering the features they advertised with the launch of Apple Intelligence two years ago.
Now, we already knew that Siri would be driven by Google's Gemini models,
but new reporting from the information suggests that Apple has much more freedom in how they use Gemini than originally thought.
Previous reports said that Apple would fine-tune a Gemini model for their purposes,
and that the models would be hosted on Apple's servers to ensure user privacy.
However, sources speaking with the information said that Apple has full access to the Gemini models,
meaning they're able to distill large versions of Gemini into their own smaller proprietary models.
Model distillation is the process of using the reasoning traces from one model to train another,
essentially a cheat code to develop powerful models.
Many of the Chinese labs have been accused of distilling models from Anthropic and OpenAI
as a way to catch up quickly.
The information sources said the process isn't straightforward, as Apple's vision for Siri is very
different to the way Gemini works.
Gemini is optimized for chatbots, enterprise tasks, and coding, while the source implied
Apple is less interested in these functions.
The source was skeptical the models would actually be that much use to Apple's foundation
models team for that reason. Maybe the main takeaway is that Apple hasn't entirely given up on training
their own models and could use the Google partnership to bootstrap their approach. The most obvious
target in most people's minds would be training small-capable models to run locally on an iPhone,
which seems to be the core vision of where Apple wants to go with AI. Ethan Malick sums up the wait
and see attitude of most people on this news, tweeting, huh, I'm not sure distilling Gemini
models to run on phones is going to result in the generally capable agents that people will
expect soon, but we shall see. Speaking of Google,
The company has published a research paper describing a new compression algorithm that could
dramatically improve the performance of small models.
Called TurboQuant, the process allows researchers to quantize model context with almost zero losses.
During long conversations or long horizon tasks, context can bloat to use even more memory than model weights.
Functionally, quantization means context is stored with less fidelity.
For example, 16-bit data might be compressed into 4-bit.
Current quantization methods are quite lossy and noticeably reduced performance.
Some believe, for example, that this is the reason Anthropics models can seem a little off during demand spikes.
Google researchers say their new process massively reduces the loss associated with quantization
and could make the technique far less of a trade-off.
They claim their process results in a 6x reduction of the amount of memory a given model uses for storing context,
while delivering an 8x speed boost compared to current methods.
This could result in a 50% reduction in inference costs and help ease the bottleneck around memory chips.
Giving a concrete demonstration of what the algorithm can do,
Google's researchers tested it on Lama 3.18B and Mistral 7B, with TurboQuant implemented,
both models achieved perfect scores on needle-in-a-hastack tests.
Cloudflare CEO Matthew Prince tried to explain the gravity of this breakthrough commenting,
this is Google's deep seek, so much more room to optimize AI inference for speed, memory
usage, power consumption, and multi-tenant utilization.
Others reached for a more relatable analogy, comparing this moment to when a scrappy startup
cracked middle-out compression with a Wiseman score of 5.2.
Chevang writes, so basically, TurboQuant is Pied Piper.
Now, Google isn't just shipping groundbreaking research, they also have a new music model
with Lyria 3 Pro.
The first version of Lyria 3 was to some folks underwhelming.
It wasn't that the model was bad, it just couldn't produce production quality music like
Suno and was limited to 30 seconds, making it seem like it was more for novelty purposes
than professional use cases.
Lyria 3 Pro definitely addresses some of those issues.
It can now create full tracks up to three minutes long and seems to have a much better
understanding of lyrics and song structure. Roan Paul writes,
The hard part in AI music is not making pleasant sound for 10 seconds. It's keeping a piece
coherent as it moves from intro to verse to chorus without collapsing into a loop. Now, Rohan noted
that Google is also pushing it in Vertex AI, AI Studio, and the Gemini app, so the bigger story
is probably less about the model, and more the fact that it is available via API,
which could mean it finds its way into a lot more use cases. Over in the world of AI
politics, Senator Bernie Sanders has unveiled his data center moratorium bill with an assist
from AOC. The legislation would pause all data center construction nationwide until strong national
safeguards, their words, are in place. The bill requires Congress to establish protections for workers
and consumers, address environmental harms, and defend civil rights before lifting the moratorium.
Sanders said, AI has received far too little serious discussion here in our nation's capital.
I fear that Congress is totally unprepared for the magnitude of the changes that are already
taking place. Now, the presence of AOC as a co-sponsor seems fairly relevant. Until now, Sanders has been
pushing the moratorium largely by himself with support from certain elements of the AI safety community.
It hadn't found meaningful traction among elected progressives. AOC personally has been pretty much
silent on the issue. Her ex-feed has zero mentions of data centers and only a single post about
AI regarding the dangers of deepfakes. By supporting the bill, AOC is declaring a position for the
broader progressive movement and could at least theoretically carry that position into a presidential run
in 2008. Meanwhile, the bill seems to have very little support from mainstream Democrats. Senator Mark Warren,
for example, said the idea was, in his words, idiocy. He continued,
A data center moratorium simply means China is going to move quicker. The idea that we're going
to stuff this back into the bottle, that's a ridiculous premise. Now, despite thinking the
moratorium is the wrong solution, Warner certainly still has strong views on AI policy.
He's currently supporting a bill to codify Anthropics redlines around using AI for domestic
surveillance and autonomous weaponry. Referring to Secretary of War, Pete Heg-Seth, he added,
those should be policy decisions not left to a single individual. Warner also raised alarm
about AI job replacement, commenting, the recent college graduate unemployment is 9%.
I'll bet anyone in the room it goes to 30 or 35% before 2028. He said he now believes the scope
of the economic disruption is going to be exponentially larger than he thought just a few months ago.
Commenting on the Sanders AOC policy, James Rosenberg writes,
I see why it's called populism now, never liking the term. Every part of this is detrimentally
performative. It's arbitrary AF. The ban on upgrades means no energy efficiency or sustainability
improvements can happen. There is nothing progressive about it.
On the other end of the spectrum, New York Times tech reporter Mike Isaac writes,
people can certainly take issue with his positions in plan of action,
but Sanders seems to be one of the few members of Congress seriously reckoning
with what the labor consequences of the coming AI age could be,
now joined by AOC.
One of the things that I'm not sure on is the extent to which a moratorium is A,
something that Bernie and AOC actually think is good policy,
B, something they think is good politics,
given increasing American antipathy towards data centers in their community,
or three, a way to anchor the conversation on the far end of one extreme, so there's more room to
find compromise in the middle. One can certainly hope it's number three, but right now it's not at all
clear. Now, speaking of the China boogeyman, in our final story, Manus co-founders have been banned
from leaving China as the CCP cracks down. The Financial Times reports that Manus's CEO and chief
scientist have both been barred from leaving the country, while Meta's $2 billion acquisition is reviewed.
We heard rumblings of this earlier in the month, as Manus and Meta executives were summoned to Beijing for
meeting with regulators. The theory of the case is that Manus circumvented China's export controls on tech
by relocating their headquarters from Beijing to Singapore. CEO Xiaoh Hong and chief scientist,
G. Yi Yu Chao reportedly attended the meeting and were told after its conclusion that they
would be unable to leave China but were free to travel within the country. Sources said that no
formal investigation has been open and no charges have been brought, but Manus is said to be seeking
legal representation to help resolve the issue. The entire situation is messy because it deals with
the intersection of actual laws and the unspoken rules that.
govern doing business in China. China has strict laws to control foreign investment and export of technology.
However, both Manus and Meta maintained their transaction was in full compliance. The relocation of the
headquarters is an obvious gray area, made even more gray by the fact that Manus still maintains an
offshore entity, which was used to develop early versions of the product. As for the unspoken rules,
Chinese officials have become increasingly concerned about losing AI talent and technology to the
West. They've even adopted the euphemism of selling young crops to describe the poaching of human
capital and strategic industries. Sources suggested that the extreme outcome would be a forced
unwind of the meta deal, but noted that that would be messy because the technology is already
being integrated into meta's platforms. What's more, this isn't the only sign that Beijing
is tightening its grip on its domestic AI industry. AI researcher Tao Hu shared that the China
Computer Federation has warned researchers not to participate in the NERIPS conference.
Chinese entrepreneur Lina Hua argued that this was all to be expected, writing,
they thought they were being clever for circumvending China's tech export controls, but you don't mess with
the CCP like that. You will be made an example of so others don't get tempted to betray the
motherland. So what's going to happen? China won't jail them because they don't want to look evil.
Instead, they're going to freeze the founder's assets in China and give them a travel ban while the
quote-unquote probe is ongoing. The probe will likely be deliberately prolonged to inflict psychological
damage, create uncertainty for potential copycats and make the public forget about this case.
And once the topic is out of the public's mind, CCP going to strike hard with a financial penalty
that wipes out most of their gains and then soft blacklist them in China.
Bill Bishop, the host of cynicism, writes,
I didn't think the manist top execs would be so naive as to go back to the PRC.
Expect they will have to spit back out a lot of what they made.
On the flip side, some Western observers thought the crackdown will probably backfire.
Former White House advisor Dean Ball commented,
If we were smart, we'd see this as a major cell phone by China as Natsac-brained public policy so often is.
The message the government is sending is,
If you ever want to found a company, especially one that makes money on software, move to Singapore first.
Easier to get GPUs, too.
Never a dull day and AI, but for now, that does it for the headlines.
Next up, the main episode.
All right, folks, quick pause.
Here's the uncomfortable truth.
If your enterprise AI strategy is we bought some tools, you don't actually have a strategy.
KPMG took the harder route and became their own client zero.
They embedded AI and agents across the enterprise, how work gets done, how teams collaborate, how decisions,
move, not as a tech initiative, but as a total operating model shift. And here's the real unlock.
That shift raised the ceiling on what people could do. Human stayed firmly at the center,
while AI reduced friction, surfaced insight, and accelerated momentum. The outcome was a more
capable, more empowered workforce. If you want to understand what that actually looks like in the
real world, go to www.kpmg.us slash AI. That's www.kpmg.us.comg.coms.a.
Today's episode is brought to you by robots and pencils, a company that is growing fast.
Their work is a high-growth AWS and Databricks partner means that they're looking for elite
talent ready to create real impact at velocity.
Their teams are made up of AI-native engineers, strategists, and designers who love solving
hard problems and pushing how AI shows up in real products.
They move quickly using RoboWorks, their agenic acceleration platform, so teams can deliver
meaningful outcomes in weeks, not months.
They don't build big teams.
they build high-impact nimble ones.
The people there are Wicked Smart with patents, published research, and work that's helped
shaped entire categories.
They work in Velocity Pods and studios that stay focused and move with intent.
If you're ready for career-defining work with peers who challenge you and have your back,
robots and pencils is the place.
Explore open roles at robots and pencils.com slash careers.
That's robots and pencils.com slash careers.
Want to accelerate enterprise software development velocity by 5X?
You need Blitzy, the only autonomous software software.
software development platform built for enterprise codebases. Your engineers define the project,
a new feature, refactor, or greenfield build. Blitzy agents first ingest and map your entire
codebase, then the platform generates a bespoke agent action plan for your team to review
and approve. Once approved, Blitzy gets to work autonomously generating hundreds of thousands of
lines of validated end-to-end tested code. More than 80% of the work completed in a single run.
Blitzy is not generating code, it's developing software at the speed of compute. Your
engineers review, refine, and ship. This is how Fortune 500 companies are compressing
multi-month projects into a single sprint, accelerating engineering velocity by 5X.
Experience Blitzy firsthand at blitzy.com. That's BLITZY.TZY.com.
It is a truth universally acknowledged that if your enterprise AI strategy is trying to buy
the right AI tools, you don't have an enterprise AI strategy. Turns out that AI adoption is
complex. It involves not only use cases, but systems integration, data foundations,
outcome tracking, people and skills and governance. My company, super intelligent, provides
voice agent-driven assessments that map your organizational maturity against industry benchmarks against
all of these dimensions. If you want to find out more about how that works, go to B-super.a-I.
And when you fill out the get-started form, mention maturity maps. Again, that's B-super.a.i.
Welcome back to the AI Daily Brief. Today we are looking at the launch of ARC-AGI3.
It's a new benchmark from ARC-PRIZ that is specifically designed to test the interactive reasoning
capability of AI agents. Now, it is the latest in a sequence of benchmarks that are meant to deal
with some of the problems of benchmarks, but to better understand what they are trying to respond to,
it's worth going back and actually understanding what benchmarks purpose is, what the problems
with them are, and how people have tried to address those problems.
Benchmarks are effectively two things. They're a way to compare AI's performance in various
areas, as well as a way to see how models are progressing over time. Historically, there have been
two major categories of benchmarks that you see included with every new model release. The two categories
are benchmarks around knowledge and benchmarks around function.
Knowledge was the first big hill to climb,
with benchmarks like MMLU for general knowledge,
and GPQA, which measures scientific knowledge.
Over time, more difficult benchmarks were introduced,
like Humanity's Last Exam,
which features obscure knowledge not typically found in the training data.
As models developed, however, function became more important.
SwayBench is one of the best known of the functional benchmarks,
testing the knowledge required to solve typical coding problems from GitHub.
As agendic coding has risen in importance in the AI space,
terminal bench has arguably overtaken SweetBench as the most important coding benchmark.
Terminal Bench tests not only coding reasoning, but also the model's ability to use a terminal.
And interestingly, a lot of benchmarks have followed this pattern, starting off as a test
of knowledge and then implicitly or explicitly also adding an element of testing functional capacity.
Humanity's last exam, for example, began as a pure test of pre-trained knowledge, but now it's
typically measured with web search tools enabled, making it a proxy for competency and tool use as well.
Now, very early on, in the modern post-chat GPT era of AI, benchmark saturation became a problem.
All the way back in May of 2024 with the release of GPT-40, all major models were already above
80% on MMLU, with GPT-40 scoring 88.7%. Now, at the time, some other benchmarks were a little
bit less saturated. 4-0 was a big breakout, for example, in GPQA scoring 53.6%. But of course,
with all of these benchmarks, it was only a matter of time. By last summer, the saturation problem
had gotten much worse. At the time, O3 was OpenAI's daily driver. More difficult questions had
been added to GPQA Diamond, and O3 still achieved 83.3% without using tools. By that stage,
most of the 2024 benchmarks had been abandoned or updated because of saturation. For example,
the math benchmark was long gone, replaced by the AIME math test, which uses questions
from a real-world math competition. O3 would score 88.9% on AIME math, foreshadowing that a specifically
trained OpenAI model would achieve a gold medal performance at the International Math Olympiad a few months
later. Fast forward to today, and once again, many of these benchmarks are getting saturated.
GBT 5.4 is now up to 52.1% on Humanity's last exam, with tools and 39.8% without,
which is very close to Opus 46's 53 and 40% respectively. Swaybench was once again upgraded with
GBT 5.4 scoring 57.7% on Swaybench Pro. For Opus 46, Anthropic reported 81.4% on Swaybench
verified, but chose to highlight Terminal Bench 2.0 more prominently, where they scored a 65.4%.
Now, it's difficult to keep track of all these numbers, but this chart shows the example of how
performance on Swee Bench Verified progressed over the past year. Models from Anthropic, Google,
OpenAI, and Minimax, who produced the chart, are basically all up into the right.
They each began at different points in the middle of 2025, ranging from 55 to 70%, however, they've all
arrived now at up near 80%. Benchmark saturation then means that benchmarks no longer show particularly
meaningful progress between each model generation. They also don't show meaningful differences between
the models. And making this problem worse is the issue of benchmark maxing. Benchmark maxing refers to when
a lab trains the models specifically to beat the benchmark even if it has little relevance in the real
world. This happens because the benchmarks are either completely known or semi-public, meaning model labs can train
specifically for the test in order to have more impressive numbers when they come out. One common
perception and critique of Chinese labs is benchmark maxing in the extreme, which frequently leaves their models
with a huge gap between their benchmark scores and real-world performance.
In February, a variant-coding benchmark called Swee Rebench was released, containing a different
set of problems, and most of the Chinese models dived in the rankings, suggesting they were
specifically trained against the narrow set of sui-bench verified problems.
The Western models did drop as well, but not by nearly as much.
Another example was Mehta with the release of Lama 4 Maverick last April.
Meadow was accused of testing multiple model variants on LM Arena, which is a crowdsource
taste test platform for LLM performance.
platform users are presented with two samples and vote for the best one.
Meta was accused of having tested models until they found the one that clicked most with users
and launched as the second-ranked model on Elm Arena.
You will recall that when people got their hands on Lama 4,
they did not in almost any case think it was the second best model available.
Between benchmark maxing and benchmark saturation,
the net effect is the diminished significance of benchmarks
as a tool for people to understand which models are good for and at what.
Now, on top of all of that, there is just an inherent problem with
traditional benchmarks. Most of these benchmarks tend to be narrowly focused on solving one particular
type of task. Some are about recalling knowledge and some are about more complex skills, but they are
focused on doing one thing within a very narrowly defined set. We've talked in some episodes this
week about the idea of task AGI, that at this point AI is really good at a huge array of
knowledge work tasks, but where its struggles is in bringing tasks together. And in that light,
it would be reasonable, I think, to say that while benchmarks might be good at demonstrating
task AGI, they're not particularly useful in helping understand how AI does outside of that
very narrow task. Math is a particularly good example of this, with last year's models
basically solving the very narrow field of competition mathematics. This was demonstrated in the
IMO gold medal performances from OpenAI and Google. That is, of course, a completely different
skill set than real-world mathematics. To the extent the practical reality of deploying AI is
understanding and dealing with its jagged frontiers, most traditional benchmarks just aren't all that
helpful with that. Now, everything I'm discussing today are known long-standing problems, and there have been
a ton of attempts to fix benchmarks over the years. One of the brute force methods is simply making the
questions harder. We've seen this with Sweebench and GPQA, which remained relevant deep into 2025
by simply changing the difficulty level. This gave the benchmarks at least a little more life
and kept them relevant for hill climbing performance, but it didn't really address the core
underlying problem. There are also benchmarks that were switched out for more practical tests.
A key example here is the transition from Sweebench to Terminal Bench as the major coding bench.
Terminal bench was intended to be a closer match to the way people actually use the models.
It put models in a standard harness and tested their ability to use a terminal and other tools
to solve coding problems.
On some level, it was an improvement, but it still is dealing with saturation issues, and
it also adds more complex variables.
Particularly early on, for example, good coding models would fail tasks because they couldn't
execute the tool calls properly.
Another approach has been trying to simulate real-world tasks.
An early version of this idea was the Sway Lancer benchmark, developed by OpenAI last
February. It tested coding ability against real-world tasks from Upwork that paid an aggregate of a
million dollars. This allowed OpenAI to express their models coding ability in dollar terms.
The spiritual successor was GDPVal, released by OpenAI last September. It extended the real-world
problem set beyond coding to encompass various types of white-collar work, like making spreadsheets
and slide decks. One of the interesting quirks of GDP-Val was that it required the agent to build
and deliver a polished work product. It quickly became clear that models were failing tasks not always
because they couldn't do them, but because the tool calls were failing. Now, GDP Val also has other
challenges. For example, OpenAI went out and actually worked with experienced professionals
to do a combination of AI and human review. Other evaluators like artificial analysis have gone
and modified GDP Val to be a strictly automated AI-only version, and it remains one of the
benchmarks that I think people are most interested in relative to all the others. Now, another major
approach was looking at continuous agent performance, with meters-long task benchmark being the most
well known. This is that chart that we joked during a lot of last year as the bubble talk was increasing,
was effectively holding up the entire global market. The way that this test works involves giving
models a set of coding problems that human coders could complete in a set interval of time,
ranging from a few minutes to several hours. The resulting chart has become one of the clearest
demonstrations of model improvement. In the space of two years, we went from agents that could only
complete tasks that take humans five minutes, in the case of GPT-40, to agents that can complete tasks
that take humans 10 hours in the case of Opus 4.6. Now, the big problem with Meeter's test,
and one that they've fully admitted, is that they're running out of tasks to test against.
Their original task set included very few tasks that take more than a few hours.
Now that agents can complete complex tasks that take 10 hours,
meter is struggling to find a useful test set. Realistically, tasks that take human developers
10 hours aren't really tasks anymore. They're full-on software builds that introduce far more
complexity into the test. In other words, meter can't really extend their benchmark without turning
it into something fundamentally different, meaning that this test even is effectively saturated.
Which brings us to ARC AGI. It began as the ARC prize in the summer of 2024, based on former
Google computer scientist Francois Chalet's approach to measuring machine intelligence. Introducing the
prize, ARC wrote at the time, AGI progress has stalled. New ideas are needed. Modern LLMs
have shown to be great memorization engines. They are able to memorize high-dimensional patterns
in their training data and apply those patterns into adjacent contexts.
This is also how their apparent reasoning capability works.
LLMs are not actually reasoning.
Instead, they memorize reasoning patterns and apply those reasoning patterns into adjacent contexts,
but they cannot generate new reasoning based on novel situations.
More training data lets you buy performance on memorization-based benchmarks,
but memorization alone is not general intelligence.
General intelligence is the ability to efficiently acquire new skills.
Arc Prize's answer to this is a test that contains,
a series of abstract visual logic puzzles. The tasks are presented as a series of colored
squares on a grid, with squares added or removed according to a particular pattern. Two clues
are given to teach the pattern, then the task is to apply that pattern to a problem square. For example,
the problem might require a yellow square to be placed next to a line of blue squares in various
orientations. These are problems that are relatively easy for humans to solve but proved to be
difficult for LLMs. The tasks were also kept hidden so the logic couldn't be trained into the models.
Instead, the test was trying to measure an LLM's ability to learn new logic within context and
apply it to a novel problem. Basically, it set out to be a pure test of reasoning ability,
rather than memorization of how to reason. Early results were pretty compelling that this
was a solid approach. At the time that Arc AGI1 was released, no models had come within
50% of human performance. Subsequent releases improved on this score, but the model seemed
to be making genuine progress through reasoning. Then in December of 2024, open AI dropped a
a preview version of their O3 model had achieved a 76% score on low inference settings,
exceeding the human score for the first time. On high settings, the score was 88%. The O3 model
had been trained on the public dataset, but tested on a private dataset to achieve this score,
so there was no risk the logic was trained into the model. Ark wrote at the time,
this is a surprising and important step function increase in AI capabilities, showing novel
task adaptation ability never before seen in the GPT family models. At the same time, Arc announced
that they would be updating their benchmark for 2025 with RKGI2.
The new benchmark looked superficially similar to the first.
It contained the same colored squares
and was once again designed to be easy for humans and harder for LLMs.
The key change was made to counteract the innovation
that allowed the O-Series models to outperform,
which is test-time compute.
It kind of seems quaint now, but at the time,
the idea of making a model reason for longer
was a paradigm-shifting innovation.
With O3, OpenAI had extended test-time compute enough
to maintain context between problems
and learn iteratively throughout the test. In order to pressure test this approach,
Arc added a new twist to the problems. Rather than simply adding a square according to the pattern,
there were now three new styles of tests, symbolic interpretation where the LLM was tasked
with interpreting more meaning within the symbols, for example, tasks where shapes needed
to be colored differently according to how many holes they have, a second new set of tasks
required applying multiple rules within the same problem set, which they called compositional reasoning,
and a final new set of tasks added context to the problem, where logic was no longer universally
applied, but depended on context. For example, shapes with the red border need to be shifted to the
right, while shapes with the blue border need to be shifted to the left. Again, all of these problems
remained fairly simple for humans, but the additional complexities were designed to overload
LLM context and test pure reasoning ability. The test held up well for most of 2025. Most model
releases scored below 30%. At the very end of the year, and as this year got underway,
things escalated dramatically. Gemini 3.1 Pro scored 77.1% at 96.
per task in February. In March, Opus 4.6 achieved a 68.8% score. GPD 54 Pro achieved 83.3%. And Gemini 3 Deepthink is the
current leader at 84.6% and 13.62 per task. Basically, once again, as the benchmark got saturated,
we needed something new. Which gets us to ARC-AGI3. In an ex-post introducing the test on Wednesday,
ARC writes, announcing ARC AGI3, the only unsaturated agentic general intelligence benchmark in the world.
Humans score 100%, AI less than 1%. This human AI gap demonstrates we do not yet have AGI.
Most benchmarks test what models already know. Arc AGI3 tests how they learn.
Now, the test is a complete rethink on the ARC AGI formula. The static grids of colored squares
are gone. In their place, ARC has designed a series of 135 simple graphical games that require
the LLM to manipulate the grid in real time. They have no instructions.
so the model needs to explore the environment, figure out how it works, execute a plan,
and adapt on the fly to what it sees.
In their early testing, ARC observed models failing by mistaking one game for another, carrying
over theories between games and failing to forecast cause and effect.
ARCHAGi 3 gives us a formal measure to compare human and AI skill acquisition efficiency.
Humans don't brute force, they build mental models, test ideas, and refine quickly.
How close AI is to that?
Spoiler, not close.
And unlike Arc AGI 2, we are starting at Ground Zero.
None of the frontier models can complete this test with any level of competency, each scoring less than 1%.
Google DeepMind's Xiaomah shared one of Gemini's playbacks, which are all publicly available in the replay section of the Ark website.
She wrote,
Poor Gemini straight thought it was playing Activision tennis.
Now, not everyone is a fan of how this is set up.
Lassan Al-Give, Scaling 01, writes,
The scoring of Arc AGI3 doesn't tell you how many levels the models completed, but how efficiently they completed them compared to humans,
actually using squared efficiency.
meaning if a human took 10 steps to solve it and the model 100 steps, then the model gets a score of 1%.
The implication they write that this means scores are not comparable to the first two arc tests.
On the other end of the spectrum, AI researcher Brandon Hancock commented on the elegance of the benchmark.
He writes,
An alien species with zero knowledge of human language could ace ARCGI 3 on day one, and I think that's beautiful.
At a time when AI is dominated by language models, it's refreshing to have a frontier benchmark,
the only one that I'm aware of, that requires zero language ability or cultural knowledge to solve.
Intelligent does not mean speaks English or speaks Python. I'm reminded of classic first-encounter
sci-fi storylines where intelligent species are able to communicate well before they hash out a
common spoken or written language, simply based on universal math, science, and reasoning concepts.
AI has gotten complex enough that it behaves much more like an alien species than a next
token predictor at this point. Francois Chalé, one of the creators of RKGI, KGI, warned that this
won't be the one benchmark to rule them all, commenting,
mind, Arc AGI is not a final exam that you pass to claim AGI. The benchmarks target, the residual
gap between what's hard for AI and what's easy for humans. It's meant to be a tool to measure
AGI progress and to drive researchers towards the most important open problems on the way to AGI.
So it's a moving target designed to track the frontier. As AI evolves, the benchmark evolves
to spotlight the exact problems we haven't solved yet. And I think maybe that's the big takeaway.
The idea of trying to quote unquote solve benchmark saturation, probably a simple
is not assuming that benchmarks are going to last all that long. Just as we need innovation in the way
that we build these models, we're going to need innovation in the way that we measure them. It'll be
interesting to see how fast we have models that actually jump from one to some meaningful percent
on RKGI3, but of course before long, we'll need some other new thing to measure some other new capability.
For now, that is going to do it for today's AI Daily Brief. Appreciate you listening or watching,
as always, and until next time, peace.
