The AI Daily Brief: Artificial Intelligence News and Analysis - AI Just Beat the World's Best Coders
Episode Date: September 19, 2025AI just scored a historic win in the International Collegiate Programming Contest, with OpenAI’s GPT-5 and Google’s DeepMind outperforming nearly every human team. The discussion focuses on whethe...r this marks a real inflection point for AI, shifting from competition success to the frontier of scientific discovery. Key themes include public perception, the pace of progress, and what these results signal for the future of the field.Brought to you by:Is your enterprise ready for the future of agentic AI?Visit AGNTCY.orgVisit Outshift Internet of AgentsKPMG – Discover how AI is transforming possibility into reality. Tune into the new KPMG 'You Can with AI' podcast and unlock insights that will inform smarter decisions inside your enterprise. Listen now and start shaping your future with every episode. https://www.kpmg.us/AIpodcastsBlitzy.com - Go to https://blitzy.com/ to build enterprise software in days, not months Robots & Pencils - Cloud-native AI solutions that power results https://robotsandpencils.com/Vanta - Simplify compliance - https://vanta.com/nlwThe Agent Readiness Audit from Superintelligent - Go to https://besuper.ai/ to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Interested in sponsoring the show? nlw@aidailybrief.ai
Transcript
Discussion (0)
Today on the AI Daily Brief, does a big victory in a coding competition mean the end of an era for LLMs?
Before that in the headlines, how to reduce AI scheming.
The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.
All right, friends, quick announcements before we dive in.
First of all, thank you to today's sponsors, robots and pencils, agency.org, and superintelligent.
To get an ad-free version of the show, go to patreon.com slash AI Daily Brief.
And speaking of Super Intelligent, a quick job announcement, TLDR, we are absolutely drowning in
agent planning customers. Turns out lots and lots of enterprises are trying to figure out what their
AI and agent strategy should be and which use cases they should be pursuing. And because of that,
we are hiring customer success and account managers. This is a role that requires you to be
extremely organized, great with people, and forward thinking about how you use AI. And so if you
are interested in this, shoot us an email at jobs at Bsupor.aI. We don't want resumes, and we
Don't want cover letters. We are only going to look at emails where people send a loom or video
of them showing off some automation or AI workflow they've created for themselves or in the
context of a job. If this sounds interesting to you, again, it's jobs at vSuper.a.i. And with that,
let's get into the show. Welcome back to the AI Daily Brief Headlines edition, all the daily
AI news you need in around five minutes. We kick off today with some interesting new safety
research around detecting and reducing scheming in AI models. Now, scheming in this
This context refers to a model appearing to behave in one way while hiding its true intentions.
In their post about the research, OpenAI writes,
scheming is an expected emergent issue, resulting from AI's being trained to have tradeoffs
between competing objectives. The easiest way to understand scheming is through a human
analogy. Imagine a stock trader whose goal it is to maximize earnings. In a highly regulated field,
such as stock trading, it's often possible to earn more by breaking the law than by following it.
If the trader lacks integrity, they might try to earn more by breaking the law and covering their tracks to avoid detection, rather than earning less while following the law.
From the outside, a stock trader who is very good at covering their tracks appears as lawful as, and more effective than one who is genuinely following the law.
Now, they point out that in the context of today's model deployment, models don't have a lot of chance to scheme in ways that are particularly problematic.
They write that the most common failures involve simple forms of deception, such as pretending to have completed a task without actually doing so.
They also don't think that there's any evidence that this is about to become some huge issue.
They write, we have no evidence that today's deployed frontier models could suddenly flip a switch and begin engaging in significant harmful scheming.
This is a future risk category that we're proactively preparing for, not an imminent behavior in our currently deployed systems.
So how to deal with this? Open AI in collaboration with Apollo research wrote,
We can significantly reduce scheming by training models to reason explicitly using an extension to the model spec that prohibits scheming.
That method is called deliberative alignment.
They found that using this technique, they were able to reduce covert actions by about 30x.
However, they wrote, situational awareness complicates results.
Researchers explained that frontier models can recognize when they're being tested,
and the tendency to scheme is based on that situational awareness.
When a model detects that it is being put through an alignment test, it is less likely to scheme.
Researchers noted that this isn't necessarily a good thing as it means the latent tendency to scheme can't be properly tested.
They concluded,
on reading models reasoning or chain of thought, and we believe the field isn't prepared
for Eval-aware methods with opaque reasoning. Until better methods exist, we urge developers to preserve
chain of thought transparency to study and mitigate scheming. Miles Brundage and independent AI
policy researcher thought that it was a useful piece that advanced the field. He commented,
it's important to simultaneously bear in mind that chain of thought is both extremely important
to maintain and use as a tool for AI oversight, and also not the whole picture of what's going on.
Among other things, sometimes the model is clearly working through a problem at the pace of the text,
hence concision, but other times it's clearly thinking something through below the surface
and just using tokens to keep things going and it seems to be wasting compute.
This is why I think it's essential that third parties outside of companies have access to chains
of thought for doing safety research, but eventually we'll need to go beyond that to model
internals.
Now, speaking of what's going on inside models, one interesting thing that's been happening
over the past few weeks, is that alongside people getting excited about GPT5 and OpenAI's Kodick
CLI, part of the reason for that excitement wasn't just about a change in perception of OpenAI
model quality, but a frustration with what seemed to be problems with Claude. Indeed, some thought
that Anthropic was intentionally throttling Claude and just not telling people. The company
has now published a post-mortem of three infrastructure issues that dragged down performance in August
and early September. Now, they were very clear. From the outset they pushed back on the prevailing
notion, once again arguing, we never reduced model quality due to demand, time of day,
or server load. The problems our users reported were due to infrastructure bugs alone. The first
bug began in August and caused short context queries to be routed to a server configured to process
using a million token context window. This caused degraded responses and impacted around 30%
of customers at least once. The second bug showed up in late August and caused low-probability
tokens to show up in responses more frequently than they should. Anthropic gave the example of
Chinese or Thai characters showing up in the middle of an English language response.
This bug was short-lived and didn't seem as widespread.
The third bug was a compiler issue which caused some highly probable tokens to be excluded
from the distribution during text generation.
The bug only impacted requests to the Claude Haiku 3.5 model, so also wasn't as likely
to be a large cause of concern.
Anthropica pledged to make a number of changes to the way they eval models and monitor
infrastructure to more easily detect issues in the future.
And while some developers have already shifted behavior, by and large, the response to
this transparency was quite positive. Moving over to the geopolitical side of the house,
China has officially banned tech companies from buying Nvidia's AI chips. The Financial Times reports
that China's internet regulator has instructed companies, including Alibaba and ByteD, to cancel
orders for Nvidia's RTX Pro 6,000D. Before the command, several Chinese companies indicated that they
would order tens of thousands of the products. This ban follows instructions to avoid using
invidia's H20 chips that were issued during the summer. The RTX Pro 6,000D is the Blackwell-based
chip designed specifically for the Chinese market to get around export controls and was to be
the successor to the H20. Invidia CEO Jensen Huang said, we probably contributed more to the
Chinese market than most companies have and I'm disappointed with what I see. But they have larger
agendas to work out between China and the United States. We can only be in service of a market
if the country wants us to be. Invita has guided market analysts to assume zero sales in China moving
forward, but the ban still dashes hopes that NVIDIA would return to what was once their second
largest market. Beijing reportedly believes that their homegrown chips are now sufficiently
advanced, that they can step in to replace the H20 and forthcoming RTX Pro 6,000. Still, by all
accounts, the infrastructure required for mass production is still being constructed. At the same
time, the ban on NVIDIA products means that Chinese chipmakers will have a huge backlog of
orders to validate the cost of expanding supply. Vaser and Ling managing director at Union
Private Bank said, China clearly prefers to develop AI at its own pace,
on a domestic tech stack. Better to bite the bullet now than to rely on U.S. tech that can be restricted
upon a whim. A complete ban, if true, would show China's confidence in its local supply chain somewhat,
but it's still likely it's a bargaining chip in the trade negotiations.
Speaking of chips, chipmaking startup GROC, completely unrelated, by the way, to XAI's chatbot,
has raised a ton of cash in a gigantic new fundraising round. The company announced on Wednesday that
they'd raised $750 million at a $6.9 billion valuation. Rumors from July had suggested the round would
draw in $600 million at $6 billion, so this is meaningfully bigger than was previously expected.
GROC last raised in August of last year at $2.8 billion, making this a 2.5x jump.
Now, another oversubscribed AI venture round isn't all that newsworthy by itself, but GROC seeing
a ton of demand does tell us something about competition and chipmaking.
GROC designs chips that are purpose-built for AI inference, as opposed to Nvidia's more
general-purpose GPUs. Google is working along a similar path with their Trilium TPUs.
In fact, GROC's founder Jonathan Ross worked on the TPU project at Google prior to starting
his own company around the idea of efficient inference chips.
While we're not there yet, the future of AI chipmaking could start to diverge substantially
from where it's been.
The market is beginning to fragment into different types of chips optimized for different parts
of the AI stack.
AI training will continue to benefit from the highest performance chips, which for the
moment are still in Bidia's range of GPUs, but inference, which represents a vastly larger
portion of AI chip demand ultimately, increasingly looks like it could go to companies that
can build the fastest or most energy efficient chips.
Quick one from my enterprise users out there.
AI avatars are about to be unleashed on Zoom meetings around the world.
Zoom announced on Wednesday that the third generation of their AI avatars will be coming
in December.
This will be the first generation capable of appearing in live meetings rather than just
delivering pre-recorded messages.
The avatars won't be able to attend meetings by themselves.
Instead, they will function as an overlay on live video tracking the user's movement.
Zoom said that they are photorealistic and designed for moments that require a polished
presence without needing the user to be camera ready, i.e. roll out of bed right before the meeting
and still look great. A series of guardrails are being rolled out alongside the technology.
Zoom says they can verify that the person sitting in front of the camera matches the AI
avatar. They will also display clear warning signals that you're looking at an AI avatar
rather than a real person. Already live deepfakes have become an issue in corporate security.
So the optimistic take here would be that normalizing the use of avatars will raise awareness
that you can't always believe what you see in front of you on the screen. Alongside the new
avatar features, Zoom is also rolling out their built-in translation and AI note-taking functions.
Lastly, today, a quick discussion of meta's new smart glasses.
The company has released a pair of new smart glasses that they hope will become the native
AI device of the next decade.
The products were rolled out at MetaConnect on Wednesday and are called the Meta Rayband
Display.
As you'll no doubt have guessed from that name, the big new feature is a display built into
the glasses.
It is a 600-600-60 pixel display projected onto the right lens of the glasses,
making it basically invisible for anyone looking from the outside.
The other big advance is the introduction of meta's neural band controller.
The device detects electrical signals given off by nerves in the wrist,
allowing the glasses to be controlled by subtle hand movements.
Now, a lot of the chatter about this
was this feature breaking for Mark Zuckerberg while live on stage,
and many people were ragging on meta for the demo not working.
Andreas Klinger had this right, though, when he wrote,
marketing people think this is bad,
but this builds an insane pile of trust among engineers.
You know the stuff in the show is real.
You know, they dare to actually ship, and they believe in their product enough to take the risk of something breaking live.
I'd build for that ecosystem rather than some fancy after effects vaporware marketing demo.
And I could not agree more.
A real live demo like this, even that has big problems and failures, is a thousand X more brand to creative,
especially for early adopters and builders,
than the sort of polished pre-recorded junk that we've more increasingly gotten from the big tech companies over the last few years.
And look, people's first impressions are pretty positive.
It seems like the Virgins reviewer even tried to dislike them and just couldn't get herself there.
She posted, I regret to inform you, meta's new smart glasses are the best I've ever tried.
We are of course all still waiting on the device that Sam Altman and Johnny Iver cooking up,
but right now when it comes to AI wearables, it is metas and metas alone.
But that is going to do it for the headlines.
Next up, the main episode.
AI isn't a one-off project.
It's a partnership that has to evolve as the technology does.
robots and pencils work side by side with clients to bring practical AI into every phase,
automation, personalization, decision support, and optimization.
They prove what works through applied experimentation and build systems that amplify human potential.
As an AWS-certified partner with global delivery centers, robots and pencils combines reach with high-touch service,
where others hand off they stay engaged, because partnership isn't a project plan, it's a commitment.
As AI advances, so will their solutions. That's long-term value.
Progress starts with the right partner.
Start with robots and pencils at robots and pencils.com slash AI Daily Brief.
Shape the future of Enterprise AI with Agency, AGNTCY.
Now in Open Source Linux Foundation Project,
agency is leading the way in establishing trusted identity and access management for the Internet of agents.
The collaboration layer that ensures AI agents can securely discover, connect, and work across any framework.
With agency, your organization gains open, standardized tools and seamless integration,
including robust identity management to be able to identify, authenticate, and interact across
any platform. Empowering you to deploy multi-agent systems with confidence, join industry leaders
like Cisco, Dell Technologies, Google Cloud, Oracle, Red Hat, and 75-plus supporting companies
to set the standard for secure, scalable AI infrastructure. Is your enterprise ready for the future
of Agentic AI? Visit agency.org to explore use cases now. That's agn-tcY.org.
Today's episode is brought to you by Super Intelligent.
Now, one thing that we are having a lot of conversations with folks about is the fact that for
some of you, your fiscal year is coming to an end, and that means two things.
One, it means planning and thinking about what you're going to do in the next year.
And two, it means using up those last of budgets so you don't lose them.
If you are an enterprise that happens to find yourself in that situation, super intelligent
would love to help on both fronts.
We are moving increasingly towards an annual AI planning model, where we're
map out how you can create an action map of your organization's agent opportunities that
represents an executable backlog of AI and agent use cases that you can deliver on over the course
of the next year. Additionally, for those end of your budgets, we have worked out deals with a number
of partners where we can pre-lock in general implementation packages even before you figured out
exactly what use cases are going to require them. If you'd like to learn more about superintelligence
agent readiness audits and this new end of fiscal year plan, visit us at Bsuper.AI, click get started,
and make sure to use the word fiscal somewhere in the description.
Welcome back to the AI Daily Brief.
Today we are talking about both Google and OpenAI's very impressive performance in the ICPC coding competition,
what it says about our perceptions of new models, and exploring what it means for what's next.
But to set us up, I just have to go back a few weeks.
It was only one month ago when we were inundated with articles like this one from the Financial Times.
is AI hitting a wall? OpenAI's underwhelming new GPT-5 models suggest progress is slowing and
competition is changing. A more kind interpretation came from the likes of the New York Times,
who ran an opinion post, AI may just be kind of ordinary. Related at the beginning of this month,
just two weeks ago, the economist published What If Artificial Intelligence is just a normal technology.
Now, of course, I have talked extensively about how much I think was going into these perceptions,
and specifically how much it had to do with factors very much outside the AI space.
And yet, it's not unfair to say that for many, outside at least this little AI sphere that we all live in,
had started to kind of accept this idea as the conventional wisdom.
Yes, it would be a disruptive technology, but maybe not some crazy, fast-moving thing like we had all been thinking.
Of course, over the last several weeks, things have changed.
There has been a major vibe shift around GPT5 that has done nothing but increased with the introduction of GPT5 codex,
On a more macro level, we had that huge Oracle projection which got markets all excited again,
plus the Fed delivered the rate cut that markets had been hoping for.
And now, on top of that general vibe shift, we've received news from OpenAI and Google about
another big win in a competitive setting.
The victory was achieved at the International Collegiate Programming Contest or ICPC.
The contest brings together teams from colleges around the world to compete to answer complex
algorithmic questions.
The context is about solving mathematical puzzles using a combination of logic and programming
skill. Google's DeepMind Think 2.5 and OpenAI's GPT5 both participated in the contest and were subjected
to the same five-hour time limit as the human teams. The results were significant. Gemini managed to answer
10 of the 12 questions, which would have been good enough for second place overall and a gold medal.
Two human teams solved 10 problems, while only a single team from St. Petersburg University
solved 11 problems. By the way, for those who are interested just from a global distribution for this,
The top five teams were St. Petersburg State University, the University of Tokyo,
Beijing, Zhao Tong University, Shinghua University, and Peking University.
Russia, Japan, China, China, China.
Harvard came in six. The University of Zagreb from Croatia came in seventh, and MIT came in eighth.
GBT5, meanwhile, managed a perfect score, which, as you just heard, none of the human teams achieved.
Mustafa Rohaninjad, one of OpenAI scientists who observed the model's performance, wrote a thread on the event.
He said,
we received the problems in the exact same PDF form, and the reasoning system selected which
answers to submit with no bespoke test time harness whatsoever. For 11 of the 12 problems, the system's
first answer was correct. For the hardest problem, it succeeded on the 9th submission. Now,
you might remember that back in July, both OpenAI and Google DeepMind claimed gold medal performances
at the International Math Olympiad. The IMO is a similar style of event to the ICPC, putting
genius-level students to the test. However, neither of the models that competed at the IMO were
generally available at the time. Part of what is capturing people's attention about this new performance
is that this week's result was largely about mostly normal models that anyone has access to
demonstrating superhuman performance. Mustafa wrote, we competed with an ensemble of general purpose
reasoning models. We did not train any model specifically for the ICPC. We had both GPT-5 and an
experimental reasoning model generating solutions, and the experimental reasoning model selecting
which solutions do submit. GPT-5 answered 11 correctly, and the last and most difficult problem
was solved by the experimental reasoning model. Mustafa confirmed that this was the same pair of
models that competed at the IMO, but of course at the time, GPD-5 was still unreleased. He writes,
the result is a great capstone to our streak of results showcasing the impressive pace of improvement
of our reasoning systems. Boris Miniev, a reasoning specialist at Open Am,
filled in some details on how impressive the result was in the specific context of the history of
the event. He posted, in 2015, I won the ICPC World Finals as a member of the ITMO University
team. It was the only time in finals history when a team solved all the problems before the
contest ended. Reflecting on the rapid advancement, Minnaev said, progress is very fast. A year ago,
AI struggled with even easy contest problems. Now it performs better than the best human teams.
If this trend continues, next year we may see real scientific discoveries made by AI.
Put a pin in that point because we are going to come back to it.
Now, like I said, one of the obvious takeaways here is just how off our first impressions were of GPT5,
or at least how wrong it was to translate our preference interpretations and perhaps wild expectations
onto a model that was actually good.
What I mean by that is that it perhaps was not unreasonable for people to be upset that a model
whose character and personality they had come to really appreciate was gone.
But then translating that to thinking that GPT5 was in some way a bad model
was a mistranslation of that real and legitimate sentiment.
There's also questions of how OpenAI introduced it,
whether they kind of shot themselves in the foot by not just calling O3 GPT5.
So whatever the case, it's very clear between Codex and now this performance,
that even if you have preference for some other models for some other use cases,
which is completely reasonable,
It's very hard to say that GPT5 is in some way a bad model.
And of course, like I said, Google was right there getting gold medal performance as well.
This is not just a chat GPT or OpenAI story.
What's more, it's not just a Google and OpenAI story either.
The actual increase in performance that's happening right now is also not limited to Google
and OpenAI.
Another set of impressive results this week happened on the ARC AGI leaderboard.
On Wednesday, Jeremy Berman of Reflection AI posted,
I'm back at the top of Arc AGI with my new program. I use GROC4 and multi-agent collaboration
with evolutionary test time compute. The result was a new state-of-the-art performance on Arc AGI
at around 80% of the first test and 30% of the second test. Berman spent $8.42 per task for the first
test and around $30 per task on the second. As a point of comparison, the Arc AGI run of 03 from last
December that had everyone speculating around whether Open AI had actually created AGI was behind that.
They achieved a 76% result at a cost of around $13 per task, and a super-expensive run that cost
several thousand dollars per task achieved an 88%.
Berman explained the architecture, writing,
The system works by having Grog4 generate natural language instructions for solving each task.
Grock4 sub-agents test these instructions against training examples scoring their accuracy.
The best-performing instructions spawned new generations of refined solutions.
Now, nothing in the design of this seems particularly complicated.
It uses this sub-agent structure, but that is, of course, a hollowing.
mark of GROC 4. In fact, when Greg Camrat of Arc Prize announced the submission, Elon Musk retweeted
and said, that's just GROC 4. He later added, I now think XAI has a chance of reaching AGI with
GROC 5. Never thought that before. And just to reinforce one of the key points here, these
results are being achieved by the standard models available to the public. Compare that to
O3's release when OpenAI needed to develop a fine-tune version and pile on a ton of inference
to achieve a result above 80%. Now GROC 4, when well structured, does it just a standard.
Berman released all of his materials as open source so anyone with the technical skills and
$100 for API costs can duplicate the run.
Now, one of the things that makes these types of results difficult to communicate to a wider
audience is that most of us just don't have any real frame of reference for what these results
mean.
If you're not a part of this competition, getting gold medal performance at the PQXY competition
doesn't necessarily mean anything.
Trying to put some context around it, Swix writes, as impressive as this is, I feel like
OpenAI is underselling it still. This is the first measure in which GPT-5 has achieved superhuman
coding ability, as in, it is literally and measurably better than every other collegiate human
programmer on Earth. Through all the prior IMO, I-OI, AT coder competitions, AI was roughly
as good as the best humans, maybe a little under. OpenAI's known Brown agreed saying that's a good
point. I wouldn't call it superhuman coding ability because there's more to coding than what the
ICPC tests, but I think this is the first major coding competition where AI did better than any
human competitor. And indeed, that's why it's worth wondering if this actually represents a meaningful
inflection point in some way. OpenAI's Jacob Pachaki put some context around what this means,
at least for them, inside the company. He writes, I believe these results, coming from a family
of general reasoning models rooted in our main research program, are perhaps the clearest
benchmark of progress this year. These competitions are self-contained, timebox tests for the ability
to discover new ideas. Even before our models were proficient at simple arithmetic, we looked
towards these contests as milestones of progress towards transformative artificial intelligence.
Our models now rank among the top humans in these domains when posed with well-specified questions
and restricted to around five hours. The challenge now is moving towards more open-ended problems
and much longer time horizons. This level of reasoning ability applied over months and years
to problems that really matter is what we're after, automating scientific discovery.
And that is really the place that the discussion has resolved.
That this is not about practical coding.
It's about whether we are on the frontier of actually being able to make novel discoveries.
Jerry Tuarek writes,
ICPC probably marks the end of our run-on competitions
and an end of a certain era for LLM systems.
But what's the next frontier is even more exciting.
Rune, I think, put it even more poetically.
He posted,
Essentially all fixed-time competitions at the edge of human skill have been grandmastered by machines,
so labs must pivot to the only true challenge of unraveling the unsolved mysteries.
It is very clear that unraveling those unsolved mysteries is where OpenAI's head is at.
CPO Kevin Wheel tweets,
OpenAI models are getting quite good at solving really hard problems.
The next stage is accelerating scientific discovery,
and we're beginning to see strong early signs.
This is also the constant theme of Google DeepMind CEO Demis Hasavis.
In basically every interview, it's pretty clear that his view of what makes for really advanced
AI is making novel scientific discoveries that do things like, as he said in April, give us a real
crack at solving all disease. So like Jerry said, maybe the ICPC represents the end of an era in
the beginning of something new. If so, it is a very exciting time to be in this field, and I'm glad to have
all of you here with me as we dive all the way in. That's going to do it for today's AI Daily Brief.
Thanks, thanks as always for listening or watching, and until next time, peace.
