The AI Daily Brief: Artificial Intelligence News and Analysis - Is o3 Functionally AGI?
Episode Date: April 20, 2025From the launch of o3 and o4-Mini to the US considering a DeepSeek ban, these are the biggest stories from the last week in artificial intelligence. Get Ad Free AI Daily Brief: �...�https://patreon.com/AIDailyBriefBrought to you by:KPMG – Go to https://kpmg.com/ai to learn more about how KPMG can help you drive value with our AI solutions.Vanta - Simplify compliance - https://vanta.com/nlwPlumb - The Automation Platform for AI Experts - https://useplumb.com/nlwThe Agent Readiness Audit from Superintelligent - Go to https://besuper.ai/ to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Subscribe to the newsletter: https://aidailybrief.beehiiv.com/Join our Discord: https://bit.ly/aibreakdown
Transcript
Discussion (0)
Today on the AI Daily Brief, all the most important stories in AI from this past week while I was traveling.
The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.
To join the conversation, follow the Discord link on our show notes.
Hello, friends, quick note before we dive in today, I'm of course coming off some travel.
And so this week, instead of a long reads episode, I decided to do a bit of a catch-up on some of the most important news.
There were some interesting things that went down, so let's dive right in.
Welcome back to the AI Daily Brief Headlines edition,
all the daily AI news you need in around five minutes.
As you know, I have been out all week, so we have a lot to catch up on.
And we are kicking off with some geopolitics where the Trump administration is reportedly
considering a deep seek ban.
According to the New York Times, further restrictions include banning the startup from
purchasing U.S. technology and barring Americans from accessing deepseek's models.
Now, the report didn't get into how exactly a government could ban open source models.
but functionally simply banning cloud providers from offering them is probably close enough.
Congress also apparently has deep seek in its sites.
The House Select Committee on China called the AI startup a, quote, profound threat to U.S.
National Security by harvesting American users' data and setting it back to China.
Their report states,
although it presents itself as just another AI chatbot,
offering users a way to generate text and answer questions,
closer inspection reveals the app siphons data back to the People's Republic of China,
create security vulnerabilities for its users,
and relies on a model that covertly censors and manipulates information pursuant to Chinese law.
Now, whether or not this comes to past tells you a lot about the state of the conversation as it relates to China and AI right now.
Speaking of, another company in trouble in that area is NVIDIA.
On Wednesday, the Trump administration extended export controls to cover NVIDIA's H20 chips,
the downrated version of the H-100 that are designed to comply with controls introduced in the Biden era.
The administration said that the enhanced rules address concerns that, quote,
the covered products may be used in or diverted to a supercomputer in China.
Invidia, for their part, warned that they would report 5.5 billion in writedowns associated with
inventory and commitments for the chips, which have essentially zero demand outside of China.
In terms of how much this would impact the development of Chinese AI,
Biden Commerce Department staff said that the bands would make it around 3 to 6% more costly
to develop an AI model in China.
Since then, we've had constant reports of Chinese researchers doing more with less, so that
figure is very much up for debate.
invidia had been outspoken about existing export controls and lobbied against them going further.
In January, as Biden's outgoing team imposed the last round of tightening, the company said that
export controls, quote, will only harm the U.S. economy, set America back and play into the hands of
U.S. adversaries. Now, earlier this month, CEO Jensen Huang made a trip to Mara Lago to petition
the president directly. Following his attendance at a million dollar ahead dinner,
NPR reported that Trump had reversed course on new controls on the age 20s. The report stated that
bands had been set to go into effect as soon as last week. The quid pro quo had been a dramatic ramp-up
of local investment from Nvidia. NPR sources made vague reference to investments in AI data
centers, but this week we've seen a wave of reports of Nvidia's commitments to U.S.
manufacturing. On Monday, the company announced that they had begun production of Blackwell
chips in TSM's Arizona facility. They also committed to producing AI supercomputers at a pair
of facilities in Texas. In total, Nvidia claimed they would produce a half a trillion dollars
worth of AI infrastructure in the U.S. over the next four years. They said their manufacturing was,
quote, expected to create hundreds of thousands of jobs and drive trillions of dollars in economic
security over the coming decades. In the press release, Huang said, the engines of the world's
AI infrastructure are being built in the United States for the first time. Adding American
manufacturing helps us better meet the incredible and growing demand for AI chips and supercomputers,
strengthens our supply chain and boosts our resiliency. Alas, the high-profile announcement
doesn't seem to have been enough, with the export control still going into force two days
later. The Financial Times reports the announcement came as a complete surprise to
Nvidia, saying that earlier this month, the company had assured Chinese tech giants that
supply of H20s would not be interrupted. And so as of Thursday, Huang is visiting Beijing
to meet with political and tech leaders. According to state broadcaster CCTV, the CEO said
that China was a, quote, very important market for Nvidia, and that his company would, quote,
make a significant effort to optimize our products that are compliant with the regulators and
continue to serve the Chinese market. Speaking of Deepseek, sources said that his itinerary included
meeting with Deepseek founder, Lang Wen Feng, to discuss a new chip design to meet regulatory
requirements set by Washington in Beijing. A public meeting with the China Council for the promotion
of international trade was televised, and Huang also reportedly met separately with Chinese
vice premier, Healy Feng. The press for their part is reading a lot into the deference shown by
Huang, who discarded his trademark leather jacket for a suit and tie to attend high-level meetings
in Beijing. Speaking to the whirlwind China visit, President Trump told the press,
quote, Jensen's an amazing guy. He's become a friend of mine. He's a person that's
very proud of our country. He loves our country. I'm not worried about Jensen at all.
Back home, big fundraising continues. Ilius Utskhaver's safe superintelligence has closed a new round of
funding that values the company at a whopping $32 billion. The former OpenAI chief scientist
founded the startup less than a year ago. In September, SSI raised a billion dollars at a $5 billion
valuation, a price tag that already seemed a little rich to sum for a company with no product
and little more than a big name founder and a resident mission statement. But obviously to many
people that would be dismissing what SSI actually has. This round, which brought in an additional
$2 billion, has marked up the valuation 6x. To put it in perspective, Anthropic was valued at $61.5 billion
during last month's funding round, meaning that SSI has already achieved half that valuation.
There are a couple of potential reads on the situation. The first take is that for those who are
in this game, venture firms are simply not that price sensitive when it comes to getting into the
very small handful of companies that actually have a chance to reach AGI first.
First. Reports from February said that SSI was in talks to raise at 20 billion then, so there's been a
pretty significant jump in valuation during the negotiations, meaning there's a lot of demand.
The second read is that SSI might have made actual progress over the past few months.
Certainly everyone is wondering about what the product will look like. James Cham, a partner
adventure firm Bloomberg Beta, said, everyone is curious about exactly what Ilya's pushing and
exactly what the inside is. It's super high risk, and if it works out, maybe you have the
potential to be part of someone who is changing the world.
A couple more before we wrap up, Anthropic is preparing to release their long-awaited voice mode.
Bloomberg reports that the feature could be released as soon as this month.
Sources said the rollout will feature three different voices for Claude,
identified as airy, mellow, and a British-accented version referred to as buttery.
CEO Dario Amade first teased voice mode during a January interview with the Wall Street Journal.
One of the reasons he gave for the long delay was a desire to ensure that Claude's voice
was comfortable and natural enough for long interactions.
The rollout will also be the first big test of Anthropics' new premium $200 per month subscription.
In Microsoft Land, the company has enabled a new computer use feature for Copilot Studio.
This feature is similar to offerings from OpenAI and Anthropic and allow Copilot to take over
the computer to interact with websites and apps.
Charles Lamana, the VP of Copilot said, this allows agents to handle tasks even when there is
no API available to connect to the system directly.
If a person can use the app, the agent can too.
Apple, meanwhile, has revealed a convoluted plan to improve their AI without compromising
privacy.
In a technical blog post, the company laid out a system that
can check their synthetic data against tokenized user data without revealing information.
The idea is that the synthetic data that most closely matches real user data can be used
as the training set for Apple's next generation of models.
This means that the company can technically state that their models aren't trained on user data.
Apple says this method can be used to improve the performance of writing assistance,
photo editing models, and their generative emoji feature.
Kind of an overcomplicated way to catch up with features that people were excited about
back in 2024, but there you are.
Separately, the New York Times reports that AI-enhanced Siri would finally arrive this year.
Their sources said that current plans are to release the updated assistant in the fall,
explaining the features they gave the example of being able to edit a photo and send it to a friend.
A fall rollout would be well ahead of previous estimates, though,
with Bloomberg tech editor Mark German previously stating that he thought that Siri,
quote, won't be ready until 2027 at best.
The information writes,
Apple's AIML group has been dubbed aimless internally,
while employees are said to refer to Siri as a hot potato that is continually
pass between different teams with no significant improvements. So I guess at the end of a week away,
the more things change, the more they stay the same. That's going to do it for today's AI Daily
Brief Headlines edition. Next up, the main episode. Today's episode is brought to you by Vanta.
Vanta is a trust management platform that helps businesses automate security and compliance,
enabling them to demonstrate strong security practices and scale. In today's business landscape,
businesses can't just claim security, they have to prove it, achieving compliance,
with a framework like SOC2, ISO-2, ISO-2701, HIPAA, GDPR, and more is how businesses can demonstrate
strong security practices.
And we see how much this matters every time we connect enterprises with agent services providers
at Superintelligent.
Many of these compliance frameworks are simply not negotiable for enterprises.
The problem is that navigating security and compliance is time-consuming and complicated.
It can take months of work and use up valuable time and resources.
Vanta makes it easy and faster by automating compliance across 35-plus frameworks.
It gets you audit ready in weeks instead of months and saves you up to 85% of associated costs.
In fact, a recent IDC White Paper found that Vanta customers achieved $535,000 per year in benefits,
and the platform pays for itself in just three months.
The proof is in the numbers.
More than 10,000 global companies trust Vantala, including Atlassian, Cora, and more.
For a limited time, listeners get $1,000 off at vanta.com slash nLW.
That's VANTA.com for $1,000 off.
Today's episode is brought to you by Super Intelligent and more specifically Super's Agent
Readiness Audits. If you've been listening for a while, you have probably heard me talk about
this, but basically the idea of the agent readiness audit is that this is a system that we've
created to help you benchmark and map opportunities in your organizations where agents could
specifically help you solve your problems, create new opportunities in a way that, again,
is completely customized to you. When you do one of these audits, what you're going to do is a
voice-based agent interview where we work with some number of your leadership and employees
to map what's going on inside the organization and to figure out where you are in your agent
journey. That's going to produce an agent readiness score that comes with a deep set of explanations,
strength, weaknesses, key findings, and of course a set of very specific recommendations that
then we have the ability to help you go find the right partners to actually fulfill.
So if you are looking for a way to jumpstart your agent strategy, send us an email
at Agent at B-Super.a.I.
And let's get you plugged into the agentic era.
Welcome back to the AI Daily Brief.
It's pretty clear that the big news this week in AI
was the introduction by OpenAI of a set of new reasoning models.
On Wednesday, OpenAI released O3 and O4 Mini.
O3 is their most advanced reasoning model to date,
while O4 Mini is being pitched as a competitive tradeoff
between price, speed, and performance.
There's also a high resource version of O4 Mini
called O4 Mini-high. So the trend of OpenAI having completely clear names continues.
The new batch of reasoning models introduces some new features to the O-Series family.
First, the models can integrate images into their reasoning process. We've seen something along these
lines show up as an emergent property of multimodal models like Google's Gemini, but this will be
the first time that OpenAI has pushed the limits on what the reasoning modality can do.
OpenAI told Venturebeat, these models don't just see an image, they think with it. It
unlocks a new class of problem-solving that blends visual and textual reasoning.
The other big improvement is tool use, with the new models natively trained on common tools.
The company wrote, we've trained them to use tools through reinforcement learning,
teaching them not just how to use tools, but to reason about when to use them.
President Greg Brockman commented,
they actually use these tools in their chain of thought as they're trying to solve a hard problem.
For example, we've seen O3 use like 600 tool calls in a row trying to solve a really hard task.
Now, this could represent a big jump in agenda capabilities.
For agents, being able to figure out the right tools to use for any given situation is going to be one of their biggest unlocks and is pretty key to enabling ultimately fully autonomous agents.
Right now, one of the most common failure states for agents is either failing to recognize when to use a tool or failing to access the tool properly.
Now, it wouldn't be a new model release without a whole bunch of benchmarks that you're not exactly sure what they mean or how much to care about.
and, in fact, the tool use appears to be showing up here.
O4 Mini, for example, managed to score 99.5% on the AIME 2025 mathematics competition,
but only when given access to a Python interpreter.
More broadly, OpenAI is claiming that O3 benchmarks is state-of-the-art across standard coding,
science, and agendic tasks.
However, as you all have heard me say before, I think that given the challenges of benchmarks,
it's much more relevant to see what people are actually doing with these tools.
Kelsey Piper of Vox's Future Perfect said that O4 Mini High is the first model to pass her own,
quote, personal secret benchmark for hallucinations and complex reasoning.
Her test involves presenting the inputs of a complex midgame chess board and the prompt,
mate in one.
The catch is that there is no checkmate in one move.
AI models are trained on extensive chess puzzles of this kind, but their training set doesn't
necessarily include this kind of trick question.
Piper said that her prior testing showed that models reasoned through thousands of possibilities
before hallucinating a solution.
This generally involves adding extra pieces to the board or illegal moves.
The models will then add lengthy justifications for why they're
hallucinated solution is correct. She had run this test on every clod model to date, as well as Gemini
2.5 Pro, GVT-O3, Mini-Hi and GROC3, with none figuring out that the solution is impossible.
Why is this a big deal? I invented this problem because I think it gets at the core of AI's potential
and limitations. An AI that can't question its premises will always be limited. An AI that doubles
down on its own wrong answers will too. She noted that the reasoning trace was eight minutes long,
much longer than any other query she ran, saying, that's a lot of places to potentially make mistakes,
and hallucinate a solution. Its expectation that there was a solution was very strong, but it overcame it.
She added in conclusion, however, that said, its explanation of why there was no checkmate,
in fact, still contain some chess inaccuracies, which I know it knows better then. So certainly
don't trust these things, but no, they're continually getting better. And even more vociferous
endorsement came from economist Tyler Cowan, who wrote, I think it's AGI, seriously. Try asking it
lots of questions and then ask yourself, just how much smarter was I expecting AGI to be?
I've argued in the past AGI, however you define it, is not much of a social event per se.
It will still take us a long time to use it properly.
Benchmarks, benchmarks, blah, blah, blah.
Maybe AGI is like porn.
I know it when I see it, and I've seen it.
Now, I haven't had as many reps as I normally would have this week with O3, given the travel,
but I am absolutely 100% in the Tyler Cowan camp here.
Not necessarily that O3 is AGI, but that it doesn't matter.
These models have so far, to me, been an absolute step-change improvement,
relative to 01 in what we were using in the past.
I've been testing them as a business thought partner,
and the reasoning is so much more thorough,
so much more interesting, and just generally better.
In fact, I've implored, by which I mean basically demanded
that everyone inside super intelligent
start playing around with 03 as a brainstorming partner
for pretty much everything.
I genuinely think it's that good.
Now, I think it'll still take some time for us to figure out
exactly what the best use cases for these models are,
Although if enough people like me demand that all their colleagues use it for every business interaction from here on out, I'm sure we'll figure it out more quickly.
Still, one use case that people jumped on very fast was that O3 appears to be disturbingly good at geogessing.
Given basically any photo of a landscape or a building, the model can pinpoint its location on a map.
Henrion X wrote, 10 years ago, the CIA would have gotten on their knees for this.
Every single human has just been handed an intelligent super weapon.
It's only getting stranger.
I would implore you if you haven't had a chance yet, go play with it.
this model. Even if you don't have something specific that you're trying to do, try asking it
whatever business question you're thinking through at the moment. Use it as a thought and collaboration
partner and just see how different it feels as opposed to past models. It is, of course, totally
possible that I'm in the few first-day glow of a new toy and that it's actually not all that
different, but I kind of don't think so. Now, completely overshadowed by the 03 and 04 mini-releases,
OpenAI also rolled out a new update to their non-reasoning model family earlier in the week on Monday.
GPT 4.1 will be the successor to GPT40 and is now available to developers through the API.
The GPT4.1 family includes three different sizes with a mini and nano variant available alongside
the full-size model. OpenAI says that the nano version will be their smallest, fastest, and cheapest
model yet. Another big update, the models have a million token context window matching Google's
recently released Gemini 2.5 Pro. As we've discussed before, ultra-long context windows are especially
important for coding assistance and agents, allowing users to dump entire codebases into the model,
or run longer agendic workflows.
And it seems that GPT4.1 is explicitly aimed at coding use cases.
An OpenAI spokesperson said,
We've optimized GPT41 for real-world use based on direct feedback
to improve in areas that developers care most about.
Front-end coding, making fewer extraneous edits,
following formats reliably,
adhering to response structure and ordering,
consistent tool usage, and more.
These improvements enable developers to build agents
that are considerably better at real-world software engineering tasks.
If nothing else, this is definitely open AI competing on price
in a very aggressive way. Michelle Pocgrass, the post-training research lead at OpenAI, said,
not all tasks need the most intelligence or top capabilities. Nano is going to be a workhorse
model for cases like autocomplete, classification, data extraction, or anything else where speed
is the top concern. Entrepreneur Paul Gauthier noted that this week's releases are more than the
sum of their parts, posting, using O3 High as architect and GPT4.1 as editor produced a new
state-of-the-art of 83% on the ATER-Polyglock coding benchmark. It also substantially reduced costs
compared to O3 High alone.
Now, speaking of coding, something we've talked a lot about on this show is how for some time,
Anthropics Clod has been the go-to choice for developers.
While Open AI is definitely not giving up that fight, because alongside these new models,
they also rolled out a new coding agent.
Sam Altman posted, O3 and O4 Mini are super good at coding, so we're releasing a new product
Codex CLI to make them easier to use.
This is a coding agent that runs on your computer.
It's fully open source and available today.
We expect it to rapidly improve.
prove. Now, because it's open source, of course, there are already forks that enable models
from outside the OpenAI ecosystem as well. First reactions seem decent. Gooby said, used Codex
CLI with O3, used like 150 in tokens in like an hour, switching to O4 Mini now, LMAO. That
being said, O3 was cooking, fixed a couple of long-standing bugs. Roshab Shravastara wrote,
vibes for Codec CLI so far, been a bit me meh, Claude Codes still much better. Codex with
though 4 Mini has been fantastic for one-shot single-file edits, extremely good at fixing subtle
bugs when specifically prompted. Meh at iteration and retaining content and at multi-file edits.
Terrible at creating documentation and explaining a codebase. So for now, maybe Claude can breathe
a sigh of relief, but it's pretty clear that OpenAI wants to compete in that space, which is also
validated by the fact that on Wednesday, Bloomberg reported that the company is looking to acquire
Winsurf. WinSurf is probably the best-known cursor competitor and was valued at 1.2.2
to $2.5 billion back in August and was reportedly in talks to raise at a $3 billion
valuation earlier this year. The report states state that OpenAI is looking to make the
acquisition at $3 billion, but sources say the deal hasn't been finalized and could still fall
apart. Now, if you're wondering, why not just buy a cursor instead? Sam Altman apparently
thought of that as well and made two separate attempts to buy the leading agentic coding platform.
One was late last year and another early this year. In fact, CNBC's sources say that Open
AI has actually met with 20 companies in the AI coding domain before reportedly finding a deal with
Winsurf. All in all, it was an extremely busy week in Open AI land, and I'm ignoring about a
half a dozen stories that otherwise might have merited attention. For now, what I will leave you with
is my strong instinct to please go try O3, play around with O4 Mini as well. These really do feel like
a different quality of model and a different quality of experience, and I think are going to open up
some different types of use cases. For now, though, that's going to do it for today's AI Daily Brief.
Until next time, peace.
