The AI Daily Brief: Artificial Intelligence News and Analysis - 3 Major New AI Model Releases: GPT-OSS, Claude Opus 4.1, Genie 3
Episode Date: August 6, 2025Three major AI model drops in one day signal a shifting landscape at the frontier. We break down OpenAI’s unexpected open-source GPT-OSS, Anthropic’s powerful Claude Opus 4.1, and Google’s ambit...ious Genie 3 world model—exploring what each reveals about the strategies, capabilities, and directions of the top AI labs.Brought to you by:KPMG – Go to https://kpmg.com/ai to learn more about how KPMG can help you drive value with our AI solutions.Blitzy.com - Go to https://blitzy.com/ to build enterprise software in days, not months AGNTCY - The AGNTCY is an open-source collective dedicated to building the Internet of Agents, enabling AI agents to communicate and collaborate seamlessly across frameworks. Join a community of engineers focused on high-quality multi-agent software and support the initiative at agntcy.org Vanta - Simplify compliance - https://vanta.com/nlwPlumb - The automation platform for AI experts and consultants https://useplumb.com/The Agent Readiness Audit from Superintelligent - Go to https://besuper.ai/ to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Subscribe to the newsletter: https://aidailybrief.beehiiv.com/Join our Discord: https://bit.ly/aibreakdownInterested in sponsoring the show? nlw@breakdown.network
Transcript
Discussion (0)
Today on the AI Daily Brief, three big new model releases from OpenAI, Anthropic, and Google.
The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.
Hello, friends, quick announcements before we dive in.
First of all, thank you to today's sponsors, KPMG, Blitzy, and Super Intelligent.
To get an ad-free version of the show, go to patreon.com slash AI Daily Brief.
And if you are interested in sponsoring the show, I actually have a new place to go now.
For fun this past weekend, I have vibe-coded up, AIDailybrief.ai.
If you're watching, you can see it's a cool, old kind of terminal-style website.
And so you can now email sponsors at AIDailybrief.aI.
That is sponsors with an S at AIDailydlybrief.A.I.
You can also still me, NLW at basically anything you can imagine and it'll get to me.
Check out the website, send me an email about sponsorship, and I will look forward to sending you more information.
The last thing I want to note is, unsurprisingly, given how much was going on,
on. Today's episode is just one big, long main episode about these three big model releases. As
makes sense this week, we will be back to our normal format or we'll just continue to focus on
whatever crazy big new thing gets launched. For now, though, we've got an open source model,
a world model, and a new coding champion to talk about, so let's dive in. Welcome back to the AI Daily
Brief. It's been clear for a while that this was poised to be a very big week. Insid suggested that
in addition to the much fabled GPT-5, we were potentially going to get a new version of Claude
from Anthropic and even some treats and something interesting from Google, to say nothing of a bunch
of other smaller models from the company's just one level back from the foundation model giants.
And today, indeed, we got drops from OpenAI, Anthropic, and Google.
So we're going to do our best to go through all of this.
This is all very fresh out of the gate with people's very first takes.
And of course, we'll be spending a lot more time on this throughout the week.
We kick off not with GPT-5, that's still forthcoming at this point, odds on bets, or that that comes on Thursday,
but instead with a release that to some is as or more significant.
And that is OpenAI's two new open-weight models.
The two models are GBTOSS-120B and GPT-O-S-20B.
And the big blinking banner headline is that these are not just some second-tier models
to give developers something nice to play around with.
these are actually very, very close to state-of-the-art. Sam Haltman tweeted,
GBTOSS is out. We made an open model that performs at the level of 04 Mini and runs on a high-end
laptop, WTF, and a smaller one that runs on a phone. Super proud of the team Big Triumph of
technology. Now, before we get into how OpenAI describes this model, it's worth putting this
into context. It has been a very long time since OpenAI was actually in the business of
open models. The last one that they released as an open model was also,
the way back at GPT2, which, by the way, remains at this point the most downloaded text generation
model of all time on Hugging Face, and by a fairly significant amount. And over the years, as OpenAI
had increasingly turned away from open source, there were still people who were asking them to
reconsider. Back in December of 2023, AI entrepreneur Varun Mather wrote, open source GPT4 and do whatever
R&D needed to ensure it can run on consumer laptops and desktops. You got into AI to change the world.
Now, in January of this year, we got the idea that OpenAI might be shifting their stance on this.
In a Reddit conversation, CEO Sam Altman said that he thought that perhaps OpenAI had been
on the wrong side of history to use his phrase when it came to open source.
Now, not coincidentally, this was in the peak of Deep Seek mania, when Chinese open source
models in general had not only taken a major jump ahead of Western open source models,
but were creeping up right on the heels of the closed models from the biggest foundation model
companies. And so with all that in mind, developers in the AI community at large have been excitedly
waiting to see what this open source model release would bring. So let's talk about how OpenAI
presented these two models. GPTOSS 20B, they frame as a medium-sized open model that can run on most
desktops and laptops, while its bigger sibling 120B is a large open model designed to run in data
centers and on high-end desktops and laptops. One of the first things people wanted to check is what the
license with this would actually be. And OpenAI led with that is
one of their four main bullets, calling it a permissive license. They write,
These models are supported by the Apache 2.0 license, built freely without worrying about
copy-left restrictions or patent risk, whether you're experimenting, customizing, or deploying
commercially. Simon Willison was enthusiastic, tweeting excitedly that both had been released
under a, quote, proper open source Apache 2.0 license. But were these just underperforming
models designed for tinkerers and hobbyists? The short answer is absolutely not. They shared
comparative benchmarks across four tests, including two competitive math tests, the AIME
2024 and 2025, as well as the MMLEU, GPQA Diamond, and Humanities Last Exam, which were all
reasoning and knowledge tests. On the competition math, both 20B and 120B actually beat OpenAI-O-3,
and were just behind 04 Mini, which had been specifically designed for math. On the reasoning tests,
both models were behind 03 and 04 Mini, but not by a ton. For example, the 120B model, score
a 90 on the MMLU as compared to 03's 93.4, an 80.1 on the GPQA diamond as compared to
03's 83.3. And there was a slightly bigger gap on humanity's last exam, scoring a 19 as opposed to a
24.9. Still, that 19 was better than 04 Mini, which got a 17.7. The point is that these models
are distinctly not second tier. Matt Schumer writes, are you kidding me? OpenAI's new open source
model is 03 level. This is going to disrupt the market in a big way. In addition to the
the permissive license and the benchmarks, OpenAI called out three other aspects of the models.
Full chain of thought. Giving developers the ability to access the full chain of thought for easier
debugging and higher trusted model outputs. Deep customizable. OpenAI writes,
adjust the reasoning effort to low, medium, or high, plus customize the models to adapt to your
use case with full parameter fine tuning. And the last one and one that caught my attention was
designed for agentic tasks. Leverage powerful instruction following and tool use within the chain
of thought, including web search and Python code execution. Matthew Berman honed in
on this point, writing, excellent instruction following, function calling web search or Python tool
use, adjustable reasoning effort, and structured outputs. And it's very clear as we'll get into in a minute
that, again, far from being some hobbyist project, I think they have some real specific use cases in mind,
and a big part of it is building agents. Sam Altman also wrote a longer post about OSS. He said,
GPTOSS is a big deal. It is a state-of-the-art open weights reasoning model with strong real-world performance
comparable to O4 Mini, that you can run locally on your own computer or phone with the smaller size.
We believe this is the best and most usable open model in the world. We're excited to make this
model the result of billions of dollars of research available to the world to get AI into the hands
of the most people possible. We believe far more good than bad will come from it. For example,
GPT OSS-120B performs about as well as O3 on challenging health issues. We have worked hard to mitigate
the most serious safety issues, especially around biosecurity. GPT-OSS models perform,
comparably to our frontier models on internal safety benchmarks.
We believe in individual empowerment.
Although we believe most people will want to use a convenient service like ChatGPT,
people should be able to directly control and modify their own AI when they need to.
And the privacy benefits are obvious.
As part of this, we're quite hopeful that this release will enable new kinds of research
and the creation of new kinds of products.
We expect a meaningful uptick in the rate of innovation in our field,
and for many more people to do important work than we're able to before.
OpenAI's mission is to ensure AGI that benefits all of humanity.
To that end, we are excited for the world to be building on an open AI stack created in the United States
based on democratic values available for free to all and for wide benefit.
Now, there is a lot to unpack in there.
If you listen to my episode about the White House's AI Action Plan,
you will have heard me talking about the fact that one of the things that made that document so interesting
is that while a lot of foreign policy in the United States right now involves withdrawing from
our traditional role in the world, the AI Action Plan basically made it a prerogative of the
U.S. government to use open source AI as a sort of soft power. That sort of theme isn't new for Sam
Altman, but with this model and the narrative around its release, they're definitely leaning all the
way into that. When it comes to these safety and risk issues, Kai from OpenAI added a few more
details. They tweeted, open weights can't go back in the box. Before permanently releasing GPT OSS
models into the ecosystem, we estimated marginal frontier risk by deliberately eliciting bio
and cyber capabilities via malicious fine-tuning. While MFT improves performance,
GBTOSS stays below the high threshold in OpenAI's preparedness framework. These findings contributed
towards the decision to release these models for the world to use. Alongside the release,
they shared a paper called estimating worst-case frontier risks of open-weight LLMs.
But what about what these models are actually useful for? Why is it important if you have
access to the closed models that OpenAI has now made these near state-of-the-art models available
in this open way. The team at Every spent a little time with the models before their release,
and here's what they had to say about them. They write, I can think of a few ways right off the bat
where we're going to start experimenting with these models internally. First, Every has a
consulting practice with hedge funds and private equity firms that are bound by significant
privacy and security regulations. I imagine this is going to be immediately interesting
for these businesses because they can now run the models themselves and the security of their
own data centers. Previously, they may have wanted to use open-weight models like Kimmy K2 or Kwen,
but for firms with strict regulatory or compliance requirements, there's no better way to make your
head of IT breakout in hives than to try to install Chinese developed AI in your secure private cloud.
OpenAI's OSS series models boasts the same open weight flexibility and security benefits without the geopolitical
drama.
Every also points out that even beyond big strategic risk, there are a lot of companies that just
don't want to send their information to anyone's cloud.
They write, Every has a suite of Mac apps that could use these models to offer AI features with a better security and privacy profile.
Our AI file organizers sends your file names to Open AIs Cloud, a no-go for security-conscious users.
Similarly, monologue our AI Dictation app currently in beta sends your dictation transcript
to the cloud for processing.
We've experimented with on-device processing for both of these apps, but speed and reliability
haven't been good enough to release it.
And I think with the point here is that in addition to just the principle of the thing, and
giving companies and users the choice of how they want to toggle their security and privacy
settings, there are actually very specific use cases in industries where this might matter significantly.
Basically, the more regulated in industry is, and the more complex it is, the more likely to want
custom solutions they're going to be. We see this all the time at superintelligent when it comes
to how they're thinking about agents. I don't think it's an accident that OpenAI is really
honing in on how good at agented capabilities like tool-calling these models are, because I think
that they anticipate that a lot of the usage is going to be in the form of privacy and security-conscious
enterprises using these models to build their own custom agentics. Harrison Chase and the team of
Langchane already started writing about how to use GPTOSS for this type of agentic use case. Chase wrote,
Deep Agents require good tool-calling capabilities, something that OpenAI's new open source model is
pretty good at. And for companies that are doing a lot of that usage, there is another part of this
story, which is also equally compelling, which is the cost profile. Matt Schumer again writes
that the pricing when delivered through Grok is 91% cheaper than O3's pricing.
He continues, because of this, you can likely use this model in much more comprehensive chains
in agents to eke out more performance than 03 dollar for dollar.
Example, running five OpenAI OSS model generations in parallel and then selecting the best
is still way cheaper than just one O3 query.
Grox had a partnership's Jacob Lowenstein wrote,
Intern accidentally priced OpenAI GPTOSS at 1-100th of Anthropic.
Bad intern, but what's done is done. Now go build.
Cerebris writes, OpenAI GPTOSS 120B is live on Cerebrus.
3,000 tokens per second, the fastest open AI model on record.
The point being that with these models, you have something that is not only available
with a privacy and security profile that was not previously possible, but at near state of
the art and for a fraction of the price and at significant speed.
I'm sure we'll be spending a lot more time on these open source models and how people
are using them in the weeks to come, but this is just one of three announcements that I wanted
to speak to in today's show.
Stephen Hydele, who works on the API OpenAI, wrote,
Also, congrats to my friends at Anthropic on the release of Opus 4.1 today.
Today's episode is brought to you by KPMG.
In today's fiercely competitive market, unlocking AI's potential could help give you a competitive
edge, foster growth, and drive new value.
But here's the key.
You don't need an AI strategy.
You need to embed AI into your overall business strategy to truly power it up.
KPMG can show you how to integrate AI and AI agents into your business strategy
in a way that truly works and is built on trusted AI principles and platforms.
Check out real stories from KPMG to hear how AI is driving success with its clients
at www.kpmG.org.us slash AI. Again, that's www.kpmg.comg.coms slash AI.
This episode is brought to you by Blitzy, the Enterprise Autonomous Software Development Platform
with Infinite Code Context. Blitzy uses thousands of specialized AI agents that think for hours
to understand enterprise-scale code bases with millions of lines of code.
Enterprise engineering leaders start every development sprint with the Blitzy platform,
bringing in their development requirements.
The Blitzy platform provides a plan, then generates and pre-compiles code for each task.
Blitzy delivers 80% plus of the development work autonomously
while providing a guide for the final 20% of human development work required to complete the sprint.
Public companies are achieving a 5x engineering velocity increase
when incorporating Blitzie as their pre-IDE development tool,
pairing it with their coding co-pilot of choice to bring an AI-Native STLC into their org.
Blitzy is providing a limited time, 30-day free proof of concept for qualifying enterprises.
The team will provide a 5x velocity increase on a real development project in your org.
Visit blitzy.com and press book demo to learn how Blitzie transforms your STLC from AI-assisted to AI Native.
That's BLITZY.com.
If you are a regular listener, you will have heard about superintelligence agent
readiness audits at this point. But I wanted to tell you today about the full suite of agent readiness
products that go beyond just the initial readiness report. Over the last six months, Super Intelligence
has built out an entire agent planning suite. We help you move from discovery to planning to implementation.
After you've completed your agent readiness audits, we help you double click on your most important
use cases with what we call our use case planning reports. These reports are going to help you understand
what sort of technical preparation you need to do to be ready for a use case, what should
challenges you might face in implementation, and whether you should be thinking about building,
buying, partnering, or some combination. After that, you can even get a spec document in what we call
our technical blueprint that gives either your developers or the developers of the partner you work
with what they need to build exactly the agent that you're looking for. If you want to learn more
about superintelligence agent planning suite, we've built a custom GPT to answer your questions.
Just go to bit.ly slash super agent. That's bit.l.ly slash super agent, all one word.
and if you have any questions, the agent can even help you book an appointment with our team.
Just before the OpenAI announcement, Anthropic tweeted,
today we're releasing Claude Opus 4.1, an upgrade to Claude Opus 4 on agentic tasks,
real-world coding, and reasoning.
And the big story here is software engineering.
On the Sweet Bench verified test, Opus 4.1 scored a 74.5% as compared to Opus 4,72.5%,
and Sonnet 3.7, 62.3%.
The quote that everyone has been grabbing from their announcement post is this one.
Winsurf reports Opus 4.1 delivers a one standard deviation in improvement over Opus 4 on their
junior developer benchmark, showing roughly the same performance leap as the jump from Sonnet 3.7
to Sonnet 4.4.5, read. Menlo's Didi Das writes,
Anthropic just dropped Opus 4.1, the best coding model in the world.
State of the art, 74.5% on the sweep bench surpassing both Gemini 2.5 Pro and OpenAI 03.
This is the product right now to use.
Chubby points out, however, that maybe the thing to really focus on is the comment that follows
the announcement post, we plan to release substantially larger improvements to our models in the coming
weeks. That's echoed, by the way, in their announcement blog post. Now, for some, it feels like
that was the main point. New Form AIs, Alec Velikhanoff said, is this it? Opus 4.1 feels like a rushed
release to get ahead of GPT5. Look at how it struggles with making a UI that Horizon, assumed GPT5,
nailed in a single shot. Now, this has been out for.
for less than two hours at the time of recording, so we haven't really gotten a chance to see a ton of
people playing around with it yet. But there's certainly no doubt that the timing is not a coincidence,
and that part of this is the competitive pressure to always be at the state of the art.
We haven't gotten GPT5 yet, but we have heard tons of rumors that it is Leagues better in coding
that any previous OpenAI model, and people have had tons of success with the assumed tester
models like the codenamed Horizon, which could theoretically threaten Anthropics Crown at the top of
the coding heap. Much respected Simon Willisson says,
My favorite thing about this model is the version number.
Treating this as a 0.1 version increment looks like it's an accurate depiction of the model's capabilities.
Basically, he's saying that, yes, it is a modest improvement, but an improvement nonetheless
and one that is appropriately pitched.
Now, the other new model that I'm really excited about that was announced today is Google's
Genie 3.
This they're calling the most advanced world simulator ever, and there's really some amazing stuff
going on here.
So the basic idea of Genie 3 is that rather than prompting a video, with a few words,
you're prompting an actual dynamic world that you can interact with.
They write, given a text prompt, Genie 3 can generate dynamic worlds that you can navigate
in real time at 24 frames per second, retaining consistency for a few minutes at a resolution
of 720P.
Let's actually just watch part of the video to get a sense of how they explain things.
Each one of these is an interactive environment generated by Genie 3, a new frontier for world
models.
With Genie 3, you can use natural language to generate a variety of worlds and explore them interactively,
all with a single text prompt.
Let's see what it's like to spend some time in a world.
Genie 3 has real-time interactivity, meaning that the environment reacts to your movements and actions.
You're not walking through a pre-built simulation.
Everything you see here is being generated live as you explore it.
And Genie 3 has world memory.
That's why environments like this one stay consistent.
World memory even carries over into your actions.
For example, when I'm painting on this wall, my actions persist.
I can look away and generate other parts of the world.
But when I look back, the actions I took are still there.
And Genie 3 enables promptable events, so you can add new events into your world on the fly.
Something like another person, or transportation, or even something totally unexpected.
You can use Genie to explore real-world physics and movement and all kinds of unique environments.
So the real-world physics are a really important part of this.
We talk about AGI like something that is a straight-line path from where we are to there.
And certainly as more models become more capable, that feels accurate.
But the reality is, there are different beliefs about the different paths that it's going to take to get us there.
And if you've ever listened to our coverage of world models before,
you'll know that this is a different path that some people are convinced is a,
better path towards AGI than the sort of text-based LLMs that have become common at the core of
our experience with generative AI. Google writes that world models are AI systems that can use
their understanding of the world to simulate aspects of it, enabling agents to predict both how an
environment will evolve and how their actions will affect it. World models they continue are a key
stepping stone on the path to AGI, since they make it possible to train AI agents in an unlimited
curriculum of rich simulation environments. So part of what Jeannie has that's different is that it can
model physical properties of the world. It can retain environmental consistency like we saw with that
painting example, and it can be prompted to actually change the generated world as well.
Now, environmental consistency was one of the things that people latched on as a really big update here.
Andrew Curran writes, they solved environmental consistency with Genie 3, and this was an
emergent capability. You can see the trees remain the same even after being out of line of sight.
Visual memory extends back one minute now. Google is on a steady path to a real world simulator.
Chubby writes,
The breakthrough, minutes of consistency at 720P using only text descriptions,
no pre-built 3D models required.
They continue,
the technical breakthrough lies in auto-regressive generation.
Each new frame must take into account the entire trajectory up to that point.
At 24 frames per second, this means 24 complex calculations per second
that must access minutes of context.
Google deep mind researcher Jack Parker Holder writes,
Genie 3 feels like a watershed moment for world models.
We can now generate multi-minute, real-time interactive simulations
of any imaginable world.
This could be the key missing piece for embodied AGI.
Theoretically, media writes,
Google's genie 3 shows the beginning of the end game for V-O-3,
full world simulation.
Just look at this and tell me you don't see the start of the holodeck.
And that word holodeck comes up a lot.
There are a few different things that I'm seeing people excited about
when it comes to the announcement.
One of them is robotics, or to use the term that we just heard embodied AI.
Robert Scoble writes,
having the ability to create a real-looking world
that a user can move around in and interact with
will speed up robotics. What Google just showed off this morning is a simulator. It simulates the
real world. Perfect for training robots of the future in. Robots need to learn to navigate the
real world. Stairs, for instance. Now developers will be able to develop thin stairs, thick stairs,
spiral stairs, muddy stairs, icy stairs, wet stairs, stairs with kids, stairs with dogs, etc.
And run virtual robots up and down those stairs billions of times to train new robot models.
Roan Chong also points out the implications for training. He writes,
We're entering the era of infinite AI training environments,
where world models like Genie 3 enable AI agents and robots to learn from their own experiences
in simulated real-world environments.
This is the convergence of world simulation, AI training, and creative expression all happening in real time.
Still, as you might imagine, a lot of the implication that people are jumping to are those for games.
Billowalsit who writes,
Genie 3 just achieved what AAA game engines do but without any 3D models.
Wild had this model figured out complex effects like exposure shifts,
volumetric god rays, and phenomena we need to code explicitly in 3D engines.
If you squint, he writes, you can see where this goes.
VR and AR, half robots, robotics, jungle gyms for robots,
Sims made from real world data to train robots to then operate in the real world,
real-time virtual production, motion tracked iPhone and you, quote-unquote, directing actor agents.
Jim Fan from Nvidia writes, this is Game Engine 2.0.
Someday, all of the complexity of Unreal Engine 5 will be absorbed by a data-driven blob of
attention weights. Those weights take as input game controller commands and directly animate a
spacetime chunk of pixels. Andrew Curran joked, given the current rate of world model advancement,
the end result of Bethesda taking 15 years to release the next Elder Scrolls is that fans may
end up making it first, to which Elon Musk responded, for sure. Common Sense Machine CEO,
Tejas Kolkarni had early access and wrote, I spent the whole day playing with the system and when
it works, it's truly mind-blowing. It's the first neural game engine and world model I've tried that
generalizes so well and has long-term world consistency. Where it shines, he writes things like
that it's truly general purpose and has a quick startup time. Works exceptionally well for gaming
environments, but also generalizes to other industrial and real-world scenarios. It learns physics.
He wrote, although there are systematic failures even for rigid body physics, it was clear to me
that it can learn game engine and non-rigid physics without an underlying engine. He also said it
works exceptionally well for stylized environments, with characters walking around, that it's way more
fun than video models, which to him indicates that there are, quote, high retention consumer
experiences waiting to be built with this. And he also wrote a few other areas where it works really
well, like photorealistic walkthroughs and drone shots, as well as global illumination and lighting.
But what are the problems? He said physics is still hard and there are obviously failure
cases when I try the classic intuitive physics experiments from psychology. Social and
multi-agent interactions are tricky to handle. For example, 1V1 combat games do not work.
Long instruction following and simple combinatorial game logic fails, i.e. collect some points.
points, keys, etc., go to the door, unlock, and so on. And so basically all in total,
he writes, it's far from being a real game engine and has a long way to go, but he concludes
this is a clear glimpse into the future. Indeed, he says it's impressive enough for me to
have strong conviction that this is going to disrupt the gaming industry. It's super early days
and there are a lot of failures, but the writing is on the wall. Lots of challenging scientific
engineering and scaling problems to be solved, but it's going to happen in the next five years.
I'm going to be spending some time over the next couple weeks, cataloging non-gaming use cases for
this, and I'm sure I'll be back to share those with you at some point when it makes sense.
It is just Tuesday. We haven't gotten to GPT5. I also haven't been able to share with you guys yet
Lindy 3.0 or 11 Labs new music model. It's like we decided to accelerate all at once, all in one week.
So I will say, as always, thank you for listening or watching, and stick around because this week
is going to be a big one. That's going to do it for today's AI Daily Brief. Until next time,
peace.
