The AI Daily Brief: Artificial Intelligence News and Analysis - Is GPT-OSS Actually Any Good?
Episode Date: August 7, 2025A day after OpenAI's surprise open source release, we dig into how the model is performing in the wild. Early reactions are mixed—while some praise its speed and efficiency, others describe stra...nge behavior, safety-maxed responses, and limited general knowledge. Is it optimized for coding and STEM? We also cover Eleven Labs’ entry into AI music, Lindy’s new agent-building tools, and Google’s powerful Genie 3 world model.Brought to you by:KPMG – Go to https://kpmg.com/ai to learn more about how KPMG can help you drive value with our AI solutions.Blitzy.com - Go to https://blitzy.com/ to build enterprise software in days, not months AGNTCY - The AGNTCY is an open-source collective dedicated to building the Internet of Agents, enabling AI agents to communicate and collaborate seamlessly across frameworks. Join a community of engineers focused on high-quality multi-agent software and support the initiative at agntcy.org Vanta - Simplify compliance - https://vanta.com/nlwPlumb - The automation platform for AI experts and consultants https://useplumb.com/The Agent Readiness Audit from Superintelligent - Go to https://besuper.ai/ to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Subscribe to the newsletter: https://aidailybrief.beehiiv.com/Join our Discord: https://bit.ly/aibreakdownInterested in sponsoring the show? nlw@breakdown.network
Transcript
Discussion (0)
Today on the AI Daily Brief, I look at the day one reactions to the major model releases that
happened yesterday.
Before that in the headlines, you guessed it, more model releases.
The AI Daily Brief is a daily podcast and video about the most important news and discussions
in AI.
All right, friends, quick announcements before we dive in today.
First of all, thank you to today's sponsors, Blitzy, Vanta, and Plum.
To get an ad-free version of the show, go to patreon.com.
Welcome back to the AI Daily Brief Headlines edition, all the daily AI news you need in around
five minutes. It is quite blurry right now between the headlines in the main as everything this
week is just all new model releases. Yesterday, as OpenAI was getting in the open source game again
for the first time since like 2019 and Google was resetting our expectations of world simulation
models, 11 labs, who some of you know as the company that I have frequently used to do a version
of my own voice for reading articles on this show, launched an AI music service. Now, there are a bunch
of things that are interesting about this.
First of all, this is Eleven Lab's first expansion beyond generative speech, where they are at this
point one of, if not the default market leader in voice cloning, text to speech, and translation.
This new music release called Eleven Music is a feature-complete music generation suite with
instrumental in lyrics, meaning it competes directly with services like Suno and Udio.
You can create an entire song based on a prompt.
The example you just heard is from the prompt, powerful choir, orchestral, large crescendo,
ethereal, spacious reverb, expansive, cinematic, classical music.
Or this one, roller, liquid drum and bass, fast technical percussion, thick bass, influences
from IDM, soft female vocals, heavily processed.
11 labs say the model is capable of generating a tune within minutes, and yet as impressive
as it is, and the quality at first glance is incredibly impressive to me.
The biggest selling point over others in this space is that the 11 music model is claimed to be
legal for broad commercial use. Both Suno and Udio are currently facing lawsuits from the major record
labels for using their recordings and training data. Now, it's very unclear how music generated
using those models is going to be treated under copyright law. Those are big battles to be fought.
But Eleven Labs is just trying to short circuit that entirely, abstaining from including music
from the major labels in their training sets. Instead, they signed agreements with independent rights
management firms Cobalt Music and Merlin Network, with Cobalt saying that their artists had been given
the choice to opt into their music being used in the training.
data or not. Ed Newton Rex, who is the CEO of fairly trained and AI copyright advocacy nonprofit,
posted positively, saying, co-founder of 11 Labs confirms that their new AI music model is trained
only on songs they've licensed. This is really good to see. When a handful of AI companies
try to tell you generative AI can only be built with scraped copyrighted work, remember that the
majority of AI music models license their training data, including now 11 Labs model. Certainly first
impressions just from a quality standpoint are very good. Chubby writes, 11 Labs music was not on my bingo
card. Holy moly does this sound good. We're now entering the era of real AI music. We are witnessing
history. Now, even though 11 Labs says this is cleared for commercial use, when you look into the
terms, there's still some questions in here. There are restrictions around, for example, whether you
can distribute these things to music streaming services and exactly where you can use them,
but it's pretty clear where this has a lot of disruptive potential is around commercial
uses of music like game development, advertising, startup launch videos. Basically, basically,
basically anything that would have had you digging around before, for the perfect sound clip or for the
perfect licensed song, you can now just generate on the fly. Creator Tyler Ganges says,
it's so simple, so easy, and will save me so much time. Teodor Mitu connected the dots
between Jeannie 3, which we'll talk about later in the main episode, and said between this,
the GPT Onslot and the new 11 Labs music tool, the entire legacy creative industry is cooked.
For my part, I'm going to be super interested to see and watching very closely how much 11
music actually gets used for this commercial use case, given that theoretically that's been possible
with things like Suno before, but hasn't really happened. Maybe the way that they train this model
will open up that use case, but we will have to wait and see. Now, another model launch, which actually
happened all the way back in the olden times of Monday, was Lindy 3.0. Lindy 3.0, they're pitching
as their biggest step ever towards their vision of the AI employee. Now, if you haven't tried
Lindy, they offer an AI agent platform that aims to make the agent building process simpler.
Last big release was Agent Swarms, which allowed use case like sales management agents to carry out dozens of tasks in parallel.
Lindy 3 has three big new features that bring the platform up to speed and push the state-of-the-art forward.
CEO Flo Crivello wrote, Our Vision has always been the AI employee.
As capable as humans, can do anything on a computer and as easy to use, just ask.
3-0, he says, takes three giant steps in this direction, agent-builder, autopilot, and team collaboration.
Agent Builder is exactly what it sounds like and is, I think, very much.
the future of how agent models are going to work. Flow writes, just type what you want and it builds
it for you in minutes. Now, this is a huge UX improvement for non-technical people. If you've ever
fiddled with N8N or Lindy and immediately closed it because you were confused about nodes and
about these charts and drawing lines from one function to another, this is a lot closer to what you
probably are looking for, which is more akin to vibe coding for agents. A vibe coding for agents
platform would be one where you didn't need to know the details of agent architecture. You just had to
describe what you wanted to automate. Now, what is a little bit interesting about the UX for Lindy's
agent builder is that it still exposes the chain of steps, even though it's creating them for you,
which gives you more granular fine-grained control to modify it to the extent that the agent builder
gets it wrong. That could end up being the right combination, especially for power users who
want a little bit more control. Autopilot is Lindy's version of computer use. Flow again writes,
autopilot takes us closer than ever to can do anything. Lindy agents can now work with their own
computers in the cloud. He also continued, when we say anything, we mean it. We found out after
building autopilot that now Lindy could also build fully functional websites, deploy them, and even QA
them using its browser. We accidentally built lovable. In dogfooding the product, Lindy found that
it was useful for the repetitive work that could be automated in the background. Flo told TBPN,
one of the most insane things it's automated for us is replacing the QA engineer. Every hour a Lindy agent
wakes up, test the entire core flow, and if anything goes wrong, it pings the on-call engineer.
Now lastly, team collaboration does what you'd expect, allowing your team members to natively share
and iterate on their agents. I think what's most interesting to me in some ways, aside from just
the utility of this, is the move to the vibe coding for agents kind of U.S. We've seen some of this
already with Emergence recently releasing their agents creating agents platform, but it's hard for me
to imagine that this doesn't just become totally standard in day regard very soon.
Now lastly, one more little model announcement before we get to our main episode.
alongside Jeannie 3, Google also announced Storybook. Their Gemini app account writes,
It's Storytime reimagined. Now you can create personalized illustrated storybooks about anything,
complete with read aloud narration. And this is kind of one of those things where the capabilities
aren't new, but it's just an interface designed to specifically get at a particular use case.
Now, one of the things that's very notable to me is that when parents experiment with AI for the first time,
A huge default use case is around creating stories for their kids,
illustrating their kids' visions with image generation tools,
basically bringing the magical things that happen in kids' brains to life with technology.
It's the convergence of the magic of childhood with the magic of technology.
And that definitely seems to be what was going on with the storybook.
Joel, who's a PM at DeepMind wrote,
As a new father, I've been thinking about how to communicate with my son
in a way that truly resonates with him,
and in a way that many of our own parents used to communicate with us, reading.
With storybook, you can now describe any story you can imagine,
and Gemini generates a unique 10-page book with custom art and audio.
We're excited to help people find creative ways to break communication gaps
when you don't quite have the words for it.
Now, of course, in a week of crazy open source releases and world simulation models,
maybe this seems a little bit small.
But I wouldn't be surprised if for a lot of you parents out there,
the coolest new thing that got released in the short term might just be Google's Gemini
storybook.
With that, though, we are going to close out the headlines.
Next up, the main episode.
This episode is brought to you by Blitzy, the Enterprise Autonomous Software Development Platform with Infinite Code Context.
Blitzy uses thousands of specialized AI agents that think for hours to understand enterprise-scale code bases with millions of lines of code.
Enterprise engineering leaders start every development sprint with the Blitzy platform, bringing in their development requirements.
The Blitzy platform provides a plan, then generates and pre-compiles code for each task.
Blitzy delivers 80% plus of the development work autonomously while providing a guide for the final 20% of
human development work required to complete the sprint.
Public companies are achieving a 5x engineering velocity increase when incorporating Blitzie
as their pre-IDE development tool, pairing it with their coding co-pilot of choice to bring
an AI-native STLC into their org.
Blitzy is providing a limited time, 30-day free proof of concept for qualifying enterprises.
The team will provide a 5x velocity increase on a real development project in your org.
Visit blitzie.com and press book demo to learn how Blitzie transforms your STLC from AI-assisted
to AI Native.
That's B-L-I-TZY.com.
As a founder, you're moving fast towards product market fit, your next round, or your first
big enterprise deal.
But with AI accelerating how quickly startups build and ship, security expectations are higher
earlier than ever.
Getting security and compliance right can unlock growth or stall it if you wait too long.
With deep integrations and automated workflows built for fast-moving teams, Vanta gets you
audit-ready fast and keeps you secure with continuous monitoring as your models, infra,
customers evolve. Fast-growing customers like Langchain, writer and cursor, trusted Vanta to build
a scalable foundation from the start. And look, as someone who lives in the world of enterprise procurement,
I love how Vanta makes it easy to get compliance right. The last thing you need when you're trying to win
that big deal is to have it scuttled by something that Vanta has solved for over 10,000 companies.
Go to vanta.com slash NLW to save $1,000 today through the Vanta for Startups Program and join over
10,000 ambitious companies already scaling with Vanta. That's V-A-N-T-A-com slash NLW to save $1,000 for a limited time.
Today's episode is brought to you by Plum. Are you building with AI? Plum noticed that every technical
creator tends to hit the same wall. You've got AI workflows people want, but monetizing them
feels impossible because client work doesn't scale. Selling copies gives away your IP. And building
your own platform, that's becoming a software company.
It's a hard gap to bridge, and that's why they built Plum.
Plum helps creators build an audience of paid subscribers for their AI workflows, all on a single
platform.
Think substack for automations.
There's no need to build extra infrastructure just to get paid for your expertise.
Plum handles that so creators can do what they do best, solving problems with AI.
Ready to turn your expertise into passive income, visit useplum.com, that's Plum with a B.
Welcome back to the AI Daily Brief.
Yesterday, there were many breathless posts about the incredible progression and performance of
OpenAI's new open source model. Twitter was absolutely flooded with posts like this one from
Darmesh, the founder of HubSpot, screaming excitedly about how they were running this model on their
MacBook pros. And of course, as you heard yesterday, there was a ton of excitement about the
reported benchmarks. But inevitably, what happens is in the first 24 hours or so after a model
gets released, people actually dig in.
They start putting it through its paces, running their own benchmarks, and generally trying to get a vibe off the thing to understand how good it is in practice.
Sometimes that leads to a model that outperforms its benchmarks.
Other times, in fact, I would say probably more often, the shine gets a little bit worn off.
Now, in the context of this model, where it's open weights and there's more information to play with, there's even more hacking to be done.
So the question we're exploring on today's show is how people are responding and what vibe they're getting from it now that it's out in the world.
wild. Initially, a lot of the responses I saw were somewhat similar to this one from Victor
Taylin. They wrote, my initial impression of OpenAI's OSS model is aligned with what they advertised.
It does feel closer to O3 than to other open models, except it is much faster and cheaper.
It's definitely smarter than Kimmy K2R1 and Quinn 3. I tested all models for a bit and got
very decisive results in favor of OpenAI OSS 120B. The independent benchmarkers didn't find
quite as high performance as OpenAI reported, but it was still pretty good. Artificial analysis
writes, independent benchmarks of OpenAI's GPTOSS models, GPTOSS-120B is the most intelligent
American open weights model, comes behind DeepSeekR1 and Quen3-235B in intelligence, but offers
efficiency benefits. You can see here that across the artificial analysis intelligence index,
the 120B model scores of 58, as opposed to R1's 59, and Quen3's 64. Now, 03 is up at 67,
so there's a fairly big jump from 120B to 03.
This question of how much efficiency matters is something we'll come back to, however.
And yet, if some of the initial impressions were positive, it didn't take long before you started
seeing posts like this one. Amad Mastock formerly of Stability AI writes,
is GPTOSS entirely mid-trained or something? It's good for its size in MacBook, but kind of feels
fried in odd at times. Mark Kretschman responded, with a theme that we'll see over and over again,
very jagged model, feels uneven. Celeste writes, it feels like since
synthetic data. It's been safety maxed. Dwayne writes, I was excited for it until I really started to use it,
and it's weird. There's something off about it, almost like they trained it on safe responses from 03,
or maybe early checkpoint of GPT5. Evie writes, yeah, it has strange vibes, reminds of Microsoft crappy
models like Fee4. OpenCodes, Dax wrote, everyone legit I know is having a not good time with GPT OSS
so far. It's useful because now when I see popular accounts say it's so good, wow, I know they are
Bessing. Now, when people tried to dig in with Dax, it sounded like part of the issue was specifically
with tool calling, which obviously really matters, because as we talked about yesterday, to the extent that
this is going to be useful as an alternative for agent builders who want a higher level of security,
privacy, and controlability, tool calling functionality is going to be extremely important.
And pretty soon this became the popular narrative. Nikil Chandak wrote, what it is, state of the art
at its size. What it is not, O4 Mini as many people have been claiming. Sam Pache writes,
the GPTOSS models have disappointing results on EQ bench and creative writing.
It may be a function of the low active parameters,
although the high performance in other evals suggest the priorities were elsewhere.
And indeed, this is one of the biggest themes that's going around AI Twitter right now,
is that this model seems to have been really focused on some particular use cases and not on others.
Björn Pluster wrote a long thread about how GBTOSS-120B is, quote,
very blatantly incapable of producing linguistically correct German text.
He writes, I see this as an exceptional release highlighted,
opening open AI's willingness to contribute to the open model space and showing how strong they actually
are on model training, but it is also very clearly a model not up to their usual standards with
regard to multilinguality or output quality. Ambilicus writes, I hate to be the bringer of bad news,
but this new open AI model is brutally locked down. Danny Aziz from every right simply OSS-120B is
not O3 level in my opinion. Kyle Corbett writes, GPTOSS may have been trained primarily on synthetic
data, similar to Microsoft's fee models. As a result, it's extremely spiky, great at the tasks
trained on really bad at everything else. This was almost certainly to avoid copyright lawsuit,
sadly. Lassano Gai writes, GBTOSS models seem to be slot-maxed on math, coding, and reasoning.
They're great at that, but they completely lack taste and common sense. At least that's my vibe so
far. Phil 111 on Hugging Face left a comment called this model is unbelievably ignorant.
He said this model has about an order of magnitude less broad knowledge than comparably sized models
like Gemma 3.27B and Mistral Small 24B. This model, including its larger brethren, are absurdly
ignorant of wildly popular information across most popular domains of knowledge for their respective sizes.
What's really confusing is all of OpenAI's proprietary models, including their tiny mini versions,
have vastly more general and popular knowledge than these open models.
So they deliberately strip the corpus of broad knowledge to create OS models that can only
possibly function in a handful of select domains, mainly coding, math, and STEM, that over 95%
of the general population doesn't give a rat's ass about, conveniently making it unusable to the
general population and in so doing protecting their paid chat GPT service from competition.
Now, let's say for a moment that there were specific decisions that went into this that
made it optimized for those fields.
Fill in the comments here is suggesting that that's them trying to protect the integrity
of their business, which, as an aside, would be a reasonable thing to do anyways, but I'm
not exactly sure that the analysis is correct even if the interpretation of what's going on
under the surface is.
As we discussed yesterday, the most likely users of this are people who have perceived
data security and privacy needs that are significant enough that they want to use a less
convenient, not fully state-of-the-art open-source model over something that has any sort of
controller interaction from a third party. And if that is the case, those folks almost certainly
aren't using this for writing poetry or for getting access to general domain knowledge.
They're using it, in other words, for coding math and STEM. Now, we have no confirmation yet
from OpenAI that it is optimized for those fields, but even if it were, it would kind of make
sense to me just on that use case level alone. But what about this question of really how it compares
to the Chinese models. Just a day before this was released, A16Z's Martin Casato wrote,
It's just remarkable how many U.S. startups are being built on Chinese OSSI models. I'd say
the majority that are building custom models via post-training. The U.S. needs to step up,
make it a national priority and back with a huge investment. Which made it all the more interesting
that the White House's Michael Cratios popped in yesterday, retweeting Didi Das from Menlo,
who was talking about these three big model releases, and saying people ask what we mean by
unquestioned and unchallenged global technological dominance. Simon Willison made this comparison directly
in his tests. He wrote, I've been writing a lot about the flurry of excellent openweight models
released by Chinese AI labs over the past few months. All of them very capable and most of them under
Apache Tour MIT licenses. Just last week, I said, something that has become more undeniable this month
is that the best available openweight models now come from the Chinese AI labs. I continue to have
a lot of love for mistral, Gemma, and Lama, but my feeling is that Quinn, Moonshot, and Z.a.I have
positively smoke them over the course of July. I can't help but wonder if part of the reason for the delay
and release of OpenA.I's open weights model comes from a desire to be notably better than this truly
impressive lineup of Chinese models. Simon continued, with the release of the GPTOSS models, that statement
no longer holds true. I'm waiting for the dust to settle in the independent benchmarks that are more
credible than my ridiculous tests to roll out, but I think it's likely that OpenAI now offer the best
available Openweight's model. Ram Jod didn't find that, though. He writes, I tested OpenAI's brand-new
open source model against Kimi K2 and Quen3 Coder. It seems to perform worse than the Chinese models
in one-shot tasks. Somewhat disappointing to see, but I think over the coming days, people will find good uses for it
within coding. Dax again writes, for coding they're not anywhere close to the Chinese models in the last few
months. We'll give it time for dust to settle, but simple evaluations aren't enough.
The San Al-Gaib writes, there is no Western open-source model that beats or ties the best Chinese
open-source models. And adding even more questions to this is Ethan Mollick, who retweeted Simon
sharing the Chinese competition section of his post, with Ethan writing,
the US now likely has the leading open weights model or close to it, but the real question is
whether this is a one-off situation from OpenAI, in which case the lead will evaporate quickly
as others catch up, unclear what their incentives are to keep updating. And this, of course,
is a challenge. Let's say that the model isn't quite as good. Even this week, this is not the main
announcement from OpenAI. This is just the amuse-bush for GPT-5, which we think is coming tomorrow.
So if we really want to compete when it comes to OpenWaT's models, it's not clear that OpenAI alone is going to do that.
Now, for a little bit more of an optimistic take, which is really just realism as optimism,
Nathan Lambert wrote,
Seems like while this launch had the vibes right and OpenAI can jazz up a crowd,
they're still going to go through a lot of the pains of why open models are hard.
Just so many weird little failures I'm seeing people mention.
May take a few weeks to get the best out of GPT. OSS.
Matrix Memories responded,
pain and beauty of open source. You can't run away from the edge cases or control for them.
So of course, the optimistic take there is that part of what makes these open models valuable
is what the community can do with them. And so we may not want to judge it based on these
first blush impressions. Nathan actually wrote a very long post on his Interconnects.aI blog,
where he argued that OpenAI validates the open ecosystem. And that, quote, open models from the
US labs were in such a dire spot that we need any step back in the right direction. But, however,
when the question becomes, is OpenAI the new Open Champion? His answer is not quite. Nathan writes,
it's a phenomenal step for the open ecosystem, especially for the West and its allies,
that the most known brand in the AI space has returned to openly releasing models. This is
momentum and could be the start of the turning point of adoption and impact of open models relative to
China. But he writes, there's a lot of uncertainty in the incentives for open models. Some of the
best China analysts I know share how China is sensing that releasing open models is a successful
strategy for them and they are doubling down. Open AIs releases a step,
in the right direction, but it's still a precarious position. Many people are making noise about
creating open models from the AI Action Plan to venture capitalists and academics. What all of these
parties have in common is that it is not their number one goal. Now, Nathan does have a new
initiative, which does have that at his number one goal called the Adam Project and which we'll
likely get into later in the week, but this sort of validates the point that Ethan Malik was making
as well. There is also, lastly, this question of speed efficiency and cost. One of the less
talked about aspects of this thing so far is that these models are extraordinarily fast and
extraordinarily cheap. Most of the research has suggested that at this stage, everyone's just using
the best model whatever it costs, but that won't necessarily be the case forever, and it
won't necessarily be the case as we get more complex workloads that just consume a boatload
of tokens. So all in all, I would say that the shine is slightly diminished from where it was yesterday,
but that still most people are very excited to have Open AI back in the open game, going to continue
to dig in and see how much can be wrenched out of these models before they're abandoned in any sort of
hugging face model junk heap. As you know, however, GPTOSS was not the only model launch yesterday. There was
also Jeannie 3, the new Google World Simulation model, and the reviews for that one could not be more rave.
I posted a poll on Twitter asking which launch is a bigger deal, and the sample size wasn't huge,
just 75 people voted, but it was almost a dead heat, with 50.7% saying GPTOSS and 14.4,000.
49.3% saying Jeannie 3. Given how much hype OpenAI usually has around their model launches,
and given the fact that Jeannie 3 wasn't even in Google's main line of Gemini models,
I think that result is hugely telling and certainly is validated by the type of conversation
that we're seeing across AI Twitter about this new model. A lot of the conversation is just
people excitedly sharing the most impressive versions of this. One clip that you'll see a lot,
and by the way, this is definitely an episode that you're going to want to watch, same with
yesterday's. Matt McGill from DeepMind tweets,
One nice thing you can do with an interactive world model, look down and see your footwear
and if the model understands what puddles are. Obviously with him posting it,
Genie 3 does this well. Justine Moore from A16Z writes blown away by the outputs from
Genie 3. This is a huge moment for world models. We now have real-time, playable simulations
that you can generate from a text prompt. Boris Minardis writes,
generating cool-looking worlds is one thing, but the physics simulation? Just wow, absolutely
amazing. Adana Singh writes, this is the most magical thing I've seen since LLMs. Ali Aslami, who admittedly
is working at Google, says Genie 3 is the most impressive AI demo I've seen since Chad GPT. And if you think
that's just bias, Stephen Heidel from OpenAI said this is absolutely incredible, a real see-the-AGI moment.
Congrats to the Google Deep Mind team. Machine Learning Street Talk called this the most mind-blowing
technology they've seen since starting their podcast. And Professor Darya Unutmas says,
I strongly believe Jeannie 3 is the AGI moment for AI video.
This is a mind-blowing advance.
I didn't anticipate that such interactive playable AI-generated environments would be possible this year.
AI is advancing faster than even we super-optimists predict.
There were also a number of comments like this one from Christoph, Thulean futurist,
who writes,
Google DeepMind is the best AI lab on planet Earth right now.
The AI for Success account doubled down on that,
saying Google DeepMind is destined to win the AGI race.
Now, notably Elon Musk disagrees, responding to Ashutosh saying,
This race will keep going for a long time.
Others joked about how much meming there had been recently,
about this sort of 8-bit voxelized dark fantasy worlds,
which have been all over TikTok and Twitter for the last couple of weeks.
Dreaming Tulpo writes,
All the disbelievers last week who said AI will never be able to achieve the OMW style
and that it's just a five-second slot video,
and now DeepMind dropped a world model on a regular Tuesday that just does it.
Spiraticalia writes,
You're telling me we've been pining over that viral AI-generated fantasy pixel video game for a few weeks,
and already Google is just like, cool, here you go.
Now, one comment that I thought was really interesting came from sincere Mickey.
Jeannie 3 is honestly mind-blowing, and what I love most about it is that it's not copying human work.
What the deep mind team has created is something brand new that could not have existed without AI technology.
And I think that's a really salient point.
So much of the work stage that we're in with LLM still is just getting them to do stuff we already do,
but better, faster, and cheaper. Now, that's starting to change, especially as we get into the
agent swarm era. Right? One of the things that we're figuring out is that in a lot of cases,
spinning up five agents to compete, doing all the same work, and then figuring out who did it best,
is a better strategy than just having one tool do the work. That starts to bridge us away from
how we've done things in the past, because it's kind of where a shift in scale and the availability
of intelligence actually is so significant that it becomes a shift in kind. What Mickey is
pointing out, however, with Jeannie 3, is that this is not proximate to something that we had before,
other than our real lived experience in the real world. Now, obviously, when it comes to these
world simulation models, they're still very nascent in terms of their use cases. Even as impressive
as these updates are, they're not yet in a state where, for example, they have the memory to actually
create entire game environments. What people are excited about here is the trajectory and the new
possibilities that are being opened up in their imaginations. Lastly, today, I did want to come back to
the Opus 4.1 question. It was very clearly, anthropic making sure they had an announcement in
this crazy week of announcements, which I think is a completely reasonable strategy just from a
press and communication standpoint. But as I shared on the first day, there were some who felt
like it was maybe a little rushed just to get ahead. So far, I've seen comments on both sides.
I've seen some who said that they're not sure that Opus 4.1 is all that much better than
Opus 4.4.1, has a much better sense of design than other models. He shared a single HTML
page that it made up of some design firm. Mostly people are still hung up on the pricing and token
availability of Claude. Gosu Kota writes, Opus 4.1 is so expensive, I'm so curious who can
afford to run this as their daily driver. And when cursor posted, Claude Opus 4.1 is available
in cursor. Let us know what you think. A huge number of the responses are some snarky thing like this
one from August landmesser. Enjoy your one request per month, guys. I think when it comes to coding
models, since this was an extend your lead release, not a recapture your lead release, it's going
to go a little under the radar until it can be compared to GPT5, which of course we anticipate
to be just around the corner. Anyways, guys, those are the first reactions, the sort of 24-hour reactions
to these models. Let me know what you think in the comments. Appreciate you guys listening or watching
as always. And until next time, peace.
