The AI Daily Brief: Artificial Intelligence News and Analysis - GPT-5: Everything You Need to Know
Episode Date: August 7, 2025NLW covers the big announcement from OpenAI, and explores why the big use case that they're clearly driving at is coding. Sharing the first impressions from early testers, we cover the good and ba...d of the AI model that will become the default for 700 million people. Brought to you by:KPMG – Go to https://kpmg.com/ai to learn more about how KPMG can help you drive value with our AI solutions.Blitzy.com - Go to https://blitzy.com/ to build enterprise software in days, not months AGNTCY - The AGNTCY is an open-source collective dedicated to building the Internet of Agents, enabling AI agents to communicate and collaborate seamlessly across frameworks. Join a community of engineers focused on high-quality multi-agent software and support the initiative at agntcy.org Vanta - Simplify compliance - https://vanta.com/nlwPlumb - The automation platform for AI experts and consultants https://useplumb.com/The Agent Readiness Audit from Superintelligent - Go to https://besuper.ai/ to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Subscribe to the newsletter: https://aidailybrief.beehiiv.com/Join our Discord: https://bit.ly/aibreakdownInterested in sponsoring the show? nlw@breakdown.network
Transcript
Discussion (0)
After literally years now of waiting, GPT-5 is here.
Is it a GI?
Is it a disappointment?
We are going to get into all of that on this special episode of the AI Daily Brief.
The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.
All right, friends, back with a very big episode of the AI Daily Brief here.
Before we get into that, quick announcements as always.
First, thank you to the sponsors of today's show, KPMG, Blitzy, and Superintelligent,
to get an ad-free version of the show, go to patreon.com slash AI Daily Brief.
And if you are interested in sponsoring the show, you can reach out to sponsors at
AIDailybrief.aI. But with that, let's dive right into the meat of this, because it was a big day.
It is not hyperbole to say that this has been years in the making.
ChatchipT4 was the dominant model for so long. I mean, almost throughout basically all of
23, it was the undisputed champion. And throughout most of the first three quarters of 2024,
it was really just everyone catching up and putting out their GBT4 class model. Now, things started
to change a little bit, of course, in the fall of 24 when we got 01 in the advent of reasoning
models. This year, we've seen continued progress on reasoning models. For many tasks, OpenAI's
O3 model is the main model they use, but then we've also got the insurgency of Chinese open
models, deepseeks R1, for example. And in the realm,
of coding, which is very germane to our story today, we've had the incredible dominance of Anthropic.
Ever since Claude 3.5 Sonnet through 3.7 Sonnet, Opus 4 and now Opus 4-1, Anthropic has had the
go-to coding model for a huge number of people. Now, Gemini 2.5 has also been up there in
contention, particularly because Google has done a really good job competing on cost as well as
performance. And while some people used O3 for coding, it really has been the Anthropic models that
have been completely dominant. From what we heard coming into today's announcement, one of the big
leaps for GPT5 was going to be around coding. And certainly that emphasis was borne out in the presentation.
In fact, I would go so far as to say that if you were just watching this presentation,
to get up to speed on where the big labs thought the AI competition was, I'm not sure that
you would believe that there is any use case that Open AI cared about except coding.
We're going to talk a lot more about that, but first let's go quickly through what they announced,
get some first impressions, talk about some other non-coding use cases, look at the benchmarks,
and then we'll get into the meat, which is surprise, surprise, all about the advent of vibe
coding, the democratization of code-based creation, and the GBT5 future that OpenAI is laying out
for us. As you would expect, there are actually a few different models that were announced,
GBT5, GPT5 Mini, GPT5 Nano, they all have 400K context lengths, and really competitive costs,
which is something we'll get into in a little bit as well. From an availability standpoint,
these models are rolling out right now to top end users and will be available to education
and business users next week. The presentation was about an hour and 20 minutes long. And to some,
the energy was a little underwhelming. Signal writes, this launch felt like attending a funeral
hosted by minimalists. They're unveiling tech that should feel magical, real breakthroughs,
but the whole vibe was grayscale grief. Even the storytelling arc, chart styles, then eulogy
tributes, then closing on someone's health battles, what exactly are we as the audience
morning. Potentially great product, sure, but the emotional tone was so d-OA. Incredibly strange all
around. Now, of course, Elon waited right into that conversation, adding, underwhelming to it.
And while I might not have felt the same as signal, it definitely felt less energetic than it should
have given the magnitude of what was being launched. Still, if there was one thing that the denizens of
Twitter slash X really latched onto, it was this particular chart showing off their performance
on the sweepbench. I don't think I've ever seen a chart from an AI launch be shared so many
times with so much gasping disbelief. For those of you who are just listening, not watching,
it's a chart of GPT-5's performance on the sweepbench verified compared to 03 and 40. And the chart
has 40 and 03 at exactly the same height in the chart, but with one scoring 30.8% accuracy and
one scoring 69.1% accuracy. And then for GPT-5, it has a with-thinking and without-thinking score,
52.8% without thinking, 74.9% with-thinking. But for some reason, the 52-2-2-5%.
point eight percent is above the 69.1 percent of 03, it's just a real aneurysm-inducing chart.
Now, that said, we're sharing these things for the sake of completeness of the reaction. We're in
the moment. Relatively speaking, if you got a couple people who say that the mood was more dreary
than it might have been, and a lot of people sharing a single chart they have a problem with,
and that's the biggest critique that you're getting, you're in pretty good shape. Now, as you know,
when it comes to understanding how good a model is, I tend not to care all that much about the
benchmarks, especially the ones that are pretty saturated up near the top, but for the sake of
completeness, we should talk about them at least a little bit. Now, one really interesting note is that on
their research website and in the presentations, OpenAI chose not to compare their performance to
any other models except OpenAI models. Whereas every other company is comparing themselves to
GROC and to Gemini and to OpenAI and to Anthropic, OpenAI is only comparing GPT5 to past
OpenAI models, which I get as a statement to keep the focus on OpenAI, but it does sort of
have the feel of not wanting people to see how it compared to other models, which doesn't
really make sense because, as expected, GBT5 crushed a lot of these benchmarks. For example,
on Humanity's last exam, GBT5 with no tools got 24.8% compared to O3's 14.7% with no tools,
while its GPT5 Pro with full access to tools got 42%.
On coding, you heard about the Sweet Bench verified.
On instruction following and agentic tool use, it scored really well,
which is something that's going to come up in a little bit.
And overall, across the stats that they share,
GPT5 just performed really, really well.
You remember I've talked a little bit about their internal benchmark
for economically important tasks.
The really interesting stat that they had shared previously
is that ChatGBT agent with full access to tools
did comparable or better to humans in roughly half the use cases, and GPT-5 was right around that
as well. In fact, GPT-5 did even a little bit better than chat GPT agent. But what about independent
benchmarks? One of the more comprehensive independent benchmarks comes from artificial analysis,
and the TLDR is that their high-end version of GPD-5 is now the highest-performing model across their
tests. You can see here that GPT-5 high and GPD-5 medium are at 69 and 68, both very slightly above GROC-4,
For those of you who are interested in going deeper, you can go on artificial analysis website
to see where GPT-5 fit across all their eight different measures.
They were right at the top of many of them, including the MMLU, Humanity's Last Exam,
the AIME, and the AALCR or Long Context Reasoning, which is a benchmark directly from
artificial analysis themselves.
In fact, long-context reasoning stands out as one of the areas of the biggest gains for
GPT-5, which matters because that test is really all about what a model can do in an agendic
context. Speaking of that, meter, whose chart I've shared numerous times on how long a model can
successfully complete tasks for, and GPT5 is now at the top of the pack. Now, this is the metric that
maps how long of a task a model can be successful at with a 50% success rate, and GPT5 pushes that
up to about two hours and 15 minutes. On LM Arena, GPT5 debuts as number one, this comes from its
testing period when it was under the codename summit, and it now is above Gemini 2.5 Pro 3,
GROC 4, Quen 3, and a number of other models.
One area where it kind of seemed to underperform is on the ARC AGI test.
On both ARC AGI 1 and especially on ARCDI2, GPD5 did well, but it was meaningfully behind
Grok 4.
At the same time, when you start to factor in efficiency, GPT5 Mini does really well.
In fact, Greg Kamrat, the president at ArcPrize, says that it appears to break the current
Arc AGI frontier.
A couple areas where OpenAI also clearly tried to put to emphasis,
was on hallucinations and sycophancy. On hallucinations, at least according to their own reporting,
GPD5 is way, way lower than 03 and 40. Simon Willison called out that when it comes to sycophancy,
they've tried to build solves into the core model. He quoted their research posts, which said,
for GPT5, we post-trained our models to reduce sycophancy. Using conversations representative of
production data, we evaluated model responses that assigned a score reflecting the level of sycophancy,
which was used as a reward signal and training.
So a bunch of sort of day-to-day improvements as well.
But what about when it comes to the big use cases?
There were three that OpenAI mentioned in their presentation,
and really, as we'll discuss, only one that felt like it really mattered.
So the three use cases that OpenAI mentioned were health, writing, and coding.
On the health front, they talked about a number of benchmarks,
but they actually humanized this one and made it about much more than benchmarks
and much more instead about a bigger change.
Roe want to count on Twitter?
OpenAI has brought out a cancer survivor to share her story
that she had to use ChatGBTBT to help her advocate for herself
and fight for her cancer and be able to challenge the doctor's opinions
to make a better decision.
This is bound to ruffle the medical industry and doctors' feathers.
We foresee a battle in coming here between medicine and AI.
Doctors were already annoyed by patients Googling their symptoms before they came in.
Imagine now when patients say,
but my AI said and don't trust their doctor's judgment.
There's going to be a tension.
And yes, of course, doctors can be wrong and AI can be useful and save lives.
This isn't a value judgment. It's an observation of an incoming tension.
Now, Elon waited into that one as well and said AI is already better than most doctors.
That's the honest truth. And it will become far better. Same for all jobs, to be honest, including mine.
Now, my strong perspective on this one is that while, yes, undercutting and undermining every decision that a doctor makes
because what an AI said isn't the right approach, in general, a more informed patient base,
while something that doctors will have to adapt to is going to be a net positive.
People should have more information, they should have access to more resources.
And so often when it comes to medical decisions, all doctors can do is give you the best set of
information and leave the decision to you.
That was the situation in this particular case, where the patient wasn't sure what choice
to make about a particular course of treatment that had a bunch of severe consequences and
potential side effects.
No, we're not going to dwell too much on this show, but it was notable that this is
use case that they are really emphasizing as something that a lot of people are turning to chat
GPT for. The next two use cases bring it back, of course, to the business world. And the first we're
going to talk about is writing. Dan Chipper and the crew of Every had a chance to test GPT5 for a few weeks
and made this handy-dandy chart where they gave a simple yes or no for a handful of key tasks,
including day-to-day tasks, that got a yes. Pair programming, that also got a yes. Agentic
engineering, that got a no. We'll come back to that. And writing was almost a split decision. For
For writing, it got a yes, but for editing it got a no.
So on the positive side, when it came to writing, they said,
GPT5 has a good voice, nuanced and expressive.
It's less likely to output obvious AI idioms, so it's the first thing we turn to when we have
a sentence we need to polish or a paragraph we need to draft.
We sometimes return to GPT4.5 for questions that require more thought.
Interestingly, however, they say when it comes to editing, it's a no-go.
For editing, they write, GPT-5 cannot determine whether writing is good.
We have benchmarks to test AI's ability to judge writing, and GPT-5 consistently fails on tasks that
Opus 4 passes.
Latent Space also published a first look that we'll come back to, and they actually had a much more
negative opinion.
They wrote, while GPT5 continues to work its way up the software engineering ladder, it's not
really a great writer.
GBT 4.5 and DeepSeekar 1 are still much better.
They shared a couple of examples of using GPT5 to rewrite some LinkedIn posts, and they
felt like the GPT5 answer was more sloppy than the GPT 4.5 answer.
In my life, I have a pretty strict dividing line between types of writing that I'll use AI for
and types of writing that I won't. And pretty much in the column I won't is everything other
than podcast descriptions and really basic stuff like that. So as I get my hands on GPT5 in the
coming weeks, I will certainly be testing to see whether this breaks out of that, but at least
for right now, I'm not holding my breath. And yet if that all feels like prelude, you're not wrong.
During the presentation, I tweeted, with GPT5, OpenAI is making an extremely loud argument that
there is one singular AI use case that matters. Can you guess which one? Yes. When push comes to shove,
this presentation was entirely about coding. OpenAI certainly tried to make the argument
that GBT5 was now the undisputed king for coding. They had, for example, Michael Truel,
the CEO of Cursor, come up during the presentation and say in no uncertain terms that GPT5
was the smartest coding model they've tried. But coding like writing is about so much more than
benchmarks. It's about vibes and experience. And as we've seen so many times over the last few
months, there will be a model that technically has better benchmarks than one of the anthropic models,
and it just doesn't displace them. AI agents are the buzzword that everyone's talking about,
but do you truly understand their significance? KPMG's agent framework demystifies the concept,
offering practical steps to unlock AI agent's immense potential. Think of it as your GPS for
AI strategy. KPMG partners with clients to harness the benefits of AI agents, guiding you from
strategy to execution with a secure architecture and a plan for workforce devolution. Check out their
comprehensive insights on scaling agent power within your enterprise. This isn't just about tech. It's a
leadership imperative. Go to www.kpmg.us slash agents to learn more. That's www.kpmg.us
slash agents.
This episode is brought to you by Blitzy, the Enterprise Autonomous Software Development Platform
with infinite code context.
Blitzy uses thousands of specialized AI agents that think for hours to understand
enterprise-scale code bases with millions of lines of code.
Enterprise engineering leaders start every development sprint with the Blitzy platform,
bringing in their development requirements.
The Blitzy platform provides a plan, then generates and pre-compiles code for each task.
Blitzy delivers 80% plus of the development work autonomously while providing a
guide for the final 20% of human development work required to complete the sprint.
Public companies are achieving a 5x engineering velocity increase when incorporating Blitzie
as their pre-IDE development tool, pairing it with their coding co-pilot of choice to bring
an AI-native STLC into their org.
Blitzy is providing a limited time, 30-day free proof of concept for qualifying enterprises.
The team will provide a 5x velocity increase on a real development project in your org.
Visit blitzy.com and press book demo to learn how Blitzie transforms your STLC from AISC from AIS.
assisted to AI Native. That's BLITZY.com.
If you are a regular listener, you will have heard about Super Intelligence Agent Readiness
Audits at this point. But I wanted to tell you today about the full suite of Agent
Readiness products that go beyond just the initial readiness report. Over the last six
months, Super Intelligence has built out an entire Agent Planning Suite. We help you move from
discovery to planning to implementation. After you've completed your Agent Readiness Audits,
We help you double-click on your most important use cases with what we call our use case planning reports.
These reports are going to help you understand what sort of technical preparation you need to do to be ready for a use case,
what challenges you might face in implementation, and whether you should be thinking about building, buying, partnering, or some combination.
After that, you can even get a spec document in what we call our technical blueprint
that gives either your developers or the developers of the partner you work with what they need to build exactly the agent that you're looking for.
If you want to learn more about superintelligence agent planning suite, we built a custom GBT to answer your questions.
Just go to bit.ly slash super agent. That's bit.l.ly slash super agent, all one word. And if you have any
questions, the agent can even help you book an appointment with our team. Now it's early, but a lot of the
early reviews are about coding. So let's go see what people had to say. Turning back to everyone
more time, they again had a mixed review. They've been a bit of a very few.
They basically said that GPT-5 was a great pair programmer. Dan Chipper wrote,
Don't get me wrong, GPT-5 is a very good programmer. It's incredibly useful as a pair programmer,
especially in AI-powered integrated development environments IDEs like cursor. It's great for engineers
from traditional backgrounds who want an AI to help collaborate on code, and it excels at research
and debugging complex issues. What Dan argues, however, is that, quote,
the discipline of programming has fundamentally changed this summer. The benchmarks don't show it, but if you
know how to Yolo four agents at once in Claude Code, GPT5 feels like a step backward. That's
partially because of the model's current personality. It's more cautious than Opus 4.1 and isn't
as comfortable working independently for long periods in our testing. But it's also due to the
app you use to interact with it. Both cursor and OpenAI's command line interface tool Codex
CLI are not on the same level as Claude Code. Both were built for programmer AI pair
programming, not true delegation. I bet this will change. The model is extremely smart,
just not yet built for this use case. But for now, OpenAI seems to have missed the paradigm
shift in programming caused by Claude Code over the last two months. Now, I wanted to present that
one first, because so far at least, they're kind of the only ones I've seen saying that.
Most of the other results are incredibly impressed. Like the team at Cursor, the team at Lovable,
was incredibly excited about it. YouTube and AI educator Matthew Berman wrote,
GPT5 is here, and spoiler, it's a coding master. I've had the privilege of testing GPT5 for about a
week now, and I gave it the most difficult tests I could come up with. He gave it a Rubik's
cube test that only previously Gemini 2.5 Pro had cracked. He asked it to make an Excel clone
and a Microsoft Word clone. He asked for a more complex version of the classic 90s phone game
snake, asked it to solve some physics, gave it the now famous hexagon with balls bouncing inside
test, and a number of others, and was basically blown away by the results. Now, one thing Berman
noted was that GBT5 is really good at front ends, and again, this is something that I've seen
over and over again, is that while there is a general sort of feel and a flavor to a lot of the
UI design that you see among AI and agentic coding, in a similar way that you see common patterns
across AI-based writing, it seems at least from first glance that GPT5 often breaks out of those
patterns. Pietro Charano shared a compilation of different experiences that he had made with
GPT5 and one shot, coming away extremely enthusiastic. He said the poem camera app is particularly
impressive because the model came up with all the details, like the way the photo stack in the gallery,
the photo developing animation, etc. He actually made a bold pronouncement. In another tweet, he said,
I had early access to GPT5. It will do for coding what GPT4 did for LLM adoption. It's fast, really smart,
has great taste and aesthetic sensibility. This is electricity arriving in every home, a before and after
moment in how we build. GPT5 is the best coding model ever, but what really impressed me is how good
of a collaborator it is. It tackles large implementations much better than before. At times,
it was able to refactor thousands of lines of code at once, and also debugs large repos way faster and
with precision. It's also he writes a really, really good agent. I was able to run long agentic flows
with no issues. It's also better at explaining why to run a certain tool versus another,
and it has the best results on running parallel tools I've ever seen. This is particularly
important for flows where, for instance, you need to create multiple files at the same time,
and consistency becomes a problem. GPD5 has no issue with that.
Matt Schumer had a really interesting take.
He said,
For those who will be testing and using GPT-5,
you won't see much improvement
if you're asking it to do things Sonnet or O3 can already do.
To see how good it truly is,
you have to ask it to do things that other AI simply can't.
In his longer review, he expanded this.
His TLDR included GBT5 is clearly a big leap from previous models,
but you have to push it hard to get the most out of it.
The ceiling for what can be vibe-coded is now much higher
than it was with previous models. Matt wrote, I was granted access to GPD 5 on July 21st, and honestly,
when I started testing it, I wasn't blown away. In fact, I felt quite let down, especially given all the
hype and expectations around it. The model felt like GBT 4.2 at best, faster, definitely sharper than
4-1, but not some huge leap. I tried to use it for my day-to-day work, which in my opinion is the best way
to evaluate any new model, and while it handled the tasks I was giving it very well, I wasn't noticing
anything dramatically better than GBT4.1, Claw4 Opus, or any of the other models I've been using.
I got myself thinking, is this really it? I settled into a routine of using GBT5 for pretty much everything
I would use existing LLMs for, and this went on for about a week. Was it better than Clod 4 Opus,
my previous daily driver? Yes, undoubtedly, but only marginally. It felt like a small incremental
improvement. But then things took an unexpected turn. Josh, my lead engineer at Hyper Right,
and I had spent an afternoon discussing a complex new product idea. One we'd estimate would take weeks,
months of dedicated engineering work to even get a proof of concept together. The idea was intricate,
involving a sophisticated front end with tightly integrated components, and a complex back-end
infrastructure for managing GPUs, auto-scaling resources, and lifecycle management. This wasn't the
kind of thing you just vibe code, even with the help of AI. It required deliberate human
oversight at every step. Or so we thought. Josh and I already decided we need at least a full month
of discovery just to figure out if a buildout was worth attempting. That night, purely out of curiosity,
I fed GPT-5 a product spec, fully expecting it to stumble immediately.
An hour later, I sent Josh a fully working prototype.
His immediate reply, WTF.
At that moment completely flipped how I thought about GPT-5.
We literally skipped a month of upfront customer discovery and planning.
We could just immediately go test with real users.
From there, things got interesting fast.
I started probing deeper, trying more ambitious tasks that I never even bothered asking
previous models.
The more I did, the clearer it became that GPT-5 wasn't incremental.
Now, Matt also talked about this whole front-end piece.
He said if you've used AI for front-end before, you probably know what I mean when I say
it usually feels made by AI.
The designs are typically a bit clumsy, predictable, obviously machine-generated.
With GPT-5, though, the UIs felt way closer to convincingly human, 80% indistinguishable
at a glance.
On back-end and infrastructure, GPT-5 was just as good, maybe even more impressive.
The deeper I went, the more clearly I saw just how different GPT-5 was.
Now, he did find that there were specific tasks that he still prefers other models for.
He said, for example, for explicit search tasks, I still prefer O3.
GPD5 stops digging sooner.
For example, I was trying to have GPT5 find the hometown of a public figure.
It only found the city and stopped there.
He also said on emotional or sensitive tasks like crafting difficult emails,
I still strongly prefer a GPT4.5.
Ultimately, he says GPT5 is a true leap.
Bottom line, GPD5 isn't just going to improve vibe coding.
It will fundamentally change the kinds of projects I consider doable
without serious human intervention and steering.
This past week, it turned what I confident
suddenly thought was a multi-month engineering challenge into a casual one-hour sprint.
This is serious, real autonomous software engineering.
He summed it up on Twitter as well, saying,
you can now vibe code real software, not just simple SaaS apps, but real technical software.
And by the way, the vibe coders seem to agree.
Felix from Lovable wrote,
GPT4 was Build Me a To-Doo app.
GPT5 is Build Me a SaaS with UserOff, Payments, Admin, Dashboard, and Email Automation.
We're not improving code generation.
We're eliminating the needy.
to code. Now, one of the most interesting reviews came from Ben Heiluck and the team at Leighton
in Space. He called a GPT5 hands-on. Welcome to the Stone Age. Ben writes, TLDR, I think GPT5 is the
closest to AGI we've ever been. It's truly exceptional at software engineering from one-shot in
complex apps to solving really gnarly issues around a massive code base. Now, I wish the story was that
simple. I wish I could tell you that it's just better at everything and anything, but that
wouldn't be true. It's actually worse at writing than GBT 4.5 and I think even 40. In most ways,
it won't immediately strike you as some sort of super genius. Because of those flaws, not despite
them, it has fundamentally changed how I see the march towards AGI. And then from there, Ben goes on
to explain his new theory. The Stone Age, he writes, marked the dawn of human intelligence.
But what exactly made it so significant? Did humans win a critical chess battle? Perhaps we proved
a fundamental theorem? Recited more digits of Pi? No, the beginning of the Stone Age is clearly
demarcated by one thing and one thing only. Humans learned how to use tools. We shaped tools and
our tools shaped us. And they really did shape us. As humans, we manifest our intelligence through tools.
Tools extend our capabilities. We trade internal capabilities for external capabilities.
It's the defining characteristic of our intelligence. GBT5 marks the beginning of the stone age for
agents in LLMs. GBT5 doesn't just use tools. It thinks with them. It builds with them.
Now, interestingly, he makes the comparison to other things that we've seen from OpenAI.
He basically suggested that what made deep research better was that OpenAI had taught O3 how to
conduct research on the internet, that it was about tools.
Like some of the others you've heard from, Ben says that GBT5 is really good at using tools
in parallel.
Other models he writes were technically capable of parallel tool calling, but A rarely
did it in practice and B rarely did it correctly.
He talked about a few areas of coding challenges, where other models had had huge problems,
and where GPD 5 just got it right.
He wrote,
We were dealing with gnarly nested dependency conflicts
adding Versel's AISDKV5 and Zod4
to our codebase base.
O3 and Cursor couldn't figure it out.
Claude Cod and Opus 4 couldn't figure it out.
GBT 5 one shot at it.
It was honestly beautiful to watch
and instantly made the model click for me.
Claude Opus thought for a while,
came up with a guess,
and then ran some tool calls
to edit files and rerun installation.
Some failed, some succeeded.
It ended the response with,
here are some things to try,
aka giving up.
With GPT5,
I felt like I was watching deep research, but using the Y-N-Y command.
It went into a bunch of folders, ran Y, taking notes in between.
When it found something that didn't quite add up, it stopped and thought about it.
When it was done thinking, it perfectly edited the necessary lines across multiple folders.
It was able to iterate its way to success by identifying and reasoning about what doesn't work, making changes in testing.
Swix added, I also had a related experience during the GPT5 demo video shoot with OpenAI,
where GPT5 was successfully able to debug three layers of nested abstractions to turn an old codebase
using an old AI SDK version of supporting GPT5.
An AI modifying a codebase to support more inferences of itself
was definitely a feel-the-AGI moment for me.
Now, Ben also writes,
GPT5 one-shots things like no model I've ever seen.
I needed to create a complex clickhouse query to export some data,
and similarly, while O3 struggled, GPT5 just one-shot at it.
I used GPT5 and Cursor to make a website I've wanted for a while,
and with the same prompt O3 and Cursor just gave me a plan.
Once I followed up to tell it to implement its plan,
and it created the app scaffolding but not the actual project.
We're already on follow-up number three.
I've spent 10x more time than with GPT-5, and there's no app.
Now, what about when it comes to Claude Opus,
as that's the real comparison point for coders?
Ben writes,
Claude Opus 4 is as good as ever at coding
and got to work immediately,
quickly taking action to create the project in scaffolding.
Opus 4 gave me a more fun and gamified UI,
but unlike GPT5,
which used existing frameworks like Create Next App
and included an SQ Lite database,
Opus 4 decided to do everything from scratch and didn't include a database.
This makes for a good one-shot prototype, but what GPT5 one-shotted was much closer to production-ready.
Freshly released Claude Opus 4.1 was clearly a step more ambitious than Opus 4.4,
also attempting the full-stack app complete with an SQ-Lite database just like GPT-5.
However, it really struggled putting all the pieces together.
While GPT-5 ran perfectly in one shot, 4.1 encountered build errors, which took multiple back-and-forths to resolve.
Ben concludes, I think GPT5 is unequivocally the best coding model in the world.
We were probably around 65% of the way through automating software engineering, and now we
might be around 72%. To me, it's the biggest leap since 3.5 Sonnet.
So what does this all add up to? Like I said, I think it's immensely clear from this presentation
that OpenAI believes that right now, the most important use case for LLMs is coding.
I don't think that they believe that that's the only use case. I think that they obviously see
AI impacting basically every domain of knowledge work, along with so many other aspects of our lives
that have nothing to do with work. But when it comes to where the important use of this technology is
right now, it is so clearly about coding. Now, maybe there is an over emphasis here on that because
they are trying to catch up in this very key area that they have historically been slightly behind in,
but I don't think it's just that. What you just heard over and over again from all of those early
interactors that I just shared is that this allows for people to build even bigger, more ambitious
things. And it allows them to do it in one shot. I think the implications of that are not just
that coders got a new, better tool. It's that once again, the parabola of who gets to create
with code has expanded. Remember, part of the goal of GPT5 is to be a single default model that can
actually select which sub-model is going to work best for a particular task. Instead of having
to select between 40 or 03 based on what you're trying to do, GPT5 just figures that out for you.
For the vast majority of chat GPT users, this will be undoubtedly the most powerful performant
model they've ever used. It won't even be close. It seems pretty clear to me that OpenAI
thinks that this trend towards vibe coding that we've seen emerge so immensely over the course of
2025 is not only not going anywhere, but is only likely to expand. I think that they believe that a lot of
people's next big experience with chat GPT, enabled by GPT5, is going to be some version of vibe coding.
Look, there was like more than half an hour of code-specific content in this presentation,
all of which happened before minute 72, which was the first time that the enterprise was explicitly
mentioned. They casually shared that five million businesses are now using chat GPT, but it
was such an afterthought. To me, the tea leaves read, like they think that this is the big competition,
this whole coding space, both for existing developers and for the next generation of Proto-Vive
developers that are going to come online. And one additional note is that they are not just
competing on performance, they are competing on price. Both the input and output costs for all of
these GPT-5 models match Gemini 2.5, which had also been competing on price, and just
absolutely blow anthropic out of the water. I lost the tweet somewhere along
the way, but someone basically said, even if you think Claude Opus 4.1 is better, if it's 10 times as much,
are you really going to justify that difference? That's a question we'll have to wait and see on,
but I think it's probably pertinent in this competition. The next step from here, of course,
is to get in there and start testing. The three focuses I'm going to have are strategy,
what I use 034 day and day out, writing, to see if it actually opens up any use cases that have been
off the table for me so far, and vibe coding. I really want to know how I
compares, and I'm going to listen to folks like Matt Schumer and try to air more on the ambitious
side and see what can come out. I will, of course, report back on those experiments, as well as
sharing what are inevitably going to be just a ton of takes over the next few days. For now, though,
that's going to do it for today's AI Daily Brief. Appreciate you listening or watching as always,
and go have fun with GBT5. Until next time, peace.
