The AI Daily Brief: Artificial Intelligence News and Analysis - GPT-5: Everything You Need to Know

Episode Date: August 7, 2025

NLW covers the big announcement from OpenAI, and explores why the big use case that they're clearly driving at is coding. Sharing the first impressions from early testers, we cover the good and ba...d of the AI model that will become the default for 700 million people. Brought to you by:KPMG – Go to ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://kpmg.com/ai⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ to learn more about how KPMG can help you drive value with our AI solutions.Blitzy.com - Go to ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://blitzy.com/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ to build enterprise software in days, not months AGNTCY - The AGNTCY is an open-source collective dedicated to building the Internet of Agents, enabling AI agents to communicate and collaborate seamlessly across frameworks. Join a community of engineers focused on high-quality multi-agent software and support the initiative at ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠agntcy.org ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠  ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Vanta - Simplify compliance - ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://vanta.com/nlw⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Plumb - The automation platform for AI experts and consultants ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://useplumb.com/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠The Agent Readiness Audit from Superintelligent - Go to ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://besuper.ai/ ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Subscribe to the newsletter: https://aidailybrief.beehiiv.com/Join our Discord: https://bit.ly/aibreakdownInterested in sponsoring the show? nlw@breakdown.network

Transcript
Discussion (0)
Starting point is 00:00:00 After literally years now of waiting, GPT-5 is here. Is it a GI? Is it a disappointment? We are going to get into all of that on this special episode of the AI Daily Brief. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. All right, friends, back with a very big episode of the AI Daily Brief here. Before we get into that, quick announcements as always. First, thank you to the sponsors of today's show, KPMG, Blitzy, and Superintelligent,
Starting point is 00:00:35 to get an ad-free version of the show, go to patreon.com slash AI Daily Brief. And if you are interested in sponsoring the show, you can reach out to sponsors at AIDailybrief.aI. But with that, let's dive right into the meat of this, because it was a big day. It is not hyperbole to say that this has been years in the making. ChatchipT4 was the dominant model for so long. I mean, almost throughout basically all of 23, it was the undisputed champion. And throughout most of the first three quarters of 2024, it was really just everyone catching up and putting out their GBT4 class model. Now, things started to change a little bit, of course, in the fall of 24 when we got 01 in the advent of reasoning
Starting point is 00:01:16 models. This year, we've seen continued progress on reasoning models. For many tasks, OpenAI's O3 model is the main model they use, but then we've also got the insurgency of Chinese open models, deepseeks R1, for example. And in the realm, of coding, which is very germane to our story today, we've had the incredible dominance of Anthropic. Ever since Claude 3.5 Sonnet through 3.7 Sonnet, Opus 4 and now Opus 4-1, Anthropic has had the go-to coding model for a huge number of people. Now, Gemini 2.5 has also been up there in contention, particularly because Google has done a really good job competing on cost as well as performance. And while some people used O3 for coding, it really has been the Anthropic models that
Starting point is 00:01:59 have been completely dominant. From what we heard coming into today's announcement, one of the big leaps for GPT5 was going to be around coding. And certainly that emphasis was borne out in the presentation. In fact, I would go so far as to say that if you were just watching this presentation, to get up to speed on where the big labs thought the AI competition was, I'm not sure that you would believe that there is any use case that Open AI cared about except coding. We're going to talk a lot more about that, but first let's go quickly through what they announced, get some first impressions, talk about some other non-coding use cases, look at the benchmarks, and then we'll get into the meat, which is surprise, surprise, all about the advent of vibe
Starting point is 00:02:39 coding, the democratization of code-based creation, and the GBT5 future that OpenAI is laying out for us. As you would expect, there are actually a few different models that were announced, GBT5, GPT5 Mini, GPT5 Nano, they all have 400K context lengths, and really competitive costs, which is something we'll get into in a little bit as well. From an availability standpoint, these models are rolling out right now to top end users and will be available to education and business users next week. The presentation was about an hour and 20 minutes long. And to some, the energy was a little underwhelming. Signal writes, this launch felt like attending a funeral hosted by minimalists. They're unveiling tech that should feel magical, real breakthroughs,
Starting point is 00:03:17 but the whole vibe was grayscale grief. Even the storytelling arc, chart styles, then eulogy tributes, then closing on someone's health battles, what exactly are we as the audience morning. Potentially great product, sure, but the emotional tone was so d-OA. Incredibly strange all around. Now, of course, Elon waited right into that conversation, adding, underwhelming to it. And while I might not have felt the same as signal, it definitely felt less energetic than it should have given the magnitude of what was being launched. Still, if there was one thing that the denizens of Twitter slash X really latched onto, it was this particular chart showing off their performance on the sweepbench. I don't think I've ever seen a chart from an AI launch be shared so many
Starting point is 00:03:55 times with so much gasping disbelief. For those of you who are just listening, not watching, it's a chart of GPT-5's performance on the sweepbench verified compared to 03 and 40. And the chart has 40 and 03 at exactly the same height in the chart, but with one scoring 30.8% accuracy and one scoring 69.1% accuracy. And then for GPT-5, it has a with-thinking and without-thinking score, 52.8% without thinking, 74.9% with-thinking. But for some reason, the 52-2-2-5%. point eight percent is above the 69.1 percent of 03, it's just a real aneurysm-inducing chart. Now, that said, we're sharing these things for the sake of completeness of the reaction. We're in the moment. Relatively speaking, if you got a couple people who say that the mood was more dreary
Starting point is 00:04:41 than it might have been, and a lot of people sharing a single chart they have a problem with, and that's the biggest critique that you're getting, you're in pretty good shape. Now, as you know, when it comes to understanding how good a model is, I tend not to care all that much about the benchmarks, especially the ones that are pretty saturated up near the top, but for the sake of completeness, we should talk about them at least a little bit. Now, one really interesting note is that on their research website and in the presentations, OpenAI chose not to compare their performance to any other models except OpenAI models. Whereas every other company is comparing themselves to GROC and to Gemini and to OpenAI and to Anthropic, OpenAI is only comparing GPT5 to past
Starting point is 00:05:23 OpenAI models, which I get as a statement to keep the focus on OpenAI, but it does sort of have the feel of not wanting people to see how it compared to other models, which doesn't really make sense because, as expected, GBT5 crushed a lot of these benchmarks. For example, on Humanity's last exam, GBT5 with no tools got 24.8% compared to O3's 14.7% with no tools, while its GPT5 Pro with full access to tools got 42%. On coding, you heard about the Sweet Bench verified. On instruction following and agentic tool use, it scored really well, which is something that's going to come up in a little bit.
Starting point is 00:06:00 And overall, across the stats that they share, GPT5 just performed really, really well. You remember I've talked a little bit about their internal benchmark for economically important tasks. The really interesting stat that they had shared previously is that ChatGBT agent with full access to tools did comparable or better to humans in roughly half the use cases, and GPT-5 was right around that as well. In fact, GPT-5 did even a little bit better than chat GPT agent. But what about independent
Starting point is 00:06:27 benchmarks? One of the more comprehensive independent benchmarks comes from artificial analysis, and the TLDR is that their high-end version of GPD-5 is now the highest-performing model across their tests. You can see here that GPT-5 high and GPD-5 medium are at 69 and 68, both very slightly above GROC-4, For those of you who are interested in going deeper, you can go on artificial analysis website to see where GPT-5 fit across all their eight different measures. They were right at the top of many of them, including the MMLU, Humanity's Last Exam, the AIME, and the AALCR or Long Context Reasoning, which is a benchmark directly from artificial analysis themselves.
Starting point is 00:07:05 In fact, long-context reasoning stands out as one of the areas of the biggest gains for GPT-5, which matters because that test is really all about what a model can do in an agendic context. Speaking of that, meter, whose chart I've shared numerous times on how long a model can successfully complete tasks for, and GPT5 is now at the top of the pack. Now, this is the metric that maps how long of a task a model can be successful at with a 50% success rate, and GPT5 pushes that up to about two hours and 15 minutes. On LM Arena, GPT5 debuts as number one, this comes from its testing period when it was under the codename summit, and it now is above Gemini 2.5 Pro 3, GROC 4, Quen 3, and a number of other models.
Starting point is 00:07:47 One area where it kind of seemed to underperform is on the ARC AGI test. On both ARC AGI 1 and especially on ARCDI2, GPD5 did well, but it was meaningfully behind Grok 4. At the same time, when you start to factor in efficiency, GPT5 Mini does really well. In fact, Greg Kamrat, the president at ArcPrize, says that it appears to break the current Arc AGI frontier. A couple areas where OpenAI also clearly tried to put to emphasis, was on hallucinations and sycophancy. On hallucinations, at least according to their own reporting,
Starting point is 00:08:19 GPD5 is way, way lower than 03 and 40. Simon Willison called out that when it comes to sycophancy, they've tried to build solves into the core model. He quoted their research posts, which said, for GPT5, we post-trained our models to reduce sycophancy. Using conversations representative of production data, we evaluated model responses that assigned a score reflecting the level of sycophancy, which was used as a reward signal and training. So a bunch of sort of day-to-day improvements as well. But what about when it comes to the big use cases? There were three that OpenAI mentioned in their presentation,
Starting point is 00:08:52 and really, as we'll discuss, only one that felt like it really mattered. So the three use cases that OpenAI mentioned were health, writing, and coding. On the health front, they talked about a number of benchmarks, but they actually humanized this one and made it about much more than benchmarks and much more instead about a bigger change. Roe want to count on Twitter? OpenAI has brought out a cancer survivor to share her story that she had to use ChatGBTBT to help her advocate for herself
Starting point is 00:09:18 and fight for her cancer and be able to challenge the doctor's opinions to make a better decision. This is bound to ruffle the medical industry and doctors' feathers. We foresee a battle in coming here between medicine and AI. Doctors were already annoyed by patients Googling their symptoms before they came in. Imagine now when patients say, but my AI said and don't trust their doctor's judgment. There's going to be a tension.
Starting point is 00:09:37 And yes, of course, doctors can be wrong and AI can be useful and save lives. This isn't a value judgment. It's an observation of an incoming tension. Now, Elon waited into that one as well and said AI is already better than most doctors. That's the honest truth. And it will become far better. Same for all jobs, to be honest, including mine. Now, my strong perspective on this one is that while, yes, undercutting and undermining every decision that a doctor makes because what an AI said isn't the right approach, in general, a more informed patient base, while something that doctors will have to adapt to is going to be a net positive. People should have more information, they should have access to more resources.
Starting point is 00:10:14 And so often when it comes to medical decisions, all doctors can do is give you the best set of information and leave the decision to you. That was the situation in this particular case, where the patient wasn't sure what choice to make about a particular course of treatment that had a bunch of severe consequences and potential side effects. No, we're not going to dwell too much on this show, but it was notable that this is use case that they are really emphasizing as something that a lot of people are turning to chat GPT for. The next two use cases bring it back, of course, to the business world. And the first we're
Starting point is 00:10:45 going to talk about is writing. Dan Chipper and the crew of Every had a chance to test GPT5 for a few weeks and made this handy-dandy chart where they gave a simple yes or no for a handful of key tasks, including day-to-day tasks, that got a yes. Pair programming, that also got a yes. Agentic engineering, that got a no. We'll come back to that. And writing was almost a split decision. For For writing, it got a yes, but for editing it got a no. So on the positive side, when it came to writing, they said, GPT5 has a good voice, nuanced and expressive. It's less likely to output obvious AI idioms, so it's the first thing we turn to when we have
Starting point is 00:11:18 a sentence we need to polish or a paragraph we need to draft. We sometimes return to GPT4.5 for questions that require more thought. Interestingly, however, they say when it comes to editing, it's a no-go. For editing, they write, GPT-5 cannot determine whether writing is good. We have benchmarks to test AI's ability to judge writing, and GPT-5 consistently fails on tasks that Opus 4 passes. Latent Space also published a first look that we'll come back to, and they actually had a much more negative opinion.
Starting point is 00:11:44 They wrote, while GPT5 continues to work its way up the software engineering ladder, it's not really a great writer. GBT 4.5 and DeepSeekar 1 are still much better. They shared a couple of examples of using GPT5 to rewrite some LinkedIn posts, and they felt like the GPT5 answer was more sloppy than the GPT 4.5 answer. In my life, I have a pretty strict dividing line between types of writing that I'll use AI for and types of writing that I won't. And pretty much in the column I won't is everything other than podcast descriptions and really basic stuff like that. So as I get my hands on GPT5 in the
Starting point is 00:12:16 coming weeks, I will certainly be testing to see whether this breaks out of that, but at least for right now, I'm not holding my breath. And yet if that all feels like prelude, you're not wrong. During the presentation, I tweeted, with GPT5, OpenAI is making an extremely loud argument that there is one singular AI use case that matters. Can you guess which one? Yes. When push comes to shove, this presentation was entirely about coding. OpenAI certainly tried to make the argument that GBT5 was now the undisputed king for coding. They had, for example, Michael Truel, the CEO of Cursor, come up during the presentation and say in no uncertain terms that GPT5 was the smartest coding model they've tried. But coding like writing is about so much more than
Starting point is 00:13:01 benchmarks. It's about vibes and experience. And as we've seen so many times over the last few months, there will be a model that technically has better benchmarks than one of the anthropic models, and it just doesn't displace them. AI agents are the buzzword that everyone's talking about, but do you truly understand their significance? KPMG's agent framework demystifies the concept, offering practical steps to unlock AI agent's immense potential. Think of it as your GPS for AI strategy. KPMG partners with clients to harness the benefits of AI agents, guiding you from strategy to execution with a secure architecture and a plan for workforce devolution. Check out their comprehensive insights on scaling agent power within your enterprise. This isn't just about tech. It's a
Starting point is 00:13:48 leadership imperative. Go to www.kpmg.us slash agents to learn more. That's www.kpmg.us slash agents. This episode is brought to you by Blitzy, the Enterprise Autonomous Software Development Platform with infinite code context. Blitzy uses thousands of specialized AI agents that think for hours to understand enterprise-scale code bases with millions of lines of code. Enterprise engineering leaders start every development sprint with the Blitzy platform, bringing in their development requirements.
Starting point is 00:14:19 The Blitzy platform provides a plan, then generates and pre-compiles code for each task. Blitzy delivers 80% plus of the development work autonomously while providing a guide for the final 20% of human development work required to complete the sprint. Public companies are achieving a 5x engineering velocity increase when incorporating Blitzie as their pre-IDE development tool, pairing it with their coding co-pilot of choice to bring an AI-native STLC into their org. Blitzy is providing a limited time, 30-day free proof of concept for qualifying enterprises. The team will provide a 5x velocity increase on a real development project in your org.
Starting point is 00:14:52 Visit blitzy.com and press book demo to learn how Blitzie transforms your STLC from AISC from AIS. assisted to AI Native. That's BLITZY.com. If you are a regular listener, you will have heard about Super Intelligence Agent Readiness Audits at this point. But I wanted to tell you today about the full suite of Agent Readiness products that go beyond just the initial readiness report. Over the last six months, Super Intelligence has built out an entire Agent Planning Suite. We help you move from discovery to planning to implementation. After you've completed your Agent Readiness Audits, We help you double-click on your most important use cases with what we call our use case planning reports.
Starting point is 00:15:31 These reports are going to help you understand what sort of technical preparation you need to do to be ready for a use case, what challenges you might face in implementation, and whether you should be thinking about building, buying, partnering, or some combination. After that, you can even get a spec document in what we call our technical blueprint that gives either your developers or the developers of the partner you work with what they need to build exactly the agent that you're looking for. If you want to learn more about superintelligence agent planning suite, we built a custom GBT to answer your questions. Just go to bit.ly slash super agent. That's bit.l.ly slash super agent, all one word. And if you have any questions, the agent can even help you book an appointment with our team. Now it's early, but a lot of the early reviews are about coding. So let's go see what people had to say. Turning back to everyone
Starting point is 00:16:20 more time, they again had a mixed review. They've been a bit of a very few. They basically said that GPT-5 was a great pair programmer. Dan Chipper wrote, Don't get me wrong, GPT-5 is a very good programmer. It's incredibly useful as a pair programmer, especially in AI-powered integrated development environments IDEs like cursor. It's great for engineers from traditional backgrounds who want an AI to help collaborate on code, and it excels at research and debugging complex issues. What Dan argues, however, is that, quote, the discipline of programming has fundamentally changed this summer. The benchmarks don't show it, but if you know how to Yolo four agents at once in Claude Code, GPT5 feels like a step backward. That's
Starting point is 00:16:57 partially because of the model's current personality. It's more cautious than Opus 4.1 and isn't as comfortable working independently for long periods in our testing. But it's also due to the app you use to interact with it. Both cursor and OpenAI's command line interface tool Codex CLI are not on the same level as Claude Code. Both were built for programmer AI pair programming, not true delegation. I bet this will change. The model is extremely smart, just not yet built for this use case. But for now, OpenAI seems to have missed the paradigm shift in programming caused by Claude Code over the last two months. Now, I wanted to present that one first, because so far at least, they're kind of the only ones I've seen saying that.
Starting point is 00:17:35 Most of the other results are incredibly impressed. Like the team at Cursor, the team at Lovable, was incredibly excited about it. YouTube and AI educator Matthew Berman wrote, GPT5 is here, and spoiler, it's a coding master. I've had the privilege of testing GPT5 for about a week now, and I gave it the most difficult tests I could come up with. He gave it a Rubik's cube test that only previously Gemini 2.5 Pro had cracked. He asked it to make an Excel clone and a Microsoft Word clone. He asked for a more complex version of the classic 90s phone game snake, asked it to solve some physics, gave it the now famous hexagon with balls bouncing inside test, and a number of others, and was basically blown away by the results. Now, one thing Berman
Starting point is 00:18:14 noted was that GBT5 is really good at front ends, and again, this is something that I've seen over and over again, is that while there is a general sort of feel and a flavor to a lot of the UI design that you see among AI and agentic coding, in a similar way that you see common patterns across AI-based writing, it seems at least from first glance that GPT5 often breaks out of those patterns. Pietro Charano shared a compilation of different experiences that he had made with GPT5 and one shot, coming away extremely enthusiastic. He said the poem camera app is particularly impressive because the model came up with all the details, like the way the photo stack in the gallery, the photo developing animation, etc. He actually made a bold pronouncement. In another tweet, he said,
Starting point is 00:18:54 I had early access to GPT5. It will do for coding what GPT4 did for LLM adoption. It's fast, really smart, has great taste and aesthetic sensibility. This is electricity arriving in every home, a before and after moment in how we build. GPT5 is the best coding model ever, but what really impressed me is how good of a collaborator it is. It tackles large implementations much better than before. At times, it was able to refactor thousands of lines of code at once, and also debugs large repos way faster and with precision. It's also he writes a really, really good agent. I was able to run long agentic flows with no issues. It's also better at explaining why to run a certain tool versus another, and it has the best results on running parallel tools I've ever seen. This is particularly
Starting point is 00:19:34 important for flows where, for instance, you need to create multiple files at the same time, and consistency becomes a problem. GPD5 has no issue with that. Matt Schumer had a really interesting take. He said, For those who will be testing and using GPT-5, you won't see much improvement if you're asking it to do things Sonnet or O3 can already do. To see how good it truly is,
Starting point is 00:19:55 you have to ask it to do things that other AI simply can't. In his longer review, he expanded this. His TLDR included GBT5 is clearly a big leap from previous models, but you have to push it hard to get the most out of it. The ceiling for what can be vibe-coded is now much higher than it was with previous models. Matt wrote, I was granted access to GPD 5 on July 21st, and honestly, when I started testing it, I wasn't blown away. In fact, I felt quite let down, especially given all the hype and expectations around it. The model felt like GBT 4.2 at best, faster, definitely sharper than
Starting point is 00:20:28 4-1, but not some huge leap. I tried to use it for my day-to-day work, which in my opinion is the best way to evaluate any new model, and while it handled the tasks I was giving it very well, I wasn't noticing anything dramatically better than GBT4.1, Claw4 Opus, or any of the other models I've been using. I got myself thinking, is this really it? I settled into a routine of using GBT5 for pretty much everything I would use existing LLMs for, and this went on for about a week. Was it better than Clod 4 Opus, my previous daily driver? Yes, undoubtedly, but only marginally. It felt like a small incremental improvement. But then things took an unexpected turn. Josh, my lead engineer at Hyper Right, and I had spent an afternoon discussing a complex new product idea. One we'd estimate would take weeks,
Starting point is 00:21:08 months of dedicated engineering work to even get a proof of concept together. The idea was intricate, involving a sophisticated front end with tightly integrated components, and a complex back-end infrastructure for managing GPUs, auto-scaling resources, and lifecycle management. This wasn't the kind of thing you just vibe code, even with the help of AI. It required deliberate human oversight at every step. Or so we thought. Josh and I already decided we need at least a full month of discovery just to figure out if a buildout was worth attempting. That night, purely out of curiosity, I fed GPT-5 a product spec, fully expecting it to stumble immediately. An hour later, I sent Josh a fully working prototype.
Starting point is 00:21:43 His immediate reply, WTF. At that moment completely flipped how I thought about GPT-5. We literally skipped a month of upfront customer discovery and planning. We could just immediately go test with real users. From there, things got interesting fast. I started probing deeper, trying more ambitious tasks that I never even bothered asking previous models. The more I did, the clearer it became that GPT-5 wasn't incremental.
Starting point is 00:22:05 Now, Matt also talked about this whole front-end piece. He said if you've used AI for front-end before, you probably know what I mean when I say it usually feels made by AI. The designs are typically a bit clumsy, predictable, obviously machine-generated. With GPT-5, though, the UIs felt way closer to convincingly human, 80% indistinguishable at a glance. On back-end and infrastructure, GPT-5 was just as good, maybe even more impressive. The deeper I went, the more clearly I saw just how different GPT-5 was.
Starting point is 00:22:32 Now, he did find that there were specific tasks that he still prefers other models for. He said, for example, for explicit search tasks, I still prefer O3. GPD5 stops digging sooner. For example, I was trying to have GPT5 find the hometown of a public figure. It only found the city and stopped there. He also said on emotional or sensitive tasks like crafting difficult emails, I still strongly prefer a GPT4.5. Ultimately, he says GPT5 is a true leap.
Starting point is 00:22:55 Bottom line, GPD5 isn't just going to improve vibe coding. It will fundamentally change the kinds of projects I consider doable without serious human intervention and steering. This past week, it turned what I confident suddenly thought was a multi-month engineering challenge into a casual one-hour sprint. This is serious, real autonomous software engineering. He summed it up on Twitter as well, saying, you can now vibe code real software, not just simple SaaS apps, but real technical software.
Starting point is 00:23:21 And by the way, the vibe coders seem to agree. Felix from Lovable wrote, GPT4 was Build Me a To-Doo app. GPT5 is Build Me a SaaS with UserOff, Payments, Admin, Dashboard, and Email Automation. We're not improving code generation. We're eliminating the needy. to code. Now, one of the most interesting reviews came from Ben Heiluck and the team at Leighton in Space. He called a GPT5 hands-on. Welcome to the Stone Age. Ben writes, TLDR, I think GPT5 is the
Starting point is 00:23:48 closest to AGI we've ever been. It's truly exceptional at software engineering from one-shot in complex apps to solving really gnarly issues around a massive code base. Now, I wish the story was that simple. I wish I could tell you that it's just better at everything and anything, but that wouldn't be true. It's actually worse at writing than GBT 4.5 and I think even 40. In most ways, it won't immediately strike you as some sort of super genius. Because of those flaws, not despite them, it has fundamentally changed how I see the march towards AGI. And then from there, Ben goes on to explain his new theory. The Stone Age, he writes, marked the dawn of human intelligence. But what exactly made it so significant? Did humans win a critical chess battle? Perhaps we proved
Starting point is 00:24:25 a fundamental theorem? Recited more digits of Pi? No, the beginning of the Stone Age is clearly demarcated by one thing and one thing only. Humans learned how to use tools. We shaped tools and our tools shaped us. And they really did shape us. As humans, we manifest our intelligence through tools. Tools extend our capabilities. We trade internal capabilities for external capabilities. It's the defining characteristic of our intelligence. GBT5 marks the beginning of the stone age for agents in LLMs. GBT5 doesn't just use tools. It thinks with them. It builds with them. Now, interestingly, he makes the comparison to other things that we've seen from OpenAI. He basically suggested that what made deep research better was that OpenAI had taught O3 how to
Starting point is 00:25:07 conduct research on the internet, that it was about tools. Like some of the others you've heard from, Ben says that GBT5 is really good at using tools in parallel. Other models he writes were technically capable of parallel tool calling, but A rarely did it in practice and B rarely did it correctly. He talked about a few areas of coding challenges, where other models had had huge problems, and where GPD 5 just got it right. He wrote,
Starting point is 00:25:30 We were dealing with gnarly nested dependency conflicts adding Versel's AISDKV5 and Zod4 to our codebase base. O3 and Cursor couldn't figure it out. Claude Cod and Opus 4 couldn't figure it out. GBT 5 one shot at it. It was honestly beautiful to watch and instantly made the model click for me.
Starting point is 00:25:45 Claude Opus thought for a while, came up with a guess, and then ran some tool calls to edit files and rerun installation. Some failed, some succeeded. It ended the response with, here are some things to try, aka giving up.
Starting point is 00:25:56 With GPT5, I felt like I was watching deep research, but using the Y-N-Y command. It went into a bunch of folders, ran Y, taking notes in between. When it found something that didn't quite add up, it stopped and thought about it. When it was done thinking, it perfectly edited the necessary lines across multiple folders. It was able to iterate its way to success by identifying and reasoning about what doesn't work, making changes in testing. Swix added, I also had a related experience during the GPT5 demo video shoot with OpenAI, where GPT5 was successfully able to debug three layers of nested abstractions to turn an old codebase
Starting point is 00:26:27 using an old AI SDK version of supporting GPT5. An AI modifying a codebase to support more inferences of itself was definitely a feel-the-AGI moment for me. Now, Ben also writes, GPT5 one-shots things like no model I've ever seen. I needed to create a complex clickhouse query to export some data, and similarly, while O3 struggled, GPT5 just one-shot at it. I used GPT5 and Cursor to make a website I've wanted for a while,
Starting point is 00:26:50 and with the same prompt O3 and Cursor just gave me a plan. Once I followed up to tell it to implement its plan, and it created the app scaffolding but not the actual project. We're already on follow-up number three. I've spent 10x more time than with GPT-5, and there's no app. Now, what about when it comes to Claude Opus, as that's the real comparison point for coders? Ben writes,
Starting point is 00:27:09 Claude Opus 4 is as good as ever at coding and got to work immediately, quickly taking action to create the project in scaffolding. Opus 4 gave me a more fun and gamified UI, but unlike GPT5, which used existing frameworks like Create Next App and included an SQ Lite database, Opus 4 decided to do everything from scratch and didn't include a database.
Starting point is 00:27:27 This makes for a good one-shot prototype, but what GPT5 one-shotted was much closer to production-ready. Freshly released Claude Opus 4.1 was clearly a step more ambitious than Opus 4.4, also attempting the full-stack app complete with an SQ-Lite database just like GPT-5. However, it really struggled putting all the pieces together. While GPT-5 ran perfectly in one shot, 4.1 encountered build errors, which took multiple back-and-forths to resolve. Ben concludes, I think GPT5 is unequivocally the best coding model in the world. We were probably around 65% of the way through automating software engineering, and now we might be around 72%. To me, it's the biggest leap since 3.5 Sonnet.
Starting point is 00:28:05 So what does this all add up to? Like I said, I think it's immensely clear from this presentation that OpenAI believes that right now, the most important use case for LLMs is coding. I don't think that they believe that that's the only use case. I think that they obviously see AI impacting basically every domain of knowledge work, along with so many other aspects of our lives that have nothing to do with work. But when it comes to where the important use of this technology is right now, it is so clearly about coding. Now, maybe there is an over emphasis here on that because they are trying to catch up in this very key area that they have historically been slightly behind in, but I don't think it's just that. What you just heard over and over again from all of those early
Starting point is 00:28:45 interactors that I just shared is that this allows for people to build even bigger, more ambitious things. And it allows them to do it in one shot. I think the implications of that are not just that coders got a new, better tool. It's that once again, the parabola of who gets to create with code has expanded. Remember, part of the goal of GPT5 is to be a single default model that can actually select which sub-model is going to work best for a particular task. Instead of having to select between 40 or 03 based on what you're trying to do, GPT5 just figures that out for you. For the vast majority of chat GPT users, this will be undoubtedly the most powerful performant model they've ever used. It won't even be close. It seems pretty clear to me that OpenAI
Starting point is 00:29:37 thinks that this trend towards vibe coding that we've seen emerge so immensely over the course of 2025 is not only not going anywhere, but is only likely to expand. I think that they believe that a lot of people's next big experience with chat GPT, enabled by GPT5, is going to be some version of vibe coding. Look, there was like more than half an hour of code-specific content in this presentation, all of which happened before minute 72, which was the first time that the enterprise was explicitly mentioned. They casually shared that five million businesses are now using chat GPT, but it was such an afterthought. To me, the tea leaves read, like they think that this is the big competition, this whole coding space, both for existing developers and for the next generation of Proto-Vive
Starting point is 00:30:23 developers that are going to come online. And one additional note is that they are not just competing on performance, they are competing on price. Both the input and output costs for all of these GPT-5 models match Gemini 2.5, which had also been competing on price, and just absolutely blow anthropic out of the water. I lost the tweet somewhere along the way, but someone basically said, even if you think Claude Opus 4.1 is better, if it's 10 times as much, are you really going to justify that difference? That's a question we'll have to wait and see on, but I think it's probably pertinent in this competition. The next step from here, of course, is to get in there and start testing. The three focuses I'm going to have are strategy,
Starting point is 00:31:03 what I use 034 day and day out, writing, to see if it actually opens up any use cases that have been off the table for me so far, and vibe coding. I really want to know how I compares, and I'm going to listen to folks like Matt Schumer and try to air more on the ambitious side and see what can come out. I will, of course, report back on those experiments, as well as sharing what are inevitably going to be just a ton of takes over the next few days. For now, though, that's going to do it for today's AI Daily Brief. Appreciate you listening or watching as always, and go have fun with GBT5. Until next time, peace.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.