The AI Daily Brief: Artificial Intelligence News and Analysis - GPT 5.4 First Test Results

Episode Date: March 6, 2026

GPT 5.4 just dropped and the early consensus is clear — this is the most substantial OpenAI release in recent memory, with massive jumps in computer use, professional work tasks, and coding efficien...cy. NLW goes hands-on building a real project with 5.4 and Codex to see where the hype holds up and where it breaks down.Brought to you by:KPMG – Agentic AI is powering a potential $3 trillion productivity shift, and KPMG’s new paper, Agentic AI Untangled, gives leaders a clear framework to decide whether to build, buy, or borrow—download it at ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠www.kpmg.us/Navigate⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Mercury - Modern banking for business and now personal accounts. Learn more at ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://mercury.com/personal-banking⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠AIUC-1 - Get your agents certified to communicate trust to enterprise buyers - ⁠https://www.aiuc-1.com/⁠Rackspace Technology - Build, test and scale intelligent workloads faster with Rackspace AI Launchpad - ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠http://rackspace.com/ailaunchpad⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Blitzy - Want to accelerate enterprise software development velocity by 5x? ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://blitzy.com/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Optimizely Agents in Action - Join the virtual event (with me!) free March 4 - ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://www.optimizely.com/insights/agents-in-action/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠AssemblyAI - The best way to build Voice AI apps - ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://www.assemblyai.com/brief⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠LandfallIP - AI to Navigate the Patent Process - https://landfallip.com/Robots & Pencils - Cloud-native AI solutions that power results ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://robotsandpencils.com/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠The Agent Readiness Audit from Superintelligent - Go to ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://besuper.ai/ ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://pod.link/1680633614⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Our Newsletter is BACK: ⁠⁠⁠⁠⁠⁠⁠https://aidailybrief.beehiiv.com/⁠⁠⁠⁠⁠⁠⁠Interested in sponsoring the show? sponsors@aidailybrief.ai

Transcript
Discussion (0)
Starting point is 00:00:00 Today on the AI Daily Brief, GPT5.4 is here, and these are both the first impressions from the broader world as well as my first test results. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. All right, friends, quick announcements before we dive in. First of all, thank you to today's sponsors, KPMG, AIUC, Blitzy, and Prompt QL. To get an ad-free version of the show, go to patreon.com slash AI Daily Brief, or you can subscribe on Apple Podcasts. If you are interested in sponsoring the show, send us a note at sponsors at AIDilybrief.A.I. daily.com. Lastly, as often happens, when we have a big new model release, no headlines today, we are just going to spend all of our time on this exciting new model. So without any further ado,
Starting point is 00:00:46 let's dive in. After a couple of weeks where mostly we've been talking about big macro issues like the Pentagon and Anthropic and all of that sort of thing, we finally have the cool, fresh breeze of a new exciting model to test. And this one, indeed, is pretty exciting. Ethan Mollock tweeted, I think we've been through enough release cycles for models at this point to say that the latest model from OpenAI or Anthropic or Google
Starting point is 00:01:10 is generally going to be the best model in the world upon release, with some jagged edges until the next release by one of the big three. Now, with that background, a different way to look at where we've been is that it's simply been OpenAI's turn. However, the expectations coming into GPT 5.4 were a little bit higher than they might have been for some of OpenAI's more recent releases. Ever since the release of GPT-5, all the big model providers got the memo that trying to promise
Starting point is 00:01:35 too much in each update, rather than just being very incremental, was a pretty scary proposition. That's what got us the 5-1, 5-2, 5-3, now 5-4 kind of paradigm. But of course, it's not just OpenAI doing that. Google and Anthropic are both on that same plan as well. And yet, even with that, 5.4 has had a little bit more hype and anticipation around it than some of the previous iterative models that we've gotten more recently. This was theoretically supposed to be the big outcome of OpenAI's Code Red, which was launched back in December. And what's more, the buzz for the last week or week and a half or so, has been that this one was really meaningful.
Starting point is 00:02:09 Enough so that it almost felt to me, like some of the more recent leaks two publications like the information, were almost trying to tamp down on expectations. To take one example, rumors had been flying that there was a 2 million token context window, whereas the information's reporting from the last couple of days suggested it was just 1 million. seemed to me to be a little bit of expectation setting through leaks. In any case, on Thursday afternoon, we actually got the model. And the initial buzz was strong. Ben Heilak wrote,
Starting point is 00:02:37 I've been using GPT 5.4 for the past few weeks. In a sea of endless model drops and benchmark maxing, this model is the first in a long time to be worth your time to try. Honestly, didn't expect to open AI to pull this off. So let's talk first about how open AI frames things, look at some of the early reactions in the community, and then we'll walk through a more comprehensive case study with a project that I recently did
Starting point is 00:02:58 to put the new capabilities through the ringer. Now, one of the interesting things about this week is that this was not the first new OpenAI model we got. Just a couple of days ago, we got GPT-53 Instant, although OpenAI started promising almost immediately that 5.4 was coming. 5-3 instant was, as we've talked about, a speed and personality play.
Starting point is 00:03:16 The announcement tweet called it more accurate, less cringe. This actually was part of the inspiration for our episode about what's going to actually matter in consumer AI, as this was so clearly aimed at that default sort of experience that the average chat GPT user is going to have because they're not optimizing the model selector for what they think they want.
Starting point is 00:03:34 GBT 54, although still having a lot of offshoots, does feel like they're trying to bring together their models under a coherent banner. They write, GPT5.4 brings together the best of our recent advances in reasoning, coding, and agenic workflows into a single frontier model. It incorporates industry-leading coding capabilities of GPT-53 codecs,
Starting point is 00:03:53 while improving how the model works across tools, software environments, and professional tasks involving spreadsheets, presentations, and documents. The result is a model that gets complex real work done accurately, effectively, and efficiently, delivering what you ask for with less back and forth. In other words, if 5-3 Instant was for the personal use cases, GPT 5.4 was, as the subheader says, designed for professional work. As the information had it in that leak, we do indeed get a 1 million token context window, which, among other things, improves 5.4's ability on tax.
Starting point is 00:04:23 tasks that require longer thinking. The quotes from early testers that they included were ravishing even more than the ones that they usually include. Brendan Foodie, the CEO at Mercore writes, GPD 5.4 is the best model we've ever tried. It's now top of the leaderboard on our Apex Agents benchmark, which measures model performance for professional services work. It excels at creating long horizon deliverables such as slide decks, financial models and legal analysis, delivering top performance while running faster and at a lower cost than competitive frontier models. Indeed, these professional tasks are some of the big focus for OpenAI's announcement blog. They compare GPT 5.4's outputs to GPT5.2s on spreadsheets, documents, and presentations,
Starting point is 00:05:01 all with significant upgrades. And if overall knowledge work is what OpenAI chose to focus on, there was also a big emphasis on computer use capabilities. We are, of course, living in an open-claw world, and so it makes sense that this is becoming increasingly important. Computer use is actually listed above coding, perhaps because the jump from 5.5. 3 Codex to 5.4 isn't all that big. Indeed, they basically said that 54 is the integration of 5.3 codex with other improved aspects of the model, meaning it seems to me that their latest round of big coding innovations was embedded in 53 codex itself.
Starting point is 00:05:34 There's a sub-theme running throughout the announcement about efficiency. In the first part of the announcement post, they write that GPT 5.4 is our most token-efficient reasoning model, using significantly fewer tokens to solve problems when compared to GPT 5.2, translating to reduce token usage and faster speeds. In the coding section, they also talk about fast mode and codex, which they say delivers up to 1.5x faster token velocity. They write, it's the same model and same intelligence just faster. That means users can move through coding tasks, iteration, and debugging while staying in flow.
Starting point is 00:06:05 The new efficiency gains also show up in tool search. They write, previously, when a model was given tools, all tool definitions were included in the prompt up front. For systems with many tools, this could add thousands or even tens of thousands of tokens to every request, increasing cost, slowing responses, and crowding the context with information the model might never use. With tool search, GPD 5.4 instead receives a lightweight list of available tools, along with a tool search capability. When the model needs to use a tool, it can look up that tool's definition and append it to the conversation at that moment. This approach, they say,
Starting point is 00:06:36 dramatically reduces the number of tokens required. They evaluated 250 tasks from Scales-MCP Atlas and found that this new configuration had the same accuracy but reduced total token usage by 47%. That combined with improved accuracy on tool calling obviously makes it really appealing for agentic use cases. There's a few other parts of the announcement, including improved web search and improved steerability, but those are the big hits. And as people dug in, there were a few things that stood out. First of all, that efficiency is showing up in people's tests. Greg Cameron at the president of ArcPrize said that on Arc AGI2, they were seeing a consistent 20 percentage point lift versus 5.2 at the same price. Still, the three benchmarks that people were most disgusting,
Starting point is 00:07:16 were around coding, computer use, and GDPVal. On coding people flagged that GPT 5.4 was only nominally better than 53 codecs on benchmarks like SweetBench Pro, but most people understood that that wasn't the core value proposition of this updated model. The computer use improvement got more of a discussion. This is maybe the most concerned I've ever seen people with computer use, and I think that's the difference between the pre-and-post open-claw world. Now that people have got all these Mac minis running around with their open-claw agents having
Starting point is 00:07:44 pretty much unfettered access to them, how good the models are at using the computer becomes much more relevant in a day-to-day way, not just a theoretical way. Rahul Agrawal writes, GPD 5.4 is here and it can use a computer better than a human? OpenAI ship GPT 5.4 on March 5th. The headline isn't the reasoning improvements. It's that this is their first general-purpose model with native state-of-the-art computer use. It can operate websites and software autonomously, issue keyboard and mouse commands, write and execute code, and navigate full desktop environments. On OS World verified, it hits 75%, which is above human level performance at 72.4%, and a massive jump from GPT 5.2's 47.3%. That's not incremental, that's a step change.
Starting point is 00:08:28 When agents can reliably navigate desktops, the bottleneck on automation shifts from, can the model do it, to do you trust it enough to let it? That's the question nobody has a good answer to yet. Jamie Cuff from Pace wrote an X article about this specifically. He called it, we stress-tested GPT 5.4 on the hardest UI on the internet, and writes, people don't realize how good AI computer use has actually gotten, until they see it tackle the hardest UIs in existence, legacy insurance portals. At PACE, we build AI agents for insurance workflows like submission intake and first notice of loss. Because we operate in this space, we use legacy enterprise software as our ultimate benchmark.
Starting point is 00:09:06 If AI can reliably navigate a 20-year-old hyperdense insurance portal without hallucinating a click, It can navigate anything. For a long time, the technology just wasn't there. But after spending the last few months working closely with OpenAI to stress test their new GPT 5.4 model inside these environments, it became clear the paradigm has shifted. A couple things they point to as much better. The first is click accuracy. Historically, Jamie writes, the biggest failure point was simply clicking the right thing.
Starting point is 00:09:33 Enterprise software is incredibly dense. Layouts are cluttered, buttons are tiny, and the systems were designed decades ago. Earlier models would frequently miss targets. GPD 5.4 is vastly better at grounding itself visually, clicking precisely where it needs to, even on a crowded screen. They also point out improvements in long-tragory reasoning, speed and time to iteration, as well as memory. Now, there was also a lot of chatter around GDP-Val, basically around the measure of the model's performance on professional work tasks. GDPVal remember tests against knowledge work spanning 44 occupations from the top nine
Starting point is 00:10:04 industries that contribute to US GDP. The win rate versus industry professionals was 49.8% for GBT 5-2, 60% for 52 Pro, and between 69.2% and 70.8% for the GPT 5.4 family. That's just wins when it comes to ties. That number rises to 82 to 83%. Ethan Malik pointed out what this means in terms of time savings. He wrote, given the GDP-Val benchmark for GPD 5.4, the new model ties or beats humans as judged by other experts at professional tasks 82% of the time. If you give a seven-hour task to AI, even with failure rates and the need to check results, you'd save 4 hours and 38 minutes on average. And in addition to the general performance increases across the GDP Val set, it's very clear that OpenAI is aggressively going after certain industries even more.
Starting point is 00:10:51 C.O. Brad Lightcap, for example, tweeted, the team worked extremely hard to make GPT5.4 great for finance. It's much improved for financial modeling and analysis, integrates directly into Excel, and connects to Factivia, DeLupa, S&P Global, and many more. It does feel like a Codex moment is coming here. But what about the overall impressions outside of the individual benchmarks? Latent Space wrote, We've learned to take for granted that OpenAI is the smartest kid in the room, always reporting state-of-the-art evals, but this set of updates feels much more substantial and confident
Starting point is 00:11:22 than any OpenAI launch in recent history. The Every vibe check summed it up, OpenAI is back. Three months ago, writes the team at Every, OpenAI was losing the agent at coding race. Claude had captured developers' hearts, and Opus 4.5 was shipping at a level other models couldn't touch. Meanwhile, OpenAI's coding agent, Codex, felt like it was built for an older era of coding with AI.
Starting point is 00:11:43 It was precise but rigid, powerful but personalityless, and not good with tools are able to run for long periods of time autonomously. OpenAI's latest model released GPT 5.4, along with their other recent releases, GPD53 Codex, GPD5.3 Codex Spark, and the Codex desktop app shifts the competitive balance back towards OpenAI on the coding front. The new model produces plans that are thorough and technically precise and have a user-focus and human feel that has been missing from the content. OpenAI's previous coding models. In our testing, GPD 5.4 reviews code with more depth than GPT
Starting point is 00:12:13 53 codex and has a noticeably more conversational voice. With a few tweaks, it became our preferred model to use in our OpenClaas, especially given that it is half the price of Opus 4.6. Even Kieran Klassen, our diehard Claude Codde devotee, is now reaching for GPD 5.4 daily since we started testing it a week ago. As ever, they say, there are tradeoffs. GPT5.4 has a tendency to expand the task well beyond what you ask for, and to call tasks done before they're finished. It's sometimes completed tasks in obviously wrong ways, then lied about it. The bigger story here is OpenAI's trajectory. From the Codex desktop app to GPT53 Codex and to GBT5.4, the company is iterating fast, and many members of the team now use its tools and models daily for coding, a significant change
Starting point is 00:12:58 from a few months ago. A couple of things that they said they liked about it. It did proactive research without being asked, it had a more human voice than previous codexes, and it was roughly twice as fast as Opus. What they didn't like included scope creep on multi-step conversations, misreading completion reports in OpenClaw, and over-engineering. One of their team member called it too eager, adding things that aren't wrong but aren't necessary. Ultimately, they summed it up with their title, OpenAI is back. Other people who reported their own experience also had positive things to say. Dr. Daria Anut-Mas wrote, GPT 5.4 Pro to write a 10,000-word literary article on my beloved T-cells. As I read through it,
Starting point is 00:13:38 I am both mesmerized and deeply emotional. It's just so beautifully written. When I was tweeting about what I found to be 5.4's over-verbosity, Click Health Simon Smith wrote, Counterpoint, it's the best writing model from OpenAI I've seen, and probably better than the best Claude models now at writing, and only needs a bit of nudging to write extraordinarily well. I've done a ton of writing tests with 5.4 now, and it's capable of writing every imaginable way with great empathy, creativity, wit, and concision. And it has personality again. How IAI's Claire Vaux wrote, What I Love speaks more like a human than 52 and 53, a million token context window, tool use is chef's kiss, first model where I've experienced the true go-investigate
Starting point is 00:14:16 and fix experience that feels robust. What needs work, basic front-end and U.X taste, still loves a bullet point, latency, and stability. The next day she updated it and said she's, quote, starting to have loving feelings towards the model. Basically, it did some really difficult tech and data work really well without a lot of support. Matt Schumer, who you might remember from something big is happening, calls GPT 5.4, in short, the best model in the world by far. On coding, he writes, coding capabilities are ridiculous.
Starting point is 00:14:45 It's essentially flawless. Inside codex, it's insanely reliable. Coding is essentially solved. There's not much more to say on this. It's just that good. Now, he did find weaknesses, including front-end taste, which is something that I'm going to get into in just a moment as well. Mark Tenenholz from Perplexity also pointed out
Starting point is 00:15:01 that while the model itself is great, the updates to the actual Codex CLI experience are really good as well. He called them the real hero. So much less friction than the previous approval system, he says. Agentic AI is powering a $3 trillion productivity revolution, and leaders are hitting a real decision point. Do you build your own AI agents, buy off the shelf, or borrow by partnering to scale faster?
Starting point is 00:15:28 KPMG's latest thought leadership paper, Agendic AI Untangled, Navigating the Build, Buy or Borrow Decision, does a great job cutting through the noise with a practical framework to help you choose based on value, risk, and readiness. And how to scale agents with the right trust, governance, and orchestration foundation. Don't lock in the wrong model. You can download the paper right now at www.kpmg.us. slash navigate. Again, that's www.kpmg.us. There's a new standard that I think is going to matter a lot for the Enterprise AI agent space. It's called AIUC1, and it builds itself as the world's first AI agent standard.
Starting point is 00:16:04 It's designed to cover all the core enterprise risks, things like data and privacy, security, safety, reliability, accountability, and societal impact, all verified by a trusted third party. One of the reasons it's on my radar is that 11 Labs, who you've heard me talk about before and is just an absolute juggernaut right now, just became the first voice agent to be certified against AIUC1 and is launching a first-of-its-kind insurable AI agent. What that means in practice is real-time guardrails that block unsafe responses and protect against manipulation, plus a full safety stack. This is the kind of thing that unlocks enterprise adoption. When a company building on 11 labs can point to a third-party certification and say our agents are secure, safe and verified, that changes the conversation.
Starting point is 00:16:44 Go to AIUC.com to learn about the world's first standard for AI agents. That's AIUC.com. Blitzy is driving over 5x engineering velocity for large-scale enterprises. A publicly traded insurance provider leveraged Blitzie to build a bespoke payments processing application, an estimated 13-month project, and with Blitzy, the application was completed in live in production in six weeks. A publicly traded vertical SaaS provider used Blitzy to extract services from a 500,000 line monolith, without disrupting production, 21 times faster than their pre-Blitzy estimates. These aren't experiments. This is how the world's most innovative enterprises
Starting point is 00:17:18 are shipping software in 26. You can hear directly about Blitzy from other Fortune 500 CTOs on the modern CTO or CIO-classified podcasts. To learn more about how Blitsey can impact your SDLC, book a meeting with an AI Solutions consultant at blitzy.com. That's BLYTZY.com. If you're an operator, your day is a non-stop stream of decisions. And most of them require you to look at the data. You don't need another dashboard.
Starting point is 00:17:45 You need answers you can trust, fast. But the bottleneck is always the same. The data isn't ready. It's scattered. It's messy. Definitions aren't clear. You're waiting on your data. team or waiting on domain experts for clarification and confirmation. That's the bottleneck today's
Starting point is 00:17:58 sponsor, PromptQL is built to break. PromptQL is a trusted AI analyst for high-frequency decision-making. It connects across warehouses, databases, SaaS, and internal APIs. No massive data prep or centralization required. It's built for multiplayer input. Teammates can jump into a thread, correct assumptions, and nuance, flag edge cases. PromptuL turns everyday conversations into a shared context. And if something is ambiguous, it doesn't guess. It escalates to the right expert, captures the correct logic and gets it right next time. That's how it delivers trust and accuracy. Over time, prompt QL specializes to your business,
Starting point is 00:18:31 like that veteran employee who just knows things. From simple what is questions to complex what if scenarios, you can model impact and stress test decisions before you commit, all through a simple natural language prompt. Prompt QL, the trusted AI analyst for teams with shared context and messy data. So as we round the corner, let's actually shift into the test that I did. I wanted to do something that was actually kind of difficult, and that was also real. I've been building a bunch of agents recently with ClaudeCodecode, and the specific thing that I was
Starting point is 00:19:02 interested in building with Codex and 5-4 was some type of experience to help people show off their agent building and agent orchestration skills. Right now, you know that we have a couple different self-directed experiences for people to learn how to use tools like OpenClaw. And one of the things that I've been recognizing around that is that this new skill set of agent building an agent orchestration is going to be more and more in demand from all sorts of different types of organizations. However, it is really this new types of skill set. It's not easy to describe necessarily. It's not easy to show necessarily. And the people who are going to be hiring or
Starting point is 00:19:37 contracting for it aren't necessarily going to be really easily able to describe exactly even what they're looking for. They're just going to want people who can help them agentify things. And so I wanted to build some type of experience that could help that gap, that could help builders, off what they built in ways that were understandable and accessible to other people outside of them, and that could help people who were potentially interested in working with builders, actually engage in a way where they weren't just totally blind, in terms of finding the right people to help them with whatever age identification needs they had. So this is not an easy feat, and it also seemed to be a great way to test both 5.4 as a model
Starting point is 00:20:13 inside Chatchabit as well as the updated Codex. Now before we get into how it did, I will say that I also tested 5.3 Instant as a partner and helping me get codec set up on a new machine. Overall, I think 53 Instant is a big upgrade. It's less annoying than 5.2, which I could get into more, but honestly, just 5.2 was really annoying. It wouldn't answer things,
Starting point is 00:20:33 and then it would be cloying in its responses. It's just a very unpleasant model to interact with. 5.3 is way better. It's much faster and much easier to use for these not-so intellectually heavyweight types of tasks. Two things that I have noticed with 5.3 Instant, however, one is that it puts too much into each response. When I said, I need help setting up Codex and figuring out how to get the most out of it,
Starting point is 00:20:54 it wrote this like 18-page tone with all sorts of steps, basically giving me a full guide to Codex instead of working our way step-by-step. In fact, one of the things that I would recommend if you are using it in this way is to ask it to go step by step. This over-verbosity, as you will see, is one thing that I think is consistent with this new set of GPT models. I also noticed 5'3 instant doing this very weird clickbaity thing, where when it has something to add at the end of a conversation,
Starting point is 00:21:19 instead of just saying it, it gives you this weird, almost clickbait-type final sentence. For example, if you want, I can also show you the five Codex workflows that technical founders are using to build products insanely fast right now. Later on, it says, once that works, I'll show you something important, the 30-second codex workflow that makes it feel like you suddenly have a junior engineer working for you. It's like they studied at the school of clickbait LinkedIn and X article headlines and brought that into ChatGBT in a way that is just really, really weird.
Starting point is 00:21:47 Still, overall, those complaints aside, 5-3 Instant is a big upgrade in a lot of ways, especially for more simple types of needs. Now, moving over into the planning stage, I used 5-4 thinking, admittedly just on the standard setting, and perhaps the results would have been very different if I had used more extended thinking or even pro. Even though this isn't the instant model, it is extremely fast compared to previous thinking iterations. In fact, it's so fast that it actually took some harnessing to slow it down and have
Starting point is 00:22:15 it be actually useful. I described the problem that we were trying to solve in the same way that I described it to you, and it immediately dove in trying to spec the thing out before we had even discussed it. The problem was that this wasn't just over-eager, it was that it fell into some default patterns based on its training data, rather than really considering what I was trying to do differently. Specifically, in this case, it jumped right into assuming that if we were talking about matchmaking between a company and an agent orchestrator, we really needed to focus on technical skills. even though, of course, the whole point is that we're dealing with a world where technical skills are being brought by the tools, and so it's a different type of skill set that we're actually trying to understand.
Starting point is 00:22:53 In addition to that, almost immediately I started to notice another issue, which is that 5-4 thinking was extremely over-reboose. There's the point of repeating itself. Even when it's not required, like when it's agreeing with you about something that you already said, it uses a million lists, bullets, lettered lists, numbered lists, all in the same response. And maybe this sounds like I'm complaining about it doing too much, which is maybe a better problem to have, but the way that it operates honestly puts a huge cognitive burden on the prompter.
Starting point is 00:23:23 Now, like I said, to be fair, I could have changed it from thinking standard to thinking extended to see if that slowed it down a little bit and improved things, but this was a set of issues that I was dealing with. Even as I asked it to focus on our back-and-forth conversation, it was still trying to update its build spec at the same time. When it came to those biases that it was bringing into this conversation, that speed made it harder to undo. I didn't get it to slow down until I said, kind of annoyed, did I at any point ask you to immediately start building a spec? Finally, we got it to engage in the way that I wanted,
Starting point is 00:23:53 and at that point I started running into a new set of challenges. One, it was still going incredibly long, and I found myself skipping over huge amounts of words that it was saying because there were just so much to read. The next thing that I noticed was that it was really eager to stay in planning mode, whereas way earlier in the conversation, Claude would have started building artifacts, GPD 54 just wanted to go deeper and deeper and deeper in planning in ways that I think were wildly over-optimizing relative to just starting to build the thing. After one single-line response where I told it what direction I wanted to go, it agreed and then wrote hundreds of lines with multiple bulleted lists, just confirming what we had already
Starting point is 00:24:29 talked about. Now, I will say that once we honed in on the core idea, it was pretty smart and was able to stay focused on the things that made what we were trying to build different, but I was still having real problems trying to get it to go into do the actual work mode. Even when I tried to shift it into visualization mode, it kept trying to plan the visualization rather than just visualize it. Again, I finally had to say, why aren't you just designing it? Claude would have showed me five versions by now. It responded, fair hit, I stayed too long in abstraction when I should have started showing. And then you would think that at that point it would start doing the showing, but instead
Starting point is 00:25:01 it said, here are five concrete interface directions for builder rep all built around blah, blah, blah, blah, blah, blah, blah. I had to literally stop it and say, no, I'm not saying describe it. I'm saying go build the clickable prototype. Which it finally did, but then had its own problem. It was just awful visually. And this is something that lots of other folks pointed out. Ben Davis, who works with Theo on his YouTube channel about AI, said, it is hilariously bad at UI stuff. Matt Schumer said front-end taste is far behind Opus 46 and Gemini 3.1 Pro. Why is this so hard to fix? I'm not trying to rip on it too much here, but it is just honestly staggering how bad and tasteless the UI design is in my little experience. I had to bring it back over to Claude and do the front end design there.
Starting point is 00:25:46 Claude pulled no punches in critiquing what we had gotten from 5'4. The card backgrounds are muddy gradient blobs. The colors are dull and washed out. The typography has no hierarchy. The tags look cheap. The cards have no breathing room. The whole thing looks like a dark mode template from 2023. Brutal, but all very true.
Starting point is 00:26:03 Still, after a while, we use Claude to get the design under control, and we're off to the races inside the Codex CLI. And this is where things start to turn around, and where you start to see why many folks are so excited about this new model. There were some not-perfect things about the experience with Codex. It had a couple weird foibles, like it pushed me to use GPT 4.1, but I found similar weirdness like that with Claude Code as well. I confirmed that it wasn't just 5-4 in ChatGPT, but also in Codex that was not good at design.
Starting point is 00:26:31 But there were some parts of the experience that were great, including compared to ClaudeCode code. The thing that Mark Tenon Holtz was talking about, with much less friction in the approval system, was absolutely true. There are so much fewer confirmations with Codex right now than ClaudeCode in ways that make the experience just massively, massively better. What's more, Codex did a much better job of letting you know what was going on as it was building.
Starting point is 00:26:53 It actually has not just reasoning traces, but almost interstitial updates around what it's doing when it's doing a long-running task. This means that when it's doing something that takes five or six or eight or ten minutes, it's not just a total black box, you actually have a sense of where it is in its process as it's happening. Now, CloudCode doesn't do none of that, but the way that Codex uses full sentences to describe where it is, I think is really appealing. Still, the big thing was when we got to the point where we were actually ready to deploy this thing, there were zero errors. It just worked. It worked immediately. It worked right out of
Starting point is 00:27:25 the box, in a way that basically nothing I've ever built with CloudCode has. Where that leaves me is in a place that's pretty similar to what every found, where I'm not going to abandon one of these models for the other, but where I can absolutely see and probably will find 5,4 and codex deeply integrated into my process going forward. I also think it's important to point out, now that I've ripped on a bunch of things that I didn't like, that this was my first set of tests, and I have not yet taken the time to go try to change, for example, the core instructions that make GPD 5.4 over leave verbose. It's entirely possible that a lot of my critiques are actually relatively superficial and fixable and just about dealing with the out-of-the-box settings.
Starting point is 00:28:05 Overall, based on the average of what people are experiencing, you would be doing yourself a disservice if you didn't go try GPT 5.4. I will certainly get more reps in with it, probably in some other contexts like OpenClaw as well, and I will report an update when I do. For now, though, that is going to do it for today's AI Daily Brief. Appreciate you listening or watching, as always. Have fun playing with 5.4, and until next time, peace.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.