The AI Daily Brief: Artificial Intelligence News and Analysis - GPT-5-Codex and the Year of Agentic Coding

Episode Date: September 17, 2025

Today on the AI Daily Brief, OpenAI launches GPT 5 Codex, a model designed for real-world software engineering with dynamic reasoning, long-task persistence, and powerful code review capabilities. We ...break down why this release cements 2025 as the year of agentic coding and what it signals for the future of autonomous dev agents. In the headlines: Google Gemini tops the App Store thanks to Nano Banana, ElevenLabs rolls out a human-in-the-loop service, and Fiverr announces major layoffs amid AI disruption.Brought to you by:Is your enterprise ready for the future of agentic AI?⁠⁠Visit AGNTCY.org⁠⁠⁠⁠Visit Outshift Internet of Agents⁠⁠KPMG – Discover how AI is transforming possibility into reality. Tune into the new KPMG 'You Can with AI' podcast and unlock insights that will inform smarter decisions inside your enterprise. Listen now and start shaping your future with every episode. ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://www.kpmg.us/AIpodcasts⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Blitzy.com - Go to ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://blitzy.com/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ to build enterprise software in days, not months Robots & Pencils - Cloud-native AI solutions that power results ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://robotsandpencils.com/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Vanta - Simplify compliance - ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://vanta.com/nlw⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠The Agent Readiness Audit from Superintelligent - Go to ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://besuper.ai/ ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Interested in sponsoring the show? nlw@aidailybrief.ai

Transcript
Discussion (0)
Starting point is 00:00:00 Today on the AI Daily Brief, OpenAI introduces GPT5 Codex and we discuss why 2025 is so clearly the year of agentic coding. And before then in the headlines, thanks to Nana Banana, Gemini has topped the app store. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. All right, friends, quick announcements before we dive in. First of all, thank you to today's sponsors, Blitzy, robots and pencils, KPMG, and Super intelligent.
Starting point is 00:00:31 To get an ad-free version of the show, starting at just $3 a month, head on over to Patreon.com slash AI Daily Brief. I'll also be setting up subscriptions on Apple and Spotify soon. And if you're interested in sponsoring the show, we are more or less sold out for 2025. There are a couple little slots here and there, but we are getting deep into 2026 now. The audience is large, growing, awesome, and very powerful in their own organization. So if you want to get access to them, shoot us a note at sponsors at AIdailybrief.aI. With that, let's dive in. Welcome back to the AI Daily Brief Headlines edition, all the daily AI news you need in around five
Starting point is 00:01:04 minutes. We kick off today with a big update in the AI competition, which tells us something I think about consumer usage. TLDR is that Gemini has hit number one in the app store overtaking ChatGPT. Chachypte has basically had a stranglehold on the number one spot since its release in 2023, with just a very few times where its dominance was broken. So this is a huge milestone for Google's flagship AI app. For many analysts, this is the validation they've been looking for that Google is starting to take the lead on AI. Justin Patterson of KeyBank Capital wrote, If Gemini can remain on the top of the App Store charts, we believe more investors will start to view Gemini as a strong core offering with incremental use cases that complement as opposed to cannibalize
Starting point is 00:01:43 the core search experience. Now, of course, for those of you in the know, which is all of you AI Daily Brief listeners, the proximate cause of Gemini's surge is growing interest in the nanobanana image model. Josh Woodward, the VP of the Gemini app, said that they added 13 million first-time users in just four days last week. Now, as has happened with other AI image models in the past, we start to see these trends around a specific type of generation. The one that's happening right now is turning base images into 3D figures. Here's one of an image of Cristiano Ronaldo that turns into a 3D figure sitting on a desk and in a packaging box.
Starting point is 00:02:19 Woodwardwarden noted that the trend had caught on in India last week to drive a wave of new downloads. Now, of course, we saw this happen before with the massive uptick in chat GPT usage after their latest image generation model earlier this year came out, specifically around the studio Ghibli trend that turned everything into a Ghibli-looking animation. Han Nuyen of Fractal MCP commented, Gemini topped the App Store charts thanks to Nanomanana. It doesn't surprise me one bit.
Starting point is 00:02:45 Humans are visual creatures. She continued, What does surprise me is how meta, with all its data, hasn't released a decent image or video generation model. It's even behind Bight Dance. Now, of course, Meta has recently announced partnerships with Black Forest Labs and with Mid Journey, so clearly they're aware of the problem,
Starting point is 00:03:02 and the same could also be said for Apple, now serving as the product surface for Google's most successful product in some time. The market is certainly paying attention. Alongside the rise of Gemini, Google's valuation has reached $3 trillion for the first time. And the turnaround has been significant. Google is up more than 70% since the April lows. 18 months ago, Google AI was all about glue-on pizza and ridiculous A-historic historical images, but since then the company's valuation has doubled. City Group analyst Ron Josie boosted his price target for the stock, saying an accelerated product development cycle that is beginning to emerge with greater Gemini adoption
Starting point is 00:03:35 across both its ads in Cloud's business. He added, we believe Google is executing better across its halo of products, experiencing greater demand, and delivering improved profitability. Now, that doesn't mean everything is rosy. Google appears to be going through some of the same sort of restructuring we talked about at XAI last week, with Wired reporting that they have fired over 200 contractors working on AI products. Wired said that these contractors were working on AI ratings work, including evaluating, editing, and re-writings. writing the Gemini's chatbot response to make it sound more human and intelligent. Basically, it sounds like a data annotation team of some time.
Starting point is 00:04:07 Now, it is not exactly clear what the proximate cause of this was. As we discussed yesterday, there is a shift going on more broadly, from generalist AI data annotators to more specialists. So maybe this is part of that, or maybe it's something else. Whatever the case, there are definitely some shifts going on behind the scenes that I think are worth keeping an eye on. Lastly, today on the Google front, one of the company's top executives has defended the company's use of AI overviews in search as promoting customer choice. Yesterday, after the publisher
Starting point is 00:04:34 of Rolling Stone sued Google, the first comments from the company were very brief and generic. However, now, VP of Government Affairs and Public Policy, Markham Erickson, gave a much more vigorous defense of AI overviews at the Wired AI Power Summit in New York on Monday. Erickson said, I don't want to speak about the specifics of the lawsuit, but I can't speak to our philosophy here, which is we want a healthy ecosystem. The 10 blue links serve the ecosystem very well, and it was a simple value proposition. We provided links that directed users free of charge to billions of publications around the world. We're not going to abandon that model. We think there's use for that model. It's still an important part of the ecosystem. However, he continued, user preferences and what users want is also
Starting point is 00:05:10 changing. So instead of factual answers and 10 blue links, they're increasingly wanting contextual answers and summaries. We want to be able to provide that too while at the same time driving people back to valuable content on the internet. So it's a dynamic space. Ultimately, our goal is to ensure that we have an overall healthy ecosystem. Now, this topic is going to get more contentious. Absolutely, there are going to be more legal battles around it. However, I think that the core idea that there are certain types of searches for which the summary is a much better value proposition for users is absolutely undeniable.
Starting point is 00:05:41 The question is going to be what the right type of business model around it is and how the relationship between the summarizers and the publishers evolves to match the new reality. Next up, a small but pretty interesting story out of 11 Labs, the company is introducing a new managed service called 11 Labs Productions. The service offers AI-generated dubbing, captions, transcripts, and audiobooks, with a human producer to verify and polish the final product. The goal is to deliver a production-ready result, and more importantly, to bring everything under one roof for an end-to-end customer experience. Co-founder Mattie Stanizuski said that 11 Labs have already partnered with 500 producers,
Starting point is 00:06:16 more than the company's full-time employee count. Pricing starts at $2 a minute, and while some folks on Twitter had the take represented by BT Norris. Wait, 11 Labs is offering human services. Is it over? For anyone who has interacted with enterprises or businesses of any kind when it comes to AI, it is so clear that there is a massive gap and opportunity for last mile services that provide just a little bit more assurance on top of the core gains that come from AI generation. I don't think it's at all surprising to see a company that builds the software also offering the service in some native way. I think even as the ecosystem of not. next-generation service providers develops, you are going to see the platforms themselves
Starting point is 00:06:55 offer some version of this kind of managed service around their core products in many, if not all cases. Lastly today, one that I'm just going to touch on briefly because I think it's likely that we come back to it in more detail and maybe even a main episode later this week, the CEO of Fiverr has announced 250 layoffs, which represent about 30% of the company, alongside an announcement that the company is moving back into startup mode. Back in April, Fiverr CEO, Mika Kaufman, was one of the CEO, and the CEO of the CEO of CEOs that came out and shared an internal note about the potential disruption of AI.
Starting point is 00:07:25 Now, what was interesting about his note was, one, the candor. It featured the line, here's the unpleasant truth. AI is coming for your jobs. Heck, it's coming for my job too. What was also interesting about this is that although we had similar notes from the CEOs of Shopify and Duolingo around the same time, Miga's original note did not have any firings or restructurings. It was basically a warning slash wake-up call that change was likely to come and an invitation
Starting point is 00:07:49 to come and talk about it. Now, six months later, it kind of feels like this is part two of that same idea. And underlying the message is basically the big reminder that the way that people build and release things is just simply changing. Now, to many, it seems like Fiverr is directly in the firing line of AI. So many of the things that people use to hire freelancers for are exactly the type of things that they're now using generative AI for. And so it's not surprising that a marketplace for hiring those freelancers is feeling
Starting point is 00:08:15 the pinch. At the same time, in their most recent reporting for the second quarter, Fiverr actually saw a 15% year-on-year increase in revenue. The company stated AI-related services are booming, with surging demand, especially around AI agents, workflow automation, and vibe coding. Anyway, the point is that we are in the midst of some big changes. For now, though, that is going to do it for today's headlines. Next up, the main episode.
Starting point is 00:08:37 AI changes fast. You need a partner built for the long game. Robots and pencils work side by side with organizations to turn AI ambition into real human impact. As an AWS-certified partner, they modernize infrastructure. design cloud native systems, and apply AI to create business value, and their partnerships don't end at launch. As AI changes, robots and pencils stays by your side, so you keep pace. The difference is close partnership that builds value and compounds over time. Plus, with delivery
Starting point is 00:09:04 centers across the U.S., Canada, Europe, and Latin America, clients get local expertise and global scale. For AI that delivers progress, not promises, visit robots and pencils.com slash AI Daily Brief. What if AI wasn't just a buzzword, but a business imperative? On You Can with AI, we take you inside the boardrooms and strategy sessions of the world's most forward-thinking enterprises. Hosted by me, Nathaniel Wittamore, and powered by KPMG, this seven-part series delivers real-world insights from leaders who are scaling AI with purpose. From aligning culture and leadership to building trust, data readiness, and deploying AI agents. Whether you're a C-suite executive, strategist, or innovator, this podcast is your front-row
Starting point is 00:09:46 to the future of Enterprise AI. So go check it out at www.kpmg.org.com slash AI podcasts or search you can with AI on Spotify, Apple Podcasts, or wherever you get your podcasts. Today's episode is brought to you by Super Intelligent. Now, one thing that we are having a lot of conversations with folks about is the fact that for some of you, your fiscal year is coming to an end. And that means two things. One, it means planning and thinking about what you're going to do in the next year.
Starting point is 00:10:14 And two, it means using up for you. those last of budgets so you don't lose them. If you are an enterprise that happens to find yourself in that situation, super intelligent would love to help on both fronts. We are moving increasingly towards an annual AI planning model where we map out how you can create an action map of your organization's agent opportunities that represents an executable backlog of AI and agent use cases that you can deliver on over the course of the next year. Additionally, for those end of your budgets, we have worked out deals with a number of partners where we can pre-lock in general implementation packages, even before you figured out exactly what use cases are going to require them.
Starting point is 00:10:51 If you'd like to learn more about superintelligence agent readiness audits and this new end of fiscal year plan, visit us at B-super.aI, click get started, and make sure to use the word fiscal somewhere in the description. Welcome back to the AI Daily Brief. Today we are talking about the launch of GPT5 Codex. It is, as it sounds, a new version of GPD5 that is specifically optimized for agentic coding. Sam Altman tweeted yesterday that it is faster, smarter, and has new capabilities. The team, he said, has been absolutely cooking. So today we're going to talk about GPT5 Codex, what it changes, how it works, and what the initial responses are. But before we do, I want to contextualize it in a theme which is becoming abundantly clear.
Starting point is 00:11:31 And to do that, we actually have to go back to last year. At the end of the summer of 2024, heading into the fall of 2024, there was a lot of discussion around whether AI was hitting a wall. Now, the nuanced version of this discussion was about whether pre-training was hitting a wall. All the way back then, people were expecting some new model from OpenAI. We had heard about something called Orion and had all these rumors. And yet there was no indication that they were going to release a major update with all the leakage suggesting that they just weren't as happy with the progress that they had made. And yet, that was a very narrow slice of what was actually going on. Because, of course, while we didn't get GBT5 back then, what we did get was the new paradigm of
Starting point is 00:12:11 reasoning models. O1 was first previewed in September and then over the course of the full. started to come to production, and as it did so, it clearly opened up some new agentic possibilities that hadn't been there before. Then, of course, there was Claude 3.5 Sonnet. It was initially released back in June of 24, and in the subsequent ones, became the standard for people using LLMs to code, and around the same time that we're talking about 01 and all these questions of hitting a wall, we also got an updated version of Claude 3.5 Sonnet, which had major improvements in this field of software engineering.
Starting point is 00:12:42 Between the initial release version of 3.5 Sonnet and the October version, its performance on Sweebench Verified jumped from 33.4% to 49%. It was also around this time that we started to see some user behavior shift. It wasn't just that people were using Claude to code, but that dedicated applications for AI coding started to become successes. Bolt launched around this time October 2024, quickly grew to $5 million, and in the first two months grew to $20 million in ARR. And yet, of course, from a narrative perspective,
Starting point is 00:13:11 as we came into 2025, the story was all about agents in general. Everyone thought that this was going to be the year when we started to see the first embers of digital employees working alongside human employees. And certainly we have seen some major developments in particularly enterprise adoption of agents. Between Q4 of last year and Q1 of this year, agent pilots jumped massively. And then between Q1 of this year and Q2 of this year, enterprise agent deployments actually tripled. And yet I would argue that part of the malaise that we've been experiencing for the last
Starting point is 00:13:41 last month or so, is that when push comes to shove, at least from a maturity perspective, 2025 hasn't been the year of agents, at least not in general. What it has been is the year of one very specific type of agent, which is, of course, the coding agent. In February, we had Andre Carpathy coined the term vibe coding. He wrote, there's a new kind of coding I call vibe coding, where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's possible because the LLMs are getting too good. When I get error messages, I just copy paste them in with no comment. Usually that fixes it. The code grows beyond my usual comprehension. I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug,
Starting point is 00:14:20 so I just work around it or ask for random changes until it goes away. I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works. It wasn't long after. Then in addition to Bolt, we started to see other platforms like Loveable and Replit just absolutely shoot to the moon in terms of their ARR. And of course, as this was happening, we were getting updates to the tooling as well. A big one happened back in February when Anthropic announced Claude Code, which was an agented coding tool that allowed people to work directly within their terminal rather than using some of their platform like cursor. We started to hear from CEOs that AI tools were doing a meaningful percentage of the code that was getting pushed. And we got a very bold prediction from Dario Amade,
Starting point is 00:14:57 who back in March said that within three to six months, AI could be writing 90% of the code that software developers were deploying. Now, obviously, there's a very bolder. there's been a lot of debate around the accuracy of that prediction, which I've covered in previous shows, but certainly when it comes to just about any chart you can show about adoption or revenue around coding, the proof is in the pudding. Anthropic in particular has been an enormous beneficiary. After taking a couple of years to get up to a billion dollars in revenue around the December-January time frame, Anthropic rocketed from that $1 billion in ARR all the way up to $5 billion in ARR by the end of the summer. Indeed, as much as some other tools like Gemini
Starting point is 00:15:32 2.5 and 2.5 Flash were occasionally used for certain types of coding projects. Anthropic has been largely dominant when it comes to AI coding. And of course, given how significant this use case is, this was certainly not something that OpenAI could abide. Which brings us to the release of GPT5. Now, we've covered the release of GPT5 endlessly, but the important reminder for our purposes here is that it was so clear right out of the gate that the use case that they cared about was coding. I even tweeted back then, OpenAI is betting about 700 million new people are about to become coders. It was in fact the first bullet they talked about in describing the new features in the announcement blog post, ahead even of creative writing. And while the first couple weeks after the release
Starting point is 00:16:17 of GPT5 were spent in a combination of the dustups around the deprecation and reintroduction of 4-0 and more general resurgence of the talk of performance walls, it wasn't long before people started to come around. In the middle of August, Sean Wang tweeted, watching the timeline and flip on GPT5's sentiment from negative to positive is pretty funny. He also pointed out that the same thing happened to 01 on a pretty similar time scale. And more than anything else, the thing driving the vibe shift was around OpenAI's coding tool, Codex. At the beginning of September, Sam Altman tweeted,
Starting point is 00:16:47 Really cool to see how much people are loving Codex. Usages up 10x in the past two weeks. Lots more improvements to come, but the momentum is so impressive. On the same day, Jan Palag tweeted, Codex CLA hype is real. GPD5 high in Codex is great, stays on track much longer than Opus, never gives up on your task, much longer context window, etc. And as recently as yesterday, we talked about how some discussion had shifted to how significant the development of coding agents were going to be in the quest for AGI. And as I was pressing send on that piece, we got this
Starting point is 00:17:17 update from OpenAI, the release of a new version of GBT5 Codex. OpenAI writes, GPT5 Codex was trained with a focus on real-world software engineering. It's equally proficient at quick interactive sessions and at independently powering through long complex tasks. Its code review capability can catch critical bugs before they ship. Now, one of the major differences between the GBT5 model that runs ChatGBT and GPD5 codex is the way it implements reasoning. Unlike GPT5, the Codex version doesn't have a model router. Instead, it's able to adjust its reasoning effort in real time as it works on a task.
Starting point is 00:17:52 So the idea here is that whereas the router and ChatGBTGT has to decide at the outset how much computational power and time to use on a particular problem. Instead, GPT5 Codex can dig into a problem and then decide a few minutes in whether it needs to add more or less power to solving it. Basically, it uses a version of dynamic thinking so it can modulate its reasoning effort based on the complexity of the task. OpenAI also claims a big jump-in prompt adherence, saying that you should be able to, quote, just tell it what you need
Starting point is 00:18:20 without writing long instructions on style or code cleanness. Now, it wouldn't be a model release without the benchmarks, And GPD5 Codex does have a modest bump on the Sweet Bench verified test, jumping from 72.8% to 74.5%. And yet, as we're seeing with basically all new releases right now, the most common benchmarks just really aren't telling the full story about model improvement anymore. To that end, OpenAI devised a custom code refactoring eval based on large established repositories.
Starting point is 00:18:46 While GBT5 High got 33.9% on those code refactoring tasks, GPD5 Codex jumped all the way up to 51.3%. So when it comes to the significance, there are a few things here. Variable thinking like we were just talking about is clearly one of the big parts of this. Swix posted the token usage distribution chart during internal testing and commented, This is the most important chart on the new GPT5 Codex model. We're just beginning to exploit the potential of good routing and variable thinking. Easy responses are now greater than 15 times faster, but for the hard stuff,
Starting point is 00:19:17 five codex now thinks 102% more than five. Same model, same paradigm. but bending the curve to fit the non-linearity of coding problems in LLM use cases. And this idea of token efficiency is really starting to come to the fore. For a long time, it's sort of been like everyone just uses the most state-of-the-art thing, especially on the development side, but as use cases get more complex, and that complexity is expressed in token consumption, and token consumption is expressed in cost,
Starting point is 00:19:43 and cost is starting to be expressed in shifting business models that are based on usage rather than bulk rates. The efficiency of models is starting to matter alongside just raw capabilities. Swix again writes, developers are going to prefer the model that sips or spends tokens according to task difficulty. Why spend $200 on a chonky plan if you can get $20 on a competitor who takes efficiency seriously? This is perhaps the most important H-Pram of inference time compute. AI engineer Daniel Mack actually thinks that the variable thinking and the ability to adjust effort while in the midst of solving the problem shows, as he puts it, a spark of metacognition.
Starting point is 00:20:17 He says this represents AI models beginning to think about their own thinking process. the exclusive domain of human minds until now. More simply from a business standpoint, Theo writes, GPD5 Codex is as far as I know the first time a lab has bragged about using fewer tokens. Hope this becomes a trend. Now, the other big part of this announcement has to do with autonomy. You heard in that sentence at the top of the description in their blog post that GPD5 Codex was trained to focus on independently powering through long complex tasks.
Starting point is 00:20:46 But when they say long complex tasks, they mean really long complex tasks. alongside the announcement Open and I released a conversational podcast, and one of the team members said, one of the things that this model exhibits is an ability to go on for much longer and to really have that grit that you need on these complex refactoring tasks. But at the same time, for simple tasks, it actually comes way faster at you and is able to reply without much thinking. And so it's this great collaborative where you can ask questions about your code, find where this piece of code is that you need to change or better understand, plan. But at the same time, once you let it go on to something, it will work for a very, very long period of time. We've seen it work internally up to
Starting point is 00:21:20 seven hours for very complex refacturings. We haven't seen other models do that before. Vraser X writes, GPD5 Codex just shattered a frontier and autonomous AI work. A few weeks ago, the record was around 200 minutes of continuous independent coding. That came from Replit, by the way, and even that was crushing the previous high. Brasser continues, now OpenAI claims codecs can push through seven plus hours nonstop on complex tasks, iterating, fixing test failures, and delivering working implementations. That's not just a speed boost, shift towards agents that don't just chat but persist, staying on a problem until it's solved.
Starting point is 00:21:55 We may be watching the early shape of true autonomous dev agents emerging. What happens when this stretches to days or weeks? Now, again, that 200 minutes that they're referring to was the claim from Replit on the release of Agent 3 last week. Before that, the best data we had on Long Horizon tasks was from meter. Repplet CEO, I'm Judd Massad said at the time, the meter paper that says that the length of tasks AI can do is doubling every seven months, radically undersells the scaling that we're seeing at Replit. It might be true if you're measuring one long trajectory for, a single model class, but this is where an Agent Research Labs Alpha is. We build multi-agent
Starting point is 00:22:25 architecture and use different models from various providers to tap into their latent abilities across various tasks. The point being that the Agent 3 system is capable of these 200 minutes. It seems like what OpenAI is saying is that GPT5 Codex on its own can do that even without this type of replet multi-agent architecture. Now, another really important part of this is that it appears to be starting to try to address some of the shortcomings of agentic coding as well. In that same conversation, OpenAI's Greg Brockman said, We started to notice that the big bottleneck for us was with increased amounts of code needing to be reviewed.
Starting point is 00:22:55 We decided to really focus on a very high signal codex model where it's able to review a PR and really think deeply about the contract and the intention that you were meaning to implement and then look at the code and validate whether that intention is matched or found in the code. It's able to go layers deep, look at the dependencies, think about the contract,
Starting point is 00:23:11 and really raise things that some of our best reviewers wouldn't have been able to find unless they were spending hours really deeply thinking about that PR. We released this internally first at OpenAI. It was quite successful and people were upset when it broke because they felt like they were losing that safety net, and it accelerated teams, including the Codex team tremendously. Now, this has been a big conversation recently, that yeah, vibe coding is great, but it
Starting point is 00:23:31 really just means that you're shifting to fixing all the problems with vibe coding. If this new model is natively good at Code Review, that could obviously change that dynamic. And unsurprisingly, after all you've heard, the early reviews are quite positive. Developer Nick Dobos writes, I had early access to the new GPT, Codex and it's very good. Feels much more like a context-driven reader. Humbs away, looking through your codebase and then one-shots it, versus other models that prefer immediately making a change, making a mess, and then iterating over and over. When someone asked, how does it feel compared to Claude Code on Opus 4.1? Nick wrote,
Starting point is 00:24:03 Opus and Sonnet deal like a workhorse that just does whatever you say, even if it's stupid. GPD5 Codex feels like thinking mode where it first checks everything. Musician and developer Michael Wall wrote, I had about four days and worked within three very different active code bases. Within hours, I built things I never thought I could. My first impressions, lightning fast, natural language coding capabilities, produces functional code on the first attempt. Even when not perfectly matching intent, code remains executable rather than broken. Maintains accuracy, avoids false confirmations, or persistent hallucinations common in other coding models. Transparent reasoning
Starting point is 00:24:36 process through clear formatting, particularly valuable for learners, clean, structured outputs that make managing multiple projects simultaneously feel organized rather than overwhelming. added a bunch more, but he sums up, creates a genuinely enjoyable coding experience. Dan Shipper and the team over at Every also had access, and Dan said it is wild. They noticed that it dynamically chooses thinking time, so this wasn't just in the lab, but in practice that it, quote, works for long periods on hard questions and returns instant answers for easy ones. When it came to autonomy, they didn't get those seven hours, but Dan did write, it ran autonomously for up to 35 minutes in our testing on a production code base, a noticeable upgrade from GPT5, which tended to be too cautious.
Starting point is 00:25:14 Overall, they found it to be a, quote, really good upgrade that makes Codex-Cly a legitimate alternative to Claude Code, although they did say that it requires prompting to get the right behavior. They also wrote, quote, sometimes it's lazy. It can underthink for some tasks and refuse to do tasks if it thinks they're too large. Overall, though, it was a very positive review. Now, there's one other part of the story that's worth noting, which is why when it comes to the overall environment of AI in practice, OpenAI's new releases really matter
Starting point is 00:25:41 compared to every other lab, a PhD student going by a PhD student going by Zofon wrote, reminder that Twitter is in a bubble. Clod is minuscule compared to codex. And while there was a bunch of comments quibbling with methodology, the point that they were making is that while all of us who are either producing or listening to this show are hyper attuned to the developments of the particular labs, for more passive consumers of AI developments, there is an absolute premium placed on what OpenAI does, given its stature in the field. Now, while the overall conversation has obviously been wildly positive, you are starting to see the emergence of some discussion of whether the labs are actually over-focusing on this area.
Starting point is 00:26:17 Professor Ethan Malik tweeted, The problem with the fact that the AI labs are run by coders who think code is the most vital thing in the world is that the labs keep developing super cool specialized tools for coding, but every other form of work is stuck with generic chatbots. He continues, yes, every other company on the planet is rushing to release AI tools for other forms of work, but if you don't own a frontier LLM and you can't train specialized models to go with your specialized AI for X interface, you are limited in what you can accomplish.
Starting point is 00:26:45 Now, Rune from OpenAI commented, this is good and optimal, seeing as autonomous coding will create the beginning of a takeoff that encompasses all those other things, basically encapsulating what we were talking about yesterday with regard to coding's role in achieving AGI. Ethan playfully equipped, coder says what? Joking, yes, I get that there is a plausible reason why coding is elevated the way it is in labs, but it still leaves almost all work and workers and students out of the really interesting part of rapid AI development that only programmers get to see right now. Rood reflected on this and actually pointed out something really interesting. He writes, right now is the time where the takeoff looks the most rapid to insiders,
Starting point is 00:27:22 i.e. we don't program anymore, we just yell at codex agents, but may look slow to everyone else as the general chatbot medium saturates. I think that this is an extremely, extremely salient point. My biggest contra take to the arguments of AI stagnation, has been to remind people to look at everything that isn't just how much better the latest chat GBT responds compared to the last one. Nanobanana unlocking new types of photo editing that allow for a ton of production use cases that weren't possible with image generation tools before. VO3 adding audio into the generation and taking over social media.
Starting point is 00:27:57 And of course, the coding tools totally shifting how everything in the world gets built. So I think that the gap that Rune is identifying here is true. Although the lucky thing for all of us is that because of tools like lovable and ClaudeCode and Codex CLI and cursor and bolt and replet and all of the others out there, one of the big points is that in practice the gap has never been smaller. So if you haven't yet, go out and try the new tools, report back on how they work. For now, that's going to do it for today's AI Daily Brief. Appreciate you guys listening or watching as always. And until next time, peace.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.