The AI Daily Brief: Artificial Intelligence News and Analysis - Opus 4.6 and ChatGPT 5.3-Codex Are Here and the Labs Are at War

Episode Date: February 6, 2026

Anthropic dropped Claude Opus 4.6 and OpenAI responded with GPT 5.3 Codex just 20 minutes later — the most intense head-to-head model release we've ever seen. Here's what each model brings, ...how they compare, and what the first reactions are telling us. In the headlines: Google and Amazon share their capex plans, and we're about to spend 2.5 moon landings on AI. Brought to you by:KPMG – Discover how AI is transforming possibility into reality. Tune into the new KPMG 'You Can with AI' podcast and unlock insights that will inform smarter decisions inside your enterprise. Listen now and start shaping your future with every episode. ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://www.kpmg.us/AIpodcasts⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Rackspace Technology - Build, test and scale intelligent workloads faster with Rackspace AI Launchpad - ⁠⁠http://rackspace.com/ailaunchpad⁠Zencoder - From vibe coding to AI-first engineering - ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠http://zencoder.ai/zenflow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Optimizely Agents in Action - Join the virtual event (with me!) free March 4 - ⁠https://www.optimizely.com/insights/agents-in-action/⁠AssemblyAI - The best way to build Voice AI apps - ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://www.assemblyai.com/brief⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Section - Build an AI workforce at scale - ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://www.sectionai.com/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠LandfallIP - AI to Navigate the Patent Process - https://landfallip.com/Robots & Pencils - Cloud-native AI solutions that power results ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://robotsandpencils.com/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠The Agent Readiness Audit from Superintelligent - Go to ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://besuper.ai/ ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://pod.link/1680633614⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Interested in sponsoring the show? sponsors@aidailybrief.ai

Transcript
Discussion (0)
Starting point is 00:00:00 Today on the AI Daily Brief, we've got not one but two new models that show exactly where the leading model labs priorities lie. And before that, in the headlines, looks like we're going to spend a cool two-thirds of a trillion dollars on AI infrastructure this year. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. All right, friends, quick announcements before we dive in. Firstly, thank you to today's sponsors, KPMG, Scrunch, Super Intelligent, and Blitzy. To get an ad-free version of the show, go to Patreon.com slash AI. Daily Brief, or you can subscribe on Apple Podcasts. If you are interested in sponsoring the show or really want to learn anything else about the show, you can head on over to AIDailybrief.aI.
Starting point is 00:00:44 Welcome back to the AI Daily Brief Headlines edition, all the daily AI news you need in around five minutes. We kick off today with Google and Amazon rounding out big tech earnings with a very unified message. AI CAPEX is accelerating faster than ever. Both companies lifted CAPEX forecast significantly. Google guided AI spending between 175 and 185 billion for this year, vastly outstripping estimates of 115 billion. This level would double Google's already high 91 billion in CAPEX for 2025. Amazon, though, came in over the top the following evening guiding 200 billion in CAPEX for 2026 for a 60% jump. With Google, Amazon, Microsoft, and meta all lifting expectations, we now have 650 billion in projected AI CAPX for 20206, for 2020, for,
Starting point is 00:01:31 from just these four. That's now more than the inflation-adjusted cost of the multi-decade U.S. Interstate Highway Project anticipated to be spent in a single year. It's about two and a half Apollo Moon missions or four and a half international space stations. Now, on the actual earnings, there was a slight divergence in performance. Google reported annual revenue of $400 billion for the first time. They saw an 18% increase in overall revenue year over year, and a 48% jump for their cloud division. Still, Google Cloud was a $17.7 billion business for the quarter, but which puts them firmly in third place behind Microsoft Azure and AWS. At the same time, they recorded by far the fastest growth rate
Starting point is 00:02:07 and were the only hyperscaler that increased their pace of growth. Amazon's story was slightly less positive. Net profit was $21.2 billion right in line with expectations. Topline revenue growth was 13.6% for a slight beat, reaching $213.4 billion for the quarter. AWS revenue growth was 24%, their fastest growth rate in three years, bringing division revenue to $35.6 billion for the quarter.
Starting point is 00:02:31 While the numbers were fine, they didn't necessarily speak to massive monetization of AI bets, and CEO Andy Jassy spent much of the earnings called justifying the massive ramp-up in Cappex. He told investors, I think this is an extraordinarily unusual opportunity to forever change the size of AWS and Amazon as a whole. We see this as an unusual opportunity and we're going to invest aggressively to be the leader. Later in the Q&A section, he pushed back against an analyst who questioned the conviction. Jassie commented, this isn't some sort of quixotic top line grab.
Starting point is 00:02:58 We have confidence that these investments will yield strong. strong returns on invested capital. We've done that with our core AWS business. I think that will very much be true here as well. Similar to Microsoft, both Amazon and Google said they were capacity constrained in their cloud businesses. They claim that stronger growth would have been possible if they had more GPUs on racks in 2025. Still, both companies saw a big drop in share price following their earnings calls, with Google falling 6% on Wednesday night and Amazon losing 11% on Thursday night. Now, one interpretation of this is investors being uncomfortable with spending at these levels regardless of AI-derived revenue. But there is also something more going on here that
Starting point is 00:03:34 is worth exploring. For decades, hyperscalers have been doing hundreds of billions of dollars in stock buybacks each year, peaking it over a trillion dollars in 2023. Most analysts expect the hypers to reduce or even end buybacks this year. CapEx plans also seem likely to require debt funding across the board for the first time. Steve Goldstein, the Europe bureau chief of Market Watch, wrote, it's funny that we had a decade of no stock buybacks are evil, and now that companies are actually ramping up CAPEX, it's, no, not like that. Quantian summed it up, investors have officially remembered that doing CAPEX means you can't spend money on buybacks and decided they don't like it anymore. Architect pointed out that Stanley Druckin-Miller has argued in the past that high
Starting point is 00:04:12 corporate CAPEX, such as spending on factories, inventory, and equipment, acts on a drag on financial assets because it drains liquidity from the financial system. Now, this is a super important point. We tend to think of these investors as sending signals about what they find to be a reasonable amount to spend on AI, but what they really might be saying is not that they necessarily think it's wrong to spend that on AI. They just don't like that it's not there to spend on them. I think we're going to see a lot more of this debate play out, but keep that in mind as you try to interpret markets' reactions to the hyperscalers over time. Speaking of Amazon, the company is considering a deep partnership with OpenAI, including using their models to power Alexa. As previously
Starting point is 00:04:50 reported Amazon is in talks to take part in OpenAI's latest funding round. They are in fact rumored to be considering an investment as large as $50 billion, which would be about half the money that OpenAI is seeking to raise. The information now reports that Amazon isn't just interested in an equity stake or a compute partnership, but is looking to get privileged access to OpenAI's tech. Sources said that OpenAI's models could bolster Amazon's AI products, including the Alexa voice assistant under a proposed deal. The process would require post-training Open AI models to tune them for Amazon's use cases and would also require OpenAI to supply dedicated researchers and engineers to the process. That could be the hitch as that would obviously divert resources to some extent
Starting point is 00:05:25 away from their own ambitions. At this point, I'm not sure how much to make of it, with a spokesperson for OpenAI saying we are focused on our strong existing compute partnership with Amazon. One other little nugget from the earnings report, Gemini has, according to Google, hit 750 million monthly active users. In December, Google said that Gemini's user base had searched from 450 million to $650 million in the final quarter, making this another substantial jump for January. The latest figure we have for OpenAI came from Censor Tower, with their data showing that ChachyBTBT had 110-monthly million active users as of November. Now, there is a little bit of a question mark around how some of these companies are measuring
Starting point is 00:06:01 user numbers. Meta, for example, claims 500 million monthly users for Meta-I-I-I-I, but that's presumably including quite a few people who stumble across the assistant in Instagram or WhatsApp. Google, however, was clear that these numbers are only counted using the Gemini app. CEO Sundar Pichai said in a statement, the launch of Gemini 3 was a major milestone and we have great momentum. One quick nugget of fundraising news,
Starting point is 00:06:23 11 Labs has secured a half billion dollars in new funding at an $11 billion valuation. The round triples 11 Labs previous valuation from their last funding round which closed in January of last year. In terms of what comes next, it sounds like they're interested in moving into video. Co-founder Mattie Stanisuski said, The intersection of models and products is critical
Starting point is 00:06:41 and our team has proven time and again how to translate research into real world. world experiences. We plan to expand our creative offering, helping creators combine our best-in-class audio with video and agents, enabling businesses to build agents that can type and take action. Finally, speaking of taking action, a story that deserves way more time than it gets in this headlines, but which I'm sure we will come back to, in addition of the news, which will be the subject of our main episode today, OpenAI also announced a new platform called Frontier. The goal is basically to help businesses deploy AI co-workers. OpenAI writes that it is a new
Starting point is 00:07:13 platform that helps businesses build, deploy, and manage AI agents that can do real work. Frontier gives agents the same skills people need to succeed at work, shared context, onboarding, hands-on learning with feedback, and clear permissions and boundaries. That's how teams move beyond isolated use cases to AI co-workers that work across the business. Basically, Frontier is a combined orchestration, governance, and optimization platform for OpenAI's agents. It allows users to manage the skills that each agent has access to, share context between agents, and set permissions and boundaries. OpenAI noted that, that AI leaders across every industry are rapidly rolling out agendic deployments, adding that
Starting point is 00:07:47 what's slowing them down isn't model intelligence but how agents are built and run in their organizations. They noted that the capability gap between leading performance and live deployments is actually growing due to increased complexity around agent governance. Frontier is designed to give a unified platform to control all the things around the AI model that goes into a successful agentic deployment. Context, data access, skills management. Now, this is something that we are going to necessarily talk a lot more about, but I just wanted to flag a little bit of commentary around this chart that was flying around Twitter and in particular financial circles. For those of you who are listening, not watching, it's a chart that shows at the bottom,
Starting point is 00:08:21 your enterprise system of record, and then five layers above it. Business context, agent execution, and evaluation and optimization right above it, then agents above that and interfaces above that. Investor Gokul Rajaram writes, check out where systems of records sit in this diagram from OpenAI Frontier. At least three, if not four, layers of context and intelligence sit between them and the end business application. It's one of the clearest representations of how AI companies plan to build next-gen systems of action
Starting point is 00:08:47 on top of existing systems of record, and why the markets are so worried about the future of software companies. Bucco Capital put it even more simply, quite a visual from OpenAI. Your system of record is a dumb pipe and we will layer five rows of value on top of it to steal the relationship and all the economics along with it. No wonder, SaaS is in the gutter. Like I said, there is so much more to explore about Frontier, but we have not one but two model releases to talk about, so we are going to close the headlines there and move on into our main episode. Sure, there's hype about AI, but KPMG is turning AI potential into business value. They've embedded AI and agents across their entire enterprise to boost efficiency,
Starting point is 00:09:28 improve quality, and create better experiences for clients and employees. KPMG has done it themselves. Now they can help you do the same. Discover how their journey can accelerate yours at www.kpmg.us slash agents. That's www.kpmg.coms. agents. Quick question. When was the last time you actually visited a website to research something? If you're like me, AI pretty much does that work for you now. That of course raises a new question for brands. If AI is doing the discovering, researching, and deciding, who or what is your website
Starting point is 00:10:01 really for? That shift in user behavior, the rise of AI bots becoming your most important new visitors, is what my sponsor Scrunch is taking head on. Scrumch is the AI customer experience platform that helps marketing teams understand how AI agents experience their site, where they show up in AI answers, where they don't, and what's preventing them from being retrieved, trusted, or recommended. And it's not just visibility. Scrunch shows you the content gaps, citation gaps, and technical blockers that matter, and helps you fix them so your brand is found and chosen in AI answers. Now, for our listener, Scrunch is providing a free website audit that uncovers how AI sees your site, where there's gaps, and how you're showing up in AI versus the competition.
Starting point is 00:10:38 Run your site through it at scrunch.com slash AI. Today's episode is brought to you by Super Intelligence. Super Intelligent is a platform that very simply put is all about helping your company figure out how to use AI better. We deploy voice agents to interview people across your company, combine that with proprietary intelligence about what's working for other companies, and give you a set of recommendations around use cases, change management initiatives that add up to an AI roadmap that can help you get value out of AI for your company.
Starting point is 00:11:04 But now we want to empower the folks inside your team who are responsible for that transformation with an even more direct platform. Our forthcoming AI Strategy Compass tool is ready to start to be tested. This is a power tool for anyone who is responsible for AI adoption or AI transformation inside their companies. It's going to allow you to do a lot of the things that we do at Superintelligent, but in a much more automated, self-managed way and with a totally different cost structure. If you are interested in checking it out, go to AIDailybreef.ai slash compass, fill out the form
Starting point is 00:11:32 and we will be in touch soon. Blitzie is driving over 5x engineering velocity for large-scale enterprises. A publicly traded insurance provider leveraged Blitzy to build a bespoke payments processing application, an estimated 13-month project, and with Blitzy, the application was completed in live in production in six weeks. A publicly traded vertical SaaS provider used Blitzy to extract services from a 500,000 line monolith, without disrupting production, 21 times faster than their pre-Blitzy estimates. These aren't experiments. This is how the world's most innovative enterprises are shipping software in 26. You can hear directly about Blitzy from other Fortune 500 CTOs on
Starting point is 00:12:07 the modern CTO or CIO-classified podcasts. To learn more about how Blitsey can impact your S-DLC, book a meeting with an AI Solutions consultant at blitzie.com. That's BLITZY.com. Oh my goodness. Look, it is a good day around here when we get news about upcoming models. It's a great day when we actually get a new model that we get to play with. And it is just something else entirely when within 20 minutes of each other,
Starting point is 00:12:37 two of the leading frontier labs drop dueling frontier models. in this case with an incredibly clear focus. The models are, of course, Claude Opus 4.6 from Anthropic and GBT 5.3 Codex from OpenAI. And for sure, the first thing that people noticed is the sequence. Now, none of the Frontier Labs have ever been above trying to get out in front of one of their peers when it comes to a big announcement, but we've never seen anything like this. Versailles Alley writes, Anthropic versus OpenAI is like Kendrick versus Drake, but for nerds.
Starting point is 00:13:05 This indeed was a common refrain. Ayush wrote, Anthropic drops a model in OpenAI, bonds 15 minutes later. This is basically a rap beef now. Peter Yang writes, how did Opus 4-6 and GPT-5-3 drop back to back in like 20 minutes? Is this a new Cold War or something where there are spies in each company? The answer, Peter, is almost certainly yes. But obviously way more important than the popcorn ball watching of the competition is what the models actually have for us. On the opus side, Anthropics says there are key coding improvements including better code review and debugging skills to, quote, catch its own mistakes. Interestingly, it's a
Starting point is 00:13:39 does also feel like Anthropic, because Claude is so associated with coding, use the chance to build on the themes that they've been exploring with Claude Co-work to talk about how Opus 4-6 was better for everyday tasks as well. Still with the focus, of course, on white-collar work, but things like running financial analyses, doing research, and using and creating documents, spreadsheets, and presentations. In fact, in their presentation of the benchmarks, they actually put knowledge work right up front. On the benchmark side, they claim the leading score on the agentic coding benchmark, Terminal Bench 2.0, and also claim the top spot on the leaderboard for Humanity's Last Exam.
Starting point is 00:14:13 Now, Humanity's Last Exam began as a general knowledge test, but is increasingly a measure of reasoning and tool use capability. In terms of new features, Opus 4.6 now supports million token context windows, and for folks who feel like they constantly run against those limits, that has got to be welcome news. They've also introduced something that I'm really excited about called Agent Teams. Now, Anthropics seemingly got the memo, that Swarms was sort of an intimidating. name, because they are explicitly trying to move away from the naming of agent swarms and calling it teams instead. The feature basically allows users to set a whole team of clods to work on a particular problem, including a coordination layer to ensure they're working on separate tasks
Starting point is 00:14:51 that all contribute to the whole. They write that agent teams are most effective for tasks where parallel exploration adds real value. Some of the examples they give are cross-layer coordination, where you've got coding changes that span front-end, back-end, and tests, but they also point to research and review, where, for example, multiple teammates can investigate different aspects of a problem simultaneously, then share and challenge each other's findings. The difference between subagents and agent teams basically comes down to the extent to which you need your agents to communicate with one another. Anthropic sums up, use subagents when you need quick, focused workers that report back, use agent teams when teammates need to share findings, challenge each other and coordinate on their own.
Starting point is 00:15:28 Anthropic has also added a feature called adaptive thinking. It allows the model to pick up on context clues to determine how much reasoning effort to expend on a particular task. Users also have manual controls available to dial up or down the amount of reasoning effort being deployed. Now, to demonstrate the power of their new agent swarm slash agent teams, they tested a fully autonomous coding task. Anthropic writes, we tasked Opus 4.6 using agent teams to build a C compiler and then mostly walked away.
Starting point is 00:15:54 The task made use of the Ralph Loop for Continuous Work, which we talked about in a show recently, but basically means a continuous loop where it's always checking in so it doesn't get stuck can continue take on the same challenge even if it doesn't solve it in one session. Overall, the process consumed around 2 billion tokens, generating over 140 million output tokens, and costing around $20,000 using standard API pricing. The test was performed without internet access and only used the standard Rust library. Anthropic also noted that this version of Claude was also created by Claude, with the AI model now the key driver of all coding within Anthropic. Now, there are a bunch of additional features that they talked about. Opus 4-6 shows significant improvement on long-context
Starting point is 00:16:32 retrieval and long-context reasoning, which should lead to a bunch better experience in long-horizon tasks as well as coding work that requires the model to access large codebases. Overall, they said, with Opus 4-6, we found that the model brings more focus to the most challenging parts of a task without being told to, moves quickly through the more straightforward parts, handles ambiguous problems with better judgment, and stays productive over longer sessions. So that was 4-6, and then literally 15 minutes later, OpenAI dropped GPT Codex 5.3. Now, the first thing you might notice is that we don't have not Codex-53 yet, and I think it's quite clear that the choice to release the coding-tune version of GPT-53
Starting point is 00:17:09 as a standalone before the release of the regular GPD 5.3 tells you everything you need to know about how these labs see the importance of various use cases. OpenAI says that this model advances both the coding performance of GPT Codex 5.2 and the reasoning and knowledge abilities of regular GPT 5.2 in a single package, Similar to Anthropic, OpenAI's models are now the core of the development team. With OpenAI writing, GPT53 Codex is our first model that was instrumental in creating itself. The Codex team used early versions to debug its own training, managed its own deployment, and diagnosed test results and evaluations.
Starting point is 00:17:45 Our team was blown away by how much Codex was able to accelerate its own development. Max Storber from the ChadGBT team shared that a recently announced feature where ChatGBT had full support for MCP apps, was built entirely with GPT-5-3 Codex. Max writes zero lines of code written by hand. Most times, the Codex CLI worked autonomously for hours and implemented parts of this first try. Now, on the benchmark, self-reported, of course, Codex 53 is a significant jump on SweetBench Pro compared to the previous version. And maybe even more notably than performance, the performance that was achieved was achieved
Starting point is 00:18:17 with far fewer tokens, demonstrating OpenAI's work on token efficiency. The model claims they knew absolutely. state-of-the-art score of 77.3% on Terminal Bench 2.0, which beats Codex 5.2 at 64%, and is, if true, much higher than Opus 46 at 65.4%. For their autonomous coding demo, OpenAI showed off a racing game. Codex 5.3 used the web game skill and was fed generic prompts throughout the process like fix the bug and improve the game. The result demonstrated that Codex 53 can churn through a task using millions of tokens without human intervention. Like Anthropic, OpenAI also highlighted various non-coding work that Codex 53 excels at. While for developers, the model is trained to be able to
Starting point is 00:18:55 debug, deploy, write PRDs, and tests, and a lot more, basically supporting the entire development lifecycle. For non-programmers, they showed off tasks like a set of financial advice slides, a retail training document, and a fashion presentation deck. Now, interestingly, on GDP Val, Codex 53 was very similar to previous models, but on the OS World benchmark, which measures compute use in real-world tasks, Codex 53 scored 64.7%, which almost doubles the performance of GPT5.2. Summing up, OpenAI wrote, with GBT3 codexes moving beyond writing code to using it as a tool to operate a computer and complete work end-to-end, by pushing the frontier of what a coding agent can do, we're also unlocking a broader class of knowledge work, from building
Starting point is 00:19:37 and deploying software to researching, analyzing, and executing complex tasks. What started as a focus on being the best coding agent has become the foundation for a more general collaborator on the computer, expanding both who can build and what's possible with codex. Basically, summing up, you got two labs which are both reaffirming everything we talked about on the Code AGI as functional AGI episode, that by expanding the capability set around coding, it unlocks use cases that are far beyond coding and are core to economically valuable knowledge work. So what were the first impressions? The companies that had early access to these models?
Starting point is 00:20:11 Big surprise, like them. AJ Orbach from Triple Whale said, we've had early access to Claude Opus 4-6 and have been testing it over the past week. It's the best model in the world for front-end report design, ad anatomy and copywriting, and orchestrating other tools to actually build creatives. What's kind of crazy is how good these models are getting at long-running agentic work. Boxes there in Levy said that overall Opus 4-6 represented a 10% jump over Opus 4.5 on their hardest knowledge work tasks. In terms of the features that people were most excited about, on the opus side, many, many focused on this 1 million token context window. Now, as Manlo Ventures, Didi Das
Starting point is 00:20:45 pointed out, part of why people were excited about this is that it came with, as he put it, insane-stated-the-art performance on all the long context benchmarks. In other words, this isn't one of those claims of a long token context window, where functionally it actually doesn't work. Others honed in quickly on the swarm slash team mode. McKay Wrigley shared a test between Opus 46 with the teams mode and 4-6 without it, and found that the team's mode was 2.5 times faster and done better. He also pointed out, Reminder that Swarms is available in the Claude Agent SDK as well. You can build Swarms into any product literally right now.
Starting point is 00:21:18 Kieran Klassen from Evry thinks the implications are even bigger. He wrote, been running Agent Swarms for a few weeks now. I think this is the future, but I'm re-learning what feature development even means. On the 5-3 Codex side, the things that people were taking note of, or things like the new token efficiency. Andy Henney writes, The biggest 5-3 Codex news is that it's roughly three times more token efficient. 5.3 high is smarter than 52 high but uses one-third the tokens, making it faster and letting your weekly limit last about three times longer. Good job.
Starting point is 00:21:48 But of course, the real question is, which of these models is better? Perennial early tester Simon Willison kind of shrugged his shoulders. In his blog post about them, he said, I don't have much interesting to say about these models yet, to be honest. They're both incremental improvements on their predecessors and very capable. Maybe my favorite tweet, which maybe you'll get right away but did take me a second, came from the prime agent who wrote, I cannot believe how much better 5-3 is than 4-6. After some internal testing, results show it's 15.2% better. Now, rather than explain, I'll let you try to work that one out for yourselves.
Starting point is 00:22:21 Some people did try to actually make a comparison, though, and many did it with their own benchmarks. Neil Chudley writes, TLDR, Opus 4.6, 1 million context, enterprise and knowledge work, agent teams, and ClaudeCode, not benching as high as Codex 5.3. Although he does point out that he doesn't really care about self-reported benchmarks. On 5-3 Codex, he writes, Wins Code benchmarks, faster, mid-task steering, but less than half the context window of Opus.
Starting point is 00:22:47 His conclusion, gonna have to try Codex, I guess. With things still pretty fresh on a performance side, latent space focused on the bloodsport that model releases have become. They write, if you think the simultaneous release of Claude Opus 46 and GPT-53 Codex is sheer coincidence, you're not sufficiently appreciating the intensity of the competition between the leading two coding model labs in the world right now.
Starting point is 00:23:08 Now, ultimately, they gave this right, to Anthropic from a developer attention perspective, getting out ahead of OpenAI and offering a huge range of new features to try out, and a $50 credit for tinkering away this weekend. At the same time, they noted that OpenAI won across most benchmarks and delivered a model that was 25% faster than their previous generation and has higher token efficiency. Their head-to-head comparison nascent though it is, showed OpenAI was stronger in coding and speed, while Opus had the edge on agent orchestration and long context tasks. However, and I think this is extremely important if you were trying to have a definitive answer right now,
Starting point is 00:23:40 they noted that the pair of models are so close that it's likely that, quote, all first-day third-party reactions are either biased or superficial. This idea of Opus for Orchestration Codex for coding is something that I've been seeing even for the last couple of weeks, though, and it'll be interesting to see if with the release of these new models, that framework or narrative reinforces itself among developers. When VibeCode apps Riley Brown asked what the vibe check is so far and how people were feeling about 5-3 and 4-6, a lot of the comments were basically about how OpenAI's Codex models had gotten really good and that people who hadn't tried them because they were so hooked on Claude Codecote it to themselves to give them a try. Still, when developer Ryan Carson asked which model are you going to code with this week,
Starting point is 00:24:22 at more than 700 voters, 53.3% said Opus 46 compared to 24.9% for Codex 53, suggesting that Anthropics still has the developer devotion. The team at every had early access to both models, with Dan Shipper's big thesis being that the models are converging. He writes, Opus 4-6 has all the things we love about 4-5, but with the thorough, precise style that made Codex the go-to for hard-coding tasks. And Codex 5-3 is still a powerful workhorse,
Starting point is 00:24:48 but it finally picked up some of Opus's warmth, speed, and willingness to just do things without asking permission. From this, we can only conclude that both labs are moving steadily towards a sort of er coding model, one that's wicked smart, highly technical, and fast, creative, and pleasant to work with. Why the convergence? Because Dan writes, a great coding agent turns out to be the basis for a great general purpose work agent. The behaviors that make AI useful for software development, parallel execution, tool use, planning before asking, knowing when to dig deep versus when to ship,
Starting point is 00:25:17 are the same behaviors that make AI useful for any knowledge work. And that is the holy grail of AI. Now, one of the things that I will be watching closely is not the day one or even the week one reactions. Remember, after we got Opus 45 and GPT 5.2 Codex, it took more than a month and everyone going home for the holidays for people to really fully appreciate just how much the capability set has shifted. Sommer reporting an early continuation of that having to relearn how to do things. Anthropics Alex Albert writes, The jump in autonomy is real. The biggest shift for me personally has been learning to let it run, give it the context, step away, and come back to something pretty amazing. The way we
Starting point is 00:25:55 work alongside models is starting to completely change. OpenAI is in fact throwing down a gauntlet, trying to encapsulate this shift. OpenAI President Greg Brockman writes, Software development is undergoing a renaissance in front of our eyes. If you haven't used the tools recently, you're likely underestimating what you're missing. Since December, there's been a step-function improvement in what tools like Codex can do. Some great engineers at OpenAI yesterday told me that their job has fundamentally changed since December. Prior to then, they could use Codex for unit tests.
Starting point is 00:26:24 Now it writes essentially all the code and does a great deal of their operations and debugging. Not everyone yet has made that leap, but it's usually because of factors besides the capability of the model. Every company, Greg continues, faces the same opportunity now, and navigating it well requires careful thought. And so the rest of the post then, he says, is about how OpenAI is retooling their teams towards Agendic Software Development. As a first step, he writes, by March 31st, we're aiming that, one, for any technical task,
Starting point is 00:26:51 the tool of first resort for humans is interacting with an agent rather than using an editor or terminal. Two, the default way humans utilize agents is explicitly evaluated as safe, but also productive enough that most workflows do not need additional permissions. Greg Brockman then is saying that in less than two months, agent first is the way that technical teams will build at OpenAI. If that doesn't put a starting gun on how we all should be thinking about changing our own workflows, I don't know what will.
Starting point is 00:27:19 Ultimately, it is a very exciting moment, one which, as we'll see on our next episode, even has people rethinking the whole idea of an AI bubble. For now that that is going to do it for today's AI Daily Brief. Appreciate you listening or watching as always, and until next time, peace.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.