The AI Daily Brief: Artificial Intelligence News and Analysis - Why Opus 4.5 Changes Vibe Coding

Episode Date: November 26, 2025

Today's episode digs into why Anthropic’s surprise launch of Claude Opus 4.5 is landing like a true step-function moment for coding, agentic workflows, and the emerging paradigm of vibe-based so...ftware creation, with new benchmarks, early user tests, and developer reactions all pointing to a shift in how real work gets done; plus a quick look at the latest headlines including the White House’s Genesis Mission and Amazon’s massive new government-focused AI expansion. Brought to you by:KPMG – Discover how AI is transforming possibility into reality. Tune into the new KPMG 'You Can with AI' podcast and unlock insights that will inform smarter decisions inside your enterprise. Listen now and start shaping your future with every episode. ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://www.kpmg.us/AIpodcasts⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Rovo - Unleash the potential of your team with AI-powered Search, Chat and Agents - ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://rovo.com/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠AssemblyAI - The best way to build Voice AI apps - ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://www.assemblyai.com/brief⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠LandfallIP - AI to Navigate the Patent Process - https://landfallip.com/Blitzy.com - Go to ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://blitzy.com/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ to build enterprise software in days, not months Robots & Pencils - Cloud-native AI solutions that power results ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://robotsandpencils.com/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠The Agent Readiness Audit from Superintelligent - Go to ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://besuper.ai/ ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Interested in sponsoring the show? sponsors@aidailybrief.ai

Transcript
Discussion (0)
Starting point is 00:00:00 Today on the AI Daily Brief, the incredible string of model releases continues with Anthropic dropping Claude Opus 4.5. Before that in the headlines, the White House launches the AI Genesis mission. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. All right, friends, quick announcements before we dive in. First of all, thank you to today's sponsors. Super intelligent, robots and pencils, blitzie, and robo. To get an ad-free version of the show, go to patreon.com slash AI Daily Brief, or you can subscribe on Apple Podcasts.
Starting point is 00:00:37 And if you are interested in sponsoring the show, we're doing a bunch of wrapping up Q1 right now. Send us a note at sponsors at AIDailyBreef.A.I. and I can give you all of the info. And with that, let's dive in. Welcome back to the AI Daily Brief Headlines edition, all the daily AI news you need in around five minutes. Yesterday you heard about how one AI executive order
Starting point is 00:00:57 from the White House had been squashed. Basically, there was a big dust up with congressional Republicans around the White House's plan to create a task force to go after states who put AI regulations on the books, but as it turns out, that was not the only executive order they have planned. President Trump has now officially signed an executive order to launch a national AI science program known as the Genesis Mission. The text of the order argues that the race for global technology dominance in the development of AI requires a historic national effort comparable in urgency and ambition to the Manhattan Project. This order launches
Starting point is 00:01:30 the Genesis mission as a dedicated, coordinated national effort to unleash a new age of AI accelerated innovation and discovery that can solve the most challenging problems of the century. Michael Kratzios, the director of the White House Office of Science and Technology Policy, continued that tone during the Monday announcement. He described the Genesis Mission as the largest marshalling of federal scientific resources since the Apollo program. Now, stripping away the superlatives, the Genesis mission is at core, an initiative to collate scientific knowledge from across the government to enable new AI-driven discoveries.
Starting point is 00:02:00 Datasets will be gathered from the National Science Foundation, the National Institute of Standards and technology, and the National Institute of Health. The datasets, some of which stretch all the way back to the 1940s, will be cleaned and transformed into machine-readable formats to make them accessible to AI models. The order lays out a two-fold goal, to train scientific foundation models, and create AI agents to test new hypotheses, automate research workflows, and accelerate scientific breakthroughs. To that end, the Department of Energy and their network of 17 national labs will make their data and compute resources available to research institutions and private sector companies. The order instructs the DOE to, quote, create a closed-loop AI experimentation platform
Starting point is 00:02:39 that integrates our nation's world-class supercomputers and unique data assets to generate scientific foundation models and power robotic laboratories. Essentially, this is a major effort to organize the scientific data that's scattered across government agencies and marshal resources in order to drive AI-accelerated scientific discovery. Krasios again said, Since the 1990s, America's Scientific Edge has faced growing challenges. He cited declining numbers of drug approvals and research outlets. outputs despite soaring scientific budgets. The Genesis mission seeks to reverse that trend by,
Starting point is 00:03:08 in his words, unifying agency's scientific efforts and integrating AI as a scientific tool to revolutionize the way science and research are conducted. Data sets and compute infrastructure will be centralized into the American Science and Security Platform to be established by the DOE, who said that once complete the platform will be, quote, the world's most complex and powerful scientific instrument ever built. It will draw upon the expertise of roughly 40,000 DOE scientists, engineers, and technical staff, alongside private sector innovators to ensure that the United States leads and builds the technologies that will define the future. The DOE is also tasked with formulating a list of 20 science and technology challenges of national
Starting point is 00:03:42 importance to form the initial focus of the Genesis mission. This potentially includes domains like advanced manufacturing, biotechnology, critical materials, nuclear fission and fusion energy, quantum information science, semiconductors. The initiative builds on the existing national artificial intelligence research resource or NER, which was established in 2020 and brought together federal agencies, including the Department of Defense, NASA, and the National Institutes of Health, with private companies like OpenAI, Google, and Palantir to form a nationwide research community. Lynn Parker, who co-chaired Nair during the Biden admin, said, government support for AI research builds the foundations for new breakthroughs and helps keep
Starting point is 00:04:16 innovation aligned with the public interest. We take for granted that new products appear regularly, but seldom consider the decades of research that made them possible. Without long-term investment, we risk seeding leadership in the technologies that will define our economy, our security, and our daily lives. Now, speaking of the connection between public and private, Amazon announced on Monday that they will spend up to $50 billion to expand their AI and supercomputing facilities for U.S. government customers. The expansion will begin next year and is expected to add a total of 1.3 gigawatts of AI capacity to the AWS regions that service government demand. The expansion will increase capacity for both unclassified and top secret AWS servers. Said AWS CEO Matt Garman
Starting point is 00:04:54 in a press release, our investment in purpose-built government AI and cloud infrastructure will fundamentally transform how federal agencies leverage supercomputing. We're giving agencies expanded access to advanced AI capabilities that will enable them to accelerate critical missions, from cybersecurity to drug discovery. This investment removes the technology barriers that have held government back and further positions America to lead in the AI era. Staying on the chip theme, meta appears to be preparing to use Google's TPUs in their own data centers. The information reports that Google has begun pitching large cloud customers, including meta and large financial institutions, on installing TPUs at their own facilities.
Starting point is 00:05:30 Google has made their custom AI chips available through Google Cloud for years, but they've yet to sell TPUs directly to outside customers. Part of the pitch is that they're able to operate the chips with higher security and compliance standards that aren't possible with cloud use. According to sources speaking with the information, meta is in talks to order billions of dollars worth of TPUs to install in their data centers in 2027. If you've been listening over the last week,
Starting point is 00:05:52 what's clear is that while Google has been making TPUs for over a decade, the release of Gemini 3 put the chips firmly on people's radar. The new model was trained exclusively on TPUs, leading many to question whether Google's chips could be a viable alternative to NVIDIA's GPUs. The news seems to have moved the stock market, with Bloomberg reporting a 2.7% bump for Google and a 2.7% drop for Nvidia in overnight markets.
Starting point is 00:06:15 Bloomberg analysts wrote, Meta's likely use of Google's TPUs, which are already used by Anthropic, shows third-party providers of large language models are likely to leverage Google as a secondary supplier of accelerator chips for inferencing in the near term. Now, while Google is clearly ramping up to compete, the analysis is still probably getting a little bit ahead of itself. That said, the new report contained a few more crumbs of information on how Google is looking to address the market for AI chips. One of Nvidia's biggest moats is the Kuta developer ecosystem. As part of the information report, they write that Google has developed
Starting point is 00:06:45 a new software suite called TPU Command Center that's designed to make TPU compatibility more easy to navigate. Ultimately, while it could take Google a number of years to carve out a meaningful share of the AI chip market, Nvidia is already taking the threat seriously. According to the information, Nvidia is following the deal-making closely and have enticed Anthropic and OpenAI to make large commitments to Nvidia GPUs. They also wrote that it's possible that Nvidia will seek to preempt a deal between Google and meta. Futurum Equity's chief market strategist Shea Boulure writes, I know the first instinct is to frame meta exploring Google TPUs as the start of Nvidia's pricing power erosion, but that's not what it is. The real story is the velocity of Metis AI workload curve,
Starting point is 00:07:23 Aslamma training cycles, video understanding systems, and tens of billions of daily inference calls all smash into the same compute ceiling. Meta is already on pace to spend $100 billion on Nvidia hardware, and they're still capacity constrained. Adding CPUs doesn't replace the spend, it just sits on top of it. Even if Nvidia doubled output, meta would still be short on compute. That's how steep the structural AI capacity shortage actually is. Lastly today, in an interview at the Emerson Collective's Demo Day, which is the venture and philanthropy fund of Steve Jobs' widow Lorene Powell Jobs, Sam Altman and Johnny Ives said that they've nailed the design of their AI device. In possibly the strangest ever description of a consumer
Starting point is 00:07:59 device, Altman said, there was an earlier prototype that we were quite excited about, but I did not have any feeling of, I want to pick up that thing and take a bite out of it. And then finally, we got there all of a sudden. Altman said this was Ives' test for knowing when a design is dialed in, when you want to lick it or take a bite out of it or something like that. The pair stayed silent on features, but Altman was excited to describe the vibes of the product. He compared the experience of modern devices as being like walking through Times Square, flashing lights, noises, and the dopamine drip, constantly just dealing with all the little indignities. By comparison, he wants using the open AI device to feel more like,
Starting point is 00:08:34 sitting in the most beautiful cabin by a lake, and in the mountains, and just sort of enjoying the peace and calm. I've added his vibe, commenting, I love solutions that teeter on appearing almost naive in their simplicity, and I also love incredibly intelligent, sophisticated products that you want to touch, and you feel no intimidation that you want to use almost carelessly. Altman commented, I hope that when people see it, they say, that's it. The interview added no information on what the device will actually do, but for Altman, the key feature continues to be total contextual awareness. He said, it is so simple, but then AI can do so much for you that so much can fall away. And the degree to which Johnny has chipped away
Starting point is 00:09:07 at every little thing that this doesn't need to do or doesn't need to be in there is remarkable. If you feel more rather than less confused, don't worry about it. Substantively, the biggest news was a timeline with I've stating the device could be available within two years. But with that, we close today's headlines. Next up, the main episode. Today's episode is brought to you by Superintelligent. Now, for those of you who don't know who are new here, maybe, super intelligent is actually my company. We started it because every single company we talk to, all the enterprises out there, trying to figure out what AI can do for them, but most of the advice is super generic, not specific to your company. So what we do is we map your AI and agent opportunities by
Starting point is 00:09:52 deploying voice agents to interview your teams about how work works now and how your people would like it to work in the future. The result is an AI action map with high potential ROI use cases and specific change management needs, basically everything you need to go actually deliver AI value. Go to B-Supert.a.i to learn more. AI isn't a one-off project. It's a partnership that has to evolve as the technology does. Robots and pencils work side by side with clients to bring practical AI into every phase. Automation, personalization, decision support, and optimization. They prove what works through applied experimentation and build systems that amplify human potential. As an AWS-certified partner
Starting point is 00:10:33 with global delivery centers, robots and pencils combines reach with high-touch service. Where others hand off, they stay engaged, because partnership isn't a project plan. It's a commitment. As AI advances, so will their solutions. That's long-term value. Progress starts with the right partner. Start with robots and pencils at robots and pencils.com slash AI Daily Brief. This episode is brought to you by Blitzy, the Enterprise Autonomous Software Development Platform with infinite code context. Blitzy uses thousands of specialized AI agents that think for hours to understand enterprise-scale code bases with millions of lines of code. Enterprise engineering leaders start every development sprint with the Blitzy platform,
Starting point is 00:11:11 bringing in their development requirements. The Blitzy platform provides a plan, then generates and pre-compiles code for each task. Blitzy delivers 80% plus of the development work autonomously, while providing a guide for the final 20% of human development work required to complete the sprint. Public companies are achieving a 5x engineering velocity increase when incorporating Blitzy as their pre-IDE development tool, pairing it with their coding pilot of choice to bring an AI-native SDLC into their org. Visit blitzy.com and press get a demo to learn how Blitzie transforms your SDLC from AI assist to AI Native.
Starting point is 00:11:43 Meet Rovo, your AI-powered teammate. Robo unleashes the potential of your team with AI-powered search, chat, and agents, or build your own agent with Studio. Rovo is powered by your organization's knowledge and lives on Atlassian's trusted and secure platform, so it's always working in the context of your work. Connect Robo to your favorite SaaS app so no knowledge gets left behind. Rovo runs on the teamwork graph, Atlassian's intelligence layer that unifies data across all of your apps and delivers personalized AI insights from day one. Robo is already built into
Starting point is 00:12:16 Gira, Confluence, and Gira service management standard, premium, and enterprise subscriptions. Know the feeling when AI turns from tool to teammate. If you rovo, you know. Discover Rovo, your new AI teammate powered by Atlassian. Get started at ROV as in victory, oh.com. Welcome back to the AI Daily Brief. The Thanksgiving 2025 parade of models has continued into a new week, this time with the launch of Clothes. opus 4.5 from Anthropic. Now, people have been assuming for some time that we were going to get an opus 4.5. We've obviously had Sonnet 4.5 for a while now, and so people figured that this was in the offing, but there had been a lot less conversation leading up to this around when it was going to come.
Starting point is 00:13:02 The big model, of course, that people have been anticipating is Gemini 3, and in many ways this was a wildly understated announcement. And yet, the response has been, in a word, significant. While they may not have hype posted, Anthropic minces no words in their launch post. Our newest model, Claude Opus 4.5 is available today. It's intelligent, efficient, and the best model in the world for coding, agents, and computer use. It's also meaningfully better at everyday tasks like deep research and working with slides and spreadsheets. Opus 4.5 is a step forward in what AI systems can do and a preview of larger changes to how work gets done. So let's talk first about the benchmarks. And it is no accident that the one they choose to put right at the top is
Starting point is 00:13:46 sui bench verified. Now, you might remember that in our discussions about Gemini 3, the only major benchmark that they didn't win or at least match was this one. While Sonnet 4.5 was at a 77.2%, Gemini 3 Pro was at 76.2%, not like it was super far behind, but still not technically state at the art. GPD 51 was also a little tiny bit ahead of Gemini 3 pro at 76.3%. and extended that lead at 77.9% when they released GPT-51 Codex Max in the days following Gemini 3. For a very short time, 5-1 Codex Max was the top of the sweep-bench verified chart, but Opus 4.5, at least by the benchmarks, lows it out of the water. 80.9%.
Starting point is 00:14:29 writes Morgan, a 3% lead has never looked so large. And it wasn't just Sweet Bench verified. On the Terminal Bench 2.0 Agentic Terminal Coding benchmark, 4-5 was meaningfully ahead of all the others as well on agenetic tool use, scaled tool use, and computer use Opus 4.5 sets a new standard. Now, there were some tests where Opus 4.5 meaningfully lagged behind Gemini 3, such as Humanity's last exam, where they were significantly behind both without search and with search. And yet, what everyone was talking about, of course, was the coding results.
Starting point is 00:15:04 If you are a regular listener of this show, you will know that the ascendancy of Anthropic this year, and the speed with which they are catching up to OpenAI has much to do with them being the preferred AI coding model for developers. That started with 3.5, and is basically continued unchallenged, although after the release of GPT5, there have at least been credible competitors. Anthropics seems very clearly to agree with SWICs on the relative importance of coding as compared to all other use cases. A couple times I've referenced Sean's post about what made him decide to go work with cognition, where he basically book coding as the high-value short timeline activity. The line which I've shared a couple of times, code AGI will be achieved in 20% of the time
Starting point is 00:15:44 of full AGI and capture 80% of the value of AGI. Whether or not that's true, Anthropica certainly behaved as such. Now, outside just the standard sweepbench, there were a couple of other things that people noticed. Igor Kotenkov points out that while there are ways to overfit towards the sweepbench verified benchmark, the more recent sweepbench pro is a lot more difficult and connected to the real world, and Opus blows previous models out of the water. Opus gets a 55. where Sonnet 4.5 got 43.6, and GPT5 got just 36%. On ARC AGI, Opus 4.5 set a new standard ahead of 51 in Gemini 3, and at ARC AGI2, they got 37.64% at 240 a task. Already just hours after the release, the people who had early access were also independently verifying some of these results. Bin new ready,
Starting point is 00:16:32 writes, Opus 4.5 tops Live Bench AI and is the world's best agentic model. We can confirm this after testing this over the past few days. Now, interestingly, one of the things that we've seen a lot from labs recently is the people inside the labs really talking up the specifics about what they like about the models. We got a spate of that from Anthropic team members, such as Jake Eaton, who writes, Opus 4.5 is very good at a lot of things, and you should read the benchmarks, the model card, etc. But my favorite thing about working with it these past two weeks is that in conversation, it is somehow more fine-grained.
Starting point is 00:17:02 It has a depth and texture that for me was immediately noticeable. It also feels interestingly much more self-contained. Sasha de Merigny says, the internal response to Opus 4.5 has been a mix of excitement, awe, and surprise, particularly around how good it is at coding. Theric writes, Opus 4.5 is special, a world record in Sweet Bench and OS World benchmarks, the best model we've ever had at Vision.
Starting point is 00:17:23 On Claude Code, I've completely stopped writing code in the IDE. I think there's so much to discover about Opus 4.5. And indeed, some of the most interesting responses from Anthropics members come from their engineering team. Shelto Douglas writes, I am so excited about this model. First off, the most important eval. Everyone at Anthropic has been posting stories of crazy bugs that Opus found
Starting point is 00:17:43 or incredible PRs that it nearly soloed. A couple of our best engineers are hitting the intervention's only phase of coding. Adam Wolf writes, this new model is something else. Since Sonnet 4.5, I've been tracking how long I can get the agent to work autonomously. With Opus 4.5, this is starting to routinely stretch to 20 or 30 minutes. When I come back, the task is often done, simply and idiomatically. They talked about how Claude Opus compared on a notoriously difficult candidate exam. In their announcement post, they wrote, we give prospective performance engineering
Starting point is 00:18:14 candidates and notoriously difficult take-home exam. We also test new models on this exam as an internal benchmark. Within our prescribed two-hour time limit, Claude Opus 4.5 scored higher than any human candidate ever. They continue, the take-home test is designed to assess technical ability and judgment under time pressure. It doesn't test for other crucial skills candidates may possess, like collaboration, communication, or the instincts that develop over years, but this result, where an AI model outperforms strong candidates on important technical skills, raises questions about how AI will change engineering as a profession. Now, they also talked to staff members to estimate the impact of using Opus 4.5 in Claude Code. 50%, 9 of the 18 they surveyed, reported a productivity improvement of at least
Starting point is 00:18:57 100%. The mean self-estimated productivity improvement was 220%. They also popped open the hood a little bit on how they're making Claude even better when it comes to Agenics. In short, they have a huge emphasis on tools. Indeed, they write, the future of AI agents is one where models work seamlessly across hundreds or thousands of tools, an IDEE assistant that integrates Git operations, file manipulation, package managers, testing frameworks, and deployment pipelines, an operations coordinator that connects Slack, GitHub, Google Drive, Jira company databases, and dozens of MCP server simultaneously. To build effective agents, they need to work with unlimited tool libraries without stuffing every definition into context up front. Agents also need to be able to call tools from code.
Starting point is 00:19:39 Agents also need to learn correct tool usage from examples. Following that, they share that they were releasing three features to make all of that possible. A tool search tool, which allows Claude to use search tools to access thousands of tools without consuming its context window, programmatic tool calling, which allows Claude to invoke tools in a code execution environment, reducing the impact on the model's context window, and tool use examples, which provide a universal standard for demonstrating how to effectively use a given tool. So again, all of this is telling a very consistent story, which is that Claude is for coding and pushing the frontier of what agents can do.
Starting point is 00:20:13 So outside of interacting with the benchmarks, what were people's first impressions? Some were excited and appreciated that there was less hype around this. Nico Christie writes, have to respect Anthropics' commitment to not vague posting all weekend. This is the most exciting model release in Sonnet 3.5. Leo at synthwaived writes, Be Anthropic, pretend Gemini 3 does not exist. No, you're ready to cook it for code anyways. Wait, zero high posting.
Starting point is 00:20:37 Drop new opus, state-of-the-art for code, state-of-the-art in RKGI, better than expected, cost less than old opus. Be more like Anthropic. On the flip side, Ethan Mollick basically asked why they were burying the lead. I'm not sure why Anthropic keeps doing it. very low-key launches for fairly major releases and materially important improvements to their services. I kind of think it has to do with the assessment and the specificity of their audience in and among developers. Basically, it's a group of people that they think is going to respond more
Starting point is 00:21:03 to having their peers and colleagues tell them about an update rather than getting maximum social distribution because of being loud and hypey. But what about people's early tests? Victor Taylin writes, to my surprise, Opus 4.1, one shot at my hardest calculus problem tying with Gemini 3. In terms of first hour impressions, couldn't be more. promising, I guess. Ethan Malick writes, I had early access to Opus 4.5, and it's a very impressive model that seems to be right at the frontier. Big gains in ability to do practical work, like make a PowerPoint from an Excel. Niko again writes, Opus 4.5 is a step function improvement for spreadsheet work. Extremely hard became doable, doable tasks became easy, and easy tasks are now
Starting point is 00:21:41 solved. And yet, if there were a few examples of people trying non-coding things, coding is very much where the main excitement lies. Garnel Rotch, the CEO of Versell, writes, Opus is on a different level. It's unreasonably good at NextJS and the best model we've tried on V0 to date. Menlo Ventures DD Das writes, Anthropic just dropped the best coding model, Opus 4.5. The coolest thing he points out is it does better at Sweet Bench verified without thinking than with 64K reasoning tokens, in other words, a super token efficient model. Matt Schumer, who didn't have early access, said first test of Claude Opus 4.5 and I'm already impressed. I asked it for a co-lab competitor UI, and it quickly pulled together this screen. Definitely better than my similar
Starting point is 00:22:22 test with GBT-51 and, shockingly, Gemini 3. More testing to go, but this is a good start. He followed it up. Okay, wow, I'm kind of blown away. In one shot, Opus 4.5 made the UI actually functional, with Python running in the browser. Some, like Superdario, pointed out, that this may not even be the best model than Anthropic has behind the scenes. They write, good time to remind everyone, Anthropic has a long-standing policy of not significantly pushing the frontier to prevent an arms race. Dario can hit sweepbench scores at will. Now, whether or not that's true, the fact that there is a lot of chatter like that, I think, is good reflection of the sentiment in the community. Maybe the most vocally excited about this is Dan Shipper in the team at every.
Starting point is 00:23:02 He writes, Breaking News. Anthropic just dropped Claude Opus 4.5. It is by far the best coding model I've ever used. And here's how Dan describes it. it extends the horizon of what you can vibe code. Explaining, he writes, The current generation of new models, Anthropic Sonnet 4.5, Google's Gemini 3, or OpenAI's Codex Max 51,
Starting point is 00:23:22 can all competently build a minimum viable product in one shot, or fix a highly technical bug autonomously. But eventually, if you keep pushing them to vibe code more, they'd start to trip over their own feet. The code would be convoluted and contradictory, and you'd get stuck in endless bugs. We have not found that limit yet with Opus 4.5. it seems to be able to vibe code forever.
Starting point is 00:23:43 Two more observations. Opus 4.5, he says, takes working in parallel to a whole new level. Because it's far better at planning and coding, it can work with more autonomy, meaning you can do more in parallel without breaking anything. One of his teammates worked on 11 different projects in six hours and had good results on all of them. Lastly, he points out its grade at design iteration. Opus 4.5, Dan writes, is incredibly skilled at iterating through a design autonomously using an MCP-like playwright.
Starting point is 00:24:07 Previous models would lose the thread after a few cycles or, say, a design was done when it wasn't. Opus 4.5 is incredible at autonomously iterating until a design is pixel perfect. Indeed, Dan's team at Every were equally as vocal in their love of this model. Kieran Klausen writes, 2023 was GPT4, 2024 was son at 3.5. 2025 is Opus 4.5. This is the coding model launch I've been waiting for. First time I genuinely believe I can vibe code an entire app end-to-end without touching the implementation details. We haven't found the limit yet. Previous models would eventually trip over their own feet. Convaluted code, contradictory logic, endless bugs,
Starting point is 00:24:44 Opus 4.5 just keeps going. If you write code with AI, you need to try this. And I think that this idea is the thing to watch for to see whether Kieran and Dan's first impressions here and some of the impressions of the Anthropic team really play out. That this is, as Kieran puts it, the first time we can vibe code an entire app and to end without touching the implementation details.
Starting point is 00:25:06 It strikes me that if that is the case, That could be the most massive implication of this model. Adam Wolfe from Anthropic again wrote, I believe this new model in Claude Code is a glimpse of the future we're hurtling towards, maybe as soon as the first half of next year. Software engineering is done. Soon, we won't bother to check generated code
Starting point is 00:25:24 for the same reasons we don't check compiler output. I love programming, and it's a little scary to think it might not be a big part of my job, but coding was always the easy part. The hard part is requirements, goals, feedback, figuring out what to build and whether it's working. There's still so much left to do. and plenty of the models aren't close to yet.
Starting point is 00:25:40 Architecture, systems design, understanding users, coordinating across teams, it's going to continue being fun and very interesting for the foreseeable future. But still, it's not hard to see that that's a fairly big pronouncement. Now, moving back to the realm of the non-speculative, the other thing that captured people's attention about this is that Opus 4.5 is significantly cheaper than Opus 4.1,
Starting point is 00:26:00 the cost dropped from $15 to $5 per million input tokens and from 75 to 25 per million output tokens. Indeed, Jeremy from Anthropic points out, one fact people won't realize immediately about Opus 4.5, it's remarkably token efficient. All in, it's often cheaper than Sonnet 4.5 and other models for cost per task success. Simon Willison points out why we probably need to be looking not just at cost per output and input but also token efficiency, when he writes,
Starting point is 00:26:29 this is notable. Opus 4.5 is around 60% more expensive than Sonnet, $25 per million output compared to $15 per million output, but if it can use 76% fewer output reasoning tokens for the same complex task, it may end up cheaper. Now that 76% came from Claude Relations, Alex Albert, who said on Sweebench verified at medium effort, Opus 4.5 beats on it 4.5 while using 76% fewer output tokens. Look, it's early days, but the first impressions are big. Dan Shipper again sums up, every six to 12 months of model drops that truly shifts the paradigm.
Starting point is 00:27:03 Opus 4.5 launched today, and that's what it is. best coding model I've ever used and it's not close. We're never going back. Brian Atwood points out, I said a month or two ago that Anthropic is a vertical AI company and this is what I meant. They rightly identified that coding is the number one use case for LLMs right now and are overwhelmingly focused on it.
Starting point is 00:27:21 Meanwhile, others are throwing darts in every conceivable direction, spreading themselves thin. Interestingly, just a couple days ago, Sam Altman posted, It has been amazing to watch the progress of the Codex team. They are beasts. The product and model is already so good and will get much better. I believe they will create the best and most important product in the space and enable so much downstream work. It has been pretty clear for some time now that OpenAI has come around to a similar
Starting point is 00:27:43 view of the importance of coding and are very much not content to cede that ground. Summing up, Ethan Malik writes, the main lesson of the past few weeks is that the big four US labs all seem to have figured out a path forward in continuing the exponential pace of LLM improvement, at least in the near future. More simply put, Andrew Curran writes, AI winter is canceled. Try again next to your Grinch Squad. There will, I'm sure, be lots more to discuss around Opus 4.5 as people get deeper into it. But for now, like I said, the Thanksgiving model explosion continues on abated. That's going to do it for today's episode. Appreciate you listening, as always.
Starting point is 00:28:17 Until next time, peace.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.