The AI Daily Brief: Artificial Intelligence News and Analysis - 3 Major New AI Model Releases: GPT-OSS, Claude Opus 4.1, Genie 3

Starting point is 00:00:00 Today on the AI Daily Brief, three big new model releases from OpenAI, Anthropic, and Google. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. Hello, friends, quick announcements before we dive in. First of all, thank you to today's sponsors, KPMG, Blitzy, and Super Intelligent. To get an ad-free version of the show, go to patreon.com slash AI Daily Brief. And if you are interested in sponsoring the show, I actually have a new place to go now. For fun this past weekend, I have vibe-coded up, AIDailybrief.ai. If you're watching, you can see it's a cool, old kind of terminal-style website.

Starting point is 00:00:42 And so you can now email sponsors at AIDailybrief.aI. That is sponsors with an S at AIDailydlybrief.A.I. You can also still me, NLW at basically anything you can imagine and it'll get to me. Check out the website, send me an email about sponsorship, and I will look forward to sending you more information. The last thing I want to note is, unsurprisingly, given how much was going on, on. Today's episode is just one big, long main episode about these three big model releases. As makes sense this week, we will be back to our normal format or we'll just continue to focus on whatever crazy big new thing gets launched. For now, though, we've got an open source model,

Starting point is 00:01:17 a world model, and a new coding champion to talk about, so let's dive in. Welcome back to the AI Daily Brief. It's been clear for a while that this was poised to be a very big week. Insid suggested that in addition to the much fabled GPT-5, we were potentially going to get a new version of Claude from Anthropic and even some treats and something interesting from Google, to say nothing of a bunch of other smaller models from the company's just one level back from the foundation model giants. And today, indeed, we got drops from OpenAI, Anthropic, and Google. So we're going to do our best to go through all of this. This is all very fresh out of the gate with people's very first takes.

Starting point is 00:01:54 And of course, we'll be spending a lot more time on this throughout the week. We kick off not with GPT-5, that's still forthcoming at this point, odds on bets, or that that comes on Thursday, but instead with a release that to some is as or more significant. And that is OpenAI's two new open-weight models. The two models are GBTOSS-120B and GPT-O-S-20B. And the big blinking banner headline is that these are not just some second-tier models to give developers something nice to play around with. these are actually very, very close to state-of-the-art. Sam Haltman tweeted,

Starting point is 00:02:29 GBTOSS is out. We made an open model that performs at the level of 04 Mini and runs on a high-end laptop, WTF, and a smaller one that runs on a phone. Super proud of the team Big Triumph of technology. Now, before we get into how OpenAI describes this model, it's worth putting this into context. It has been a very long time since OpenAI was actually in the business of open models. The last one that they released as an open model was also, the way back at GPT2, which, by the way, remains at this point the most downloaded text generation model of all time on Hugging Face, and by a fairly significant amount. And over the years, as OpenAI had increasingly turned away from open source, there were still people who were asking them to

Starting point is 00:03:10 reconsider. Back in December of 2023, AI entrepreneur Varun Mather wrote, open source GPT4 and do whatever R&D needed to ensure it can run on consumer laptops and desktops. You got into AI to change the world. Now, in January of this year, we got the idea that OpenAI might be shifting their stance on this. In a Reddit conversation, CEO Sam Altman said that he thought that perhaps OpenAI had been on the wrong side of history to use his phrase when it came to open source. Now, not coincidentally, this was in the peak of Deep Seek mania, when Chinese open source models in general had not only taken a major jump ahead of Western open source models, but were creeping up right on the heels of the closed models from the biggest foundation model

Starting point is 00:03:49 companies. And so with all that in mind, developers in the AI community at large have been excitedly waiting to see what this open source model release would bring. So let's talk about how OpenAI presented these two models. GPTOSS 20B, they frame as a medium-sized open model that can run on most desktops and laptops, while its bigger sibling 120B is a large open model designed to run in data centers and on high-end desktops and laptops. One of the first things people wanted to check is what the license with this would actually be. And OpenAI led with that is one of their four main bullets, calling it a permissive license. They write, These models are supported by the Apache 2.0 license, built freely without worrying about

Starting point is 00:04:28 copy-left restrictions or patent risk, whether you're experimenting, customizing, or deploying commercially. Simon Willison was enthusiastic, tweeting excitedly that both had been released under a, quote, proper open source Apache 2.0 license. But were these just underperforming models designed for tinkerers and hobbyists? The short answer is absolutely not. They shared comparative benchmarks across four tests, including two competitive math tests, the AIME 2024 and 2025, as well as the MMLEU, GPQA Diamond, and Humanities Last Exam, which were all reasoning and knowledge tests. On the competition math, both 20B and 120B actually beat OpenAI-O-3, and were just behind 04 Mini, which had been specifically designed for math. On the reasoning tests,

Starting point is 00:05:11 both models were behind 03 and 04 Mini, but not by a ton. For example, the 120B model, score a 90 on the MMLU as compared to 03's 93.4, an 80.1 on the GPQA diamond as compared to 03's 83.3. And there was a slightly bigger gap on humanity's last exam, scoring a 19 as opposed to a 24.9. Still, that 19 was better than 04 Mini, which got a 17.7. The point is that these models are distinctly not second tier. Matt Schumer writes, are you kidding me? OpenAI's new open source model is 03 level. This is going to disrupt the market in a big way. In addition to the the permissive license and the benchmarks, OpenAI called out three other aspects of the models. Full chain of thought. Giving developers the ability to access the full chain of thought for easier

Starting point is 00:05:56 debugging and higher trusted model outputs. Deep customizable. OpenAI writes, adjust the reasoning effort to low, medium, or high, plus customize the models to adapt to your use case with full parameter fine tuning. And the last one and one that caught my attention was designed for agentic tasks. Leverage powerful instruction following and tool use within the chain of thought, including web search and Python code execution. Matthew Berman honed in on this point, writing, excellent instruction following, function calling web search or Python tool use, adjustable reasoning effort, and structured outputs. And it's very clear as we'll get into in a minute that, again, far from being some hobbyist project, I think they have some real specific use cases in mind,

Starting point is 00:06:32 and a big part of it is building agents. Sam Altman also wrote a longer post about OSS. He said, GPTOSS is a big deal. It is a state-of-the-art open weights reasoning model with strong real-world performance comparable to O4 Mini, that you can run locally on your own computer or phone with the smaller size. We believe this is the best and most usable open model in the world. We're excited to make this model the result of billions of dollars of research available to the world to get AI into the hands of the most people possible. We believe far more good than bad will come from it. For example, GPT OSS-120B performs about as well as O3 on challenging health issues. We have worked hard to mitigate the most serious safety issues, especially around biosecurity. GPT-OSS models perform,

Starting point is 00:07:14 comparably to our frontier models on internal safety benchmarks. We believe in individual empowerment. Although we believe most people will want to use a convenient service like ChatGPT, people should be able to directly control and modify their own AI when they need to. And the privacy benefits are obvious. As part of this, we're quite hopeful that this release will enable new kinds of research and the creation of new kinds of products. We expect a meaningful uptick in the rate of innovation in our field,

Starting point is 00:07:38 and for many more people to do important work than we're able to before. OpenAI's mission is to ensure AGI that benefits all of humanity. To that end, we are excited for the world to be building on an open AI stack created in the United States based on democratic values available for free to all and for wide benefit. Now, there is a lot to unpack in there. If you listen to my episode about the White House's AI Action Plan, you will have heard me talking about the fact that one of the things that made that document so interesting is that while a lot of foreign policy in the United States right now involves withdrawing from

Starting point is 00:08:07 our traditional role in the world, the AI Action Plan basically made it a prerogative of the U.S. government to use open source AI as a sort of soft power. That sort of theme isn't new for Sam Altman, but with this model and the narrative around its release, they're definitely leaning all the way into that. When it comes to these safety and risk issues, Kai from OpenAI added a few more details. They tweeted, open weights can't go back in the box. Before permanently releasing GPT OSS models into the ecosystem, we estimated marginal frontier risk by deliberately eliciting bio and cyber capabilities via malicious fine-tuning. While MFT improves performance, GBTOSS stays below the high threshold in OpenAI's preparedness framework. These findings contributed

Starting point is 00:08:47 towards the decision to release these models for the world to use. Alongside the release, they shared a paper called estimating worst-case frontier risks of open-weight LLMs. But what about what these models are actually useful for? Why is it important if you have access to the closed models that OpenAI has now made these near state-of-the-art models available in this open way. The team at Every spent a little time with the models before their release, and here's what they had to say about them. They write, I can think of a few ways right off the bat where we're going to start experimenting with these models internally. First, Every has a consulting practice with hedge funds and private equity firms that are bound by significant

Starting point is 00:09:21 privacy and security regulations. I imagine this is going to be immediately interesting for these businesses because they can now run the models themselves and the security of their own data centers. Previously, they may have wanted to use open-weight models like Kimmy K2 or Kwen, but for firms with strict regulatory or compliance requirements, there's no better way to make your head of IT breakout in hives than to try to install Chinese developed AI in your secure private cloud. OpenAI's OSS series models boasts the same open weight flexibility and security benefits without the geopolitical drama. Every also points out that even beyond big strategic risk, there are a lot of companies that just

Starting point is 00:09:54 don't want to send their information to anyone's cloud. They write, Every has a suite of Mac apps that could use these models to offer AI features with a better security and privacy profile. Our AI file organizers sends your file names to Open AIs Cloud, a no-go for security-conscious users. Similarly, monologue our AI Dictation app currently in beta sends your dictation transcript to the cloud for processing. We've experimented with on-device processing for both of these apps, but speed and reliability haven't been good enough to release it. And I think with the point here is that in addition to just the principle of the thing, and

Starting point is 00:10:27 giving companies and users the choice of how they want to toggle their security and privacy settings, there are actually very specific use cases in industries where this might matter significantly. Basically, the more regulated in industry is, and the more complex it is, the more likely to want custom solutions they're going to be. We see this all the time at superintelligent when it comes to how they're thinking about agents. I don't think it's an accident that OpenAI is really honing in on how good at agented capabilities like tool-calling these models are, because I think that they anticipate that a lot of the usage is going to be in the form of privacy and security-conscious enterprises using these models to build their own custom agentics. Harrison Chase and the team of

Starting point is 00:11:04 Langchane already started writing about how to use GPTOSS for this type of agentic use case. Chase wrote, Deep Agents require good tool-calling capabilities, something that OpenAI's new open source model is pretty good at. And for companies that are doing a lot of that usage, there is another part of this story, which is also equally compelling, which is the cost profile. Matt Schumer again writes that the pricing when delivered through Grok is 91% cheaper than O3's pricing. He continues, because of this, you can likely use this model in much more comprehensive chains in agents to eke out more performance than 03 dollar for dollar. Example, running five OpenAI OSS model generations in parallel and then selecting the best

Starting point is 00:11:41 is still way cheaper than just one O3 query. Grox had a partnership's Jacob Lowenstein wrote, Intern accidentally priced OpenAI GPTOSS at 1-100th of Anthropic. Bad intern, but what's done is done. Now go build. Cerebris writes, OpenAI GPTOSS 120B is live on Cerebrus. 3,000 tokens per second, the fastest open AI model on record. The point being that with these models, you have something that is not only available with a privacy and security profile that was not previously possible, but at near state of

Starting point is 00:12:12 the art and for a fraction of the price and at significant speed. I'm sure we'll be spending a lot more time on these open source models and how people are using them in the weeks to come, but this is just one of three announcements that I wanted to speak to in today's show. Stephen Hydele, who works on the API OpenAI, wrote, Also, congrats to my friends at Anthropic on the release of Opus 4.1 today. Today's episode is brought to you by KPMG. In today's fiercely competitive market, unlocking AI's potential could help give you a competitive

Starting point is 00:12:40 edge, foster growth, and drive new value. But here's the key. You don't need an AI strategy. You need to embed AI into your overall business strategy to truly power it up. KPMG can show you how to integrate AI and AI agents into your business strategy in a way that truly works and is built on trusted AI principles and platforms. Check out real stories from KPMG to hear how AI is driving success with its clients at www.kpmG.org.us slash AI. Again, that's www.kpmg.comg.coms slash AI.

Starting point is 00:13:14 This episode is brought to you by Blitzy, the Enterprise Autonomous Software Development Platform with Infinite Code Context. Blitzy uses thousands of specialized AI agents that think for hours to understand enterprise-scale code bases with millions of lines of code. Enterprise engineering leaders start every development sprint with the Blitzy platform, bringing in their development requirements. The Blitzy platform provides a plan, then generates and pre-compiles code for each task. Blitzy delivers 80% plus of the development work autonomously while providing a guide for the final 20% of human development work required to complete the sprint.

Starting point is 00:13:45 Public companies are achieving a 5x engineering velocity increase when incorporating Blitzie as their pre-IDE development tool, pairing it with their coding co-pilot of choice to bring an AI-Native STLC into their org. Blitzy is providing a limited time, 30-day free proof of concept for qualifying enterprises. The team will provide a 5x velocity increase on a real development project in your org. Visit blitzy.com and press book demo to learn how Blitzie transforms your STLC from AI-assisted to AI Native. That's BLITZY.com. If you are a regular listener, you will have heard about superintelligence agent

Starting point is 00:14:21 readiness audits at this point. But I wanted to tell you today about the full suite of agent readiness products that go beyond just the initial readiness report. Over the last six months, Super Intelligence has built out an entire agent planning suite. We help you move from discovery to planning to implementation. After you've completed your agent readiness audits, we help you double click on your most important use cases with what we call our use case planning reports. These reports are going to help you understand what sort of technical preparation you need to do to be ready for a use case, what should challenges you might face in implementation, and whether you should be thinking about building, buying, partnering, or some combination. After that, you can even get a spec document in what we call

Starting point is 00:15:00 our technical blueprint that gives either your developers or the developers of the partner you work with what they need to build exactly the agent that you're looking for. If you want to learn more about superintelligence agent planning suite, we've built a custom GPT to answer your questions. Just go to bit.ly slash super agent. That's bit.l.ly slash super agent, all one word. and if you have any questions, the agent can even help you book an appointment with our team. Just before the OpenAI announcement, Anthropic tweeted, today we're releasing Claude Opus 4.1, an upgrade to Claude Opus 4 on agentic tasks, real-world coding, and reasoning.

Starting point is 00:15:37 And the big story here is software engineering. On the Sweet Bench verified test, Opus 4.1 scored a 74.5% as compared to Opus 4,72.5%, and Sonnet 3.7, 62.3%. The quote that everyone has been grabbing from their announcement post is this one. Winsurf reports Opus 4.1 delivers a one standard deviation in improvement over Opus 4 on their junior developer benchmark, showing roughly the same performance leap as the jump from Sonnet 3.7 to Sonnet 4.4.5, read. Menlo's Didi Das writes, Anthropic just dropped Opus 4.1, the best coding model in the world.

Starting point is 00:16:10 State of the art, 74.5% on the sweep bench surpassing both Gemini 2.5 Pro and OpenAI 03. This is the product right now to use. Chubby points out, however, that maybe the thing to really focus on is the comment that follows the announcement post, we plan to release substantially larger improvements to our models in the coming weeks. That's echoed, by the way, in their announcement blog post. Now, for some, it feels like that was the main point. New Form AIs, Alec Velikhanoff said, is this it? Opus 4.1 feels like a rushed release to get ahead of GPT5. Look at how it struggles with making a UI that Horizon, assumed GPT5, nailed in a single shot. Now, this has been out for.

Starting point is 00:16:48 for less than two hours at the time of recording, so we haven't really gotten a chance to see a ton of people playing around with it yet. But there's certainly no doubt that the timing is not a coincidence, and that part of this is the competitive pressure to always be at the state of the art. We haven't gotten GPT5 yet, but we have heard tons of rumors that it is Leagues better in coding that any previous OpenAI model, and people have had tons of success with the assumed tester models like the codenamed Horizon, which could theoretically threaten Anthropics Crown at the top of the coding heap. Much respected Simon Willisson says, My favorite thing about this model is the version number.

Starting point is 00:17:20 Treating this as a 0.1 version increment looks like it's an accurate depiction of the model's capabilities. Basically, he's saying that, yes, it is a modest improvement, but an improvement nonetheless and one that is appropriately pitched. Now, the other new model that I'm really excited about that was announced today is Google's Genie 3. This they're calling the most advanced world simulator ever, and there's really some amazing stuff going on here. So the basic idea of Genie 3 is that rather than prompting a video, with a few words,

Starting point is 00:17:51 you're prompting an actual dynamic world that you can interact with. They write, given a text prompt, Genie 3 can generate dynamic worlds that you can navigate in real time at 24 frames per second, retaining consistency for a few minutes at a resolution of 720P. Let's actually just watch part of the video to get a sense of how they explain things. Each one of these is an interactive environment generated by Genie 3, a new frontier for world models. With Genie 3, you can use natural language to generate a variety of worlds and explore them interactively,

Starting point is 00:18:22 all with a single text prompt. Let's see what it's like to spend some time in a world. Genie 3 has real-time interactivity, meaning that the environment reacts to your movements and actions. You're not walking through a pre-built simulation. Everything you see here is being generated live as you explore it. And Genie 3 has world memory. That's why environments like this one stay consistent. World memory even carries over into your actions.

Starting point is 00:18:59 For example, when I'm painting on this wall, my actions persist. I can look away and generate other parts of the world. But when I look back, the actions I took are still there. And Genie 3 enables promptable events, so you can add new events into your world on the fly. Something like another person, or transportation, or even something totally unexpected. You can use Genie to explore real-world physics and movement and all kinds of unique environments. So the real-world physics are a really important part of this. We talk about AGI like something that is a straight-line path from where we are to there.

Starting point is 00:19:44 And certainly as more models become more capable, that feels accurate. But the reality is, there are different beliefs about the different paths that it's going to take to get us there. And if you've ever listened to our coverage of world models before, you'll know that this is a different path that some people are convinced is a, better path towards AGI than the sort of text-based LLMs that have become common at the core of our experience with generative AI. Google writes that world models are AI systems that can use their understanding of the world to simulate aspects of it, enabling agents to predict both how an environment will evolve and how their actions will affect it. World models they continue are a key

Starting point is 00:20:18 stepping stone on the path to AGI, since they make it possible to train AI agents in an unlimited curriculum of rich simulation environments. So part of what Jeannie has that's different is that it can model physical properties of the world. It can retain environmental consistency like we saw with that painting example, and it can be prompted to actually change the generated world as well. Now, environmental consistency was one of the things that people latched on as a really big update here. Andrew Curran writes, they solved environmental consistency with Genie 3, and this was an emergent capability. You can see the trees remain the same even after being out of line of sight. Visual memory extends back one minute now. Google is on a steady path to a real world simulator.

Starting point is 00:20:58 Chubby writes, The breakthrough, minutes of consistency at 720P using only text descriptions, no pre-built 3D models required. They continue, the technical breakthrough lies in auto-regressive generation. Each new frame must take into account the entire trajectory up to that point. At 24 frames per second, this means 24 complex calculations per second that must access minutes of context.

Starting point is 00:21:19 Google deep mind researcher Jack Parker Holder writes, Genie 3 feels like a watershed moment for world models. We can now generate multi-minute, real-time interactive simulations of any imaginable world. This could be the key missing piece for embodied AGI. Theoretically, media writes, Google's genie 3 shows the beginning of the end game for V-O-3, full world simulation.

Starting point is 00:21:39 Just look at this and tell me you don't see the start of the holodeck. And that word holodeck comes up a lot. There are a few different things that I'm seeing people excited about when it comes to the announcement. One of them is robotics, or to use the term that we just heard embodied AI. Robert Scoble writes, having the ability to create a real-looking world that a user can move around in and interact with

Starting point is 00:21:57 will speed up robotics. What Google just showed off this morning is a simulator. It simulates the real world. Perfect for training robots of the future in. Robots need to learn to navigate the real world. Stairs, for instance. Now developers will be able to develop thin stairs, thick stairs, spiral stairs, muddy stairs, icy stairs, wet stairs, stairs with kids, stairs with dogs, etc. And run virtual robots up and down those stairs billions of times to train new robot models. Roan Chong also points out the implications for training. He writes, We're entering the era of infinite AI training environments, where world models like Genie 3 enable AI agents and robots to learn from their own experiences

Starting point is 00:22:31 in simulated real-world environments. This is the convergence of world simulation, AI training, and creative expression all happening in real time. Still, as you might imagine, a lot of the implication that people are jumping to are those for games. Billowalsit who writes, Genie 3 just achieved what AAA game engines do but without any 3D models. Wild had this model figured out complex effects like exposure shifts, volumetric god rays, and phenomena we need to code explicitly in 3D engines. If you squint, he writes, you can see where this goes.

Starting point is 00:23:00 VR and AR, half robots, robotics, jungle gyms for robots, Sims made from real world data to train robots to then operate in the real world, real-time virtual production, motion tracked iPhone and you, quote-unquote, directing actor agents. Jim Fan from Nvidia writes, this is Game Engine 2.0. Someday, all of the complexity of Unreal Engine 5 will be absorbed by a data-driven blob of attention weights. Those weights take as input game controller commands and directly animate a spacetime chunk of pixels. Andrew Curran joked, given the current rate of world model advancement, the end result of Bethesda taking 15 years to release the next Elder Scrolls is that fans may

Starting point is 00:23:33 end up making it first, to which Elon Musk responded, for sure. Common Sense Machine CEO, Tejas Kolkarni had early access and wrote, I spent the whole day playing with the system and when it works, it's truly mind-blowing. It's the first neural game engine and world model I've tried that generalizes so well and has long-term world consistency. Where it shines, he writes things like that it's truly general purpose and has a quick startup time. Works exceptionally well for gaming environments, but also generalizes to other industrial and real-world scenarios. It learns physics. He wrote, although there are systematic failures even for rigid body physics, it was clear to me that it can learn game engine and non-rigid physics without an underlying engine. He also said it

Starting point is 00:24:11 works exceptionally well for stylized environments, with characters walking around, that it's way more fun than video models, which to him indicates that there are, quote, high retention consumer experiences waiting to be built with this. And he also wrote a few other areas where it works really well, like photorealistic walkthroughs and drone shots, as well as global illumination and lighting. But what are the problems? He said physics is still hard and there are obviously failure cases when I try the classic intuitive physics experiments from psychology. Social and multi-agent interactions are tricky to handle. For example, 1V1 combat games do not work. Long instruction following and simple combinatorial game logic fails, i.e. collect some points.

Starting point is 00:24:45 points, keys, etc., go to the door, unlock, and so on. And so basically all in total, he writes, it's far from being a real game engine and has a long way to go, but he concludes this is a clear glimpse into the future. Indeed, he says it's impressive enough for me to have strong conviction that this is going to disrupt the gaming industry. It's super early days and there are a lot of failures, but the writing is on the wall. Lots of challenging scientific engineering and scaling problems to be solved, but it's going to happen in the next five years. I'm going to be spending some time over the next couple weeks, cataloging non-gaming use cases for this, and I'm sure I'll be back to share those with you at some point when it makes sense.

Starting point is 00:25:17 It is just Tuesday. We haven't gotten to GPT5. I also haven't been able to share with you guys yet Lindy 3.0 or 11 Labs new music model. It's like we decided to accelerate all at once, all in one week. So I will say, as always, thank you for listening or watching, and stick around because this week is going to be a big one. That's going to do it for today's AI Daily Brief. Until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - 3 Major New AI Model Releases: GPT-OSS, Claude Opus 4.1, Genie 3

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.