The AI Daily Brief: Artificial Intelligence News and Analysis - Is GPT-OSS Actually Any Good?

Episode Date: August 7, 2025

A day after OpenAI's surprise open source release, we dig into how the model is performing in the wild. Early reactions are mixed—while some praise its speed and efficiency, others describe stra...nge behavior, safety-maxed responses, and limited general knowledge. Is it optimized for coding and STEM? We also cover Eleven Labs’ entry into AI music, Lindy’s new agent-building tools, and Google’s powerful Genie 3 world model.Brought to you by:KPMG – Go to ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://kpmg.com/ai⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ to learn more about how KPMG can help you drive value with our AI solutions.Blitzy.com - Go to ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://blitzy.com/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ to build enterprise software in days, not months AGNTCY - The AGNTCY is an open-source collective dedicated to building the Internet of Agents, enabling AI agents to communicate and collaborate seamlessly across frameworks. Join a community of engineers focused on high-quality multi-agent software and support the initiative at ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠agntcy.org ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠  ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Vanta - Simplify compliance - ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://vanta.com/nlw⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Plumb - The automation platform for AI experts and consultants ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://useplumb.com/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠The Agent Readiness Audit from Superintelligent - Go to ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://besuper.ai/ ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Subscribe to the newsletter: https://aidailybrief.beehiiv.com/Join our Discord: https://bit.ly/aibreakdownInterested in sponsoring the show? nlw@breakdown.network

Transcript
Discussion (0)
Starting point is 00:00:00 Today on the AI Daily Brief, I look at the day one reactions to the major model releases that happened yesterday. Before that in the headlines, you guessed it, more model releases. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. All right, friends, quick announcements before we dive in today. First of all, thank you to today's sponsors, Blitzy, Vanta, and Plum. To get an ad-free version of the show, go to patreon.com.
Starting point is 00:00:27 Welcome back to the AI Daily Brief Headlines edition, all the daily AI news you need in around five minutes. It is quite blurry right now between the headlines in the main as everything this week is just all new model releases. Yesterday, as OpenAI was getting in the open source game again for the first time since like 2019 and Google was resetting our expectations of world simulation models, 11 labs, who some of you know as the company that I have frequently used to do a version of my own voice for reading articles on this show, launched an AI music service. Now, there are a bunch of things that are interesting about this. First of all, this is Eleven Lab's first expansion beyond generative speech, where they are at this
Starting point is 00:01:07 point one of, if not the default market leader in voice cloning, text to speech, and translation. This new music release called Eleven Music is a feature-complete music generation suite with instrumental in lyrics, meaning it competes directly with services like Suno and Udio. You can create an entire song based on a prompt. The example you just heard is from the prompt, powerful choir, orchestral, large crescendo, ethereal, spacious reverb, expansive, cinematic, classical music. Or this one, roller, liquid drum and bass, fast technical percussion, thick bass, influences from IDM, soft female vocals, heavily processed.
Starting point is 00:02:03 11 labs say the model is capable of generating a tune within minutes, and yet as impressive as it is, and the quality at first glance is incredibly impressive to me. The biggest selling point over others in this space is that the 11 music model is claimed to be legal for broad commercial use. Both Suno and Udio are currently facing lawsuits from the major record labels for using their recordings and training data. Now, it's very unclear how music generated using those models is going to be treated under copyright law. Those are big battles to be fought. But Eleven Labs is just trying to short circuit that entirely, abstaining from including music from the major labels in their training sets. Instead, they signed agreements with independent rights
Starting point is 00:02:49 management firms Cobalt Music and Merlin Network, with Cobalt saying that their artists had been given the choice to opt into their music being used in the training. data or not. Ed Newton Rex, who is the CEO of fairly trained and AI copyright advocacy nonprofit, posted positively, saying, co-founder of 11 Labs confirms that their new AI music model is trained only on songs they've licensed. This is really good to see. When a handful of AI companies try to tell you generative AI can only be built with scraped copyrighted work, remember that the majority of AI music models license their training data, including now 11 Labs model. Certainly first impressions just from a quality standpoint are very good. Chubby writes, 11 Labs music was not on my bingo
Starting point is 00:03:24 card. Holy moly does this sound good. We're now entering the era of real AI music. We are witnessing history. Now, even though 11 Labs says this is cleared for commercial use, when you look into the terms, there's still some questions in here. There are restrictions around, for example, whether you can distribute these things to music streaming services and exactly where you can use them, but it's pretty clear where this has a lot of disruptive potential is around commercial uses of music like game development, advertising, startup launch videos. Basically, basically, basically anything that would have had you digging around before, for the perfect sound clip or for the perfect licensed song, you can now just generate on the fly. Creator Tyler Ganges says,
Starting point is 00:04:03 it's so simple, so easy, and will save me so much time. Teodor Mitu connected the dots between Jeannie 3, which we'll talk about later in the main episode, and said between this, the GPT Onslot and the new 11 Labs music tool, the entire legacy creative industry is cooked. For my part, I'm going to be super interested to see and watching very closely how much 11 music actually gets used for this commercial use case, given that theoretically that's been possible with things like Suno before, but hasn't really happened. Maybe the way that they train this model will open up that use case, but we will have to wait and see. Now, another model launch, which actually happened all the way back in the olden times of Monday, was Lindy 3.0. Lindy 3.0, they're pitching
Starting point is 00:04:42 as their biggest step ever towards their vision of the AI employee. Now, if you haven't tried Lindy, they offer an AI agent platform that aims to make the agent building process simpler. Last big release was Agent Swarms, which allowed use case like sales management agents to carry out dozens of tasks in parallel. Lindy 3 has three big new features that bring the platform up to speed and push the state-of-the-art forward. CEO Flo Crivello wrote, Our Vision has always been the AI employee. As capable as humans, can do anything on a computer and as easy to use, just ask. 3-0, he says, takes three giant steps in this direction, agent-builder, autopilot, and team collaboration. Agent Builder is exactly what it sounds like and is, I think, very much.
Starting point is 00:05:21 the future of how agent models are going to work. Flow writes, just type what you want and it builds it for you in minutes. Now, this is a huge UX improvement for non-technical people. If you've ever fiddled with N8N or Lindy and immediately closed it because you were confused about nodes and about these charts and drawing lines from one function to another, this is a lot closer to what you probably are looking for, which is more akin to vibe coding for agents. A vibe coding for agents platform would be one where you didn't need to know the details of agent architecture. You just had to describe what you wanted to automate. Now, what is a little bit interesting about the UX for Lindy's agent builder is that it still exposes the chain of steps, even though it's creating them for you,
Starting point is 00:06:01 which gives you more granular fine-grained control to modify it to the extent that the agent builder gets it wrong. That could end up being the right combination, especially for power users who want a little bit more control. Autopilot is Lindy's version of computer use. Flow again writes, autopilot takes us closer than ever to can do anything. Lindy agents can now work with their own computers in the cloud. He also continued, when we say anything, we mean it. We found out after building autopilot that now Lindy could also build fully functional websites, deploy them, and even QA them using its browser. We accidentally built lovable. In dogfooding the product, Lindy found that it was useful for the repetitive work that could be automated in the background. Flo told TBPN,
Starting point is 00:06:39 one of the most insane things it's automated for us is replacing the QA engineer. Every hour a Lindy agent wakes up, test the entire core flow, and if anything goes wrong, it pings the on-call engineer. Now lastly, team collaboration does what you'd expect, allowing your team members to natively share and iterate on their agents. I think what's most interesting to me in some ways, aside from just the utility of this, is the move to the vibe coding for agents kind of U.S. We've seen some of this already with Emergence recently releasing their agents creating agents platform, but it's hard for me to imagine that this doesn't just become totally standard in day regard very soon. Now lastly, one more little model announcement before we get to our main episode.
Starting point is 00:07:14 alongside Jeannie 3, Google also announced Storybook. Their Gemini app account writes, It's Storytime reimagined. Now you can create personalized illustrated storybooks about anything, complete with read aloud narration. And this is kind of one of those things where the capabilities aren't new, but it's just an interface designed to specifically get at a particular use case. Now, one of the things that's very notable to me is that when parents experiment with AI for the first time, A huge default use case is around creating stories for their kids, illustrating their kids' visions with image generation tools, basically bringing the magical things that happen in kids' brains to life with technology.
Starting point is 00:07:52 It's the convergence of the magic of childhood with the magic of technology. And that definitely seems to be what was going on with the storybook. Joel, who's a PM at DeepMind wrote, As a new father, I've been thinking about how to communicate with my son in a way that truly resonates with him, and in a way that many of our own parents used to communicate with us, reading. With storybook, you can now describe any story you can imagine, and Gemini generates a unique 10-page book with custom art and audio.
Starting point is 00:08:15 We're excited to help people find creative ways to break communication gaps when you don't quite have the words for it. Now, of course, in a week of crazy open source releases and world simulation models, maybe this seems a little bit small. But I wouldn't be surprised if for a lot of you parents out there, the coolest new thing that got released in the short term might just be Google's Gemini storybook. With that, though, we are going to close out the headlines.
Starting point is 00:08:37 Next up, the main episode. This episode is brought to you by Blitzy, the Enterprise Autonomous Software Development Platform with Infinite Code Context. Blitzy uses thousands of specialized AI agents that think for hours to understand enterprise-scale code bases with millions of lines of code. Enterprise engineering leaders start every development sprint with the Blitzy platform, bringing in their development requirements. The Blitzy platform provides a plan, then generates and pre-compiles code for each task. Blitzy delivers 80% plus of the development work autonomously while providing a guide for the final 20% of human development work required to complete the sprint. Public companies are achieving a 5x engineering velocity increase when incorporating Blitzie
Starting point is 00:09:15 as their pre-IDE development tool, pairing it with their coding co-pilot of choice to bring an AI-native STLC into their org. Blitzy is providing a limited time, 30-day free proof of concept for qualifying enterprises. The team will provide a 5x velocity increase on a real development project in your org. Visit blitzie.com and press book demo to learn how Blitzie transforms your STLC from AI-assisted to AI Native. That's B-L-I-TZY.com. As a founder, you're moving fast towards product market fit, your next round, or your first
Starting point is 00:09:46 big enterprise deal. But with AI accelerating how quickly startups build and ship, security expectations are higher earlier than ever. Getting security and compliance right can unlock growth or stall it if you wait too long. With deep integrations and automated workflows built for fast-moving teams, Vanta gets you audit-ready fast and keeps you secure with continuous monitoring as your models, infra, customers evolve. Fast-growing customers like Langchain, writer and cursor, trusted Vanta to build a scalable foundation from the start. And look, as someone who lives in the world of enterprise procurement,
Starting point is 00:10:18 I love how Vanta makes it easy to get compliance right. The last thing you need when you're trying to win that big deal is to have it scuttled by something that Vanta has solved for over 10,000 companies. Go to vanta.com slash NLW to save $1,000 today through the Vanta for Startups Program and join over 10,000 ambitious companies already scaling with Vanta. That's V-A-N-T-A-com slash NLW to save $1,000 for a limited time. Today's episode is brought to you by Plum. Are you building with AI? Plum noticed that every technical creator tends to hit the same wall. You've got AI workflows people want, but monetizing them feels impossible because client work doesn't scale. Selling copies gives away your IP. And building your own platform, that's becoming a software company.
Starting point is 00:11:03 It's a hard gap to bridge, and that's why they built Plum. Plum helps creators build an audience of paid subscribers for their AI workflows, all on a single platform. Think substack for automations. There's no need to build extra infrastructure just to get paid for your expertise. Plum handles that so creators can do what they do best, solving problems with AI. Ready to turn your expertise into passive income, visit useplum.com, that's Plum with a B. Welcome back to the AI Daily Brief.
Starting point is 00:11:32 Yesterday, there were many breathless posts about the incredible progression and performance of OpenAI's new open source model. Twitter was absolutely flooded with posts like this one from Darmesh, the founder of HubSpot, screaming excitedly about how they were running this model on their MacBook pros. And of course, as you heard yesterday, there was a ton of excitement about the reported benchmarks. But inevitably, what happens is in the first 24 hours or so after a model gets released, people actually dig in. They start putting it through its paces, running their own benchmarks, and generally trying to get a vibe off the thing to understand how good it is in practice. Sometimes that leads to a model that outperforms its benchmarks.
Starting point is 00:12:14 Other times, in fact, I would say probably more often, the shine gets a little bit worn off. Now, in the context of this model, where it's open weights and there's more information to play with, there's even more hacking to be done. So the question we're exploring on today's show is how people are responding and what vibe they're getting from it now that it's out in the world. wild. Initially, a lot of the responses I saw were somewhat similar to this one from Victor Taylin. They wrote, my initial impression of OpenAI's OSS model is aligned with what they advertised. It does feel closer to O3 than to other open models, except it is much faster and cheaper. It's definitely smarter than Kimmy K2R1 and Quinn 3. I tested all models for a bit and got very decisive results in favor of OpenAI OSS 120B. The independent benchmarkers didn't find
Starting point is 00:12:56 quite as high performance as OpenAI reported, but it was still pretty good. Artificial analysis writes, independent benchmarks of OpenAI's GPTOSS models, GPTOSS-120B is the most intelligent American open weights model, comes behind DeepSeekR1 and Quen3-235B in intelligence, but offers efficiency benefits. You can see here that across the artificial analysis intelligence index, the 120B model scores of 58, as opposed to R1's 59, and Quen3's 64. Now, 03 is up at 67, so there's a fairly big jump from 120B to 03. This question of how much efficiency matters is something we'll come back to, however. And yet, if some of the initial impressions were positive, it didn't take long before you started
Starting point is 00:13:40 seeing posts like this one. Amad Mastock formerly of Stability AI writes, is GPTOSS entirely mid-trained or something? It's good for its size in MacBook, but kind of feels fried in odd at times. Mark Kretschman responded, with a theme that we'll see over and over again, very jagged model, feels uneven. Celeste writes, it feels like since synthetic data. It's been safety maxed. Dwayne writes, I was excited for it until I really started to use it, and it's weird. There's something off about it, almost like they trained it on safe responses from 03, or maybe early checkpoint of GPT5. Evie writes, yeah, it has strange vibes, reminds of Microsoft crappy models like Fee4. OpenCodes, Dax wrote, everyone legit I know is having a not good time with GPT OSS
Starting point is 00:14:22 so far. It's useful because now when I see popular accounts say it's so good, wow, I know they are Bessing. Now, when people tried to dig in with Dax, it sounded like part of the issue was specifically with tool calling, which obviously really matters, because as we talked about yesterday, to the extent that this is going to be useful as an alternative for agent builders who want a higher level of security, privacy, and controlability, tool calling functionality is going to be extremely important. And pretty soon this became the popular narrative. Nikil Chandak wrote, what it is, state of the art at its size. What it is not, O4 Mini as many people have been claiming. Sam Pache writes, the GPTOSS models have disappointing results on EQ bench and creative writing.
Starting point is 00:15:01 It may be a function of the low active parameters, although the high performance in other evals suggest the priorities were elsewhere. And indeed, this is one of the biggest themes that's going around AI Twitter right now, is that this model seems to have been really focused on some particular use cases and not on others. Björn Pluster wrote a long thread about how GBTOSS-120B is, quote, very blatantly incapable of producing linguistically correct German text. He writes, I see this as an exceptional release highlighted, opening open AI's willingness to contribute to the open model space and showing how strong they actually
Starting point is 00:15:31 are on model training, but it is also very clearly a model not up to their usual standards with regard to multilinguality or output quality. Ambilicus writes, I hate to be the bringer of bad news, but this new open AI model is brutally locked down. Danny Aziz from every right simply OSS-120B is not O3 level in my opinion. Kyle Corbett writes, GPTOSS may have been trained primarily on synthetic data, similar to Microsoft's fee models. As a result, it's extremely spiky, great at the tasks trained on really bad at everything else. This was almost certainly to avoid copyright lawsuit, sadly. Lassano Gai writes, GBTOSS models seem to be slot-maxed on math, coding, and reasoning. They're great at that, but they completely lack taste and common sense. At least that's my vibe so
Starting point is 00:16:11 far. Phil 111 on Hugging Face left a comment called this model is unbelievably ignorant. He said this model has about an order of magnitude less broad knowledge than comparably sized models like Gemma 3.27B and Mistral Small 24B. This model, including its larger brethren, are absurdly ignorant of wildly popular information across most popular domains of knowledge for their respective sizes. What's really confusing is all of OpenAI's proprietary models, including their tiny mini versions, have vastly more general and popular knowledge than these open models. So they deliberately strip the corpus of broad knowledge to create OS models that can only possibly function in a handful of select domains, mainly coding, math, and STEM, that over 95%
Starting point is 00:16:48 of the general population doesn't give a rat's ass about, conveniently making it unusable to the general population and in so doing protecting their paid chat GPT service from competition. Now, let's say for a moment that there were specific decisions that went into this that made it optimized for those fields. Fill in the comments here is suggesting that that's them trying to protect the integrity of their business, which, as an aside, would be a reasonable thing to do anyways, but I'm not exactly sure that the analysis is correct even if the interpretation of what's going on under the surface is.
Starting point is 00:17:16 As we discussed yesterday, the most likely users of this are people who have perceived data security and privacy needs that are significant enough that they want to use a less convenient, not fully state-of-the-art open-source model over something that has any sort of controller interaction from a third party. And if that is the case, those folks almost certainly aren't using this for writing poetry or for getting access to general domain knowledge. They're using it, in other words, for coding math and STEM. Now, we have no confirmation yet from OpenAI that it is optimized for those fields, but even if it were, it would kind of make sense to me just on that use case level alone. But what about this question of really how it compares
Starting point is 00:17:54 to the Chinese models. Just a day before this was released, A16Z's Martin Casato wrote, It's just remarkable how many U.S. startups are being built on Chinese OSSI models. I'd say the majority that are building custom models via post-training. The U.S. needs to step up, make it a national priority and back with a huge investment. Which made it all the more interesting that the White House's Michael Cratios popped in yesterday, retweeting Didi Das from Menlo, who was talking about these three big model releases, and saying people ask what we mean by unquestioned and unchallenged global technological dominance. Simon Willison made this comparison directly in his tests. He wrote, I've been writing a lot about the flurry of excellent openweight models
Starting point is 00:18:32 released by Chinese AI labs over the past few months. All of them very capable and most of them under Apache Tour MIT licenses. Just last week, I said, something that has become more undeniable this month is that the best available openweight models now come from the Chinese AI labs. I continue to have a lot of love for mistral, Gemma, and Lama, but my feeling is that Quinn, Moonshot, and Z.a.I have positively smoke them over the course of July. I can't help but wonder if part of the reason for the delay and release of OpenA.I's open weights model comes from a desire to be notably better than this truly impressive lineup of Chinese models. Simon continued, with the release of the GPTOSS models, that statement no longer holds true. I'm waiting for the dust to settle in the independent benchmarks that are more
Starting point is 00:19:09 credible than my ridiculous tests to roll out, but I think it's likely that OpenAI now offer the best available Openweight's model. Ram Jod didn't find that, though. He writes, I tested OpenAI's brand-new open source model against Kimi K2 and Quen3 Coder. It seems to perform worse than the Chinese models in one-shot tasks. Somewhat disappointing to see, but I think over the coming days, people will find good uses for it within coding. Dax again writes, for coding they're not anywhere close to the Chinese models in the last few months. We'll give it time for dust to settle, but simple evaluations aren't enough. The San Al-Gaib writes, there is no Western open-source model that beats or ties the best Chinese open-source models. And adding even more questions to this is Ethan Mollick, who retweeted Simon
Starting point is 00:19:49 sharing the Chinese competition section of his post, with Ethan writing, the US now likely has the leading open weights model or close to it, but the real question is whether this is a one-off situation from OpenAI, in which case the lead will evaporate quickly as others catch up, unclear what their incentives are to keep updating. And this, of course, is a challenge. Let's say that the model isn't quite as good. Even this week, this is not the main announcement from OpenAI. This is just the amuse-bush for GPT-5, which we think is coming tomorrow. So if we really want to compete when it comes to OpenWaT's models, it's not clear that OpenAI alone is going to do that. Now, for a little bit more of an optimistic take, which is really just realism as optimism,
Starting point is 00:20:30 Nathan Lambert wrote, Seems like while this launch had the vibes right and OpenAI can jazz up a crowd, they're still going to go through a lot of the pains of why open models are hard. Just so many weird little failures I'm seeing people mention. May take a few weeks to get the best out of GPT. OSS. Matrix Memories responded, pain and beauty of open source. You can't run away from the edge cases or control for them. So of course, the optimistic take there is that part of what makes these open models valuable
Starting point is 00:20:56 is what the community can do with them. And so we may not want to judge it based on these first blush impressions. Nathan actually wrote a very long post on his Interconnects.aI blog, where he argued that OpenAI validates the open ecosystem. And that, quote, open models from the US labs were in such a dire spot that we need any step back in the right direction. But, however, when the question becomes, is OpenAI the new Open Champion? His answer is not quite. Nathan writes, it's a phenomenal step for the open ecosystem, especially for the West and its allies, that the most known brand in the AI space has returned to openly releasing models. This is momentum and could be the start of the turning point of adoption and impact of open models relative to
Starting point is 00:21:33 China. But he writes, there's a lot of uncertainty in the incentives for open models. Some of the best China analysts I know share how China is sensing that releasing open models is a successful strategy for them and they are doubling down. Open AIs releases a step, in the right direction, but it's still a precarious position. Many people are making noise about creating open models from the AI Action Plan to venture capitalists and academics. What all of these parties have in common is that it is not their number one goal. Now, Nathan does have a new initiative, which does have that at his number one goal called the Adam Project and which we'll likely get into later in the week, but this sort of validates the point that Ethan Malik was making
Starting point is 00:22:07 as well. There is also, lastly, this question of speed efficiency and cost. One of the less talked about aspects of this thing so far is that these models are extraordinarily fast and extraordinarily cheap. Most of the research has suggested that at this stage, everyone's just using the best model whatever it costs, but that won't necessarily be the case forever, and it won't necessarily be the case as we get more complex workloads that just consume a boatload of tokens. So all in all, I would say that the shine is slightly diminished from where it was yesterday, but that still most people are very excited to have Open AI back in the open game, going to continue to dig in and see how much can be wrenched out of these models before they're abandoned in any sort of
Starting point is 00:22:45 hugging face model junk heap. As you know, however, GPTOSS was not the only model launch yesterday. There was also Jeannie 3, the new Google World Simulation model, and the reviews for that one could not be more rave. I posted a poll on Twitter asking which launch is a bigger deal, and the sample size wasn't huge, just 75 people voted, but it was almost a dead heat, with 50.7% saying GPTOSS and 14.4,000. 49.3% saying Jeannie 3. Given how much hype OpenAI usually has around their model launches, and given the fact that Jeannie 3 wasn't even in Google's main line of Gemini models, I think that result is hugely telling and certainly is validated by the type of conversation that we're seeing across AI Twitter about this new model. A lot of the conversation is just
Starting point is 00:23:30 people excitedly sharing the most impressive versions of this. One clip that you'll see a lot, and by the way, this is definitely an episode that you're going to want to watch, same with yesterday's. Matt McGill from DeepMind tweets, One nice thing you can do with an interactive world model, look down and see your footwear and if the model understands what puddles are. Obviously with him posting it, Genie 3 does this well. Justine Moore from A16Z writes blown away by the outputs from Genie 3. This is a huge moment for world models. We now have real-time, playable simulations that you can generate from a text prompt. Boris Minardis writes,
Starting point is 00:24:03 generating cool-looking worlds is one thing, but the physics simulation? Just wow, absolutely amazing. Adana Singh writes, this is the most magical thing I've seen since LLMs. Ali Aslami, who admittedly is working at Google, says Genie 3 is the most impressive AI demo I've seen since Chad GPT. And if you think that's just bias, Stephen Heidel from OpenAI said this is absolutely incredible, a real see-the-AGI moment. Congrats to the Google Deep Mind team. Machine Learning Street Talk called this the most mind-blowing technology they've seen since starting their podcast. And Professor Darya Unutmas says, I strongly believe Jeannie 3 is the AGI moment for AI video. This is a mind-blowing advance.
Starting point is 00:24:41 I didn't anticipate that such interactive playable AI-generated environments would be possible this year. AI is advancing faster than even we super-optimists predict. There were also a number of comments like this one from Christoph, Thulean futurist, who writes, Google DeepMind is the best AI lab on planet Earth right now. The AI for Success account doubled down on that, saying Google DeepMind is destined to win the AGI race. Now, notably Elon Musk disagrees, responding to Ashutosh saying,
Starting point is 00:25:07 This race will keep going for a long time. Others joked about how much meming there had been recently, about this sort of 8-bit voxelized dark fantasy worlds, which have been all over TikTok and Twitter for the last couple of weeks. Dreaming Tulpo writes, All the disbelievers last week who said AI will never be able to achieve the OMW style and that it's just a five-second slot video, and now DeepMind dropped a world model on a regular Tuesday that just does it.
Starting point is 00:25:31 Spiraticalia writes, You're telling me we've been pining over that viral AI-generated fantasy pixel video game for a few weeks, and already Google is just like, cool, here you go. Now, one comment that I thought was really interesting came from sincere Mickey. Jeannie 3 is honestly mind-blowing, and what I love most about it is that it's not copying human work. What the deep mind team has created is something brand new that could not have existed without AI technology. And I think that's a really salient point. So much of the work stage that we're in with LLM still is just getting them to do stuff we already do,
Starting point is 00:26:01 but better, faster, and cheaper. Now, that's starting to change, especially as we get into the agent swarm era. Right? One of the things that we're figuring out is that in a lot of cases, spinning up five agents to compete, doing all the same work, and then figuring out who did it best, is a better strategy than just having one tool do the work. That starts to bridge us away from how we've done things in the past, because it's kind of where a shift in scale and the availability of intelligence actually is so significant that it becomes a shift in kind. What Mickey is pointing out, however, with Jeannie 3, is that this is not proximate to something that we had before, other than our real lived experience in the real world. Now, obviously, when it comes to these
Starting point is 00:26:39 world simulation models, they're still very nascent in terms of their use cases. Even as impressive as these updates are, they're not yet in a state where, for example, they have the memory to actually create entire game environments. What people are excited about here is the trajectory and the new possibilities that are being opened up in their imaginations. Lastly, today, I did want to come back to the Opus 4.1 question. It was very clearly, anthropic making sure they had an announcement in this crazy week of announcements, which I think is a completely reasonable strategy just from a press and communication standpoint. But as I shared on the first day, there were some who felt like it was maybe a little rushed just to get ahead. So far, I've seen comments on both sides.
Starting point is 00:27:16 I've seen some who said that they're not sure that Opus 4.1 is all that much better than Opus 4.4.1, has a much better sense of design than other models. He shared a single HTML page that it made up of some design firm. Mostly people are still hung up on the pricing and token availability of Claude. Gosu Kota writes, Opus 4.1 is so expensive, I'm so curious who can afford to run this as their daily driver. And when cursor posted, Claude Opus 4.1 is available in cursor. Let us know what you think. A huge number of the responses are some snarky thing like this one from August landmesser. Enjoy your one request per month, guys. I think when it comes to coding models, since this was an extend your lead release, not a recapture your lead release, it's going
Starting point is 00:27:57 to go a little under the radar until it can be compared to GPT5, which of course we anticipate to be just around the corner. Anyways, guys, those are the first reactions, the sort of 24-hour reactions to these models. Let me know what you think in the comments. Appreciate you guys listening or watching as always. And until next time, peace.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.