The AI Daily Brief: Artificial Intelligence News and Analysis - The Open Source AI Model Beating GPT-5 on Agents

Starting point is 00:00:00 Today on the AI Daily Brief, meet the open source model that is outperforming GPT5 and basically everyone else when it comes to a gentic performance. Before that on the headlines, maybe vibe-goating isn't dead after all. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. All right, friends, quick announcements before we dive in. First of all, thank you to today's sponsors, Super Intelligent, Robots and Pencils, Blitzie, and KPMG. To get an ad-free version of the show, go to patreon.com. or you can subscribe on Apple Podcasts. If you are interested in sponsoring the show, and especially if you are hoping to get any Q1

Starting point is 00:00:42 placements, now is a really good time. Things are filling up fast. And I'm trying to map everything out. So if you are interested or thinking about sponsoring the show and you just want to learn about the opportunities we have, send us a note at sponsors at AIDDailyBief.A.I. Like I said, if you are hoping to get Q1 placement, now is a good time to reach out. Lastly, as I mentioned yesterday, we are now up over a thousand use cases contributed to the AI-R-OI benchmarking study. I am so appreciative of all of your help so far, and if you want your use cases

Starting point is 00:01:09 included, as well as to get access to the full readout of all of this incredible AI-R-OI information, go to ROISurvey.com. It'll be live for about another week and a half. With that, let's dive in. Welcome back to the AI Daily Brief Headlines edition, all the daily AI news you need in around five minutes. Apparently, rumors of vibe coding's demise have been greatly exaggerated. Speaking with TechCrunch on Monday, Lovable CEO Anton Ocica said that the company is closing in on 8 million users, dramatic growth from their 2.3 million active users back in July. Oseka claimed the company is now seeing 100,000 new products built on Lovable every single day. We didn't get a new revenue number, but Lovable crossed the $100 million ARR milestone back in June, and there are currently

Starting point is 00:01:52 rumors of new funding being raised at a $5 billion valuation, which would almost almost almost tripled their valuation from fundraising over the summer. Now, part of the interview addressed a report from Barclays in September, which showed that traffic to Lovable had dropped by 40% since a peak in August. Oseka said that retention was still strong, with 100% net dollar retention, meaning the average user spends more over time. Now, of the major vibe coding startups, Loveable might be the one that's most focused on empowering non-coters. The platform not only enables easy prototyping, but is increasingly being used to deploy full products. If you've ever been on AIDailybrief.Ai, for example, that is built.

Starting point is 00:02:26 built, maintained, and hosted all with help from Lovable. Now, when it comes to where the company is focused, it follows from that same specialization. Oseka said the part of the engineering organization that we're moving the quickest on hiring is security engineers. He said that the goal is to make building with Lovable more secure than building with just human written code. Now, in terms of the battle for the vibe coding space and increased competition from OpenAI and Anthropic, Oseca said that he thinks it's not winner take all. He said, if we can unlock more human creativity and human agency, and just driving the change so that anyone can create if they have good

Starting point is 00:02:56 ideas, that should be celebrated regardless of whoever does that. Next up, Meta has returned to open source with a new speech recognition model. Called Omnilingual ASR, the model's big selling point is support for a huge range of underserved languages. Out of the box, the model can recognize over 1,600 languages. In contrast, OpenAI's Open Source Whisper model supports 99 languages. Developers can also extend this support with a feature called Zero Shot in Context Learning. The model can learn new languages at inference time using just a few parody.

Starting point is 00:03:26 examples of speech and text, with no retraining required. Meta said the feature can allow the model to support as many as 5400 languages, which is pretty close to every language in use globally. Functionally, then, meta are claiming to have created something like an AI Rosetta Stone for universal speech recognition. Reported benchmarks are also very strong, with the model more than quadrupling the performance of Open AI's Whisper Large model. Meta claims a character error rate of less than 10% for 95% of high and medium resource languages, as well as 36% of low resource languages with less than 10 hours of audio in their datasets. Now, while the model itself is very cool,

Starting point is 00:03:59 the reason that most people are taking notice is that the release suggests that meta might not be completely done with open source models. When Mark Zuckerberg started spending billions of dollars to build out the superintelligence team, there was a suspicion that the days of leading open source models coming out of meta were numbered. Does this suggest that those concerns were overblown?

Starting point is 00:04:16 Only time will tell, but it's certainly a positive sign. Next up, some interesting comments from a deep-seek researcher who has warned that AI could replace most jobs within a decade. Senior researcher Chen Deli made a rare public appearance at the World Internet Conference in China late last week alongside executives from five other AI and robotics companies. He warned that over the next 10 to 20 years, quote, societal structures will also be greatly challenged. Tech companies should play the role of guardians of humanity at the very least protecting human

Starting point is 00:04:43 safety, then helping to reshape societal order. Chen said that we're currently in the honeymoon phase where AI cannot work independently to complete economically useful tasks, and people can harnesses. AI to boost their own productivity. However, he predicted that the next five to 10 years will see a rapid transition that leads to massive job cuts. Chen suggested, quote, during this period, tech companies should serve as whistleblowers warning society of potential risks. Now, this view certainly isn't rare in the West. What makes it interesting is to see it emerge from one of the leading Chinese companies. AI optimism among the U.S. population is among the lowest in the world at 39%. But in contrast,

Starting point is 00:05:18 Chinese sentiment is among the highest at 83%. The AI transformation has become a core part of of the Chinese government's economic and social strategy. In that context, the comments from Chen seem extremely non-consensus and frankly, even potentially a little risky. Moving over to markets, Corweave more than doubled their revenue forecast last quarter, but delays in data center construction have lowered revenue forecasts. The AI data center operator reported earnings on Monday, with revenue doubling year-over-year to come in at 1.36 billion, outperforming analyst estimates. Corweave also trimmed their loss making to 22 cents per share, coming in way under the 57 cents per share projected by analysts and an 85% reduction compared to a

Starting point is 00:05:56 year ago. Still, the big story from Corweave's earnings was a delay to a major product that's limiting forward revenue. CEO Michael in Trader disclosed that a third-party developer is causing temporary delays. Fourth quarter earnings will be impacted, but the client agreed to an adjusted timeline, so Corweave will maintain the full value of the contract. Entrater said, Everybody is frustrated, the data center provider is frustrated, we're frustrated, the client is frustrated, People who are waiting on the next iteration of AI are frustrated. Now, the mystery client could be OpenAI or Meta, who each have over $10 billion in contracts with Corweave.

Starting point is 00:06:27 CoreWe've lowered full-year revenue forecast to $5.05 billion from $5.15 billion due to the delays. Now, one really positive signal, however, from that call, it seems that installed GPUs are holding their value for longer than expected. Corweave has been criticized in the past for assuming a six-year depreciation schedule on Nvidia H-100s, which is longer than the more common four- or five-year schedule. During earnings, however, Corey announced that their first H-100 contract was reaching Exbury and was re-signed within 5% of the original price. In other words, at the moment at least, it looks like the scarcity of compute is trumping

Starting point is 00:07:00 all other factors in the current market. Now, checking in on AI stock themes overall, it does seem like many of the jitters last week were perhaps broader macro factors and not AI alone. As we came into the week, with a deal to end the government shutdown deal on the horizon, there was a major Wall Street rebound with AI stocks leading the way. The S&P 500 was up 1.3% winning back around 75% of its drop from last week. The NASDAQ regained around 2 thirds of last week's loss, and NVIDIA led the way with a 4.8% rally. Now, I certainly do not think that this means that all of the concern that we saw last week

Starting point is 00:07:33 was just based on bigger macro factors, but it is a good reminder that right now, AI is both the chief beneficiary and biggest victim of any shift in market sentiment, good, bad, or otherwise. That, however, is going to do it for today's headlines. Next up, the main episode. Today's episode is brought to you by my company, Super Intelligent. You've got 100 what if ideas, but which one becomes an agent. Super Intelligent maps every AI use case across your company

Starting point is 00:08:05 and helps you create an agent plan that you can actually execute. We match opportunities to your tech stack, your data profile, and your team. No more guesswork, just a clear path from pilot to production. If you want agents that deliver business outcomes, start with planning. Go to BSUper.ai and sign up for a demo. Small, nimble teams beat bloated consulting every time. Robots and Pencils partners with organizations on intelligent, cloud-native systems powered by AI. They cover human needs, design AI solutions, and cut-through complexity to deliver meaningful impact without the layers of bureaucracy.

Starting point is 00:08:39 As an AWS-certified partner, robots and pencils combines the reach of a large firm with the focus of a trusted partner. With teams across the U.S., Canada, Europe, and Latin America, clients gain local expertise and global scale. As AI evolves, they ensure you keep peace with change, and that means faster results, measurable outcomes, and a partnership built to last. The right partner makes progress inevitable. Partner with Robots and Pencils at Robots and Pencils.com slash AI Daily Brief. This episode is brought to you by Blitzy, the Enterprise Autonomous Software Development Platform with infinite code context. Blitzy uses thousands of specialized AI agents that think for hours to understand Enterprise-scale code bases with millions of lines of code. Enterprise Engineering

Starting point is 00:09:21 leaders start every development sprint with the Blitzie platform, bringing in their development requirements. The Blitzy platform provides a plan, then generates and pre-compiles code for each task. Blitzy delivers 80% plus of the development work autonomously, while providing a guide for the final 20% of human development work required to complete the sprint. Public companies are achieving a 5x engineering velocity increase when incorporating Blitzie as their pre-IDE development tool, pairing it with their coding pilot of choice to bring an AI-native SDLC into their org. Visit blitzy.com and press get a demo to learn how Blitzy transforms your SDLC from AI-assisted to AI Native. What if AI wasn't just a buzzword, but a business imperative?

Starting point is 00:10:00 On You Can with AI, we take you inside the boardrooms and strategy sessions of the world's most forward-thinking enterprises. Hosted by me, Nathaniel Wittamore, and powered by KPMG, the seven-part series delivers real-world insights from leaders who are scaling AI with purpose, from aligning culture and leadership to building trust, data readiness, and deploying AI-8. agents. Whether you're a C-suite executive, strategist, or innovator, this podcast is your front-row seat to the future of Enterprise AI. So go check it out at www.kpmg.org.us slash AI podcasts or search you Penn with AI on Spotify, Apple Podcasts, or wherever you get your podcasts.

Starting point is 00:10:38 Welcome back to the AI Daily Brief. Today we are once again talking about another Chinese open source model that is really changing people's sense of what is possible. in the field of AI today. Now, to put this model release in some proper context, we have to go back to January. It is now coming up towards the end of the year. And of course, this is the time when I start to plan out my end of year coverage, which is a big time for reflecting on the year that has passed and what's to come. And any end of year big story recap is inevitably going to kick off with the big story from January,

Starting point is 00:11:18 which was, of course, the release of Deepseek. When Chinese lab Deepseek dropped their reasoning model, it caused an absolute tizzy in the AI industry that even sent stocks reeling. Now, there were three big reasons that Deepseek was such a big deal. The first was that it totally changed people's perception of how far behind us China really was. Up until that point, people were working on the assumption that when it came to model development, China was meaningfully behind the U.S., and Deepseek seemed to suggest that wasn't true. The second big reason for concern and the one behind the big stock wobble was that at the time it appeared that they had achieved those results at significantly lower cost than big U.S. training runs.

Starting point is 00:11:58 This made everyone question the incredible amount of resources being spent on the data center buildout. The third reason DeepSeek was such a big deal was more on the consumer side. When they released their R1 reasoning model, the chatbot app that housed it actually dethroned chat chitpity to become the number one downloaded free app on Apple's App Store for iPhone. Now, what was interesting about this was that Deepseek was not the first company to release a reasoning model. At that point, OpenAI's R1 had been available for a number of months. The difference was that DeepSeek made it available for free, meaning that for most people, it was their first experience with a reasoning model, which, of course, if you've ever experienced the jump

Starting point is 00:12:37 from a non-reasoning to a reasoning model, is just a fundamentally different LLM experience. So this is what kicked off the year, and set the tone for a number of different conversations that we'd be having throughout the year. Now, more recently, the whole China element of this story has heated back up in a big way. Invitya CEO Jensen Huang recently said in very stark terms that he believed that China would win the AI race because of their disposition towards it. And even though, by the way, all these outlets are reporting that he backtracked, for my money, the backtrack was kind of more just a reaffirmation of what he was saying

Starting point is 00:13:09 while trying to present a slightly more positive spin like the U.S. still had a chance. along with the rise in AI skepticism among market investors, there has also been a surge in the idea that China isn't building as many data centers and that perhaps the U.S. is overbuilding then. Investor Gordon Johnson went viral with a tweet that said, Question for the AI Bulls. The U.S. currently has around 5,426 data centers and is investing billions to build more. China has around 449 data centers and is not adding.

Starting point is 00:13:35 If AI is real, why isn't China building thousands of data centers every month, which they could clearly do? Semi analysis is Dylan Patel responded, Where did you get the idea that they aren't adding? Not as much as the U.S., but China has thousands of data centers and are building many more. Your data source sucks. Now, the substance here is less important than the narrative and the fact that once again, China's actions become the big foil for the U.S.s. And this is the setup into which the new Kimi K2 thinking model was released. The new model was released by Moonshot last Thursday with claims of outperformance on major benchmarks.

Starting point is 00:14:07 The model purportedly leads both GPD5 and Claude Sommet 4.5 on Humanity's Last Exam, which is a general knowledge test, on Browsecomp, which is a test of agentic search, and Seal Zero, which is a test of the ability to collect real-world data. The model lags slightly on major coding benchmarks like Sweet Bench verified, but not by much. Didi Das of Menlo Ventures wrote, Today is a turning point in AI. A Chinese open source model is number one. Kimi K2 Thinking scored 51% on Humanity's last exam,

Starting point is 00:14:35 higher than GPT5 in every other model. 60 cents per million tokens and $2.5.00 per million tokens output. The best at writing and does 15 tokens per second on two Mac M3 Ultras. Seminal moment in AI. In other words, the point that D.D. is making here is that in addition to performing well, it's doing so cheaply and in a way that's efficient enough that people could run it on their own hardware. Now, in addition to scorching the benchmarks, Moonshot claimed the model is capable of 200 to 300 sequential tool calls without human interference. If that's true, it would make it incredibly capable for agenic workflows,

Starting point is 00:15:10 frankly, head and shoulders above many of the Western frontier models. Indeed, according to independent testing from artificial analysis, Kimmy is now ranked ahead of GPT5, Clod 4.5 Sonnet, and GROC4 on agentic tool use, and there's a fairly significant gap. Some, like Dan Mack, suggested that this might be enough to delay the release of the next generation of models as the frontier labs go back to the drawing board. referencing that same recent quote that we were just talking about from Jensen Huang, the one where he said that Chinese AI is nanoseconds behind America. Dan wrote, Jensen is right, look at Kimi K2 thinking.

Starting point is 00:15:45 Watch for delayed releases of Gemini 3, Opus 4.5 and GPD 5.1. Delays signal they are not clearly better or cheaper than Kimi K2 thinking. That is evidence that the USA is indeed falling behind in the race. Said Machina, Kimi K2 beating Gemini 3 would be, well, humiliating doesn't even cover it. Think about what Google has. Decades of data, the best talent money can buy, infrastructure that runs the Internet. And they're sweating a smaller team's model? That's not supposed to happen in tech.

Starting point is 00:16:14 The big guy wins always. Maybe not this time, though. Now, part of what has people excited is that the model is open source, so people were running their own tests over the weekend. Pietro Sherato, the CEO at Magic Pathai, wrote, Kimi K2 Thinking is incredible. So I built an agent to test it out, Kimi Writer. It can generate a full novel from one prompt, running up to 300 tool requests per session.

Starting point is 00:16:36 Here it is creating an entire book, a collection of 15 short sci-fi stories. LXE gave the model the task of balancing nine eggs, a book, a laptop, an empty plastic bottle, and a nail to try out its reasoning. The model came up with a counterintuitive solution of arranging the eggs to support the book as the starting point, then adding the book, laptop, bottle, and the nail in turn.

Starting point is 00:16:56 LXE remarked, Kimi Ketu Thinking is the only modern reasoning model in recent memory that provided a human solution to this on the first try. Now, another big shift here is that Chinese models are now right there with the U.S. models on coding. AI coding has been the breakout killer use case for this year, and frankly, that's probably been something of a comfort for the Western companies, as this is one area where they've continued to maintain something of a lead. At the beginning of the year, Claude 3.5 Sonnet was the premier model with no-close competitor. Since then, later versions of Claude, GPD5, Gemini

Starting point is 00:17:27 2.5 Pro, GROC4, all have vied for the top of the leaderboards and API credits from developers. increasingly, though, Chinese models are catching up, if not to the absolute state of the art, at least presenting a very compelling cost-of-value trade-off. Kimi-Katu Thinking is clearly better at coding than Claude 3.5 Sonnet, the model that everyone was using just a few months ago, and it's being served at a fraction of the cost. In a recent article, the information suggested that that competition is a huge problem for Anthropic in particular, given how much of their revenue is derived from API use for coding. They also point out that looking abroad is an imperative for the Chinese startups, writing,

Starting point is 00:18:00 it is critical they find customers outside China who pay to access the AI models through APIs no matter how low the prices are. That's because it's difficult for AI companies in China to generate revenue from domestic customers, where price competition is fierce, and business customers are reluctant to pay for subscriptions. The article continues, as the overall AI coding market grows rapidly, the Chinese companies are betting that there will be sufficient demand for cheaper and good enough options. And in fact, this is one way that the release of Kimmy K2 could end up being different to the Deepseek moment. If the release of Deepseek R1 was all about giving consumers their first glimpse of reasoning models that were hidden behind the paywall at OpenAI,

Starting point is 00:18:37 Kimmy K2 Thinking could end up being more about providing a near state-of-the-art model that could perform in the enterprise at a fraction of the cost. Another interesting shift is that models like Kimmy K2 Thinking are opening the door to self-hosted LLMs in a way that wasn't really feasible last year. Up until recently, there has been a stark trade-off when a developer chose to run models locally. Previously, you could use open-source models to underpin products that didn't need state-of-the-art AI or you could tinker around with them. But for serious advanced production use cases, there needed to be a very significant reason

Starting point is 00:19:08 to want the privacy or security of a local model to make up for the reduced performance. Kimi Ketu Thinking is one of a crop of Chinese models that have reduced that gap. One of the reasons for that is an innovation in quantization. You can think of quantization as kind of like compression for AI models. While the process reduces performance, it also lowers the memory requirements substantially to allow models to fit on consumer hardware. Kmi K2 Thinking, for example, can be quantized down to run on a pair of Mac M3 Ultras, which is certainly not a cheap consumer setup, but it is a realistic rig for a professional

Starting point is 00:19:39 programmer or a company. Some are starting to wonder if local LLMs will be a growing trend. I'm not really sure that I'm convinced at this point, but it is possible that we will see certain types of industrial use cases where the balance of value that you get from running locally does shift things and that will be an important trend to keep an eye on. And while we haven't seen a lot of U.S. enterprises all of a sudden adopting Chinese models, there are growing reports that the startup ecosystem has already made the switch. Bloomberg opinion columnist Catherine Thorbeck wrote,

Starting point is 00:20:10 In recent weeks, a subtle shift has become increasingly apparent. Speculation has been stirring for months that low-cost, open-source Chinese AI models could lure global users away from U.S. offerings. But now it appears they are also quietly winning. over Silicon Valley. She referenced Chimath Palahapitia commenting that one of his portfolio companies has already moved major workflows to Kimi K2, which he said is, quote, frankly, just a ton cheaper than OpenAI and Anthropic. That same week, Airbnb's CEO, Brian Chesky said that they hadn't integrated with OpenAI because the connections aren't quite ready.

Starting point is 00:20:42 Instead, Airbnb's new service agent is, quote, relying a lot on Alibaba's Quen3 model, which Chesky said is very good and also fast and cheap. Miramarati's thinking machine's lab is also building on Quinn 3. Cursor's new in-house coding agent, Composer 1 is rumored to be built on top of a Chinese model, and Hugging Face downloads for Quinn have recently overtaken downloads of meta's Lama models, suggesting a shift in user patterns for open source AI. Referencing that same Jensen Huang quote, Thorbeck wrote, it's premature for Huang to declare a winner. The U.S. still has clear advantages when it comes to access to cutting-edge chips and computing power, but Beijing's low-cost in open-source pushes undoubtedly attracting developers the backbone of AI innovation.

Starting point is 00:21:22 If Washington truly wants to come out on top in the long run, it should start by asking why Silicon Valley is already switching sides. So what's the net of all of this? Cashat Patel writes, Kimi Ket-2 thinking is more important than O3, not because the model is better, but because of what it signals about the future of AI development. For him, there are a few different elements of this. First, that the open-source lag is now measured in months, not years, that basically we've seen the closed model advantage window collapsed from more than 18 months to three to four months,

Starting point is 00:21:51 that China is treating AI like they treated electric vehicle manufacturing, in other words, not trying to match the West but trying to lap it on price and accessibility and competing on economics. And then this observation, the real race isn't to AGI, it's to democratization. He writes, Who cares if you build AGI if only a thousand companies can afford it? Kimi K2 provides frontier performance at commodity prices. That's the game. Dean Sackaransky thinks that the agenic capabilities update is the real deal here. He writes,

Starting point is 00:22:21 In July 2025, models could not effectively call tools, three to five pool calls max. Then Kimi K2 released, and every subsequent model has been post-trained for tool calling. Now we have agents that can run for an hour and 30 minutes. This is the quietest and most significant advancement in recent memory. Bindu Ready, writes, In spite of all the closed-source drama,

Starting point is 00:22:40 the biggest story of 2025 has been open-source agentic models. Three new models dominate the cheap mass market agent space, GLM, Kimi K2, and Quinn Koder are all. amazing, with trillions of tokens being used every day. That leads to a prediction from Bindu. 2026 will be the year of open weights. We will see at least two U.S. labs enter the arena. Kimi and GLM will push to close the gap in agentic coding. DeepSeek will finally release R2. We will have state-of-the-art image and video generation models. LLM developer community will explode. Now look, obviously one of the subtexts for a lot of this show is around the geopolitics

Starting point is 00:23:15 of this, but when it comes to consumer choice, it's hard to see all of these advanced. is anything but incredibly valuable. New frontiers of performance and costs are being pushed, bringing the efficiency and affordability of everything down, and that's going to mean all of us being able to do even more with these models than what was previously possible. Pretty interesting stuff, obviously a lot to keep track of. For now, it's going to do it for today's AI Daily Brief.

Starting point is 00:23:38 Appreciate you listening or watching as always, and until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - The Open Source AI Model Beating GPT-5 on Agents

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.