The AI Daily Brief: Artificial Intelligence News and Analysis - The "Wave of Crazy New AI Stuff" Coming Next Month

Episode Date: May 17, 2025

A flood of major AI updates is right around the corner. New models from Anthropic, OpenAI’s autonomous coding agent Codex, Windsurf's SWE-1 for end-to-end software engineering, and changes at Sa...lesforce and Walmart all point to a massive shift.Get Ad Free AI Daily Brief: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://patreon.com/AIDailyBrief⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Brought to you by:KPMG – Go to ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://kpmg.com/ai⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ to learn more about how KPMG can help you drive value with our AI solutions.Blitzy.com - Go to ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://blitzy.com/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ to build enterprise software in days, not months Vertice Labs - Check out ⁠⁠⁠⁠⁠⁠http://verticelabs.io/⁠⁠⁠⁠⁠⁠ - the AI-native digital consulting firm specializing in product development and AI agents for small to medium-sized businesses.The Agent Readiness Audit from Superintelligent - Go to ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://besuper.ai/ ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Subscribe to the newsletter: https://aidailybrief.beehiiv.com/Join our Discord: https://bit.ly/aibreakdownInterested in sponsoring the show? nlw@breakdown.network

Transcript
Discussion (0)
Starting point is 00:00:00 A wave of crazy new AI stuff seems to be right on the horizon, and we're actually starting to see some of it as early as today. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. Thanks to today's sponsors, KPMG, Blitzy.com, and Super Intelligent, and to get an ad-free version of the show, go to patreon.com. All right, friends, quick note before we dive in, as I promised in yesterday's show, because we have had a couple of long main-only episodes this week, Today we are doing an extended headlines. We're catching up on just a ton of news. It's jam-packed,
Starting point is 00:00:37 so let's dive in. Welcome back to the AI Daily Brief. Earlier this week, why Combinator Managing Partner Dalton Caldwell wrote, A wave of crazy new AI-related stuff is coming next month. Betting on the models getting smarter reminds me of the 1990s bet that network bandwidth would only keep growing. It was a good one. And that is really the meta-theme of today's show. One of the labs that people are eagerly awaiting a next drop from is anthropic. Lucky for us, the information reports that new versions of Claude Sonnet and Claude Opus are coming over the next few weeks. Citing model testers, they wrote, what makes these models different from existing reasoning AI is their ability to go back and forth between thinking or exploring different ways to
Starting point is 00:01:17 solve a problem and tool use, the ability to use external tools, applications, and databases to find an answer. The information gave examples of business development, where the models are able to alternate between web-based research and reasoning through data to come up with a suggestion. On the coding side, models can automatically test their own code and then reason about bug fixes. One of the implications of this is that these models might be able to function based on much higher-level instructions, further relegating the need for exact prompt engineering. For example, they write, the new anthropic models are supposed to handle more complex tasks with less input and corrections from their human customers.
Starting point is 00:01:53 The example they give is, in something like software engineering, you might just want to make this app faster, and let it figure out how to do that. Now, there's an open question until we see these things around just how different they are than OpenAI's 03 or 04 Mini, which integrate tool use into the reasoning process. And as we'll see, those are not the only models that OpenAI has now in this vein. And it's also not a sure thing that people will embrace a new model. For example, as the information points out, reactions to Claude 3.7 Sonnet, a previously released Anthropic model that combined reasoning and traditional large programming models in a single AI have been mixed. Some people have complained the model is more likely to lie and
Starting point is 00:02:29 ignore user commands. Others have said when they don't give specific enough instructions to the model, it's more likely than other AI to get too ambitious and go out of scope for what it's supposed to do. Tony Ennis of Scout AI noted, Claude 3.5's sonnet was released around a year ago, and despite being followed by 3.5 haiku and 3.7 sonnet is still the recommended model for half of cursor tasks. And when it comes to coding, Anthropics models appear to have some competition now. coding assistant startup windsurf has announced the launch of their first family of proprietary models. The family will be known as S-W-E-1 or S-W-E-1 or S-W-E-1 and includes a full-size model alongside light and mini versions. The company said that the models will be optimized for the entire software
Starting point is 00:03:08 engineering process, not just coding. They claim that the flagship model Sway 1 will have, quote, approximately clawed 3.5 sonnet levels of tool-call reasoning while being cheaper to serve. WinSurf will be offering the model for free during a promotional period. The smaller light version will be delivered with unlimited. used to all users, including free-tier customers. The offering seems pretty squarely aimed at undercutting the dominant pairing of cursor and 3.5 Sonnet. The primary complaint that users have with cursor using Anthropics models are around cost and rate limits. WinServe clearly sees an opportunity to deliver an experience on par with 3.5 Sonnet at a fraction of the cost and potentially win market share
Starting point is 00:03:43 because of it. Still, there's another part of this announcement that's really important as well, which is the idea of expanding coding assistance beyond just churning out lines of code. Winsurf is attempting to deliver a model more capable at drawing on knowledge bases, testing code, and understanding user feedback. They also noted that coding assistants have been great at zoomed-in tactical work, but generally struggle to consider the full scope of software engineering problems. This is particularly true when it comes to switching between terminals, IDEs, and internet-based resources. They write, at some point, just getting better at coding will not make you or a model better at software engineering, and we ultimately want to help accelerate everything a software engineer
Starting point is 00:04:17 can do. So we've known for quite a while that we're going to need software engineering models, sui models for short. Across a number of benchmarks, WindSurf is claiming that Sui 1 is in the same ballpark as 3.5 sonnet, but not quite as powerful as 3.7. They also tested the new model on real-world usage by running a blind experiment on users and found that Sui 1 had significantly more lines of code accepted by the user than 3.5, but not quite as many as 3.7. The release is also interesting in the context of the reported OpenAI acquisition of the company. Many assume that OpenAI just wanted to showcase their own models on WinServe's platform, but these new models imply that windsurf is more than just an interface for the latest and greatest from OpenAI.
Starting point is 00:04:55 And that's all the more interesting today, because in the morning that I was recording this Friday, May 16th, as I was prepping the show, OpenAI announced that they were going to have a live stream in just a couple of hours. What they launched was their version of a vibe coding tool, sort of, called Codex. Here's how Dan Shipper from Every summed it up. OpenAI just launched Codex, a brand new autonomous coding agent that can build features and fix bugs on its own. We've been using it at every for a few days, and I'm impressed. Codex is designed to be used by senior engineers. It performs coding tasks like adding features or fixing bugs autonomously. It's built to allow you to start many sessions at once so you can have multiple agents working in parallel.
Starting point is 00:05:32 Codex is built to have taste. OpenAI trained Codex to have the taste of a senior software engineer. It knows how big code bases work, how to write a good PR, and uses clean, minimal code. Codex is designed to allow users to delegate many tasks at once without getting caught up in the details. This lets you point an abundance of agents at a specific task. like a difficult bug, making it worth it even if only one of them succeeds. Finally, Dan and every suggest that OpenAI's vision for the future of programming is that in the future, developers will probably spend less time writing routine code and more time guiding agents, reviewing their work and making strategic decisions. Programming will become more social,
Starting point is 00:06:06 letting teams easily delegate multiple tasks at once, allowing people to focus on ideas and collaboration instead of routine coding. Like I said, this thing was literally just launched hours ago, so I haven't had a chance to play around with it yet, but it certainly suggests just how essential this category is, and is further evidence of the point that started this show, which is that there is a lot of stuff coming down the pipeline right now. Another small update from OpenAI. The company has brought GPT4-1 to ChatGBTGT, and even made it the new default model. GPT4-1 was released last month and marketed as a coding-focused model that might not be of all that much interest for other use cases. It was OpenAI's first
Starting point is 00:06:41 release that was only available through the API, suggesting that the company was fairly confident it would only be used or useful by developers. Earlier this weekend, however, OpenAI announced that by popular request, GPT4-1 will be available directly in chat GPT. Chief Product Officer Kevin Wheel added, We built it for developers, so it's very good at coding and instruction following. Early response is positive. Melvin Vivas writes, GPT4-1, huge difference just at the start of a conversation. 4-0 feels like talking to a robot, 4.1 feels like talking to a human.
Starting point is 00:07:12 Instruction following is also pretty good. V-Racer X also wrote, 4-1 is a lot funnier than 4-0. If you're into creative writing, I'd prefer 4-1. Not every company, however, is pushing out models. The Wall Street Journal reports that Meta's flagship Lama 4 model is being delayed after failing to live up to expectations. Sources told the journal that engineers have been unable to improve the capabilities of Lama 4
Starting point is 00:07:35 Behemath, leading staff to question whether it's a meaningful enough upgrade to justify public release. Behemoth is, of course, the ultra-large model in the Lama 4 family. It uses a mixture of experts architecture that engages a subset of parameters for each query, similar to Deepseek v3 and GROC 3. It clocks in at 288 billion active parameters across 16 experts for a total of 2 trillion parameters, similar to the size of GROC 3 but far larger than any other open source model currently available.
Starting point is 00:08:01 And yet, it appears that all that size hasn't really yielded results. The journal report's behemoth was originally slated to be released in April alongside the two smaller models in the Lama 4 family. Internal targets were then pushed to June and are now delayed until the fall or even later. Last month at the inaugural LamaCon, Mark Zuckerberg said that behemoth would be the, quote, highest performing base model in the world, and so they really can't release a model that doesn't live up to that. The reporting also highlighted growing tension at META surrounding the rollout of Lama 4. The journal wrote,
Starting point is 00:08:30 Senior executives at the company are frustrated at the performance of the team that built the Lama 4 models and blame them for the failure to make progress on Behemath. Meta is contemplating significant management changes to its AI product group as a result. Now, there have already been a lot of changes around META's AI leadership strategies, over the last year, but the stakes are obviously very, very high for Zuckerberg and for meta as a whole. Moving a bit down the stack from the foundation model companies, Kohir seems to be pulling off their pivot to the app layer, but to some, their strong performance still represents a fall from grace. In 2023, Kohir was well and truly in the mix to compete
Starting point is 00:09:05 as a foundation model company alongside Anthropic, OpenAI, and Mistral. However, as training runs got larger and more expensive, they just couldn't keep up. At the end of last year, the company announced to pivot to niche enterprise AI deployments rather than competing for the whole stack, which, by the way, is almost a silly way to describe it given how absolutely massive this quote-unquote niche of enterprise AI deployments is going to be. But basically, the company abandoned plans to train frontier models to instead focus on smaller models for on-premise deployment. Co-founder Nick Frost said at the time, what we're hearing from customers is that they just don't need bigger models to be good at everything. They need models that are actually built
Starting point is 00:09:40 for their specific use cases. Since then, the business seems to be thriving. sources said the company has now reached 100 million in annualized revenue, doubling their pace from the beginning of last year. 85% of that revenue comes from long-term enterprise contracts, with the company stating that they've managed to reach 80% margins. The reporting states that they're testing a document summarization model with large clients, including the Royal Bank of Canada and LG. But even this incredibly impressive feat shows just how big a gap there is between the foundation model companies and everyone else. Back in 2023, as ChatGBTBT was sweeping the world, cohere gave investors projections of hitting 600 million in annualized revenue from selling access to their models.
Starting point is 00:10:18 Still, I think the company should be very proud of having pivoted and figured out a viable and exciting model for the app layer. Jenny Zhao writes, most foundation model companies will fail. The brutal reality is that it's extremely hard to outcompete open source models. If you can't cross that line, you're basically worth zero. Today's episode is brought to you by KPMG. In today's fiercely competitive market, unlocking AI's potential could help give you a competitive edge, foster growth and drive new value. But here's the key. You don't need an AI strategy. You need to embed AI into your overall business strategy to truly power it up. KPMG can show you how to integrate
Starting point is 00:10:53 AI and AI agents into your business strategy in a way that truly works and is built on trusted AI principles and platforms. Check out real stories from KPMG to hear how AI is driving success with its clients at www.kpmg.comg.com.com.com. Again, that's www.kp pmg. us slash AI. Today's episode is brought to you by Blitzy, the Enterprise Autonomous Software Development Platform with Infinite Code Context, which, if you don't know exactly what that means yet, do not worry we're going to explain, and it's awesome. So Blitzy is used alongside your favorite coding copilot as your batch software development
Starting point is 00:11:30 platform for the Enterprise, and it's meant for those who are seeking dramatic development acceleration on large-scale code bases. Traditional co-pilots help developers with line-by-line completions and snippets, but Blitzy works ahead of the IDE, first documenting your entire code base, then deploying more than 3,000 coordinated AI agents working in parallel to batch build millions of lines of high-quality code for large-scale software projects. So then whether it's code-based refactors, modernizations, or bulk development of your product roadmap, the whole idea of Blitzy is to provide enterprises dramatic velocity improvement. To put it in simpler terms, for every line of code
Starting point is 00:12:04 eventually provided to the human engineering team, Blitzy will have written it hundreds of times, validating the output with different agents to get the highest quality code to the enterprise and batch. Projects then that would normally require dozens of developers working for months can now be completed with a fraction of the team in weeks, empowering organizations to dramatically shortened development cycles and bring products to market faster than ever. If your enterprise is looking to accelerate software development, whether it's large-scale modernization, refactoring, or just increasing the rate of your STLC, contact Blitzy at blitzy.com, that's B-L-I-T-Z-Y dot com, to book a custom demo, or just press get start,
Starting point is 00:12:38 and start using the product right away. Today's episode is brought to you by Super Intelligent. Now, you have heard me talk about agent readiness audits probably numerous times at this point. This is our system that uses voice agents and a hybrid human AI analysis process to benchmark your agent readiness and map your agent opportunities and give you some really pointed, actionable next steps to move further down the path in your agentic journey. But we're coming up on the slow time of the year, and if you want to use this time to get out ahead of peers and competitors. We're excited to announce something we're calling Agent Summer.
Starting point is 00:13:10 The idea here isn't that complicated. It's basically just an accelerated program to get you agentified and fast. First of all, it's going to include an Agent Readiness Audit, figuring out where your biggest agent opportunities are. Next, we're going to support both your internal change management process, helping you figure out AI policy, data readiness, things like that, as well as doing action planning around the agent opportunities that are most relevant for you. And finally, we're going to connect you to the right vendors to actually go and deliver this. Now, for this, we want to work with a very small handful of companies that really want to move. We're going to be bundling more than $50,000 of services for something that starts closer to $30,000.
Starting point is 00:13:44 And so if you want to use this summer to jump ahead on your company's agent journey, email agent at besuper.a.i with summer in the subject line, claim one of these limited spots, and let's go have an agent summer. Another recent theme we've been exploring is pricing, and Salesforce is apparently taking another look at their pricing models as agents become a bigger and bigger part of their business. customers will now pay 10 cents per action when using Salesforce agents. Last year, the company was one of the first to experiment with per-use pricing, rather than following traditional SaaS models of charging per seat.
Starting point is 00:14:16 The agents were priced at $2 per conversation, with the presumption that they would be used primarily for outbound sales. The company says that this new pricing structure is intended to be a more attractive way to pay for non-conversational and internal uses like scanning through emails to look for leads. Salesforce will also now allow existing customers to reallocate spending from software subscriptions into their AI agent offerings. Executive VP Bill Patterson said, for companies who are looking at the future of their workforce,
Starting point is 00:14:42 whether it scales up or scales down, what the flex agreement gives us is this ability to move spending between human labor and digital labor. Now, I did an entire show a couple of weeks ago about agent pricing and the implications it has, and Salesforce is a live-action case study in that. Effectively, their last price experiment
Starting point is 00:14:59 was imagining one type of use, but then when they saw another type of use that didn't work for that pricing, they have to adapt. I think that this flex agreement idea is really smart and creates a lot of space for them to potentially be even more nimble with this pricing. But overall, this is just one more sign that nobody exactly knows how this is going to play out or how they even should think about pricing. Another company thinking about agents is Walmart. The retailer is preparing for big changes in the way their consumers shop, or rather how their agents shop. Walmart is apparently starting
Starting point is 00:15:27 to think about how to market their products to the AI agents that they believe will soon take over the shopping experience. Walmart's CTO Haru Vasada said, it will be different. Advertising will have to evolve. So far, most of the shopping agents we've seen follow a very simple rubric. They either choose the top blue link in a search or have instructions to look for certain brands in particular categories. But it's highly likely that as these agents proliferate, we could see an entirely new SEO game evolve, with companies focused on figuring out how to appeal to these new robotic shoppers. Robert Hetto, the VP analyst for Retail and Market Research at Consultant Firm Gardner, also suggested that brands could
Starting point is 00:16:03 lose their direct relationship with customers. And it's difficult to imagine an AI agent developing a ton of brand loyalty. So Walmart, for their part, is developing their own shopping agent, but also preparing for most consumers to start using third-party agents. Vasudev says he also foresees the establishment of an industry protocol, which enables third-party agents to communicate with a retailer's proprietary agent to serve product recommendations. And by the way, if you are an entrepreneur out there thinking about what your next opportunity might be, that is a great example of just how much new infrastructure is going to be built, retail industry agent-to-agent protocol,
Starting point is 00:16:36 feels niche in a million dollar a year business. In any case, Hed2 believes that we could see a situation where latency plays a larger role, with retailers modifying pricing in a split second to win the business of third-party agents. Now, Walmart isn't thinking this is going to happen overnight. Company still does 80% of its business and physical locations, but very clearly they're getting out ahead of the changes.
Starting point is 00:16:56 Now, it wasn't exactly about the same thing, but I did also notice this tweet from Perplexity CEO Aravan Shrinivas, who wrote, Hotel bookings natively on perplexity are quietly growing. It's one of the under-the-radar features we have right now that has a massive potential to disrupt the ad industry. Google's second biggest ad word category, I think. Now, interestingly, I was just experimenting with Perplexity and Manus last night on a bunch of my own travel searching, although I'm still more on the research rather than booking front, but I think it's another indicator of how quickly these experiences are going to converge. Speaking of Perplexity, another report on their next
Starting point is 00:17:28 funding round. The Wall Street Journal reports that the company is in advance talks to raise a $500 million round at a $14 billion valuation led by Excel Ventures. Now, when it comes to AI venture, Perplexity's fundraising story is one of the more intriguing to watch right now. On the one hand, that $14 billion valuation is a huge jump from the $9 billion valuation from their last funding round in November, which itself was like 300% of their previous valuation just a few months before that. At the same time, it looks like the valuation was negotiated down, with reports from March stating that the company was aiming to raise a billion dollars at an $18 billion valuation. There also seems to be a rotating cast of VCs. The last round was led by institutional venture partners,
Starting point is 00:18:09 but Excel is reportedly taking over for this round. That's very different to recent fundraising from OpenAI and XAI, which saw existing investors double down as hard as possible. What makes perplexity so interesting to watch is that it is by far the most successful quote-unquote wrapper company, a company that's building a product rather than a model, but that does uncomfortably up against something that the model companies do themselves as well. It actually doesn't surprise me to see a little bit of volatility in investor conviction just because of how many different opinions there around whether that's a viable concern in the long run. In another area of financing, we have some M&A news with Databricks making another big purchase, paying a billion dollars to acquire database
Starting point is 00:18:48 startup Neon. This will be Databricks' third billion dollar acquisition over the past two years as they seek out to build their AI-first data analytics platform. Neon's tools allow developers to clone databases and preview changes before they go into production, alongside offering scaling hosting solutions. Now, the interesting part of this is that Neon has seen an explosion of AI agents using their platform rather than human developers. Databricks said that recent telemetry data shows that 80% of the databases provisioned on Neon were created automatically by AI agents rather than humans. Essentially, Databricks is not just looking to offer agents, but heading downstream to capture value from the tooling an agentic workforce will require.
Starting point is 00:19:26 Lastly today, a set of rather weird stories surrounding AI safety issues. XAI's GROC was briefly obsessed with race relations in South Africa this week. On Wednesday, the chatbot started discussing the claimed white genocide in completely unrelated topics on X. In one of hundreds of examples, a user asked how many times HBO had changed its name, GROC gave the answer that HBO had rebranded twice before launching into a discussion of attacks on white farmers in South Africa as a complete non-sequitur. In another example, GROC pivoted hard from discussing baseball statistics to discussing South Africa for no obvious reason.
Starting point is 00:20:03 New York Times investigative journalist Eric Toller posted, I can't stop reading the GROC reply page. It's going schizzo and it can't stop talking about white genocide in South Africa. Post-Groc is this true on any post and it'll start talking about kill the boars in white genocide. Now, if ChatGBT's recent issues with sycophancy was a high-profile example of AI misalignment, Elon's chatbot seems to be saying, hold my beer. I am absolutely not going to get into the political dynamics of this. It's an extremely hot-button issue. The U.S. brought in 59 white South Africans under a very specifically targeted refugee program this week,
Starting point is 00:20:34 which generated a ton of controversy. Elon Musk himself is, of course, a white South African immigrant. But for our purposes, what this demonstrates is how easily chatbots can go haywire when system prompts are edited. On Thursday, XAI addressed the controversy tweeting, on May 14th at approximately 3.15 a.m., an unauthorized modification was made to the Grock response bots prompt on X. This change which directed GROC to provide a specific response on a potential topic,
Starting point is 00:20:58 violated XAI's internal policies and core values. We've conducted a thorough investigation on our implementing measures to enhance GROC's transparency and reliability. Moving forward, the company said that they would begin publishing their system prompts on GitHub. YC founder Paul Graham pointed out the problem saying, Grock randomly blurting out opinions about white genocide in South Africa smells to me like the sort of buggy behavior you get from a recently applied patch. I sure hope it isn't.
Starting point is 00:21:21 It would be really bad if widely used AI's got editorialized on the fly by those who controlled them. One upshot of the whole debacle is that we now have the first commitment from a major AI lab to transparently publish their system prompts. The recent incident with the sycophantic version of GPT-40 was also caused by a modification to the system prompt, but we haven't seen a similar commitment from them. Although in separate but somewhat related news, they did announce a new safety evaluations hub, which they describe as a resource to explore safety results for our models. Basically, they say they're going to communicate about safety more proactively. In any case,
Starting point is 00:21:53 jailbreaker extraordinaire, Pliny the Liberator, has been pushing for this sort of commitment that we got from XAI as a very bare minimum accountability and transparency measure and tweeted, sweet, sweet victory, we did it, chat. And that concludes another fascinating week in the world of AI. Appreciate you listening or watching, as always. And until next time, peace.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.