The AI Daily Brief: Artificial Intelligence News and Analysis - The "Wave of Crazy New AI Stuff" Coming Next Month
Episode Date: May 17, 2025A flood of major AI updates is right around the corner. New models from Anthropic, OpenAI’s autonomous coding agent Codex, Windsurf's SWE-1 for end-to-end software engineering, and changes at Sa...lesforce and Walmart all point to a massive shift.Get Ad Free AI Daily Brief: https://patreon.com/AIDailyBriefBrought to you by:KPMG – Go to https://kpmg.com/ai to learn more about how KPMG can help you drive value with our AI solutions.Blitzy.com - Go to https://blitzy.com/ to build enterprise software in days, not months Vertice Labs - Check out http://verticelabs.io/ - the AI-native digital consulting firm specializing in product development and AI agents for small to medium-sized businesses.The Agent Readiness Audit from Superintelligent - Go to https://besuper.ai/ to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Subscribe to the newsletter: https://aidailybrief.beehiiv.com/Join our Discord: https://bit.ly/aibreakdownInterested in sponsoring the show? nlw@breakdown.network
Transcript
Discussion (0)
A wave of crazy new AI stuff seems to be right on the horizon,
and we're actually starting to see some of it as early as today.
The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.
Thanks to today's sponsors, KPMG, Blitzy.com, and Super Intelligent,
and to get an ad-free version of the show, go to patreon.com.
All right, friends, quick note before we dive in, as I promised in yesterday's show,
because we have had a couple of long main-only episodes this week,
Today we are doing an extended headlines. We're catching up on just a ton of news. It's jam-packed,
so let's dive in. Welcome back to the AI Daily Brief. Earlier this week, why Combinator Managing
Partner Dalton Caldwell wrote, A wave of crazy new AI-related stuff is coming next month.
Betting on the models getting smarter reminds me of the 1990s bet that network bandwidth would only
keep growing. It was a good one. And that is really the meta-theme of today's show. One of the labs that
people are eagerly awaiting a next drop from is anthropic. Lucky for us, the information reports
that new versions of Claude Sonnet and Claude Opus are coming over the next few weeks.
Citing model testers, they wrote, what makes these models different from existing reasoning
AI is their ability to go back and forth between thinking or exploring different ways to
solve a problem and tool use, the ability to use external tools, applications, and databases
to find an answer. The information gave examples of business development, where the models are
able to alternate between web-based research and reasoning through data to come up with a suggestion.
On the coding side, models can automatically test their own code and then reason about bug fixes.
One of the implications of this is that these models might be able to function based on much
higher-level instructions, further relegating the need for exact prompt engineering.
For example, they write, the new anthropic models are supposed to handle more complex tasks
with less input and corrections from their human customers.
The example they give is, in something like software engineering, you might just want to
make this app faster, and let it figure out how to do that. Now, there's an open question
until we see these things around just how different they are than OpenAI's 03 or 04 Mini,
which integrate tool use into the reasoning process. And as we'll see, those are not the only
models that OpenAI has now in this vein. And it's also not a sure thing that people will embrace
a new model. For example, as the information points out, reactions to Claude 3.7 Sonnet,
a previously released Anthropic model that combined reasoning and traditional large programming
models in a single AI have been mixed. Some people have complained the model is more likely to lie and
ignore user commands. Others have said when they don't give specific enough instructions to the model,
it's more likely than other AI to get too ambitious and go out of scope for what it's supposed to do.
Tony Ennis of Scout AI noted, Claude 3.5's sonnet was released around a year ago, and despite
being followed by 3.5 haiku and 3.7 sonnet is still the recommended model for half of cursor tasks.
And when it comes to coding, Anthropics models appear to have some competition now.
coding assistant startup windsurf has announced the launch of their first family of proprietary models.
The family will be known as S-W-E-1 or S-W-E-1 or S-W-E-1 and includes a full-size model alongside
light and mini versions. The company said that the models will be optimized for the entire software
engineering process, not just coding. They claim that the flagship model Sway 1 will have,
quote, approximately clawed 3.5 sonnet levels of tool-call reasoning while being cheaper to serve.
WinSurf will be offering the model for free during a promotional period.
The smaller light version will be delivered with unlimited.
used to all users, including free-tier customers. The offering seems pretty squarely aimed at undercutting
the dominant pairing of cursor and 3.5 Sonnet. The primary complaint that users have with cursor using
Anthropics models are around cost and rate limits. WinServe clearly sees an opportunity to deliver
an experience on par with 3.5 Sonnet at a fraction of the cost and potentially win market share
because of it. Still, there's another part of this announcement that's really important as well,
which is the idea of expanding coding assistance beyond just churning out lines of code. Winsurf is attempting to
deliver a model more capable at drawing on knowledge bases, testing code, and understanding user
feedback. They also noted that coding assistants have been great at zoomed-in tactical work,
but generally struggle to consider the full scope of software engineering problems.
This is particularly true when it comes to switching between terminals, IDEs, and internet-based
resources. They write, at some point, just getting better at coding will not make you or a model
better at software engineering, and we ultimately want to help accelerate everything a software engineer
can do. So we've known for quite a while that we're going to need software engineering models,
sui models for short. Across a number of benchmarks, WindSurf is claiming that Sui 1 is in the same
ballpark as 3.5 sonnet, but not quite as powerful as 3.7. They also tested the new model on real-world
usage by running a blind experiment on users and found that Sui 1 had significantly more lines of
code accepted by the user than 3.5, but not quite as many as 3.7. The release is also
interesting in the context of the reported OpenAI acquisition of the company. Many assume that
OpenAI just wanted to showcase their own models on WinServe's platform, but these new models
imply that windsurf is more than just an interface for the latest and greatest from OpenAI.
And that's all the more interesting today, because in the morning that I was recording this Friday,
May 16th, as I was prepping the show, OpenAI announced that they were going to have a live
stream in just a couple of hours. What they launched was their version of a vibe coding tool,
sort of, called Codex. Here's how Dan Shipper from Every summed it up. OpenAI just launched Codex,
a brand new autonomous coding agent that can build features and fix bugs on its own. We've been
using it at every for a few days, and I'm impressed. Codex is designed to be used by senior engineers.
It performs coding tasks like adding features or fixing bugs autonomously. It's built to allow you
to start many sessions at once so you can have multiple agents working in parallel.
Codex is built to have taste. OpenAI trained Codex to have the taste of a senior software
engineer. It knows how big code bases work, how to write a good PR, and uses clean, minimal code.
Codex is designed to allow users to delegate many tasks at once without getting caught up in
the details. This lets you point an abundance of agents at a specific task.
like a difficult bug, making it worth it even if only one of them succeeds. Finally, Dan and
every suggest that OpenAI's vision for the future of programming is that in the future,
developers will probably spend less time writing routine code and more time guiding agents,
reviewing their work and making strategic decisions. Programming will become more social,
letting teams easily delegate multiple tasks at once, allowing people to focus on ideas and
collaboration instead of routine coding. Like I said, this thing was literally just launched
hours ago, so I haven't had a chance to play around with it yet, but it certainly suggests
just how essential this category is, and is further evidence of the point that started this show,
which is that there is a lot of stuff coming down the pipeline right now.
Another small update from OpenAI. The company has brought GPT4-1 to ChatGBTGT, and even made it
the new default model. GPT4-1 was released last month and marketed as a coding-focused model
that might not be of all that much interest for other use cases. It was OpenAI's first
release that was only available through the API, suggesting that the company was fairly
confident it would only be used or useful by developers. Earlier this weekend,
however, OpenAI announced that by popular request, GPT4-1 will be available directly in chat
GPT. Chief Product Officer Kevin Wheel added,
We built it for developers, so it's very good at coding and instruction following.
Early response is positive. Melvin Vivas writes,
GPT4-1, huge difference just at the start of a conversation.
4-0 feels like talking to a robot, 4.1 feels like talking to a human.
Instruction following is also pretty good.
V-Racer X also wrote,
4-1 is a lot funnier than 4-0.
If you're into creative writing, I'd prefer 4-1.
Not every company, however, is pushing out models.
The Wall Street Journal reports that Meta's flagship Lama 4 model is being delayed after
failing to live up to expectations.
Sources told the journal that engineers have been unable to improve the capabilities of Lama 4
Behemath, leading staff to question whether it's a meaningful enough upgrade to justify
public release.
Behemoth is, of course, the ultra-large model in the Lama 4 family.
It uses a mixture of experts architecture that engages a subset of parameters for each query,
similar to Deepseek v3 and GROC 3.
It clocks in at 288 billion active parameters across 16 experts for a total of 2 trillion
parameters, similar to the size of GROC 3 but far larger than any other open source model
currently available.
And yet, it appears that all that size hasn't really yielded results.
The journal report's behemoth was originally slated to be released in April alongside
the two smaller models in the Lama 4 family.
Internal targets were then pushed to June and are now delayed until the fall or even later.
Last month at the inaugural LamaCon, Mark Zuckerberg said that behemoth would be the, quote,
highest performing base model in the world, and so they really can't release a model that doesn't live up to that.
The reporting also highlighted growing tension at META surrounding the rollout of Lama 4.
The journal wrote,
Senior executives at the company are frustrated at the performance of the team that built the Lama 4 models
and blame them for the failure to make progress on Behemath.
Meta is contemplating significant management changes to its AI product group as a result.
Now, there have already been a lot of changes around META's AI leadership strategies,
over the last year, but the stakes are obviously very, very high for Zuckerberg and for meta as a whole.
Moving a bit down the stack from the foundation model companies,
Kohir seems to be pulling off their pivot to the app layer, but to some, their strong performance
still represents a fall from grace. In 2023, Kohir was well and truly in the mix to compete
as a foundation model company alongside Anthropic, OpenAI, and Mistral. However, as training runs
got larger and more expensive, they just couldn't keep up. At the end of last year, the company
announced to pivot to niche enterprise AI deployments rather than competing for the whole stack,
which, by the way, is almost a silly way to describe it given how absolutely massive this quote-unquote
niche of enterprise AI deployments is going to be. But basically, the company abandoned plans to train
frontier models to instead focus on smaller models for on-premise deployment.
Co-founder Nick Frost said at the time, what we're hearing from customers is that they
just don't need bigger models to be good at everything. They need models that are actually built
for their specific use cases. Since then, the business seems to be thriving.
sources said the company has now reached 100 million in annualized revenue, doubling their pace
from the beginning of last year. 85% of that revenue comes from long-term enterprise contracts,
with the company stating that they've managed to reach 80% margins. The reporting states that they're
testing a document summarization model with large clients, including the Royal Bank of Canada and LG.
But even this incredibly impressive feat shows just how big a gap there is between the foundation
model companies and everyone else. Back in 2023, as ChatGBTBT was sweeping the world,
cohere gave investors projections of hitting 600 million in annualized revenue from selling access to their models.
Still, I think the company should be very proud of having pivoted and figured out a viable and exciting model for the app layer.
Jenny Zhao writes, most foundation model companies will fail.
The brutal reality is that it's extremely hard to outcompete open source models.
If you can't cross that line, you're basically worth zero.
Today's episode is brought to you by KPMG.
In today's fiercely competitive market, unlocking AI's potential could help give you a competitive edge,
foster growth and drive new value. But here's the key. You don't need an AI strategy. You need to embed
AI into your overall business strategy to truly power it up. KPMG can show you how to integrate
AI and AI agents into your business strategy in a way that truly works and is built on trusted
AI principles and platforms. Check out real stories from KPMG to hear how AI is driving success with its
clients at www.kpmg.comg.com.com.com. Again, that's www.kp
pmg. us slash AI.
Today's episode is brought to you by Blitzy, the Enterprise Autonomous Software Development Platform
with Infinite Code Context, which, if you don't know exactly what that means yet, do not worry
we're going to explain, and it's awesome.
So Blitzy is used alongside your favorite coding copilot as your batch software development
platform for the Enterprise, and it's meant for those who are seeking dramatic development
acceleration on large-scale code bases.
Traditional co-pilots help developers with line-by-line completions and snippets, but Blitzy
works ahead of the IDE, first documenting your entire code base, then deploying more than
3,000 coordinated AI agents working in parallel to batch build millions of lines of high-quality
code for large-scale software projects. So then whether it's code-based refactors, modernizations,
or bulk development of your product roadmap, the whole idea of Blitzy is to provide
enterprises dramatic velocity improvement. To put it in simpler terms, for every line of code
eventually provided to the human engineering team, Blitzy will have written it hundreds of times,
validating the output with different agents to get the highest quality code to the enterprise and batch.
Projects then that would normally require dozens of developers working for months can now be
completed with a fraction of the team in weeks, empowering organizations to dramatically
shortened development cycles and bring products to market faster than ever.
If your enterprise is looking to accelerate software development, whether it's large-scale
modernization, refactoring, or just increasing the rate of your STLC, contact Blitzy at
blitzy.com, that's B-L-I-T-Z-Y dot com, to book a custom demo, or just press get start,
and start using the product right away.
Today's episode is brought to you by Super Intelligent.
Now, you have heard me talk about agent readiness audits probably numerous times at this point.
This is our system that uses voice agents and a hybrid human AI analysis process
to benchmark your agent readiness and map your agent opportunities and give you some really
pointed, actionable next steps to move further down the path in your agentic journey.
But we're coming up on the slow time of the year, and if you want to use this time to
get out ahead of peers and competitors. We're excited to announce something we're calling Agent Summer.
The idea here isn't that complicated. It's basically just an accelerated program to get you
agentified and fast. First of all, it's going to include an Agent Readiness Audit, figuring out
where your biggest agent opportunities are. Next, we're going to support both your internal
change management process, helping you figure out AI policy, data readiness, things like that,
as well as doing action planning around the agent opportunities that are most relevant for you.
And finally, we're going to connect you to the right vendors to actually go and deliver this.
Now, for this, we want to work with a very small handful of companies that really want to move.
We're going to be bundling more than $50,000 of services for something that starts closer to $30,000.
And so if you want to use this summer to jump ahead on your company's agent journey,
email agent at besuper.a.i with summer in the subject line, claim one of these limited spots,
and let's go have an agent summer.
Another recent theme we've been exploring is pricing, and Salesforce is apparently taking another
look at their pricing models as agents become a bigger and bigger part of their business.
customers will now pay 10 cents per action when using Salesforce agents.
Last year, the company was one of the first to experiment with per-use pricing,
rather than following traditional SaaS models of charging per seat.
The agents were priced at $2 per conversation,
with the presumption that they would be used primarily for outbound sales.
The company says that this new pricing structure is intended to be a more attractive way
to pay for non-conversational and internal uses like scanning through emails to look for leads.
Salesforce will also now allow existing customers to reallocate spending from software subscriptions
into their AI agent offerings.
Executive VP Bill Patterson said,
for companies who are looking at the future of their workforce,
whether it scales up or scales down,
what the flex agreement gives us
is this ability to move spending
between human labor and digital labor.
Now, I did an entire show a couple of weeks ago
about agent pricing and the implications it has,
and Salesforce is a live-action case study in that.
Effectively, their last price experiment
was imagining one type of use,
but then when they saw another type of use
that didn't work for that pricing, they have to adapt.
I think that this flex agreement idea is really smart and creates a lot of space for them to potentially
be even more nimble with this pricing. But overall, this is just one more sign that nobody exactly
knows how this is going to play out or how they even should think about pricing.
Another company thinking about agents is Walmart. The retailer is preparing for big changes
in the way their consumers shop, or rather how their agents shop. Walmart is apparently starting
to think about how to market their products to the AI agents that they believe will soon take over
the shopping experience. Walmart's
CTO Haru Vasada said, it will be different. Advertising will have to evolve. So far, most of the
shopping agents we've seen follow a very simple rubric. They either choose the top blue link in a search
or have instructions to look for certain brands in particular categories. But it's highly likely
that as these agents proliferate, we could see an entirely new SEO game evolve, with companies
focused on figuring out how to appeal to these new robotic shoppers. Robert Hetto, the VP
analyst for Retail and Market Research at Consultant Firm Gardner, also suggested that brands could
lose their direct relationship with customers. And it's difficult to imagine an AI agent developing
a ton of brand loyalty. So Walmart, for their part, is developing their own shopping agent,
but also preparing for most consumers to start using third-party agents. Vasudev says he also
foresees the establishment of an industry protocol, which enables third-party agents to communicate
with a retailer's proprietary agent to serve product recommendations. And by the way, if you are
an entrepreneur out there thinking about what your next opportunity might be, that is a great example
of just how much new infrastructure is going to be built,
retail industry agent-to-agent protocol,
feels niche in a million dollar a year business.
In any case, Hed2 believes that we could see a situation
where latency plays a larger role,
with retailers modifying pricing in a split second
to win the business of third-party agents.
Now, Walmart isn't thinking this is going to happen overnight.
Company still does 80% of its business and physical locations,
but very clearly they're getting out ahead of the changes.
Now, it wasn't exactly about the same thing,
but I did also notice this tweet from Perplexity CEO Aravan Shrinivas, who wrote,
Hotel bookings natively on perplexity are quietly growing. It's one of the under-the-radar features
we have right now that has a massive potential to disrupt the ad industry. Google's second
biggest ad word category, I think. Now, interestingly, I was just experimenting with Perplexity
and Manus last night on a bunch of my own travel searching, although I'm still more on the research
rather than booking front, but I think it's another indicator of how quickly these experiences are going
to converge. Speaking of Perplexity, another report on their next
funding round. The Wall Street Journal reports that the company is in advance talks to raise a $500 million
round at a $14 billion valuation led by Excel Ventures. Now, when it comes to AI venture,
Perplexity's fundraising story is one of the more intriguing to watch right now. On the one hand,
that $14 billion valuation is a huge jump from the $9 billion valuation from their last
funding round in November, which itself was like 300% of their previous valuation just a few months
before that. At the same time, it looks like the valuation was negotiated down, with reports
from March stating that the company was aiming to raise a billion dollars at an $18 billion valuation.
There also seems to be a rotating cast of VCs. The last round was led by institutional venture partners,
but Excel is reportedly taking over for this round. That's very different to recent fundraising
from OpenAI and XAI, which saw existing investors double down as hard as possible.
What makes perplexity so interesting to watch is that it is by far the most successful quote-unquote
wrapper company, a company that's building a product rather than a model, but that does uncomfortably
up against something that the model companies do themselves as well. It actually doesn't surprise me
to see a little bit of volatility in investor conviction just because of how many different opinions there
around whether that's a viable concern in the long run. In another area of financing, we have some
M&A news with Databricks making another big purchase, paying a billion dollars to acquire database
startup Neon. This will be Databricks' third billion dollar acquisition over the past two years
as they seek out to build their AI-first data analytics platform. Neon's tools allow developers to clone
databases and preview changes before they go into production, alongside offering scaling hosting
solutions. Now, the interesting part of this is that Neon has seen an explosion of AI agents
using their platform rather than human developers. Databricks said that recent telemetry data
shows that 80% of the databases provisioned on Neon were created automatically by AI agents
rather than humans. Essentially, Databricks is not just looking to offer agents, but heading downstream
to capture value from the tooling an agentic workforce will require.
Lastly today, a set of rather weird stories surrounding AI safety issues.
XAI's GROC was briefly obsessed with race relations in South Africa this week.
On Wednesday, the chatbot started discussing the claimed white genocide in completely
unrelated topics on X. In one of hundreds of examples, a user asked how many times HBO
had changed its name, GROC gave the answer that HBO had rebranded twice before launching into
a discussion of attacks on white farmers in South Africa as a complete non-sequitur.
In another example, GROC pivoted hard from discussing baseball statistics to discussing South Africa
for no obvious reason.
New York Times investigative journalist Eric Toller posted, I can't stop reading the GROC reply page.
It's going schizzo and it can't stop talking about white genocide in South Africa.
Post-Groc is this true on any post and it'll start talking about kill the boars in white genocide.
Now, if ChatGBT's recent issues with sycophancy was a high-profile example of AI misalignment,
Elon's chatbot seems to be saying, hold my beer.
I am absolutely not going to get into the political dynamics of this.
It's an extremely hot-button issue.
The U.S. brought in 59 white South Africans under a very specifically targeted refugee program this week,
which generated a ton of controversy.
Elon Musk himself is, of course, a white South African immigrant.
But for our purposes, what this demonstrates is how easily chatbots can go haywire when
system prompts are edited.
On Thursday, XAI addressed the controversy tweeting,
on May 14th at approximately 3.15 a.m., an unauthorized modification was made to the
Grock response bots prompt on X.
This change which directed GROC to provide a specific response on a potential topic,
violated XAI's internal policies and core values.
We've conducted a thorough investigation on our implementing measures to enhance GROC's transparency
and reliability.
Moving forward, the company said that they would begin publishing their system prompts on GitHub.
YC founder Paul Graham pointed out the problem saying,
Grock randomly blurting out opinions about white genocide in South Africa smells to me like
the sort of buggy behavior you get from a recently applied patch.
I sure hope it isn't.
It would be really bad if widely used AI's got editorialized
on the fly by those who controlled them. One upshot of the whole debacle is that we now have
the first commitment from a major AI lab to transparently publish their system prompts. The recent
incident with the sycophantic version of GPT-40 was also caused by a modification to the system
prompt, but we haven't seen a similar commitment from them. Although in separate but somewhat related
news, they did announce a new safety evaluations hub, which they describe as a resource to explore
safety results for our models. Basically, they say they're going to communicate about safety
more proactively. In any case,
jailbreaker extraordinaire, Pliny the Liberator,
has been pushing for this sort of commitment that we got from XAI
as a very bare minimum accountability and transparency measure
and tweeted, sweet, sweet victory, we did it, chat.
And that concludes another fascinating week in the world of AI.
Appreciate you listening or watching, as always.
And until next time, peace.
