The AI Daily Brief: Artificial Intelligence News and Analysis - First Reactions: Claude 3.7 Sonnet and Claude Code
Episode Date: February 26, 2025Claude 3.7 Sonnet has launched to much fanfare. Along with it comes Claude Code, reinforcing just how much Anthropic has found Claude's core use case in coding. NLW shares the first reactions. Bro...ught to you by:KPMG – Go to www.kpmg.us/ai to learn more about how KPMG can help you drive value with our AI solutions.Vanta - Simplify compliance - https://vanta.com/nlwThe Agent Readiness Audit from Superintelligent - Go to https://besuper.ai/ to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Subscribe to the newsletter: https://aidailybrief.beehiiv.com/Join our Discord: https://bit.ly/aibreakdown
Transcript
Discussion (0)
Today on the AI Daily Brief, Anthropic has just launched Quad 3.7 Sonnet.
Before that in the headlines, chat GPT has hit 400 million weekly active users.
The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.
To join the conversation, follow the Discord link in our show notes.
Welcome back to the AI Daily Brief Headlines edition, all the daily AI news you need in around five minutes.
Quick note for the next couple of episodes, we will be audio only.
And this week, we'll be back to our normal video format as well.
We kick off today with an announcement from OpenAI at the end of last week that ChatGPT has hit
400 million weekly active users, surging a full 33% since December.
OpenAI hasn't previously disclosed these figures, which show the service is still growing
at a rapid rate.
Chief Operating Officer Brad Lightcap posted, ChatGPT recently crossed 400 million weekly active users.
We feel very fortunate to serve 5% of the world every week.
2 million-plus business users now use Chat Chapti at work, and reasoning model API use is up
5x since the O3 Mini launch in January. That last number, I think, is hugely significant.
O3Mini has kicked up API reasoning model use 5X.
LightCap added that GPT4.5 and 5 are coming soon with plans to offer unlimited use of
GPT5 to free users on low inference settings. In comments to CNBC, LightC discussed
the gulf between hundreds of millions of free users and relatively slow business adoption,
stating there's a buying cycle there and a learning process that goes into scaling an enterprise
business. AI is going to be like cloud services. It's going to be something where you can't run a business
that ultimately is not really running on these powerful models underneath the surface. However, the
implication, which is completely true from our experience at super intelligent, is that it just takes time.
Even the most obvious things in the world come up against human and organizational inertia
that has to be pushed through. Turning to other topics, LightCap discussed the DeepSeek moment
as validation that AI has entered the zeitgeist rather than as a negative for Open AI. He commented,
deep seek is a testament to how much AI has entered the public consciousness in the mainstream.
It would have been unfathomable two years ago. It's a moment that shows how powerful these models are
and how much people really care. Many people pointed out when they saw these numbers
that if this sort of rate of growth increases a pace, we're going to see a billion chat GPT
users in extremely short order. Speaking of GPT 4.5, some companies are getting ready.
The verge is reporting that GPT 4.5 could be released as soon as this week.
And according to sources familiar with Microsoft's plans, the company is already readying server capacity
for GPT4.5 and GPT5.
They expect GPT4.5 to be released imminently.
GPT5, on the other hand, is expected to launch in late May, aligning with Microsoft's
Build Developer Conference.
This could represent a much closer working relationship between Microsoft and OpenAI
for this year's releases.
Microsoft was reportedly blindsided by the release of GPT40 last May.
It offered voice and translation services as well as a big speed boost, all at a cheaper
price than Microsoft services built on GPT4 Turbo. It took Microsoft until October to overhaul their
services to catch up with OpenAI, who is of course supposed to be their biggest partner here.
Now, there have been obviously lots and lots of rumors about the potential breakup of Microsoft
and OpenAI. But it appears that in this case at least, Microsoft has been given the heads up
this time around, and so presumably we could expect copilot updates ready to go shortly after
OpenAI's releases. Sam Alman, meanwhile, has been hyping it up, posting last week that, quote,
trying GPT 4.5 has been much more of a feel-the-AGI moment among high-taste testers than I expected.
GPT-5, meanwhile, will remember be a much larger rethink of the company's product line.
It'll be the first model to integrate reasoning and non-reasoning into a single model.
OpenAI have also suggested that they will devise a way to apply the right amount of inference to each query,
doing away with the need for the model selector.
Already the rumors are starting to build.
Lisan Al-Gaib suggested that OpenAI could already be testing GPT4.5 in public,
routing some O3 Mini queries to the new model.
Meanwhile, OpenAI rumormonger Riley Coyote passed on whispers that Wednesday will be the release day.
Now, speaking of new models, there is a little bit of controversy swirling around GROC3's benchmarks,
with some doubting the new model from XAI is really a match for OpenAI's O3 Mini.
The controversy deals specifically with the AIME benchmark, a set of competitive math problems.
XAI tested their model using a method known as cons at 64 or best of 64.
This involves generating 64 responses and selecting the one that appeared most frequently.
Best of 64 is a well-accepted benchmark standard, so there's no issue with using it per se.
The problem was that XAI compared their result against O3Mini's benchmark, using a one-shot solution method referred to as pass at 1.
OpenAI had presented this one-shot benchmark to demonstrate that O3Mini was better than 01, even when the older model made 64 attempts.
In other words, XAI wasn't making an apples-to-apples comparison.
It appeared particularly galling to the OpenAI team as XAI was promoting GROC3.
is the world's smartest AI. Boris Power, the head of applied research at OpenAI, posted,
disappointing to see the incentives for the GROC team to cheat and deceive in evals. TLDR, O3
Mini, is better in every eval compared to GROC3. GROC3 is genuinely a decent model, but no need
to oversell. Tony Wu, the co-founder of XAI, commented,
obsession with metric pass at one is just stupid. To compare fairly, you have to fix the test
compute budget, and without disclosing what test time compute method is used behind O3Mini,
we cannot really compare. At the end of the day, it's just about which one is a better product.
Also, depending on the product, e.g. Consumer Product v. API, you may have different requirements in terms of latency or total flops for test time compute.
Try GROC 3 and tell me if you think it's better or worse than O3 Mini.
Now, this discussion, which on first glance one could be forgiven for viewing as just the inherent competitiveness of two teams,
did spill over into the rest of the AI research community, who discussed how to deal with benchmarks moving forward.
Tehratak has compiled all of the available benchmarks in a single chart with both one shot and best of 64 variants commenting,
I actually believe Grok looks good there, and OpenAI's test-time compute chicanery behind
O3 Mini High Pass at 1 deserves more scrutiny.
Mathen Lambert wrote, I think it's safe to say that XAI and OpenAI both have committed
minor chart crimes with thinking models.
Frankly, there are no industry norms to lean on.
Just expect noise.
It's fine.
May the best models win.
Do your own evals anyway.
AIME is practically useless to 99% of people.
And this, I think, is for sure the key point.
Every model still humbles us over the head with these benchmarks.
as soon as they release their newest thing, saying,
look, we've improved blah, blah, blah, blah, blah.
And it fundamentally doesn't matter.
I'm sorry, but at this point,
I am fully on the train that these benchmarks are totally soaked.
There's almost no relevant signal in that,
that all of the models now are at the very high end of these things,
and that they just tell you almost nothing.
I hope we get some more good work on thinking about new types of evaluation
because we desperately need it.
But at this stage, I think that there's no other reasonable answer
if you're willing to take the time and the resources to do it,
then to just try every type of query and every type of prompt
and every type of challenge against all of the state of the art
and see which one does best.
That, or alternatively, just pick one,
assume that it's going to be close to as good as the state of the art
and will be as good as the state of the art in a couple of weeks
when they ship the latest update.
Speaking of which, I think that leads perfectly to our main episode topic,
which is Anthropics launch of Claude 3.7 Sonnet.
Today's episode is brought to you by Vanta.
Trust isn't just earned, it's demanded.
Whether you're a startup founder navigating your first audit
or a seasoned security professional scaling your GRC program,
proving your commitment to security has never been more critical or more complex.
That's where Vanta comes in.
Businesses use Vanta to establish trust by automating compliance needs
across over 35 frameworks like SOC2 and ISO-2101.
Centralized security workflows, complete questionnaire,
up to 5X faster and proactively manage vendor risk. Vanta can help you start or scale up your
security program by connecting you with auditors and experts to conduct your audit and set up your
security program quickly. Plus, with automation and AI throughout the platform, Vanta gives you
time back so you can focus on building your company. Join over 9,000 global companies like
Atlassian, Quora, and Factory who use Vanta to manage risk and prove security in real time.
For a limited time, this audience gets $1,000 off Vanta at Vanta.com slash NLW.
That's V-A-N-T-A-com slash N-L-W for $1,000 off.
If there is one thing that's clear about AI in 2025, it's that the agents are coming.
Vertical agents by industry, horizontal agent platforms, agents per function.
If you are running a large enterprise, you will be experimenting with agents next year.
And given how new this is, all of us are going to be back in pilot mode.
That's why Super Intelligent is offering a new product for the beginning of this year.
It's an agent readiness and opportunity audit.
Over the course of a couple quick weeks, we dig in with your team to understand what type of agents make sense for you to test, what type of infrastructure support you need to be ready,
and to ultimately come away with a set of actionable recommendations that get you prepared to figure out how agents can transform your business.
If you are interested in the agent readiness and opportunity audit, reach out directly to me, NLW at B-Super.AI, put the word agent in the same.
subject line so I know what you're talking about, and let's have you be a leader in the most
dynamic part of the AI market. Welcome back to the AI Daily Brief. Anthropic has just launched
Quad 3.7 Sonnet, what they call their most intelligent model to date. Similar to how open AI
appears to be describing what GPT5 is supposed to be, Anthropic calls this a hybrid reasoning model
that, quote, produces near-incent responses or extended step-by-step thinking, one model, two
ways to think. Now, holding aside whether it actually does that well, it is extremely telling,
I think, that this is the new norm going forward. No more of this separation between reasoning and
non-reasoning models. It's just one model to rule them all that can navigate between the two.
Now, of course, as you would expect, Anthropic announced a bunch of benchmarks to demonstrate
how Claude 3.7 Sonnet is a big improvement over its predecessor. They showed an increase in
performance on everything from GPQA Diamond, the graduate level reasoning, to the AIME.
I've just been on my rant about evaluation benchmarks, so I won't repeat that again.
Ultimately, I think what you can say is that even based on their own sharing, in most of these
cases, it is a nudge forward rather than a leap forward.
The one exception of that, which we'll come back to, is around coding, where the sweepbench
verified tests saw a huge improvement, from 49% with Claude 3.5 Sonnet, all the way up to
62.3 to 70.3% with Claude 3.7 Sonet.
Agentic tool use was also way up, showing a meaningful,
increase in performance over Claude 3.5 Sonnet, as well as OpenAIs 01. Indeed, this is what
led Anthropic to say that Claude 3.7 is state-of-the-art model for both coding and agentic tool
use. They write, in developing it, we've optimized somewhat less for math and computer science
competition problems and instead shifted focus towards real-world tasks that better reflect
the needs of our users. So at least someone is hearing these rants about benchmarks and what we
should be thinking about. Now, it's very clear that coding is the whole ballgame right now for
Anthropics, so we're going to come back to that in a moment. But before that, let's get some
first reactions. Rowan Chung from the rundown, writes, Anthropic just dropped Claude 3.7
sonnet, the best coding AI model in the world. I was an early tester and it blew my mind.
It created this Minecraft clone in one prompt and made it instantly playable in artifacts.
Professor Ethan Malik writes, it is very, very good. Its vibe coding from language is impressive.
Here's a one-shot prompted video game based on the Melville story, Bartleby the Scrivener.
Boxes Aaron Levy writes, Box has been doing e-vails on it with Enterprise Docs, and it's
extremely strong at hard math, logic, content generation, and complex reason and use cases.
Box AI will support Claude 3.7 Sonnet later today in the Box AI studio.
Adonna Singh writes,
Dude, what? I just asked how many R's it has.
Claude Sonnet 3.7 sput up an interactive learning platform for me to learn it myself.
And indeed, while the general impressions were favorable, it's because a lot of those
impressions were about coding.
CJZZZZE writes, Claude Sonnet 3.7 is built for coders.
Don't evaluate it on web search and multimodality evals.
Claude is doubling down on what they know the best, AI coding.
Matt Schumer shared the Sweebench-Ferrified benchmarks and said this seems to be a huge step up.
Flower Slop writes,
Claude 3.7 seems to be way ahead in coding compared to 01, 03 mini-high, R1, and GROC 3,
according to my first vibe test.
The test I like is whether a model can build a fully functional doodle jump clone from scratch.
It's right at the edge of what soda models almost get right, but not quite.
Until now.
O1 tried, but the window closed instantly with a console error.
O3 Mini High made a basic version, but platforms were two-forking.
far apart to reach. R1 had no starting platform, so you'd just fail instantly. GROC3, even with
extra thinking, also crashed instantly. Cloud 3.7 nailed it. First try, one prompt, fully working
with the prettiest design and even a funny little doodler. It simply just did it without any flaws
or bugs. And indeed, this is perhaps why that was not the only part of the announcement.
Head of Claude Relations, Alex Albert writes, we're opening limited access to a research
preview of a new agented coding tool we're building Claude Code. You'll get Claude Powered Code
assistance, file operations, and tasks execution directly from your terminal. After installing
Claude code, simply run the Claude command from any directory to get started. Ask questions about
your code base, let Claude edit files and fix errors, or even have it run bash commands and create
Git commits. Alex continues, within Anthropic, Claude code is quickly becoming another tool we can't
do without. Engineers and researchers across the company use it for everything from major code refactors
to swash and commits to generally handling the toil of coding. He shared a message from Slack that
said, I just want to say clawed code is very quickly taking over my life and becoming my go-to
tool. Truly think there's something very special here. Pietrich Serrano explains it a little bit further.
Claude is a command line tool that lets developers delegate substantial engineering tasks
to clawed directly from their terminal. In early testing, Claude completed tasks in one pass
that would normally take 45 minutes of manual work. Not Adam Paul writes,
Claude code is an in-terminal coding agent and it's objectively the coolest thing a frontier
company has shipped since GBT4. Here I get it to read my project specs and tell me what's left to
implement against the code base. Haven't even started coding with it yet, and I'm hooked.
Now, to the extent that anyone had any concern, it was around price. Harrison Kinsey writes,
Claude Code is really nice. The UI is so wonderful. I like the action type rules. Well done.
Prepare to spend up to $5 an hour running it, potentially more. Deja Vu Coda responded more like
5 U.S.D per 20 minutes. Others like Anthropics, Catherine Olson, jumped in to talk about where it
wasn't perfect. She writes, Claude Code is very useful, but it can still get confused.
A few quick tips from my experience coding with it at Anthropic.
One, work with a clean commit so it's easy to reset all the changes.
Two, sometimes I work on two dev boxes at the same time.
One for me, one for Claude Code.
We're both trying ideas in parallel.
And so on and so forth.
And I actually think that this is a super valuable category of information.
Not only does sharing this stuff build trust with your users,
it also guides them to use your tools more effectively.
Overall, I tend to agree with Benjamin DeKracker who writes,
I have a hunch that Claude Code, the Terminal Coder,
is a bigger deal than many people realize.
certainly there is a sense that combined with the other updates,
we are in the middle of another big shift.
Professor Ethan Malik again just published a new piece
on his one useful thing blog called a new generation of AIs,
Claude 3.7 and Grogh 3.
Yes, AI suddenly got better again.
For tomorrow's episode, I'm going to be doing a look
at what's evolving faster and what's evolving slower in AI
than people might have imagined,
and so we'll definitely be coming back to some of this assessment.
For now, though, I'm excited to go dive into Claude 3.7 Sonnet myself,
And I hope that when you test it out, you come back and tell us what you found as well.
For now, that is going to do it for today's episode of the AI Daily Brief.
Appreciate you listening, as always.
And until next time, peace.
