The AI Daily Brief: Artificial Intelligence News and Analysis - First Reactions: Claude 3.7 Sonnet and Claude Code

Starting point is 00:00:00 Today on the AI Daily Brief, Anthropic has just launched Quad 3.7 Sonnet. Before that in the headlines, chat GPT has hit 400 million weekly active users. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. To join the conversation, follow the Discord link in our show notes. Welcome back to the AI Daily Brief Headlines edition, all the daily AI news you need in around five minutes. Quick note for the next couple of episodes, we will be audio only. And this week, we'll be back to our normal video format as well. We kick off today with an announcement from OpenAI at the end of last week that ChatGPT has hit

Starting point is 00:00:39 400 million weekly active users, surging a full 33% since December. OpenAI hasn't previously disclosed these figures, which show the service is still growing at a rapid rate. Chief Operating Officer Brad Lightcap posted, ChatGPT recently crossed 400 million weekly active users. We feel very fortunate to serve 5% of the world every week. 2 million-plus business users now use Chat Chapti at work, and reasoning model API use is up 5x since the O3 Mini launch in January. That last number, I think, is hugely significant. O3Mini has kicked up API reasoning model use 5X.

Starting point is 00:01:14 LightCap added that GPT4.5 and 5 are coming soon with plans to offer unlimited use of GPT5 to free users on low inference settings. In comments to CNBC, LightC discussed the gulf between hundreds of millions of free users and relatively slow business adoption, stating there's a buying cycle there and a learning process that goes into scaling an enterprise business. AI is going to be like cloud services. It's going to be something where you can't run a business that ultimately is not really running on these powerful models underneath the surface. However, the implication, which is completely true from our experience at super intelligent, is that it just takes time. Even the most obvious things in the world come up against human and organizational inertia

Starting point is 00:01:50 that has to be pushed through. Turning to other topics, LightCap discussed the DeepSeek moment as validation that AI has entered the zeitgeist rather than as a negative for Open AI. He commented, deep seek is a testament to how much AI has entered the public consciousness in the mainstream. It would have been unfathomable two years ago. It's a moment that shows how powerful these models are and how much people really care. Many people pointed out when they saw these numbers that if this sort of rate of growth increases a pace, we're going to see a billion chat GPT users in extremely short order. Speaking of GPT 4.5, some companies are getting ready. The verge is reporting that GPT 4.5 could be released as soon as this week.

Starting point is 00:02:30 And according to sources familiar with Microsoft's plans, the company is already readying server capacity for GPT4.5 and GPT5. They expect GPT4.5 to be released imminently. GPT5, on the other hand, is expected to launch in late May, aligning with Microsoft's Build Developer Conference. This could represent a much closer working relationship between Microsoft and OpenAI for this year's releases. Microsoft was reportedly blindsided by the release of GPT40 last May.

Starting point is 00:02:55 It offered voice and translation services as well as a big speed boost, all at a cheaper price than Microsoft services built on GPT4 Turbo. It took Microsoft until October to overhaul their services to catch up with OpenAI, who is of course supposed to be their biggest partner here. Now, there have been obviously lots and lots of rumors about the potential breakup of Microsoft and OpenAI. But it appears that in this case at least, Microsoft has been given the heads up this time around, and so presumably we could expect copilot updates ready to go shortly after OpenAI's releases. Sam Alman, meanwhile, has been hyping it up, posting last week that, quote, trying GPT 4.5 has been much more of a feel-the-AGI moment among high-taste testers than I expected.

Starting point is 00:03:32 GPT-5, meanwhile, will remember be a much larger rethink of the company's product line. It'll be the first model to integrate reasoning and non-reasoning into a single model. OpenAI have also suggested that they will devise a way to apply the right amount of inference to each query, doing away with the need for the model selector. Already the rumors are starting to build. Lisan Al-Gaib suggested that OpenAI could already be testing GPT4.5 in public, routing some O3 Mini queries to the new model. Meanwhile, OpenAI rumormonger Riley Coyote passed on whispers that Wednesday will be the release day.

Starting point is 00:04:04 Now, speaking of new models, there is a little bit of controversy swirling around GROC3's benchmarks, with some doubting the new model from XAI is really a match for OpenAI's O3 Mini. The controversy deals specifically with the AIME benchmark, a set of competitive math problems. XAI tested their model using a method known as cons at 64 or best of 64. This involves generating 64 responses and selecting the one that appeared most frequently. Best of 64 is a well-accepted benchmark standard, so there's no issue with using it per se. The problem was that XAI compared their result against O3Mini's benchmark, using a one-shot solution method referred to as pass at 1. OpenAI had presented this one-shot benchmark to demonstrate that O3Mini was better than 01, even when the older model made 64 attempts.

Starting point is 00:04:48 In other words, XAI wasn't making an apples-to-apples comparison. It appeared particularly galling to the OpenAI team as XAI was promoting GROC3. is the world's smartest AI. Boris Power, the head of applied research at OpenAI, posted, disappointing to see the incentives for the GROC team to cheat and deceive in evals. TLDR, O3 Mini, is better in every eval compared to GROC3. GROC3 is genuinely a decent model, but no need to oversell. Tony Wu, the co-founder of XAI, commented, obsession with metric pass at one is just stupid. To compare fairly, you have to fix the test compute budget, and without disclosing what test time compute method is used behind O3Mini,

Starting point is 00:05:20 we cannot really compare. At the end of the day, it's just about which one is a better product. Also, depending on the product, e.g. Consumer Product v. API, you may have different requirements in terms of latency or total flops for test time compute. Try GROC 3 and tell me if you think it's better or worse than O3 Mini. Now, this discussion, which on first glance one could be forgiven for viewing as just the inherent competitiveness of two teams, did spill over into the rest of the AI research community, who discussed how to deal with benchmarks moving forward. Tehratak has compiled all of the available benchmarks in a single chart with both one shot and best of 64 variants commenting, I actually believe Grok looks good there, and OpenAI's test-time compute chicanery behind O3 Mini High Pass at 1 deserves more scrutiny.

Starting point is 00:05:57 Mathen Lambert wrote, I think it's safe to say that XAI and OpenAI both have committed minor chart crimes with thinking models. Frankly, there are no industry norms to lean on. Just expect noise. It's fine. May the best models win. Do your own evals anyway. AIME is practically useless to 99% of people.

Starting point is 00:06:12 And this, I think, is for sure the key point. Every model still humbles us over the head with these benchmarks. as soon as they release their newest thing, saying, look, we've improved blah, blah, blah, blah, blah. And it fundamentally doesn't matter. I'm sorry, but at this point, I am fully on the train that these benchmarks are totally soaked. There's almost no relevant signal in that,

Starting point is 00:06:34 that all of the models now are at the very high end of these things, and that they just tell you almost nothing. I hope we get some more good work on thinking about new types of evaluation because we desperately need it. But at this stage, I think that there's no other reasonable answer if you're willing to take the time and the resources to do it, then to just try every type of query and every type of prompt and every type of challenge against all of the state of the art

Starting point is 00:06:56 and see which one does best. That, or alternatively, just pick one, assume that it's going to be close to as good as the state of the art and will be as good as the state of the art in a couple of weeks when they ship the latest update. Speaking of which, I think that leads perfectly to our main episode topic, which is Anthropics launch of Claude 3.7 Sonnet. Today's episode is brought to you by Vanta.

Starting point is 00:07:20 Trust isn't just earned, it's demanded. Whether you're a startup founder navigating your first audit or a seasoned security professional scaling your GRC program, proving your commitment to security has never been more critical or more complex. That's where Vanta comes in. Businesses use Vanta to establish trust by automating compliance needs across over 35 frameworks like SOC2 and ISO-2101. Centralized security workflows, complete questionnaire,

Starting point is 00:07:46 up to 5X faster and proactively manage vendor risk. Vanta can help you start or scale up your security program by connecting you with auditors and experts to conduct your audit and set up your security program quickly. Plus, with automation and AI throughout the platform, Vanta gives you time back so you can focus on building your company. Join over 9,000 global companies like Atlassian, Quora, and Factory who use Vanta to manage risk and prove security in real time. For a limited time, this audience gets $1,000 off Vanta at Vanta.com slash NLW. That's V-A-N-T-A-com slash N-L-W for $1,000 off. If there is one thing that's clear about AI in 2025, it's that the agents are coming.

Starting point is 00:08:29 Vertical agents by industry, horizontal agent platforms, agents per function. If you are running a large enterprise, you will be experimenting with agents next year. And given how new this is, all of us are going to be back in pilot mode. That's why Super Intelligent is offering a new product for the beginning of this year. It's an agent readiness and opportunity audit. Over the course of a couple quick weeks, we dig in with your team to understand what type of agents make sense for you to test, what type of infrastructure support you need to be ready, and to ultimately come away with a set of actionable recommendations that get you prepared to figure out how agents can transform your business. If you are interested in the agent readiness and opportunity audit, reach out directly to me, NLW at B-Super.AI, put the word agent in the same.

Starting point is 00:09:14 subject line so I know what you're talking about, and let's have you be a leader in the most dynamic part of the AI market. Welcome back to the AI Daily Brief. Anthropic has just launched Quad 3.7 Sonnet, what they call their most intelligent model to date. Similar to how open AI appears to be describing what GPT5 is supposed to be, Anthropic calls this a hybrid reasoning model that, quote, produces near-incent responses or extended step-by-step thinking, one model, two ways to think. Now, holding aside whether it actually does that well, it is extremely telling, I think, that this is the new norm going forward. No more of this separation between reasoning and non-reasoning models. It's just one model to rule them all that can navigate between the two.

Starting point is 00:09:57 Now, of course, as you would expect, Anthropic announced a bunch of benchmarks to demonstrate how Claude 3.7 Sonnet is a big improvement over its predecessor. They showed an increase in performance on everything from GPQA Diamond, the graduate level reasoning, to the AIME. I've just been on my rant about evaluation benchmarks, so I won't repeat that again. Ultimately, I think what you can say is that even based on their own sharing, in most of these cases, it is a nudge forward rather than a leap forward. The one exception of that, which we'll come back to, is around coding, where the sweepbench verified tests saw a huge improvement, from 49% with Claude 3.5 Sonnet, all the way up to

Starting point is 00:10:34 62.3 to 70.3% with Claude 3.7 Sonet. Agentic tool use was also way up, showing a meaningful, increase in performance over Claude 3.5 Sonnet, as well as OpenAIs 01. Indeed, this is what led Anthropic to say that Claude 3.7 is state-of-the-art model for both coding and agentic tool use. They write, in developing it, we've optimized somewhat less for math and computer science competition problems and instead shifted focus towards real-world tasks that better reflect the needs of our users. So at least someone is hearing these rants about benchmarks and what we should be thinking about. Now, it's very clear that coding is the whole ballgame right now for

Starting point is 00:11:09 Anthropics, so we're going to come back to that in a moment. But before that, let's get some first reactions. Rowan Chung from the rundown, writes, Anthropic just dropped Claude 3.7 sonnet, the best coding AI model in the world. I was an early tester and it blew my mind. It created this Minecraft clone in one prompt and made it instantly playable in artifacts. Professor Ethan Malik writes, it is very, very good. Its vibe coding from language is impressive. Here's a one-shot prompted video game based on the Melville story, Bartleby the Scrivener. Boxes Aaron Levy writes, Box has been doing e-vails on it with Enterprise Docs, and it's extremely strong at hard math, logic, content generation, and complex reason and use cases.

Starting point is 00:11:43 Box AI will support Claude 3.7 Sonnet later today in the Box AI studio. Adonna Singh writes, Dude, what? I just asked how many R's it has. Claude Sonnet 3.7 sput up an interactive learning platform for me to learn it myself. And indeed, while the general impressions were favorable, it's because a lot of those impressions were about coding. CJZZZZE writes, Claude Sonnet 3.7 is built for coders. Don't evaluate it on web search and multimodality evals.

Starting point is 00:12:08 Claude is doubling down on what they know the best, AI coding. Matt Schumer shared the Sweebench-Ferrified benchmarks and said this seems to be a huge step up. Flower Slop writes, Claude 3.7 seems to be way ahead in coding compared to 01, 03 mini-high, R1, and GROC 3, according to my first vibe test. The test I like is whether a model can build a fully functional doodle jump clone from scratch. It's right at the edge of what soda models almost get right, but not quite. Until now.

Starting point is 00:12:33 O1 tried, but the window closed instantly with a console error. O3 Mini High made a basic version, but platforms were two-forking. far apart to reach. R1 had no starting platform, so you'd just fail instantly. GROC3, even with extra thinking, also crashed instantly. Cloud 3.7 nailed it. First try, one prompt, fully working with the prettiest design and even a funny little doodler. It simply just did it without any flaws or bugs. And indeed, this is perhaps why that was not the only part of the announcement. Head of Claude Relations, Alex Albert writes, we're opening limited access to a research preview of a new agented coding tool we're building Claude Code. You'll get Claude Powered Code

Starting point is 00:13:08 assistance, file operations, and tasks execution directly from your terminal. After installing Claude code, simply run the Claude command from any directory to get started. Ask questions about your code base, let Claude edit files and fix errors, or even have it run bash commands and create Git commits. Alex continues, within Anthropic, Claude code is quickly becoming another tool we can't do without. Engineers and researchers across the company use it for everything from major code refactors to swash and commits to generally handling the toil of coding. He shared a message from Slack that said, I just want to say clawed code is very quickly taking over my life and becoming my go-to tool. Truly think there's something very special here. Pietrich Serrano explains it a little bit further.

Starting point is 00:13:43 Claude is a command line tool that lets developers delegate substantial engineering tasks to clawed directly from their terminal. In early testing, Claude completed tasks in one pass that would normally take 45 minutes of manual work. Not Adam Paul writes, Claude code is an in-terminal coding agent and it's objectively the coolest thing a frontier company has shipped since GBT4. Here I get it to read my project specs and tell me what's left to implement against the code base. Haven't even started coding with it yet, and I'm hooked. Now, to the extent that anyone had any concern, it was around price. Harrison Kinsey writes, Claude Code is really nice. The UI is so wonderful. I like the action type rules. Well done.

Starting point is 00:14:16 Prepare to spend up to $5 an hour running it, potentially more. Deja Vu Coda responded more like 5 U.S.D per 20 minutes. Others like Anthropics, Catherine Olson, jumped in to talk about where it wasn't perfect. She writes, Claude Code is very useful, but it can still get confused. A few quick tips from my experience coding with it at Anthropic. One, work with a clean commit so it's easy to reset all the changes. Two, sometimes I work on two dev boxes at the same time. One for me, one for Claude Code. We're both trying ideas in parallel.

Starting point is 00:14:43 And so on and so forth. And I actually think that this is a super valuable category of information. Not only does sharing this stuff build trust with your users, it also guides them to use your tools more effectively. Overall, I tend to agree with Benjamin DeKracker who writes, I have a hunch that Claude Code, the Terminal Coder, is a bigger deal than many people realize. certainly there is a sense that combined with the other updates,

Starting point is 00:15:04 we are in the middle of another big shift. Professor Ethan Malik again just published a new piece on his one useful thing blog called a new generation of AIs, Claude 3.7 and Grogh 3. Yes, AI suddenly got better again. For tomorrow's episode, I'm going to be doing a look at what's evolving faster and what's evolving slower in AI than people might have imagined,

Starting point is 00:15:22 and so we'll definitely be coming back to some of this assessment. For now, though, I'm excited to go dive into Claude 3.7 Sonnet myself, And I hope that when you test it out, you come back and tell us what you found as well. For now, that is going to do it for today's episode of the AI Daily Brief. Appreciate you listening, as always. And until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - First Reactions: Claude 3.7 Sonnet and Claude Code

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.