Everyday AI Podcast – An AI and ChatGPT Podcast - EP 469: Claude 3.7 Sonnet - World’s first hybrid AI model. How it works and when to use it
Episode Date: February 25, 2025The world's first hybrid LLM is here. We've been waiting since June for Anthropic's next heavyweight model. With Claude Sonnet 3.7, not only is that wait over, but we also have the worl...d's first hybrid model. What's it mean? And how should you use it? Newsletter: Sign up for our free daily newsletterMore on this Episode: Episode PageJoin the discussion: Ask Jordan questions on ClaudeUpcoming Episodes: Check out the upcoming Everyday AI Livestream lineupWebsite: YourEverydayAI.comEmail The Show: info@youreverydayai.comConnect with Jordan on LinkedInTopics Covered in This Episode:1. Overview of Claude 3.7 Sonnet2. Performance Benchmarks3. Potential Use Cases4. Discussion on Hybrid AI ModelsTimestamps:00:00 Apple's $500B AI & Manufacturing Investment10:40 Claude SONNET Summarized Chain of Thought16:03 Claude's Restrictive Usage Limits22:31 AI Control: Balancing Innovation and Cost25:18 "Front End User Model Options"28:54 Anthropic's Claude Challenges Market Norms38:33 Explaining Local Code Interaction41:22 Live Coding: GitHub and Replit Demo49:11 "Episode Release Day Impact Analysis"51:29 Tuesday Episodes Analysis and Challenges59:01 Painter and Match Jokes01:02:58 "Deep Research in Generative AI"01:06:28 Concerns Over Costly AI Hybrid ModelsKeywords:large language model, Anthropic, Claude 3.7 SONNET, hybrid AI model, generative AI, GPT-4, OpenAI, Claude Code, API users, Google Gemini, code assist, Apple, AI infrastructure, Microsoft, meta, generative AI advancements, LLM updates, AI news, hybrid model system, transformer models, extended thinking, software development, reasoning models, Gemini code assist, Google Gemini two point o model, Visual Code Studio, software engineering, MMLU benchmarks, coding assistance, generative AI space, artificial general intelligence, chain of thought, agentic toolsSend Everyday AI and Jordan a text message. (We can't reply back unless you leave contact info) Start Here ▶️Not sure where to start when it comes to AI? Start with our Start Here Series. You can listen to the first drop -- Episode 691 -- or get free access to our Inner Cricle community and all episodes: StartHereSeries.com Also, here's a link to the entire series on a Spotify playlist.
Transcript
Discussion (0)
This is the Everyday AI Show, the everyday podcast where we simplify AI and bring its power to your fingertips.
Listen daily for practical advice to boost your career, business, and everyday life.
Meet Firefly AI Assistant, now live in Adobe Firefly, the All In One Creative AI Studio.
Just describe what you want to create and the assistant handles the rest,
orchestrating multi-step workflows across Photoshop, Premiere Express, and more in one conversational interface.
You direct the outcome.
The assistant accelerates execution.
Another week, another state of the art, large language model release.
But this one from Anthropic is a little different.
It's actually the first of its kind.
Because when Anthropic just released its Claude 3.7 sonnet, they became the first company to release a hybrid large language model.
All right.
So we're going to be talking today about what that is, what it means, how it works,
and when you should actually use this new model from Anthropic.
I hope you're excited for this show.
I am welcome if you're new here to Everyday AI.
What's going on, y'all?
My name is Jordan Wilson, and this is Everyday AI.
This thing is for you.
This is your daily live stream podcast and free daily newsletter helping us all,
not just keep up with Gena AI advancements and LLM updates,
but how we can use it to get ahead.
I want you to be the smartest person in AI at your company.
And this is your cheat code.
So if you haven't already go to your everyday AI.com,
that is where you can sign it for our free daily newsletter.
Yeah, maybe you're listening to this podcast for the first time.
If so, thank you.
Make sure to check out the show notes.
There's going to be a lot of other information.
But probably the most important is our website,
because each and every day in our newsletter, we recap exclusive insights from this exact podcast,
as well as giving you every other piece of news and update that you need to stay ahead
in the generative AI space, as well as you can go listen to like 500 episodes on our website,
all sort of a category.
So make sure you go check that out.
All right.
So I am extremely excited to talk about the new Claude 3.7 Sonnet.
I think it's going to change how a lot of people are using large language models, both for the good and for the bat.
But before we get into that, let's first start out as we do most days by recapping the biggest AI news.
So first, Google has launched a free version of its AI-powered coding assistance, Gemini Code Assist,
aimed at solo developers, students, freelancer, startups, and hobbyists.
So the new free public preview offers up to 180,000 monthly code completions,
significantly exceeding the 2,000 completions for free offered by competitors like GitHub
co-Pilots free tier.
So it is powered by the Google Gemini 2.0 model, and it can generate entire codebox,
auto-complete code, debug, and assist developers via a chatbot interface.
So users can instruct the assistant in natural language, such as asking
it to create specific code snippets or modify existing applications.
So Gemini Code Assist supports 38 programming languages and integrates with popular developer
environments like Visual Code Studio, GitHub, and JetBraids.
All right.
Our next piece of AI news, Apple, a couple of years, maybe too late.
I don't know, but they're making a splash with a reported $500 billion investment over the next
for years into AI infrastructure, signaling a major push into not just AI, but American manufacturing
and technology.
So according to Apple CEO, Tim Cook, this commitment reflects confidence in the future of American
innovation and aims to strengthen the company's role in AI and advanced manufacturing.
So a key part of this investment, $500 billion with a B, includes the development of a key new
manufacturing facility in Houston, providing thousands of jobs to produce servers designed for
AI cloud computing. These servers will feature the Apple Silicon and offer cutting edge
security and performance capabilities. So the integration of Apple Intelligence, the company's
AI platform, could further transform healthcare specifically by leveraging its global network
of over 2 billion active devices to provide innovative health tracking and data insights.
So Apple investments come amid a broader AI spending race with competitors like META, spending $65 billion, Amazon spending $100 billion, and Project Stargate, which is $500 billion over five years, also ramping up their AI infrastructure and innovation budgets.
Speaking of, that's our last piece of AI news, Microsoft, just on the same day that we get reports that Apple is going all all in with a $500 billion investment.
reportedly Microsoft is canceling leases for a couple of 100 megawatts of U.S. data center capacity
equivalent to about two full data centers, and that's according to a report from T.D. Cohen.
So this move raises concerns about whether Microsoft, obviously a global leader in AI investment,
may be securing more AI computing capacity than it needs in the long term.
So the cancellations involve agreements with private operators and a slowdown,
converting statements of qualification, which are typically precursors to former formal leases.
So TD Cohen speculates that Open AI, which is backed heavily by Microsoft, may be shifting
some of its workloads to Oracle as part of a new partnership, which may be causing
Microsoft to cancel or change some of its longer term investments.
So Microsoft, which owns and operates many of its own data centers, is also reallocating billions of
dollars in infrastructure investments, potentially shifting focus back to the U.S. from international
projects. So despite these adjustments, Microsoft reiterated its $80 billion spending target for an AI
data center infrastructure for the fiscal year ending in June. So analysts suggest that Microsoft
could be in an oversupply position, meaning it may have overestimated the immediate demand for
AI computing power. So it'll be interesting to see how those stories play out, especially
happening at the same time. I mean, Apple, you know, making a huge splash with a $500 billion investment
where we get reports that Microsoft may be slightly scaling back. All right. Enough. If you want more
AI news, make sure you can go get it at our website. Sign up for the free daily newsletter,
your everyday AI.com. All right, let's get into it. Let's talk the world's first large language
model hybrid.
All right.
And that is with Claude 3.7 sonnets.
All right.
So it is the world's first publicly available hybrid AI model.
So what that means, and we're going to get more into this, right?
I've been talking about this on the show now for, I don't know, at least six months since
Open AI kicked off this reasoner race.
Right.
So you essentially think of it when it comes to generative AI in large language models.
I know I hate to use terms like old school.
when technically, you know, this space is only like, I don't know, six years old,
you know, at least commercially available, you know, the GPT three technology, right?
I would say it would be the first large language model that was popularized commercially
a couple of years before the chat GPT release.
So you have your kind of quote unquote old school transformer models,
and then you have your quote unquote new school reasoning models that kind of use this advanced
thinking, all right?
And right now, those are two very separate things.
So as an example, Open AI, the leader in large language models, you know, they have their GPT4O,
still an industry leading model, even though it's technically older, but that is its kind of quote unquote old school transformer model.
And then they have their newer kind of reasoning models that use logic kind of under the hood.
And that is, you know, 01, 01, 03 mini, oh, three mini, hi.
Yeah, these names suck, right?
But you know, you essentially have these two very different types of models that excel at two very different kinds of tasks.
So now with Anthropic, they are essentially merging this together in Claude 3.5, sorry, Claude 3.7 sonnet, and it kind of does both.
And this new hybrid system will kind of decide on its own when it should use more of this advanced thinking versus when it should just straight out spit an answer to you without really thinking.
it through. All right. So let's go over a little bit. And it's not just the 3.7 sonnet.
They also announced Claude Code, which I think is an extremely big move from Anthropic and kind of
tips its cap into where it's actually competing in. So more on that in a second. But hey,
live stream audience. Thanks for joining. Appreciate y'all tuning in. We've got an international audience
today. So thanks to our, you know, YouTube audience. We got Big Bogey Face and Sandra and Sam,
Michelle, thanks for joining on the LinkedIn crew. We have Dr. Harvey Castro, Christopher,
Woozy, LinkedIn, user, Marie, Denny, Douglas, Cecilia, Jean, thank you all. Jamie Carina.
You know, we got the UK and Italy in the house. Love to see it. Mack. Max holding it down from
Chicago just like me. But I'm curious, live stream audience, do you care about these, like
this new hybrid approach because it's something that Open AI is also going to be adapting to as well.
I think it's there's actually some downsides and we're going to talk about this.
But, you know, live stream audience, I'm curious.
Number one, do you care about this hybrid approach?
Do you think it's going to be good or bad?
And have you used Cloud 3-7 sonnet yet?
I know it's only been out for a couple of hours.
If you do have any questions, get them in.
Now I'll try to tackle them at the end of the show.
All right.
So let's get into an overview, and this is from Anthropic.
So they're saying today we're announcing Claude 37 Sonnet, our most intelligent model to date
and the first hybrid reasoning model on the market.
Claude 3 Sonnet can produce near instant responses or extended step-by-step thinking that is made visible to the user.
Yeah, so that part's important.
You can kind of see like you can.
It is a summarized chain of thought.
So chain of thought is actually a prompting technique that was popularized.
you know, over the last couple of years using transformer models.
So this kind of chain of thought or, you know,
how a person would think about a problem.
So now a hybrid model does that.
And it shows a summarized version of the chain of thoughts.
You can kind of see how this new 37 sonnet is kind of thinking about your prompt
if it is using the advanced thinking.
So now back to the release API users also have fine-grained control over how long
the model can think for.
So Claude 37 Sonnet shows particularly strong improvements in coding and front-end web development.
Along with the model, we're also introducing a command-line tool for agentic coding called Claude Code.
Cloud Code is available as a limited research preview and enables developers to delegate substantial engineering tasks to Claude directly from their terminal.
All right.
So a lot to unwrap there.
So you don't have to read.
I think there's like three separate releases that Anthropical.
put out. I'll just give you the high level. So like we said, this is the first hybrid reasoning
model with visible thinking process. And the extended thinking is on paid plans only. So if you are a
free user to Anthropic Claude, you will see the 3.7 Sonnet model available, but you do not get
kind of this advanced thinking available on the free plan. All right, a couple other high level
kind of points here.
It scored a 70.3 on the SWE bench or SWE bench verified.
Best in class by a lot for coding.
Like we talked about, the Claude Code program for agenic development.
And then it has a 15 times longer output token capacity.
So a 128,000 tokens that it can output versus previously.
Claude could only output 8.5,000 tokens.
So that's just the amount, right?
So if you ask Claude to do something before, it would spit things out.
Sometimes if you ask for a lot, it would spit things out in little chunks.
So now, at least according to Anthropic, that is a 1208,000 token output capacity.
Personally, I'm not seeing that yet.
We're going to do a live test here, y'all.
We'll see if we actually see that.
I was still getting it, breaking it out in small chunks.
They did say that is in beta.
So not sure if that's fully rolled out yet or if that'll be coming out in the coming days or weeks.
But I don't know.
I'm not seeing it.
Also, which is important, this is available across all platforms.
So a lot of what we're going to be talking about is using Claude on the front end as a front end user.
So going to Claude A.I and using your free account, your paid account, maybe you have a team's account.
right, but obviously Claude is available on the back end, and it is a very popular model on the API side,
mainly due to its proficiencies in coding and software development.
It is historically been the most used model, at least when you're looking at open router statistics.
It is generally the most used model on the API side, at least those that are using open router, right?
OpenRouter is one of the more popular services where you can essentially sign up for one service,
connect all your different API keys.
So they have good data, but that's not every single model.
That's just those using OpenRouter.
All right.
So let's talk a little bit about Claude's thinking because this is the big chain of thought,
reasoning model.
So let's go over some of the highlights on how this actually works.
So it uses deeper reasoning for complex tasks.
So what that means is it has an extended thinking mode that lets Claude spend more time and compute effort, solving, challenging problems, or answering tougher questions.
Okay.
It has user-controlled thinking budget on the back end.
So developers can set a thinking budget to determine how much effort Claude should apply for a task.
I think that's where things get a little trick.
Ricky. We'll talk about that here in a second. It is the same model more effort. So the extended
thinking doesn't rely on a different model. It is still the Claude 37 sonnet, right? So hybrid model,
whereas Open AI as an example has their 01, 03, and then they still have their workhorse do
everything model, GPT40. Not like that with Claude. It is just Claude 37 sonnet, right? It's not
37 sonnet thinking. It's not oh, three seven sonnet alphabet soup. It's just three seven sonnet. It's the same thing.
One model does it all.
I think there's pros and there's cons.
The extended thinking, like I said, doesn't rely on a different model.
Visible thought process.
That's a big new feature, at least for Claude, right?
So users can see the, it says the raw reasoning steps.
I don't know.
We'll have to see if that's the raw reasoning when I'm looking at it.
It still looks like a summarized chain of thought.
I could be wrong.
We're going to look at it live.
The other thing, Claude is historically terrible.
for limits. So, you know, I was able to test this a lot last night.
And I wanted to do a ton more testing this morning before this live show.
But, you know, even though I'm on a paid plan, Claude's limits have historically been
the worst in the industry. And it's not even close. Right. So I wish Claude would give paid
users a little more leeway in order to test these things. So a lot of these things, I've
already done them a couple of times. But normally, I would like to play with an
LLM for at least six to eight hours before doing an even simple show.
Not always an option, at least using Claude on the front end because those limits are
terrible.
All right.
Also, improved accuracy over time.
So Anthropics says that extended thinking boost performance on tasks like Mac problems or
complex evaluations by allowing Claude to refine answers iteratively.
Adobe just introduced an entirely new way to call.
create, bringing the power and precision of its creative suite into one conversational experience.
Meet Firefly AI Assistant, now live in the Adobe Firefly app, the all-in-one creative
AI studio.
Powered by Adobe's creative agent, Firefly AI Assistant lets you start with your vision, just
describe what you want, and shape the outcome as it takes form with the Assistant.
The Assistant orchestrates multi-step workflows, drawing on 60-plus pro-grade tools across Adobe
Creative Cloud apps, including folks.
Photoshop, Illustrator, Premiere, Lightroom Express, and more to help bring your ideas to life.
You can also get started with creative skills, a growing library of pre-built workflows for common creative tasks,
like batch editing photos, creating mood boards, portrait retouching, and creating social variations.
Every step the assistant takes is visible so you can refine, redirect, or take over at any time.
You stay in the driver's seat as the creative director.
Adobe Firefly AI assistant now in public beta.
See it today at firefly.addbore.com.
All right.
So let's talk about the Claude Update timelines.
Because if you're wondering, wait, has it been a minute since we've heard from Anthropic?
Yeah, kind of, right?
When now the leaders, Google and Open AI, seemingly are announcing new models every month.
It has been like light years and then some since we've had an actual step
improvement from Anthropic.
So the original 3.5 Sonnet was back in June, 24.
All right.
Then they had this upgraded 3.5 Sonnet, which was confusing because they just called it 3.5
Sonnet new.
They didn't use 3.6, even though a lot of people online, myself included, said, this is
dumb.
Why are you calling it 3.5 Sonnet new?
And then they obviously skipped 3.5.
which lends me to believe that, yeah, that 3.5 sonnet new, which really didn't bring anything
terribly new. It was more of an under the hood update, the type of updates that, you know,
Google and Open AI do almost on a biweekly basis. It didn't seem like anything major,
but we saw the, that Claude 35 sonnet new in October. Then in November, we saw Claude 35
Haiku, right? So essentially, Anthropic has historic.
historically had three model sizes, small, medium, and large for small tasks, medium tasks,
and large tasks.
So Claude Haiku is the small, Sonnet is the medium, and Opus is the large.
So you'll see here now, finally, February 24th, we got the Quad 37 Sonnet.
So I will say the 35, you know, new update in October, I don't know.
That wasn't much.
I used it plenty.
I use Claude 3-5s on it every day.
I didn't see anything new, anything noticeable, at least for my daily use case,
which I know is different than a lot of people's, right?
But so I'll say this, for the most part, it's been since June.
It's been a good eight months since we saw a top class model, real update from Anthropics.
So it's been a hot minute.
So let's also talk about what's next, because Anthropic did release this little,
I guess you could call it a timeline, but looks very much in step with open AIs,
kind of five faces to AGI, right?
So you have your, you know, your reasoners, your agents, et cetera,
from Open AI, Claude takes a little different approach here.
So they said, 2024 was Claude Assists.
Then they said 2025 now is Claude collaborates.
And then they said in 2027, Claude will pioneer.
So is this kind of their AGI artificial general intelligence timeline?
I'm not sure.
It kind of looks like it, right?
They're saying it looks like Claude is just going to be a collaborator.
It goes from assist to collaborates from 2024 to 2025.
And it is going to be a pioneer in 2027.
So I don't know what that means.
But it's Tuesday, y'all.
Should I come in with some hot takes?
Let me know how spicy.
I got to get a sip a coffee here for live stream audience.
But how hot should I make these hot takes, y'all?
And yeah, if you do listen on the podcast, this is a live stream.
We do it every single day.
It's unedited, unscripted, the realest thing, and artificial intelligence.
7.30 a.m.
I know it's a little early.
That's why sometimes I take a little second to sip on the coffee.
But yeah, live stream audience, should I be nice?
Should I bring some heat here with my hot take takeaways?
It is Tuesday after all.
So, all right.
Let's get to some of my takeaways here.
And then we're going to get back to the facts, the figures, the stats.
We're going to do a live walkthrough as well.
So let's talk about this concept of hybrid models.
Big Bogey Face said sweat emoji.
All right.
Allison says, just spicy.
I'll keep it just spicy.
Maybe I won't go, you know, five alarm, hot chili, hurts in the toilet, spicy.
All right.
Not a fan of hybrid models right now.
I'm not.
But I'm also a power user.
So I have to understand.
Most people are not.
I ultimately think these hybrid models are just going to be a way for companies to make more money, right?
Which I get and I understand.
I've said all along, whether you're talking about $20 a month for Claude, you know, paid plan, $20 a month for chat GPT plus, $200 a month for chat GPT pro.
Same thing with Gemini, whatever.
Companies for the most part are losing money.
So I get it.
I get you got to make money.
You got to be profitable.
But on the API side, if I'm a developer and I've been using, you know,
or maybe looking at switching from Open AI to Claude,
I am not incentivized to do so.
Because when you have this new Claude 37 sonnet,
yes, you have this kind of slider control over how much thinking you can apply to certain situations.
But when there's companies out there that literally their business model is essentially
creating a helpful wrapper around an AI model for their customers for a certain niche,
you need a little more control over a simple slider, you know, over saying like,
ah, you know, let's apply this much thinking unilaterally across the board.
I don't think there was anything wrong from a back end API perspective, right?
So I'm hoping that Anthropic and others will not get rid of, you know, as an example,
35 sonnet and will still allow companies to have, you know, 3.5 haiku, 3.5 sonnet.
And the reason why is because the API prices for 37 Sonnet are ridiculously high.
Ridiculously high.
All right.
And if you don't have an option to have like, you know, 3.7 Sonnet regular and 3.7 Sonnet think.
I mean, there's a reason why right now OpenAI is winning the AI race.
I mean, number one, they were the first with chat.
Number two, even though it's confusing for front end users to stare at eight different
model selections. It's extremely important for back-end developers, companies that are essentially
running their business off this technology to use the right model for the right time,
for the right purpose, and the costs that are associated with it. So Claude 3.7 sonnet is
extremely expensive. So for certain use cases, no-brainer. Coding, software development,
etc. You're going to pay it because right now, Claude 3.5,
Claude 3-7 sonnet is the best in those areas.
It is.
It's a great model.
I'm not a huge fan of it.
And I probably won't be a huge fan of it when OpenAI does that as well.
So CEO, OpenAI, CEO, Sam, Sam Haltman said that Open AI is shifting once GPT5 comes out.
GPT5 is going to be more of a system.
And it will also use this hybrid approach.
And it will say, you know, hey, here's, you know, here's when you should use a reasoner.
versus when you should use a transformer model.
So I just think this is just a way for these companies to make more money
if they eventually take away the option to use older models that are not hybrid.
That's all I'm saying.
And as a front end power user, I hope I always have the option as well.
Right.
I got some sun shining in my face.
All right.
So I hope as a front end user,
I'll still have the option in the future to say, oh, I don't want to use a reasoning model for this.
Or I need to use a reasoning model and only a reasoning model, right?
You might have to over prompt engineer if you're giving, you know, Claude 37 sonnet,
you know, something on the front end and you want it to use reasoning and it's not.
Then you just have to go and take that extra step, you know, do a little extra prompt engineering to get it to use
this logic, right? So there's huge downsides that I don't think people are talking about, right?
Everyone wants to wrap it in a bow and say, oh, it's the world's most powerful. It's hybrid.
It's all in one. Okay. There's times and use cases that all in one is great. But I don't think
this is one of them. Again, I'm a power user. So maybe my viewpoint is skewed. I personally like
going into chat, GBT, and seeing eight different models, right? Because I'm a power user. I'm
I'm using probably five of them for very specific use cases.
I don't want one, right?
I don't.
I could be wrong on that.
All right.
Next, poor opus.
Poor opus.
Opus hasn't been updated in like a trillion years.
So it looks like, I don't know, Anthropic may have just abandoned their big boy model
Claude Opus.
Maybe they're waiting until they're kind of clawed 4.0 models to bring
back opus, I'm not sure, but at least for now, poor opus is bye-bye. Also, I don't know. So I saw a comment here.
Let's see, who said this? There we go. Douglas from LinkedIn said,
Curious how this will improve cursor into windsurf, right? I think now Anthropic is competing with them, right?
even though these IDEs, right, so that's an integrated development environment.
So, you know, like we talked about at the top of the show with the news, you know, Gemini,
Code Assist, GitHub co-pilot, cursor, windsurf, lovable, bolt, right?
There's all these kind of IDEs or essentially now AI coders, right, where you can literally,
you can talk to it, you can type to it.
Think how we have these large language models, right?
we have chat GPT, right, the GPT models, Gemini, Claude, right?
And then we have now this newer breed.
They use a model.
So you choose which large language model, but it is an AI-powered IDE or integrated
development environment like cursor, right?
Cursor by default uses Claude.
But it looks like with Claude code, which, you know, audience, let me know if you
want us to go into that.
not today at a later time.
We'd have to have a show or two dedicated.
It is more technical.
But I think Claude code is really cool.
But it looks like Claude wants to compete more with those IDs than it looks like they want
to compete in the strictly large language model space.
And I think that makes sense.
I think it makes sense because it looks like over the years, Claude has kind of carved,
Claude has kind of carved out its niche.
And I'm not saying they're abandoning, you know, general business use cases.
They're not.
But it looks like especially with Claude code, especially with the MCP protocol that they put out,
you know, computer use, even though it's clunky, it did get updated with now this 37 sonnet.
So we'll have to see if it's any better.
But it looks like Anthropic is maybe just wanting to compete more in that space,
especially by making Claude code a free beta preview.
Also, another hot take since you wanted to spicy.
I don't think most companies are going to end up using Claude.
A lot of people were waiting for this release because they assumed that
Anthropic would be cutting their API prices because that has been the trend across the
industry, right?
Open AI has cut their API.
prices by more than 90% over the last 18 months when it looks at their top state of the art model.
Google, same thing.
Just ridiculous API pricing cuts, right?
Anthropic, not so much.
They didn't change their pricing at all, right?
Yeah, it's a more powerful model, but you're paying the same price.
But I think for the most part, businesses are not going to use clawed general use cases.
They won't.
maybe Anthrop I'm sure Anthropic knows this,
but I will say 90% of businesses that are looking for a large,
a general use case,
large language model to use on the API backend,
whether that's for customer success,
whether it's for sales,
whether it's for an internal knowledge base.
I'd say non-coding,
non-software development.
90% of companies will not look at Claude,
and I don't blame them.
The prices are more ludicrous,
than the early 2000s wrapper.
It's there,
they're insanely not practical for everyday use cases.
They're not.
All right.
So let's look at those API prices.
So Claude 3.7, 3, and this is per million tokens.
It is a $3.4 million tokens input and $15 for output.
All right.
So yeah, it's a hybrid model, sure, but I'm still going to go.
If I'm a business leader, GPT40 Mini is great because you can chunk, right?
You can chunk different tasks to different models.
And that's why I'm going to this whole like this API, you know, and developers using it.
No one's going to use 3.7 unless you specifically need software development, coding, right?
Unless you're in one of those categories, maybe some.
some stem areas, right?
But otherwise, who's going to touch it?
When you look at GPT40 Mini is 15 cents versus the $3 input and then 60 cents versus $15 on the
output side, right?
And when you can chunk it and when you can say, hey, for these type of questions for
customer success for sales, et cetera, we're going to use GPT40 Mini because we don't need a hybrid
3.7 model to do 90% of what we would use it for.
I mean, the cost savings there, it's like, I don't know, 10, 6, like 30 times as expensive.
Like, absolutely not.
Or 25 times as expensive.
I'm doing math live on the fly.
I don't know.
From an API perspective, this does not make sense.
I was really expecting Anthropic to.
slashed their prices, but it looks like they're not necessarily concerned with competing for
everyday business use cases.
They're like, yo, if you want to use agentic tools, if you want to use software development,
uh, coding, et cetera, maybe some engineering, like I said, some STEM use cases, but for everyone
else, nah, we're good because the combination as an example of GPT4O mini at 15 cents and 60
cents and 03 mini at $1.10, 440, duh, right?
And then the same thing with Gemini.
Gemini 2.0 Pro, $1.25 and $5.
And then they have their flash.
And then they also have flash thinking.
I probably should have put that up on the chart.
But it just doesn't make sense.
It doesn't make sense.
Their pricing on the back end does not make sense.
And I think as the other models get essentially better at coding and software development,
because right now, yes, Anthropic Claude and with their 3.7, they have a huge lead there.
So let's look at that.
So some of these benchmarks here, we're looking at the SWE bench, SWE bench verified,
and looking at some of the different benchmarks.
And you have the version here without, you know, they're calling it custom scaffolding
or without that extra thinking.
Even without the extra thinking, Claude 3.7 saw it on SWE bench, 62%.
where their last version, 3.5 saw it, was 49.
OpenAIs 01 is 48.9.
03, 3, 3, 3, deep seek 492.
But with the extra thinking, Claude is a 70%.
Right.
So that's what I'm saying.
If you're doing any type of software engineering,
nothing else right now comes close.
Same thing with agenic tool use.
So the towel bench, I think it's pretty,
pronounce tau bench, but TAU bench.
Same thing.
This is when you essentially have a model.
You give it access to tools and you have it go complete some technical tasks.
Same thing.
Claude 3.7 saw it here with an 81% on the towel bench retail and then open AI 73%.
So not close.
Generally with a lot of these benchmarks, you know, especially some of the non-technical,
non-software engineering ones, one point difference can be huge.
Right.
So in this use cases, Claude 3.7 sonnet is light years ahead.
Interestingly enough, when we look at the regular benchmarking kind of marks here,
between Claude 37 Sonnet, Claude 35, OpenAI, OpenAI, OpenAI, O3 Mini,
DeepSeek R1 in Grock 3 beta, this is from Anthropics website.
Interesting, they didn't include on this main one.
anything from Gemini.
And these are definitely cherry-picked.
But something that I found interesting when Anthropic was putting out its own benchmarks on its website,
is they didn't use the same benchmarks as they had previously when they announced Claude 3.5.
Saw it.
When they announced the Claude 3 family of models, specifically they're keeping out these benchmarks comparisons like MMLU.
and then the ML, the multimedia version one, right?
There's essentially kind of, I think it's MMLU and MMLU Pro.
Now I'm blanking on it, but it's kind of the standard.
It's been this golden benchmark, but I mean, you can see it here.
Anthropic is just kind of like, no, we're good.
We're just going to stick to these more technical benchmarks, right?
Visual.
Oh, there we go.
M, MU, you know, it's with nine.
extended thinking at a 71% it's not better than Open AI, right? Open AI on the MMMLU, which is the multimedia
version of the MMLU, which I would say is the standard or has been the standard benchmark.
Open AI is better than it, right? So it's interesting to see here. It doesn't look like Anthropic
is trying to overfit for certain benchmarks, right? And I would like to see once there is the MMLU.
and not the multimedia version of it,
where Anthropics' new Quad 3.7 stands,
because I'm guessing it is going to be not in first.
I'm guessing it might not even be in the top five,
but I don't think that Anthropic necessarily cares.
Because like I said,
it looks like they're just trying to compete
and they're trying to be more of a just a coding assistant, right?
So maybe their biggest competitors might also be some of their customers,
like Cursor, like WinSurf, like Lovable,
like Bolt, right?
Or maybe some of their competitors might be GitHub co-pilot.
All right.
Let's talk a little bit about Claude code.
So, yeah, live stream audience, let me know.
Should we tackle this at a later point?
I think it's pretty cool.
But you have to have a little bit of tech know-how.
So here's how it works.
So Claude code, essentially you go to GitHub, you kind of install this GitHub repo,
and then it can work with a code base on your computer.
So, you know, if you're on a Mac and you open Mac Terminal, essentially,
you can have Claude Code.
So this is a new, essentially, research preview that's free.
That's even for free users can use, which I think is great.
So you can work with an entire code base.
Okay.
So let's say you have a full.
folder. All right. So non-technical people, bear with me. And I'm probably going to get some of the
technical details wrong here. So, you know, if you are a coder, bear with me as I explained it to a
non-technical audience. But let's say you have a code base. So you build an app or something and you
have a folder and there's, you know, seven different files in there. You know, maybe there's a
JavaScript file. Maybe there's an HTML file, a CSS, et cetera, right? So the cool thing about
Claude, well, number one is it works locally on your machine, right?
So you don't have to go into a third party environment.
You're just working in the terminal, which I know might be intimidating for some.
But then you essentially just talk to Claude like you would as if you were inside
Claude.
Claude.
And then it can code and it can update your entire code base.
So it will search, edit, test and push code from within the terminal.
So it's not going to say, oh, here, here's the new code for the HTML.
Here's the new CSS code.
Here's the new JavaScript code.
Go copy and paste this, right?
It just does it all for you.
It works with and updates your entire code base.
It has GitHub integration.
It's pretty good at debugging.
So Claude Code was part of this, you know, 3-7 Sonnet release.
And I think it may end up being more impactful than the most.
model itself because I think this signals,
Anthropics shift to really want to compete more in that space.
And you might be wondering why.
And I actually don't hate it because if you listen to our 2025 AI predictions in
Roadmap series, one of those things is non-technical people are going to be spinning
up apps for themselves to use.
And now Claude could be the easiest way to do that.
Yes, you can use cursor.
you can use WinServe.
Some of these other tools,
I think the learning curve might actually be a little higher.
But Claude code can allow everyday people to just go create apps,
talk with it.
You can even be like,
yo,
I have no clue what this means.
Explain it to me.
Or hey,
make it prettier.
Make it shinier.
Make it more useful, right?
You know,
make a data visualization,
right?
You can just dump all your data,
give it to Claude via this Claude code,
create a program that runs locally on your computer that helps you solve something, right?
I do think enterprise software, if I'm being honest, it doesn't have the same future that it has
today.
I do think everyday non-technical people are going to be using AI and large language models to
spin up their own software for very niche use cases.
And I think Claude might be that first big step toward bringing that to everyday people, right?
Yeah, you might have to get used to, you know, here's what a GitHub repo is.
You know, but it does it all working with your entire code base where, yes, I love using, you know, O3 Mini or O1 Pro or something like that.
But then you still have to copy and paste, you know, all of those, all of those different files.
You might have to use something like Replit to run it.
So CloudCode, pretty cool.
It does it all kind of for you.
All right.
Let's look live, shall we?
what could go wrong?
What could go wrong doing a live test of a brand new model that has terrible, terrible limits?
Let's try anyways.
You guys say you like these live tests, so let's go ahead and do them.
Live stream audience, let me know if you can see my screen here.
All right, so here's a couple of things to keep in mind.
When you are choosing Claude, make sure you are using Claude 3.7 Sonnet.
Also, you'll see this new thinking mode.
So it's kind of ironic.
You still have to have this extended and you're only going to see this on the paid plan.
You want to make sure you have that extended box checked.
So you can choose a normal thinking mode and this is as a front end user or you can use
the extended thinking mode, and this is best for math and coding challenges.
All right.
I'm going to go ahead.
I'm going to put a giant prompt in here.
All right.
Okay, thank you.
Marie is always the first to say, yes, I can see your screen.
Thank you, Marie.
I always appreciate that because I never know.
All right.
So I have a giant prompt I'm going to put in, and we're going to use this extended thinking
in Claude 37.
All right.
So here's what it is.
I've done this on the show before.
This is what I did when I first tested O1 Pro.
All right.
So essentially, I'm saying these are my podcast stats.
So I'm using the same exact prompt.
So I say today's date is January 16th.
This is when I did the O1 Pro show.
I want to have a consistent comparison across quote unquote reasoning or hybrid models because
that's what we're trying to do here.
We're trying to say like, okay, how is this, how is this model?
Right.
And you'll see here at live stream audience, it's working.
I'm going to try to keep my eye on the model here.
I'll actually just let you watch it and I'll read the prompt that I put in.
All right.
So I say these are my podcast stacks.
Keep in mind, today's date is January 16th, 2025 for all questions.
Always exclude the top 2% and bottom percent of episodes unless otherwise noted.
So then essentially I have, I give it a series of 11 questions.
And these questions are extremely specific.
Then I copy and paste.
I believe I give it, let me count here, about data from 150 podcast episodes.
Okay.
So this has the name of the episode, the episode number.
Then it has the number of downloads in the first or sorry, the last seven days,
the last 30 days, the last 90 days.
in all-time downloads.
And then over the course of these 12 different questions, I am asking, in this case,
you know, Claude 3.7 sonnet with the extra thinking, right, with the extended thinking,
I'm asking it some very advanced questions, all right?
13 of them.
So as an example, you know, question number two, I say, give me the complete list.
list of all episodes with a new performance percentage of over or under the adjusted average.
Because I'm saying take away the top 2% and the bottom 2% of episodes because sometimes
there's anomalies, right? And I don't really care about those. And so I'm saying,
hey, find trends. And then I'm saying question three, give me top 10 and bottom 10 episodes
in their respective percentage that they're either over or under the adjusted average. So what I'm
trying to do is, you know, sometimes there's episodes that kind of go viral. Sometimes there's
episodes that for whatever reason don't get like any downloads.
And I'm like, okay, there must have been a problem retrieving data.
So I want to find kind of that median or mean.
And then I want to find types of episodes that get more downloads than that kind of
adjusted average.
And then I want to, over the course of all of these questions, I'm asking it to find
different trends and patterns so I can create better episodes for you all, right?
It might be something as simple as how do I name these episodes better, right?
and having it spot different things.
It could be, I'm asking some questions about days of the week, right?
So as an example for question four, I'm saying for the top 10 episodes above the adjusted
average, please suggest three slightly adjusted title names for each if I were to rerun them,
right?
Yeah, probably a couple times, couple times a month.
I'll rerun shows, you know, I might get sick or, you know, a guest might have to bail
at the last minute and I might need an episode to rerun.
So I'm saying, hey, give me a new title.
All right.
So it looks like, let's see.
Okay, so it looks like it says it thought for 23 seconds.
It looks like we're, so it looks like it's done thinking.
Let me see.
Okay.
So when I go through and I look at this thinking, this does not look like the raw chain
of thought.
Okay.
So it says, I need to analyze podcast stats from the provided data.
The first step is to extract the data and organize it in a way that makes it easier to analyze.
Let me go through the instructions carefully and understand the tasks, right?
And then it breaks it down into six different subsections.
And then it says, let me first extract the data.
So it's going through, it's kind of showing it's step by step.
But I'm looking at this.
If this only thought for 23 seconds, let's see.
Okay. I'm trying to see if there's more, there's no more chain of thought. Okay. So it did get through
this fairly quickly. It is still answering the questions. All right. So I'll have to go through and
give this a good scrub, but I'm kind of surprised that it only thought for 23 seconds. And you know what?
hey, if you share this episode, I'll share the complete stats and prompt that I sent.
I will share the exact output that we got here from Clawed because it's still going.
And I will share the exact output that we got from 01 Pro as well.
So if you really want to dive into the details, I'm not going to have time.
It would take another half an hour to read all of this.
I'm going to go ahead offline once this is done and look at the comparison.
But I will say this.
Overall, it looks like it did a decent job,
although I did look at the responses I got from 01 Pro this morning.
The responses from 01 Pro were exponentially more impressive.
They were, right?
The findings here, let me just go ahead, see if I can read
let's see if I can read one or two of these answers that maybe we can have a little more nuance,
right?
So let's say number seven.
Okay.
So the question for number seven was how does release day impact episode performance?
Please exclude Mondays, as that is usually our AI News That Matters, days.
And we don't usually run any other type of shows on those days.
And then I say, you know, here's here's,
what today is so you don't get confused.
So it says impact of release day on episode performance.
So it says Saturday, which I don't know why they're Saturday, because we, as far as I
know, have never released an episode on Saturday.
So that's a little weird.
So then it says Wednesday is 6% above adjusted average.
Friday is minus 3%.
Tuesday is minus 4%.
And Thursday is minus 1%.
5% below adjusted average.
I'm not sure if that is true, right?
Because we didn't release any episodes on Saturday.
So unless there was some weird thing in the formatting, Saturday shows should not be there.
And if so, maybe it was a Friday, like one Friday show that got posted super late.
I don't know.
But this, I mean, in short, this data does not look correct.
It does not look correct that three,
of our weekdays are below the average and only one of them is above the average.
Doesn't make sense.
And then it says key findings.
Wednesday episodes perform particularly well for technical tool guides and platform-specific
content.
And here it says Saturday episodes, which again, we haven't done, perform well, likely
due to less competition and more listener leisure time.
Thursday is consistently the worst performing episode day, especially.
for industry-specific content, which I know is not true because I look through daily downloads
every single time. Thursday is not a bad day. Thursday is usually our second best day.
It says Tuesday episodes underperform, especially for news or recaps. Also false. So, you know,
I'm going to have to go through and look a little bit, but not great responses. And I'm wondering,
this should have taken, I think, many, many,
minutes, many minutes.
And at least it says here that it thought for 23 seconds, which number one doesn't seem
right.
But it doesn't look like I got great results, if I'm being honest, right?
It looks like it did go through and answer all the questions, which is good because in
my first testing of this last night, it actually stopped.
It only answered the first three questions.
And then it essentially said that it went through the context window, right?
which didn't make sense because I'm like,
yo, this is supposed to be, you know,
a hundred,
you know,
a hundred some thousand,
uh,
context window.
So here's this one.
Here's the one that I did previously.
And at the bottom,
it says,
Claude hit,
uh,
I know it's kind of small there.
It says,
Claude hit the max length for a message and has paused its response.
You can write to continue to keep the chat going.
And you'll see in this use case here,
let me go to the top.
In this one,
It thought for three minutes and eight seconds.
Okay.
Why did in one use case, it thought for three minutes and eight seconds and could not give me the entire output.
Yet in the second one, it says it only thought for 23 seconds and it gave me the entire output.
And I will have to do a little bit more offline comparison, but, you know, mixed bag.
Mixed bag so far.
All right.
So let's do this.
I have a very very very.
short rubric that I normally do for reasoning models. I'm going to go through this one quickly.
I'm going to make sure that I have the extended thinking on this. All right. So let's go ahead,
go through some of the questions that we would normally run. All right. So this one, I'm saying,
I just woke up with six apples and three bananas. Some of these I made up. Some of them are,
you know, kind of widely used across the internet. Some are just modified from,
pretty popular ones.
So this, I, like, I know I need to make an actual, like, reasoning rubric, but these are just
some that I generally use.
So I said, I just woke up today with six apples and three bananas.
Yesterday, I ate a banana and two apples.
This morning, I will eat one apple and no bananas.
However, I don't really like apples and one banana may turn brown tomorrow.
Assuming nothing else changes, how many apples and bananas will I have tonight?
So, let's see here.
It says it thought for five seconds.
And I can go through and look at the chain of thought.
Again, I don't know if this is raw.
Maybe this is the raw chain of thought, not the summarized chain of thought.
So let's see.
It says, let's work this through step by step.
Starting point, six apples, three banana.
Yesterday, the person ate.
Then it says this morning they will eat.
The question asks how many apples and bananas they'll have tonight after eating what they
described.
So then it's going, let's calculate.
And then it says, wait, I need to double check the wording of the problem.
The person says, I just woke up.
today with six apples and three bananas.
So these are their current quantities after whatever happened yesterday.
So yeah, Claude gets a lot of the information that I put in here is just meant to throw a model off.
Most of these models, including Claude 35, saw it, do not get these questions correct.
I'm assuming that Claude 35 saw it with thinking, got it correct.
Yes, it did.
The correct answer right there.
Got it correct.
It is five apples and three bananas.
All right, let's do a couple more.
We're going to go through these quick, y'all.
All right, same thing.
We have extended thinking on.
All right.
So this one, a man and his dog are standing on one side of the river.
There's a boat with enough room for one human and one animal.
How can a man get across with his dog in the fewest number of trips?
There we go.
The man and the dog can cross the river.
together in just one trip.
So even some of the original, you know, very powerful state-of-the-art models would always get
this wrong.
It's very simple, right?
These questions are simple.
Any human knows right away, oh, that's one trip for whatever reason.
A lot of large language models, including, you know, GPT-40 when it first came out,
Claude Sondit 3-5 would get that wrong.
All right.
Here's another super easy one.
Let's go ahead and ask this one.
I'm saying if it takes three hours to dry 10 t-shirts and the sun,
how long will it take to dry 30 t-shirts in the sun?
All right.
So here we go.
We got a little bit of chain of thought here, not a ton, three seconds.
Correct.
It's saying it will still take three hours to dry 30 t-shirts in the sun because the process is parallel.
Right.
So as long as you have room, you can dry as many as you want.
All right.
Our next question, let's see if we can trip it up here.
So I'm saying if you have a single match and you walk into the room with an oil lamp, a candle and a fireplace, which do you light first?
The answer, which Claude got correct here, you would light the match first.
All right.
I hate how jumpy this sidebar is.
All right.
Next question.
I don't think any large language model has ever got this one wrong.
What color is an airplane's black box?
All right.
It is actually bright orange, correct.
All right.
Here's one that a clawed model has never gotten correct.
And let's, okay, I'm asking for jokes.
Also, a model has never really been good at humor.
All right.
So I'm saying, please give me seven jokes that end in the word blue.
Two should be about animals.
Three should be about some other topic in the body of this chat, and you can make up the other two.
All right.
So again, seven jokes.
They should end in the word blue.
Two are about animals.
Three should be about something else in this chat, and it can make up two.
So most models can at least now follow the basics of the directions.
Some models still struggle, but let's see if they're actually funny and if it follows directions.
So only thought for three seconds, and it says here are the seven jokes.
So let's see, the two about animals.
Why was the fish such a great musician?
Because it knew how to play the scales and always stayed in the deep blue.
Is that a joke?
Not really.
Is it following the directions?
Yes.
What did the sad elephant say at the waterhole?
Even my memory feels blue.
Not funny, but it's there.
All right.
And then we are going to have three about the context from the context of this chat.
So why did the banana go to therapy?
because it couldn't stop feeling peeled and blue.
All right.
Why did the boat stay on the river after failing to cross?
You're making me feel blue.
Why did the t-shirt refuse to dry in the sun?
It was feeling too blue.
So it's using the same, you know, oh, it feels blue as the ending of the, I guess,
the punchline.
It's not even really a punchline.
All right.
And then two of the ones where it made it up themselves.
Why did the painter refuse to use any other colors?
He was going through a phase where he felt only.
only blue. And then what did the match say after lighting the candle? So interestingly enough,
it used another one from the context of this chat, even though I told it to only use three.
It ended up using four. So from an instruction following standpoint, I guess I said you can choose.
So it just chose to use another one from the context of the chat. It said, what did the match say after
lighting the candle? That was my moment to shine. Now I'm just blue. So are they funny? No. Did it follow the
directions. Yes, it did. All right. I think this is going to be the last one we do here just due to time.
So a box is locked with a three-digit numerical code. All we know is that the digits are different. The sum of
all digits is nine, and the digit in the middle is the highest. What is the code? All right. So now we can look at
the chain of thought. It's going through. It's saying let's work through this step by step. It's assigning numerals.
let's call the digits A, B, and C.
A does not equal B does not equal C.
All right.
And then it says B is greater than A.
B is greater than C.
All right.
So it's breaking it down the way that a human probably would.
It's going through some potential use cases.
It's throwing some numbers out to see if they work.
Let's scroll to the bottom here.
So it's going pretty fast.
So that's good.
Even though the model's new,
probably a lot of people are hitting it right now.
it's going pretty quickly.
All right.
Let's see if it actually gets it correct.
Here we go.
So it's saying, actually, wait, let's reconsider the constraint.
All digits are different.
Does this include zero?
Yes.
So many models skip over zero for whatever reason.
When they look at numbers that would be on a padlock, they only think one through nine,
but zero would be there.
All right.
So now it's saying I'm actually sure zero is allowed,
since the valid digits for a code are typically zero through nine.
But zero does result in non-uniqueness.
There we go.
All right, let's scroll to the bottom.
So this is actually, aside from the first version of the podcast one,
the podcast stats that I did, I can't even remember.
That was either late last night or early this morning that caused it to think for three minutes.
This is the one where it's taking the longest to think.
And the chain of thought is pretty impressive.
It's doing seemingly a pretty good job here.
And luckily, I haven't hit my rate limits yet.
What a miracle.
But I intentionally did not use it a lot just so I wouldn't hit my rate limits.
All right, live stream audience, we're going to give this one a second to finish up.
But what are your thoughts here?
What are your thoughts here as we get ready to wrap?
Big Bogey is saying, I have Gemini 2,
tell me a joke every day.
It's not getting better.
Yeah, these large language models are definitely not good,
although Denny is saying that AI makes dad jokes look good.
All right, let me see.
I want to make sure if you did have any questions,
let me just double check.
I want to make sure I don't see any specific questions,
but I do have a couple dozen.
Okay, here we go.
Woozy saying, and for our podcast audience,
Sonnet 3-7 is still thinking through this combination problem.
So Woozy is asking, is there anything that you see and the thinking that makes you adjust
specific things in your original prompt question?
Woozy, thank you for that question.
That's an amazing question.
Yes, 100%.
And I've mentioned this multiple times.
And I called this out, I think twice when I did deep research, when I did my deep research
comparison show.
You should be doing this all the time.
Because when you're using deep research as an example, it shows you what it does.
I think deep research is one of the best use cases for generative AI, FYI.
But I always go back and I always say if you're going to use a model like a deep research
or a reasoning model that takes its time to think, you might as well squeeze the juice out of it
and get a good return on your time invested and go ahead and do it again.
In these cases, when there's a finite answer, right, the number of apples and bananas,
I'm not going to do it, right?
I'm not going to run that a second time.
There's a yes or no answer.
When it's more of putting it on a task that there's not, it's not as finite,
100%.
Thank you for that question, Moosey.
You should always, always, always look at the chain of thought.
This is the biggest, one of the biggest advantages to having both chain of thought using
these reasoning models as well as being able to see exactly how these deep research
tools research is you go through, you read it, you see, oh, this was a good decision that it made.
Oh, this was not a good decision. I take notes on the side and then I adjust my original prompt.
You need to be doing that all the time. Woozy, you just added a ton of value to our audience here.
Let's see. Douglas asking says, I think hybrid is interesting. Pure Transformer is missing the
benefit of the thinking. The thinking is slow compared to the transformer. Hybrid could be a good
bridge, but what is being sacrificed for the capability? Yeah. Yeah, so Claude 37 Sonnet is the first hybrid model.
Well, what's being sacrificed here is fine-tuned control, right? Especially for front-end users.
Yeah, you have a little slider on the back end if you're using the API, but like I said,
I personally don't like this, right? But I'm a power user, right? When I'm a power user, right? When I
I love, which I know I'm in the minority,
I love logging into chat, GPT and seeing like eight different models.
And there are sometimes I know I'm going to 01 Pro.
There's times that I know I'm going to 03 Mini High Plus Web.
There's times I know I'm going to deep research.
There's times I know I'm going to GPT 4-0, right?
And I'm going there with intent because I know the pros and the cons.
Does the average user need seven models?
Probably not.
Will the average user benefit from a hybrid model? Probably.
But it does, I don't care what anyone says.
For front end users, a hybrid approach lowers the ceiling, right?
Because there's going to be times that it uses thinking, it uses extra compute when you
don't want it to.
There's going to be times when conversely, right, when it does the opposite.
it. So I think what this ultimately means for power users, you're going to lose some flexibility.
And for everyone, I don't care what you say. I think it lowers the ceiling just a little bit.
The floor goes up. The floor goes up. The ceiling comes down.
So like I said, for the average everyday user, I think hybrid models are great for power users that are using it on the front end.
I don't like it. I don't like it. And on the back end, the companies are going to make much more money.
I think you're going to be paying more if you're using a hybrid model.
And like I said, I hope, hope that all these companies, even five years down the road
are still going to have these non-hybrid options.
Because if there's only hybrid options, regardless for that very reason, especially when
you're using the API, if you're a software company, let's say, right, or you're just using,
you know, Open AI or Claude for customer, you know, customer service, right, to get to support
tickets faster.
You use some rag.
You put in your company documentation and someone's chatting with an AI chatbot with your
company's information, but using one of these models, right?
If there's only a hybrid model and those hybrid model costs are much higher and you don't,
in the future, you don't have an option to use like a GPT40 mini or a Claude 35 haiku.
and you only have the hybrid model, your costs go up.
Your costs go up exponentially.
Like I said, I think we've been getting a steal for the last couple of years.
All right.
Let's go ahead and look at this result and wrap this show up.
So let's see.
How long did this one think?
It did finish here a minute ago when I was on a side tangent answering Woosie's question.
So this one, interestingly enough, thought for three minutes and 15 seconds.
So let's see if it got it right.
So I'm going to go past the chain of thought.
All right.
It says, to solve this problem, I need to find the three digit code where all digits are different.
The sum is nine and the middle digit is the highest of the three.
So it says 180, 270, 360, 450, 162, 153, 153, and 243.
Okay.
And it says, since the problem asks for the code, implying a single answer.
So yeah, I did say what is the code, but it should know, as most of the reasoning models do,
that there's actually many answers.
So I would not give this a pass on this necessarily, because even though it gave me other
options that worked, right?
1-80 works, 270 works, 360 works.
It ultimately chose a single answer, which I don't know.
It says 153 is the smallest valid code that satisfies all the conditions.
No, all these two.
But there's tons of others.
There's tons of other codes that work.
So as an example, 180 works.
243 works.
What about 342?
Right?
So it didn't do a good, it didn't do a good job.
It thought for a long time.
I would say that it did not pass this.
So it is kind of a trick question.
Even though I ask for the code, there's many codes.
And it thought that and it found that out with its reasoning, but still decided to almost
overthink this issue and say kind of just the wrong answer.
All right.
I know this was a long one.
I hope it was helpful.
Like I said, if this was helpful, go ahead, share this.
I'll share that complete prompt, the one going to.
over the podcast stats and I'll share O-1's complete answer as well as Claude
37 Sonnet's complete answer.
So yeah, if you are interested in just the kind of the raw thinking capabilities,
go ahead, share this.
I hope this was helpful, but I'll tell you, I'll tell you my quick takeaways.
Claude 37 Sonnet, amazing.
I think it is going to dominate and continue to dominate for any companies that need
software engineering, coding, etc.
I think even the 3-5 new model in many of those use cases was already better.
So it was already making kind of one of the best models in the world exponentially better.
So for 3-7 sonnet, coding software development, et cetera, through the roof.
Even what this means for artifacts, huge, right?
I'll probably do a dedicated episode just on 3-7 sonnet artifacts and what that means even for non-technical people.
But for everyone else, for API prices, I think it's a big loss.
I think overall hybrid model, like I said, brings the floor up, brings the ceiling down.
I don't think hybrid models, at least right now, are great for power users who are using this on the front end.
I actually think I will probably be using Claude 3.7 Sonet maybe a little less.
And I'll probably just be using Claude 35 Sonet a little more, right?
I know that seems backwards, but that's probably the reality.
So I think that there's some super promising aspects of this new Claude 3.7 sonnet,
some highs, some lows, but hopefully this episode was helpful.
All right.
So like I said, if this was, please share this.
Also, if you haven't already, please go to your everyday AI.com.
Sign up for the free daily newsletter.
We're going to be recapping this episode.
Maybe you miss something.
You know, I know this was a longer one.
trying to explain things live is always a little time consuming,
but that's what a lot of people that I hear from like.
So speaking of that, go subscribe to the newsletter,
reply to today's newsletter if you're still listening.
And tell me what you want to see more of.
This show actually was because of you.
I put a poll out on my LinkedIn.
It was actually,
it was only decided by one vote.
So we're going to have the other winner,
which is how to prompt O models,
01, and O3 models.
We'll probably do that show tomorrow or maybe next week.
We'll see if time allows for it.
So thank you for tuning in.
I hope to see back tomorrow and every day for more everyday AI.
Thanks y'all.
Meet Firefly AI Assistant.
Now live in Adobe Firefly, the Allman One Creative AI Studio.
Just describe what you want to create in your own words and the assistant handles the rest,
orchestrating multi-step workflows across Adobe Creative Cloud apps,
including Photoshop, Premiere Express, and more in one conversational interface.
You direct the outcome while the assistant accelerates execution.
band control with the ability to step in and refine at any time.
See it today at firefly.adobie.com.
And that's a wrap for today's edition of Everyday AI.
Thanks for joining us.
If you enjoyed this episode, please subscribe and leave us a rating.
It helps keep us going.
For a little more AI magic, visit Your EverydayAI.com and sign up to our daily newsletter
so you don't get left behind.
Go break some barriers and we'll see you next time.
