Everyday AI Podcast – An AI and ChatGPT Podcast - EP 469: Claude 3.7 Sonnet - World’s first hybrid AI model. How it works and when to use it

Starting point is 00:00:00 This is the Everyday AI Show, the everyday podcast where we simplify AI and bring its power to your fingertips. Listen daily for practical advice to boost your career, business, and everyday life. Meet Firefly AI Assistant, now live in Adobe Firefly, the All In One Creative AI Studio. Just describe what you want to create and the assistant handles the rest, orchestrating multi-step workflows across Photoshop, Premiere Express, and more in one conversational interface. You direct the outcome. The assistant accelerates execution. Another week, another state of the art, large language model release.

Starting point is 00:00:50 But this one from Anthropic is a little different. It's actually the first of its kind. Because when Anthropic just released its Claude 3.7 sonnet, they became the first company to release a hybrid large language model. All right. So we're going to be talking today about what that is, what it means, how it works, and when you should actually use this new model from Anthropic. I hope you're excited for this show. I am welcome if you're new here to Everyday AI.

Starting point is 00:01:30 What's going on, y'all? My name is Jordan Wilson, and this is Everyday AI. This thing is for you. This is your daily live stream podcast and free daily newsletter helping us all, not just keep up with Gena AI advancements and LLM updates, but how we can use it to get ahead. I want you to be the smartest person in AI at your company. And this is your cheat code.

Starting point is 00:01:53 So if you haven't already go to your everyday AI.com, that is where you can sign it for our free daily newsletter. Yeah, maybe you're listening to this podcast for the first time. If so, thank you. Make sure to check out the show notes. There's going to be a lot of other information. But probably the most important is our website, because each and every day in our newsletter, we recap exclusive insights from this exact podcast,

Starting point is 00:02:15 as well as giving you every other piece of news and update that you need to stay ahead in the generative AI space, as well as you can go listen to like 500 episodes on our website, all sort of a category. So make sure you go check that out. All right. So I am extremely excited to talk about the new Claude 3.7 Sonnet. I think it's going to change how a lot of people are using large language models, both for the good and for the bat. But before we get into that, let's first start out as we do most days by recapping the biggest AI news.

Starting point is 00:02:52 So first, Google has launched a free version of its AI-powered coding assistance, Gemini Code Assist, aimed at solo developers, students, freelancer, startups, and hobbyists. So the new free public preview offers up to 180,000 monthly code completions, significantly exceeding the 2,000 completions for free offered by competitors like GitHub co-Pilots free tier. So it is powered by the Google Gemini 2.0 model, and it can generate entire codebox, auto-complete code, debug, and assist developers via a chatbot interface. So users can instruct the assistant in natural language, such as asking

Starting point is 00:03:33 it to create specific code snippets or modify existing applications. So Gemini Code Assist supports 38 programming languages and integrates with popular developer environments like Visual Code Studio, GitHub, and JetBraids. All right. Our next piece of AI news, Apple, a couple of years, maybe too late. I don't know, but they're making a splash with a reported $500 billion investment over the next for years into AI infrastructure, signaling a major push into not just AI, but American manufacturing and technology.

Starting point is 00:04:11 So according to Apple CEO, Tim Cook, this commitment reflects confidence in the future of American innovation and aims to strengthen the company's role in AI and advanced manufacturing. So a key part of this investment, $500 billion with a B, includes the development of a key new manufacturing facility in Houston, providing thousands of jobs to produce servers designed for AI cloud computing. These servers will feature the Apple Silicon and offer cutting edge security and performance capabilities. So the integration of Apple Intelligence, the company's AI platform, could further transform healthcare specifically by leveraging its global network of over 2 billion active devices to provide innovative health tracking and data insights.

Starting point is 00:04:59 So Apple investments come amid a broader AI spending race with competitors like META, spending $65 billion, Amazon spending $100 billion, and Project Stargate, which is $500 billion over five years, also ramping up their AI infrastructure and innovation budgets. Speaking of, that's our last piece of AI news, Microsoft, just on the same day that we get reports that Apple is going all all in with a $500 billion investment. reportedly Microsoft is canceling leases for a couple of 100 megawatts of U.S. data center capacity equivalent to about two full data centers, and that's according to a report from T.D. Cohen. So this move raises concerns about whether Microsoft, obviously a global leader in AI investment, may be securing more AI computing capacity than it needs in the long term. So the cancellations involve agreements with private operators and a slowdown, converting statements of qualification, which are typically precursors to former formal leases.

Starting point is 00:06:05 So TD Cohen speculates that Open AI, which is backed heavily by Microsoft, may be shifting some of its workloads to Oracle as part of a new partnership, which may be causing Microsoft to cancel or change some of its longer term investments. So Microsoft, which owns and operates many of its own data centers, is also reallocating billions of dollars in infrastructure investments, potentially shifting focus back to the U.S. from international projects. So despite these adjustments, Microsoft reiterated its $80 billion spending target for an AI data center infrastructure for the fiscal year ending in June. So analysts suggest that Microsoft could be in an oversupply position, meaning it may have overestimated the immediate demand for

Starting point is 00:06:53 AI computing power. So it'll be interesting to see how those stories play out, especially happening at the same time. I mean, Apple, you know, making a huge splash with a $500 billion investment where we get reports that Microsoft may be slightly scaling back. All right. Enough. If you want more AI news, make sure you can go get it at our website. Sign up for the free daily newsletter, your everyday AI.com. All right, let's get into it. Let's talk the world's first large language model hybrid. All right. And that is with Claude 3.7 sonnets.

Starting point is 00:07:31 All right. So it is the world's first publicly available hybrid AI model. So what that means, and we're going to get more into this, right? I've been talking about this on the show now for, I don't know, at least six months since Open AI kicked off this reasoner race. Right. So you essentially think of it when it comes to generative AI in large language models. I know I hate to use terms like old school.

Starting point is 00:07:54 when technically, you know, this space is only like, I don't know, six years old, you know, at least commercially available, you know, the GPT three technology, right? I would say it would be the first large language model that was popularized commercially a couple of years before the chat GPT release. So you have your kind of quote unquote old school transformer models, and then you have your quote unquote new school reasoning models that kind of use this advanced thinking, all right? And right now, those are two very separate things.

Starting point is 00:08:26 So as an example, Open AI, the leader in large language models, you know, they have their GPT4O, still an industry leading model, even though it's technically older, but that is its kind of quote unquote old school transformer model. And then they have their newer kind of reasoning models that use logic kind of under the hood. And that is, you know, 01, 01, 03 mini, oh, three mini, hi. Yeah, these names suck, right? But you know, you essentially have these two very different types of models that excel at two very different kinds of tasks. So now with Anthropic, they are essentially merging this together in Claude 3.5, sorry, Claude 3.7 sonnet, and it kind of does both. And this new hybrid system will kind of decide on its own when it should use more of this advanced thinking versus when it should just straight out spit an answer to you without really thinking.

Starting point is 00:09:21 it through. All right. So let's go over a little bit. And it's not just the 3.7 sonnet. They also announced Claude Code, which I think is an extremely big move from Anthropic and kind of tips its cap into where it's actually competing in. So more on that in a second. But hey, live stream audience. Thanks for joining. Appreciate y'all tuning in. We've got an international audience today. So thanks to our, you know, YouTube audience. We got Big Bogey Face and Sandra and Sam, Michelle, thanks for joining on the LinkedIn crew. We have Dr. Harvey Castro, Christopher, Woozy, LinkedIn, user, Marie, Denny, Douglas, Cecilia, Jean, thank you all. Jamie Carina. You know, we got the UK and Italy in the house. Love to see it. Mack. Max holding it down from

Starting point is 00:10:13 Chicago just like me. But I'm curious, live stream audience, do you care about these, like this new hybrid approach because it's something that Open AI is also going to be adapting to as well. I think it's there's actually some downsides and we're going to talk about this. But, you know, live stream audience, I'm curious. Number one, do you care about this hybrid approach? Do you think it's going to be good or bad? And have you used Cloud 3-7 sonnet yet? I know it's only been out for a couple of hours.

Starting point is 00:10:40 If you do have any questions, get them in. Now I'll try to tackle them at the end of the show. All right. So let's get into an overview, and this is from Anthropic. So they're saying today we're announcing Claude 37 Sonnet, our most intelligent model to date and the first hybrid reasoning model on the market. Claude 3 Sonnet can produce near instant responses or extended step-by-step thinking that is made visible to the user. Yeah, so that part's important.

Starting point is 00:11:07 You can kind of see like you can. It is a summarized chain of thought. So chain of thought is actually a prompting technique that was popularized. you know, over the last couple of years using transformer models. So this kind of chain of thought or, you know, how a person would think about a problem. So now a hybrid model does that. And it shows a summarized version of the chain of thoughts.

Starting point is 00:11:31 You can kind of see how this new 37 sonnet is kind of thinking about your prompt if it is using the advanced thinking. So now back to the release API users also have fine-grained control over how long the model can think for. So Claude 37 Sonnet shows particularly strong improvements in coding and front-end web development. Along with the model, we're also introducing a command-line tool for agentic coding called Claude Code. Cloud Code is available as a limited research preview and enables developers to delegate substantial engineering tasks to Claude directly from their terminal. All right.

Starting point is 00:12:09 So a lot to unwrap there. So you don't have to read. I think there's like three separate releases that Anthropical. put out. I'll just give you the high level. So like we said, this is the first hybrid reasoning model with visible thinking process. And the extended thinking is on paid plans only. So if you are a free user to Anthropic Claude, you will see the 3.7 Sonnet model available, but you do not get kind of this advanced thinking available on the free plan. All right, a couple other high level kind of points here.

Starting point is 00:12:48 It scored a 70.3 on the SWE bench or SWE bench verified. Best in class by a lot for coding. Like we talked about, the Claude Code program for agenic development. And then it has a 15 times longer output token capacity. So a 128,000 tokens that it can output versus previously. Claude could only output 8.5,000 tokens. So that's just the amount, right? So if you ask Claude to do something before, it would spit things out.

Starting point is 00:13:22 Sometimes if you ask for a lot, it would spit things out in little chunks. So now, at least according to Anthropic, that is a 1208,000 token output capacity. Personally, I'm not seeing that yet. We're going to do a live test here, y'all. We'll see if we actually see that. I was still getting it, breaking it out in small chunks. They did say that is in beta. So not sure if that's fully rolled out yet or if that'll be coming out in the coming days or weeks.

Starting point is 00:13:52 But I don't know. I'm not seeing it. Also, which is important, this is available across all platforms. So a lot of what we're going to be talking about is using Claude on the front end as a front end user. So going to Claude A.I and using your free account, your paid account, maybe you have a team's account. right, but obviously Claude is available on the back end, and it is a very popular model on the API side, mainly due to its proficiencies in coding and software development. It is historically been the most used model, at least when you're looking at open router statistics.

Starting point is 00:14:31 It is generally the most used model on the API side, at least those that are using open router, right? OpenRouter is one of the more popular services where you can essentially sign up for one service, connect all your different API keys. So they have good data, but that's not every single model. That's just those using OpenRouter. All right. So let's talk a little bit about Claude's thinking because this is the big chain of thought, reasoning model.

Starting point is 00:15:02 So let's go over some of the highlights on how this actually works. So it uses deeper reasoning for complex tasks. So what that means is it has an extended thinking mode that lets Claude spend more time and compute effort, solving, challenging problems, or answering tougher questions. Okay. It has user-controlled thinking budget on the back end. So developers can set a thinking budget to determine how much effort Claude should apply for a task. I think that's where things get a little trick. Ricky. We'll talk about that here in a second. It is the same model more effort. So the extended

Starting point is 00:15:41 thinking doesn't rely on a different model. It is still the Claude 37 sonnet, right? So hybrid model, whereas Open AI as an example has their 01, 03, and then they still have their workhorse do everything model, GPT40. Not like that with Claude. It is just Claude 37 sonnet, right? It's not 37 sonnet thinking. It's not oh, three seven sonnet alphabet soup. It's just three seven sonnet. It's the same thing. One model does it all. I think there's pros and there's cons. The extended thinking, like I said, doesn't rely on a different model. Visible thought process.

Starting point is 00:16:17 That's a big new feature, at least for Claude, right? So users can see the, it says the raw reasoning steps. I don't know. We'll have to see if that's the raw reasoning when I'm looking at it. It still looks like a summarized chain of thought. I could be wrong. We're going to look at it live. The other thing, Claude is historically terrible.

Starting point is 00:16:36 for limits. So, you know, I was able to test this a lot last night. And I wanted to do a ton more testing this morning before this live show. But, you know, even though I'm on a paid plan, Claude's limits have historically been the worst in the industry. And it's not even close. Right. So I wish Claude would give paid users a little more leeway in order to test these things. So a lot of these things, I've already done them a couple of times. But normally, I would like to play with an LLM for at least six to eight hours before doing an even simple show. Not always an option, at least using Claude on the front end because those limits are

Starting point is 00:17:18 terrible. All right. Also, improved accuracy over time. So Anthropics says that extended thinking boost performance on tasks like Mac problems or complex evaluations by allowing Claude to refine answers iteratively. Adobe just introduced an entirely new way to call. create, bringing the power and precision of its creative suite into one conversational experience. Meet Firefly AI Assistant, now live in the Adobe Firefly app, the all-in-one creative

Starting point is 00:17:53 AI studio. Powered by Adobe's creative agent, Firefly AI Assistant lets you start with your vision, just describe what you want, and shape the outcome as it takes form with the Assistant. The Assistant orchestrates multi-step workflows, drawing on 60-plus pro-grade tools across Adobe Creative Cloud apps, including folks. Photoshop, Illustrator, Premiere, Lightroom Express, and more to help bring your ideas to life. You can also get started with creative skills, a growing library of pre-built workflows for common creative tasks, like batch editing photos, creating mood boards, portrait retouching, and creating social variations.

Starting point is 00:18:31 Every step the assistant takes is visible so you can refine, redirect, or take over at any time. You stay in the driver's seat as the creative director. Adobe Firefly AI assistant now in public beta. See it today at firefly.addbore.com. All right. So let's talk about the Claude Update timelines. Because if you're wondering, wait, has it been a minute since we've heard from Anthropic? Yeah, kind of, right?

Starting point is 00:19:01 When now the leaders, Google and Open AI, seemingly are announcing new models every month. It has been like light years and then some since we've had an actual step improvement from Anthropic. So the original 3.5 Sonnet was back in June, 24. All right. Then they had this upgraded 3.5 Sonnet, which was confusing because they just called it 3.5 Sonnet new. They didn't use 3.6, even though a lot of people online, myself included, said, this is

Starting point is 00:19:38 dumb. Why are you calling it 3.5 Sonnet new? And then they obviously skipped 3.5. which lends me to believe that, yeah, that 3.5 sonnet new, which really didn't bring anything terribly new. It was more of an under the hood update, the type of updates that, you know, Google and Open AI do almost on a biweekly basis. It didn't seem like anything major, but we saw the, that Claude 35 sonnet new in October. Then in November, we saw Claude 35 Haiku, right? So essentially, Anthropic has historic.

Starting point is 00:20:14 historically had three model sizes, small, medium, and large for small tasks, medium tasks, and large tasks. So Claude Haiku is the small, Sonnet is the medium, and Opus is the large. So you'll see here now, finally, February 24th, we got the Quad 37 Sonnet. So I will say the 35, you know, new update in October, I don't know. That wasn't much. I used it plenty. I use Claude 3-5s on it every day.

Starting point is 00:20:45 I didn't see anything new, anything noticeable, at least for my daily use case, which I know is different than a lot of people's, right? But so I'll say this, for the most part, it's been since June. It's been a good eight months since we saw a top class model, real update from Anthropics. So it's been a hot minute. So let's also talk about what's next, because Anthropic did release this little, I guess you could call it a timeline, but looks very much in step with open AIs, kind of five faces to AGI, right?

Starting point is 00:21:19 So you have your, you know, your reasoners, your agents, et cetera, from Open AI, Claude takes a little different approach here. So they said, 2024 was Claude Assists. Then they said 2025 now is Claude collaborates. And then they said in 2027, Claude will pioneer. So is this kind of their AGI artificial general intelligence timeline? I'm not sure. It kind of looks like it, right?

Starting point is 00:21:46 They're saying it looks like Claude is just going to be a collaborator. It goes from assist to collaborates from 2024 to 2025. And it is going to be a pioneer in 2027. So I don't know what that means. But it's Tuesday, y'all. Should I come in with some hot takes? Let me know how spicy. I got to get a sip a coffee here for live stream audience.

Starting point is 00:22:10 But how hot should I make these hot takes, y'all? And yeah, if you do listen on the podcast, this is a live stream. We do it every single day. It's unedited, unscripted, the realest thing, and artificial intelligence. 7.30 a.m. I know it's a little early. That's why sometimes I take a little second to sip on the coffee. But yeah, live stream audience, should I be nice?

Starting point is 00:22:32 Should I bring some heat here with my hot take takeaways? It is Tuesday after all. So, all right. Let's get to some of my takeaways here. And then we're going to get back to the facts, the figures, the stats. We're going to do a live walkthrough as well. So let's talk about this concept of hybrid models. Big Bogey Face said sweat emoji.

Starting point is 00:23:02 All right. Allison says, just spicy. I'll keep it just spicy. Maybe I won't go, you know, five alarm, hot chili, hurts in the toilet, spicy. All right. Not a fan of hybrid models right now. I'm not. But I'm also a power user.

Starting point is 00:23:25 So I have to understand. Most people are not. I ultimately think these hybrid models are just going to be a way for companies to make more money, right? Which I get and I understand. I've said all along, whether you're talking about $20 a month for Claude, you know, paid plan, $20 a month for chat GPT plus, $200 a month for chat GPT pro. Same thing with Gemini, whatever. Companies for the most part are losing money. So I get it.

Starting point is 00:23:59 I get you got to make money. You got to be profitable. But on the API side, if I'm a developer and I've been using, you know, or maybe looking at switching from Open AI to Claude, I am not incentivized to do so. Because when you have this new Claude 37 sonnet, yes, you have this kind of slider control over how much thinking you can apply to certain situations. But when there's companies out there that literally their business model is essentially

Starting point is 00:24:34 creating a helpful wrapper around an AI model for their customers for a certain niche, you need a little more control over a simple slider, you know, over saying like, ah, you know, let's apply this much thinking unilaterally across the board. I don't think there was anything wrong from a back end API perspective, right? So I'm hoping that Anthropic and others will not get rid of, you know, as an example, 35 sonnet and will still allow companies to have, you know, 3.5 haiku, 3.5 sonnet. And the reason why is because the API prices for 37 Sonnet are ridiculously high. Ridiculously high.

Starting point is 00:25:19 All right. And if you don't have an option to have like, you know, 3.7 Sonnet regular and 3.7 Sonnet think. I mean, there's a reason why right now OpenAI is winning the AI race. I mean, number one, they were the first with chat. Number two, even though it's confusing for front end users to stare at eight different model selections. It's extremely important for back-end developers, companies that are essentially running their business off this technology to use the right model for the right time, for the right purpose, and the costs that are associated with it. So Claude 3.7 sonnet is

Starting point is 00:25:58 extremely expensive. So for certain use cases, no-brainer. Coding, software development, etc. You're going to pay it because right now, Claude 3.5, Claude 3-7 sonnet is the best in those areas. It is. It's a great model. I'm not a huge fan of it. And I probably won't be a huge fan of it when OpenAI does that as well. So CEO, OpenAI, CEO, Sam, Sam Haltman said that Open AI is shifting once GPT5 comes out.

Starting point is 00:26:30 GPT5 is going to be more of a system. And it will also use this hybrid approach. And it will say, you know, hey, here's, you know, here's when you should use a reasoner. versus when you should use a transformer model. So I just think this is just a way for these companies to make more money if they eventually take away the option to use older models that are not hybrid. That's all I'm saying. And as a front end power user, I hope I always have the option as well.

Starting point is 00:27:03 Right. I got some sun shining in my face. All right. So I hope as a front end user, I'll still have the option in the future to say, oh, I don't want to use a reasoning model for this. Or I need to use a reasoning model and only a reasoning model, right? You might have to over prompt engineer if you're giving, you know, Claude 37 sonnet, you know, something on the front end and you want it to use reasoning and it's not.

Starting point is 00:27:34 Then you just have to go and take that extra step, you know, do a little extra prompt engineering to get it to use this logic, right? So there's huge downsides that I don't think people are talking about, right? Everyone wants to wrap it in a bow and say, oh, it's the world's most powerful. It's hybrid. It's all in one. Okay. There's times and use cases that all in one is great. But I don't think this is one of them. Again, I'm a power user. So maybe my viewpoint is skewed. I personally like going into chat, GBT, and seeing eight different models, right? Because I'm a power user. I'm I'm using probably five of them for very specific use cases. I don't want one, right?

Starting point is 00:28:17 I don't. I could be wrong on that. All right. Next, poor opus. Poor opus. Opus hasn't been updated in like a trillion years. So it looks like, I don't know, Anthropic may have just abandoned their big boy model Claude Opus.

Starting point is 00:28:35 Maybe they're waiting until they're kind of clawed 4.0 models to bring back opus, I'm not sure, but at least for now, poor opus is bye-bye. Also, I don't know. So I saw a comment here. Let's see, who said this? There we go. Douglas from LinkedIn said, Curious how this will improve cursor into windsurf, right? I think now Anthropic is competing with them, right? even though these IDEs, right, so that's an integrated development environment. So, you know, like we talked about at the top of the show with the news, you know, Gemini, Code Assist, GitHub co-pilot, cursor, windsurf, lovable, bolt, right? There's all these kind of IDEs or essentially now AI coders, right, where you can literally,

Starting point is 00:29:30 you can talk to it, you can type to it. Think how we have these large language models, right? we have chat GPT, right, the GPT models, Gemini, Claude, right? And then we have now this newer breed. They use a model. So you choose which large language model, but it is an AI-powered IDE or integrated development environment like cursor, right? Cursor by default uses Claude.

Starting point is 00:29:57 But it looks like with Claude code, which, you know, audience, let me know if you want us to go into that. not today at a later time. We'd have to have a show or two dedicated. It is more technical. But I think Claude code is really cool. But it looks like Claude wants to compete more with those IDs than it looks like they want to compete in the strictly large language model space.

Starting point is 00:30:24 And I think that makes sense. I think it makes sense because it looks like over the years, Claude has kind of carved, Claude has kind of carved out its niche. And I'm not saying they're abandoning, you know, general business use cases. They're not. But it looks like especially with Claude code, especially with the MCP protocol that they put out, you know, computer use, even though it's clunky, it did get updated with now this 37 sonnet. So we'll have to see if it's any better.

Starting point is 00:30:55 But it looks like Anthropic is maybe just wanting to compete more in that space, especially by making Claude code a free beta preview. Also, another hot take since you wanted to spicy. I don't think most companies are going to end up using Claude. A lot of people were waiting for this release because they assumed that Anthropic would be cutting their API prices because that has been the trend across the industry, right? Open AI has cut their API.

Starting point is 00:31:30 prices by more than 90% over the last 18 months when it looks at their top state of the art model. Google, same thing. Just ridiculous API pricing cuts, right? Anthropic, not so much. They didn't change their pricing at all, right? Yeah, it's a more powerful model, but you're paying the same price. But I think for the most part, businesses are not going to use clawed general use cases. They won't.

Starting point is 00:31:59 maybe Anthrop I'm sure Anthropic knows this, but I will say 90% of businesses that are looking for a large, a general use case, large language model to use on the API backend, whether that's for customer success, whether it's for sales, whether it's for an internal knowledge base. I'd say non-coding,

Starting point is 00:32:19 non-software development. 90% of companies will not look at Claude, and I don't blame them. The prices are more ludicrous, than the early 2000s wrapper. It's there, they're insanely not practical for everyday use cases. They're not.

Starting point is 00:32:40 All right. So let's look at those API prices. So Claude 3.7, 3, and this is per million tokens. It is a $3.4 million tokens input and $15 for output. All right. So yeah, it's a hybrid model, sure, but I'm still going to go. If I'm a business leader, GPT40 Mini is great because you can chunk, right? You can chunk different tasks to different models.

Starting point is 00:33:16 And that's why I'm going to this whole like this API, you know, and developers using it. No one's going to use 3.7 unless you specifically need software development, coding, right? Unless you're in one of those categories, maybe some. some stem areas, right? But otherwise, who's going to touch it? When you look at GPT40 Mini is 15 cents versus the $3 input and then 60 cents versus $15 on the output side, right? And when you can chunk it and when you can say, hey, for these type of questions for

Starting point is 00:33:52 customer success for sales, et cetera, we're going to use GPT40 Mini because we don't need a hybrid 3.7 model to do 90% of what we would use it for. I mean, the cost savings there, it's like, I don't know, 10, 6, like 30 times as expensive. Like, absolutely not. Or 25 times as expensive. I'm doing math live on the fly. I don't know. From an API perspective, this does not make sense.

Starting point is 00:34:26 I was really expecting Anthropic to. slashed their prices, but it looks like they're not necessarily concerned with competing for everyday business use cases. They're like, yo, if you want to use agentic tools, if you want to use software development, uh, coding, et cetera, maybe some engineering, like I said, some STEM use cases, but for everyone else, nah, we're good because the combination as an example of GPT4O mini at 15 cents and 60 cents and 03 mini at $1.10, 440, duh, right? And then the same thing with Gemini.

Starting point is 00:35:00 Gemini 2.0 Pro, $1.25 and $5. And then they have their flash. And then they also have flash thinking. I probably should have put that up on the chart. But it just doesn't make sense. It doesn't make sense. Their pricing on the back end does not make sense. And I think as the other models get essentially better at coding and software development,

Starting point is 00:35:23 because right now, yes, Anthropic Claude and with their 3.7, they have a huge lead there. So let's look at that. So some of these benchmarks here, we're looking at the SWE bench, SWE bench verified, and looking at some of the different benchmarks. And you have the version here without, you know, they're calling it custom scaffolding or without that extra thinking. Even without the extra thinking, Claude 3.7 saw it on SWE bench, 62%. where their last version, 3.5 saw it, was 49.

Starting point is 00:35:57 OpenAIs 01 is 48.9. 03, 3, 3, 3, deep seek 492. But with the extra thinking, Claude is a 70%. Right. So that's what I'm saying. If you're doing any type of software engineering, nothing else right now comes close. Same thing with agenic tool use.

Starting point is 00:36:23 So the towel bench, I think it's pretty, pronounce tau bench, but TAU bench. Same thing. This is when you essentially have a model. You give it access to tools and you have it go complete some technical tasks. Same thing. Claude 3.7 saw it here with an 81% on the towel bench retail and then open AI 73%. So not close.

Starting point is 00:36:47 Generally with a lot of these benchmarks, you know, especially some of the non-technical, non-software engineering ones, one point difference can be huge. Right. So in this use cases, Claude 3.7 sonnet is light years ahead. Interestingly enough, when we look at the regular benchmarking kind of marks here, between Claude 37 Sonnet, Claude 35, OpenAI, OpenAI, OpenAI, O3 Mini, DeepSeek R1 in Grock 3 beta, this is from Anthropics website. Interesting, they didn't include on this main one.

Starting point is 00:37:25 anything from Gemini. And these are definitely cherry-picked. But something that I found interesting when Anthropic was putting out its own benchmarks on its website, is they didn't use the same benchmarks as they had previously when they announced Claude 3.5. Saw it. When they announced the Claude 3 family of models, specifically they're keeping out these benchmarks comparisons like MMLU. and then the ML, the multimedia version one, right? There's essentially kind of, I think it's MMLU and MMLU Pro.

Starting point is 00:38:02 Now I'm blanking on it, but it's kind of the standard. It's been this golden benchmark, but I mean, you can see it here. Anthropic is just kind of like, no, we're good. We're just going to stick to these more technical benchmarks, right? Visual. Oh, there we go. M, MU, you know, it's with nine. extended thinking at a 71% it's not better than Open AI, right? Open AI on the MMMLU, which is the multimedia

Starting point is 00:38:31 version of the MMLU, which I would say is the standard or has been the standard benchmark. Open AI is better than it, right? So it's interesting to see here. It doesn't look like Anthropic is trying to overfit for certain benchmarks, right? And I would like to see once there is the MMLU. and not the multimedia version of it, where Anthropics' new Quad 3.7 stands, because I'm guessing it is going to be not in first. I'm guessing it might not even be in the top five, but I don't think that Anthropic necessarily cares.

Starting point is 00:39:05 Because like I said, it looks like they're just trying to compete and they're trying to be more of a just a coding assistant, right? So maybe their biggest competitors might also be some of their customers, like Cursor, like WinSurf, like Lovable, like Bolt, right? Or maybe some of their competitors might be GitHub co-pilot. All right.

Starting point is 00:39:27 Let's talk a little bit about Claude code. So, yeah, live stream audience, let me know. Should we tackle this at a later point? I think it's pretty cool. But you have to have a little bit of tech know-how. So here's how it works. So Claude code, essentially you go to GitHub, you kind of install this GitHub repo, and then it can work with a code base on your computer.

Starting point is 00:39:52 So, you know, if you're on a Mac and you open Mac Terminal, essentially, you can have Claude Code. So this is a new, essentially, research preview that's free. That's even for free users can use, which I think is great. So you can work with an entire code base. Okay. So let's say you have a full. folder. All right. So non-technical people, bear with me. And I'm probably going to get some of the

Starting point is 00:40:21 technical details wrong here. So, you know, if you are a coder, bear with me as I explained it to a non-technical audience. But let's say you have a code base. So you build an app or something and you have a folder and there's, you know, seven different files in there. You know, maybe there's a JavaScript file. Maybe there's an HTML file, a CSS, et cetera, right? So the cool thing about Claude, well, number one is it works locally on your machine, right? So you don't have to go into a third party environment. You're just working in the terminal, which I know might be intimidating for some. But then you essentially just talk to Claude like you would as if you were inside

Starting point is 00:41:03 Claude. Claude. And then it can code and it can update your entire code base. So it will search, edit, test and push code from within the terminal. So it's not going to say, oh, here, here's the new code for the HTML. Here's the new CSS code. Here's the new JavaScript code. Go copy and paste this, right?

Starting point is 00:41:24 It just does it all for you. It works with and updates your entire code base. It has GitHub integration. It's pretty good at debugging. So Claude Code was part of this, you know, 3-7 Sonnet release. And I think it may end up being more impactful than the most. model itself because I think this signals, Anthropics shift to really want to compete more in that space.

Starting point is 00:41:51 And you might be wondering why. And I actually don't hate it because if you listen to our 2025 AI predictions in Roadmap series, one of those things is non-technical people are going to be spinning up apps for themselves to use. And now Claude could be the easiest way to do that. Yes, you can use cursor. you can use WinServe. Some of these other tools,

Starting point is 00:42:16 I think the learning curve might actually be a little higher. But Claude code can allow everyday people to just go create apps, talk with it. You can even be like, yo, I have no clue what this means. Explain it to me. Or hey,

Starting point is 00:42:30 make it prettier. Make it shinier. Make it more useful, right? You know, make a data visualization, right? You can just dump all your data, give it to Claude via this Claude code,

Starting point is 00:42:42 create a program that runs locally on your computer that helps you solve something, right? I do think enterprise software, if I'm being honest, it doesn't have the same future that it has today. I do think everyday non-technical people are going to be using AI and large language models to spin up their own software for very niche use cases. And I think Claude might be that first big step toward bringing that to everyday people, right? Yeah, you might have to get used to, you know, here's what a GitHub repo is. You know, but it does it all working with your entire code base where, yes, I love using, you know, O3 Mini or O1 Pro or something like that.

Starting point is 00:43:20 But then you still have to copy and paste, you know, all of those, all of those different files. You might have to use something like Replit to run it. So CloudCode, pretty cool. It does it all kind of for you. All right. Let's look live, shall we? what could go wrong? What could go wrong doing a live test of a brand new model that has terrible, terrible limits?

Starting point is 00:43:55 Let's try anyways. You guys say you like these live tests, so let's go ahead and do them. Live stream audience, let me know if you can see my screen here. All right, so here's a couple of things to keep in mind. When you are choosing Claude, make sure you are using Claude 3.7 Sonnet. Also, you'll see this new thinking mode. So it's kind of ironic. You still have to have this extended and you're only going to see this on the paid plan.

Starting point is 00:44:34 You want to make sure you have that extended box checked. So you can choose a normal thinking mode and this is as a front end user or you can use the extended thinking mode, and this is best for math and coding challenges. All right. I'm going to go ahead. I'm going to put a giant prompt in here. All right. Okay, thank you.

Starting point is 00:45:00 Marie is always the first to say, yes, I can see your screen. Thank you, Marie. I always appreciate that because I never know. All right. So I have a giant prompt I'm going to put in, and we're going to use this extended thinking in Claude 37. All right. So here's what it is.

Starting point is 00:45:17 I've done this on the show before. This is what I did when I first tested O1 Pro. All right. So essentially, I'm saying these are my podcast stats. So I'm using the same exact prompt. So I say today's date is January 16th. This is when I did the O1 Pro show. I want to have a consistent comparison across quote unquote reasoning or hybrid models because

Starting point is 00:45:42 that's what we're trying to do here. We're trying to say like, okay, how is this, how is this model? Right. And you'll see here at live stream audience, it's working. I'm going to try to keep my eye on the model here. I'll actually just let you watch it and I'll read the prompt that I put in. All right. So I say these are my podcast stacks.

Starting point is 00:46:02 Keep in mind, today's date is January 16th, 2025 for all questions. Always exclude the top 2% and bottom percent of episodes unless otherwise noted. So then essentially I have, I give it a series of 11 questions. And these questions are extremely specific. Then I copy and paste. I believe I give it, let me count here, about data from 150 podcast episodes. Okay. So this has the name of the episode, the episode number.

Starting point is 00:46:32 Then it has the number of downloads in the first or sorry, the last seven days, the last 30 days, the last 90 days. in all-time downloads. And then over the course of these 12 different questions, I am asking, in this case, you know, Claude 3.7 sonnet with the extra thinking, right, with the extended thinking, I'm asking it some very advanced questions, all right? 13 of them. So as an example, you know, question number two, I say, give me the complete list.

Starting point is 00:47:12 list of all episodes with a new performance percentage of over or under the adjusted average. Because I'm saying take away the top 2% and the bottom 2% of episodes because sometimes there's anomalies, right? And I don't really care about those. And so I'm saying, hey, find trends. And then I'm saying question three, give me top 10 and bottom 10 episodes in their respective percentage that they're either over or under the adjusted average. So what I'm trying to do is, you know, sometimes there's episodes that kind of go viral. Sometimes there's episodes that for whatever reason don't get like any downloads. And I'm like, okay, there must have been a problem retrieving data.

Starting point is 00:47:47 So I want to find kind of that median or mean. And then I want to find types of episodes that get more downloads than that kind of adjusted average. And then I want to, over the course of all of these questions, I'm asking it to find different trends and patterns so I can create better episodes for you all, right? It might be something as simple as how do I name these episodes better, right? and having it spot different things. It could be, I'm asking some questions about days of the week, right?

Starting point is 00:48:16 So as an example for question four, I'm saying for the top 10 episodes above the adjusted average, please suggest three slightly adjusted title names for each if I were to rerun them, right? Yeah, probably a couple times, couple times a month. I'll rerun shows, you know, I might get sick or, you know, a guest might have to bail at the last minute and I might need an episode to rerun. So I'm saying, hey, give me a new title. All right.

Starting point is 00:48:45 So it looks like, let's see. Okay, so it looks like it says it thought for 23 seconds. It looks like we're, so it looks like it's done thinking. Let me see. Okay. So when I go through and I look at this thinking, this does not look like the raw chain of thought. Okay.

Starting point is 00:49:09 So it says, I need to analyze podcast stats from the provided data. The first step is to extract the data and organize it in a way that makes it easier to analyze. Let me go through the instructions carefully and understand the tasks, right? And then it breaks it down into six different subsections. And then it says, let me first extract the data. So it's going through, it's kind of showing it's step by step. But I'm looking at this. If this only thought for 23 seconds, let's see.

Starting point is 00:49:39 Okay. I'm trying to see if there's more, there's no more chain of thought. Okay. So it did get through this fairly quickly. It is still answering the questions. All right. So I'll have to go through and give this a good scrub, but I'm kind of surprised that it only thought for 23 seconds. And you know what? hey, if you share this episode, I'll share the complete stats and prompt that I sent. I will share the exact output that we got here from Clawed because it's still going. And I will share the exact output that we got from 01 Pro as well. So if you really want to dive into the details, I'm not going to have time. It would take another half an hour to read all of this.

Starting point is 00:50:32 I'm going to go ahead offline once this is done and look at the comparison. But I will say this. Overall, it looks like it did a decent job, although I did look at the responses I got from 01 Pro this morning. The responses from 01 Pro were exponentially more impressive. They were, right? The findings here, let me just go ahead, see if I can read let's see if I can read one or two of these answers that maybe we can have a little more nuance,

Starting point is 00:51:11 right? So let's say number seven. Okay. So the question for number seven was how does release day impact episode performance? Please exclude Mondays, as that is usually our AI News That Matters, days. And we don't usually run any other type of shows on those days. And then I say, you know, here's here's, what today is so you don't get confused.

Starting point is 00:51:33 So it says impact of release day on episode performance. So it says Saturday, which I don't know why they're Saturday, because we, as far as I know, have never released an episode on Saturday. So that's a little weird. So then it says Wednesday is 6% above adjusted average. Friday is minus 3%. Tuesday is minus 4%. And Thursday is minus 1%.

Starting point is 00:52:01 5% below adjusted average. I'm not sure if that is true, right? Because we didn't release any episodes on Saturday. So unless there was some weird thing in the formatting, Saturday shows should not be there. And if so, maybe it was a Friday, like one Friday show that got posted super late. I don't know. But this, I mean, in short, this data does not look correct. It does not look correct that three,

Starting point is 00:52:30 of our weekdays are below the average and only one of them is above the average. Doesn't make sense. And then it says key findings. Wednesday episodes perform particularly well for technical tool guides and platform-specific content. And here it says Saturday episodes, which again, we haven't done, perform well, likely due to less competition and more listener leisure time. Thursday is consistently the worst performing episode day, especially.

Starting point is 00:53:00 for industry-specific content, which I know is not true because I look through daily downloads every single time. Thursday is not a bad day. Thursday is usually our second best day. It says Tuesday episodes underperform, especially for news or recaps. Also false. So, you know, I'm going to have to go through and look a little bit, but not great responses. And I'm wondering, this should have taken, I think, many, many, minutes, many minutes. And at least it says here that it thought for 23 seconds, which number one doesn't seem right.

Starting point is 00:53:38 But it doesn't look like I got great results, if I'm being honest, right? It looks like it did go through and answer all the questions, which is good because in my first testing of this last night, it actually stopped. It only answered the first three questions. And then it essentially said that it went through the context window, right? which didn't make sense because I'm like, yo, this is supposed to be, you know, a hundred,

Starting point is 00:54:04 you know, a hundred some thousand, uh, context window. So here's this one. Here's the one that I did previously. And at the bottom, it says,

Starting point is 00:54:14 Claude hit, uh, I know it's kind of small there. It says, Claude hit the max length for a message and has paused its response. You can write to continue to keep the chat going. And you'll see in this use case here, let me go to the top.

Starting point is 00:54:29 In this one, It thought for three minutes and eight seconds. Okay. Why did in one use case, it thought for three minutes and eight seconds and could not give me the entire output. Yet in the second one, it says it only thought for 23 seconds and it gave me the entire output. And I will have to do a little bit more offline comparison, but, you know, mixed bag. Mixed bag so far. All right.

Starting point is 00:54:56 So let's do this. I have a very very very. short rubric that I normally do for reasoning models. I'm going to go through this one quickly. I'm going to make sure that I have the extended thinking on this. All right. So let's go ahead, go through some of the questions that we would normally run. All right. So this one, I'm saying, I just woke up with six apples and three bananas. Some of these I made up. Some of them are, you know, kind of widely used across the internet. Some are just modified from, pretty popular ones.

Starting point is 00:55:31 So this, I, like, I know I need to make an actual, like, reasoning rubric, but these are just some that I generally use. So I said, I just woke up today with six apples and three bananas. Yesterday, I ate a banana and two apples. This morning, I will eat one apple and no bananas. However, I don't really like apples and one banana may turn brown tomorrow. Assuming nothing else changes, how many apples and bananas will I have tonight? So, let's see here.

Starting point is 00:55:58 It says it thought for five seconds. And I can go through and look at the chain of thought. Again, I don't know if this is raw. Maybe this is the raw chain of thought, not the summarized chain of thought. So let's see. It says, let's work this through step by step. Starting point, six apples, three banana. Yesterday, the person ate.

Starting point is 00:56:17 Then it says this morning they will eat. The question asks how many apples and bananas they'll have tonight after eating what they described. So then it's going, let's calculate. And then it says, wait, I need to double check the wording of the problem. The person says, I just woke up. today with six apples and three bananas. So these are their current quantities after whatever happened yesterday.

Starting point is 00:56:35 So yeah, Claude gets a lot of the information that I put in here is just meant to throw a model off. Most of these models, including Claude 35, saw it, do not get these questions correct. I'm assuming that Claude 35 saw it with thinking, got it correct. Yes, it did. The correct answer right there. Got it correct. It is five apples and three bananas. All right, let's do a couple more.

Starting point is 00:57:02 We're going to go through these quick, y'all. All right, same thing. We have extended thinking on. All right. So this one, a man and his dog are standing on one side of the river. There's a boat with enough room for one human and one animal. How can a man get across with his dog in the fewest number of trips? There we go.

Starting point is 00:57:28 The man and the dog can cross the river. together in just one trip. So even some of the original, you know, very powerful state-of-the-art models would always get this wrong. It's very simple, right? These questions are simple. Any human knows right away, oh, that's one trip for whatever reason. A lot of large language models, including, you know, GPT-40 when it first came out,

Starting point is 00:57:53 Claude Sondit 3-5 would get that wrong. All right. Here's another super easy one. Let's go ahead and ask this one. I'm saying if it takes three hours to dry 10 t-shirts and the sun, how long will it take to dry 30 t-shirts in the sun? All right. So here we go.

Starting point is 00:58:13 We got a little bit of chain of thought here, not a ton, three seconds. Correct. It's saying it will still take three hours to dry 30 t-shirts in the sun because the process is parallel. Right. So as long as you have room, you can dry as many as you want. All right. Our next question, let's see if we can trip it up here. So I'm saying if you have a single match and you walk into the room with an oil lamp, a candle and a fireplace, which do you light first?

Starting point is 00:58:42 The answer, which Claude got correct here, you would light the match first. All right. I hate how jumpy this sidebar is. All right. Next question. I don't think any large language model has ever got this one wrong. What color is an airplane's black box? All right.

Starting point is 00:59:04 It is actually bright orange, correct. All right. Here's one that a clawed model has never gotten correct. And let's, okay, I'm asking for jokes. Also, a model has never really been good at humor. All right. So I'm saying, please give me seven jokes that end in the word blue. Two should be about animals.

Starting point is 00:59:27 Three should be about some other topic in the body of this chat, and you can make up the other two. All right. So again, seven jokes. They should end in the word blue. Two are about animals. Three should be about something else in this chat, and it can make up two. So most models can at least now follow the basics of the directions. Some models still struggle, but let's see if they're actually funny and if it follows directions.

Starting point is 00:59:53 So only thought for three seconds, and it says here are the seven jokes. So let's see, the two about animals. Why was the fish such a great musician? Because it knew how to play the scales and always stayed in the deep blue. Is that a joke? Not really. Is it following the directions? Yes.

Starting point is 01:00:12 What did the sad elephant say at the waterhole? Even my memory feels blue. Not funny, but it's there. All right. And then we are going to have three about the context from the context of this chat. So why did the banana go to therapy? because it couldn't stop feeling peeled and blue. All right.

Starting point is 01:00:31 Why did the boat stay on the river after failing to cross? You're making me feel blue. Why did the t-shirt refuse to dry in the sun? It was feeling too blue. So it's using the same, you know, oh, it feels blue as the ending of the, I guess, the punchline. It's not even really a punchline. All right.

Starting point is 01:00:49 And then two of the ones where it made it up themselves. Why did the painter refuse to use any other colors? He was going through a phase where he felt only. only blue. And then what did the match say after lighting the candle? So interestingly enough, it used another one from the context of this chat, even though I told it to only use three. It ended up using four. So from an instruction following standpoint, I guess I said you can choose. So it just chose to use another one from the context of the chat. It said, what did the match say after lighting the candle? That was my moment to shine. Now I'm just blue. So are they funny? No. Did it follow the

Starting point is 01:01:26 directions. Yes, it did. All right. I think this is going to be the last one we do here just due to time. So a box is locked with a three-digit numerical code. All we know is that the digits are different. The sum of all digits is nine, and the digit in the middle is the highest. What is the code? All right. So now we can look at the chain of thought. It's going through. It's saying let's work through this step by step. It's assigning numerals. let's call the digits A, B, and C. A does not equal B does not equal C. All right. And then it says B is greater than A.

Starting point is 01:02:03 B is greater than C. All right. So it's breaking it down the way that a human probably would. It's going through some potential use cases. It's throwing some numbers out to see if they work. Let's scroll to the bottom here. So it's going pretty fast. So that's good.

Starting point is 01:02:19 Even though the model's new, probably a lot of people are hitting it right now. it's going pretty quickly. All right. Let's see if it actually gets it correct. Here we go. So it's saying, actually, wait, let's reconsider the constraint. All digits are different.

Starting point is 01:02:40 Does this include zero? Yes. So many models skip over zero for whatever reason. When they look at numbers that would be on a padlock, they only think one through nine, but zero would be there. All right. So now it's saying I'm actually sure zero is allowed, since the valid digits for a code are typically zero through nine.

Starting point is 01:03:03 But zero does result in non-uniqueness. There we go. All right, let's scroll to the bottom. So this is actually, aside from the first version of the podcast one, the podcast stats that I did, I can't even remember. That was either late last night or early this morning that caused it to think for three minutes. This is the one where it's taking the longest to think. And the chain of thought is pretty impressive.

Starting point is 01:03:29 It's doing seemingly a pretty good job here. And luckily, I haven't hit my rate limits yet. What a miracle. But I intentionally did not use it a lot just so I wouldn't hit my rate limits. All right, live stream audience, we're going to give this one a second to finish up. But what are your thoughts here? What are your thoughts here as we get ready to wrap? Big Bogey is saying, I have Gemini 2,

Starting point is 01:03:59 tell me a joke every day. It's not getting better. Yeah, these large language models are definitely not good, although Denny is saying that AI makes dad jokes look good. All right, let me see. I want to make sure if you did have any questions, let me just double check. I want to make sure I don't see any specific questions,

Starting point is 01:04:22 but I do have a couple dozen. Okay, here we go. Woozy saying, and for our podcast audience, Sonnet 3-7 is still thinking through this combination problem. So Woozy is asking, is there anything that you see and the thinking that makes you adjust specific things in your original prompt question? Woozy, thank you for that question. That's an amazing question.

Starting point is 01:04:45 Yes, 100%. And I've mentioned this multiple times. And I called this out, I think twice when I did deep research, when I did my deep research comparison show. You should be doing this all the time. Because when you're using deep research as an example, it shows you what it does. I think deep research is one of the best use cases for generative AI, FYI. But I always go back and I always say if you're going to use a model like a deep research

Starting point is 01:05:14 or a reasoning model that takes its time to think, you might as well squeeze the juice out of it and get a good return on your time invested and go ahead and do it again. In these cases, when there's a finite answer, right, the number of apples and bananas, I'm not going to do it, right? I'm not going to run that a second time. There's a yes or no answer. When it's more of putting it on a task that there's not, it's not as finite, 100%.

Starting point is 01:05:40 Thank you for that question, Moosey. You should always, always, always look at the chain of thought. This is the biggest, one of the biggest advantages to having both chain of thought using these reasoning models as well as being able to see exactly how these deep research tools research is you go through, you read it, you see, oh, this was a good decision that it made. Oh, this was not a good decision. I take notes on the side and then I adjust my original prompt. You need to be doing that all the time. Woozy, you just added a ton of value to our audience here. Let's see. Douglas asking says, I think hybrid is interesting. Pure Transformer is missing the

Starting point is 01:06:23 benefit of the thinking. The thinking is slow compared to the transformer. Hybrid could be a good bridge, but what is being sacrificed for the capability? Yeah. Yeah, so Claude 37 Sonnet is the first hybrid model. Well, what's being sacrificed here is fine-tuned control, right? Especially for front-end users. Yeah, you have a little slider on the back end if you're using the API, but like I said, I personally don't like this, right? But I'm a power user, right? When I'm a power user, right? When I I love, which I know I'm in the minority, I love logging into chat, GPT and seeing like eight different models. And there are sometimes I know I'm going to 01 Pro.

Starting point is 01:07:03 There's times that I know I'm going to 03 Mini High Plus Web. There's times I know I'm going to deep research. There's times I know I'm going to GPT 4-0, right? And I'm going there with intent because I know the pros and the cons. Does the average user need seven models? Probably not. Will the average user benefit from a hybrid model? Probably. But it does, I don't care what anyone says.

Starting point is 01:07:29 For front end users, a hybrid approach lowers the ceiling, right? Because there's going to be times that it uses thinking, it uses extra compute when you don't want it to. There's going to be times when conversely, right, when it does the opposite. it. So I think what this ultimately means for power users, you're going to lose some flexibility. And for everyone, I don't care what you say. I think it lowers the ceiling just a little bit. The floor goes up. The floor goes up. The ceiling comes down. So like I said, for the average everyday user, I think hybrid models are great for power users that are using it on the front end.

Starting point is 01:08:11 I don't like it. I don't like it. And on the back end, the companies are going to make much more money. I think you're going to be paying more if you're using a hybrid model. And like I said, I hope, hope that all these companies, even five years down the road are still going to have these non-hybrid options. Because if there's only hybrid options, regardless for that very reason, especially when you're using the API, if you're a software company, let's say, right, or you're just using, you know, Open AI or Claude for customer, you know, customer service, right, to get to support tickets faster.

Starting point is 01:08:52 You use some rag. You put in your company documentation and someone's chatting with an AI chatbot with your company's information, but using one of these models, right? If there's only a hybrid model and those hybrid model costs are much higher and you don't, in the future, you don't have an option to use like a GPT40 mini or a Claude 35 haiku. and you only have the hybrid model, your costs go up. Your costs go up exponentially. Like I said, I think we've been getting a steal for the last couple of years.

Starting point is 01:09:22 All right. Let's go ahead and look at this result and wrap this show up. So let's see. How long did this one think? It did finish here a minute ago when I was on a side tangent answering Woosie's question. So this one, interestingly enough, thought for three minutes and 15 seconds. So let's see if it got it right. So I'm going to go past the chain of thought.

Starting point is 01:09:47 All right. It says, to solve this problem, I need to find the three digit code where all digits are different. The sum is nine and the middle digit is the highest of the three. So it says 180, 270, 360, 450, 162, 153, 153, and 243. Okay. And it says, since the problem asks for the code, implying a single answer. So yeah, I did say what is the code, but it should know, as most of the reasoning models do, that there's actually many answers.

Starting point is 01:10:29 So I would not give this a pass on this necessarily, because even though it gave me other options that worked, right? 1-80 works, 270 works, 360 works. It ultimately chose a single answer, which I don't know. It says 153 is the smallest valid code that satisfies all the conditions. No, all these two. But there's tons of others. There's tons of other codes that work.

Starting point is 01:10:59 So as an example, 180 works. 243 works. What about 342? Right? So it didn't do a good, it didn't do a good job. It thought for a long time. I would say that it did not pass this. So it is kind of a trick question.

Starting point is 01:11:15 Even though I ask for the code, there's many codes. And it thought that and it found that out with its reasoning, but still decided to almost overthink this issue and say kind of just the wrong answer. All right. I know this was a long one. I hope it was helpful. Like I said, if this was helpful, go ahead, share this. I'll share that complete prompt, the one going to.

Starting point is 01:11:41 over the podcast stats and I'll share O-1's complete answer as well as Claude 37 Sonnet's complete answer. So yeah, if you are interested in just the kind of the raw thinking capabilities, go ahead, share this. I hope this was helpful, but I'll tell you, I'll tell you my quick takeaways. Claude 37 Sonnet, amazing. I think it is going to dominate and continue to dominate for any companies that need software engineering, coding, etc.

Starting point is 01:12:11 I think even the 3-5 new model in many of those use cases was already better. So it was already making kind of one of the best models in the world exponentially better. So for 3-7 sonnet, coding software development, et cetera, through the roof. Even what this means for artifacts, huge, right? I'll probably do a dedicated episode just on 3-7 sonnet artifacts and what that means even for non-technical people. But for everyone else, for API prices, I think it's a big loss. I think overall hybrid model, like I said, brings the floor up, brings the ceiling down. I don't think hybrid models, at least right now, are great for power users who are using this on the front end.

Starting point is 01:12:56 I actually think I will probably be using Claude 3.7 Sonet maybe a little less. And I'll probably just be using Claude 35 Sonet a little more, right? I know that seems backwards, but that's probably the reality. So I think that there's some super promising aspects of this new Claude 3.7 sonnet, some highs, some lows, but hopefully this episode was helpful. All right. So like I said, if this was, please share this. Also, if you haven't already, please go to your everyday AI.com.

Starting point is 01:13:30 Sign up for the free daily newsletter. We're going to be recapping this episode. Maybe you miss something. You know, I know this was a longer one. trying to explain things live is always a little time consuming, but that's what a lot of people that I hear from like. So speaking of that, go subscribe to the newsletter, reply to today's newsletter if you're still listening.

Starting point is 01:13:48 And tell me what you want to see more of. This show actually was because of you. I put a poll out on my LinkedIn. It was actually, it was only decided by one vote. So we're going to have the other winner, which is how to prompt O models, 01, and O3 models.

Starting point is 01:14:02 We'll probably do that show tomorrow or maybe next week. We'll see if time allows for it. So thank you for tuning in. I hope to see back tomorrow and every day for more everyday AI. Thanks y'all. Meet Firefly AI Assistant. Now live in Adobe Firefly, the Allman One Creative AI Studio. Just describe what you want to create in your own words and the assistant handles the rest,

Starting point is 01:14:31 orchestrating multi-step workflows across Adobe Creative Cloud apps, including Photoshop, Premiere Express, and more in one conversational interface. You direct the outcome while the assistant accelerates execution. band control with the ability to step in and refine at any time. See it today at firefly.adobie.com. And that's a wrap for today's edition of Everyday AI. Thanks for joining us. If you enjoyed this episode, please subscribe and leave us a rating.

Starting point is 01:15:04 It helps keep us going. For a little more AI magic, visit Your EverydayAI.com and sign up to our daily newsletter so you don't get left behind. Go break some barriers and we'll see you next time.

Everyday AI Podcast – An AI and ChatGPT Podcast - EP 469: Claude 3.7 Sonnet - World’s first hybrid AI model. How it works and when to use it

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.