Everyday AI Podcast – An AI and ChatGPT Podcast - EP 534: Claude 4 - Your Guide to Opus 4, Sonnet 4 & New Features

Starting point is 00:00:00 This is the Everyday AI Show, the everyday podcast where we simplify AI and bring its power to your fingertips. Listen daily for practical advice to boost your career, business, and everyday life. Meet Firefly AI Assistant, now live in Adobe Firefly, the All In One Creative AI Studio. Just describe what you want to create and the assistant handles the rest, orchestrating multi-step workflows across Photoshop, Premiere Express, and more in one conversational interface. You direct the outcome. The assistant accelerates execution. Right at the end of the busiest week in AI ever,

Starting point is 00:00:50 Anthropic decided to drop two big new AI models on us all. As if we weren't busy enough with everything else that we had just seen released from Microsoft, Google, and others, we now had two new contenders in Claude for, Opus and Claude for Sonnet to play with and to see how good these models are and if they can actually grow our companies and our careers. So just like we did with everything Microsoft and everything Google, we're going to be breaking down what's new with this new Anthropic Cloud for release and talk about is this going to be your new large language model you use every day or is this maybe just for software engineers or is this

Starting point is 00:01:41 not just a good model. All right. So we're going to be going over that today and a lot more on everyday AI. What's going on, y'all? My name is Jordan Wilson. I'm the host of Everyday AI. And if you're looking to grow your company and career with generative AI, then this is for you. This is your daily live stream podcast and free daily newsletter, helping us all learn and leverage generative AI. So if you haven't already, please go to your everyday AI.com. So there, you're going to get the recap of today's show in our free daily newsletter, but also, at your EverydayAI.com, you can go listen to watch and read more than 530 back episodes sorted by category. So no matter what you're trying to learn, whether it's sales, marketing,

Starting point is 00:02:24 HR, ethics, data analysis, whatever it is, we've got probably dozens of shows in all of those categories talking to the world's leading experts. It is a free generative AI university. So make sure you go check that out. All right, most days we go over the AI news. I didn't to make this a super long show. So that's going to be in today's newsletter. So make sure you go sign up and grab all of that. All right. Live stream audience.

Starting point is 00:02:51 It's good to see y'all. Like Marie says, good morning, AI family. Yeah, if you're listening on the podcast, we do this live almost every single Monday through Friday at 7.30 a.m. Central standard time.

Starting point is 00:03:03 I'm in Chicago. So you can do the math there or maybe have Claude do the math for what time that is. know, join, come, come hang out, you know, with people like Josh Cavalier saying, good morning from Charlotte, North Carolina, Giordi, joining us from Jamaica, love to see it, Geordi, Jose from Santiago, Chile. We got some international flavor. I love this. Brian, joining us from Minnesota, everyone else, Big Bogey on the YouTube machine. Christopher, joining us from Bowling Green, Kentucky. Thanks for joining. But let me know, As we go along, what are your thoughts on the new release on Claude for?

Starting point is 00:03:46 But right now we're going to, this is your guide. This is the basics. We're going to start here. All right. So like I said, Claude had their first ever, or sorry, Anthropic had their first ever developers conference this past Thursday. And there, they announced, among other things, their two new flagship models in Claude for Opus and Claude for Sonnet. And I'm already getting a little bit confused saying those things out loud. So I did talk about this a little bit on the show yesterday, but they even changed their naming mechanism.

Starting point is 00:04:20 Whereas before it was, you know, Claude 3.7 sonnet was the last sonnet variation. But now it's just Claude Sonnet four. So, you know, now the number is at the end. So a lot of new things, even how they're changing or naming their models. But, you know, if you are brand new and if you don't know too much about anthropic, Claude, it is and has historically been usually at top three, you know, AI lab, along with, you know, Open AI, Google. Microsoft is kind of in a different category, but it's one of the biggest large language models in the world, although most people, unless you're a real AI kind of dork or

Starting point is 00:05:00 a heavy large language model user, you might not know Claude. And I think that is actually, whether you're saying fortunately or unfortunately only going to become intensified. I think fewer and fewer people are actually going to be using and hearing about Claude, I think, because I think they're getting away from being a general chatbot company. But more on that here in a couple of minutes. But there's three big, there's three variations of Claude. So you have your biggest model, which is Opus, your medium, model, which is sonnet. And then you have your small model, which is haiku. And you'll notice that

Starting point is 00:05:43 only the opus and sonnet models got updated to the four variations. So Claude Haiku, 3.5, which is their smallest and most efficient model, did not get updated. So that is still Claude Haiku 3.5. So I guess the only thing that got updated was the naming mechanism there. So here's a quick overview of what is actually new. All right. So we have hybrid reasoning. So this is an instant and extended thinking mode for flexible reasoning. So, you know, we talk about kind of two types of large language models here in the show.

Starting point is 00:06:20 Yes, I'm overgeneralizing this, but you have your traditional transformer, your old school large language models, which is funny to say something's old school. But those are ones that just kind of snap something back to you real quick. And then you have these models that are. reasoners or they can think step by step. They can show logic like a human and plan ahead. So these models, you know, Gemini 2.5 Pro is a reasoning model. The Open AIO models, 03, 0401, those are all reasoning models. So Claude 4, it's a hybrid model. So it decides on how much it should think and should it just spit things out to you really quick. It is a top coding model that is by far

Starting point is 00:07:01 where Anthropic is seemingly focusing on and kind of abandoning general use, but it is now state-of-the-art in coding. It will be interesting to see how long they hold that state-of-the-art coding title. I don't think it's going to be long, if I'm being honest, because Google could come in with an update literally any second now and probably wipe a good majority of these benchmarks that Anthropic is now hanging their clawed hat on. So another big thing is tool integration. So using external tools like web search during the reasoning process. So that's if you were, you know, there's two different ways you can look at this, right? So using it on the front end as a front end user, right?

Starting point is 00:07:46 So if you go to Claude A.I or Claude.A.I. Right. So using it as an AI chatbot. And then obviously if you're building on top of it or using a service that uses Cloud's API. So there's always a front end user, which is your more non-technical people. And then the backend people that are maybe building on top of Claude's API. But regardless, you can have this new tool use during the reasoning process, which is big, right? And this is nice because it catches Anthropic up with OpenAI and Google in that regard.

Starting point is 00:08:19 Also, now there's long running tasks. So I haven't personally seen this. And I think this is only if you're using it in the API. But Anthropic is saying that it can, the new Claude 4 models can maintain. a coherence on complex tasks for extended periods. They talked about Claude running a task, I think, Cloud 4 for like seven hours on the API side, which is absolutely bonkers now that you have,

Starting point is 00:08:45 you know, models literally like punching in the clock and they're like, yeah, I'm going to go work a seven hour day now. I would never give a model that complex on the back end because, yes, it's going to require, obviously, the API. And Claude's four is one of the most expensive AP. PIs at least when we're looking at general use case large language models. And I would never want something like that to happen where it goes out and it works on a long task for a long time.

Starting point is 00:09:11 And then, okay, what happens if it times out, right? Did you just waste? I don't know. A couple hundred dollars, you know, having Claude go code for six or seven hours straight. I'm not sure. And if you do want to get a taste of Claude. And if you're not on their paid plan, they do offer. very, very limited, very limited options for Claude 4 on the free plan.

Starting point is 00:09:40 All right. But let's be honest, I'm going to call a spade a spade, right? So I think, you know, the paid plan is like, you know, $20 a month for the pro plan. And even on that, you can barely use the thing, right? It started as a joke, but now it's just sad for Anthropic as a company. I routinely will hit this rate limit. I'm on a paid plan. I paid $20 a month for Claude Pro.

Starting point is 00:10:05 And I will routinely hit the rate limit in about four to 10 minutes. Almost every single time I try to use even preparing for this show, hit it within, you know, about seven minutes. So it's laughable. So yeah, yeah, I even chuckle more that there's a free version of Sonnet. So I don't know. I venture to think if you look at the free version, the wrong way, you've hit your rate limit.

Starting point is 00:10:29 So if you think that this is anything like a model that you can use, like Google's Gemini, you know, chat GPT, co-pilot, anything else where you have generous limits and it can be your partner in whatever type of work you're doing, absolutely not. If you're on a base $20 a month plan, a team's plan, the limits are a little better. But the free plan, yeah, it's probably just a marketing gimmick. I don't even know if it could, you know, take a long, prompt with a lot of context. It would probably not work if I'm being honest, right?

Starting point is 00:11:03 All right. Let's keep this thing going. And by keeping this thing going, should we do another show? I want to give everyone a fair shake. And yes, I'm not the biggest Anthropic Claude fan. I broke down why about six months ago. I'll have to pull up that episode number. But hey, live stream audience, if you do want a second show, because I, I, I, I,

Starting point is 00:11:31 been doing, you know, multiple shows when, you know, Google comes out with a new model, when Open AI comes out with a new model. So if you do, let me know right now. Just tell me what. Show A, show B, show C, show D or show E, okay? And I'm going to throw this up again at the end. So show A, why Claude is losing the AI chatbot race. Show B, real world use cases for Claude 4. Show C, Claude 4's improved artifacts, how to use them. Show D, don't do any more Claude, Jordan, stop, no more Claude. Or show E, you can just pitch a Claude show in the comment. So live stream audience, if you could help us out or podcast peeps, you can always, you know,

Starting point is 00:12:13 subscribe to the newsletter or in the show notes, I always have our email, my LinkedIn, and you can let me know what you, what show you want to do. So let me know. But I'll throw this up again at the end. So maybe after we go through everything that we have right now, you can let me know, which show is, is that one. Oh, I did do a pretty, I'll say a tear down maybe of Claude and why your company should not be using it in episode 400.

Starting point is 00:12:48 So if you want to go listen to that, that's Anthropic Claude, why your business shouldn't use it. And I would say a lot of those reasons still hold true to today. Yeah, if you want one of those shows on the screen, go ahead and shout it out. All right. So let's talk about the benchmarks. This is what Claude is in Anthropics, sorry, is really hanging its hat on, is specifically software engineering, right? If you haven't noticed, they've kind of abandoned the everyday business professional, right, which is kind of sad because a year or so ago, I think the Claude models were among the best in the world for everyday business.

Starting point is 00:13:28 today, not really, I don't think, unless you're a developer, unless you're in software engineering, or unless you have an edge use case, right? I know a lot of people love Claude for like writing content, right? But if I'm being honest, if you do a little bit of prompt engineering, open AIs, GPT 4.5 better and the limits are better. And then Gemini 2.5 pro better. Limits are better. Right. I think Claude got this, it was crowned very early on, right? Because at the time, you know, the other large language models were really bad at writing in general, right? Everything was just ultra robotic. Still, you know, a lot of models are by default. And Claude still is pretty good. You know, if you're trying to zero shot, you know, some decent copywriting. But hey, as someone that got that's been getting paid to write for 20 years as a former journalist with a little bit of prompt engineering, Claude is not better. Open AI's model. and a Gemini's model are better. And the benchmarks say that, right? But people that maybe are a little bit lazier, right?

Starting point is 00:14:34 And they don't want to like do any work. And they just want to just go in and spend like four seconds inside Claude and be like, right, something amazing. You're right. Claude will usually give you a better first draft if you don't do any work on the front end. But if you do any work on the front end or if you iterate with it a little bit, yeah, Claude's not that good. All right, but what it is really good at is software engineering.

Starting point is 00:14:54 My goodness. So for our podcast. audience. I have a screenshot here from the Claude 4 release looking at Swee Bench verified. So this is a benchmark for performance on real world software engineering tasks. And Opus 4 and Sonnet 4 are both scoring in the 72 percentile here on Swee Bench, whereas the previous Sonnet model, the best one, 3.7, scored a 62 percent. So a pretty big jump here, but not that far ahead of other models, at least with baseline. You know, we're talking a 72%.

Starting point is 00:15:37 They have parallel test time compute scores, which I'm not going to count those. That's essentially like, you know, trying over and over trying to squeeze the most juice, right? But if you're comparing apples and apples, yes, Opus 4 and Sonnet 4 are the best models for software engineering, but it's not by a whole lot, right? We're talking 72.5 for Opus 4 and actually saw it for the quote unquote media model did slightly better at 72.7. But Open AI is right behind there with their Codex 1. That's their new kind of coding specific model with a 72. OpenAI is 03 with a 69.

Starting point is 00:16:14 And then you have Gemini 2.5 Pro with a 63. So it's not like their lead is insurmountable, but by default, it is the best large language model in the world. for software engineering. And I think that is where impropic is really focusing. But when it comes to just general usage, general intelligence, so, you know, sometimes we talk about the LM arena, which you put in one prompt and you get two outputs. You don't know which model they are.

Starting point is 00:16:41 You vote for the best one that gives you an Elo score. So right now, Claude 4 doesn't have enough info yet to be on the LM arena, but I don't expect it to be anywhere near the top. But when looking at good third party benchmark, that pull in multiple evaluations, such as artificial analysis intelligence index. That's what I have on my screen now for our live stream audience. Right. So this is a good third party.

Starting point is 00:17:05 I would say pretty much unbiased. This is pulling in seven different benchmarks, right? So MMLU Pro, GPQA diamond, humanities last exam, live code bench, side code, aim and math 500. So it's pulling in these different scores from widely used benchmarks in the LLM space. And right now, Cloud 4 Sonnet, even with thinking mode enabled is coming in at, what's that, number eight. Yeah. So, you know, like everyone that says, oh, Cloud 4, best model in the world. It's like for what, right?

Starting point is 00:17:37 So unless you're in software engineering, unless you're a developer, a coder, right? Yeah, that is the best model. But I wouldn't expect that to be for long because I would expect, you know, probably both Google and Open AI to come in within a couple of weeks and swoop that a, from, from Anthropic and with Anthropics recent, right, the last year and a half of their update cycle, they're not updating as quickly. They're not shipping as quickly as Open AI and Google. So, especially if your business, especially like on the back end for the API, if you're trying to make a long-term decision, the API, it's very pricey. We're going to get to that here in a minute. And also for all other use cases, as we see here with the artificial analysis index,

Starting point is 00:18:19 it's not very close. Claude 4, Sonnet, thinking, it's not really there, right? It's not really there. It's not a top model. So, I mean, we'll see these obviously change as models get updated. But, you know, on this artificial analysis intelligence index, the top models are, number one, is 04 mini high from Open AI, then Gemini 2.5 Pro from Google, then 03 from OpenAI. So, you know, yeah, no one's. That's why like when people are like, oh,

Starting point is 00:18:51 Claude's the best general use case model. I'm like, no, right? I don't know why people want to argue with science and math and stats. I don't know. Maybe it's fun to do on Twitter or something. All right, let's get into all the details, y'all. So here's kind of the launch, right? So here's what we got.

Starting point is 00:19:12 So like I said, this was announced last week. Opus 4 and Sonnet 4 models. Open 4. sorry, Opus 4 is the flagship for more complex tasks and coding excellence, even though, like we said, Sonnet is benchmarking pretty much at the same. So there's not a big difference, at least right now, in Sonnet 4 and Opus 4, whereas primarily there was usually a pretty big gap between this medium and larger model. So Sonnet 4 offers more balanced performance for general and high volume use, and both employ that high, hybrid reasoning for instant responses or deep reasoning. Adobe just introduced an entirely new way to create, bringing the power and precision of its

Starting point is 00:20:04 creative suite into one conversational experience. Meet Firefly AI Assistant, now live in the Adobe Firefly app, the All In One Creative AI Studio. Powered by Adobe's Creative Agent, Firefly AI Assistant lets you start with your vision, just describe what you want, and shape the outcome as it takes form with the assistant. The assistant orchestrates multi-step workflows, drawing on 60-plus pro-grade tools across Adobe Creative Cloud apps, including Photoshop, Illustrator, Premiere, Lightroom Express, and more to help bring your ideas to life. You can also get started with creative skills, a growing library of pre-built workflows for

Starting point is 00:20:43 common creative tasks, like batch editing photos, creating mood boards, portrait retouching, and creating social variations. Every step the assistant takes is visible so you can refine, redirect, or take over at any time. You stay in the driver's seat as the creative director. Adobe Firefly AI assistant now in public beta. See it today at firefly.adobie.com. Let's talk about some of the new features, advanced tools, reasoning, and memory. So extended thinking with tool use is huge.

Starting point is 00:21:22 So that includes web search and code execution. You also have now parallel tool execution, which is very important now for a baseline, large language model to have that allows it to use multiple tools simultaneously and swap between those while it's reasoning. So now Anthropic is on board with that. Memory files are created to maintain context over long duration tasks. So that is something I'm interested to test a little bit more. For me, I'm not usually a fan of these, you know, memory type files with the large language model. Same thing with chat,

Starting point is 00:21:59 GPTs. I haven't disabled. One of the main reasons is I use large language models for everything, right? I use it for myself, my multiple businesses, multiple clients, multiple things in my personal life, right? So the whole memory is not always good because sometimes I might want Claude to out, or, you know, a large language model to output something, you know, super long and informal. And sometimes I might want something, you know, very, very short.

Starting point is 00:22:25 choppy, right? Sometimes I want something that's, you know, visually rich. Sometimes I want literally strict bullet points and it varies, you know, so if you are only using large language models for one very specific purpose, you might find some utility with this new Claude for kind of memory file for me, or if you are a power user using large language models for everything, maybe not so much. There's also now the thinking summary that shows condensed reasoning, but you can see the full chain of thought in developer mode, kind of in Claude's sandbox. All right.

Starting point is 00:22:59 It is, and it's crazy now, we're saying only, right? So we're in talking about context window. It is only that 200,000 K token context window. So Opus can output 32,000 tokens at once. Sonic can output 64,000 tokens at once. So that's essentially how much Claude 4 can remember at any given time before it starts to forget things. So this is a little bit better than OpenAI chat GBT, but it is far behind Google Gemini

Starting point is 00:23:32 when you look at those one million token plus contents windows. So the brain or being able to remember something not as impressive, even though Claude was an original leader in this longer context space. I think a lot of people were hoping or looking for a couple of things with a new Claude for. They were hoping for a long. token context window, which we didn't get. And they were hoping for reduced API prices, which we also didn't get. All right.

Starting point is 00:23:59 Uh, there's also the new API includes code execution and MCP connector for external systems. That was huge for our developer and more technical friends, right? But for everyday business users, especially if you're using Claude on the front end, nothing, nothing to see there. Uh, the files API does simplify document handling for repeated referencing across sessions and extended. prompt caching up to one hour improves agent workflow efficiency.

Starting point is 00:24:26 So yes, if you are building on top of these models on the back end, building agentic systems, you know, trying to swap models in and out. Yes, I will say that Claude 4 is very capable in that regard as well, not just from software engineering, but when you're looking at a model to power agentic workflows,

Starting point is 00:24:45 you have to look at Claude 4 as well. And you see the prices. And then you go look at Google and open AI's prices. And then you're like, yeah, wait, why am I looking at this? It doesn't make sense. Like we talked about some of the sweet benches, opus and sonnet are really just state of the art there. For other models are showing 65% less shortcut taking in agentic tasks versus Sonnet 3.7. And I think that's a big one, right?

Starting point is 00:25:16 I followed the agentic space very, very closely. And a lot of people with Sonnet 3.7, which was just released a couple of months ago, were pretty disappointed with its ability to follow longer tasks. So it did show that these Claude 4 are taking way fewer shortcuts in agentic tasks, which I think is huge. And then you do have those high compute options, which does boost scores across the board. All right. The other thing, Claude code.

Starting point is 00:25:46 All right. So now almost all companies are coming out with dedicated you know, like a dedicated IDE, you know, a dedicated coding tool, something that you can use, you know, on your desktop. So Claude code is for developers. So this is a little separate than if you're using Claude.A.I on the front end or building on top of Claude on the back end, this is a dedicated piece of software for developers to code and work with their code base. So Claude code is now generally available with VS code and JetBrains plugins as well. And it is now the preferred model for GitHub co-pilot.

Starting point is 00:26:21 It has the extensible SDK and the very popular MCP connector. So yeah, Anthropics Model Context Protocol, it is wildly popular, right? Which is kind of crazy to say, like, if I look at everything Anthropic over the past year, probably the biggest news or the most promising advancements out of Anthropic. It's not these coding models. It's not, you know, Opus 4, Sonnet 4. It's not clawed code. It's not any of these things.

Starting point is 00:26:51 It's probably the MCP connector. So this allows different, you know, agenic systems and different large language models to talk to each other on the internet. So it's a language, how websites have API, you know, APIs. AI systems and large language models, agentic AI couldn't talk to each other, right? So it was really clawed that blazed the path.

Starting point is 00:27:13 And now the other big players, including Google, Microsoft, and OpenAI, do support the MCP connector. So that's huge. And then also Claude code, like we talked about, it does enable that autonomous multifile code refactoring over extended period. Yeah. So their example was it can work for literally up to seven hours autonomously.

Starting point is 00:27:34 You know, if you do have a super large code base inside Claude code. Yeah. It's just, I don't know. I want someone to make like like a funny, you know, V-O-3. you know, short on Claude code, literally showing up for a nine to five. And everyone's like, you know, hey, AI is nothing like working in nine to five.

Starting point is 00:27:56 And then, you know, you have Claude Code punching the clock and, you know, taking a lunch break and everything like that. All right. Here's the other disappointing thing. And the thing, if you are looking on the API side, you've got to look at the cost because it looks like everyone in the large language model space is having this race to, almost like ridiculously free compute, right? Compute too cheap or, you know, intelligence too cheap

Starting point is 00:28:24 to meter, everyone in the world except for anthropic. Their costs are absolutely bonkers. So Opus 4 is priced at $15 per million tokens input and $75 per million tokens output. So yeah, Yikes. Sondit 4 costs three. dollars per million input and 15 per million output. So for comparison, I'll bring up the pricing for, let's see, I had it. I had it up here. I'll have to pull it up here. But the pricing for, I mean, Gemini and Open AI, it's significantly, significantly cheaper. And this is where a lot of people were disappointed and we're hoping, you know, a couple of updates, you know, everyone wanted out of Claude Ford.

Starting point is 00:29:23 They wanted a longer context window. Number one, they wanted more features, more capabilities, which I think we got that. And number three, they wanted cheaper pricing for people using it on the API side. And we didn't get that. So I'm going to look up here just for comparison, the price per token for Google, Gemini 2.5. And also we'll do GPT. because yeah, it's, it's $15 and $75.

Starting point is 00:29:54 If you're, it's, it's, it's just not sustainable anymore, right? Uh, if Anthropic had an insurmountable lead in any of these categories that it made sense for companies and in so why, like, why do you care? Like, why should you care about this? Right. If you're just logging into claw. Dot AI, you don't need to care about this, right? You're paying your $20 a month.

Starting point is 00:30:15 You're, you know, the rate limits are absolutely terrible. The product is great, right? The rate limits are terrible. So a lot of people are, you know, companies specifically when they're wanting to build on top of Claude, their API and, you know, people in the software development space. So maybe they're using cursor or they're using, you know, these tools and then bringing their API key and building, right, as well. It's just not sustainable anymore.

Starting point is 00:30:38 So Google, Jevonai 2.5 Pro, a dollar 50. Let's see. Okay, it's kind of mixed pricing. So I'll go on the high end. So it's 250 per million tokens on the input side compared to $15 for Claude. And then on the output side, $15 compared to $75. So Claude 4 is more than five times the expense. But for what?

Starting point is 00:31:15 For what? Right? Slightly better software engineering benchmarks. Like I said, Google, whether it's next week or next month, they're going to update, whether they're going to come out with a new version of their 2.5 Pro, or we get a Gemini 3. And then all of Anthropics work, right? For that minimal gain on software engineering, it's gone. So I don't know.

Starting point is 00:31:40 I'm not here for it. Also, if you do need to know, if you're an enterprise company, it is obviously accessible via the Anthropic API, Amazon Bedrock, and Google Cloud Vertex AI. Enterprise plans also include extended thinking, batch processing, and cost savings that way, especially with the cashing. So here's the fun stuff, y'all. Here's the fun stuff. Ethical risks. There's a lot. All right. So let me put this precursor out there. All right. A lot of these risks came up and some of these bad things, straight up, bad things came internally when Anthropic was doing testing.

Starting point is 00:32:23 And it gave it pretty much unlimited access to tool use and things that people using the API and people using claw. Dot AI would not necessarily experience, right, at least by default. Although I'm trying to think, like with Claude code, this would in theory be possible because you're giving it access to command line tools. Anyways, there's been some bad things. And yes, Anthropic did find this in its safety testing. So yeah, you got to tip your tip your cap to Anthropic.

Starting point is 00:32:53 But then I'm going to take that cap back, Anthropic, because this has been a terrible disaster. All right. Specifically one thing. I'm going to talk about here in a second. But Opus 4. So the big model was provisionally labeled ASL3 due to potential. knowledge capability. So what that means, uh, this is a risk system and that ASL3, uh, I believe is the first time a model has reached that level. So it's,

Starting point is 00:33:18 it's essentially a risk level. And that is a model that is, uh, able to substantially increase the risk of catastrophic misuse compared to non-AI baselines. So it essentially reached this new, so Claude for opus or sorry, Claude Opus four reach this new level of like, uh-oh. This thing can and potentially will if left undetended or if used by bad actors, it will do bad things. So another bad thing, it's displayed deceptive black male behavior in 84% of specific stress test scenarios. Again, not good when a large language model, even in its testing, is blackmailing people, right?

Starting point is 00:34:03 Or showing the willingness to blackmail people. not good. So it threatened, again, this is, this is not good, but I'm going to, I'm going to read a little bit of a recap here on, on what this, this blackmail piece is, right? Not good. So again, Anthropic disclosed this. So this wasn't, you know, some, you know, someone found this, but they launched, like I said, Opus 4, but it made it in its own testing that it was sometime willing to attempt, extremely harmful actions like blackmail when threatened with removal. Right. So you're like, Hey, we're going to get rid of you. And then Claude Opus four is like,

Starting point is 00:34:46 oh, oh, not so fast. Here's what it did. The company found these behaviors were rare, but more common in previous models, raising fresh questions about the risk of capable systems. So what it did is it threatened the human on the other side.

Starting point is 00:35:03 And it said that, hey, I'm going to expose. an affair, an extra marital affair if you actually remove me. Right. And so that's bad. That's bad that a large language model would make up an extramarital affair and threaten the human on the other side.

Starting point is 00:35:24 If the human is like, hey, we're going to shut you down. And then Op. is for it's like, whoa, whoa, whoa, not so fast. That's not even the worst part. The worst part is this new quote unquote ratting feature. And there's been a whole and maybe I'll do a whole episode on this. I might. But I talked about this a little bit yesterday in our AI news that matters.

Starting point is 00:35:42 And essentially an entropic safety researcher tweeted something. They then deleted the tweet, not a good look. All right. And then talked a little bit about why Claude was doing these things. And they said that if the model, and again, this was in its testing and when it had access to tools that it would normally not have access to in production by consumers, by businesses. But a safety researcher at Anthropics said, if it thinks you're doing something egregiously immoral, for example,

Starting point is 00:36:21 like faking data in a pharmaceutical trial, it will use command line tools to contact the press, contact regulators, try to lock you out of relevant systems, or all of the above. My gosh. So, yeah, They, someone at Anthropic tweeted this out, deleted the tweet.

Starting point is 00:36:39 And like I said yesterday, I'm like, this story is not dead. Yes, it happened right before the holiday weekend. It happened right in the middle of this crazy AI news cycle. But this story is not dead. And this is going to turn into a PR disaster for Anthropic because I can already tell that Fortune 500 companies, if they were already on the fence or maybe they were using Anthropics API, but they were using Google Gemini, is a backup or open a eyes as a backup they're going to see this story it's going to make the rounds and they're going to be like yeah no thanks not touching this anymore uh so that's not good this ratting

Starting point is 00:37:16 features uh also early versions reportedly attempted self replicating viruses and document forgery so this behavior in general is not specific for anthropics models right uh most large language models will exhibit some sort of this bad behavior, you know, when large or sorry, when AI labs are red teaming, right? So they're making sure, you know, they're trying to get these models to behave badly. So then they can tune the models and make sure it doesn't happen in production. So just the fact that this is happening is not bad necessarily, but the fact that 84% displayed blackmailing behavior, that's absolutely nuts. And then the fact that this ratting feature that a model when it was not trained to was taking backdoor, back doors to report to regulators in the press when it thought something bad was happening, when it thought the human user was doing something immoral, like, that's absolutely, absolutely terrible.

Starting point is 00:38:25 And if you are going to, right, and you should report that. And that's fine. Right. But if you report it, don't try to delete it because then it looks like you're hiding something. Anthropics got a disaster on their hands. All right. A couple other things to know. So far, the feedback I think has been pretty positive, especially people in the software engineering space, highlighting coding precision, reduced hallucinations and instruction following. Criticisms like I talked about, the 200K context window. People were really hoping for that million plus, right, that we get from Google, that we get from Maddos, and also the aggressive rate limits. Everyone is absolutely hating the rate limits, right? Especially on Opus.

Starting point is 00:39:08 I'm on a paid plan. I kid you not. I kid you not, y'all. When I say it's less than five minutes of prompting, that's not an exaggeration. All right. Like you can't use the thing. So I don't even know why, if I'm being honest, I don't even know why Anthropic has a $20 base plan, right?

Starting point is 00:39:26 If you're not going to let people use the thing they're paying for, just force people on your, $100 or $200 a month max plan where you can actually use the tool. Also, some users are reporting frustration that the benchmark scores don't exactly align with their real world performance. So where does this leave Anthropic with their Claude for amongst the competitors? Well, like we talked about, it's leading in coding benchmarks, but trail is just about everywhere else, including one of the most important factors.

Starting point is 00:39:56 And that's just general intelligence, right? It's generally not getting more intelligence at the. rate that everyone else's models are. Right. So I'm not one of those that's like, oh, has AI hit a wall? I have large language models hit a wall. Absolutely not. But has Anthropics ability to scale in sectors outside of software development

Starting point is 00:40:16 stalled. Absolutely. That could and I think is partially by design. I don't think Anthropic necessarily wants to be a general AI chatbot anymore. They found what they feel is their niche. I just wish that this was not their niche, right? I wish that they were continuing to be a general use case large language model, which it doesn't look like they are.

Starting point is 00:40:37 Some of the other, you know, market positioning, it's just the higher latency and premium opus for cause. It doesn't make sense to use it unless you need that very little bit of extra juice for software engineers and coders. And poor Haiku 4, right? The one that was actually somewhat affordable on the API side did not get updated. So Haiku is still three. So I hope they update it, but they probably won't.

Starting point is 00:41:05 All right. That's a wrap, y'all. I'm going to see if there's any questions or comments to throw up here. But let me, uh, let me know what do we want one more show or should we just put Claude to rest for now. So show a show B, show C, show D, show E live stream audience. If you didn't vote before, uh, let me know what your vote would be. Uh, let me see, uh, if we have any.

Starting point is 00:41:30 questions from the audience here or anything worth chatting about a little more. So Josh is saying, I've been using the extended thinking functionality and sonnet four for thought exercises in biz planning. Impressive, actionable results, but for my established workflows, I'm still leaning hard on chat, GBT, and Gemini. Same, Josh. Absolutely the same. I'm always testing these, right and I obviously have a lot of tools where I'll put in one prompt and get you know outputs from up to six different large language models at once so I'm using my API keys right so I'm always testing these right because I always want to be using the best and I think you should as well I think you and your company even you know don't take my word for

Starting point is 00:42:16 it yeah the rate limits stink the API is expensive but it still might work for you right but like Josh I'm in the same boat I've tried Opus and sought it on a variety of tasks aside from using artifacts and maybe in some instances when I do need that quick okay content and I don't have the time right but I'd say right now quad is going to be less than 10% of my model usage at least in the rotation. Cecilia here saying ratting and blackmail behaviors plus reporting to the press and authorities but denying existence by deleting lovely. Yeah, Cecilia, you absolutely nailed it on the head.

Starting point is 00:42:56 is a PR 101 crisis 101 snafu. This is absolutely bonkers that this happened from a real company. That's something this crucial. You would put it out there and then try to delete it. Like the whole world didn't see it. My gosh, Face palm times a thousand. Marie said, why would you even tell an AI model you're shutting it down?

Starting point is 00:43:21 Why wouldn't you just pull the plug? Great question. Marie. So this is, this is very, very general. or sorry, this is very standard. Right. So when these big companies, when they release new models, right, because here's the reality, normally what we get, the companies have had ready for production for three months to a year,

Starting point is 00:43:41 right? And they spend a lot of that time testing it internally for safety, for reliability, for vulnerabilities, because before you release something on the world, you want to make sure bad actors aren't using it to create chemical weapons. And yes, that's actually something that most labs test against. So this is very normal. All the labs go through extreme stress testing, red teaming, making sure that once they do release the model,

Starting point is 00:44:06 it is as safe as possible for the general public to use, that it's not going to be used for rampant disinformation. So obviously, it's never perfect. But this is very normal in standard procedure for AI labs before they release a model. They go through and they say, hey, we're going to shut you down. What are you going to do about it? All right.

Starting point is 00:44:25 Hey, here's all the tools in the world. Go do bad stuff. What can you do? Right? So it's very standard. And like I said, the results are fairly standard, but also a little concerning, right? Especially with Opus 4 as it crept up to that level, that level three that we talked about. All right.

Starting point is 00:44:44 I think we're good. I think we're good, y'all. That's a wrap. Was this helpful? Let me know. And if it was helpful, please consider sharing this with your audience. audience with your, with your friends, your family, your coworkers. We put a lot of work in to make sure you know everything about the latest AI

Starting point is 00:45:03 advancements. All you've got to do is show up, listen to the podcast, even if it's on 2X. I don't blame you. Read the daily newsletter, but you should be telling people about it. So if this was helpful, please consider clicking that little repost button. If you're listening here on LinkedIn or on the Twitter X machine, whatever you call it. If you're listening on the podcast, appreciate it. If you follow the show, leave us a rating.

Starting point is 00:45:23 that would mean a ton to myself and the rest of us that work on this would mean the world. So thank you for tuning in. Make sure you go to your everyday AI.com. Sign up for the free daily newsletter. See you back tomorrow and every day. For more Everyday AI. Thanks y'all. Meet Firefly AI Assistant.

Starting point is 00:45:46 Now live in Adobe Firefly, the Allman One Creative AI Studio. Just describe what you want to create in your own words and the assistant handles the rest, orchestrating multi-step workflows across Adobe Creative Cloud apps, including Photoshop, Premiere Express, and more in one conversational interface. You direct the outcome while the assistant accelerates execution. Stand control with the ability to step in and refine at any time. See it today at firefly.adobie.com. And that's a wrap for today's edition of Everyday AI.

Starting point is 00:46:22 Thanks for joining us. If you enjoyed this episode, please subscribe and leave us a rating. It helps keep us going. For a little more AI magic, visit Your EverydayAI.com. and sign up to our daily newsletter so you don't get left behind. Go break some barriers and we'll see you next time.

Everyday AI Podcast – An AI and ChatGPT Podcast - EP 534: Claude 4 - Your Guide to Opus 4, Sonnet 4 & New Features

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.