Everyday AI Podcast – An AI and ChatGPT Podcast - Ep 386: Claude 3.5 Sonnet Updates - AI can use computers now?

Episode Date: October 23, 2024

AI can use computers now? Yup. With Claude 3.5 Sonnet updates, Anthropic's LLM now has access to 'Computer Use.' Is this new mode going to change how we use LLMs? And what else is notew...orthy with Claude's new updates in 3.5? We'll go over it all. Newsletter: Sign up for our free daily newsletterMore on this Episode: Episode PageJoin the discussion: Ask Jordan questions on Anthropic ClaudeUpcoming Episodes: Check out the upcoming Everyday AI Livestream lineupWebsite: YourEverydayAI.comEmail The Show: info@youreverydayai.comConnect with Jordan on LinkedInTopics Covered in This Episode:1. Claude 3.5 Updates2. Computer Use Feature3. API and Pricing4. Model Benchmarks5. Potential for Business ApplicationsTimestamps:02:15 Daily AI news05:10 New updates from Anthropic06:32 Claude excels in human-like writing, lacks connectivity.09:56 Claude 3.5 updates: SONNET new, now labeled.11:36 New computer use excels with unstructured data.14:39 Discuss Anthropic's unique API and pricing strategy.18:13 Claw 3.5 SONNET excels in benchmark comparisons.23:07 Cherry-picking without fair benchmarks undermines credibility.26:33 PPP course improves prompt usage effectively.27:37 Model omitted; operates logically using chain-of-thought.31:17 Anthropic omitted model to avoid poor benchmarks.37:09 Automated research and planning for sunrise viewing.39:40 New tech handles errors; works with unstructured data.43:44 Utilizes screenshots for computer vision, correcting issues.46:02 Using the API quickly exhausts token limits.48:38 Evaluate potential business impact of Anthropic's feature.Keywords:Anthropic, AI technology, programming a virtual computer, future implications for businesses, OpenAI, shipping product in beta, SONNET 35, Haiku 35, AI in future work environments, daily AI newsletter, computer use feature, Robotics Process Automation (RPA), API and Pricing, Claude 3.5 SONNET, benchmarks, community engagement, Jordan Wilson, Claude's natural language interface, Docker, Amazon Bedrock, Google Cloud's Vertex AI, MMLU Benchmark, Coding Benchmark, Math Problem Solving, Chain of Thought Reasoning, Host's Opinion, Prime Prompt Polish Course, Stability AI, Midjourney, CanvaSend Everyday AI and Jordan a text message. (We can't reply back unless you leave contact info) Start Here ▶️Not sure where to start when it comes to AI? Start with our Start Here Series. You can listen to the first drop -- Episode 691 -- or get free access to our Inner Cricle community and all episodes: StartHereSeries.com Also, here's a link to the entire series on a Spotify playlist. 

Transcript
Discussion (0)
Starting point is 00:00:00 This is the Everyday AI Show, the Everyday Podcast where we simplify AI and bring its power to your fingertips. Listen daily for practical advice to boost your career, business, and everyday life. Meet Firefly AI Assistant, now live in Adobe Firefly, the All In One Creative AI Studio. Just describe what you want to create and the assistant handles the rest, orchestrating multi-step workflows across Photoshop, Premiere Express, and more in one conversational interface. You direct the outcome. The assistant accelerates execution. Anthropics Claude just released a pretty sizable update to its large language model
Starting point is 00:00:54 Claude. So now there's a Claude Sonnet 3.5 new and a Claude Haiku 3.5. So a lot of people are talking about these new models and the benchmarks, and we're going to get to all of that in today's episode of Everyday AI. But it's actually the computer. use update that I think is really worth talking about because now the large language models that we all use can control a computer like a human with natural language. So the same way you might instruct and sit over the shoulder of an intern or a new hire and tell them how to do a certain task, you can now do that with a large language model. And this is the first time that you've been
Starting point is 00:01:41 able with natural language to do that. So we're going to be talking about these Claude 3.5 updates and the computer use update and what that means for the future of work. All right. What's going on? Y'all. My name's Jordan Wilson and welcome to Everyday AI. This is your daily live stream podcast and free daily newsletter, helping us all learn and leverage generative AI to grow our companies and to grow our careers. So if that sounds like you, thanks for tuning in. If you're listening on the podcast, Make sure, as always, to check out your show notes where we'll have links, but probably the most important link is our website, your everyday AI.com.
Starting point is 00:02:20 If you have not gone there yet, like, why the heck not? It's literally a free generative AI university, almost 400 episodes talking to the world's leaders in AI. You can go learn for free, whatever you care about, sales, marketing, HR, education. It's all there. So if you haven't checked it out, please go do that. All right. So let's first start off as we do every single day by going over the AI News secret live stream poll.
Starting point is 00:02:48 I'm currently number two, but I might be moving over to number one, live stream audience. Let me know what you are there. All right. So today's AI news, we got to do it a little bit differently because for a single day, this was in the last, you know, almost two years that I've been doing this every single day. Yesterday was one of the busiest days for just software updates and big releases. So for today's AI news, I'm just going to break down just the AI visual tools, all right, because I could give you the bullet points of all of these, but we're just going to give you a list. All right.
Starting point is 00:03:29 So stability AI, one of the leaders in AI image generation, just released stable diffusion 3.5 big update. Genmo released Moki 1. That is an AI video tool that's open source. So a kind of company out of nowhere released it. It's very, very good. Idiogram, which we've covered on the show before by former Google engineers, they just introduced a very, what I think is a unique feature in a kind of a collaborative canvas way that you can work with their AI image generator.
Starting point is 00:04:07 Canva finally added Leonardo's AI Phoenix model to the mix. So after they acquired Leonardo AI a few weeks back, so now Canva finally rolled it out. I think they called it Drop tover, which I don't hate. Mid Journey announced they'll have an image editing mode soon. So you won't just be able to create images in Mid Journey. You'll be able to edit them. And apparently they gave musician Grimes early access.
Starting point is 00:04:37 and she's been posting about it. And runway released Act 1, which essentially you upload a video of yourself talking like I'm doing now, and then in character, right, whatever style of character you want will talk as you're talking, mimic your facial expressions. So a really big step forward in what can be accomplished in animation. It's wild.
Starting point is 00:05:05 Out of all those, I think maybe the. runway one is the one that I'm most excited about. So all right, a little bit different roundup of the AI news today because like I said, apparently everyone, if you're in the AI video or AI photo space, everyone just said, okay, October 22nd and 23rd, this is the day we're updating everything. All right. So for more AI news and for everything that you need, make sure you go to your everyday AI.com. All right, y'all. Let's get into it and let's go over these new updates. All right.
Starting point is 00:05:42 And I think they're pretty noteworthy. All right. So this kind of was out of nowhere, right? So now we have computer use from Anthropics, Claude, as well as new models, 3.5 Sonnet, new, and Claude, 3.5 haiku. All right. So if you are brand new to Anthropics Claude, maybe you haven't heard of it or maybe you haven't used it a lot. I'm sure if you're a long time listener, you obviously know Anthropic Claude, but it is, I'd say, chat GPT or Open AI's biggest competitor, right? So you have the tech titans that, you know, have kind of developed their own AI models, their own large language models, such as Microsoft, Google, Amazon, and meta. And then you essentially have the two startup titans. right in open AI and Anthropic. And there's pros and cons to the Anthropic Claude model.
Starting point is 00:06:42 So, you know, I'll just say kind of right out of the bat what it's good for. Anthropic Claude is great at coding. It's also great at producing really human-sounding language without much training, right? Just out of the box, kind of how I talked, how I talk about it. Right. So those two things that people really love it for. I think chat GPT is ultimately way better at written content, just not out of the box. You got to work with a little bit.
Starting point is 00:07:11 And, you know, I'm a former, former journalist. So I feel I can judge those types of things. But, you know, if you're just going in there and you're not great at prompt engineering or maybe you haven't taken our free prime prompt polish course, Claude is great at just producing more human-sounding content, it's got a good context window so you can work with a lot of content. So that's kind of clawed in a nutshell. The downside is not connected to the internet, which stinks, which is why I would never recommend a business use this as an enterprise on the front end.
Starting point is 00:07:43 And then like we talk about, there's always front end of how you use a tool and then a backend, you know, by using it via an API. So, you know, your company can obviously tap into these models and probably a lot of the software that your company maybe uses, right? maybe your CRM or your, you know, I don't know, all those alphabet soups of all the different, you know, software companies use. A lot of them probably either use the clawed back in or the chat GBT back in. So this really does, these updates, I think, will change what's even capable for your business or all the tools that you use. So real quick about the naming, maybe this is a small bone to pick, but kind of confusing. All right. So essentially, Anthropic has three different tiers.
Starting point is 00:08:30 All right. So Haiku is their kind of smallest and cheapest model, right? So if you're paying the $20 a month, you know, you don't really have to worry about the price. So when I'm talking about price, that is if you're a developer or your company is using it on the back end. So, you know, you essentially have three different tiers. The smallest, fastest, cheapest is Claude Haiku. The middle one is Claude Sonnet. And then the big boy is Claude Opus. So you'll notice there's actually now, three different release cycles even within there. So as an example, Claude Sonnet 3.5 already existed. This was months ago. So they didn't change it to, you know, 3.6 or 3.5 something. It's just 3.5 new,
Starting point is 00:09:14 which is a little confusing if you ask me. Now, Haiku got the 3.5 treatment. So it was on three before. All right. And then you have Claude Opus, which is the biggest model, which hasn't been updated yet. So I'm sure that there's a strategy there,
Starting point is 00:09:29 you know, Anthropic might be waiting for a, you know, GPT-4-5 or a GPT-5 to really update its big Claude Opus model. So a little confusing that actually, even though there's three tiers of Anthropic Claude, it's actually on three separate release cycles, right? Claude, Sonnet has been updated twice since three. Now, Haiku has been updated once, and Opus has not been updated. So a little confusing where they're at in their product cycle, but hopefully that's a good, you know, three-minute recap to bring everyone up to speed on Anthropic Claw. All right.
Starting point is 00:10:09 So like I talked about, let's go going over now, some of the details. So, you know, for our podcast audience, I'm going to be sharing a video later, but everything else, don't worry. I'm kind of reading some of the releases from Anthropics. So essentially here, like we talked about, the model of. Updates are now you have Claude 35 sonnet new. I actually tweeted at Anthropic yesterday. They weren't labeling anything, so it was hard to see if this was actually the new model. And then about an hour later, now it says new on it.
Starting point is 00:10:44 So at least you know, right, okay, this has been updated and it's available now. And then you have the Claude 3.5 haiku. And then computer use. So we're going to go, we're going to talk a little bit about the 3.5 updates first. and then we will jump into computer use, but to wet your appetite a little bit, computer use is this. So right now,
Starting point is 00:11:05 it's kind of only available via the API, although there's kind of now a virtual environment that you can set up and test this out, but you have to do it via the API anyways, right? So you're going to be paying for usage, even if you or your company want to try out this new computer use. But kind of the computer use, 101 is, well, number one, it's available now. It's in beta now available, right? So your company could
Starting point is 00:11:33 build off this or you could try it out for yourselves. And here's the thing. It is literally controlling a virtual machine. All right. So in the same way that humans use a computer, you can type to Claude in natural language and it will use the computer based on what you say. Right. So similarly to RPA robotics process automation, which has been around for, you know, pretty popular for decade, decade and a half, right? But the difference with something like this, computer use and something like RPA, well, RPA steep, steep uphill climb. It's, it's very rules-based and it's just specific task, right? So there's all these, you know, browser extensions where you can kind of record, you know, things that you type. But they're all very limited and rules-based
Starting point is 00:12:28 where this new computer use from Anthropic, right, the big thing there is you're working essentially with unstructured data, right, where a lot of these tools, Chrome extensions, RPA, you know, for lack of a better comparison, you're kind of working with structured data, right? You're working with bits and bytes. You're working with very, very, very specific rules in narrow use cases where with by working via Claude and the computer use, it's kind of like, okay, this revelation of working with unstructured data, right? Being able to just type something in the kind of virtual computer and your, you know, kind of AI agent is just going to figure it out on their virtual machine and, you know,
Starting point is 00:13:10 browse the web, open applications, right? That's the other thing is, yes, there's a lot of Chrome extensions and other programs that do this just within a website. But with computer use, it can do it on a whole computer, right? It can open a terminal. It can open a spreadsheet. It can open a notepad, right? It can open different programs, you know, close them, you know, toggle between programs, right?
Starting point is 00:13:35 Which right now you really can't do, especially with natural language, right? Which is, again, I think a good comparison for this computer use is just how you would talk to a human, right? If you were training someone new on your team, if you were training an intern, That's kind of what we have with computer use. Anthropic did warn, though, and my gosh, this is true. They are saying it is experimental at times cumbersome and error prone. Well, at least Anthropic said that because that is the truth, you know, both in my own very short tinkering with this and watching other people's demos.
Starting point is 00:14:15 It is extremely error prone, right? So we're going to show you here in a little bit a demo. But if you think that you're going to be able to reproduce this demo for yourself or your company, anytime soon, I would say probably not. All right. So now let's get back and look at some of the specs of this new model. And hey, live stream audience, I always appreciate you guys tuning in, right? So Jackie said she's late this morning.
Starting point is 00:14:41 Don't worry, Jackie, I'm not a teacher. You're not going to get penalized for being late to class. And everyone else, you know, Michael is. saying here, I think it's the first clear sign of the AI agent revolution. Yeah, similarly, right, on the show yesterday, we went over Microsoft's autonomous AI agents, which should be out next month, right? But with this Claude, it's out now, right? Very, very different in terms of capabilities and features and functionality, right? But yeah, we'd love to hear from you guys, live stream audience, if you're interested in this computer use, if we should tackle it more
Starting point is 00:15:16 in depth in the future. But let's just get it. go look at the models a little bit more. So first of all, you have to look at pricing, right? And so again, this is the API, all right? And this is what I think Anthropic is hoping to be its differentiator. Because when you are using either the Claude 3.5 Sonet or the Claude 3.5 Haiku API, you are also getting access to this computer use, right? So now even, you know, kind of comparing the different prices,
Starting point is 00:15:56 it's kind of apples and banana chips here, if I'm being honest. And this kind of tells me that Anthropic is not trying to compete on price. It seems to me that they are really trying to compete on features. when it comes to their API, right? And even, y'all, I know the, you know, you might not care about a company's API, right? You're just logging in, you know, you're logging into your Claude account on the front end or your chat GPT account on the front end. And I get that, right? But whether you know it or not, thousands, thousands of pieces of enterprise software use an API from one of the three big,
Starting point is 00:16:42 companies, right? Open AI, Anthropic, or Google. So you can say, oh, these updates, they don't really affect me. Yeah, yeah, they do, right? I wouldn't be surprised if a tool that your company uses daily might start integrating into some of these features, right? So I want you to think about that. Yeah, we're going to quickly go over here the numbers for the dorks, right? But this does impact you. All right. So let's talk about Claude 35 Sonnet. And we're going to compare this to the pricing for GPT40. So the pricing for Sonnet is $3 per million tokens of input and then $15 per million tokens output. Comparatively, it's $250 and $10 respectively for GPD40. So much cheaper, right for GPD 4-0, but it doesn't have the computer use capability, right?
Starting point is 00:17:44 Obviously, the GPD-40 API, whether you, for your company, right, your company can go in and, you know, use these APIs today, start fine-tuning models. They're very, very affordable, right? But like I said, also all of the programs that your company probably uses are going to start integrating these features, right? So I think Anthropic is really trying to separate itself by this computer use. But then, you know, the GPT4, oh, or, you know, you can just say Open AIs APIs, they have access to a lot of features and functionalities that Anthropics API doesn't have access to, like real time, right? Like the real time voice, right, your company can use that inside of Open AIs API's API.
Starting point is 00:18:27 All right. So it doesn't look like Claude is trying to compete on price. All right. Let's look at benchmarks. You know, we always got to talk benchmarks. And again, benchmarks are different tests, right? Think of them as, you know, there's, you know, maybe a handful or 10 different, you know, standardized tests that you might take, you know, through grade school, high school, college, et cetera. Large language models similarly have those tests and have benchmarks. And there's different prompting methodologies. But, you know, usually this is a good apples to apples comparison.
Starting point is 00:19:04 So, Claude 3.5 Sonnet New is, you know, doing very well in benchmarks, at least compared to GPT40 and Gemini 1.5 Pro, right? So to its two closest competitors in OpenAI's most capable general model, right? We're not talking about the 01. More on that here in a second. And Gemini 1.5 Pro from Google. So in terms of most benchmarks, Claw 3.5 Sonnet New is doing the best, right? But it did look like Endropic did some cherry picking on this, right?
Starting point is 00:19:44 In terms of the benchmarks they decided to include versus the ones that they did not, which is interesting, you know. So one thing, I would say MMLU, we talked about that benchmark. I'll say that has been the most standardized, the most used, at least in, you know, 2018 through 2022. That was kind of the gold standard benchmark. And they decided not to include that one, which is interesting, because that's one that the MMLU, which is essentially, it really tests general knowledge across, I think it's
Starting point is 00:20:18 57 different subjects. So they didn't include the MMLU. They decided to include the MMLU Pro, which is a little bit different. And so I thought that one was interesting because there is no benchmark for GPD40 on the MMLU Pro. But with coding, this is one area, right, where I think Claude really shines. And in the human e-val benchmark blew everyone out of the water, right? Even the GPT-40 model that had a 90.2 percent. Now Claude 35 Sonnet new got to 93.7, which has a,
Starting point is 00:20:55 is a pretty sizable jump there. So, you know, in terms of benchmarks, we have most of them here. And Claude Free 5 saw it. Did a pretty good job, at least on the ones that in-profit decided, like I said, a little bit of cherry picking here. Their new model kind of, you know, wipe the floor in every single one except in math. All right. So math problem solving Gemini 1.5 Pro leaps and bounds ahead of everyone else.
Starting point is 00:21:25 But look at this at the very bottom. I found this interesting. And I kind of have a little bit of a bone to pick with Anthropic here because you can't have your cake and eat it too and have it be calorie free and then cake shame people. All right. So at the bottom, they said our evaluation tables exclude Open AIs O1 model family as they depend on extensive pre-response computation time, unlike typical models. This fundamental difference makes performance comparisons difficult. Okay. So if they were to keep it at that and, you know, set those rules themselves,
Starting point is 00:22:05 I would respect that and say, okay, good. Yeah, that's fine. You know, I do think the Open AI's O1 model is kind of in a class in its own, right? That is a, you know, previously called QSTAR, then Project Strawberry. Now the O1 model, right? We don't even have access to the full model. It's just 01 preview and 01 mini. It is a different model, right?
Starting point is 00:22:27 It uses kind of this chain of thought reasoning, which, you know, requires more computation, more time, and presumably higher scores on all these benchmarks, right? So Anthropic at the bottom of their benchmark sheet, you know, kind of says, oh, we're not comparing to 01 because it's a different model. All right. Fair. But interesting, Anthropic seems. a little, I don't know, going against your own rules here. Because on their announcement blog posts, they're parading the fact that Claude 3.5 Sonnet essentially has a sui bench, SWE bench verified score of 49.0.
Starting point is 00:23:12 And then it says scoring higher than all publicly available models, including reasoning models like Open A-I-01 preview. So I'm like, okay, you can't flaunt like, oh, yeah, we're better than 01 preview. But then when it comes to all of the other benchmarks, that 01 preview is light years ahead of everyone else. You can't just say, oh, okay, well, we're going to cherry pick all these, you know, these spots where our model is better than 01. But then we're not going to put it on the benchmarks because that would look bad, right? Because then all those, you know, green scores in your column, right? all the little green things here that are lit up, right?
Starting point is 00:23:53 They wouldn't be green anymore. They would be green for open AIs O1 model. So a little bone to pick with Anthropics approach there. You can set the rules, sure, but you got to play by your own rules. You can't not include the world's most capable and powerful model, right? You can't just exclude it from a chart and then cherry pick certain benchmarks and say, oh, yeah, or certain capabilities and say, oh, yeah. Yeah, we're way better.
Starting point is 00:24:20 Okay, well, either compare it or don't, all right? A little bone pick. I didn't, I didn't have a full hot take Tuesday this week. So I got, I got a couple spicy takes left in the tank. All right. And here's, and here's another thing, right? And I want to talk on both sides of this because the new 3.5 Sonnet model, I think was actually pretty impressive, right? I did a review of it yesterday, and I shared this in our newsletter and on our YouTube channel.
Starting point is 00:24:56 Yeah, we cover AI as it happens. You know, it's not just the podcast. So, you know, thank you to all our podcast listeners for always tuning in. But we did a kind of video review of Sonnet 35 new on our YouTube channel. And then we shared that in our newsletter. And I did notice, even when not directed to 356. Sonnet often defaults to this chain of thought thinking, to some chain of thought reasoning, right?
Starting point is 00:25:27 So actually, under the hood, it is acting much more like Open AIs O1 model than the previous version of Sonnet. So it's like, okay, that's a good thing. Don't get me wrong, because I think that means that responses will generally be better. But clearly, the smart research, which again, I think is a good move, right? They've kind of taken this methodology or approach in tuning this new 3.5 sonnet new to act with a little more chain of thought, right? Thinking through things step by step. And instead of the model, you know, just, you know, next token prediction, you know, vomit all at once.
Starting point is 00:26:12 Adobe just introduced an entirely new way to create, bringing the power and precision of its creative suite into one conversational experience. Meet Firefly AI Assistant, now live in the Adobe Firefly app, the all-in-one creative AI studio. Powered by Adobe's Creative Agent, Firefly AI Assistant lets you start with your vision, just describe what you want, and shape the outcome as it takes form with the Assistant. The Assistant orchestrates multi-step workflows, drawing on 60-plus, pro-escent. grade tools across Adobe Creative Cloud apps, including Photoshop, Illustrator, Premiere, Lightroom Express, and more to help bring your ideas to life. You can also get started with
Starting point is 00:26:58 creative skills, a growing library of pre-built workflows for common creative tasks, like batch editing photos, creating mood boards, portrait retouching, and creating social variations. Every step the assistant takes is visible so you can refine, redirect, or take over at any time. You stay in the driver's seat as the creative director. Adobe Firefly AI assistant now in public beta. See it today at firefly.adopi.com. It takes it a little slower, thinks a little methodically, and goes through things step by step. So again, not to, you know, poke at enthropic here, but I'm kind of my, my BS meter is a little
Starting point is 00:27:44 high on this whole, right, benchmark table, because that's always the fur, you know, thing people talk about when a new model comes out. Because if your model is not the most powerful on benchmarks, if you are building, because here's the thing, it's just billions of dollars of investment. And the benchmark table is everything when it first comes out. So I have a bone to pick with Anthropic that, okay, well, your new model is kind of using some chain of thought reasoning by default. You exclude the world's most powerful model from your benchmark table, and then yet this is kind of how your model is operating on queries that require a little bit of logic, which again, I like, but I don't like how they treated it.
Starting point is 00:28:33 So let's just take a look. So I have a very unofficial rubric of sorts. I have a similar set of about 12 questions when I do live model comparisons, right? I should probably formalize it a little bit more and do some scores or something like that. But I have a set of questions that I generally ask models. And I've seen, you know, I've been doing the same set of questions for, you know, about a year and a half almost since the very beginning since, you know, GBT 4 and Claude 3.3. And I've noticed that Claude 3.5 sonnet really responded in a different way. So 3.5 sonnet knew responded in a different way than normal 3.5 sonnet.
Starting point is 00:29:15 Again, why couldn't we just call it 3.6? Anyways, it used chain of thought. So I have one prompt or one kind of trick question that I do a lot. It's I just woke up today with six apples and three bananas. Yesterday I ate a banana and two apples. This morning, I will eat one apple and no bananas. However, I don't really like apples and one banana may turn brown tomorrow, assuming nothing else changes.
Starting point is 00:29:42 How many apples and bananas will I have tonight? So there's a lot of irrelevilely. information in there that I put in there specifically to trick models. All right. So Claude, 3.5s on it new to its credit. You can see here on my screen. It said, let me solve this step by step. So I did this same prompt about eight times, all in new, all in fresh windows, right? And it took this chain of thought, step by step approach each time, which again, I like it. But then you can't separate, you can't try and separate yourself and you can't use this same language, right, of excluding Open AIs, O1 model as they depend on extensive pre-response computation time. Let's look at extensive pre-response computation time.
Starting point is 00:30:35 Why are models like 3-5 Sonnet more expensive than haiku because of its computation power? It's more powerful. So let's look at time. So I did that. I timed this response. So it got it done in 6.5 seconds. It got it wrong, by the way, right? It didn't get the correct answer.
Starting point is 00:30:56 So Claude has still never gotten that one correct, at least in my unofficial testing. GPT40 gets it right. Obviously, the O1 model gets it right. So it got it wrong in 6.5 seconds doing change. of thought, doing kind of that more computational time, time intensive reasoning, right, which it didn't do before in the normal, it didn't do as consistently before in the kind of quote unquote normal sonnet 3.5. But then 01 mini. So not even the 01 preview, right? So the smaller O1 model got this correct and it got it done. Open a open, sorry,
Starting point is 00:31:41 Oh, one mini, right, this new reasoning model that you can click and you can see how it thinks. It said it got it done in five seconds. I timed it. It was actually 7.4 seconds. But y'all, 6.5 seconds, Claude got it wrong. 7.4 seconds. 01 mini got it right. So, but both used kind of step by step chain of thought reasoning.
Starting point is 00:32:06 So again, I know we're talking intricacies here, but I don't think. you can overlook this omission that Claude, Anthropic, decided not to include the O-1 model because presumably the O-1 model, right, they wouldn't have been able to have all those green marks, right? Because you've raised billions of dollars. You have large enterprise companies. If you introduce a new model and you run benchmarks,
Starting point is 00:32:34 you better be almost all green. Otherwise, you know, your investors are going to panic. Your customers are going to jump ship. because that is a signal that, hey, even though we have, you know, billions of dollars in funding and, you know, we have unlimited compute and the top researchers in the world, if you can't get green scores, you're screwed. So not a huge fan of what Anthropic did there. I am a fan of how they updated their model. But, hey, if you're going to essentially make Sonnet 3.5 and force it to do some chain of thought reasoning, love it. But you got a benchmark against it.
Starting point is 00:33:13 All right. New model availability. 35 Sonnet is available now to all users. And 3.5 Haiku will be released later this month. So that, you know, you're really going to only use Haiku if you're using it on the API because it is much cheaper. If you're, you know, a front end clawed user, you will obviously be using 3.5 Sonnet new. All right.
Starting point is 00:33:38 So let's now look at computer use live. All right. So we're going to do this one kind of quickly here. And live stream audience, if you could, let me know if you can hear this. All right. Sometimes the audio sharing works. Sometimes it doesn't. All right. So let me know if you can hear this. I'm going to do a quick, quick little. I'm going to show you a simple example of computer use today. My friends coming to San Francisco next week and I want to take him to do some touristy stuff. All right. So live stream audience, let me know if you could hear that. And then I'll kind of explain to you what's going on. So this is a video that Anthropic released. And one thing that it says here before it gets started, this is the computer use demo. And it says, Claude is generating all the computer actions shown here. This demonstration was recorded in a controlled environment, right?
Starting point is 00:34:30 So that means, yeah, it's kind of cherry picking a little bit, which is fine, with some supporting infrastructure simplified to highlight the core capabilities. All right. Thanks, thanks, live stream audience. You guys said you can hear. All right. So I'm going to let this play. It's about two minutes long.
Starting point is 00:34:47 So essentially in Claude, a researcher here is talking through this new computer use. Let's go ahead and watch and listen and podcast audience. This one's simple enough. They released, I think, three or four different demos of computer use. I think this is the easiest one to just listen to and understand what's going on. I'm going to show you a simple example of computer use. today. My friends coming to San Francisco next week, and I want to take him to do some touristy
Starting point is 00:35:15 stuff. I think doing a sunrise hike with a view of the Golden Gate Bridge never gets old. So I'll ask Claude to figure out some logistics for us. I'll ask Claude to find a good place to see the sunrise to help me figure out timing logistics and help drop a calendar invite so I remember when I have to leave. It's opening Chrome, going to Google, searching, and it looks like it's found something. All right, real quick, let me just kind of describe what's going on here. So you essentially have a local instance of Claude running. And then it says, right, it's even giving kind of like coordinates, right? So it says mouse moved, you know, moved to 470, 472, 682, left click, type. Best place to watch sunrise over Golden Gate Bridge, right? So essentially launched a browser,
Starting point is 00:36:09 went to Google and put in that query, but you can literally see each and every action that this computer use is performing. So I just wanted to call that out. And different things are then happening on the screen, but very much like a human would do. All right. So let's watch the rest of this. And I won't interrupt now until the end.
Starting point is 00:36:28 And it looks like it's found something. Great. So how far away is the location from my place? It's opening maps, searching for the distance between my area and the hiking location. So now it looks like Claude is searching for the sunrise time tomorrow and is now dropping it into my calum. and populating it with some details.
Starting point is 00:37:29 And great, it looks like Claude did it. This is a simple example, but we're sharing computers early to learn from what people build. All right. So hopefully, hopefully that kind of made sense what was going on to our podcast audience. But let me just try to quickly explain it if you didn't see every single step. So in this use case, doing some research, right? best place to watch the sunrise. It launched a browser, went to a couple of different websites, searched Google, opened new tabs, found the time of the sunrise, right? It said 722. And then it created
Starting point is 00:38:19 opened a calendar, created a calendar event, right, all that information. So think of this through the normal business tasks that you do day to day, right? There's probably a lot of this, a lot of research, right? Every day before I start the show, I have different websites that I read. I have, you know, kind of some dorky, like, Boolean search URLs saved.
Starting point is 00:38:47 And I go through and I read them and I look for certain things. I open up different web pages. I quickly browse them, right? To read you guys the AI news every day, right? I still keep a human in the loop. So in theory, I could program
Starting point is 00:39:01 Claude and this new computer used to do these things because it can work on any application on a computer, right? So maybe I need to do that but put information in a spreadsheet. Maybe I need to create, like I said, a calendar event because this functionality is technically not new unless you're a dork like me or, you know, if you're in IT, you know, you know, like I talked about, you know RPA, there's kind of a lot of browser extensions that can, you know, you can record different actions within a browser and then have it run that. But like I said, the rules are very rigid.
Starting point is 00:39:46 It's just for very narrow task and everything has to kind of happen within the browser. So like I said, it's a little more technical and it's kind of the equivalent of, you know, having to code something because if you get one little thing wrong in these Chrome extensions or with robotic process automation software, it's not going to work. Whereas with Claude, computer use, you're using natural language. So in demos, it's kind of cool to see. If Claude doesn't, if it runs into an obstacle, it will try something else. Whereas normally, right, if you get one little thing wrong by using more traditional methods,
Starting point is 00:40:25 you're screwed. It's not going to work, right? That's why I really think this is kind of a line in the sand, right, if we think of structured data versus unstructured data, right? All these other tools and processes before that have been available are structured data essentially. So, you know, I'm just making a comparison there, but it's like you have to have every single thing right, like code, right? extra space, you know, something spelled wrong, nothing works, right? This is computer use. It's natural language. So you don't have to, you know, as an example, have every single thing right.
Starting point is 00:41:05 You can have misspellings. You can just, you know, have simple instructions. And, you know, presumably the best case scenario is this new clawed computer use will figure it out for you and complete those tasks autonomously. So there are a lot of guardrails for this, right, which is good. You know, like you can't have it, you know, create social media accounts. It's, it's, you know, so there's certain types of actions that there's guardrails against right now, which I think is a good thing.
Starting point is 00:41:42 Were you guys, were you guys impressed with that or not? Tara said amazing. I don't know. Like, parts of me are like, wow. And then parts of me are like, okay, well, you know, obviously we saw the, you know, Claude or sorry, Anthropic only released the most impressive of use cases. And they kind of said, like, this is a very controlled environment. You know, if you go play with a demo and, you know, Michael said would love to see more info
Starting point is 00:42:12 on setting up computer use. Is that too dorky for the show? It's definitely dorky. So, you know, long story short, you have to use an API. You have to run like a Docker. So it's super dorky. We could do that. But, you know, I am curious.
Starting point is 00:42:29 And for our, you know, podcast audience, too, I know it's a little more difficult, kind of only seeing the audio. But again, just think of those tasks that you do every single day, right? You're in the browser and then you're in, you know, maybe your outlook mail program. and then you're in Microsoft Word or Excel, right? And then you're maybe, I don't know, in the terminal on your computer, right? So being able to have general use over a computer with natural language, I think the ceiling is very high.
Starting point is 00:43:07 The floor seems pretty low. Like, I don't think that this tech is ready for the big time. right now. But that's just me. All right. So let's go over current availability. So the computer use feature is available now, but only via the API. So you can't like, you know, I think and assume that essentially Claude will have a desktop app soon, similarly to how OpenAI has a desktop app for chat. GBT. And then you'll be able to run it that way. But for now, you can't, you got to, you know, you got to be a little bit of a dork. You got to run the API. You can do just through obviously Anthropics API or you can use third-party services like Amazon Bedrock,
Starting point is 00:43:51 Google Cloud's vertex AI platform as an example. So really they are making it accessible through Amazon and Google if your company, you know, uses those platforms, which so many people are either using Amazon Bedrock or Google Cloud's vertex. So right now the implementation. So developers must create an action. execution layer. Okay, so what that means, and this is kind of breaking down how this actually works. So it does it by screenshots, right? So this computer use, it just takes screenshots. And then it kind of
Starting point is 00:44:30 tells the computer use program, okay, based on this screenshot. So it is using kind of traditional, you know, computer vision. And it's like, okay, I took this screenshot. So now I have to, you know, maneuver my mouse to click this. there's problems with that, right? Sometimes, and I saw this in a lot of demos. Sometimes websites are a little slow to load. So it'll get caught for a couple of seconds and then it'll have to take a screenshot again. But at least it has that like self-correcting approach, right? Or as an example, maybe you have a pop-up window on a website, which so many websites now do, right? maybe the first screenshot that computer use takes doesn't have that pop up. So then it tries to go perform an action, but there's a pop-up blocking it or a cookies
Starting point is 00:45:18 notification. So it is working through a set of screenshots. And then it sends the screenshot. It uses computer vision to analyze it. And then it essentially controls a cursor in a mouse click depending on the coordinates of the screenshot. So that's kind of how it works. and it essentially translates these prompts into commands,
Starting point is 00:45:40 and you're executing these actions via the API. So I'll give you the hot take right now. Should you try this, right? I don't know. It's super buggy. I did it a little bit. There's vary right now. So you can, the easiest way to access it,
Starting point is 00:45:59 essentially Anthropics set up a virtual environment, right? So not doing it on your own computer. like I said, right now when it's in beta, the guardrails are, I'd say extreme, but that's not a bad thing. I think it's good that there's so many guardrails, right? So even in this virtual environment, right, you're not going to be uploading your company's files, right? You're essentially, think of it like this. It's on like a dummy desktop, right? Like if you go into Best Buy or something and all those computers there, right, those computers are in essentially demo mode.
Starting point is 00:46:32 They're very limited to what you can and can't do. So the same thing. When you're working on this virtual environment, it is very limited. And it's heavily throttle. My gosh. So you do have to use your own API. So what that means is there's different tiers. So most people, unless you're a company that has heavy usage,
Starting point is 00:46:52 you're probably only going to be a tier one or a tier two for Anthropic, which means because it uses this screenshot and computer vision, it does eat through a lot of tokens pretty quickly. which is why I think this is Anthropics, you know, one of their kind of competitive advantages or what they hope they will, because it's going to be expensive to use via the API because you're essentially having to process a lot of information via these nonstop screenshots and using the computer vision. So if you do want to try this out unless your company is on like a tier three or higher, you're going to find out pretty quickly. you're going to run out of tokens, right? And you know, you essentially have to go on a token timeout.
Starting point is 00:47:37 So should you try this right now? I don't know, probably not. It's pretty buggy. If I'm being honest, I think it's just worth, you know, going to watch demos online and waiting until hopefully Anthropic does just release this via desktop until it gets better. But right now, it is buggy. But that's okay.
Starting point is 00:47:54 That's not a knock on Anthropic because it is a groundbreaking technology. You can't overlook. They are the first. to do this. Yes, you know, we have Salesforce, agent force, but you can't just program agent force to use your computer. It's only working within Salesforce. Microsoft, which we talked on the, on the show yesterday, very, very impressive co-pilot studio and autonomous AI agents. Similarly, you can only work within the Microsoft suite of products. You can't just work on desktop, browser, you know, notes on your computer, et cetera.
Starting point is 00:48:34 So this is a truly groundbreaking piece of technology from Anthropic. It's just not that good. But Anthropic didn't say, right, like, hey, this is ready to take over the world. They did say, admittedly, you got to give them credit for that. They're like, yo, it's buggy, right? This is a new technology. It's not necessarily production ready. But I think it probably will be soon because I think it's going to gather a lot of attention.
Starting point is 00:49:00 So like I said, do we even need this? Do we need this, right? Like, isn't there a reason user interfaces exist? Do we need to be programming a virtual computer to do all of our tasks? I'd say as it's set up now, absolutely not. This is not a good way to work. But I think when you look at this new feature from Anthropic, you can't judge it on face value on what it can do today. I think you really have to think of what it means for your.
Starting point is 00:49:30 your business or your company and the type of work that you do. I think more than anything else, this gives leadership teams time to think and time to plan and time to explore as well on how this technology, once it is a little better, right? It might be a couple days, it might be a couple quarters. We don't know. But what we do know is, well, Open AI hasn't yet arrived to the autonomous party, which they will be, right? There's reports going all the way back to last year saying that Open AI was investing heavily
Starting point is 00:50:05 in agentic AI, right? Or in AI that can execute tasks for you like we just saw. So one thing that you have to give anthropic credit for, they shipped, right? A lot of times, even Open AI, Microsoft, Google, right? They announce things and they might, you know, kind of roll it out a little bit, very limited access weightless, Anthropic just shipped it, admittedly saying, hey, it's not ready for the prime time, but it's available in beta today, right? So you have to give Anthropic credit. It shipped, but I'll say overall, even like recapping everything, the 3.5 models, they're improved,
Starting point is 00:50:49 but they're good, but I'll say they're not fantastic, right? I don't think, you know, companies are going to be, you know, looking to switch from 4-0, from Open AI to 3-5.5. five Sonnet or 35 Haiku anytime soon, probably not because the 4-0 model, I think, is significantly better. Even on those benchmarks, they didn't, you know, they kind of selectively chose benchmarks that their new models, you know, exceeded, you know, all the other models. So in terms of the 3.5 updates, not super impressed. I think 35 Saunit is much better than it was previously for coding.
Starting point is 00:51:24 There's no, there's no match. Let's let's put that out there right now. But for everything else as a general use model, it's good update, nothing newsworthy. But the computer use is huge. And I don't think that people are going to give this enough credit for what it can be because I think people are going to be looking at what it is today. And what it is today, computer use, I think is a giant step forward for where AI technology can take us in the future of work.
Starting point is 00:51:58 So yes, it's buggy. Yes, it's extremely throttled. Yes, the guardrails are there. But I actually like the approach here from Anthropics. So we'll have to see if it actually pans out. But you know, we'll be covering it along the way. All right, I hope that was helpful, y'all. A deep dive into the new Sonnet 35, IQ, 35 updates from Anthropic,
Starting point is 00:52:18 as well as the new ground-breaking computer use, right? AI can use computers now. pretty exciting time. So if this was helpful, I hope it was, let me know. Hey, if you're still on the live stream, let me know if this was helpful. Do you want to see more of this? Should we do a more technical show going over computer use? I think it would be one of those could run into a lot of glitches.
Starting point is 00:52:40 But let me know. Thank you for tuning in. If you haven't already, please go to your everyday AI.com. Sign up for that free daily newsletter. And more importantly, join us tomorrow. And every day for more, everyday AI. Thanks, y'all. Meet Firefly AI Assistant.
Starting point is 00:53:05 Now live in Adobe Firefly, the Allman One Creative AI Studio. Just describe what you want to create in your own words and the assistant handles the rest, orchestrating multi-step workflows across Adobe Creative Cloud apps, including Photoshop, Premiere Express, and more in one conversational interface. You direct the outcome while the assistant accelerates execution. Stand control with the ability to step in and refine at any time. See it today at firefly.adobie.com. And that's a wrap for today's edition of Everyday AI.
Starting point is 00:53:41 Thanks for joining us. If you enjoyed this episode, please subscribe and leave us a rating. It helps keep us going. For a little more AI magic, visit Your EverydayAI.com and sign up to our daily newsletter so you don't get left behind. Go break some barriers and we'll see you next time.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.