Everyday AI Podcast – An AI and ChatGPT Podcast - Ep 662: Opus 4.5: New king of the AI hill or just a niche model for coders?

Starting point is 00:00:00 This is the Everyday AI Show, the everyday podcast where we simplify AI and bring its power to your fingertips. Listen daily for practical advice to boost your career, business, and everyday life. Meet Firefly AI Assistant, now live in Adobe Firefly, the all-in-one creative AI studio. Just describe what you want to create and the assistant handles the rest, orchestrating multi-step workflows across Photoshop, Premiere Express, and more in one conversational interface. You direct the outcome. The assistant accelerates execution. Is Google's reign on top of the AI throne already over?

Starting point is 00:00:52 Just a couple days after Google released Gemini 3 Pro. Anthropic quietly released what they called the best model in the world for certain tasks. And now the rest of us are left here wondering, what the heck happened? how can we keep up? And is Claude's new model or Anthropics new model in Claude Opus 4.5, the new king of the AI Hill, or is it just a really good model for developers and people who work with data? Well, we're going to be diving in on today's episode of Everyday AI to find out. Let's get after it. What's going on, y'all? My name is Jordan Wilson. Welcome to Everyday AI. If you're new here, this is your daily live stream podcast and free daily newsletter,

Starting point is 00:01:40 helping everyday business leaders like you and me make sense of these AI updates. Yeah, apparently, you get the world's best model every other day. All right. So this is where we help you keep track of all those updates, leverage them and grow your company and your career. So if that's what you're trying to do, awesome. Starts here with the unedited, unscripted live stream podcast. But to take it to the next level, you've got to go to our website, the cheat code,

Starting point is 00:02:05 not in disguise your everyday AI.com. Go sign up for the free daily newsletter. We're going to be recapping today's episode and a whole lot more, including the AI news for today. Who knows? By the time the newsletter's out in a couple of hours, maybe we'll have a completely new leader in the AI world. All right.

Starting point is 00:02:22 So let's get into it. And first, apologies, y'all. So this is our putting AI to work on Wednesday's series, where on Wednesdays, we do more of a practical look at news. large language model updates, usually from one of the big four or new features, from Anthropic, OpenAI, Google, or Microsoft. And on Monday, y'all, I said, hey, we're going to be taking a deeper look at Gemini 3. But then hours later, after our Monday show, Anthropic made this huge splash out of nowhere,

Starting point is 00:02:55 and now we have Opus 4.5. And according to some benchmarks, it is the best model in the world for specific tasks. So we'll still do another deep dive on Gemini 3. Don't worry, but today we've got to take a look at this new model from Anthropic in Opus 4.5. So on today's show, we're going to dive into those benchmarks from Opus 4.5, industry leading in many different categories. We're going to showcase three overlooked feature updates that Anthropic barely even mentioned. And we're going to explore three different use cases for Opus 4.5 across a Gentropic. authentic research, file creation, and data analysis and coding.

Starting point is 00:03:37 All right. So let's just start there. Let's start at the end. All right. So live stream audience, let me know. Can you see my screen? Hopefully you can. So we're going to end up showing you three use cases.

Starting point is 00:03:52 We're only going to be able to do two of them live because I did some token math and because of Anthropics, terrible rate limits, which even though they just said that they updated, they're still terrible. We're only going to be able to demo two of these live, but I did run another one earlier that we'll be able to take a look at. All right. So we're going to get these started like any good cooking show. Let's get in the kitchen. So I'm going to describe a little bit what we're doing later on these, but let me just read the different prompts and let everyone know.

Starting point is 00:04:22 So podcast audience, this is probably not going to be a super visual episode, but I do always recommend just going to our website. Again, your EverydayAI.com and check out the video version of this. All right. So the first one here, again, I'm selecting Opus 4.5, which is now available for paid users. And my first task, we ran a couple of these last week in our kind of hands-on, initial hands-on with Gemini 3 Pro from Google. So the first one, I'm saying, read the last five newsletters at read. Dot your Everyday AI.com. So, yeah, you can actually go read every single newsletter we ever had on that website. And I'm saying, then find 10 recent AI news stories.

Starting point is 00:05:03 that were not covered in those five issues. Research each story and outline how each would look as a podcast segment. Suggest episode title, main points, context background, and a short show outline, ground answers with links to sources. All right. So again, when I go over these use cases, I'm giving you actual use cases that we test different models and how we use different modes and features across the big, large language models. It's a question I get all the time, Jordan.

Starting point is 00:05:32 and how are you using AI? Well, here, I'm showing you. We're putting AI to work on Wednesdays. All right, so let's get that first one, cook in, get it in the oven, $450 for 20 minutes. I don't know if it's actually going to be $450 for 20 minutes. We'll see. All right, next one, that we're going to get it going.

Starting point is 00:05:48 I am uploading three different documents from my podcast. So these are just different stats. We use Buzz Sprout as our podcast host, so we get some different stats here. So what I'm saying is analyze my podcast stats. I've uploaded the files. Using Claude Artifacts, I'm going to talk more about artifacts later. It's always been one of my favorite modes across any large language model.

Starting point is 00:06:12 So I'm saying using Claude Artifacts show me the top 10 obvious trends, the top 10 hidden trends, 10 biggest growth opportunities, and 10 episode ideas I should plan for December 2025 based on recent trends. Using Claude Artifacts, build a dashboard that is extremely interactive, sortable, and filterable and useful as if it were a full stack high priced SaaS application. Again, using Opus 4.5. So the first prompt, more on the agentic research side. The second one, at least that we'll be able to run here live a little bit more on data

Starting point is 00:06:49 analysis, visualization, coding, etc. All right. We're going to let those cook in the background and let's get into more of the details now on what's new inside Claude Opus. Again, we had kind of heard some rumors and rants recently that Anthropic might release Claude. We actually, in our newsletter, it's the hard part of a daily newsletter, y'all. I remember hitting scent, and I kid you not, it was seven minutes later. I go to my wife, oh, my gosh, Anthropic just released a new model.

Starting point is 00:07:22 And she's like, all right, didn't you just say this last week, like twice? There's a new best model in the world. Yes, I did. That's how quickly AI news happens. So there's kind of, we knew something was coming, didn't know it was coming this early. So here's kind of what's new and the biggest takeaways from Claude 4.5 opus. So I got to get in the right habit. Technically, Opus 4.5.

Starting point is 00:07:47 It used to be a number than opus, but it's Opus 4.5. I have it wrong on my little visuals here. So Opus 45 is Anthropics top mom. model. All right. So they have their three different tiers. All of them now are on the four or five variant. So you have your haiku, which is your kind of fastest, but least intelligent. You have your sonnet, which is kind of your middle of the road, middle intelligence, middle speed. And then you have your opus, your most powerful, but usually a little bit slower and more expensive if you're using it on the API side. For the most part, aside from when we're talking about API pricing,

Starting point is 00:08:21 we're talking about using Claude on the front end or using Opus 4.5 on the front end. So that's when you go to claw.AI and you're using it as a front-end chat bot. So they did say that this is their new top model, achieving state-of-the-art results in coding and agentic tasks. The API price cut, huge. We're going to talk about that in a couple of slides here. And it did set a new benchmark on SWEBENCH-Verified, which is one of the, you know, if you are a developer, software engineering, coder, you know SwayBench verified. It is one of the more notable benchmarks that AI models go through specifically when it comes to coding and completing different kind of bug creation or bug fixing to put it in layman's terms. This also introduces an effort to parameter

Starting point is 00:09:09 on the dev side to trade response thrown as for token use and latency. So that's on the dev side. And then there's also some new enhanced agentic features. So tool search and context compaction or the infinite chat, which improves long running multi-step workflows. benchmarks. We got to talk about it because even though Anthropic was kind of quiet with their announcement, right? Simple blog posts, a couple tweets, right? Not the usual type that were used to out of Silicon Valley, right? Big splashy live stream, you know, big production. Anthropic just kind of, you know, put out a couple videos and just a little blog post. But they were extremely splashy by calling this the again.

Starting point is 00:09:58 And I'm going to quote their words, the quote unquote best model in the world for coding agents and computer use. And in the benchmarks, they shared, aside from three of the benchmarks where either Gemini or OpenAI came out on top. So in the other, what is that? In the other six or seven, Claude Opus 4.5 was tops against their older version, Sonnet 4.5, Opus 4.5.

Starting point is 00:10:25 Opus 4-1 and then Gemini 3-Pro and then GPT-5-1. What's important to note, there already is a more powerful version of GPT-5-1, which is GP-T-5-Pro, which is extremely impressive. I think for me even, I mean, we'll see how Opus 4-5 fares. I don't think it's going to end up being my daily driver model. My daily driver model probably now is going to be Gemini 3 Pro, but when I need a little bit more power, a little bit more juice, I'm probably personally going to be using GPT-5-1 Pro, an amazing model, unfortunately only available on that $200 a month pro plan. But Opus 4-5,

Starting point is 00:11:03 at least for these benchmarks on agentic coding, agentic terminal coding, agentic tool use, scaled tool use, computer use, and the novel problem solving. These are the benchmarks that Anthropics shared on their website pretty far ahead of everyone else. So if you're just looking at Anthropics website and reading their blog posts, you might just think, oh, this is the most powerful model in the world. And that's literally, like I said, what they said themselves. Is it? I don't know. Let's look at some third party, some third parties here. So on the artificial analysis website, and this is, we've mentioned it before on our show, this is a great, unbiased third-party site to look at when you're trying to figure out what is the best model for

Starting point is 00:11:52 what use case. So this is kind of an aggregate score of some other different benchmarks, including live code bench, side code, terminal bench hard, and some others. But you'll see here, even on the coding index, not even the artificial analysis intelligence index, which is kind of like their overall or kind of their final boss metric. so to speak. So even on the coding one, even though Anthropics said best model in the world for coding, when you start to look at aggregate coding benchmarks, it's not. So Gemini 3 Pro comes in ahead of Claude Opus 4.5. So interestingly enough, they didn't on aggregate benchmarks, they aren't the best in the world.

Starting point is 00:12:42 And two points actually, and this is a pretty comfortable lead. And then even on the artificial analysis intelligence index, it's not in the lead either. It is tied for second place with GPT-5-1 high, which again is not OpenAI's best model. It's not benchmarked. Their GPT-5-1 Pro is not benchmarked. It just came out last week. There's no API access yet. So that's why the GPT-15 Pro has not been benchmarked in a lot of different places.

Starting point is 00:13:09 But even on the artificial analysis intelligence index, which is the conglomerate of all the different available third-party benchmarks. Gemini 3-Pro is fairly ahead of Claude Opus 4-5, which I said right now is tied with GPD 5-1-high. So it seems like some cherry-picking and marketing to me from Anthropic in those claims. But other third-party benchmarks do show that it's, you know, doing very well. In this case, on Live Bench, it does hold a slight lead. over Gemini 3 Pro and GPT5 high on their kind of aggregate scoring system by less than a point,

Starting point is 00:13:54 about 0.7 points over GPD5 high and about 0.3 points over Claude of 4 or 5 opus. So very slim lead. But is it a top tier model? Absolutely. Especially. I think if you're working on any agendic tasks, if you're using Claude on the back end, absolutely dev software engineering, sure. But a lot of our audience, I believe, and hey, let me know in either the live stream comments

Starting point is 00:14:26 or in the podcast comments, if I'm wrong. But I think a lot of our audiences using these models on the front end and a lot of times with a team. And in that case, I don't think really Claude 4 or 5 Opus for general tasks is a clear runaway king of the AI Hill, so to speak. But let's highlight some of the more. notable kind of accomplishments and also just what's new across the board with this model. Because it's a lot more, I'd say, than just a small little marginal update to Opus 4-1 or

Starting point is 00:14:59 to Sonnet 4-5, whichever you consider to be their most powerful model. So obviously, the headlines, software engineering and coding. It's actually interesting. Before I started recording, I wanted to look. the history of what Anthropic itself has benchmarked itself against. And if you go back to some of its earlier models, you know, a year and a half ago, they're not using the same benchmarks. They've essentially given up on being a general purpose model.

Starting point is 00:15:31 And they're really only focused on, I think, making a play in the vertical space. I really think they only want to compete in the long term kind of software engineering and financial and the agetic sides, right? So I don't think Anthropic is really concerned with, you know, anymore being like a great strategist, a great creative thought partner, you know, being overly creative, those types of things. I don't think that's where the model has been developed

Starting point is 00:16:05 over the last six to nine months. Anyways, on the software engineering and code side, it did receive an 80.9 on Swee bench verified. That's absolutely frontier engineering ability. Also, according to Anthropic, it can reduce multi-day team projects to hours, you know, giving a boost to your team's engineering velocity. And it outperformed any human ever on Anthropics' own two-hour engineering exam. And this was Opus 4-5 was the first model from Anthropic that actually outperformed any human ever.

Starting point is 00:16:42 So Anthropic has their own internal engineering exam. And this is the first time that it outperformed any human that had ever taken it. So let's talk a little bit on the API side because that's the other big play here. All right. So I am at the very end going to give you kind of the three under the radar features. But I mean, if I had to say number four, it would be the cost reduction. So previously, it didn't make sense anymore. per anyone, if I'm being honest, to use Anthropics models on the back end.

Starting point is 00:17:18 They were ridiculously expensive depending on what your task was. You know, not only before this came out, before Opus 4.5, Anthropic was not a top three model, even for engineering and coding, at least if you look at any relevant benchmark, right? Actually, let me just scroll back here a couple spots here on my screen. So you'll see, let's see, yeah, here we go for the coding index. So again, this is the aggregate of multiple coding benchmarks. Up until this week in Opus 4-5, they did not have a top-5 coding model.

Starting point is 00:17:59 All right. Claude 4-5 Sonnet was not a top-five coding model. And so the fact that Anthropics API was still anywhere 3 to 10x more expensive. than the other models that were better at them than coding. You know, I said that on the show probably two months ago. I'm like, no one out there should be using Anthropic, like pretty much, period, because across all benchmarks. And this was after, you know, essentially after Gemini 2.5 Pro, 2.5 Pro flash,

Starting point is 00:18:31 after some, you know, even Open AI's Codex models that they released here earlier in November, it didn't make sense. It was way more expensive in most benchmarks. It literally couldn't compete even in the top five. So big move here from Anthropic to maybe go in and change that. They said, okay, yeah, we're going to come back on the top of some of these leaderboards. And yes, we are going to try to remain or, sorry, maintain our stranglehold as, you know, coder or software developer's favorite model on the API side.

Starting point is 00:19:06 Not only that, but we're going to come in with a huge API. price cut, which is still not the best price per performance, but at least now it makes sense financially. Right. So they did cut their API pricing by two thirds. So now it can, you know, they kind of took their model from the penthouse to the ground floor where the work is being done. All right.

Starting point is 00:19:30 Where before, you know, even three to four weeks ago, I didn't really know. And I talked to a lot of people in the space, obviously. I didn't know anyone that was doing big numbers on the API side that was still using Claude. And if they were, it just means that they didn't do their proper due diligence on having a modular approach. And you have to be modular, especially when a model like Gemini 3 Pro can come in. And I'm sure we're going to be seeing some flashlight versions coming soon. You have to be ready to swap those out. Right. So no one was actually using, right, at least huge enterprises that are spending millions of dollars annually.

Starting point is 00:20:16 We're not still using any of Claude's models. So big play there. And it's it should be interesting to see how and when and if, I guess, Anthropic continues to cut their prices. Obviously, they've had a longer standing partnership with Amazon and using Amazon's chips, but we also saw news last week for a multi-billion dollar partnership with Google to use Google's TPUs, which as of recently have been getting a lot of love. So, you know, kind of a combination of those things. And maybe behind the scenes, anthropic reducing its reliance on Nvidia has made. maybe allowed it to be a little bit more competitive on the API side. All right.

Starting point is 00:21:07 Next, agentic workflows and tool orchestration. So they've really optimized four five, opus four five, for reliable agent handling more complex, multi-step reasoning. Also, again, a lot of this is more on the API side too. So tool search, being able to dynamically find needed tools from large libraries, So avoiding context pollution, right? I mean, if you are a software engineer, I'm sure that's a pain point for you, right? Working in longer context windows, you know, having to really pile on top of the scaffolding of these now very agentic models, right?

Starting point is 00:21:48 It got a little cloudy. So apparently, OPA5, a little bit better at that. And also it enables multi-agent systems that self-refine in fewer iterations. another benchmark we didn't go over, but very impressive and third-party benchmarks as well, is just the vision capabilities. And that's one that I'm personally interested to test out a little more. Again, when I'm talking about these things, I always, especially on Wednesdays, right, I always want you to be thinking of your use cases, right?

Starting point is 00:22:16 So maybe you have extremely complex diagrams, right? And this is a big part of what your company does. you know, parsing information out of manuals with, you know, complex illustrations inside of PDFs, et cetera. You know, Opus 4.5, extremely high benchmarks, even on third-party benchmarks for vision and multimodal capabilities. So they did score a 80.7% on MMMU validation. So more also, it added a Zoom tool, OpenAI,

Starting point is 00:22:51 kind of stole the show and kind of went a little viral among us AI nerds last year with this. But Anthropic finally followed suit with a Zoom tool for inspecting screen regions at full resolution, just to be able to better understand images, right? The future, well, not the future, but in 2025, right, the default for large language models, they have to be multimodal. They have to be able to understand images as an input. Unfortunately, right? And that's a game.

Starting point is 00:23:22 Anthropics not playing on the output side, even though, you know, obviously Google and OpenAI can output video and images. That's an area that Anthropic is not yet playing in. All right. Next big category of improvements with Opus 4.5, just improved tool support for Excel, which we're going to be talking about here in a little bit in Chrome.

Starting point is 00:23:41 Another one we're going to be talking about. Adobe just introduced an entirely new way to create, bringing the power and precision of its creative, suite into one conversational experience. Meet Firefly AI assistant, now live in the Adobe Firefly app, the all-in-one creative AI studio. Powered by Adobe's creative agent, Firefly AI assistant lets you start with your vision, just describe what you want, and shape the outcome as it takes form with the assistant. The assistant orchestrates multi-step workflows, drawing on 60 plus pro-grade tools across Adobe Creative Cloud apps, including Photoshop, Illustrator, Premiere,

Starting point is 00:24:22 Lightroom Express, and more to help bring your ideas to life. You can also get started with creative skills, a growing library of pre-built workflows for common creative tasks like batch editing photos, creating mood boards, portrait retouching, and creating social variations. Every step the assistant takes is visible so you can refine, redirect, or take over at any time. You stay in the driver's seat as the creative director. Adobe Firefly AI assistant now in public beta. See it today at firefly.adobie.com.

Starting point is 00:24:58 They also have said that they've worked on their context issues. And I'm not going to frame this as like a as a new feature. No, this is them trying to fix something that has made their platform absolutely unusable. I've talked about this before in the show, especially on the front end for front end users. Single prompts, y'all, single prompts have busted, not just. the rate limits for Claude, but also the context window. I've run single prompts. So we'll see if that gets any better. Also, Anthropic says that Opus 4-5 produces more consistent domain-aware spreadsheets, slides, and documents for precision verticals. That is a huge, another kind of under-the-radar

Starting point is 00:25:47 feature that was announced a couple of months ago. The go-to-market on Anthropic is wildly poor. let me just say that unequivocally right they have some of the most groundbreaking innovation inside large language models but when they roll it out they just roll it out to their like max subscribers um so no one's talking about it because who's a max subscriber like you know 10 people on twitter so you know it's great features their five creation feature amazing but when they rolled it out no one had access so no one talked about it so then when they kind of gave access to the masses it's, well, it's all news, right? No one cared about it then because no one had it,

Starting point is 00:26:27 so you can't create a grand opening of a feature twice. But low-key, one of my favorite features, we actually did a dedicated show on that a couple of weeks ago. All right. So that's kind of the high level of what's new. Lao, let's go over three overlooked feature updates because with this new Opus 4.5, Claw or sorry, Anthropic also kind of snuck in

Starting point is 00:26:50 some unrelated feature updates that maybe should have just been their own announcements but weren't. So a lot of people missed these. I didn't. I saw these. I was scrolling through their kind of announcement thread. And I was like, wait, how is this in the middle of a tweet thread, right? But I guess like any good sandwich, right?

Starting point is 00:27:10 Sometimes the meat is in the middle. So number one, the Claude for Chrome extension rolling out to more people. So previously, again, speaking of Anthropics, absolutely terrible. go-to-market strategy and just marketing in general. Chrome extension looked really good, right? So kind of in a similar way that you can use the Atlas browser from OpenAI, right? And it can see and understand what's in your browser. There's some elements of that in the new Claude for Chrome extension.

Starting point is 00:27:40 But when they rolled it out, they rolled it out to a thousand people, right? And this was months ago. So now it's open, well, it's not open to everyone. It's only available to those on Claude Max in, I believe, Enterprise Plan. plans, but it is noteworthy because now, at least if you are on that $100 or $200 a month plan for Anthropic, you at least have access to the Claude for Chrome extension. And with kind of their visual understanding, their agentic capabilities inside Opus 4.5, and also its kind of ability to handle that context.

Starting point is 00:28:20 This is a big, big update, right? So, yeah, if you are on a Max plan, you probably should be trying out. I might have to begrudgingly upgrade to Max just to use this and try it for you all. So the Chrome extension also demonstrates improved robustness against browser prompt injection attacks compared to previous models. And, you know, obviously it kind of brings, you know, agentic capabilities to Chrome, right, that you would have in other browsers, like as an example, to handle tasks across, you know, multiple open tabs inside Chrome. All right, the next overlook feature would be Claude code coming to desktop. So Claude code is now available within the Claude Desktop app.

Starting point is 00:29:08 So you don't have to run it separately in the terminal. So this is, I think, a boon to non-technical users. The Claude Desktop app is actually really good. I'm surprised more people don't talk about it. I'd say the capabilities leap, right? If you look at as an example, you know, open AIs, you know, chatGBT.com versus open AIs chat chat.com, you know, fairly apples to apples. I think when you look at Claude.aI, so the Claude chatbot and Claude desktop, the desktop is a leap better, right? Just because of everything that you can do kind of with MCPs, you know, you can have it read, right? I'm a Mac. user so it can read your iMessages so it can like look into different programs.

Starting point is 00:29:54 So it has some really complex and very robust capabilities. So also now, Opus 4.5 support on that as well on desktop. And it does allow software engineers to run multiple local and remote coding sessions simultaneously. And then last but not least, three underrated new feature. updates. So now Claude in Excel. So the specialized Claude for Excel product is now generally available where before it was in a beta and it did roll out now to Max team in enterprise users. So unfortunately, if you're on that $20 a month plan like myself, you're not getting clawed

Starting point is 00:30:39 in Excel. But if you do have Excel co-pilot, FYI or sorry, Microsoft 365 co-pilot. And if you're in Microsoft's Frontier program, you will have a similar version. It's just the co-pilot branded version of this, but it's actually powered, I believe, by 4.1 from Claude. So this delivers the Claude in Excel, Anthropics version, delivers a step change improvement for knowledge workers, creating spreadsheets, documents, and slides with more professional polish. So Claude for Excel uses programmatic tool calling to efficiently read and modify spreadsheets with thousands of rows. So this is one of those things, you know, kind of about where I said originally, it seems like Intropic is really just trying to compete vertically, right? They're not trying to compete on the app layer. They're not trying to be in everything app like Google or Microsoft

Starting point is 00:31:32 or OpenAI. It seems like they're really just sticking with, you know, software engineer, coding and people in finance or working in data, right? So if you're in spreadsheets all the time, Claude might be a model that you really want to consider, or if obviously you were in software development. All right. That's it for kind of what's new and noteworthy. Now let's check back on our cake. All right. So we're going to start with the third one, the use case that I actually couldn't show because,

Starting point is 00:32:09 unsurprisingly, Claude could not handle it. even the new 4.5. And one thing that Anthropic did specifically promote another kind of new feature was its extended context window, right? By far on the chat side, again, this is different than when you're looking at the API. A clawed chat was absolutely terrible, right? So if you're in a chat and you dump a bunch of information, you know, literally I've had single prompts. bust the context window for Claude. So my third kind of example here that I couldn't do live because I also did the math

Starting point is 00:32:56 and I realized that it would push me over my current session limit. So I'm showing this on my screen here. So my two other prompts that are complete use up 54% of my current session. And when I did this third test on its own, it took up the, more than 50%, so I knew that I couldn't do all three. Anyways, it failed. So it doesn't matter. It failed even when I did it on its own.

Starting point is 00:33:18 And this is another example that I tried to do a month ago when talking about this new file creation feature. So I wanted to try to have Claude, or in this case, obviously, Opus 4.5, do some tool calling, right? do some multi-step agentic work. So the file creation feature, if you haven't used it, it is absolutely amazing. I think that and artifacts are two standout features for Anthropics Claw. So in the file creation, it can create PowerPoints. It can create Excel spreadsheets that you can actually download.

Starting point is 00:34:01 So obviously now Gemini can do that. They just rolled out their kind of slides version that we've. went over a couple of weeks ago as well. Open AI can do that in Chad GPD. It's a little clunky, though, because you have to use agent mode and it takes forever. And most of their slides, or if you're talking about PowerPoints are very ugly. Claude makes beautiful, beautiful PowerPoints. In this case, though, it failed. Let me kind of show for our live stream audience. I'll kind of say quickly what I was doing. I uploaded an older PDF that I did, a presentation, an older podcast. And And I told Opus 4 or 5, this is more than a year old.

Starting point is 00:34:42 First, you're going to have to go research. This, you know, I gave it the website. Here's the website that went along with this presentation. You need to analyze the PDF. You need to go find the corresponding transcript or podcast episode. I gave it the link. And then I said, you need to update this slide, right? Or this slide deck.

Starting point is 00:35:00 It was about a 25-page slide deck. And so, hey, it's a year old. Go see what's still relevant. Go research everything that's not relevant. anymore. So we'll see here, Claude started off by doing a pretty good job. So good instruction following early on. It kind of broke the task, a complex task, sure, into multiple steps. It started to do it, right? So it started to read what I uploaded. It was kind of calling on its PowerPoint skill. Claude has this kind of new skill feature. It went through. It correctly went

Starting point is 00:35:33 and found that URL. It found other episodes on the website that talked about agents, which is what I encouraged it to do. So it did a pretty good job, you know, calling other tools and instruction following. However, it just busted. And I did try this in three separate chats, but in each case, it just died, right? And I tried, you know, the standard, please continue, right? And it did 10 different attempts to continue. And it just did. So it busted the context window one prompt. All right. So for all of the hoopla of people talking that, oh, Claude on the front end has extended their context window, still not that good if you're doing a complex query.

Starting point is 00:36:19 All right. So that was technically prompt number three. Let's go back to prompts one and two and see how we did. So the first one, remember, I said go find the five latest newsletters on our website. then go find 10 recent AI news stories we didn't cover and then outline those as if they were podcast episodes. All right. So we'll see here. It went in.

Starting point is 00:36:41 Not sure why. It used, uh, okay, interesting. It used my Canva connector, even though none of them were connected. Okay.

Starting point is 00:36:52 That's interesting. Oh, no, I just went to Canva.com, uh, and found the, uh, news there.

Starting point is 00:36:58 Okay. So, uh, it went, it did a little bit of research. It looks like it. read, hopefully it read through the last couple of newsletters. I don't actually know if it did.

Starting point is 00:37:10 So interestingly enough, it went to our main website at Your EverydayAI.com. I explicitly told it to go to our subdomain, read. Your EverydayAI.com, which is not the same thing. So interestingly enough, I'm looking through the chain of thought, always read through the chain of thoughts. Like if there's one thing you get from AI at work on Wednesdays, read the chain of thought, right? These agenic or hybrid models that think and reason and plan ahead and show you,

Starting point is 00:37:40 right, except for Gemini 3 Pro, all right? Gemini team, please add tool calling to the chain of thought so we can see what these models are doing a little bit more transparency versus just a summarized chain of thought. So, but I don't see that it actually went to the actual newsletter. So instead, it went to the episode page and it looked at podcasts, which is wrong. That's not what I wanted it to do. So instead, it's just searching everyday AI and it did not do it. So in this case, the instruction following, bad.

Starting point is 00:38:23 All right. So it said, I now have a good overview of recent Everyday AI, newsletter topics, which is false. It went to our website, which does not have our newsletter on it. You had to go to this subdomain. So in terms of instruction following and calling the right tools to do the right thing, pretty big failure here. And instead, it just looked at topics that we covered on the podcast, which is two completely different things. So failure here. Let's at least see if it did a decent job at creating us, kind of podcast segments anyways.

Starting point is 00:39:01 So it looks like it ran some JavaScript, interesting. It said, let me create a comprehensive word document, but instead it just created a JavaScript file. Interesting. So I could obviously go through and convert this JavaScript file. I'm kind of scrolling through it. So yeah, it looks like it probably planned some of this information out. But yeah, it said that it was going to, I can see in the thought process that it actually went and did it.

Starting point is 00:39:37 But big fat failure here, big fat failure. All right. So next one, and this is one I was looking forward to. I'm going to hit refresh on this. Okay. Let's see. Do we get another? Did we get another failure?

Starting point is 00:39:59 That would be surprising. Let me see. I'm wondering if something is up with my browser. I'm going to go ahead and publish this artifact. Open it in an incognito window here. And let me see if it actually rendered. So interesting. Did we get a bunch of failures from Claude?

Starting point is 00:40:26 You know what? I'm going to give it a second chance here. And I'm going to say, you know, nothing rendered. Please try again and rebuild this using Claude artifacts. Okay. So interestingly enough, even though I go through, I'm going through, right, this is where I gave it my podcast stats. I said, you know, find 10 obvious trends, find 10 hidden trends. and then find me 10 episode ideas that maybe I should be planning based on these trends that you find, you know, in these thousands of rows of data in the spreadsheets that I upload. It seemed like it didn't even render anything inside Claude Artifax.

Starting point is 00:41:12 So I'm giving it a second try here. So we'll see what it does, right? And I don't want this to go on for too long because here we are at the 40 minute mark. But I will tell you, Claude Artifacts is one of my absolutely favorite features, right? I will say that is probably recently been leapfrogged by Google Gemini Canvas. Google Gemini Canvas, especially with the new Gemini 3 Pro, is unfair, right? If you aren't using it, all right, I'm going to say this out loud. If you're not using Google Gemini canvas every day, you're leaving immense value on the table.

Starting point is 00:42:00 I'll just say that. It is not just extremely flexible in what it can't accomplish, but it is visually, aesthetically, just stunning, right? Which is, yes, I know that there's a matter of taste to that. But the overall flexibility and utility of Gemini Canvas is amazing. So, you know, it's obviously something very similar to Claude Artific. where you can run and render code, right? So you can create interactive dashboards. You can create little micro apps or full-blown like SaaS type apps, right,

Starting point is 00:42:33 inside Google Gemini Canvas. And you can do that inside Google's AI studio using their build tool as well. So similarly, you know, artifacts was actually kind of first on the scene in its ability to write code, right? Not just write code because that's obviously one thing that the Claude models are great at. And here we're testing out Opus 4.5, but the ability to render it kind of using their Claude Artifax engine. So here for our live stream audience, as I'm drawing this one out, sorry, you know, it's

Starting point is 00:43:04 writing the code on the left side. And how it's supposed to work is it is supposed to render the code on the right side. So this is supposed to give me a nice looking dashboard. I said it should be like SaaS worthy, right? So it should be so good. It feels like a product that you pay for, but it's personalized to me, to my information. So, you know, I don't have to know how to write code to do any of this. I just have to know, right, here's a bunch of information.

Starting point is 00:43:35 Go make me something super useful and use clawed artifacts, right? All right. So I'm going to give it. I'd hate to put it on a shot clock because now I've already drawn it out like three minutes. I'm wondering, I'm going to go ahead and maybe see if I can quickly. Let me go up, because I did use a very similar prompt inside the Google Gemini version of this that I did probably last week. So I'm going to say, let's see, top 10 obvious trends. I'm going to see if I can just quickly pull up the Google version of this. All right. So doing a little multitasking

Starting point is 00:44:14 here, live stream audience. Thank you for bearing with me. Let's see. Let's see if I can find the one. I have so many paid accounts. It is hard to find them. I must have done it in my personal account because that is where my my Ultra, Google Ultra subscription is. So let's see if I can find that. There we go. All right, cool. So let's go ahead while we're waiting because You know what? What a bummer. What a bummer, Claude. Opus 4-5.

Starting point is 00:44:51 Let's scroll down. It looks like it's another failure. Another failure. Oh, no. Still writing and rendering. All right. So we might just have to end on this, showing you the version that Gemini,

Starting point is 00:45:05 Gemini 3 Pro built. So good. I was like flabbergasted last week. Again, running some similar problems. or some similar use case or internal benchmarks that we did last week when going over a Gemini 3 Pro. This is really good. I mean, our live stream audience, this literally looks like a full-blown SaaS application, all just based on my data. So everything that we just asked Opus 4.5 to do Gemini 3 Pro just absolutely crush it.

Starting point is 00:45:42 Beautiful, interactive. It shows the overview. It shows the trends, the top, top 10 obvious, top 10 hidden trends, opportunities. They're all color-coded, nice icons, right? Really, really good. And then it also included some of the raw data. Let's see. I didn't even see if the search works.

Starting point is 00:46:01 Oh, the search works, you know, interactive search. So it looks like unfortunately, all right, because I'm going to, I'm going to stop it there. Claude is still going. I think the first version might have took like 15 minutes. I should have checked. Yeah. So you know what? If it finishes, if Claude can redeem itself with this Opus 4.5 with this artifact, we'll go

Starting point is 00:46:23 ahead and put it in our newsletter. All right. So we're going to have to wrap it there leaving you on a cliffhanger, but not my fault. So again, live demos of generative AI usually never a good idea, right? But hey, last week when I did the same with when I did the same with Gemini 3, Gemini 3 pro, was good, right? everything worked, you know, today, mainly hiccups with Obis 4.5. So don't take my word for it, right?

Starting point is 00:46:49 This is obviously, I'm still going to be using this model every single day. Sometimes live use cases aren't the best, but hey, I showed you. Sometimes best benchmarks are still going to have hiccups. So I hope this episode was helpful. And, hey, on our Wednesday series, I always want you to think about what is your use case. Tell me, right? Tell me in the comments. I always go through and read the.

Starting point is 00:47:12 comments on LinkedIn or on the podcast. So if you're listening on Spotify, thank you. You can leave a comment there. So thanks for tuning in. If you haven't already, please go to our website at your EverydayAI.com. Go sign up for that free daily newsletter. Thanks for tuning in. F.I. Happy Thanksgiving to everyone. So tomorrow for Thanksgiving and Friday, actually, the podcast and the newsletter are taking a little, little break, right? People tell me how tired I look. I'm going to try to sleep a little bit. So we will see you Monday for the AI News That Matter. So thank you for tuning in. And we'll see you next time on Everyday AI.

Starting point is 00:47:49 Thanks, y'all. Meet Firefly AI Assistant. Now live in Adobe Firefly, the Allman One Creative AI Studio. Just describe what you want to create in your own words and the assistant handles the rest, orchestrating multi-step workflows across Adobe Creative Cloud apps, including Photoshop, Premiere Express, and more in one conversational interface. You direct the outcome while the assistant accelerates execution. stay in control with the ability to step in and refine at any time.

Starting point is 00:48:22 See it today at firefly.adobie.com. And that's a wrap for today's edition of Everyday AI. Thanks for joining us. If you enjoyed this episode, please subscribe and leave us a rating. It helps keep us going. For a little more AI magic, visit Your EverydayAI.com and sign up to our daily newsletter so you don't get left behind. Go break some barriers and we'll see you next time.

Everyday AI Podcast – An AI and ChatGPT Podcast - Ep 662: Opus 4.5: New king of the AI hill or just a niche model for coders?

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.