Everyday AI Podcast – An AI and ChatGPT Podcast - Ep 731: GPT-5.4 Hands-On Review: 5 Reasons Why it Will Be the Best AI Model You’ve Ever Used

Starting point is 00:00:00 This is the Everyday AI Show, the Everyday Podcast where we simplify AI and bring its power to your fingertips. Listen daily for practical advice to boost your career, business, and everyday life. Meet Firefly AI Assistant, now live in Adobe Firefly, the All In One Creative AI Studio. Just describe what you want to create and the assistant handles the rest, orchestrating multi-step workflows across Photoshop, Premiere Express, and more in one conversational interface. You direct the outcome. The assistant accelerates execution. Open AI's newly released GPT-5-4 model crossed a new threshold for me.

Starting point is 00:00:51 And I spend thousands of hours each year evaluating and stress testing models, so that says a lot. But GPD-5-4 is the first AI model that I think hits the full usability trifecta to be a true daily driver model without any compromise. It's natural enough to chat with, number one. Number two, it's legit off the charts in terms of general intelligence and transparency. And three, it follows instructions to a T, no matter how daunting or challenging or long the task is that you throw at it. And I think there's been a lot of talk lately about how the models are becoming less and less important as the harness and tool use become more and more important and where the moat is actually.

Starting point is 00:01:38 at. And to a certain point, I do agree with that. But with releases in 2026, the line between model and harness and tool use start to blur. It's because the model updates used to bring only updates in the underlying intelligence engine. Not anymore. Now model updates like OpenAI's impressive upgrade to GPT-5-4 also bring with it major changes to the harness and the tools, which completely changes what an AI model can actually accomplish. And with this round of updates to GPT-5-4, I think Open AI knocked it out of the park. So today, we're putting AI to work on Wednesdays as we go under the hood a bit with GPT-5-4 as we go hands-on. And I will also break down the five reasons why I'm confident GPT-5-4 will be the best model you've ever used yet. All right.

Starting point is 00:02:38 I'm looking forward to this one. So on today's show, if you stick with me, here's what we're going to go over. So you'll hear what's new and noteworthy in OpenAI's newest model, GPD 54. You'll learn why you may be able to get the benefits of the $200 a month pro plan without even really paying for it. You'll know the reasons why the model is best to be your daily driver right now, at least. And you'll leave with the five reasons why it'll be the best model you've ever used. All right. Let's get into it, shall we?

Starting point is 00:03:11 If you're new here, welcome. My name's Jordan. This is Everyday AI. And well, this thing, it's for you. Unedited, unscripted, just bringing you the realist information and intelligence and artificial intelligence. And hopefully giving you the tools to grow your company and your career. So if you're along on the journey, awesome. It starts here.

Starting point is 00:03:31 But to take it to the next level, make sure you go to our website and go sign up for our daily newsletter. We're also going to be recapping today. show. So if you are brand new here on Wednesdays, we do putting AI to work on Wednesday. So it's usually a more hands-on in practical use case of, you know, one of the, usually one of the big four, right, between Microsoft, Google, Open AI and Anthropic. We really like to go hands-on on on Wednesdays. But if you want to know more of the details and the benchmarks of Open AI's new model, we did cover that right after its release on. Friday so you can go click the back button a couple of times if you're listening on the podcast

Starting point is 00:04:12 to episode 728 where we go over more of the release in the seven trends that I think you need to know about Open AI's new model, which are so much more than just benchmarks. All right. Let's get into this. And I'm not going to make you wait any longer for the five reasons. So number one, interrupting thinking mode. All right. That's one of the reasons why I think this.

Starting point is 00:04:38 going to be the best model that you'll ever use. And you might be wondering, like, well, number one, what is that? And well, why does it matter? Okay. So the big four, well, in this case, the big three model makers and, you know, open AI uses Anthropics models and they use open AI's models. When you use a thinking model, it can take a terribly long time. And I think sometimes people don't use thinking.

Starting point is 00:05:08 models for that very reason. They're like, well, I need an answer right away. Or, hey, if I realized that I forgot to say something or forgot to do something, I don't want to have to wait five, 10, 50 minutes for a model to finish its thought process and to give me the answer. So this is a new feature for GBT 5 for the masses because this was actually available on the previous pro tier. but now it's available to anyone that's on a paid tier.

Starting point is 00:05:41 And I should probably start with that, although we did go over that in Friday's episode, but I should let you know, right, to use the thinking model, the new GPT 54 for thinking or GBT 54 pro, you do have to be on a paid plan, whether that's the $20 a month plus plan, the $200 a month pro plan,

Starting point is 00:06:01 the business plans, enterprise, EDU, et cetera, right? So if you're wondering like, where is 54, well, that's where it is. It's on the paid plan. And actually, another small thing here that I'm just realizing kind of now, even though I've been talking about GPD 54 quite a bit already, maybe it's good that they didn't come out with the free or the instant version of 54. And maybe that was intentional.

Starting point is 00:06:26 And hey, open AI folks, I know there's a few of you listening. If this was not intentional, go ahead and take this and say it was. I think one of the biggest downfalls of Chad GPD is people are using the bad model. There's always a bad model in anyone that you use, whether you're using Claude, Microsoft, Gemini, chat GPD, right? So previously, seven days ago, right, before there was 5, 5,4 or even 53 instant, I'm going to get to that here in a second. We were living in a GPT 52 world. So when everything was GPT 52, well, people, People just thought, well, I'm using the best model.

Starting point is 00:07:06 Well, no, because if you were using the instant model, which is the chat model, it doesn't think and it's not really good. Right. So essentially last week, opening I came out with GPT 5.3 instant, which was kind of confusing because then the next day, they came out with GPT 54 thinking and 54 Pro. So with the all the model confusingness that's going on, maybe it's actually a good thing. because someone knows, well, if you want the best, you should be using GPT-5-4. And at least right now, there's not a bad version of it, whereas generally there's always been a quote-unquote bad version

Starting point is 00:07:45 of the best model. In the overwhelming majority, I am talking hundreds of millions of users worldwide, don't know the difference. So maybe this is actually a great thing. Anyways, getting back to the number one reason, interrupting thinking mode. And you'll see in some of these examples here,

Starting point is 00:08:02 I already did them, but we're going to go live under the hood because I'm not going to make you wait. Some of the thinking models took like 35 plus minutes, right? But if you see something going wrong, you can course correct it. Unfortunately, you can't upload files or use different tools during that course correction. But right now, Chad ChbT is the only major model maker to offer this. Right. So if you are using as an example, Claude 46 opus, if you're using, Gemini 3-1 Pro, and you're using the thinking, which you should for most tasks,

Starting point is 00:08:36 and you see something's going wrong, and you're like, oh, crap, forgot to do something. You could be five, ten minutes in, right? And you either have to scrap it or you have to accept a subpar answer. And that kind of stinks, right? I've been using the thinking interrupting because I've been on the pro plan for a while on chat GPT actually since it came out. So I'm kind of used to this on the pro level, but it's really, Really nice to get this on the thinking level because this is where the masses are and this is where you should be is using the thinking models.

Starting point is 00:09:10 All right. Reason number two, it can access skills now. Yeah. You didn't know this probably because Open AI didn't even announce this. I don't even think there was a tweet out. I think they just updated a blog post. So skills was actually a big advantage for Claude Anthropic kind of created and popularized skills. and now they're really used across the industry. But up until, well, a couple hours ago,

Starting point is 00:09:37 skills was only available in Codex, which is ChatGPT's coding tool, although I think it's way better, sorry, FYI. I think it's way better than Claude Code and Claude Co-Work combined, even though I use all three. Codex is nice in between. It's like a blending of the two. Anyways, you could use skills in,

Starting point is 00:10:00 side of codex, which is chat GBT's desktop app or their command line interface tool, more of their coding tool. But they kind of slid in these skills under the radar. But unfortunately, they're only available on business or enterprise plans right now. But to be able to pair up skills with GBT 54 is huge because again, up until a few hours ago, I still would say that was one of the reasons why I was still using Claude a lot more for some of my quote unquote daily driving. I think skills in that framework, which will probably dive into a little deeper on a

Starting point is 00:10:38 future Start Here series. So if you haven't been listening to our Start Here series, it's how to go from, you know, zero to 10 or at least zero to five, you know, in understanding AI. Skills are great. They're a little different than GPTs. They're different than projects, right? There's a great and flexible utility to skills. But now that you can pair skills with GBT-5-4, that's pretty big.

Starting point is 00:11:06 All right. Number three, a benchmark that users will actually feel is going to make the difference with GBT-5-4. And that's BrowseComp. Okay. So BrowseComp, if you've never heard of it, it evaluates an AI agent's ability to find obscure, hard-to-verify information through persistent multi-step web browsing. Adobe just introduced an entirely new way to create, bringing the power and precision of its creative suite into one conversational experience.

Starting point is 00:11:39 Meet Firefly AI Assistant, now live in the Adobe Firefly app, the all-in-one creative AI studio. Powered by Adobe's Creative Agent, Firefly AI Assistant lets you start with your vision, just describe what you want, and shape the outcome as it takes form with the Assistant. The Assistant orchestrates multi-step workflows, drawing on 60-plus pros, pro, and you know, grade tools across Adobe Creative Cloud apps, including Photoshop, Illustrator Premiere, Lightroom Express, and more to help bring your ideas to life. You can also get started with creative skills, a growing library of pre-built workflows for common creative tasks, like batch

Starting point is 00:12:16 editing photos, creating mood boards, portrait retouching, and creating social variations. Every step the assistant takes is visible so you can refine, redirect, or take over at any time. You stay in the driver's seat as the creative director. Adobe Firefly AI assistant now in public beta. See it today at Firefly.adopi.com. Why is this incredibly important in a large language model? Well, for a lot of reasons, but one that I think sometimes gets overlooked and, you know, I talk about a lot on the show, but with 700 plus episodes, it's worth mentioning again. For the most part, when you're using today's large language models, even the latest ones, they're working with a very old knowledge off, right? Usually, you know, it might be six or so months, right? But that's just the absolute

Starting point is 00:13:08 best case scenario because many frontier labs are using offline data sets. And the data in those data sets might be two years old, right? So the ability to browse the web accurately and to follow your instructions while browsing the web is not just a nice to have. It is an absolute necessity. Because if you are using the outputs of a large language model for business purposes, which is like all of us, everything changes, right? Unless you're writing a history paper or you're using this, right, to just, I don't know, do something about ancient, I don't know, ancient history, right? But everything else changes.

Starting point is 00:13:51 Even if you're using this to, you know, market your business and maybe your industry is a slow moving industry, well, marketing is changing daily. Right. So browse comp is huge. in Open AI is the now world leader in Browsecom. And it's going to be a noticeable jump. So although there are only a few percentage points now ahead of Google and Anthropic, I think those few percentage points can actually be felt.

Starting point is 00:14:19 It is actually a huge jump from where they were with the last models, which is GBT52. So if we look at the normal just thinking versions, GPT52 was a 65% on Browse, Comp in GPT 54 is an 82%. So a huge jump, right? And then you have GPT 54 Pro at 89%. Right. And that is, even though it's about, you know, four or five percentage points above Anthropic in Google's

Starting point is 00:14:46 offerings there, you can tell. And I will have a very small secluded example of that as we go live here. Also, worth noting on browse comp, yeah, Anthropic essentially, fessed up, which good on them, right? I do think Anthropic is really good at when they find issues. They say it. Well, they said even their score on Browse Comp, they figured out that Claude was cheating. So they said on their website, they said evaluating Opus 4-6 on Browsecom, we found cases where the model recognized the test, then found and decrypted answers to it, raising questions about

Starting point is 00:15:25 eval integrity and web-enabled environments. So, yeah, essentially, while doing their evaluation for this, They found that, oh, Opus realized it was being evaluated and kind of decided to cheat. Anyways, that's something that is actually going to be felt, BrowseComp, because how important it is. And most people don't understand the amount of agentic browsing that is needed for everyday intelligence and for everyday business use. You need it constantly. And you need it to be really good.

Starting point is 00:15:58 And you need it to number four, our number four point. at that three and four go together instruction following the instruction following right now on the higher thinking models is otherworldly okay and let me talk about this and this is also the little secret there that i teased in the beginning how you might be able to get that two hundred dollar a month value out of a twenty dollar a month plan so on the base chat gpt plus plan you do not get gpt five four pro which is a bummer All right. Hey, little secret, if you just get on the business plan, which is $30 a month, minimum two seats, you get a couple pro queries and it's worth it, even if you're the only one using it and you have two seats, FYI. So what I found through my testing, which I was kind of surprised by, aside from the instruction following, which is outstanding and that will probably make more sense showing you live.

Starting point is 00:16:59 But even just with the thinking models, that's why I say instruction following on higher thinking models is otherworldly. So if you are on the chat GBT plus plan, you get two kind of levels of thinking. If you're on the pro plan, there's four levels of thinking on the thinking models, right? The Chad GBT plus, the higher level of thinking, I found it to be, I wouldn't say the results were comparable. but it put in the same amount of reasoning effort as the pro plan as the the gbd54 pro on a lot of my internal testing and it actually in many cases took longer to think and took more steps again

Starting point is 00:17:46 the output wasn't always better or even the same but it was comparable and that's really important to point out, right? I think it's pretty well known across the industry. I don't care if you're a fanboy of OpenAI, Microsoft, Anthropic, Google, it doesn't matter. I think most people know and have understood for a long time. If you need something right and getting it accurate and correct is of utmost importance, you always, well, up until this past week, you would go to GPT52 Pro. Now you go to GPT 54 Pro.

Starting point is 00:18:28 But the problem is it's extremely slow. And well, it is expensive if you're using it. Well, even if you're using it in Chad GBT, the $200 a month plan, that's kind of expensive. If you're using it in the API, it's ungodly expensive. But the higher level of thinking now in the base chat GPT plus plan for $20 a month. Again, they're not on the same level. but it closed the gap, right? There were so many things previously that I would always just use pro for.

Starting point is 00:19:01 And I would never think of using GBT 52 thinking for so many tasks. Now, I don't think twice about it. Even if I just have those two tiers of thinking, the higher tier of thinking is so much better than it was before. It's closed the gap. I think it's, if nothing else, it's going to maybe just, allow people to use thinking way more and to use pro way less, right? Maybe it'll end up saving, you know, opening AI some money in the longer. And then number five, it is the most natural, generally intelligent chatbot that I've ever

Starting point is 00:19:42 used. And I think for the first time, maybe ever, I felt a model didn't have these out-of-the-box glaring weaknesses in either. intelligence and transparency in that intelligence or chatting. So here's what I mean. I think Gemini 3.1 Pro is on this same level as GBT 54, the thinking and the pro level. The problem is Gemini 3.1 Pro, you don't have the transparency of intelligence, right? Which is important because let's be honest, humans, when was the last time that you used

Starting point is 00:20:20 any of these models for an entire day, right? And you look at the answer and you're like, yeah, I knew that. I feel confident in this. No, right? It's so important to be able to look at the chain of thought, the summarized chain of thought and to be able to transparently see where these models are getting the information. So unfortunately right now, Gemini does not provide all of that information in the same way that open AI in Anthropic does, right?

Starting point is 00:20:55 I know they're changing it. I chatted with them. They've said as much that they're eventually going to be bringing a little bit more transparency to the chain of thought. I know there's problems with competition, distillation, all those things. So things I don't understand that they have to protect. But in terms of business use, right, it's one of the main reasons why I love and have love for so long, the GPT5 thinkings and even going back.

Starting point is 00:21:20 to 03 and 01. The chain of thought not just shows, I think, that Open AI's models are better. They provide more transparency and you can understand them more. And they're just way better at instruction following. So that's on the one side. It's just generally intelligent in a transparent way. And then on the other side, you can actually talk to it. Let me be honest. I don't really care about having a pleasant conversation with a chatbot. And I do know that Open AI you know, has some newer settings that you can default its voice to just be concise. And, you know, even the just changing those, the default response does it for me. But I know for a lot of people, they don't go in and do that.

Starting point is 00:22:04 And a lot of, I think historically, opening as models have been overly sick of phantic. They've been verbose. And, you know, even opening I admitted that they've been cringe, right? So for the first time, I think maybe ever, you have a model that is. transparently intelligent off the charts number one but number two you can actually talk to it right not in a you know you're my height man kind of way right but man i i mean honestly i spend way too much time quote unquote chatting with large language models i'm really just directing them uh agentically right um but i get so tired of the the way they respond and i'm like i can't even read this

Starting point is 00:22:48 It's, you know, so cringe, so sycophantic, so verbose, whatever, right? And I think for the first time, we're not getting that anymore. It hits the sweet spot. All right. So before we go live, let's jump in and live stream audience. Thanks for sticking around. This is going to be a little shorter on the live end just because, you know, we can't really like watch a 20, 30 minute prompt. That'll be super boring, but we are going to go under the hood.

Starting point is 00:23:16 But a couple of things to keep in mind. And I'm going to go ahead and I'm going to call out the people that are calling me out. All right. I get accused a lot of, it's actually strange because I get accused of pumping, you know, open AI, but then I get accused of pumping Anthropic, but then I get accused of, you know, pumping Google, right? I get accused of being a fan boy for everyone, but also against everyone. It doesn't make sense.

Starting point is 00:23:40 But overwhelmingly, I think it's fair to say. I am more preferable to open AI. in Google than I am too Anthropic. So let me just speak to those people because I get a couple of messages every single week. People accusing me, oh, Jordan, you don't know what you're doing. You don't know what you're talking about. You're clearly an idiot. Anthropics great.

Starting point is 00:24:01 Okay? Let me just let me tell you this. When I'm doing these demos, when I'm doing these shows, right, not just our putting AI to work at Wednesday shows, but just the 700 plus episodes, I am speaking to the general business leader, right? The C-suite exec. Sometimes that person is a technical person. Sometimes they're not.

Starting point is 00:24:22 My approach has always been about using AI to automate tough, general tasks. And hey, the reality, and I went over this on the show Friday, so go listen to that. Yes, Anthropic has generally always held a sizable advantage, you know, software engineering, agent orchestration. computer use. Well, not anymore. Open AI with this model, they actually took their lunch money on that. So yes, up until last week,

Starting point is 00:24:55 I do think Anthropic had some huge advantages. And y'all, I kid you not. I'm using billions with a B, billions of tokens between ClaudeCode, Claudeco work, Codex, anti-gravity. Like, I have max subscriptions to everything. and I hit my rates constantly, right? Billions of tokens.

Starting point is 00:25:19 So I know what I'm doing. I know what I'm talking about. All right. Just I'm putting that out there. Is Claude great? Absolutely. And it's great for certain tasks. Is Gemini great?

Starting point is 00:25:30 Absolutely. It's great for certain tasks. But I do think with this one, this is the first one that I feel confident. You can go back and listen. I've never, in 700 plus episodes, I've never said, hey, I think this can be a go-to daily driver model because I think for the most part it's better to be jumping around in multiple models but I do feel maybe for many people five four thinking will get you there so when I go through these demos and when I kind of show you under the hood here I want you to think of a

Starting point is 00:26:04 multi step tough problem what is that multi step tough problem that you have all right and then I encourage you do that same thing. It's got to be tough. Do that exact same thing in GBT54 thinking. If you have access to GPT54 high, do it there. Do it in Opus 46 with extra reasoning. And then do it in Gemini 3.1 Pro. Do it yourself.

Starting point is 00:26:35 It's got to tackle the entire gauntlet, right? Data analysis, web research, reasoning, tool use, instruction, following, common sense. That's what I'm going to show you now. And I'm telling you people, everyone that gets on my back about, oh, Jordan, you know, you're, you're, you're, you're, you're, you're, you're, you're, you're, you're, you're, you're, you're, you're, you're, you're, you're, you're, you're, you're, you're, you're, you're, yeah, I've been on the, up to the highest tier. So, FYI. All right, let's look live. So here's what we're going to do, y'all.

Starting point is 00:27:16 We have a lot of stats here. All right. These are my podcast stats. Yes, we're going to do another thing looking into my podcast, right? I'm not going to jump in and pretend to show you financial analysis on something. That's not my background, right? That's not what I'm using it for. I'm doing my use case.

Starting point is 00:27:31 Think of yours when I walk you through mine. All right, but I've actually had this was put together. This version was put together by Codex and then it was enriched by Claude Code. but essentially I have a mountain of data from my podcast stats. So I've done this before in the past, but this version is a little better. So there's certain things that I can get out of my provider, which is called Buzzsprout, but not what I really need to make educated decisions. So my problem, I have 700 plus episodes, and I can't easily export all of the data I need

Starting point is 00:28:08 to make good decisions on what type of episodes, I should be doing more of and which types I should be doing less of. So I did have using both ClaudeCode code and codex. I put together this better look at my stats. So there are more than 20,000 data points in here. So yeah, I had to grab a lot of this with an agent with some APIs, had to do a lot on it. But essentially, I'm able to get the episode titles, normal stuff that I would normally get episode length, but I'm also able to get if it's a guest or solo, the number of

Starting point is 00:28:46 plays, consumption hours, retention across quartiles, which is super important, completion percentage, consumption hours, discovery, people reached, right? All of these different metrics that are not usually available when I just go click export on my stats. Problem is, this is a ton. It's a ton of information. And all of these shows are not all. categorized. All right. So that's another thing I'm going to be telling the models to do. All right. So let's jump in. And I'm going to go ahead and read. Let's go here. I'm going to go ahead and read the prompt that I sent. And then I'm going to kind of go through the responses here. So this is using GPT 54 on heavy thinking. So again, if you're on the pro plan, you have light,

Starting point is 00:29:34 standard extended and heavy. If you are on the $20 a month plan, you have standard and extended. So I said, and I uploaded the file, I said, this is a comprehensive list of stats from my Spotify analytics for my podcast, Everyday AI. Please take your time analyzing all the data. Keep in mind everything you know about me and Everyday AI in personalizing your replies, including strengths, gross areas, known bottlenecks, and constraints. Your responses to most of the below should take into account, growing the podcast, saving time without sacrificing quality, and improving new audience discovery, retention, stickiness,

Starting point is 00:30:10 consumption downloads, et cetera. All right. Use every tool available at your disposal, right? I'm trying not to read all of this because it's a lot. And then I'm saying avoid, let's just say, I do want to make sure I include one of these parts. Okay. When and if needed, you can access complete transcripts on my website at your everyday AI.com. So then I said after carefully and meticulously analyzing the data, please reply back with.

Starting point is 00:30:37 And then I have five different. categories. And in each of these categories, I do have four to five different things that I'm asking. So for category one, it's essentially obvious trends. And I'm, you know, asking for, to give it to me in three different ways, 20 obvious trends. Category two, under the radar trends. I'm asking for 20 under the radar, but meaningful statistical trends, three different ways. Then I'm doing comparisons. All right. So about, uh, looks like, six, no, five different ways. So, you know, as an example, the 10 types or categories of shows that are the most popular

Starting point is 00:31:17 and why. So just so you know, you weren't able to fully see and read my spreadsheet, especially, especially if you are on the podcast only, right? You can always watch the video version. Yes, there is a video version. Go watch it on our website at Your EverydayAI.com. So my spreadsheet didn't have categories, right? I have another version with categories.

Starting point is 00:31:41 I honestly lost it. It's like buried, I don't know, in co-work or codex somewhere. So I am also having it categorize it. So it's not just reading, right? It's not just reading the data. It's having to crunch the numbers and it's also having to think, right? Hey, according to this show, what is it? What category is this?

Starting point is 00:32:03 What does this mean? All right. And then the fourth category is March, 2026 planning. So I'm essentially saying, hey, based on all this data, what works and what doesn't, go research trends. See what I haven't covered. That's important. I'm asking what I haven't covered, but I should.

Starting point is 00:32:23 All right. And then last but not least, I say, you're in charge, right? If I wanted to double my audience this year, what are the different things I should be doing? And then, you know, asking that in three different ways. All right. So that's essentially what I asked. And then I did do this, uh, both. in thinking, heavy thinking mode.

Starting point is 00:32:42 I did this in GPD 54 Pro. And then last but not least, just for fun, I also did it in Opus, just to have a baseline, right? And just because, I don't know, I think I need a demonstration. I can just send to people that all the time are telling me, I don't know what I'm talking about

Starting point is 00:33:00 because Anthropics so much better. No, I know what I'm talking about, y'all. Will it be next week? Maybe. Today, it's not. And it hasn't. then for a very long time. For general knowledge, work, hard tasks,

Starting point is 00:33:16 Anthropic has never been the top model. Period. Right. Look at the benchmarks all you want. It hasn't been. So here's where we're going to start to dig in a little bit and why I'm going to start kind of referencing back some of those five big points that I talked about earlier. So, oh, at the very end, I did say,

Starting point is 00:33:38 to chat. The only difference in these prompts at the very end, I told chatGVT, I said use canvas mode to put together a sleek, interactive, and useful dashboard. That includes all of this information. And then for Enthrop,

Starting point is 00:33:55 clot, I said the same thing, but I said using artifacts, right? Because they don't have a canvas mode. It's called artifacts. So that's the only difference. Otherwise, everything was the exact same. Okay. So, and then also, the the pro gbt 54 pro you cannot use canvas so there was no dashboard so both models

Starting point is 00:34:18 completed the task the quality in the nuance completely different all right and i will start to show a couple of the things uh so let's scroll down here let's scroll down here a little bit all right so big big big differentiator right here right it thought for 39 minutes and 47 seconds all right there's no timer on the Claude Anthropic which they used to have that I don't know why they don't anymore but it was about four minutes all right so you can look at that in a good way or a bad way well I'll tell you spoiler alert Claude's version was I won't say trash but compared to GPD 54 thinking, Claude's version was trash, right?

Starting point is 00:35:10 Exact same prompt, exact same data, memory, chat history, all the same, right? I upload those in markdown files. They have the same thing. It was bad. It was really bad. So you can't just look at time. You have to look at output.

Starting point is 00:35:26 And I'm going to show you a couple of things. But first, we have to be able to see here what GVT54 thinking did. And again, this is not one of the, those, I think on 5-2, I should have ran it. Maybe I'll rerun this and put it in the newsletter on 5-2. I'm guessing it probably would have only taken 20 or so minutes. And it wouldn't have picked up on nearly half of the nuance that it did in this case. Right. And one of the biggest things out of the back that I didn't even tell it to. It says, Spotify says discovery data is a last 30 days view and can take up to 48 hours to refresh, right?

Starting point is 00:36:08 Before, and this is in the very first paragraph, right? Because if you look at the chain of thought, you'll see one of the first external websites it goes to, it might have pulled it up in the API. I'll have to look later. But it instantly looked at Spotify because I told it these are my Spotify stats. Guess what Claude did not do? Well, it didn't look at that. And you might be saying, okay, why?

Starting point is 00:36:33 Why does that matter? Well, because a lot of what Claude suggested in this case, and I'm not trying to turn this into a GPD 5-4 versus Opus 46, but I know that's what a lot of you all are going to be thinking. The bulk majority of what Claude recommended was just dumb because it didn't do the basic work, right? I say basic, but it's actually nuanced and super smart that check. IGPT went out and found that the Spotify discovery data is only the last 30 days because one of those columns in there is discovery, right? So it's how many people are discovering each episode.

Starting point is 00:37:16 So Claude had all these straight up not useful and off the wall and incorrect kind of insights throughout this entire document because it didn't understand that that was only the last 30 days, right? It's like your discovery, you know, has gone up 268 X, you know, this month. It's like, no, it hasn't. It's just because the discovery is only the last 30 days, right? This is something like an intern that wasn't using their brain would come in and be like, oh, look, look at these stats. Wow, the last 30 days have been way better in certain categories. It's like, no, dummy.

Starting point is 00:37:54 You didn't think, right? And this is why again, I think the difference here on the thinking model is huge because I don't know if the GPD 52 thinking would have, you know, picked up on some of those nuances early on. And picking up on that early on is pivotal. Uh, right. So we'll go through here. I'm not going to be able to go through all of them, but I will just point out the

Starting point is 00:38:16 instruction following is fantastic. Right. So in the obvious trends, it broke it down, right? The 20 obvious trends across the three different categories. All right. So there's our top 20 trends. You know, maybe I'll read like one or two. We'll go, I don't know, maybe lower.

Starting point is 00:38:35 Oh, here, this is good. Okay. The first six weeks of 2026 are the healthiest early year cohort in the sheet. That's great. Maybe it's because of the start here series. Number 17, a higher solo share is part of the improvement. Yeah, I noticed that. Our guest shows weren't doing as good, right?

Starting point is 00:38:54 In general, apparently people didn't like guest shows. So I've been doing fewer guest shows because, hey, Codex. went through and broke it all down for me a couple months ago. And it's like, yep, you should not be doing as many guest shows. I said, okay, AI. Okay, then the 20 under the radar trends. Great. Let me just go ahead and maybe read one of these.

Starting point is 00:39:17 Okay. This one's interesting. It says Google is not just a spike topic for you. It is sticky. Right. So it says that the Google or Gemini wins in both median plays and long tail behavior, right? because it was able to properly understand the discovery metric. It says current context, Google is still shipping meaningful work-oriented AI updates into March 26.

Starting point is 00:39:42 It said build a recognizable weekly or bi-weekly Google franchise. So it's pretty good. I don't do as many Google shows as OpenAI. And recently, I've done more clawed and anthropic shows versus Google. So it pointed that out. It said Google is shipping at a high rate and you're not covering a high enough percent. of what's of what Google is shipping and it's very sticky for you in terms of audience retention so cool all right i'm not going to read all these although they're super fun and important for me i will

Starting point is 00:40:09 just go through and say it completed everything right in number three uh the comparisons every single one all right i go down in the uh you're the boss or no march 2026 planning perfect right what's actually funny is as i was planning this it said the first show i should do is GPD 54 at work, five tasks. It actually does better now, which actually might be a better title than this show. I didn't see it until I was already making this show. But it properly, and I'm going to point it out here because I do want to look at

Starting point is 00:40:43 Claude, it properly did this, right? Because I haven't, number one, these are all relevant and useful shows according to what it earlier identified were high performing shows. But these are also shows, well, I haven't done. So number one, they're not repeat shows. So it followed directions. Number two, it taps on what worked.

Starting point is 00:41:08 And number three, well, they're highly relevant. All right. And then the, you're in charge, went down here. Hey, number one, make solo practical explainers, your default weekday format. All right. I'm doing that. They're just more time consuming. And I will go and show at the very top here.

Starting point is 00:41:24 It did also properly complete the, um, the dashboard. So it's not the prettiest dashboard. It's actually pretty plain and ugly, but it's helpful. Uh, right? And there's some cool interactive graphs in here. Uh, you know, I can click through the, uh, the overview, the obvious trends, the under the radar, the comparisons, the March 2026 plan and how to double the audience. So in terms of output, instruction following, accuracy, GPD 54, thinking. thinking boat. This is this wouldn't have been possible before. All right. And just a quick gripe in comparison because I know people are going to be wondering, uh, right.

Starting point is 00:42:09 Claude's was not good. Granted, the dashboard it made way better. Looks better. Right. One of the things GPD 54 stinks at front end design, not any good. Uh, Claude, amazing at front end design, but not good at, well, things that require. Number one, factual accuracy. Number two, assuming things, making assumptions, not good, and just not completing the task. All right. So let me show you one or two, just quick examples. Let's see. Okay.

Starting point is 00:42:45 Here we go. This is just what I had what I had up. In the March 2026 planning section, it's, okay, it's saying Anthropic versus OpenAI Pentagon Drama. Oh, guess what? I already did that show. Guess what? I asked for 10 examples. Guess how many it gave me?

Starting point is 00:43:06 Three, right? It did that repeatedly. When I would ask for 10 things, it gave me either five or it gave me three. Right? It didn't always give me 10. It's not good at instruction following. What is, y'all? What's the difference between an intern that's not very good?

Starting point is 00:43:28 And someone in your company that's gone from junior analyst, junior researcher to senior, their ability to follow instructions. And y'all, stop chirping at me. Go run these own. We're like go run your own multi-step, extremely hard multi-tool use examples with real data that require research that require multiple tool calls and have a multifaceted required output. you'll see for yourself. It's not a comparison.

Starting point is 00:44:03 Right. So for everyone saying like, oh, Jordan, you don't know, no, I know what I'm talking about here. And I want you to know what you're talking about too. So don't just take my word for it. Right. Go in,

Starting point is 00:44:13 try all these things out yourself. All right. So I'm not going to keep comparing. I think that was a pretty good under the hood look. Oh, in FYI, another thing why Claude really failed here in talking about, you know, some of the advantages. of GPD 54, well, it didn't even go to the website that I told it to, right?

Starting point is 00:44:32 I told it, go to your everyday AI.com. It didn't. It suggested things like you should post things on YouTube, right? And then in the, in GPD 54, it found our YouTube channel, which I completely ignore. And it's like, hey, you're already doing things on YouTube, but you should be doing more shorts. Right. So, yeah, Claude just assumes things. Number one, it rushes.

Starting point is 00:44:56 Number two, it doesn't check. And yes, I was on Opus 4.6 extended, right? The best extended model you can do. And I wasn't even comparing this to GVT-54 Pro. It just, it falls flat. But I think it's not so much Opus 46 falling flat. It is that now, I think, to reiterate my point, I think for the first time, we have that trifecta, right?

Starting point is 00:45:22 We have a daily driver model. that it's natural enough to chat with. It is off the charts in terms of transparency and general intelligence. And it will follow instructions to a T. Because when you are looking for a daily driver large language models, those are non-negotiables. And I think maybe for the first time, we have them all in a single package with GBT 5.4.

Starting point is 00:45:53 All right. So I hope this was helpful going over a little bit. hands on under the hood maybe a little bit more than normal maybe a little bit more technical all right but now you know well why i think and my thousands of hours of experience the five reasons why it'll be the best model you've ever used in gbt54 at least today because who knows maybe tomorrow this could all change but hey you know what that means right now you're at an advantage so number one go test it for yourself number two refine and reiterate and number three get to work. Get ahead of your competitors. All right. And the other way you do that is you go to our

Starting point is 00:46:30 website, your everyday AI.com. So thanks for tuning in. We're going to be recapping today's show in our newsletter. If this was helpful, tell someone about it. Thanks for tuning in. Hope to see you back tomorrow and every day for more everyday AI. Thanks y'all. Meet Firefly AI assistant. Now live in Adobe Firefly, the Allman One Creative AI Studio. Just describe what you want to create in your own words and the assistant handles the rest, orchestrating multi-step workflows across Adobe Creative Cloud apps, including Photoshop, Premiere Express, and more in one conversational interface. You direct the outcome while the assistant accelerates execution. Stand control with the ability to step in and refine at any time.

Starting point is 00:47:14 See it today at firefly.adobie.com. And that's a wrap for today's edition of Everyday AI. Thanks for joining us. If you enjoyed this episode, please subscribe and leave us a rating. It helps keep us going. For a little more AI magic, visit Your Everyday AI. dot com and sign up to our daily newsletter so you don't get left behind go break some barriers and we'll see you next time

Everyday AI Podcast – An AI and ChatGPT Podcast - Ep 731: GPT-5.4 Hands-On Review: 5 Reasons Why it Will Be the Best AI Model You’ve Ever Used

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.