Everyday AI Podcast – An AI and ChatGPT Podcast - Ep 728: GPT-5.4 Released: 7 Takeaways you need to know about Openai’s New model
Episode Date: March 6, 2026OpenAI dropped GPT-5.4 Thinking and Pro. 🚨Annnnd? The usual. ↳ Benchmarks are wild. ↳ Performance is elite. ↳ Everyone's impressed.Cool. Now let's talk about what actually matt...ers.This wasn't a model update. It was a direction change.OpenAI is going all-in on long-running tasks, computer use, deeper research, and serious AI workflows. That's not a chatbot play. That's a work system play. We break it down and dish 7 big takeaways on the new GPT-5.4 model. GPT-5.4 Released: 7 Takeaways you need to know about Openai’s New model -- An Everyday AI Chat with Jordan WilsonNewsletter: Sign up for our free daily newsletterMore on this Episode: Episode PageJoin the discussion on LinkedIn: Thoughts on this? Join the convo on LinkedIn and connect with other AI leaders.Upcoming Episodes: Check out the upcoming Everyday AI Livestream lineupWebsite: YourEverydayAI.comEmail The Show: info@youreverydayai.comConnect with Jordan on LinkedInTopics Covered in This Episode:GPT-5.4 Pro and Thinking Model OverviewOpenAI’s Computer Use and Tool Agent UpgradesGPT-5.4’s 1,000,000 Token Context WindowModel Naming Confusion in OpenAI LineupOpenAI vs Anthropic AI Model RivalryCodex Desktop App and Non-Coding UsesGPT-5.4’s Native Spreadsheet and Excel IntegrationBenchmark Performance: GDP-Val and Real Work OutputsPro/Thinking Model Tiers and User ImpactState-of-the-Art Agentic Workflow AbilitiesTimestamps:00:00 "GPT-4.5: A Bold Shift"03:54 "Boosted Accuracy for Knowledge Work"08:00 OpenAI's Model Naming Issues10:39 "GPT Updates: Confusion Explained"15:52 "OpenAI's Codex Expands Horizons"18:08 "Updates and Tools Overview"22:33 Free vs Paid AI Plans24:49 Enhanced AI Thinking & Steering27:40 "OpenAI's Smarter Agent Tools"31:51 "AI's True Test: Economic Value"34:30 "AI Milestone: 82% Benchmark Achieved"37:11 "Wednesday AI Hands-On Session"Keywords: GPT-5.4, GPT-5.4 Pro, GPT-5.4 Thinking, OpenAI new model, AI benchmarks, model performance, AI workflows, long running work, tool use, deep research,Send Everyday AI and Jordan a text message. (We can't reply back unless you leave contact info) Start Here ▶️Not sure where to start when it comes to AI? Start with our Start Here Series. You can listen to the first drop -- Episode 691 -- or get free access to our Inner Cricle community and all episodes: StartHereSeries.com Also, here's a link to the entire series on a Spotify playlist.
Transcript
Discussion (0)
This is the Everyday AI Show, the everyday podcast where we simplify AI and bring its power to your fingertips.
Listen daily for practical advice to boost your career, business, and everyday life.
Meet Firefly AI Assistant, now live in Adobe Firefly, the All In One Creative AI Studio.
Just describe what you want to create and the assistant handles the rest,
orchestrating multi-step workflows across Photoshop, Premiere Express, and more in one conversational interface.
You direct the outcome.
The assistant accelerates execution.
Open AI just dropped their latest model in GPT-54 thinking and pro,
and it's obviously state-of-the-art in performance and benchmarks.
Yeah, we knew that.
And technical talk, that means just a really, really good model.
But this wasn't just a model update from OpenAI.
It was a flex.
GBT-5-4 feels like OpenAI is making it
direct play at developers, researchers, and anyone building serious AI workflows.
And if that kind of sounds like a flex against Anthropic, you're right.
But let's leave behind the competitive side of this GPT-5-4 release, which comes on the heels of
impressive updates from both Anthropic and Google.
But between the benchmarks and features and tokens from the announcement, I think there's
something much bigger in terms of takeaways.
There's two stories here.
The public story is okay, impressive new model, but the real story I think is okay, bold new direction.
GPD-5-4 shows OpenAI is leaning harder into long-running work, better tools, deeper research,
in computer use.
That means the gap between chatbot and work system is officially dead with this release.
So today I want to unpack why this launch feels like.
a step towards systems that work for you, not just smart models that you talk to.
So on today's show, here's what you're going to learn and what we're going to go over.
Quickly separate what's important and what's just marketing when it comes to these new models,
the GPD 54 thinking and GPt 54 Pro. We're going to break down the meaningful benchmarks in simple
language and how they'll impact your use. Then we're going to dish out what I think are seven
the more important or not talked about takeaways that you need to know about this latest update.
And then I'm going to reveal at the end the one most important takeaway that no one is really
talking about.
All right.
Let's do this thing.
Welcome to everyday AI.
My name's Jordan Wilson.
If you're new here, well, you can guess by the name we do this every day.
It's an unedited, unscripted, live streaming podcast helping everyday business leaders make sense of the
nonstop updates.
because yeah, they're nonstop.
I tell you what matters, what doesn't,
and you take that information
to be the smartest person in AI at your company.
So it starts here, but if you really want the goodies,
that's on our website, your everyday AI.com.
Make sure you go sign up for the free daily newsletter.
And hey, while you're there, also just go subscribe to our podcast.
You can always go find the links on there.
So let's talk about the basics here.
What's new in GPT-5-4?
If you're a little confused by the names and everything, well, let's start there with,
well, actually, let's not start there.
Let's give you the basics first.
And then I'll give you some of my takeaways.
So on paper, I mean, this is Open AI's most capable and efficient model.
And right now it is available in chat, GBT, the API, and Codex platforms.
If you are a paid subscriber, all right?
If you're a free subscriber, you're not going to see GBT5, at least not for now.
in maybe not for a while. We'll see.
GPT-54 Pro offers maximum performance for complex tasks and professional use cases,
and it integrates top coding abilities, tool use, and document handling improvements.
Let's talk a little bit about real-world accuracy and just work.
Well, one of the things that is great at now is excelling at spreadsheets, presentations,
and creating documents with higher factual accuracy.
So cutting hallucinations down according to Open AI by 33%
and just improving greatly, both visually and in different benchmarks on just doing real work, right?
Literally creating presentations, documents, and spreadsheets all with a model, right?
Not even having to have any go into a different mode.
A lot of people still don't know that about ChadGBT and its base models,
but yeah, it does all that out of the box.
And right now, it does outperform previous versions and knowledge work across 44 occupations and industries.
And it significantly reduces errors and hallucinations for more reliable outputs compared to its previous models.
But, I mean, some of the biggest advancements are on the computer and tool use side.
And we will go over these benchmarks here in a second.
So this is Open AI's first model with native computer use.
capabilities for desktop and browser tasks.
So yeah, if you are needing some of that, right,
whether you're a developer or wanting to use it on the front end, right?
A lot of recently it was infropic.
Very, very recently, I think Google Gemini and their 3-1 Pro came along.
And now, you know, Open AI is kind of leading with this, right?
I was actually surprised, at least from their marketing and messaging that they led so
heavily into tool search, right? They talked multiple times about improved tool search,
you know, cutting down on the amount of tokens that it takes to even grab tools. So a much more
technical angle in this release, but really pumping up a browser use, computer use,
and just being able to handle a larger tool ecosystem with lower token usage and faster
responses. And the big number, a million. That is a million token context.
window, but not inside chat GPT.
Uh, that would be great.
I don't think we're going to get that, uh, really from any provider, uh, anytime
soon, um, but it is available in the API and in some use cases in codex as well.
That's one of the reasons I'm using codex all the time, uh, but open AI does
say that there's enhanced safety, cybersecurity in reasoning monitoring, uh, for professional
deployment.
So like I said, you do have to be, uh, on a paid plan to use.
this right now. And if you are using the older model, which is actually GPT5.
Two. Yeah. Confusing. I know that is going to sunset in June.
Let's take a quick look at the benchmarks. And I will get over into these a little
bit here in a minute, a little more. But I mean, really good. Right. You don't have to be on the
live stream and see the, uh, the benchmark.
screenshot from OpenAI. So obviously, as with any new release, you have to keep in mind,
they're always going to cherry pick what they're showing, what they're not showing. You know,
sometimes there's different versions of benchmarks. So I mean, usually when you see this from a
company, you know, they're not going to put every single benchmark, even the ones that they,
you know, maybe aren't the leaders in. So in this one, for the most part, you know, across OS World
verified, Webberita Verififififif.
GDP Val, browser comp,
Sweebench, Pro,
GPQA Diamond, Frontier Math, right?
All of those, you know, OpenAI,
I think all of them except one.
OpenAI is winning against their competitors.
In this instance,
they're showing it against Claude Opus 4-6
and Gemini 3-1 Pro,
which are the most up-to-date
and latest frontier offerings
from their main competitors in Anthropic in Google.
So some pretty big jumps across the board.
And the thing that we have to
realizing that's, you know, good to talk about now.
Because at least on the chart, if you're reading this left to right, right, we have
GPD 54 thinking pro and then we have GPD 53 codex, right?
But here's the thing.
Almost every single chat GBT user, right?
All 900 million of us.
No one was using that because it wasn't available inside of chat GBT.
Right.
So for the most part, people are going to be a little confused because they were maybe, uh,
using GBT 5 to yesterday and then it's GPT52 today.
All right.
But let's get into the seven takeaways because we're going to start there where I just ended
because takeaway number one, Open AI has still not solved the model naming problem.
Right.
Back in last summer, OpenAI CEO Sam Allman said, yeah, we're going to solve the naming problem.
And at first, it seemed like maybe they were, right?
Because at the time, you had models like GPT4.
GVT4-1, GPT-4-5.
You had 0-3-03-high, 0-1, right?
It was super confusing because you had two different classes of models.
So same Altman said, all right, well, we're going to come out with the GBT5.
It's going to have this smart, you know, model router.
And, you know, that's it.
And you're just going to go in there and it's going to route you to the right model you need.
Obviously, that didn't work.
GVT-5 Auto or GPT-5-2 Auto is still.
an option. So we'll see what happens with that between, you know, free users, paid users.
We'll get to that. But takeaway, number one, it's more than about Open AI being confused
with their model naming. It's bad. It's actually confusing consumers, probably hundreds of
millions of them, right? I do a lot of trainings, right? Corporate trainings, you know, virtual, whatever.
I don't think I've ever met anyone that really knows what model is what.
And that's probably a bad thing.
Right.
And I think this is something where Google has done probably a much better job in.
You know, they have three one.
You know, go use three one, right?
I think unfortunately, you know, opening eyes probably the furthest behind in this.
Anthropic is not, you know, not that much better.
although they did get a little better by bringing all their latest models to four, six, two weeks ago.
But before that, you had a couple different tiers as well.
But so not only is it confusing having multiple, but even the last like week or so in the last two months has also been confusing.
Because like I said, we just jumped from 5-2 to 5-4.
So most people for the last couple of months, they've been using GPT-5-2.
But there's all this talk of GPT-5-3.
But everyone's like, where is it?
Right?
Everyone's like, I don't see this GPD 53 codex model.
Well, that's because it was only in codex.
And then to make that even more confusing, earlier this week, OpenAI teased the GPD 54 release.
So we obviously knew this was coming.
And then they released GPT53 instant the same day, which is really only for free users, right?
And that's all who should be using GPD 53 instant.
So there was technically a GPD 53 instant.
So there was technically a GPD 53 instant.
GPT 53 in chat GPT.
That was the quote unquote latest model, but like for less than 24 hours.
So takeaway number one,
this is confusing for consumers,
for business,
small businesses,
enterprise across the whole AI landscape.
Because like I said,
I talk to so many people.
And number one,
most people don't even know,
right?
The overwhelming majority of people use the default model.
And that's different depending on,
what plan you're on.
And that's usually a very bad thing because you should always be using a thinking variety,
right?
I tell people,
humans are impatient, right?
Because sometimes people, oh, I don't want to use thinking or a high version of thinking.
You know, I just want the answer.
Okay, well, you're going to get an answer.
And then you're going to spend five times that trying to make it better, right?
Because a default, especially the GPT default models, not good.
They're not, right?
Yes, five, three instance, a little.
better, but compared to what you have, they're very bad, right? And that makes the whole lineup
seem even more messy and confusing. All right. Takeaway number two, opening eye is going for
Anthropics throat with this one. All right. That's the first thing as I'm looking,
both at the benchmarks and in the marketing and how they're angling this, you know, like showing
some of their use cases. Yeah. You know what? I don't, I don't think the, uh, the open AI
team has has taken the recent kind of pseudo rivalry in some of the shots from Anthropic too lightly.
Is it, you know, coincidental timing that we got this release like 24 hours after the report of
Anthropic CEO, you know, in a leaked memo, kind of taking some shots at OpenAI.
obviously the Super Bowl commercial taking a shot at Open AI, which is something
Anthropic hasn't really been known for, right?
Anthropic up until the last like five weeks, they've kind of been the good guy, right?
The good guy that's concerned about safety.
And now it doesn't seem like that.
Now it seems like they're trying to bully their way into relevance.
Yes, I do love the anthropic models, especially over the last five or six months.
I think they've gotten much better.
you unfortunately have to pay $200 a month to get any utility out of them that's beside the point
but they've really i think changed in their persona right um maybe it's intentional i don't know
maybe they're they need to be the the loudest AI lab in the room because outside of you know us
quote unquote us right if you're listening to this podcast you obviously know claude right everyone
knows anthropic but no one else does right they had these studies like speaking of the super
commercial. I think it was like they had 7% recognition. No one knows outside of the,
the AI bubble that many of us choose to live in. No one knows Anthropic, right? Most people know
Google Gemini. Most people know co-pilot. Just about everyone knows, you know, Open AI, chat
CBT, it's become synonymous with AI, right? No one knows Anthropic. So Anthropic trying to, you know,
bully their way maybe into a little more relevance, probably wasn't the right move. And I do,
think that this model of GPD 54 is going straight for their throat because everything that
Anthropic has hung their hat on over the past 18 months is exactly what OpenAI updated in GPD
54 right just around tool usage efficiency in those tool usage my gosh like everyone always goes
crazy which I get it the models are great but I paid $200 a month for Claude right I also
pay $200 a month for ChbPT I was doing some side-by-side test.
there's certain prompts, right?
Everyone's like, oh, you know, Claude, agentic, blah, blah, blah, right?
A single prompt, I kid you not.
And it crashes every single time because it runs over the context, context window.
The compaction inside Claude breaks it.
Right.
So for all this stuff about, oh, you know, Claude's, you know, the context window and tokens and the tool use.
Okay, good.
But long context.
Open AI is in a league of their own, right?
Especially when it comes to long context with transparency in tool use and not breaking.
So not just that, but Open AI with 54 improved token consumption in improving their ability to call those tools as well.
So just really making a harder play in long horizon tasks.
Adobe just introduced an entirely new way to create.
bringing the power and precision of its creative suite into one conversational experience.
Meet Firefly AI Assistant, now live in the Adobe Firefly app, the All-In-One Creative AI Studio.
Powered by Adobe's Creative Agent, Firefly AI Assistant lets you start with your vision,
just describe what you want, and shape the outcome as it takes form with the Assistant.
The Assistant orchestrates multi-step workflows, drawing on 60-plus pro-grade tools across Adobe Creative Cloud apps,
including Photoshop, Illustrator, Premiere, Lightroom Express, and more to help bring your ideas to life.
You can also get started with creative skills, a growing library of pre-built workflows for common creative tasks,
like batch editing photos, creating mood boards, portrait retouching, and creating social variations.
Every step the assistant takes is visible so you can refine, redirect, or take over at any time.
You stay in the driver's seat as the creative director.
Adobe Firefly AI assistant now in public beta.
See it today at firefly.adobie.com.
So let's go to takeaway number three.
Codex is becoming a requirement.
All right.
So not only is that one million token context window,
something that you can take advantage of in codex,
which is great.
I think that takes the tool from a nice desktop app, right?
just release it for Windows this week after last month, releasing it for a dedicated Mac app
for Codex, right? So I think it's gone from, okay, this is a software development tool,
right, an IDE to, you know, great for vibe coding. So no, now it's like, you know, able to refactor
entire code basis. But I think Codex is becoming a requirement for non-technical people,
even just for the right because at least right now if you have a chat gpt paid account you have codex right
and i think there's still a couple weeks where the limits are double right i've yet to hit limits
and i know you're tired of me saying this yes i'm always running codex like i'm running codex right
now every single time i'm recording codex is running 24-7 i've never hit limits it's absolutely wild yes
i'm on the two hundred dollar month plan there's double limits but regardless i'm doing so many
non-technical, non-coding tasks in codex, right?
I was actually chatting with someone at OpenAI, and I'm like,
you guys need to push this, right?
Like, sometimes I just give advice.
Sometimes companies ask me things, and I'm like, you guys need to push this.
Because Claude is really pushing co-work, right?
Anthropic has their Claude code and Claude Co-work.
And I think Codex does both, right?
But a lot of people are looking at Codex, like it is just their, you know,
their version of Claude Code.
And it's so much more than that.
And I think the new updates in 5-4 really emphasize that just with the computer use agent
capabilities, right?
So right now, if you want to take advantage of that, you're not seeing that inside of chat
GBT.
You know, I don't know if agent mode is going to hopefully, eventually it'll be updated with
some of these, you know, the 5-4 model and some of that computer use capabilities.
But right now, if you want to use that, it's either on the API or inside codex, right?
Playwright, interactive.
I'll probably do an entire show on it.
You know, it's a new.
So they just came out with Playwright Interactive, but they've had the Playwright CLY command
line tool, right?
But that's essentially a browser.
People don't understand that.
Right.
Like, yes, Codex has a browser that it can control.
It can access your machine.
So, I mean, at that point, it's much more than a coding tool.
It is a co-worker, right?
Too bad.
Claude got to that name first.
Great name, by the way.
But the new Playwright Interactive kind of plug in for Codex is,
Great.
So it pushes codex into a more, I think, serious testing and debugging and execution workflow.
All right.
Takeaway number four, ChadGBT is coming for data and analyst roles.
All right.
So analyst roles, data roles, not data analysts, but that's one of them too.
Let's let's be honest.
The original ChadGBT couldn't add.
Right.
And even models a year ago couldn't edit a spreadsheet.
All right.
And you really had to know your way around prompt.
engineering to get it to save a spreadsheet.
So not only do the models now, by default, they're agentic and they can do all those things.
People don't know.
You don't got to click a button.
You can just be like, yo, GPD-5-4, go do all this research for me, put it through my own
personal lens via my memory and what you know about me and go create spreadsheets and documents
and PowerPoints.
And it will literally do that and they work and they're there.
Right.
But they now also just released a dedicated Excel and a.
integration, right? So another thing that Claude, right, when Claude announced this, or Anthropic
announced this and, you know, they've been coming out with these plugins and they, in these
skills. Yeah, Anthropics been shipping, great stuff. But it's been moving the markets, right? You've had some
legacy, you know, per seat software providers, you know, some legacy financial institutions that
have seen their, their stocks crash, right? When Anthropic is releasing some of these plugins and,
you know, some of these like their Excel integration.
Okay, well, chat ChbT just did that.
I don't, I'm not quite, I'm not so quite sure about the, the timing of this.
I would have, I don't know.
If it was me, I would have saved that for maybe not the same day, because that's
actually a big freaking deal that no one, again, no one's really talking about that one.
So, uh, there is a dedicated app, uh, for, um, the Excel integration with chat.
That's huge, right?
Because the world runs on Excel, uh, but they are also coming out with one soon for
Google Sheets as well, which I'm looking forward to that one. Also, Gemini has gotten so much better,
right? Late 2025 and early 2026, like Gemini and Sheets is actually great, but I'm still going to
use chat GPT and sheets as well. And I think what we're seeing here is it's shifting the GPT
models, especially with the GPD 5.4. It's shifting from like, oh, is this an AI tool or a junior
researcher, right? Like that was kind of the, the, maybe with 5-1 and 5-2, maybe that's the conversation
people are having, right? Like, oh, at what point is this AI tool, a junior researcher? Now I think
we're way past that. I think it's like, okay, is this a junior researcher or a senior researcher
or a junior analyst or a senior analyst, right? Another reason why I've been saying the
consulting industry is going to get absolutely smashed because of this, right?
All right, takeaway five, thinking models are much more than chat, right?
And I think with 5-4, I've only had a couple of hours, you know, to play with the model.
I unfortunately didn't get early access like some people.
So this is my takeaway from just a couple of hours.
And it depends on what plan you're on.
And let me explain that.
If you're on the pro plan, there's four tiers of thinking.
If you're on the regular $20 a month plan, there's,
two tiers. If you're on the free plan, you can click the little light bulb icon and I think you
get like one of those a week or something like that. But for the most part, if you're on the paid
plan, the base $20 month plan, which I think is what most people are on, you know, there's two
thinking levels. And I think that using those in the older models, right, 5152, it just felt like
a smarter chat, right? Five four, it feels much more than that.
Right now when you're using thinking, it feels like a system, not a smarter chat.
And there's a couple reasons for that.
But it seems like Open AI put a lot of emphasis on the thinking versions.
Right.
So using 5.1 and 52 thinking versus 5.1 and 52 pro, the gap was enormous.
Right.
I don't feel that gap is as big.
I mean, we'll see as more and more benchmarks come out.
using especially when you're on the pro version and you have four different tiers of thinking,
you know, using the heaviest thinking tier on the pro.
So not, okay, this is confusing, right?
Not using GPT54 pro, but using the highest version of thinking on the pro plan,
because it's an extra higher version, right?
But it felt like the premier model, but it was a thinking model.
So I think that gap between the thinking tier and the pro,
tier actually closed.
And I'm not saying that means that GBT54
Pro didn't get much better.
It did.
But the thinking, again, I think the older versions
were just smart chats.
Now it feels like you are working
with an agentic system.
Couple updates on that.
Number one,
OpenAI did specifically
say that they made
some improvements. So GPD
554 thinking also now includes deep web research.
So a version seems like a specialized or a mini version of deep research.
That kind of technology, which I think uses like dual models,
that's available in the thinking mode.
So they've definitely put a little bit more technology just into the thinking side,
which is huge.
Another update to thinking is you can steer it, which is really nice.
before that was something you could only do in the pro version.
So what that means, especially if you're giving it a lot of data and it's going to do a lot of
research and a lot of tool calling, which again, we need to get more comfortable with this,
non-technical people.
All tool calling means is, oh, it's going to use Python, right?
If you throw a lot of numbers at it, it's going to use Python and write some code or it's going to
use web search, right?
That's all that means.
You know, the harness and, you know, all that, you know, people use all the fancy words.
I just try to simplify it for you.
Right.
But it can take longer now.
The GPD54 thinking modes because they have this new deep search built in.
So it's actually nice that you can steer it.
So you don't, you know, if normally it might take 10, 15 minutes and you're like,
ah, frick, you know, I'm reading the chain of thought and I'm reading,
how it's thinking and I see it's going in the wrong direction. Normally with the thinking models,
you'd have to just wait or just say, all right, well, I'm just going to clip cancel or, you know,
there's an answer now button. Now you can steer it, which is really nice. All right, takeaway number six.
All right. I was kind of having fun with this one being a little cheeky, but I said 5-4 is soda
co-a, all right? State of the art computer using agent, right? Five-four pushes computer use agents
into more complex cross-application workflows.
All right.
So like I was saying earlier,
opening I said this is their first model
with state-of-the-art computer-using agent abilities built in, right,
by default.
So I kind of see this and view this.
If you remember back to GBT4,
and then you kind of went to GPT-40, right?
So technically, GPT-4 used three different models to give you responses.
And GPD 4, which stood for Omni, meant it was all one model.
So that's kind of what I'm seeing and feeling now with how OpenAI is starting to integrate computer use into their model.
So, yeah, unfortunately, we may not actually get to realize that in the old chat, gvt.com, maybe until they update agent mode.
Please open AI, update agent mode.
So much, right?
Like just so much potential there.
It just needs some love, right?
But by default, if we're talking about in Codex, if we're talking about on the API side,
that's important.
So you may not touch this directly, right?
You may not touch their new built-in computer use agent.
Like I said, they have a demo of it.
You can go play with it on GitHub, you know, download the repo.
You can do it that way.
You can use it in Codex.
But ultimately, this is for builders, but this is going to be so much of the technology
that we all use, right?
that's what I'm excited for is all of a sudden this just made agents number one agents much
smarter but it also gave us non-technical humans I think the potential capabilities to direct
agents or direct smarter agentic browsers at scale because it is going to now be faster
better and more token efficient so huge win that now open AI again going after
Anthropics lunch money, right?
Open AI said, oh, okay, Anthropic, you want to come, come after our, you know, our decision
to run ads with a, you know, not super truthful Super Bowl ad.
Ironic, right?
Okay, we're going to come for your lunch money.
So, yeah, pretty, pretty impressive.
All right.
And speaking of state of the use computer, sorry, state of the art computer use, some of
those benchmarks, yeah, just absolutely crushing it.
And the noteworthy thing here, I think, is some of the jumps from 5-2 to 5-4.
Because like I said, hardly no one used 5-3 because it was in codex.
Right.
So now these are state-of-the-art.
These are topping the charts.
So in GDP Val, which I'll talk about here in a second, sui bench pro, right?
That's how good the model is at fixing real software bugs, state-of-the-art.
OS World verified.
that's how well the model can use a computer like a person would,
state of the art, toolathon.
That's how well the model uses tools correctly, right?
State of the art, browser comp.
I think that might be one.
They're like 0.1 points behind someone else,
but essentially state of the art,
that's how well the model finds and uses information on the web.
So essentially, anything with related to computer use tool calling,
I mean, open AI just,
crazy, right? And this is not, at least in my opinion, when you're looking at kind of the three-way race over the last, you know, year and a half.
This is not something that Open AI has been known for.
So again, I think this is the bold new direction.
And then last but not least, and this is both takeaway seven and the one thing I teased for the end, the thing that I think most people are overlooking.
I know you're going to get tired. If you're an avid listener, I'm sorry. I'm going to talk about GDP.
Val one more time. And here's the reason why. I hate benchmarks. Everyone cares about them.
Everyone talks about them. Everyone in the lab. Guess what? In the real world here in Chicago,
Illinois, right, where I'm from, right? I think this is just like a Silicon Valley thing. Maybe.
I don't know. But that drives the narrative. Everyone's so concerned about these 50 benchmarks. So I got to
talk about them. Because if I don't, you know, I'm going to get 50 emails about them. But I think my
Seventh takeaway, my last one here, is most benchmarks don't matter anymore.
But I think one matters even more.
And there's a couple of reasons for that, right?
And I think GDP Val is the one that matters more.
It's number one, it's harder to gain, game.
But I think that most frontier labs are just, you know, benchmaxing, right?
All they're doing is they're playing the game, you know, tweaking.
things just to get really good scores on certain benchmarks.
And guess what?
Benchmarks don't pay bills.
Benchmarks don't help us do work better.
In theory, you know, A leads to B, B leads to C.
But if the end goal is C, that's GDP Val.
That's the work getting done.
Not just, oh, here's this random test, you know, Arc AGI, right?
right all these you know humanity's last exam right all these things that are you know these tests that are
set up that are more just like oh here's a bunch of random things that are very hard that just aren't in
training data right that's not what we should be looking at models for and how good they are we need to be
looking at how good they are at creating economically valuable viable viable work on their own
which is what GDP val is right it measures uh how good
good a model is at creating deliverables in the same way a humid would.
I've talked about this a little bit, but let me just tell you how absolutely bonkers this is, right?
And I'm actually going to go to my next slide first.
All right.
GPT 4-0.
All right.
So remember, GPT-40, I'm trying to do the math here.
Yeah, 10 months ago, nine or 10 months ago, this was the best model, all right, the best general purpose model.
No one used the O models.
I did.
I love the O models, right?
0103.
No one used them.
Everyone used GPD 4.0.
If you're looking at this GDP val, stick with me here.
Right?
And this is just a model would get, there's a benchmark.
And it's, you know, you have to go in and do real work like a human would.
The whole thing, front to back.
All right.
One, you know, one task.
But it's creating.
spreadsheets, is doing, you know, multi-step research and creating something on the back end.
And then it's judged by experts in the field, right? So this is across 44 different,
you know, real, real world jobs. And then it's judged by a panel of experts. So essentially,
the 50% mark, that is the parity with an industry expert, right? And then the expert also,
you know, there's groups of experts that judge both the expert human,
that submits the deliverable in the AI model.
So GPT40, which was the best model 10 months ago,
got a 12% on that, right?
Not very good.
All right.
And even GPT5 high,
still not very good, 38%.
Okay.
So why is this the benchmark that matters and why is GBT4 54 such a huge step?
Number one, I was not expecting this, right?
Because GBT 52 had a 70% all right.
So win tie rate.
So 70% of the time, it either won or it tied the expert human blindly judged by expert judges.
In my 2026 AI prediction and roadmap series, I thought I was being kind of bold by saying, yeah, I think we'll get to 80%.
guess what we already got there because gpd 54 pro got 82%.
The more I like, I've read this, this benchmark, you know, the study that came along with it multiple times.
And the more and more I read about it and think about it and revisit it when new models come out.
It just to me just talks about the sheer gap and it's a knowledge gap.
It's an educational gap.
And I'm going to end with this, right,
which I know is a weird way to end a recap show about GPD 54,
but I do think this ties into what is Open AI's bold new direction.
It is getting work done, right?
And that's what these models and that's what GPD4, sorry, GDP Val,
the benchmark shows.
82% of the time, this new model wins or ties against a human.
that's wild.
And if that doesn't change, right?
So think, think if you've been in an industry 10 or 15 years, right, you're an expert and you sit down and you have a project, right?
You have to do some research.
You have to use your smart brain.
And then you have to create something of value, right, a document, a spreadsheet, a PowerPoint presentation.
And then a group of people are going to judge it.
You only have an 18% chance to be GPT-5-4 Pro.
So I'm not wild, right?
Everyone's like, oh, these AI models are so dumb.
That's what I'm saying.
Where we've come in the past year is not normal, right?
Going from a year ago, models couldn't edit spreadsheets to now they're better than almost all experts.
And I think the GPT-5-4 model maybe, just maybe, might be the model, at least,
from Open AI that starts that conversation.
And it moves away from chatGPT is a chatbot to oh.
ChatGPT is the place where work gets done.
All right.
I hope this show was helpful.
If it is, let me know.
Should we do a more hands-on version of this on Wednesdays?
Right.
On Wednesdays, we do our AI at work on Wednesdays.
So let me know if you actually want to see some.
you know, go under the hood with GPT-5-4, test some of these things out.
I've been having fun and the little time I've had testing so far.
So I hope this was helpful.
Make sure if you haven't already, go listen to our 2026 AI prediction and roadmap series.
That's episode 712 and 713.
I get a lot of people always, right, asking questions, emails a day, you know, asking me very questions that would take me a long time to answer.
And I feel like a jerk sometimes, but I'm like, go listen to this episode, right?
I cover so much in there.
If you listen to that and then go read the newsletters that come along with it,
I guarantee you you're going to be the smartest person in AI in your company, right,
for the most part, unless you're working at Google or Open AI.
Right.
So, and then when you're done doing that, make sure you go to your EverydayAI.com.
Sign up for the free daily newsletter.
Thanks for tuning in.
Hope to see you back later for more Everyday AI.
Thanks, y'all.
Meet Firefly AI Assistant.
Now live in Adobe Firefly, the Allman One Creative AI Studio.
Just describe what you.
you want to create in your own words and the assistant handles the rest, orchestrating multi-step
workflows across Adobe Creative Cloud apps, including Photoshop, Premier Express, and more in one
conversational interface. You direct the outcome while the assistant accelerates execution.
Stand control with the ability to step in and refine at any time. See it today at firefly.adobie.com.
And that's a wrap for today's edition of Everyday AI. Thanks for joining us.
If you enjoyed this episode, please subscribe and leave us a rating.
It helps keep us going.
For a little more AI magic, visit your everyday AI.com and sign up to our daily newsletter so you don't get left behind.
Go break some barriers and we'll see you next time.
