Everyday AI Podcast – An AI and ChatGPT Podcast - EP 509: OpenAI o3 and o4 Unlocked - Inside the newest, most powerful AI models
Episode Date: April 22, 2025OpenAI's newest models are already topping the charts. It's almost like another week, another chart-topping AI release from a big player. So what's different about OpenAI's new o3... (and o4 mini)?And is it really better than Google's super impressive Gemini 2.5 Pro? Newsletter: Sign up for our free daily newsletterMore on this Episode: Episode PageJoin the discussion: Thoughts on this? Join the convo.Upcoming Episodes: Check out the upcoming Everyday AI Livestream lineupWebsite: YourEverydayAI.comEmail The Show: info@youreverydayai.comConnect with Jordan on LinkedInTopics Covered in This Episode:OpenAI o3's Advanced Tool Useo3 Versus o4 Model ComparisonAgentic AI: o3's Tool Chainingo3 Full Model's Global BenchmarksLimitations of o3 in Practical UseOpenAI's Image and Python Capabilitieso3 and o4 Mini Context LimitsAI Model Intermodel CommunicationTimestamps:00:00 OpenAI's new models02:30 Daily AI news06:50 Most powerful models?09:05 New OpenAI O-Series Released11:54 O Three: Powerful AI Model16:30 ChatGPT Context and Safety Enhancements18:09 ChatGPT's New Features Overview22:21 "Gemini 2.5 vs. O Series"24:41 "Hybrid Language Models Prevail"29:03 Top AI Model: O3 and O432:05 "OpenAI Browsing Benchmark Update"35:33 GPT Usage Strategy: O3 vs. O4 Mini38:06 Comparing Models in Autonomous Decisions41:33 "Analyzing AI Tools: A Guide"44:44 OpenAI's GPT-5 Release DelayedKeywords:OpenAI's o3, OpenAI's o4, Most powerful AI model, OpenAI vs Google, AI news updates, Huawei 910c chip, US restrictions on NVIDIA, Competitive AI chips, OpenAI memory with search, AI in education, AI executive order, K-12 AI training, ChatGPT, Autonomous AI, Agentic AI models, o series models, AI model naming issues, AI tool use, Visual input reasoning, AI context window, 200k token context, AI reasoning capabilities, AI benchmarks, Large Language Models, AI model speed and efficiency, Agentic tool chaining, AI coding capabilities, AI's impact on education, AI-driven insights, AI model comparison, Chain of Thought reasoning.Send Everyday AI and Jordan a text message. (We can't reply back unless you leave contact info) Start Here ▶️Not sure where to start when it comes to AI? Start with our Start Here Series. You can listen to the first drop -- Episode 691 -- or get free access to our Inner Cricle community and all episodes: StartHereSeries.com Also, here's a link to the entire series on a Spotify playlist.
Transcript
Discussion (0)
This is the Everyday AI Show, the everyday podcast where we simplify AI and bring its power to your fingertips.
Listen daily for practical advice to boost your career, business, and everyday life.
Meet Firefly AI Assistant, now live in Adobe Firefly, the all-in-one creative AI studio.
Just describe what you want to create and the assistant handles the rest,
orchestrating multi-step workflows across Photoshop, Premiere Express, and more in one conversational interface.
You direct the outcome.
The assistant accelerates execution.
There's a new most powerful AI model in the world.
Yeah.
Sometimes I feel like DJ Khalid.
Because each week it's like another one.
Another one.
Another most powerful AI model in the world.
Y'all, the last couple of weeks, couple of months,
it has been a back and forth, I think specifically.
between OpenAI in Google for the ever-changing title of Most Powerful AI Model in the World.
And I think now with Open AI's new O3, specifically, it is the most powerful AI model in the world.
Is it the most flexible?
Will it be the most used model?
I don't know.
But we're going to be going over that and a lot more today on everyday AI.
as we talk about the new open AIs, O3 and O4 mini models unlocked inside the world's newest,
most powerful AI models.
All right.
What's going on, y'all?
My name's Jordan Wilson, and I'm the host of Everyday AI, and this thing is for you.
It is your daily live stream podcast and free daily newsletter, helping us all, not just
keep up with AI, but how we can use it to get ahead to grow our companies and our careers.
If that's what you're trying to do, you are in the.
right place.
So you need to go to your everyday AI.com.
And there on our website, you can not just sign out for our free daily newsletter,
where we will be recapping the most important aspects of this show and sharing a lot more.
But we are going to share with you everything else that's happening in the business world,
in AI world.
So you can be the smartest person in AI at your company or in your department.
All right.
So make sure if you haven't already to go to your everyday AI.com to do that.
So I am very excited today to talk about the new 03 and 04 models from OpenAI.
But before we do, let's start as we do most days by going over the AI news.
And hey, live stream crew is technically a two-part show, so I need your help.
Let me know as I go over the AI news, what 03 use cases should we cover in tomorrow's show in part two.
All right.
Here's what's happening in the world of AI news.
A couple big things.
So Chinese tech giant Huawei is preparing to begin mass shipments of its new 910CAI chip in May,
aiming to fill the gap left by U.S. restrictions on NVIDIA's H20 chips, according to Reuters.
So the new chip from Huawei, the 910C, achieves performance comparable to NVIDIA's H100
by combining two existing 910B processors representing a key shift for Chinese AI developers
who need domestic alternatives.
So Washington's latest AI export controls have pushed Chinese AI companies to seek more
homegrown solutions making Huawei's 910C likely to become the main AI chip for China's
tech sector.
So yeah, looks like Nvidia could potentially.
have a strong new competitor in Huawei.
All right.
Next.
A small thing, but I think that could have a big impact.
So OpenAI has quietly introduced memory with search, much different than their memory
feature they rolled out about two weeks ago.
So this allows chat GPT to use personal details from prior chats specifically to tailor web search
queries.
All right.
So yes, OpenAI rolled out there.
expanded memory feature a couple of weeks ago that allows chat chvety to use personal details,
but that did not apply to web queries. So this new update means chat chbtee can now rewrite
user prompts to reflect individual preferences while browsing the web, such as, you know,
whatever you share with it, dietary restrictions, location, etc. to bring you more accurate
search results. So this move follows recent upgrades that let chat chbtvete reference users' entire
chat history, further distinguishing it from competitors that don't have this feature enable.
Users can turn off this feature in settings, but the rollout appears to be very limited so far
with only a few accounts reporting early access. So yeah, make sure to keep an eye out for that.
All right. One last thing to keep an eye out on is while bringing AI into the classroom.
So the Trump administration is weighing an executive order that would require federal agencies to promote artificial intelligence training in K through 12 education.
And this is according to a draft obtained by the Washington Post.
This is technically super breaking news only a couple minutes old.
So the draft policy directs agencies to train students in using AI and integrate the technology into teaching tasks,
signaling a potential national shift in how school.
approach technology education. So agencies would also partner with private companies to develop and
implement AI-related programs for students, aiming to better prepare them for careers shaped by AI.
So the proposal is in draft form right now is still under review and could change or be abandoned.
However, if it is enacted, it could significantly shape how the next generation learns and works
with artificial intelligence. I would love to see this.
happen personally this little eight little tidbit y'all i haven't shared shared this much but i just saw uh you
you know jacky here in our comments holding it down uh i'm teaching a course uh at de paul here in chicago
and like i'm flipping the script on its head i'm saying you have to use ai at every single junction like
don't go old school don't right uh in all of these aspects you should be using ai in every single
aspects. So it should be pretty interesting to see how this new executive order unfolds and
if it actually is introduced. All right. A lot more on those stories and a ton more on our website,
your everyday AI.com. All right, let's get into it. Let's talk about the newest and I think
the most powerful AI models in the world. All right, from Open AI. But again, I don't necessarily think,
That means if it's just because it's the most powerful, I don't think it's necessarily the best or the most flexible, right?
Those are three very different things.
I do think by far the new OpenAI, 03, which is the full version.
And then we have the 04 mini and 04 mini hide.
Yeah, the naming is terrible.
Open AI has said that they're going to address this naming problem because it's,
it's extremely problematic, right?
But the new 03 and 04 models are extremely impressive, specifically the 03.
All right.
And if you're confused like, oh, Jordan, why is the 03 better than the 04?
Well, that's because the 04 is a mini.
So we have 04 mini and 04 mini high.
But now we have the 03 full model, right?
Whereas previously we had 03 mini and 03 mini high.
Confusing.
but this is the first kind of full O model that we've had since O1.
Yes, I know it's confusing.
That is skip O2 because of some naming rights with, I believe, a British telecom.
Very confusing with the model names.
But here is what is not confusing.
This new model is extremely impressive.
All right.
So, live stream audience, good morning, good morning.
Like what Will said.
on LinkedIn. Love, love to see it. Everyone, let me know what questions you have about this new
03 and 04 models. You know, I'll either tackle them today later on our live stream here,
or I will, you know, make sure that we do this tomorrow in part two. So it's good to see everyone on
LinkedIn and on YouTube. Thanks, thanks for tuning in. Everyone, love to see us learning together.
live. All right. Let's get into it, shall we? So here's the overview on the new 03 and 04 models.
So these were just released about a week ago. And this is the kind of the newest successors in OpenAI's O series.
So yeah, I just laid out a bunch of O's, which by the way, has anyone had O's the serial?
I was talking about this with my wife. They are so much.
underrated, like maybe my favorite top high favorite cereal. That's beside a point. But so many different
O's, right? You have O1 and still, right? So they got rate of O3 mini high. But, you know, if you're on a
pro plan right now as an example, I believe you have O1, you have O1 Pro. You have O3 full and then you
have O4 mini, oh four mini high. It's five different O series models across three different classes.
Extremely confusing. Right. And obviously, you know, Open AI is.
is in the future moving away from this and treating GPT5 as a system.
But essentially, if you're wondering, what's all these O models?
These are the thinking models.
These are the models that can reason and plan ahead, step by step under the hood before they
give you a response.
Whereas the GPT models, so as an example, GPT4 or GPT 4.5, they are more instantaneous, right?
They're not necessarily thinking like a human would step by step,
using this chain of thought reasoning under the hood before it gives you a response.
So I like to say there's two very different classes of models from OpenAI.
You have your quote unquote old school transformers.
And then you have your quote unquote new school O series model, which are your thinkers and your reasoners.
All right.
So this was just released less than a week ago.
And here's the biggest part.
It is capable of using all of Open AI's tools, which is the biggest differentiator between the
01 in the 03 models that could not use every single tool.
Because when we talk about agentic AI and yeah, that's what I think 03 is.
It is an agentic model at its core.
And we're going to see that, I think tomorrow when we go through some of these use cases
live.
But the biggest difference or one of the biggest differentiators here is 03 can use all tools,
web search, Python, file uploads, computer vision with the visual input
reasoning and also image generation.
It can literally do everything, whereas the previous O-Series models were a little limited, right?
And some of them were different.
You know, even now you can use Canvas, which is more of this interactive mode that can run
and render code inside the new O3 model.
Whereas before, it's like, okay, the O-1 model is the only one that could use Canvas.
But O-1 wasn't very good at many things because O-1 Pro and O3 Mini were better, or sorry,
O3 Mini-Hi-high.
And then O3 Mini High could use the internet, but you couldn't upload files and it couldn't use Canvas, right?
And then you had O1 Pro that you could upload files, but you couldn't use Canvas and it couldn't browse the web, right?
So it was kind of hard with all these different O models and, you know, they all kind of had their own kind of unique features.
But now, 03, I do think this is an agentic model, right?
And I know that sounds crazy to say, but it is extremely powerful and it can use every single tool
under its kind of tool belt. And it's trained to autonomously decide when and how to use these tools.
That is what I think makes it probably the most powerful AI model in the world. And it responds with
rich answers typically in under a minute. And it is right now if you have a paid plan to
Chad GPT, you have access to it.
So whether that's Chad GPT Plus, pro teams, et cetera, you have access.
It's also available in the API.
There are limits though.
All right.
So if you are on either a Chad GPT plus account, that's your standard paid account at
$20 a month, or if you're on a team account or enterprise account, it's pretty limited.
So you only have 50 messages a week with the best one, which again is 0.3.
not 04, right? So, 04 mini is not the best one. Oh, three is, right? I'm just going to say 03 full.
It's what a lot of people, including myself, are calling it since we previously had the 03 mini.
And then we're having to deal with the 04 mini and people are confused. So 03 full is the best model.
But right now, if you're on a paid plan, you only have about seven messages a day or about 50 messages a week.
So not a ton. With 04 mini, you have 150 messages a day. In 04 mini high, you,
have 50 messages a day. So if you are a power user on a paid plan, you might want to start with
04 mini high, right? You have 50 messages a day and then maybe save those seven messages a day for the
time that you really need a little more juice, a little bit kind of more compute, more smarts,
then you can hand those over to 03 full. If you are on the pro plan, which is $200 a month,
you have quote unquote, near unlimited access.
So, you know, Open AI says, yeah, there's, you know, some, you know, fair use things that you have to adhere to.
But for the most part, it is unlimited.
So I have free plans, I have $20 month plans.
I have multiple team plans.
I have multiple enterprise accounts for companies that hire us to train their employees.
So yeah, if you're trying to do that, you can reach out to us.
We can train your team.
Right.
So it is kind of weird, I'd say, that the team's account and the enterprise.
accounts have the same model as the plus account, you would think or hope it would have
two X, three X, especially the enterprise. Y'all, open AI, you got to get together. I'm hearing
a lot of grumblings from companies that have invested heavily into enterprise accounts.
And they can't, you know, they can't get kind of the same power that you can get with an
individual account. I know it comes with a pricing, right, paying, I think anywhere between
$30 to $50 for an enterprise seat versus $200 for a pro seat. But so many of these,
these companies are investing in hundreds or thousands of seats for their enterprise teams.
Open AI, you got to give them more juice.
Just saying, all right?
So what the heck is new?
Let's go over it.
So advanced tool use.
So like I talked about, it has autonomous access to browsing, coding, and visual tools.
The image understanding, it is improved.
The visual capabilities are much improved.
And 03 does a great job at interpreting complex visual inputs, like as an example, research papers.
It has a much larger context window in the chat GPT interface.
Finally, right?
So finally, within the chat GPT interface, we have a 200K token context window.
Okay.
So it can handle longer multi-step task seamlessly.
And you can share a ton of information without it for getting things.
Whereas previously, you know, unless you were on an enterprise plan,
we still, for the most part, had a 32,000 token context window on the chat side of chat
GBT, right?
It was different on the API side.
But a lot of users inside of chat chbt if they were especially copying and pacing a lot
of information, chat chpT was forgetting things, right?
Because that 32,000 context window, it's about 27, 28,000 words of input and output, which isn't a ton.
So it's a welcome site to see a 200K token context in the new O models.
Improved reasoning.
Another thing that it's new.
And the ability to chain multiple tool calls together for layered analysis.
And I think that is probably the standout feature.
And there's some new safety features as well, right?
Open AI doesn't want to accidentally start a biochemical war, which you might be like kind of chuckling and rolling your eyes.
but no, seriously.
So good on Open AI for addressing these things, you know,
when they release new models and they give it essentially levels or warning levels.
So they address that on their website as well.
And there's new features that can reduce the risk and enhance trust.
All right.
Adobe just introduced an entirely new way to create,
bringing the power and precision of its creative suite into one conversational experience.
Meet Firefly AI Assistant, now live in the Adobe Firefly app, the all-in-one creative AI studio.
Powered by Adobe's creative agent, Firefly AI Assistant lets you start with your vision,
just describe what you want, and shape the outcome as it takes form with the Assistant.
The Assistant orchestrates multi-step workflows, drawing on 60-plus pro-grade tools across Adobe Creative Cloud apps,
including Photoshop, Illustrator, Premier, Lightroom Express, and more to help bring your ideas
to life. You can also get started with creative skills, a growing library of pre-built
workflows for common creative tasks like batch editing photos, creating mood boards, portrait retouching,
and creating social variations. Every step the assistant takes is visible so you can refine,
redirect, or take over at any time. You stay in the driver's seat as the creative director.
Adobe Firefly AI assistant now in public beta. See it today at firefly.adopi.com.
a little confused and you're like, wait, this is the new feature? I thought it was a different
feature. Yeah. Let me quickly get you up to speed. If you've been sleeping under an AI
rock for three weeks, here's what else is new at OpenAI and chat GPT because you might be
confused. And I want to really tell you, no, no, no, no. This is separate, right? So yeah,
we've been hearing a lot of buzz the last couple of weeks about this new GPT40 image gen. Okay. That is
different. This is, you know, 03, different beasts altogether, but it can use the image gen.
Then in April, we had the memory rollout across all chats. So essentially, if you have this
enabled, chat GPT can pull in conversation or can pull in information from past chats,
which is different than memories, which were essentially individual nuggets that were stored
in kind of a memory bank. But now chat, GPT, it does this via kind of a search pull in a semantic
you know, keyword matching and then deliver you kind of personalized results.
I personally hate this, right?
Because it's always trying to personalize things based on my past chats.
But that's new.
All right.
And then we also had the Google Drive Connector rollout for chat GPT teams accounts about
three weeks ago.
And then also last week, we got, was that last week?
Yeah.
My weeks are starting to blur together, y'all.
So yeah, it was last Monday that Open AI released another set of new models.
So don't get confused.
These other models were GPT 4-1, GBT4-1 Mini, and GBT4-1 Nano.
However, those are not available inside ChatGPT.
Those are only available on the developer side.
All right.
So I think those kind of the highlights of those context window to a million
tokens huge. Actually, the GPT 4-1 mini was stealing a lot of the headlines, rightfully so,
because it was really outpunching its mini moniker. But, you know, the 4-1 models, I think,
were much better in coding and just, you know, a pretty big improvement, both on cost in performance
when it came to the model that it was following in GPT-40. All right. So these new O-Series models,
are not that, right?
But I do think it was worth pointing out, yeah,
there's been a lot of new things happening inside chat,
ChbT, that are not these O series models.
So I figured I'd take two minutes here to get you caught up.
Yeah, like what Jackie's saying,
need a cheat sheet.
Yeah, maybe I should create one.
Kevin is saying, Kevin from YouTube is saying,
it's annoying in the paid education version,
I still can't access it.
So I'm guessing,
Kevin, you're talking about 03.
Yeah, it should be rolling out.
You know, I know this sounds weird.
It's kind of like, oh, you know, restart your computer.
You know, take out the S&S cartridge and blow on it, right?
So many times it is like a cookie issue or a caching issue.
So if you, you know, log out.
If your chat, GBT account, maybe clear your cache and log back in.
It might be there.
That's actually the way I always do it.
Whenever there's new models announced, I do that like two or three times a day to try
and get access a little earlier, even though Open AI does kind of control those rollouts.
All right.
Let me answer the question.
Is this the best model in the world?
So yes and no.
I think it is the most powerful AI model in the world.
I think best depends on your use case.
Is it the most flexible right now?
No.
So let me say that again.
I, yes, I 100% believe it is the most powerful.
AI model in the world.
It is not the most flexible.
And if it's the best,
depends on your use case.
So obviously,
right now, it's kind of jabbing back and forth
with a Gemini 2.5 Pro from Google.
And we'll see as, you know,
more user feedback starts to roll out.
But when it comes to just pure upside,
just the ceiling,
strictly power.
I think 03 is unmatched right now.
Does it like does that mean that I'm only like, right?
Does that mean me personally?
I'm only going to be using 03.
Absolutely not.
Right.
I'm still going to be using Gemini 2.5 Pro all the time.
The big difference is y'all, and we're going to talk about this a little bit with benchmarks.
Gemini 2.5 Pro is a hybrid model, which makes it much more flexible because in certain instances,
especially if you're having iterative conversations,
back and forth conversations with a model, which is what you should be doing.
Sometimes if you're using these O series models, you can ask a very simple query or a very
simple follow-up query and it might think for like minutes, right?
So in terms of flexibility and usability, might not always be the best for some of those
conversations that are a little more nuanced and don't just require, you know, big AI brains.
But if you need big AI brains and in an.
agentic type of large language model interface.
O3 is it.
And it is so, so impressive, right?
But let's look at some of the benchmarks.
And here's here's one thing that I kind of wanted to call out, right?
So on this show, we talk a lot about the LM Arena, right?
And this thing called an ELO score.
And what that means is you put in a prompt
okay and then you get two blind outputs and you decide which one is better output a or output
b all right and that essentially over time when there's enough votes a new model that gets released
gets an elo score essentially you know it comes from elos scores and chest and it's like hey head to head
this is what humans prefer the most so right now the top on that list is jemini 2.5 pro
and here's why i'm bringing this up as a caveat right now 03 full
does not yet have enough votes to be on the chatbot arena leaderboard.
That could change in a couple of hours or in a couple of days.
It could be up there pretty soon.
However, I do not expect the 03 full model to do very well when it comes to head-to-head
human comparisons.
And here's the reason why.
When you look at 03 mini high, right, which was my workhorse model, right?
Before Gemini 2.5 Pro came out, I'd say, oh, oh,
three mini high, that was getting about 60% of my usage.
Humans head to head for the most part don't prefer it.
Right.
And one of the reasons why think you have these traditional large language models that focus on kind of quick snappy responses.
You have these thinking models which just take longer and really only showcase their
abilities when it comes to when you're asking it for a very tough question, right?
And then you have your hybrid models.
So I think ultimately, the hybrid models are going to be the ones that on a head-to-head
ELO score, those are going to be the ones that do best.
I don't think these thinking models, strictly thinking models, are ever going to do that
great in human comparison.
The way I think about it is like, okay, think of someone you know that's, you know, super personable
and has a ton of business savvy and is super smart, right?
That's like Gemini 2.5 Pro.
Then you think of something like Einstein, right?
And a lot of people, what they're putting queries, you know, into LM Arena,
you know, it's kind of quippy things, fun things, right?
Like, you know, write me a haiku about explaining large language models
using basketball terms, right?
not something that an Einstein level model wouldn't necessarily excel at.
So I'm just putting this out there.
Once the 03 model full hits the chat bot arena, I don't necessarily foresee it, you know,
being a top, you know, a top three model.
I do think probably Gemini 2.5 Pro because it is a hybrid model will still retain its lead
on that specific benchmark.
However,
However, however, look at some of the other comprehensive sets of benchmarks that have already gone through with the new 03 full or as some people are calling it, oh, three high.
And it's the best.
So as an example, if you look at LiveBench, okay?
So LiveBench is a benchmark for large language models designed with test sets contamination and objective evaluation in mind.
So I'm reading off their website here.
It has the following properties.
Live bench limits potential contamination by releasing new questions regularly.
So then that way it won't get into, you know, models testing sets.
Each question has verifiable, objective ground truth answers.
Right.
So it eliminates kind of the need for a large language model judge.
So it's factor fiction, no gray area.
And then Live Bench currently has a set of 18 diverse tasks across six.
categories, right? So language, data analysis, math, coding, reasoning, et cetera. And then you have a
global average. So on live bench, which I think is a good third party benchmarking system,
O3 is better than Gemini 2.5 with a global average of 81.5 and Gemini 2.5 is the next best model,
aside from Open AI's O models, which actually take up the first three spots. So Gemini 2.5 comes in at a
77.4. So, oh, three, high, much better at 81.5. Similarly, another one that we talk about a lot
is the artificial analysis index. So again, a very reputable, and I'd say probably one of the
most trustworthy third-party benchmarking services out there. So they haven't done 03 full yet,
I believe because not all of the capabilities are available in the API, whereas on 04 mini high,
they are. Okay. So on 04 mini high, which is a mini model, on the intelligence index, it is the best
model or the most powerful model in the world. All right. So right now, uh, it is ahead of Gemini
2.5 pro by two points. All right. And this I think is pretty important because again,
you are comparing a mini model.
So I assume once the full model is put through some of these tests, it will be even further ahead.
But the 04 mini high is two points ahead of Jev and I 2.5 Pro.
So when it comes to unbiased third party benchmarks that look at a lot, it has been decided,
oh three and 2004, right?
this is the most powerful model in the world.
Could Google clap back next week and release a brand new, you know, 2.6 pro?
Absolutely.
I'm sure they have something ready to go.
But today, if you are looking for the most powerful model in the world, 03 and 04,
it's where it's at.
So the standout feature, which is something that we're going to be doing in part two tomorrow,
uh, and let me know again, what use cases do you want to see in our part two?
but the standout feature by far is agenic tool use.
So if you're listening on the podcast,
this will make a little bit more sense on the live stream
where I have a couple of graphics here.
Okay, but as an example,
and this is from OpenAI's kind of website going over 03,
it says, I took this pick earlier.
So again, visual understanding,
the ability to reason with photos and use,
uh, kind of on its own, uh, terms decide when and how often to use,
use these tools. So it says, I took this pick earlier. Can you find the name of the biggest ship you see
and where it will dock next? All right. This is tricky because there's in this photo that they upload,
all right, the ships are number one. They're out of focus. They're a little blurry. But also,
they're at different perspectives, right? So it could be one ship just appears bigger because it's
closer. And the other ship could be larger, but it's further away. So it really, it really,
reasoned for only a minute and a half.
And it even is talking it through.
Right.
So like here's kind of the chain of thought or the reasoning that the model is going through.
It says, I think I miss the ships in the crop.
They seem to be off to the left, which my human eye did not even see this.
It says, I'll zoom in to better inspect.
Then after it literally crops in, zooms in, gets a clear kind of view.
of the photo, then it says, okay, I see the ships now. The largest one appears to be the red hold
tanker on the right with another smaller ship in the center. The user wants to know the name of the
largest ship and where it will dock next. At this distance, the name of the hole may be too blurry to read,
maybe zooming in further will help get a clear view. So it essentially enhances the image,
continues to zoom and that it decides at a certain point, okay, I've, I've now understood the location, right?
So then it goes on and it uses things like location data.
It looks up using the internet to correctly identify what that ship actually is.
So I also, there is a browse comp agentic browsing benchmark from OpenAI.
And I think this is worth pointing out because if you've ever used the 4O model,
and if you've uploaded an image and then had it go browse, such as the case in this example,
4 is not good, right?
So it only has a 1.9% accuracy rate.
Whereas now, right, when you look at 03 with Python, okay?
So again, that means it can kind of create its own code and render code to help solve
problems on the fly.
So when you have this new reasoning model that has a better visual understanding,
It can run code to help it solve problems and it can browse the internet.
That 1.9% accuracy from 4.0 with browsing goes to nearly 50% with 03.
An extremely impressive jump.
All right.
And also, FYI, I threw this in here.
Should have been a couple slides back, but we did cover when we talk about use cases since we're going to be jumping into
use cases tomorrow. There's actually some use cases. I think a lot of people are sleeping on that we
went over in the new 4-0-0-0-1-1. But this also, the new model, can do image gen in 03. So here's the
overall features and takeaway as we wrap up today's show. So it is, 03 is a powerhouse of
reasoning. It excels in coding math, science, and visual tasks. So it provides deep insights
and complex solutions.
And it does this by tackling intricate coding,
science data, and creative tasks.
It can quickly analyze complex data sets.
Yeah, you can upload files,
and it can create new intelligence
with those files that you upload
for human level insights.
It thrives where deep understanding
and factual accuracy are essential,
and it's ideal for applications
demanding high-level expertise, right?
So if you've used OpenAIs deep research,
it actually, that was the only, I guess, tool or mode previously that used 03, the full version, right?
Whereas, you know, for the last couple of months when we've had deep research,
it was, it was not using 03 mini, right?
And there's a huge jump between 03 mini and this 03 full or 03 high, whatever you want to call it,
right?
And it does a fantastic jump of this agenic browsing on the web and iterating and kind of
changing course midway.
way through, again, depending on what you start with.
And it's ideal for applications demanding high level of expertise.
04 Mini, if I'm being honest, unless you're using 04 Mini because you don't want to run out
of prompts, right, of those like 50 messages a week.
Otherwise, there's no reason to use it on the front end.
There's not.
But I think 04 Mini will be probably in the long run more for developers because right now it's
faster.
And it's more efficient.
So the big thing with 04 mini here, it's speed, scalability, and efficiency.
It's a smaller model, but it balances reasoning with computational efficiency.
And it excels where speed and costs are key.
And it's ideal for high volume use.
It's quicker.
Yet it is still insightful in interpreting data.
And it streamlines workflows with adaptable processing into connectivity.
So, yeah, I don't think if you're on a paid plan, you know, in using chat,
on the front end, you should probably never prefer to use 04 mini.
It should really only be if you've kind of hit your quota for the week with 03.
But, you know, if you're a casual user and you're like, okay, 50 messages a week,
I can get by with that for 04 mini.
But if you're a power user, yeah, you might have to use 04 mini for some of those tasks.
And then kind of pocket 03 for the more complex things or things that require, you know,
kind of juggling these tools.
And that's ultimately where 03 excels.
And, you know, it's agentic use of multiple tools and researching in changing course.
It's extremely impressive.
So tool chaining.
That's something you're probably going to start hearing a lot.
And that's why it's important.
And that's why I think what makes it the most powerful model in the world is the ability
to use multiple of these tools at the same time for you to be able to upload files for
you to start with computer vision, right? Or start by, you know, uploading a photo and have it to
be able to reason over that photo, the ability to essentially do deep research, right? So it's not
just blanket doing one search and pulling in all of that aggregate data and thinking over it at
once. It's going literally step by step and it's researching. And if it finds something in its
research, I've seen this, it will change course. I've had it a couple of times, start by using
computer vision, then it goes and starts on the web, then it goes and starts using a Python to
create something. And then in the middle of that, it's like, oh, wait, I need to go back to the
web. And then it's like, oh, wait, I need to go zoom in on that photo. Right. So that's where this
really excels in this, in kind of a special sauce in why like when I first started using this,
my jaw kind of dropped, which is hard for me to do as someone that spends so much time on
AI tools is it's agenic tool chaining.
and putting these different capabilities together
and deciding on its own when it should use what tool
and then going back and reiterating on its own.
So it can think with images.
It can crop, zoom, and rotate visuals during analysis.
The 200k token context is great for deep layered workflows.
And then to seamlessly chain together tools,
the web, Python, and ImageGen for complex queries
like forecasting things, right?
and then to have this autonomous decision-making.
So complex queries, this is your model, right?
Because of that autonomous ability to chain together these different tools.
So Google has a shorter, smaller version of this.
But for the most part, when I'm using Gemini 2.5, I don't see Gemini 2.5's ability to go back and forth and reiterate on its tool use.
So yes, it can create things in its canvas mode in Gemini 2.5 Pro.
It can query on the web.
But for the most part, it is more of this unilateral approach where 03 does these
in parallel.
And it iterates on its own tool use, right?
Which is it is, right?
I don't know people remember when I used to talk about plug-in packs and how they were
so powerful back when Chad ChupT had plugins.
And I'm like, y'all are missing the big thing here.
Right.
And it hasn't been until now that I,
I've had that same feeling because essentially, right, you look at these different agentic
tools, kind of like plugins or tasks, right?
So part of it will analyze the image and then it will use that information to go find,
you know, update information on the web.
Then it will pull that and maybe start using Python.
Then it'll look at the image again.
So I almost think of it as kind of like multiple specialists working together, but they'll work
one at a time. And then the researcher will come and find things and then bring, you know,
bring that back to the data analyst, which is, you know, Python, right? And it'll keep working
iteratively and then even use the canvas mode. So it's almost like, you know, you have a UI,
Ux designer, right? So it does all of these things iteratively where I don't think we've really
had that with any models, right? So even with Gemini 2.5 Pro, again, this model hasn't been out for very
long. It does seem and feel and under the hood look like a more unilateral approach, where I think
where 03 shines is that it can adapt its own strategy on the fly.
It reacts to information.
It refines its tool use and it can tackle those tasks requiring up-to-date data,
expanded reasoning and diverse outputs.
All right.
That's a wrap, y'all.
I'm going to scroll through and if I see any questions.
Joe just says, thanks for this report.
Very helpful.
I wonder how open AI has resolved intermodel
communications for chaining. Yeah, we'll see. Right. So we have heard, and this has been pushed out,
right, that in the future, you're not going to be able to decide which model to use, right? And
GPT5 will actually be an architecture that houses some of these modes or some of these models under
the hood and you may not get to choose. I don't want that to happen. I don't want GPT5, right? I want
to be able to choose my own models, right? So it should be interesting to see how that happens.
All right, we have a LinkedIn comment here.
Someone said in your newsletter, you mentioned you have been struggling to push past 03's limits and would love to hear more about that.
What limits have you been pushing?
Yeah, great question.
And yeah, sorry, for whatever reason, LinkedIn settings, I don't see your name.
It's been very easy for me to push models to the limit.
And one of the reasons is you give them complex tasks that would normally unfold over the course of like an hour long conversation.
right, you know, saying, hey, analyze, you know, analyze this photo, then go, you know, create a
charts where you forecast something based on information that you pull from this photo.
So as an example, you know, here's a photo with a bunch of AI tools.
And this is probably an example I'll do tomorrow, right?
Go look up pricing for all these tools.
Go look up, you know, what's included on a free and paid tier.
Then, you know, using, you know, kind of your coding abilities, create a chart.
but then go out and also create, I don't know, a website or an interactive graph on this.
So, you know, it's been difficult for me to kind of break some of these models because they don't
have essentially complex tool use.
And 03 does.
And it seems like at least in my very initial testing, which hasn't meant a lot, right,
I've probably only been able to give 03, I don't know, maybe 10 or so hours so far.
I've been very busy.
I had a keynote and a workshop and I moderated a.
panel at 1871 and, you know, planning all these episodes.
So I haven't had my normal amount of time.
You know, we had the Easter weekend.
So I was, you know, trying to spend as much time with family as possible.
So I haven't had as much time to break it.
But I haven't been able to break 03 yet because it's extremely, extremely capable.
So McDonald asking, do you recommend using this for building games?
It depends, right?
I still would probably start that in Gemini 2.5 pro again.
just because 03 is the most powerful model in the world does not mean it's necessarily the best.
I think the use cases are going to be when you need to string together all of these agentic use cases.
At least for me, if I'm looking for one off, you know, building games as an example.
I'm not a coder, but I would probably still do that in Gemini 2.5 Pro.
It's going to be faster.
And its coding capabilities are outstanding.
All right.
Let me just real quick before we wrap this up, see if there's any more questions.
I always try to get to questions at the end.
Big bogey face from YouTube saying,
why use a sludge hammer when a rock and hammer will do?
Yeah, that's a great point.
Renee is asking what about Manus?
So Manus is a little different.
You have to choose a model for Manus.
Manus is not publicly available yet, right?
You have to get on a wait list, get access.
And it's different, right?
That's why people, you know, sometimes are like,
oh, you know, what about perplexity?
Well, perplexity at its core is not a large language model.
Neither is Manus, right?
Manus, you have to use.
a model and then Manus is essentially a collection of tools.
And right now it runs on Claude Sonnet.
So it is completely different.
That is a true kind of operating agent, whereas this is more interfacing inside of a chat
like you would a traditional large language model.
All right, we have some proposed use cases for tomorrow.
All right, we have one more question here from Kirin saying,
How might the advancements in the 03 and 04 mini models influence the development of future AI systems, such as the anticipated GPT5?
That's a great question.
Kieran, I don't have the answer, right?
I'm lucky enough.
I, you know, have contacts over at OpenAI that I chat with.
I don't know the answer to this.
As I get the answer, I will get it to you.
But again, Open AI has delayed GPT5.
And they said that they've been struggling to essentially put all of these capabilities under this kind of umbrella and turning it into a system.
So yeah, like I said, personally, personally, I'm not looking forward to GPD5.
I love, right, even though a lot of people look at it is this chaotic mess.
I love going into my chat GPD account and seeing, you know, seven to 10 different models to choose from, right?
Because I'm a power user.
I know what I'm doing.
And generally, I have a better idea than a GPT5 system probably would of knowing which model is best because I've used them all for hundreds of hours for my own use cases, right?
Maria is saying, I'm still waiting for the OMG model.
For me, you know, I think the Gemini 2.5 was an oh my model.
In 03 is a, oh, my gosh, model, right?
So I went from OM with Gemini 2.5 Pro, which most of the time models, I'm just like,
okay, you know, cool. This is nice. Gemini 2.5 was oh my and 03 was, oh my gosh.
All right. So we're going to continue this tomorrow. So make sure you tune in for part two.
We're going to be going over different use cases, but also let me know what do you want to see.
So if you're listening on the podcast, thanks for tuning in.
Make sure to go to your everyday AI.com. Sign up for the free to free day.
We're going to be recapping the most important takeaways from today's episode.
But also, you can just reply to today's email that's going to come out in a couple of hours
and let me know what is the use case you want to see tomorrow, right?
I really want to tackle things that are on your mind.
I call this your everyday AI because it's for you.
So I want to hear from you.
What do you want to see this new 03 model tackle?
Yeah, maybe you have limited messages and you don't have.
kind of the message budget, so to speak, to tackle this.
I've got unlimited.
Put me to work.
Let me know what you want me to see.
Or let me know what you want to see in our part two.
If this was helpful, please, this would help.
Click that little repost button, y'all.
Share this with your network.
I know you're trying to be the smartest person in AI at your company, in your department.
That's what we try to help you with at your everyday AI.
But this thing only works when you share it with others.
So if you're listening on social, please share it with other.
If you're listening on the podcast, please follow the show.
Click that little button.
If you could, leave us a rating.
I'd really appreciate it.
So thank you for tuning in.
We'll see you back tomorrow and every day for more everyday AI.
Thanks y'all.
Meet Firefly AI assistant.
Now live in Adobe Firefly, the Allman One Creative AI Studio.
Just describe what you want to create in your own words and the assistant handles the rest,
orchestrating multi-step workflows across Adobe Creative Cloud apps,
including Photoshop, Premiere Express, and more in one,
conversational interface. You direct the outcome while the assistant accelerates execution.
Stand control with the ability to step in and refine at any time. See it today at firefly.adobie.com.
And that's a wrap for today's edition of Everyday AI. Thanks for joining us.
If you enjoyed this episode, please subscribe and leave us a rating. It helps keep us going.
For a little more AI magic, visit Your EverydayAI.com and sign up to our daily newsletter so you don't
get left behind.
Go break some barriers and we'll see you next time.
