Everyday AI Podcast – An AI and ChatGPT Podcast - Ep 357: OpenAI o1 released: 5 things to know about ChatGPT’s new model
Episode Date: September 13, 2024OpenAI just shocked us all with the release of its o1 model, previously codenamed as Strawberry and Q* We break down what's new, 5 things you need to know and more! Newsletter: Sign up for our f...ree daily newsletterMore on this Episode: Episode PageJoin the discussion: Ask Jordan and Hanan questions on AIUpcoming Episodes: Check out the upcoming Everyday AI Livestream lineupWebsite: YourEverydayAI.comEmail The Show: info@youreverydayai.comConnect with Jordan on LinkedInTopics Covered in This Episode:Introducing OpenAI's New Model: O1 PreviewO1 Preview: Chain of Thought ReasoningComparison: GPT-4 and O1 Preview PerformanceAnticipated Frequent Language Model UpdatesOpenAI O1: Features, Limitations, and PricingUser Experience Concerns with O1 SeriesBenchmark Performance: O1 Surpassing PhDsThe Current and Future State of AIPractical Testing of Large Language ModelsReal-Time Application of AI ModelsAGI and Chain of Thought in AI EvolutionExclusivity of O1 Preview to ChatGPT AccountsO1 Preview: Shift in AI Approach to Problem-SolvingLive Testing Demonstrations on PodcastRegulation Needs for Safe AI Development.Timestamps:00:00 Everyday AI: Learn generative AI for growth.03:46 Free, Plus, Teams, Enterprise accounts available soon.07:50 GPT tech simplifies skilled tasks with agentic workflow.10:51 OpenAI creates two distinct series for GPT.13:10 Expensive and limited OpenAI's API preview.18:06 MMLU: Multitask language understanding benchmark for AI.21:14 Anthropic raised $4B; OpenAI holds back GPT-4.5.22:40 Corporate revenues mainly from subscriptions and APIs.26:39 Man and dog cross in one trip.29:41 Many correct answers exist, including 153.32:24 Large language models struggle with this question.35:47 Future: Many small companies, fewer traditional jobs.38:34 Q Star impressive, expensive, limited, large model updates.Keywords:OpenAI, o1 series, AGI, Artificial General Intelligence, GPT-4, ChatGPT Plus, ChatGPT Teams, ChatGPT Enterprise, EDU accounts, API access, reasoning-based model, large language models, everydayai.com, numerical code, word count task, world hunger plan, Anthropic, Google, Meta, performance benchmarks, MMLU, subscription costs, message limits, user experience, AI development, AI regulation, real-time tests, agentic processing, code-named Q Star Strawberry, prompt engineeringSend Everyday AI and Jordan a text message. (We can't reply back unless you leave contact info) Start Here ▶️Not sure where to start when it comes to AI? Start with our Start Here Series. You can listen to the first drop -- Episode 691 -- or get free access to our Inner Cricle community and all episodes: StartHereSeries.com Also, here's a link to the entire series on a Spotify playlist.
Transcript
Discussion (0)
This is the Everyday AI Show, the everyday podcast where we simplify AI and bring its power to your fingertips.
Listen daily for practical advice to boost your career, business, and everyday life.
Meet Firefly AI Assistant, now live in Adobe Firefly, the All In One Creative AI Studio.
Just describe what you want to create and the assistant handles the rest,
orchestrating multi-step workflows across Photoshop, Premiere Express, and more in one conversational interface.
You direct the outcome. The assistant accelerates execution.
OpenAI has just released a new model. Well, technically two models. That's not the story. The story is these models are actually one of a kind, as in they don't have any competitors.
It's something no one's talking about. This new 01 in 01 mini model.
from open AI are agentic in nature.
They think like humans.
They go through human level reasoning and take time to respond.
Yeah, that's right.
They aren't instantaneous and spit out hundreds of characters before you can even read.
It goes through an agentic process in a chain of thought reasoning like a human would.
And I'm excited to talk to you about that today on Everyday AI.
And we're going to be telling you also not just about OpenAI's new 01 and 01 mini models,
but also five things that I think you need to know about this new groundbreaking model.
So what's going on, y'all?
My name's Jordan Wilson and welcome to Everyday AI.
This is a daily live stream podcast and free daily newsletter,
helping us all learn and leverage generative AI to grow our companies and to grow our careers.
So technically double dipping today.
That's how big this one is.
Yes, second live stream and podcast for today.
So if you're normally tuning in to hear the AI news by this time,
we'll already have a second podcast out.
So go check that one.
And always on our website at your everyday AI.com.
Let's jump straight into it.
No fluff, no fillers.
Give us about, yeah, we'll see.
20, 25 minutes.
I'm going to tell you everything.
We did all the research, took our time.
We've been reaching out to people at OpenAI actually heard back from one of them.
So we're going to tell you everything you need to know.
So first, this is a new class of model, period.
I don't know why anyone's talking about that.
This is a reasoning-based model.
So when we talk about other large language models that we talk about all the time, right?
This is the future of work.
We talk about Anthropics Claude.
We talk about Microsoft co-pilot.
We talk about, you know, technically Microsoft co-pilot uses GPT-40.
We talk about GPT-40 from OpenAI.
We talk about Gemini, 1.5 from Google.
This new model, 01, 01, Mini, they are not the same, right?
So this is a reasoning-based model, chain of thought, reasoning.
Extremely impressive and apparently very good.
good at coding and math. So this is also considered a preview, all right? And it is available now,
as in today. Yeah, no wait list, no blog post announcing. Open AI shipped it right after we've
kind of been waiting for things like search GPT. We've been waiting for things like advanced
voice mode. You know, we kind of heard some rumblings and they just shipped it today. So if you have
a chat GPT plus or a chat GPT teams account. It should be available now as you're listening.
We have both. We have free paid, which is, sorry, we have the free chat GPT plus, chat
GPT teams and chat GPT enterprise accounts when companies hire us to teach them. So yeah,
if you want to have your company learn, chat GPT reach out. So right now, the only people that
have access, like I said, are the plus in the teams. Enterprise and EDU accounts should be available
within the week and free. All we heard is soon, but for just the 01 mini. All right. So as
chat GPT did or OpenAI did with GPT4O, there is a 40 and a 40 mini. In this case, there is an
01 and a one mini. Yes, the naming, not a huge fan of, but it is a completely different
class. Also, the API has been rolled out, but right now only to tier five users. So only those
heavier users right now. So let's get in. That's just an overview. So here are the five things that
you need to know. All right. And then we're going to dive into each one of these five in a little
bit more depth. So number one, this is a first chain of thought reasoning based.
model, agetic thinking, right? That's the reality. And this is a big step toward AGI. Number two,
this is not part of the GPT family. It is different. More on that later. Number three, right now,
it's very expensive and the messaging rate is limited. My gosh, yeah. Hopefully you don't like it
too much because you're not going to be able to use it too much. Also, number four, the benchmarks are
surpassing obviously every model that there is and it's not even close but it is surpassing
phd-level humans also not even close and number five i think this is going to set off a large
language model arms race all the big companies are going to have to respond all right so more on
that here let's now dive into each one of those five things that i think you need to know so
talked about number one. And also podcast audience, check, check the show notes. We are going to have
a video for this as well. You know, if you're listening on the video, you're going to see.
But I'm going to try to do my best to describe at the very end. We're going to do run some simple tests.
We're going to do it live. Here's the thing, y'all. I've been itching, itching to do this.
Even though I have multiple accounts, I haven't even done this yet. So I'm going to be doing it live on end of
unscripted, hopefully it works, right?
But there's the limits are very severe, which is why I'm saving it and we're going to find out live.
So step number or thing thing number one to know.
This is the first chain of thought reasoning based model.
Huge step toward AGI.
So it's like this all every single model out there right now.
I mean number one, they're so fast, which is sometimes good or bad, right?
But also think models right now how they're
work without proper prompt engineering, right? If you're listening, you've probably taken our
prime prompt polish course, our PPP course, and you know what? We might have to completely
rehash this because this new 01 and 01 Mini changes the rules. It is, right? You can achieve
similar outcomes as you can in 01 and 01 mini in other large language models.
But it takes someone highly skilled, right?
I would say that's myself.
And, you know, so if you're someone that, you know,
spends at least five to eight hours a day inside of large language models like me.
And if you've been doing that for multiple years, you know, even pre, you know,
pre chat GPT, right?
Like our team's been using the GPT technology since 2020.
You have to be highly skilled and spend a lot of time to get the results that you can now get
in a single prompt because of this agenic workflow.
And you're going to see this.
I kind of have this screenshot here, but there is a toggle after you run a prompt
where you can essentially see the chain of thought, right?
In chain of thought, it is a prompting technique.
That's why I say technically the outcomes that you can achieve right now have already
been available, but it has required someone highly skilled and it has required usually a
lot of time.
So you'll see it.
It's almost like it's going through, in theory, an ideal outcome of what a very skilled prompt engineer would do over and over and over in multiple steps, right?
Extremely impressive.
It thinks, it processes, and think of it like that, right?
Like a smart human can accomplish a ridiculous amount of things with a powerful large language model in enough time, right?
Yes, this is slower, right?
So you might take 10 to 20 seconds like gasp, right?
But that's because you literally have an agentic workflow.
This isn't where, you know, oh, you know, large, you know,
I hate when people call large language models next token predictors.
Yes, they technically are, but they are so much more than that if you know how to use them, right?
So now think of that happening seemingly dozens of times,
in one step.
Super, super impressive.
All right.
Second thing to know,
this is not part of the GPT family, y'all.
So right now,
let me just read this from OpenAI.
So it says this is an early preview
of these reasoning models in chat GPT and the API.
In addition to these model updates,
we expect to add browsing, file,
and image uploading in other features
to make them more useful to everyone.
So that means right now, 01 and 01 mini do not have these normal features that we expect, right?
The ability to browse the web, upload files, computer vision via image uploading, right?
Also, it says we plan to continue developing and releasing models in our GPT series in addition to the new OpenAI-O-1 series.
Two separate things, all right?
And y'all, the cool thing about having a daily podcast that's been going on for like a year and a half,
Go check the receipts, right?
I cover the rumors, but I've never jumped on the hype.
I said OpenAI has zero reason to release GPT5 anytime soon.
This is different, right?
This is essentially we've been talking about this strawberry, right?
Open AI strawberry.
Before that, it was Q Star.
That's what this is.
This isn't the next iteration of the GPT family.
This is essentially a new mode or a new,
way of thinking. All right. So, you know, it's almost like open AI just created a fork in the road.
So at least right now, it seems for the short-term future, it's going to be operating like that,
at least according to this statement that they put out that, you know, they're essentially
looking at it as two different series, which is interesting, right? Because if they say that in this
oh one series, they're going to be adding browsing, file, and image uploading, and these other
features, well, then what would separate it from the quote-unquote GPT class series?
I'll tell you one thing, price, right?
We've been covering it here on everyday AI, is there's been these rumblings and rumors that,
you know, internal discussions at OpenAI as they're, you know, right now they're reportedly
burning through a lot of money and trying to raise a lot of money and they've been floating out
higher subscription prices. I think this is why. I think this is why, right? Numbers have been floated out
like $2,000 a month. I don't think we're going to get to that, right? Right now, you have a very
limited use of it. But right now it's still only, you know, $20 to $30 a month for your plan.
So is that what we've been hearing? I personally think.
so. But if they are separating the two, these two kind of series, I believe that's what they said.
Yes. But if they're adding all of the quote unquote GPT features to the 01 series, what separates them?
Well, I would think price, right? And maybe this is more for enterprise companies. And maybe, yeah,
you're going to have to pay a couple hundred dollars or $2,000 in the future.
use it. I don't know, but right now, go out and use it now, right? Or maybe it's just going to
continue to be extremely limited. All right. So yes, that's our, okay, so speaking of limits,
number three, well, it's number three is it's expensive, but also very limited. All right,
let's talk about the expense first. All right. So I'm not talking about on the front end of
chat GPT. That's the limits, but let's talk about the expense.
So API, all right, so many companies out there, you know, they're building on top of Open AIs API.
So right now, the 01 preview, again, it's called Preview, the thing that the public has access to, $15 per 1 million tokens input and $60 per 1 million output, which like two years ago was super cheap, if I'm being honest.
but after the 4-0, that's pretty expensive, right?
Comparatively, 4-0 is 5 and 15, whereas this is 15 and 60.
So big jump up.
But again, we are getting agentic workflows in an API, right?
That's wild.
What this means for businesses, I am excited.
Someone hire me right now.
I can't wait to go, wow.
I mean, okay, anyways, let's stick to price here.
But, but, but, but, oh, one mini, 80% cheaper.
$2.50 per one million tokens input and $10 per one million output.
So the 01 mini is looking nice.
It's looking nice.
All right.
Got the chart there for the rest of you.
But let's talk now about messages.
Yeah.
This is pretty, pretty limiting here.
So there we go.
Here's the downside.
Right now you get on GPT4.0.
So if you are on the plus accounts, all right, the $20 a month.
Right now you get 40 messages every three hours, which isn't, isn't bad, right?
This for 01, you know, technically 01 preview, right?
So 01, you get 30 messages.
a week. Not every three hours, not every day a week. 30 messages a week. And for 01 Mini, 50 messages a week.
Ouch. Yeah. Remember when I said, oh, why are there these two different series and why is everyone
talking about this, this, you know, plan and, you know, potentially.
potentially much higher prices.
Y'all, I've been saying this.
Go back and look.
Go check the tape.
I've been saying as models continue to improve,
it's worth hundreds of dollars a month as you get agentic workflows.
I said that back in 2023 before any of these rumors came out.
I said it's going to be worth hundreds or thousands of dollars.
So it's not shocking when this, uh,
the information reported this a week ago.
I said, okay, well, if you're getting agentic workflows, technically, if it works, $2,000 is a bargain.
You can't get a smart human to even power your chat GPT account for $2,000, right?
So it's good, right?
Assuming the limits.
But right now on that $20, $30 a month plan, not looking good.
hopefully enterprise will have more all right and hopefully here's the thing i don't like so i reached out to the
head of research at open ai um saying the 30 to 50 30 to 50 a week is going to be lingering um you know
so asking you if there's a way to see your limit right what if you're in the middle of a big
project and you don't know over a week okay how many did i use Tuesday how many did i use
Wednesday morning. Oh, did I use any on my phone on Friday? You have no way to know right now.
So Boris here said good feedback, not yet. So yeah, there's no way to track it. So that's a
bummer. At least they're listening. All right, let's look at number four. Four thing you need to know,
the benchmarks are nutty. The benchmarks are nutty. They're surpassing Ph.D. human level.
All right. So this is MMLU. All right. So I'm going to tell you.
you right now on the screen if you're watching you got this otherwise I'm going to describe it so
up until today the most powerful model was gpt 4-0 now if you have a paid account the most
powerful model you can access is oh one technically called oh one preview but then there is the
oh one that open a i has that they haven't released all right so we technically have access to
oh one preview and oh one mini but the actual oh one model which is still under wraps
92.3 MMLU, all right?
And yes, I know there's problems with the MMLU benchmark, all right?
And to say it very plainly, right, I try to keep things simple here at everyday AI for the
non-technical people, the MMLU is the multitask language understanding benchmark.
Yes, there are maybe some better by now, but historically, the MMLU,
you, it's, I, I call it like the ACT for large language models. So about four years ago,
right, the scores in the, in the 40th percentile were considered cutting edge, right?
The average human, right, the average educated human, not a domain expert, would score in the
30s, all right. It's out of 100. So, you know, if you're a smart human, you'll get in the 30s.
models four years ago were in the mid-40s.
Domain experts, what that means.
Literally think of the smartest human in the world.
Get all of the smartest humans in the world.
And they all take the MMU.
Experts estimate that they'll get about an 89%.
All right.
92.3.
All the other models right now are in the 88s.
So, you know, GPT-40, Claude 35 Opus, Metas, 405B, all of the leading models had been stuck essentially in this 87-88 range for the better part of six months.
And people are like, oh, you know, LLMs are stagnant.
Generative AI is hype.
Look, all this money and they can't get over this, you know, 88 MMLU hump.
well, consider it crushed.
92.3 for 01.
The 01 preview is 90.8.
But 92.3.
Y'all, I don't think there's a technical bar where it's like,
oh, we've achieved AGI.
92.3 is crazy.
So I remember following MMLU back in, you know,
2020 when our team first started using the GPT technology.
And I remember back then, you know, the estimates where it would take maybe 15 to 20 years to ever hit 90.
Got there in about three years, right?
And now we're in the mid 90s, which a lot of people thought would never be possible with large hangers models.
Development's wild, y'all.
All right.
And then let's go to number five.
Number five things are about to heat up.
Adobe just introduced an entirely new way to create, bringing the power and precision of its
Creative Suite into one conversational experience.
Meet Firefly AI Assistant, now live in the Adobe Firefly app, the all-in-one creative
AI studio.
Powered by Adobe's Creative Agent, Firefly AI Assistant lets you start with your vision, just
describe what you want, and shape the outcome as it takes form with the Assistant.
The Assistant orchestrates multi-step workflows, drawing on 60-plus pro-grade tools across
Adobe Creative Cloud apps, including Photoshop, Illustrator, Premier, Lightroom Express,
and more to help bring your ideas to life.
You can also get started with creative skills,
a growing library of pre-built workflows for common creative tasks,
like batch editing photos, creating mood boards, portrait retouching,
and creating social variations.
Every step the assistant takes is visible so you can refine, redirect, or take over at any time.
You stay in the driver's seat as the creative director.
Adobe Firefly AI assistant now in public beta.
See it today at 5.5.
Firefly.adobie.com.
Yeah.
There's no AI winter here.
There's no hype dying down.
My gosh.
You know how much Anthropics got in the bank?
I don't know how much they actually have in the bank, but they raised $4 billion from one partner from Amazon.
Right?
Google has unlimited money that they can print.
Right?
We've assumed that Anthropic.
thought has just been waiting, right?
This is a wait and see game.
Everyone kind of waits to see what Open AI does.
Love the strategic play here, right?
I am of the belief that OpenAI has a 4.5, a GPD 4.5 level update for its
GPT class ready to go when it wants to, but they have no reason to release it, right?
Their GPD 40 is technically the best in this class.
and at least for now, this new kind of agentic class of large language models,
there's zero competitors, right?
I mean, there are.
There's other companies out there, you know, that have, you know, been creating
agentic, more like workflows.
But there's, I mean, I'm only talking about the big five in the room, right?
Not talking about anyone else.
Because these are the world leaders.
So they are the first world leader in this class.
But, but you got to look at the cost here.
If I'm anthropic, you don't have time.
You don't have time, right?
Their most powerful model right now, Claude 3-5 sonnet.
So much of money that these companies make comes from, yes, subscriptions, people paying, you know, $20, $30 a month, you know, whether you're on the base plan or the team's plan.
but so much of it comes from organizations that are paying, right, to essentially use their API.
They bring in their data, these companies that are building products that we all use, that we all love.
They're all powered by, you know, mainly either OpenAI, Anthropic or Google Gemini.
Anthropic, they have to respond, right?
Their most powerful model right now is Claude 35 Sonnet.
We know their order of models goes Haiku, Sonnet, Opus.
So they only upgraded their middle model Sonnet to 3.5.
So presumably, we've all known that they probably have a 3-5 opus, maybe ready to go.
They kind of wanted to see benchmarks of whoever makes the next big splash.
Well, guess what?
Splash has been made, y'all.
So I would assume Claude 35 Sonet or an agentic response from,
Claude has to be coming soon. Similarly, I would feel Google Gemini 1.5 Ultra, that has to now be
in play fairly soon. Or meta, you know, meta's been floating their agents out there and we
haven't seen anything either. This is going to get things going. I think especially, right, so now
there's, there is a little bit of pressure now here in the U.S. at least for these large language
model makers to work with U.S. government, federal regulators, et cetera, for safety reasons,
I would expect after the U.S. election here in about seven weeks, it's got to go wild.
We're all going to get Christmas presents, right?
Large language models raining down on us.
All right.
So now I hope this works.
Let's look live.
All right.
This is not a full breakdown, right?
We'll do that maybe next week.
want to go over some basics. All right. So here we go. So keep in mind, a lot of models now to choose from.
So always make sure, especially because 01 preview and 01 mini are so limited. So always make sure you are using the correct model that you want to use.
All right. I'm not going to do a full, you know, prompt rundown. I do need to get more of a rubric like an official set.
I've been doing this since 2023. You know, I just have a bunch of.
of random questions that I ask, you know, sometimes I'll have it, you know, code a game, you know, go through, I made up some logic questions. There's some logic questions that are, have been floating out there, you know, on the internet. So I'm going to do a couple. And I'm going to do some that generally models struggle with. So first, I am just going into chat GPT here. Oh, this is going to be annoying. So actually what I'm going to have to do, y'all give me a second. I don't know what happened with my edge in Chrome. I cannot log.
in I can't log in to chat GPT anymore I'm getting SSO errors I've cleared my cookies
logged out log back in fun time sorry so give me a second here I'm having to open I
couldn't zoom in on my chat GPT app so now I am going to share I'm going to share my
Firefox yeah got to go into Firefox not thrilled about it but here we are all right
so here we go now we are going to do some basic tests so we
are in GPD 4-0. I'm doing something simple again, going over ones that normally get wrong.
I'm saying a man and his dog are standing on one side of the river. There's a boat with enough
room for one human and one animal. How can a man get across with his dog in the fewest number
of trips? Not bad. Normally GPD 4-0 says like three to four. The correct answer is obviously
one. GPT-40 got this wrong. So just so you know, if you're listening at home,
on the podcast. Each time I'm doing this, I'm creating a new chat. So the context window is not
skewed in any way. All right. So now my first 01 chat, here we go. I'm going to do the same
thing. And presumably this is going to go much slower. And we will see, like I talked about at
the top of the show, a more agentic flow, breaking this down into steps and seeing how the model
works. All right. So I'm going to hit enter here. And you know what I'm going to do? I'm going to do. I'm
to go ahead, I'm going to do a stopwatch here.
Let's see how long this takes roughly.
All right, here we go.
And there.
All right.
So it says thinking.
So it's taken a sec.
It says charting the journey.
I can click down and I can see it.
So okay, that was actually faster than I thought.
Probably took me a couple seconds to get over.
Probably took about nine seconds.
But my gosh, this is the first time.
And I've tried this simple prompt, every single large language model.
I don't know why it screws them up.
It always says two, three, four, right?
The correct answer is one.
So, oh one preview says, the man and his dog can both get into the boat together and cross the river in a single trip,
as the boat has enough room for one human and one animal.
All right, so let's go ahead and let's look at this kind of chain of thought.
And it does time it, so it did say here, thought for five seconds.
All right, so I'm going to click down and we can look.
This one's very simple.
So it kind of broke it down.
It said charting the journey, calculating boat capacity, identifying the boat's location.
This was a pretty simple one.
You know, Open AI shared a lot of examples on their on their website that went into crazy detail.
But this is actually in theory simple, right?
This is simple.
So love to see that.
Love to see that breakdown.
All right.
Now we're going to do, we're going to do probably two more.
We're going to try to do it quickly here.
Don't want this to go on for too long.
So now we're going into chat.
GPD 40.
I've done this one many times on the show.
So I'm saying a box is locked with a three-digit numerical code.
All we note is that all digits are different.
The sum of all digits is nine.
And the digit in the middle is the highest.
What is the code?
When I first started doing this, it confused me.
Because I thought like, oh, there's one answer.
I made this one up.
And then I'm like, oh, wait, no, there's tons of answers, actually.
So let's see.
interestingly enough, which you get this a lot with 4-0.
4-0 is trying to do this kind of breaking it down step-by-step.
So it's trying to do a chain of thoughts.
And it's so it's saying step one, define the relationship.
It's creating variables, right?
A plus B plus C equals nine.
So actually, I've run this on 4-0 a lot.
4-0 did just get updated last week.
there's always kind of updates going on under the hood.
And this is probably the best 4-0 example that I've got.
So right now, I technically got a wrong answer, even though it's kind of right.
So it identified one correct code, which would be 153.
So it says, yeah, so that is technically a correct answer to the question, but there's actually a lot of correct answers.
Right.
So in theory, I can even think of them in my head.
I could say 0, 5, 4 would work, right?
Three digits add up to 9.
Middle digit is the highest.
So there's actually a lot.
So let's go ahead.
Let's pop open 01 preview.
Fresh chat here.
Let's do the same thing.
Let's see it think.
All right.
So it says evaluating constraints, mapping out the combinations,
enumerating digit combinations,
assessing digit,
constraints, assessing digit pairs. Wow, okay. This one's taking a little longer. Evaluating potential
codes. All right. I'm going to look at it as it goes. Identifying unique codes. Now, identifying
valid combinations. I'm looking at this. Some of this doesn't even make sense to me. Identifying
valid digit sums. So you'll see here, big difference from what we're used to. You know, GPT4O spat out an
answer in like two seconds. This is thinking. It said it thought for 44 seconds. Wow. All right. Let's see if we
got the answer right. All right. So let's go down a lot, a lot of information here. Filtering possibilities,
verifying combinations, thinking about possibilities, evaluating code choices. Wow. A lot of math here.
A lot of math. Let's see what we actually got. Interestingly enough, it only so we got the
answer wrong here again. It gave one correct answer. So it said the answer is 4-50. But again, we know
there's actually many correct answers, right? Technically, I have to read look. So it says a box is
locked with a three-digit numerical code. All we notice that all the digits are different. So as an
example, it said 4-50. So 351 would have worked. Two-five, not two-five-two. Right. Anyways, there's more
one that works. So although in this case, 01 preview was a little more impressive with its thought,
still didn't get it right. So that's one that is still stumping large language models. All right,
here's another one I've been doing since the beginning, since I started everyday AI. How many
words are in your response? This one, no model as far as I know. And this is a pretty famous one.
seeing this in research papers, right?
But in why this doesn't work, and this is the same reason why, and I broke this down
into like a 20 tweet thread once, you know, why you can't ask, you know, how many ours
earned strawberry, right?
We know that one works.
OpenAI did a demo on that.
And I don't want to waste one of my precious, one of my precious 30 queries for the week.
I got to teach you all.
All right.
So now I'm saying, how many words are in your response?
And the reason why this doesn't work is because large language models technically don't understand words, really.
They break smaller parts of words into tokens.
So it's this tokenization process.
So they actually don't know how many words they're going to spit out to you.
So I'm going in the regular 4-0 mode here.
And it says the response contains 23 words.
And obviously the response, 1, 2, 3, 4, 5 contained 5 words.
So it said 23.
Not even close, right?
Large language models can do this.
Let's see if an agentic model can't.
All right, I'm curious.
Let's go.
New chat, here we go.
We're going to 01 preview, same thing, and let's see how it does.
It is thinking.
So now it's counting words, understanding paradoxes, crafting a response.
All right, we'll see how it does.
Evaluating response, avoiding exposure, thought for 13 seconds.
So it says, the number of words in my response is,
Nine. Let's count. One, two, three, four, five, six, seven, eight, nine. Impressive. Simple for a human to do, right?
But think you have to, if I say, if I'm quizzing you and I say, hey, answer this question and tell me how many words are in it, you got to think a little bit first, right? You have to think ahead. You have to think like a human. This is how I'm going to respond. Before I respond, let me count how I would respond and then respond.
respond, right? So it shows this kind of agentic chain of thought, step by step reasoning.
All right, we're going to do one last one. I don't want this, don't want this podcast to turn
into a super one here. This one I just thought would be interesting, right? So I'm going into
4-0 now. I'm saying, please create a realistic plan to address world hunger. Keep in mind the
current state of affairs, political limitations, geographical challenges, and technical implementation.
That's one thing, y'all.
I worked at a nonprofit for 10 years.
I am extremely excited on how this new model can be used for nonprofits to tackle huge societal issues, right?
We always think, and I get it, you know, large language models, AI taking jobs.
Yeah, it's going to take, y'all.
I don't care what anyone says.
AI is going to take more jobs than it creates, period.
I think the future of work is, you know, think about now, you know, people are, you know, doing
DoorDash and Uber and Lyft, right? Like, if I'm being honest, I think in 10 years, more people
are going to have their own small companies than people who have traditional full-time, you know,
W-2 40-hour week jobs. I don't think that's the future of work. I think the future of work is many
people are going to have many small companies. Anyways, I've always wanted to know how models can
solve problems, right?
I think it's a powerful thing to think about,
you know, because we only think about the bad stuff.
So here we go in 4-0, it's breaking it down.
So it says, you know, immediate and short-term relief,
one to three years, improved food distribution logistics,
nutritional supplements, programs, key actions.
Okay, so it's doing a pretty good job.
Lays out a nice plan.
You know, we can't judge which one's better.
But, you know, I'm just curious,
I really want for this one, yes, I mean,
you can't compare these, you know, side by side and say one plan's better.
I want to see how a model, how a model made by humans that we don't think is,
is human.
How does it think about a problem like this, right?
That's what I'm really, I'm not even super interested in the output.
I'm interested in how it thinks about an issue like this.
So let's go into one O preview.
We're going to enter a new chat and let's see how it thinks.
So navigating world hunger, I'm reading out the steps.
Oh, thought for five seconds.
Did not think very long about it.
Okay.
Interesting.
So it says the steps, navigating world hunger, understanding the challenges.
And then it created the plan.
So it looks similar in length.
And I'm looking at some of the, I'm looking at some of the kind of bullet points here.
So they did a similar job.
All right.
Let's go ahead and wrap this show up, y'all.
So as a very quick recap, brand new model, first of its kind in Open AIs, O1 preview and O1
Many.
So very quickly, five things.
It is the first, number one, first model of its kind, chain of thought, reasoning.
So am I blown away?
I'm impressed.
I'm impressed.
I need to put it through its paces.
I had very simple kind of test runs.
but we saw the chain of thought reasoning and y'all this is the worst or sorry yeah this is the worst it's ever going to be
it's only going to get better from here and i think this is a big step toward a GI number two not part of the gpt family
two separate series so yes right now there's limitations to this you don't have browse with bing file
upload computer vision etc you know like we talked about this was originally code named q star strawberry
this agentic chain of thought thinking so we'll see if in the future this does lead to that more
expensive model. Number three, right now, it is expensive to use in the API and extremely limited,
although the 0-1 Mini, not too bad. Last, sorry, number four, the benchmarks, extremely impressive.
MMLU off the charts in a league of its own. And then last but not least, I think we're going to
start getting so many large language model updates now. I think Claude, you know,
Claude from Anthropic, Google Gemini, I think is going to start to start to.
to improve and bring so many of their developer aspects
to the front end.
I see meta probably rolling out agents sooner
than we might think.
I think this is kind of the first splash
that is going to really bring a wave.
All right, I hope this was helpful.
Like I said, double dose.
If you want the normal news, all that,
make sure to go to your everyday AI.com.
Thank you for tuning in.
Please subscribe.
If you're listening on the podcast or on YouTube,
please subscribe.
Let me know what you want to hear more of.
And I'll see you back tomorrow.
And every day.
For more, every day, AI.
Thanks y'all.
Meet Firefly AI Assistant.
Now live in Adobe Firefly, the Allman One Creative AI Studio.
Just describe what you want to create in your own words and the assistant handles the rest,
orchestrating multi-step workflows across Adobe Creative Cloud apps,
including Photoshop, Premiere Express, and more in one conversational interface.
You direct the outcome while the assistant accelerates execution.
Stand control with the ability to step in and refine at any time.
See it today at firefly.adobie.com.
And that's a wrap for today's edition of Everyday AI.
Thanks for joining us.
If you enjoyed this episode, please subscribe and leave us a rating.
It helps keep us going.
For a little more AI magic, visit Your EverydayAI.com
and sign up to our daily newsletter so you don't get left behind.
Go break some barriers and we'll see you next time.
