Better Offline - Exclusive: How GPT-5 Actually Works
Episode Date: August 22, 2025In a bonus third episode, Ed Zitron reports exclusively on how OpenAI’s new "router-based" ChatGPT-5 makes it impossible for the company to cache the static prompt for any model or tool it uses ...every single prompt, doubling token burn for mediocre gains. Better Offline listener deal: Get $15 Off Where's Your Ed At Premium! Deal goes until the end of August.https://edzitronswheresyouredatghostio.outpost.pub/public/promo-subscription/better-offline-discount YOU CAN NOW BUY BETTER OFFLINE MERCH! Go to https://cottonbureau.com/people/better-offline and use code FREE99 for free shipping on orders of $99 or more. BUY A LIMITED EDITION BETTER OFFLINE CHALLENGE COIN! https://cottonbureau.com/p/XSH74N/challenge-coin/better-offline-challenge-coin#/29269226/gold-metal-1.75in --- LINKS: https://www.tinyurl.com/betterofflinelinks Newsletter: https://www.wheresyoured.at/ Reddit: https://www.reddit.com/r/BetterOffline/ Discord: chat.wheresyoured.at Ed's Socials: https://twitter.com/edzitron https://www.instagram.com/edzitronSee omnystudio.com/listener for privacy information.
Transcript
Discussion (0)
This is an IHeart podcast.
Guaranteed Human.
Run a business and not thinking about podcasting.
Think again.
More Americans listen to podcasts
than adds supported streaming music from Spotify and Pandora.
And as the number one podcaster,
IHearts twice as large as the next two combined.
Learn how podcasting can help your business.
Call 844-844-I-Hart.
Another podcast from some SNL late-night comedy guy,
not quite.
Unhumor me with Robert Smygel and friends.
Me and hilarious guests from Bob Odenkirk to David Letterman
help make you funnier.
This week, my guest,
SNL's Mikey Day and head writer, Streeter Seidel,
help an a cappella band
with their between songs banter.
Where does your group perform?
We do some retirement homes.
Those people are starving for banter.
Listen to humor me with Robert Smigel and friends
on the IHeart Radio app, Apple Podcasts,
or wherever you get your podcasts.
Life is full of hurdles.
So how do you keep going?
On Hurtle with Emily Abadi,
we're talking with the most inspiring women
in sports and wellness,
from professional athletes, coaches, and Olympic champions
about the challenges that shape them
and the mindset that keeps them moving forward.
At our level, at this scale,
being able to fail in front of the entire world.
Like, I can do anything.
I can do anything.
Listen to Hurtle with Emily Abadi
on the IHeart Radio app, Apple Podcasts,
or wherever you get your podcasts.
Presented by Capital One, founding partner of IHart Women's Sports.
Your 20s can be so exciting,
but they can also be really overwhelming, confusing,
and honestly just kind of lonely.
May is Mental Health Awareness Month
and the psychology of your 20s
is breaking down the science
behind the biggest roadblocks we face.
I was six years into my career,
the 80-hour weeks and just the first one in,
the last one out, and I ended up burning out.
There was a large chunk of my 20s
that I was just so wanting to be out of that phase
out of my skin, and I just like really regret
not living in the present more.
You don't need to have everything figured out right now.
You just need to understand yourself
a little bit better. Listen to the psychology of your 20s on the IHeartRadio app, Apple Podcasts, or
wherever you get your podcasts.
AllZone Media.
Hi, my name's Ed Zittron and welcome to Better Offline. This is also Jackass.
So you've just had a cheery two-part chuckle first about how generative AI may tank
our markets in our economy. So I'm going to give you a lighter one. An episode about
GPT5, which is a model from OpenAI, and why just under three years of hype have led to the software
equivalent of the launch of St. Anger, except every time Lars Orrick hit the snare drum, it cost them $55,000.
Now, if we look at the positive reviews, we see takes ranging from Simon Willison's tepid remark that
GPT5 is just good at stuff, to semi-analysis's completely insane statement that GPT5 is setting the stage for
ad monetization and the OpenAI GPT, chat GPT super app, in a piece that makes several assertions about how the router that underpins GPT5 is somehow,
the secret way that OpenAI will inject ads, which is just distinctly silly. It's, I'll get into
this in the episode a little bit, but just with everything you're going to hear, you're going to
realise that this is just someone just saying stuff. Took four bylines to do that shit too.
I'm also British. I'm going to say router. I might say router as well, because I've been here
a while. Make fun of my voice if you really must. But with that out the way, here's a quote from
Semenian analysis is coverage. Before the router, there was no way for a query to be distinguished.
and after the router, the first low-value query could be routed to a GPT5 Mini model that can answer with zero tool calls and no reasoning.
This likely means serving this user is approaching the cost of a search query.
This does not make any sense.
None of this makes it like, it's just a bunch of assumptions.
Why would this be the case?
The article also makes a lot of claims about the value of a question and how chat GPT could, I am serious,
agentically reach out to lawyers.
I'm not going to edit that out, because egenetically is not a fun word to say. It's just complete nonsense.
And in fact, I'm not sure this piece reflects how GPT5 even works at all. Again, quoting it,
the router serves multiple purposes on both the cost and performance side. On the cost side,
routing users to many versions of each bottle allows OpenAI to service users at a lower cost,
or with lower costs even. To be fair on semi-analysis, it's not as if OpenAI gave them much help.
OpenAI's official writings about the router aren't exactly filled with details,
talking in glowing terms about what it does, but not how.
Here's what they say.
ChatGPT's real-time router quickly decides which model to use
based on the conversation-type complexity tool needs and your explicit intent.
For example, if you say think hard about this in the prompt.
The router is continuously trained on real signals,
including when users switch models, preference rates for responses and measured correctness,
improving over time.
Once usage limits are reached, a mini version of each model handles remaining queries.
In the near future, we plan to integrate these capabilities into a single model.
And that last bit really doesn't make sense, but in any case, the launch of GPT-5 has been very, very weird.
At first, some people seemed really happy about it, chief of them software YouTuber Theo Brown, who has over 468,000 subscribers.
He's also known as Theo Gigi, who said,
I didn't know it could get this good.
This was kind of the, like, oh, fuck moment for me in a lot of ways.
And I've had to fight like a slow spiral into insanity.
It's a really, really good model.
He finished by saying,
and keep an eye on your job because I don't know what this means for us long term.
Pretty crazy, right?
Comments on the video included people saying things like,
if OpenAI is held in you hostage, blink twice,
and yes, that is an adverbate in quote.
Another saying, this dude is everything wrong in IT today.
Another saying this video is sponsored by OpenAI.
Another saying,
every test project I gave it today. It's a lie in my experience. Maybe they haven't ramped up the GPUs.
Now, from what I can tell, Theo Brown played with GPT5 in OpenAI's offices and did all the benchmarking there.
OpenAI, by the way, fucking hell. Come on, you can't benchmark in their offices. Anyway, OpenAI's API-based
access to GPT5 models. You know the thing that you use if you want to integrate GPT into your app.
It does not root them, by the way, nor does Open AI offer access to its router or any associated models.
Important detail. Just want you to know that because we need to make sure we're very clear.
Now, a week later, Theo Brown would put out another video called I Was Wrong About GPD 5,
which he would open by saying,
So first and foremost, I want to make sure it is very, very clear that the experience that you probably are having with chat GPT and GPD5 right now
is not the experience that I had when I was first testing it.
Brown goes on to explain that he was not paid by OpenAI at all,
that he was sincerely impressed by the company and GBT5,
and that he'd actually spent over $25,000 in inference testing it on his own company's software,
and indeed also that he turned down a grand appearance fee.
I mean, that's a very British thing, a thousand dollar appearance fee,
not just like a really nice one.
Brown claims he asked OpenAI to try it out,
and after they declined to let him test it early on his own,
he was invited to try it on camera with a small group of other people at OpenAI's offices,
where they'd film his reactions.
He said that the API was incredible, but that it's become apparent that the models he was using in the video were not the same as those released to the public, making a post on August 13th on XD Everything app that GBT5 was nowhere near as good in cursor as it was as it was using it a few weeks ago, complaining that things that worked while demoing it at OpenAI no longer did, adding that there was somebody else on Twitter that said they had a similarly great experience of GPT5 on launch that has since decayed.
It isn't completely clear what happened here, but I'm going to guess that OpenAI showed Theo Brown and others in their offices some sort of heavily modified version of the model that burns significantly more compute to provide its outputs, though I'm also very suspicious of how significance the difference is here.
Brown's videos attempt to show the difference between the generations that he received from the model when it was good and when it was bad in this video, which I'll include a link to in the episode notes, but if I'm honest, they look pretty similar in that they're kind of mediocre. I'm not saying,
that as a hater, by the way. They just kind of look like shit. It's just kind of, okay, like shit.
They look like regular, fucking generated websites. They don't look special. The good one is fine and the
bad one has weird gradients on it. This whole thing sucks, though, and was a clear setup by open
AI to overstate the abilities of GPT5, one that fell apart with the lightest brush with reality.
I imagine their assumption was that Brown would post a glossy video and then walk away,
and I gave Theo some credit for straight up stating he was misled. This was.
was a desperate move and one that blew up in the face of OpenAI, along with the rest of the GPT5 launch.
People hate the model. Customers are mad at OpenAI for taking models away like 4-0 and have
remained mad even with their return. And the chat GPT subreddit is almost entirely people complaining
about how ineffective the new version is and how even GPT-4-O is not the same. They got game of
brain baby. As I said in last week's monologue, I believe Open AI has grown a fandom rather than
any kind of sustainable product market fit, and they're now suffering fandom like hate,
with every minor change they make in an attempt to push GPT5 further,
further aggravating people that barely understand what they use the product to begin with.
Yet at the center of the anger laid the reason for GPT5's launch,
the belief that this was somehow a cost-cutting measure,
where OpenAI had added a router to chat GPT as a means of sending certain requests to cheaper
models to save money.
But when I hear router, I hear latency, and I never, or even a second, believe that this
would somehow be cheaper to run. It didn't make sense. I'm a curious little critter, so I went and
found out how chat GPT5 actually works. And unlike the following incredible products that you should
buy, it's actually kind of a big piece of shit. Another podcast from some SNL late-night comedy guy,
not quite. Unhumor me with Robert Smygel and friends, me and hilarious guests from Jim Gaffigan
to Bob Odenkirk to David Letterman help make you funnier. This week, my guest, SNL's Mikey Day and
head writer Streeter Seidel, help an
Acapella band with their between songs
banter. There's that worst singer
in the group? The worst? Yeah.
Me. Is there anything to
the idea that because you're from Harvard,
you only got in because your
parents made a huge donation.
The group.
The yard birds, right? That's the name.
The Harvard yard, but they're open. Do you have a name suggestion?
We're open. Since you guys are middle
aged, one
erection.
Listen to human.
Remember me with Robert Smygel and Friends on the IHeart Radio app, Apple Podcasts, or wherever you get your podcast.
Hulmer me.
I need some jokes to make me seem funny.
Run a business and not thinking about podcasting, think again.
More Americans listen to podcasts than ads supported streaming music from Spotify and Pandora.
And as the number one podcaster, IHart's twice as large as the next two combined.
So whatever your customers listen to, they'll hear your message.
Plus, only IHeart can extend your message to audiences across broadcast radio.
Think podcasting can help your business.
Think IHeart.
Streaming, radio, and podcasting.
Call 844-844-I-Hart to get started.
That's 844-844-I-Hart.
There are times when the mind becomes a difficult place to live.
This is David Eagleman with the Inner Cosmos podcast,
and for Mental Health Awareness Month,
we're dedicating a series to understanding the mind when it struggles.
I'm joined by doctors, researchers, and those with lived experience.
We'll talk with singer-songwriter Jewel about anxiety.
I started living in my car, and then my car got stolen.
I was shoplifting.
I was having panic attacks.
I was agoraphobic.
And making it through hardship.
To be present is a learned skill, and it's hard to be present.
We'll talk with John Nelson about clinical depression and the brain implant that saved his life.
What I learned is that procedure made me happy because I'm disease-free.
And we'll talk with leading experts like Judd Brewer about anxiety
and John Hirschfield about obsessive-compulsive disorder
and the science of how the brain can change.
This is a month of deeply personal and honest conversations
about what happens when the brain goes off course
and what we can do about it.
Listen to Inner Cosmos on the IHeart Radio app, Apple Podcasts,
or wherever you get your podcasts.
Agency, the ability to know that we're the experts in our own body.
On the podcast, cultivating her space,
Dr. Dom and Terry Lomax create a space where black women can show up fully and be heard.
I wholeheartedly think, you know, you hit 30.
You shouldn't have to share one with anybody.
Mm-hmm.
From navigating friendships and healing to setting boundaries and prioritizing your mental health.
These are real honest conversations.
we don't always get to have out loud.
Totally unreasonable with different parts of life, right?
Like, oh, have all three meals and make sure you're mindful during all of them?
Absolutely not.
During one meal, I'm standing.
I'm standing and handing my children food.
Because healing, empowerment, and resilience aren't just ideas, their practices.
And this Mental Health Awareness Month, there's no better time to pour back into yourself.
Listen to cultivating her space on the I-Heart Radio app,
Apple Podcasts or wherever you get your podcasts.
And we're back.
And from here on now, I will define two things.
GPT5 referring to the model and its associated mini and nano models and chat GPT5,
referring to the current state of chat GPT, which features an auto, fast, and thinking,
and thinking mini model selections.
You also can see legacy models, but that's not what we're talking about today.
And that's also only for a little bit.
It's a distinction I have to make, by the way, and make early because the two things are different.
They work in different ways, and chat GPT-5's structure introduces a bunch of trade-offs and downsides
that, as I'll discuss later, make this whole thing even more wasteful.
In discussions with a source at an infrastructure provider familiar with the architecture,
it appears that chat GPT-5 is in fact potentially more expensive to run than previous models,
and due to the complex and chaotic nature of said architecture can at times spend upwards of double the tokens per query.
Tokens, for those who don't know, are basically chunks of text that the AI models do stuff with.
I'm simplifying this. Do not email me and correct some minor thing. Nobody cares.
A sentence like the quick brown fox jumps over the lazy dog will be broken into lots of
smaller four-character chunks. There are different kinds of tokens and they're all priced
differently. An input token refers to the data you send to the model when you ask it a question.
Output tokens are used to measure the size of its response, with bigger responses,
requiring more tokens. The more tokens you burn per query, the more expensive it is to run that
query. The fact that chat GPT5 can, in certain circumstances, burn twice the number of
tokens of a query, means that every question costs more. Chat GPD is also significantly more
convoluted, plagued by latency issues, and is more compute-intensive thanks to OpenAI's new
smarter, more efficient model routing system. In simpler terms, every user prompt on chat GPD,
whether it's in Auto-Fast, Thinking or Thinking Mini, starts by putting the user's prompt
before the static prompt. I don't want to lose you here. This is important. A static prompt is the
invisible instructions given by OpenAI to chat GPT and the models themselves, and the
tools associated with them to tell them how to operate. Instructions like you are chat GPD, you're a
large language model, you're a helpful chat bot, do not threaten them with a knife, and so on and so
forth. These static prompts are different with each model you use. A reasoning model will have a different
instructions set to a more chat-focused one, such as think harder about a particular problem
before giving an answer, break down problems into component answers when you get a certain
thing, like if someone asks you a coding question, query a coding tool, that kind of thing.
A user prompt is exactly what it sounds like. The thing that a user prompt is exactly what it sounds like.
user wants the AI model to do. The new order in chat GPT5 becomes an issue when you use multiple
different models in the same conversation because the router, the thing that selects the right model
for the request, has to look at the user prompt. It can't consider static instructions first
because they may be different based on what the user asked. In fact, the order has to be flipped
for the whole thing to work. Put simpler, previous versions of chat GPT would take the static prompt
and then invisibly append the user prompt onto it. This static prompt would typically be cached,
massively reducing the amount of compute the model needs to perform a task.
Chat GPD cannot do this.
Every time you use chat GPT5,
every single thing you say or do can cause it to do something different.
Attach a file, might need a different model.
Ask it to look into something and be detailed.
Might trigger a reasoning model or a different depth of reasoning.
Ask a question in a weird way.
Sorry, the root is going to need to send you to a different model entirely.
Each time, coming up with new instructions based on the subtle interpretation of what you asked it.
Every single thing that can happen when you ask chat chpt to do something may trigger the router to change model,
a request a new tool, and each time it does so requires a completely fresh static prompt,
regardless of whether you select auto-thinking fast or any other option on chat chvety.
This in turn requires it to expend more compute,
with queries consuming more tokens compared to previous versions.
It's like you started a job, and every time you do a task,
write an email, make a cup of copy, attend a meeting, email someone with a threat,
your workplace requires you to complete the entire mandatory onboarding training first.
Want to edit a spreadsheet? Not before you brush up on your anti-bibrary legislation first, you prick.
As a result, chat GPT may be smart, but it doesn't really seem efficient in the GPT5 version.
Now, to play devil's advocate, OpenAI likely added the routing model as a means of creating a more sophisticated output for a user,
and I imagine with the intention of cost saving.
Then again, this might just be the thing it had to ship.
After all, GPT-5 was meant to be the next great leap in AI and the pressure was on to get it out the door.
By creating a system that depends on an external routing model, likely another LLM in this case,
open AI has removed the ability to cache the hidden instructions that dictate how the models generate answers in chat GPT,
creating massive infrastructural overhead.
Worse still, this happens with every single turn, as in message, on chat GPT5,
regardless of the model you choose, creating endless infrastructural baggage with no real way out
that only compounds based on how complex a user's queries get, or how much they change.
They could be simple, but just going in different directions every time.
Could OpenAI make a better router?
Sure.
Does it have a good one today?
No.
Every time you message ChatGPT as the potential to change model or tooling based on its own whims,
each time requiring a fresh static prompt and, short of totally reworking the architecture
of ChatGPT5, there's no way to change this.
And if it's an LLM choosing which model, I don't know, maybe it hallucinates.
Just a guess.
It doesn't even need to be the case where a user asks CHAPGT-5 to think.
And based on my test with GPT-5, sometimes you can just ask it a four-word question,
and it will think about it for no apparent reason.
OpenAI has created a product with latency issues and an overwhelmingly convoluted routing system
that's already straining capacity, to the point that this announcement feels like OpenAI is walking away from its API entirely.
This, as a reminder, is the thing that people use to incorporate OpenAI's models into their apps,
while also running said models on the infrastructure OpenAI rents from Microsoft and CoreWeave at some point as well as Oracle.
And this API thing is really weird, by the way, because these are new models, but OpenAI is really not talking about the models themselves that much.
Unlike the GPT4O announcement, which mentions the API in the first paragraph, the GPT5 announcement has no reference to it,
and only has a single reference to developers at all when talking about coding.
Sam Orman has already hinted that Ian tends to deprecate any new API demand, though I imagine he'll let
anyone who will pay for priority processing, which is essentially open AI's way to require
minimum commitments and extra payments from API customers, just so they never feel the bite of
any compute shortages and throttling, which they absolutely will do to people that don't pay.
Chat GPT5 feels like the ultimate comeuppance for a company that has never been forced to build
a product, choosing instead to bolt increasingly complex tools onto the side of models in the
hopes that one will magically appear. Now each and every feature of ChatGPT burns more money than
it ever did before.
Chat GPD 5 feels like a product that was rushed to market by a desperate company that had to get something out the door.
In simpler terms, here, it's actually really funny.
When I worked this out, I chuckled vigorously.
This is just a case where Open AI has given Chat GPT a middle manager.
But now I'm giving you the chance to open up your hearts and do something better.
Open up your wallets too and send money to a company that follows here.
Behold my advertisements.
Another podcast from some SNL, late-night comedy guy, not quite.
Unhumor Me with Robert Smygel and friends, me and hilarious guests from Jim Gaffigan to Bob Odenkirk to David Letterman, help make you funnier.
This week, my guest, SNL's Mikey Day and head writer Streeter Seidel, help an acapella band with their between songs banter.
There's that worst singer in the group?
The worst?
Yeah.
Me.
Is there anything to the idea that because you're from Harvard, you only got in because your parents made a huge donation.
The yard herds, right?
That's the name.
The Harvard Yardt Yard.
Do you have a name suggestion?
We're open.
Since you guys are middle-aged, one erection.
Listen to humor me with Robert Smigel and Friends on the I-Heart Radio app, Apple Podcasts, or wherever you get your podcast.
Humor me.
I need some jokes to make me seem funny.
Run a business and not thinking about podcasting, think again.
More Americans listen to podcasts than ad-supported streaming music from Spotify and Pandora.
And as the number one podcaster, IHearts twice as large as the next two combined.
So whatever your customers listen to, they'll hear your message.
Plus, only IHeart can extend your message to audiences across broadcast radio.
Think podcasting can help your business.
Think IHart.
Streaming, radio, and podcasting.
Let us show you at iHeartadvertising.com.
That's iHeartadvertising.com.
There are times when the mind becomes a difficult place to live.
This is David Eagleman with the Inner Cosmos podcast,
and for Mental Health Awareness Month,
we're dedicating a series to understanding the mind when it struggles.
I'm joined by doctors, researchers, and those with lived experience.
We'll talk with singer-songwriter Jewel about anxiety.
I started living in my car, and then my car got stolen.
I was shoplifting. I was having panic attacks.
I was agoraphobic.
and making it through hardship.
To be present is a learned skill,
and it's hard to be present.
We'll talk with John Nelson about clinical depression
and the brain implant that saved his life.
What I learned is that procedure made me happy
because I'm disease-free.
And we'll talk with leading experts
like Judd Brewer about anxiety
and John Hirschfield about obsessive-compulsive disorder
and the science of how the brain can change.
This is a month of Gude-Drewd,
deeply personal and honest conversations about what happens when the brain goes off course and what we can do about it.
Listen to Inner Cosmos on the IHeart Radio app, Apple Podcasts, or wherever you get your podcasts.
Agency, the ability to know that we're the experts in our own body.
On the podcast cultivating her space, Dr. Dom and Terry Lomax create a space where black women can show up fully and be heard.
I wholeheartedly think, you know, you hit 30, you shouldn't have to share one with anybody.
Mm-hmm.
From navigating friendships and healing to setting boundaries and prioritizing your mental health,
these are real honest conversations.
We don't always get to have out loud.
Totally unreasonable with different parts of life, right?
Like, oh, have all three meals and make sure you're mindful during all of them?
Absolutely not.
During one meal, I'm standing.
I'm standing and handing my children food.
Because healing, empowerment, and resilience aren't just ideas.
They're practices.
And this Mental Health Awareness Month, there's no better time to pour back into yourself.
Listen to cultivating her space on the IHeartRadio app, Apple Podcasts, or wherever you get your podcast.
And we're back.
Like every great middle manager, Chad GPT-5's router creates more work based on its own interpretation of what's going on,
and as a separate large language model, I can't imagine it has a ton of training data available.
If I had to guess, and this is a guess, by the way, OpenAI has done and will do a lot of fine-tuning and reinforcement learning to make it work.
Though to give it a little grace, this is a new thing that it's doing, and it's doing sort of a huge scale.
The problems start, by the way, with the fact that ChatGPT5 is taking the user's initial prompt and then deciding which model to use.
Unlike previous models which sent your prompt directly to the model along with the static prompt,
which was cached and came first, an important feature in how these models limit token burn.
OpenAI starts with a router model that takes what you ask and gives it to chat GPT and tags it based on what kind of thing your question might need.
The thing might be a tool, such as whether it has to do a web search to spit out the thing at the end,
a reasoning model, whether it needs to use a coding language and so on and so forth.
Once chat GPT has bounced your query across various models, burning compute along the way,
it then pushes it towards the chat portion of the generation.
And each time you ask chat GPT a question or to do something,
a new specialised static prompt is generated, sometimes several,
make it impossible to cache them in advance.
In simpler terms, each time you message it, chat GPT is to dump all cached information and instructions
for what you need to do and reload it with each prompt.
Now, here are some examples of what chat GPT5 has to reload every single,
time you prompt it. Whether or not to use a browser or search the internet and under what conditions
to do so, because they will change with each prompt. How to approach a particular problem based on
what the user asked, including any specific ways you meant to answer, tone, brevity and so on,
based on their request. Specifics around how it might use, say, OpenAI's code interpreter,
such as the usage rules for running a Python script or how you want the code's output, which, again,
will be different based on each prompt. And you can even say do it in exactly the same way,
and because it's the large language model, it may hill.
hallucinate something different. Every single goddamn time you prompt chat GPT5, it has to do this.
Worse still, a particular conversation can involve you using multiple different models and tools,
requiring you with each and every prompt having to inject a different static prompt for each component
that chat GPT5 uses. And you can't cache the static prompt before the user's intent,
because if you did that, it might send an instruction to a model that doesn't make sense,
such as telling a reasoning model to give a quick and simple answer, or a mini or nano model to do some
sort of deep reasoning, which would create a crappy answer and burn tokens in the process.
And this is all thanks to the complicated way that OpenAI insisted on building GPD5.
Every single time you send something to chat GPD can trigger it to use a different series of
models, audio vision, reasoning, each with their own instructions, static prompts, all while
putting different tools, each requiring their own instructions based on what you asked, and reasoning
models even have different depths of reasoning. Unlike 4-0, which is a multimodal model combining
text, vision, and voice, GPT5 is a rat king of open AIS models and tools that gets reborn
every single time you ask it to do anything. It can prompt cash some things, but the core
instructions, not so much. But let's get a little more granular, because I know I've been quite
repetitive, but this is detailed. So from what I've been told, there are either one or two models
at work for the routing. I'm going to go with what I think is most likely based on the discussions I've
had with people familiar with the architecture. I've heard the term orchestrator thrown around,
potentially suggesting the router may be more omnipresent throughout the process, but I was unable
to confirm its existence. Reach out if you hear differently. I'll explain things as they were explained
to me, though. When a user sends a prompt, it goes through the splitter leg, which decides to send the
query on one of two paths. One is called the fast path, where a query is straightforward, such as a text-only
conversation that doesn't require any analysis or extra tools, or thinking, a path where the query
may require reasoning, or more complex tools like code generation or access to a web browser
for research. To be clear, there are prompts where it may be split into multiple parts that trigger
multiple models or tools, each requiring their own static instructions. From what I understand,
the splitter model is a completely separate large language model, though we don't have a ton of
details about it. I also, based on conversations I've had, think there's a chance that could be a separate
model that sits above the splitter that does much lighter classification of how a query
might be routed. So you ask it to do something, it might just go, okay, this looks like it needs
a tool. But, and going off, why now? In any case, none of this can be cashed because all of this
exists before inference, which is where, by the way, it's inference I've misstated in the past
as inferring meaning. Inference is everything that happens to get an output to you. So all
of the stuff that's happening. And by the way, this is all a completely new cost to open
has created. No one does this like this. It's so fucking stupid. But now we get to the chat leg.
Now the open AI has added layers of extraction, it can begin cooking up the output, by which I mean
do inference. The chat leg is where the pieces that the splitter model created are pulled together,
each loaded into their, with their respective static prompts based on what the user asked chat GPD5 to do.
Each piece of the model, a tool to generate Python, an image generation tool, a reasoning model
to generate an output, has to process an entirely new static prompt. And again, that's every
interaction. Remember, static prompts are effectively instructions, so the splitter model has told each
piece of the pie how to act to create a particular output. As a result, much of this can't be
cached, creating more and more repetitious token burn-piment response, and me to have to repeat this stuff
so that you really get in. The upshot of the chat legs static prompt baggage is that you can do a little
more here, at least in theory. Because each component can be instructed separately, they can, again,
in theory, be made to give more individualized specialized outputs, like creating an image with text,
that is, as I'll give an example of, very shortly, generated using a specific reasoning model.
I'm clutching at straws here. I don't really know if this is better, but I'm trying to be reasonable.
I'm trying to be normal. Every day I try and be normal. Previously, open AI's advantage was that a model
like 4-0 was kind of a jack of all trades. But to get the benefits of chat GPT5, and that's in air quotes,
it's engaged a conductor model that can just make things more convoluted, even in the case of simple requests.
an example. You upload a chart of NFL players' stats and ask ChatGPT to decide which is the best
of the group and create an image to show the results. In GPT40, ChadGPT would use one model
and thus one static prompt to look at the image, decide which tools to use and then how to format
the response. You only needed one prompt, which was cached, because one model can look at
the stats, pull the data and make the decisions and then use the image generation tool to make
the final image. In GPT5, the Chat-GPT conductor model would see the stats routed to a vision
model requiring its own static prompt, then a separate text-only reasoning model, one that has no
ability to use tools, but it might be cheaper to get an answer from, and also requires a static
prompt. And that would then decide which players are best, and then spit out an output,
and then root it to a completely separate model that can generate text to query the image tool.
Again, need a static prompt for this to then generate the image. On top of all, this onerous
baggage lies another problem. The GPT-5's various models are just more complex. By splitting out the
component elements of what a model can do and allowing each model to have different levels of
reasoning, even the cheaper ones like MIDI and Nano. OpenAI has created an endless combination of
different reasons to have to make a brand new static prompt instruction, all automated by a
router, a large language model that chooses what large language model to choose for a query.
It is, if I'm honest, kind of funny. Reasoning models work, when simply described, by breaking
up a prompt into component pieces looking over them and deciding what the best course of action might be.
Chat GPT's router is effectively in an abstraction higher, breaking up the prompt into component pieces,
then choosing different models for each of those pieces, which may in turn be broken up by a reasoning model.
While I wouldn't say this is a hat-on-a-hat situation, it is at this point unclear what exactly the benefits of Chat-GPT-5's new architecture are.
Less hallucinations?
Bitter answers?
Based on what I've been told, this was a decision made to increase the model's performance.
What I can say is that this very likely increased OpenAI's overhead at a time when it needs to do the exact.
opposite. Even if chat GPT5 pushes people towards cheaper models, it does so while guaranteeing extra
costs and latency, and whatever signals it may learn as people use this, will have to create significant
benefits, massive 100% plus gains for it to be anything close to worthwhile. While OpenAI's Router may be
smart in terms of nuance of how it might answer a query, and even then I question, it most decidedly is not
more efficient and may have actually increased the burn rate for a company that will lose as much as $8 billion
dollars this year. And I think that number might be low too. Yet what I'm left with in writing
this script is how wasteful of this is. Open AI, a company that has already incinerated upwards of
$15 billion in the last two years, has chosen to create a less efficient way of doing business as a
means of eking out a modest and best performance improvements. It just sucks. In our own lives, we're
continually pushed and pressured and punished if we get into debt, judged by our peers and our parents,
if we spend our money recklessly, and if we're too reckless, we find ourselves less likely to receive
anything from credit to housing. Companies like OpenAI live by a different set of standards.
Samortman intends to lose more than $44 billion by the end of 2028 on OpenAI, and graciously told
CNBC like Lord Farquod that he was willing to run at a loss for a long time, where he was treated
like he was this smart, reasonable decision-maker rather than someone that needed to rein in their
horrendous spending habits and be more mindful.
all. The ultra-rich are rewarded far more for their errant spending habits than we ever are for any
thriftiness or austerity measures we make, and none of us are afforded the level of grace that
clammy Sam Altman has been, and has been feels appropriate. Chad GPT-5 is an engineering nightmare,
a phenomenally silly and desperate attempt to juice what remains of the dying innovation and excitement
within the walls of Open AI. It's not November 2022 anymore, and let's be honest, there really hasn't
been anything exciting or interesting out of this company since GPT4. There's nothing exciting
happening at this company. As many as 700 million people a week allegedly use chat GPT, but nobody
can really say why. An open AI, despite its massive popularity, cannot seem to stop losing billions of
dollars. And it can't seem to explain why that's necessary other than this shit's really expensive,
dude. Can anyone actually articulate a reason why we need to burn billions of dollars to do this?
What are we doing? Why are we doing it? Is everybody just a great?
to do this until it becomes a completely untenable? Do we all yearn for the abyss so much that we
can't find camaraderie and admitting we were wrong? Look at GPT-5. This is, if you believe the hype,
the best-funded, best-resourced company in the world with the greatest mind at its helm and the
greatest minds within its wars, and this is the best they've got. A large-language model that chooses
which large-language model will answer your question. G-fucking-wiz, Sam, All-man, sounds dandy,
and how much better is this you say? Oh, you can't really say?
Fucking brilliant. Hey, does it do anything new? No? Oh, what's that? It's actually our job to work that out for ourselves? Thanks, man. I love it. I love this shit. And if you're someone that is a hype merchant listening to this, and you've done really well getting to the end of the third part, by the way, I respect you. I want you to email me and explain why they should be justified in burning billions of dollars. If you tell me Uber, if you tell me AWS, I will eat you alive. I mean that, I mean that completely literally.
unhinge my jaw, I'll eat you like Kirby and shit out of dunce. I've said that one before, but I'm
going with him. In any case, this three-parter has also really reminded me how ridiculous this is, how
nonsensical things have become, and how much waste has been kind of justified, justified on this
idea that this will become something by people that don't really know what it does today or might
do in the future. None of this is going to end well. And not even the boosters seem to be having fun
anymore. Everybody's just flailing around waiting for it to end. Even Sam
Altman seems tired of it all. I know I bloody well am.
Thank you for listening to Better Offline. The editor and composer of the Better Offline
theme song is Mattosowski. You can check out more of his music and audio projects
at Mattisowski.com. M-A-T-T-O-S-K-I.com. You can email me at E-Z at
at Better Offline.com or visit Better Offline.com to find more podcast links and of course my
newsletter. I also really recommend you go to chat. Where's Your Ed?at to visit the Discord and go to
our slash Better Offline to check out our Reddit. Thank you so much for listening.
Better Offline is a production of Cool Zone Media. For more from Cool Zone Media, visit our website,
coolzonemedia.com or check us out on the IHeartRadio app, Apple Podcasts, or wherever you get your
podcast. Another podcast from some SNL, late-night comedy guy, not quite. Unhumor me with Robert Smigel and
friends.
guests from Bob Odenkirk to David Letterman help make you funnier. This week, my guest, S&L's
Mikey Day and head writer Streeter Seidel, help an acapella band with their between songs banter.
Where does your group perform? We do some retirement homes. Those people are starving for banter.
Listen to humor me with Robert Smigel and friends on the IHeart Radio app, Apple Podcasts, or wherever
you get your podcasts. Life is full of hurdles. So how do you keep going? On Hurtle with Emily Abadi,
we're talking with the most inspiring women in sports and wellness from professional athletes, coaches, and Olympic champions about the challenges that shape them and the mindset that keeps them moving forward.
At our level, at this scale, being able to fail in front of the entire world.
Like, I can do anything.
I can do anything.
Listen to Hurtle with Emily Abadi on the Iheart Radio app, Apple Podcasts, or wherever you get your podcasts.
Presented by Capital One, founding partner of IHeart Women's Sports.
Your 20s can be so exciting, but they can only.
to be really overwhelming, confusing, and honestly, just kind of lonely.
May is Mental Health Awareness Month, and the psychology of your 20s is breaking down the science
behind the biggest roadblocks we face.
I was six years into my career, the 80-hour weeks, and just the first one in, the last one out,
and I ended up burning out.
There was a large chunk of my 20s that I, like, was just so wanting to, like, be out of that
phase out of my skin, and I just, like, really regret not living in the present more.
You don't need to have everything figured out right now.
You just need to understand yourself a little bit better.
Listen to the psychology of your 20s on the IHeart Radio app,
Apple Podcasts, or wherever you get your podcasts.
Why are we all so obsessed with romance?
On the Radio 831 podcast, join us.
Sanjana Basker and Tyler McCall,
as we unpack all the trending tropes,
fuzzy adaptations, book talk drama,
and celebrity love stories with hot takes and sharp guests.
Each episode digs into what these stories reveal about desire,
fantasy, identity, and how we love now.
Listen to the Radio 831 podcast on the IHeart Radio app, Apple Podcasts, or wherever you get your podcasts.
This is an IHeart podcast. Guaranteed human.
