TBPN - OpenAI Day: GPT-5 Unveiled | Mark Chen, Greg Brockman, Sarah Friar, Max Schwarzer, Brad Lightcap & More
Episode Date: August 7, 2025(01:18) - AI Model Whiteboard Breakdown (26:23) - Mark Chen, Chief Research Officer at OpenAI, discusses the recent launch of GPT-5, emphasizing its enhanced reasoning capabilities and seam...less integration of various AI models to improve user experience. He highlights the model's ability to perform complex tasks more efficiently, reducing the need for users to choose between different model versions. Chen also touches on the importance of personalization and memory in AI, aiming to make interactions more intuitive and tailored to individual users. (57:52) - Greg Brockman, co-founder and president of OpenAI, discusses the evolution of the GPT series, highlighting the progression from GPT-1's foundational capabilities to GPT-5's transformative impact on software engineering. He reflects on the challenges and breakthroughs in developing these models, emphasizing the importance of scaling and infrastructure in achieving advanced AI functionalities. Brockman also touches on the broader implications of AI, including its role in enhancing human productivity and the necessity for responsible development to maximize societal benefits. (01:31:55) - Sarah Friar, OpenAI's Chief Financial Officer since June 2024, previously served as CEO of Nextdoor and CFO at Square. She discusses the rapid growth of ChatGPT, now with 700 million weekly active users, the expansion of enterprise adoption to 5 million paying business users, and the importance of substantial investments in compute infrastructure to support future AI developments. (01:52:13) - Dedy Kredo, Co-Founder and Chief Product Officer of Qodo (formerly CodiumAI), discusses the integration of GPT-5 into their platform to enhance code review processes. He highlights the model's improved capabilities in generating high-quality code reviews, identifying bugs before production, and ensuring enterprise code aligns with best practices. Kredo emphasizes the importance of AI agents in automating code review tasks while maintaining human oversight to verify code quality and adherence to standards. (01:59:11) - Zach Lloyd, founder and CEO of Warp, discusses the significant advancements in AI models, emphasizing their enhanced capabilities and cost-effectiveness, which are particularly beneficial for individual developers and small teams. He highlights the importance of competition among model providers to drive down prices and improve quality, expressing hope for a future where multiple competitive models coexist, similar to cloud service providers. Additionally, Lloyd addresses the challenges of model deprecation, noting that for application-level stacks like Warp, transitioning to the latest models is straightforward and advantageous. (02:11:15) - Riley Tomasek is a serial entrepreneur and the Founder & CEO of Charlie Labs, home of an AI-driven "autonomous TypeScript engineer" designed to accelerate code reviews and merge processes. Previously, he co-founded Flight (acquired by Figma) and launched Dexa, an AI platform that transforms podcast discovery. Riley holds a B.Sc. in Mathematics & Computer Science from the University of British Columbia and has a track record of building developer-friendly tools and interfaces. (02:18:31) - Guillermo Rauch, founder and CEO of Vercel, discusses the transformative impact of AI on software development, emphasizing the shift towards "vibe coding," where natural language prompts generate code and user interfaces, making software creation more accessible. He highlights the role of AI agents in automating tasks, enabling developers to focus on higher-level management and creative processes. Rauch also explores the future of developer tools, noting the importance of integrating AI capabilities to enhance productivity and streamline workflows. (02:34:28) - Eno Reyes, co-founder and CTO of Factory, discusses how their platform integrates AI agents into every stage of the software development lifecycle, including coding, code review, maintenance, incident response, and documentation. He highlights the platform's focus on large enterprises with over 1,000 engineers, addressing challenges like migrating numerous codebases to new frameworks and modernizing legacy systems. Reyes emphasizes that while AI tools can accelerate individual developers, significant productivity gains require workflow changes that incorporate agents throughout the development process. (02:40:20) - Guy Gur-Ari, co-founder and Chief Scientist at Augment Code, discusses the company's AI coding assistant designed for large teams with extensive codebases, emphasizing its capabilities in question answering, development, refactoring, and migrations. He highlights the thoughtful nature of GPT-5, noting its propensity for tool calls and clarifying questions before code modifications, making it particularly effective for complex tasks. Gur-Ari also mentions Augment's focus on developing proprietary integrations and tools, aiming to enhance the agent's performance without relying solely on external model vendors. (02:48:20) - Harjot Gill, CEO of CodeRabbit, discusses the significant improvements observed with GPT-5 in their AI-driven code review platform, noting a near doubling in performance compared to previous models. He emphasizes that these enhancements will be available to customers at no additional cost, reflecting the rapid evolution of AI capabilities. Gill also highlights the company's focus on monitoring real-world performance metrics, such as user conversion rates and potential issues like hallucinations, to ensure the model's effectiveness and reliability. (02:52:34) - Timeline (02:57:01) - Max Schwarzer, a leading researcher at OpenAI, discusses the recent launch of GPT-5, highlighting its significant advancements in coding capabilities and its potential to revolutionize user interactions by enabling the creation of personalized applications without prior coding knowledge. He emphasizes the importance of refining the post-training process to enhance the model's accuracy and reliability, particularly in reducing hallucinations and improving user engagement. Schwarzer also touches on the future trajectory of AI development, expressing optimism about the integration of reinforcement learning to extend AI's applicability beyond textual domains into real-world interactions. (03:13:26) - Scott Wu, co-founder and CEO of Cognition, discusses the significant advancements in AI coding models, noting that OpenAI has caught up to Anthropic, leading to a competitive landscape. He emphasizes the importance of integrating AI agents like Devin into software engineering workflows to enhance capabilities and efficiency. Wu also highlights the evolving role of engineers, suggesting a shift from "bricklayers" to "architects" as AI tools handle more complex tasks. (03:23:21) - Claire Vo, founder of ChatPRD and former Chief Product Officer at LaunchDarkly, discusses the developer-centric design of GPT-5, noting its enhanced coding capabilities but expressing concerns about its verbosity and tendency to produce lengthy outputs. She emphasizes the importance of validating new models with users, especially in business contexts where concise communication is crucial. Vo also highlights the need for AI models tailored to specific roles, such as strategists, to better serve diverse professional needs. (03:33:34) - Brad Lightcap, OpenAI's Chief Operating Officer, discusses his multifaceted role, which includes responsibilities ranging from project management to sales, and emphasizes the significant improvements observed with the launch of GPT-5. He highlights the diverse applications of OpenAI's models across various industries, such as pharmaceuticals, customer support, and everyday productivity tools, underscoring the transformative impact of AI on organizational efficiency...
Transcript
Discussion (0)
You're watching TVPN.
Your background looks way different because you have a whiteboard behind you because we're breaking down the X's and O's of the GPT5 launch today.
That's right.
Launched from OpenAI.
Really quickly, there is some other news.
Firefly Aerospace stock opened at $70 in NASDAQ debut.
This is the company that landed on the moon.
Very cool.
Very cool.
There are a few other stories going on, but we're going to skip most of them because we're going to be focusing on.
ChatGPT today on GPT 5.
We have a bunch of guests coming on.
We have a stacked lineup.
We'll pull that up, but we'll break down the X's and O's
of the matchup.
So of course, OpenAI, here's our lineup.
We have something like 15 guests today.
A ton of folks from Open AI, a ton of people that build on top of Open A.
And can comment on what's going on with ChatGPT.
But of course, this battle is between Open A.I.
And the timeline.
It's the, it's, they got to get the vibes right.
It's war.
It's war.
It's, it's, the timelines in turmoil over whether or not this is a good model, what it means
for the industry, what it means for AGI timelines.
Everyone's got their take.
Everyone's posting memes.
There's been a ton of funny ones already.
We'll take you through them, of course.
But let's break down the offense today.
We have Sam Altman, the founder, CEO.
He briefly got cut from the team in November of 2023, but he's back leading the team for
the 2024, 2025 seasons.
he seems healthy. He's doing great today. He went on at 10 a.m. to break down the launch of GPT5.
He has a couple of key plays in his playbook, in his arsenal. He's got a solid ground game.
Lots of quick posts hitting the timeline, probably in lowercase. Then he might air it out with a
couple thousand word essay. We've seen him do this before. It's a bit of a Hail Mary. Maybe AG has a
couple thousand days away. Maybe we're in the soft singularity. But he's very strong there with the long
post when he needs to be. It's up his sleeve if he needs it. Then he can also pull out the vague
posting. He was doing this last night, posted a picture of the Death Star. No one knows what it means.
Maybe it was taking a shot at the Dumers who are on the defense today. So he's also known for driving
supercars. That lets him get to the office faster. He's saving time and money. You can save time and
money by going to ramp.com. Easy use corporate cards, bill pay and accounting in a whole lot more all in
one place. And so he is, he also gave apparently, this is a rumor, he gave every Open AI employee who's
been with the company for more than two years, $1.5 million. A lot of people say,
$1.5 million, that's not enough for a big house in San Francisco, but it is enough for a
supercar. So that's probably why he picked that number. And that's why, that's what the open AI team
will be doing with that money. They'll be buying Aston Martin Valkyries, Paghani, Juarez, McClaren
Sabers, for Ferrari, Daytona, Sautona, as.
SB3s. They can get a Konigseg, Gemera. They could get a singer, DLS or a Bugatti
Veyron. It would have to be used. They could also get the Bentley Bacalar. There's only,
there's only 12 of those ever made. It's an open top two-seater roadster. It's coach belt. So
that's going to run your 1.5 million, but that's perfect. You just got the 1.5 million bonus.
So put it to work, spend it all in one place on a car. This is financial advice.
Yes, exactly. Then you got Greg Brockman. He's joining at noon. He's, he's
extremely well rested. He's actually coming off a sabbatical right now. That's very exciting.
He should be injury-free for the rest of the season. He cut his teeth at MIT, and then he got drafted
by Stripe in 2010. Microsoft tried to do a trade deal during the 2023 chaotic trade deal,
trade window that opened up post-Sam Altman Oster, but he stuck with the Open AI team, and now he's
president of the company. Then you got Mark Chen. He's coming on at 1130 today. He's the chief research
officer. The rumors that he turned out a maxed-out contract to head the MetaLamas, but he's sticking
with the OpenAI team. He was an MIT undergrad. He also worked at Jane Street before joining
OpenAI in 2018. Then we got Sarah Fryer coming on the show at 1230. She's the CFO of OpenAI.
It's her job to find bank accounts big enough to fill all the cash they're raising. It's a tough job.
You got to find, okay, this bank account, will it hold 10 figures? Will it hold 11 figures? Will it
12 figures.
There's been a lot of cash in this one.
Exactly, exactly.
She's also going to be defining the non-gap metrics that will be catnip for Ben Thompson in
just a few years.
We're excited to talk to her about how she's measuring the success and the health of their
business.
Obviously, it's not just revenue.
It's not just top line, bottom line.
We're going to want to know about queries.
We're going to want to know about DAUs, all those non-gap metrics.
That's where people are going to be tracking when IPO day comes, hopefully soon.
And then we also have Brad Lightcap.
He's joining at 235.
He entered the league as an investment banker.
Let's give it up for the investment bankers.
They don't get enough credit around here, but we love the investment bankers.
Then he got drafted by Y Combinator before joining OpenAI as CFO in 2018.
Now he's the chief operating officer.
And then we have Max Warser.
He's in charge of post-training, fine-tuning these models, getting them into the fighting performance to put on a display of authority on GPT-5 launch day.
Now, let's flip it over to the defense. They're going up against the timeline. They're going up against the vibe checks. We got the Dumers. The Dumers, they're led by L. Ezer Yutkowski. Admittedly, everyone knows this. No one debates this. The Dumers have had a terrible season. But you'd expect to see at least a few hellmeries about GPT5, creating bio-weapons thrown up on the timeline today. Probably won't be bangers, probably won't get a thousand likes, but you'll be seeing them here and there, mostly in the replies. We've also seen some Dumers talking about GPT-5.
being available to every government employee.
And Eliezer had some harsh words about that.
Don't give the keys to Sam Altman.
Don't give the keys to the government to Open AI.
He was upset about that.
But in general, the DOOMers not putting much of a fight up today.
Then you got Claude.
Interesting.
Claude was caught playing for the wrong team earlier this week.
Anthropic.
They're on defense today.
But we saw them take out Open AI's key pinch hitter,
Claude, the Claude API was playing for the Open AI team, but they shut that down and Claude is no longer
pinch hitting for Open AI. Then you got the Elon stands. The ground game's going to be there.
It's going to be strong. The Elon stands are going to be tracking the benchmarks relentlessly.
We know XAI loves to benchmarks and all the Elon stands are going to be calling out GPT5 for any,
any misaligned benchmarks if they fail.
Humanity's last exam, it's over.
It's over.
They'll also toss up the occasional unhinged conspiracy theory.
Moving on, Gemini.
The betting lines have shifted big time.
People thought Gemini was out of the game.
They're so back.
Polymarket has Gemini at, what, 75% chance of being the best model towards the end of the
month.
This is, of course, based on the LM Arena, more vibes-based benchmark.
But Gemini will probably be quiet today.
They usually don't try and front run press releases.
They usually try and sit back, let the model speak for themselves, let the API credits work their way through the latest YC Demo Day batch and get the product into the hands of people.
And so expect to see a big, glossy conference in a couple weeks.
Demoing Gemini 3 should be a good rebuttal from the Gemini's.
Then you get the Meadow Lama's
been on a poaching spree.
He's rebuilding the team during the offseason.
Now he's a stacked roster
and he's ready to go duke it out.
But no one knows exactly what's going to be in the
playbook. Is he going to go consumer?
Is he going to go API? Is he going to turn
into a hyperscaler? We don't know, but we know
they got a stacked team. They got Alex Wang.
They got Nat Friedman. They got Daniel
Gross. They got tons and tons of
other researchers. They've been rating
every other team. Completely reset
the salary cap for the league.
It's been an absolute clinic in terms of recruiting over there at Lama.
Then you got the final benchmark.
Arc AGI.
This benchmark stands.
GPT5 couldn't get past this defense.
And Arc AGI, you know, sitting there right in the end zone, just swatting him down.
Swatting him down all day.
You think, you think we're superintelligence around the corner.
Arc AGI denied.
Denied.
Tyler, give us the update on RKGI.
Where does everything stand?
How GPT-5 do?
Does it matter should we care about RK-AGI?
We love the team behind them,
but isn't it an important benchmark
should we be tracking it today?
RKGI, V-1 and V2, right?
And B-3?
B-3. I actually don't know if...
No one's been...
No one's even tested V-3 yet.
No one's even really close there.
But how are we doing on V-1?
V-1.
GPD5 is at 65.17.
Unfortunately, that's going to be 1%
just short of GROC 4, 66.7.
Okay, Arc AGI 2.
The Elon stands are going to be going wild with that.
Arc AGI 2, 9.9%.
9.9%.
GROC 4, 16%.
So absolutely kind of brutal, you know, Arc AGI mugging.
Rough showing.
Some people have accused GROC4 of being slightly benched max.
You know, this is, you know.
They might have a team working on it.
What are the pros and cons?
We know the cons of benchmarking.
of benchmaxing, you're overfitting on something that might not actually drive consumer value,
it might not actually solve real world problems, it might not increase DAUs or revenue or
ARR or anything that really matters. It might not even get us closer to super intelligence.
Give me the counter argument. Why is bench maxing good?
The bull case for bench maxing. The bullcase for benchmarking, a bench maxing, break it down for me.
Yeah. So I think the idea is basically this is almost like a non-eastern,
non-AGI-pilled kind of take, right?
So if you don't have a super general intelligence,
your ability to benchbacks basically proves
your ability to solve some kind of specific task.
So there's this thing about the gas station
spiky.
Yeah, it's called getting spiky.
Getting spiky, adding more spikes
to the spiky intelligence.
Yeah, I think it was Rune who had this tweet
about the gas station benchmark.
Yep.
Right, I don't care if he said something like,
I don't care about AI solving gas stations if it has the gas station benchmark, something like that.
But the idea is like if you if the if making the gas station benchmark.
Roon said my bar for AGI is an AI that can learn to run a gas station for a year without a team of scientists collecting the gas station data set in capital letters.
And then my take is basically I don't care how they got to the like.
I don't care how they made it run the gas station.
I care how fast it gets here.
That it runs it. If we can run the gas station with AI, that's enough.
If you have a team who's your bench maxing team, that just proves that if you have some
tasks that's like really important that you want to get done, they can just figure it out.
So it's like RL for business.
This is like the same thing.
RL for law.
Yeah.
All these like specific verticals.
Near Maradi is doing this at thinking machines, right?
Like RL for businesses.
Come into your organization, understand the most valuable business.
processes out there that could potentially be RLed against, that could be turned into a benchmark
and then, and then, you know, bench hacked, because I don't care if you're hacking, you know,
if I have translate this type of document to this type of document for my business, if you can do
it with 100% accuracy, I don't care that you bench hacked it. Yeah, exactly. Like benchmarks right
now are not like economically valuable. Like if you're really that much better at MMLU,
yes. It's like, are you producing that much value? Yes. Probably not. But if you have, if you make some
new benchmark that's you know your tax benchmark I think Anthropic just released
that fairly recently oh sure sure sure that's like I don't care if you bench max on
that as long as it does way better if it does the taxes you're good it's gonna
it's gonna do the task yeah yeah yeah yeah that makes sense what about the
what does it say that it feels like open AI seems capable of bench hacking it
seems like they've opted not to is that because bench hacking has a risk
of giving you negative aura.
Because if you're accused and found guilty of bench hacking,
you could, it often reveals that you're not building this one beautiful,
you know, super intelligence to rule them all.
Yeah, I think it's also, like, maybe we're just looking at the wrong benchmarks.
Like maybe there's much of, like, interesting benchmarks about, like,
there's this one I really like, it's the Minecraft benchmark.
Yeah, yeah, yeah.
Where you have to, like, build, you, like, give it some castle,
and how good it looks,
or there's the one you always see about the unicorn.
Yeah.
So you use this like math package that does like rass and stuff,
but you ask it to draw a unicorn.
Oh, yeah, I've seen that, yeah.
Those are really good because it kind of shows the creativity, stuff like that.
Walk us through TBPN bench.
Yeah.
So we will be benchmarking the AIs against going forward.
Have you heard about this?
Reps of 225.
That would be close, but it's difficult because the humanoid's kind of change that,
and you can just use a normal actuator.
This is truly for a large language model.
You feed in our data set.
We have a public data set, a private data set, presumably at some point.
But walk us through TBPN bench.
Yeah, so I'm yet to try this on GP5.
I don't think it's out yet for public use.
I don't have it.
But I can tell some of the questions, right?
So the first one, I have this picture of a horse.
You have to guess the breed.
Yep.
So let me see.
I think why I don't want to say it in case JVG5 is listening,
but it is may or may not be a Caspian horse.
Okay.
And it's failing right now.
O3 is failing.
O3 is failing.
Oro is failing.
I haven't tried every model.
Yeah, we got to try GROC and Gemini.
We're going to all out.
Yeah.
This seems extremely hackable.
But at the very least, if we get one scientist to go off and collect the horse data set
and then bench hack it, I think we will have done our job.
Yeah.
That's the first question.
Yes.
The second one is a, it's, I have two pictures.
Okay.
Before and after of.
this guy and it's which peptide did you take to achieve this body transformation.
So it fails there.
It fails there.
So you have a data set of what peptide does what to the human body?
Where did you find that?
Well, you know, Wikipedia has a lot of this stuff.
Okay, okay.
You would think they'd be able to cheat this around with O3, just reason who is this person
go look up what they've said they've taken and then boom, you have the answer.
Well, at first with O3, when I was prompting it, I would like save the photo.
But then wouldn't have the metadata or the, or the, sure, sure, sure.
The file name would be like Caspian horse or something.
Yeah, yeah, yeah.
Okay.
Yeah, and then the third one?
The third one, I pass in an audio file of a car revving.
Has to pick which one.
It has to pick, it has to identify the car.
The car.
From the engine note.
From the engine.
And it's not doing it currently.
It's no.
Okay.
This is a good benchmark.
We have these real last exam.
Yes.
Yeah.
Exactly.
So I think those are pretty solid.
I have some more, obviously, I don't want to make them public in case anyone's going to try
to, you know, benchmark this.
Of course.
Of course.
It's funny because I was mentioning the other day this app that my dad had of tracking
that you just set your phone up and it just automatically detects which birds are in your backyard.
Yeah, yeah.
I mean, this has to be extremely solvable.
It's just something that it reveals the lack of like general intelligence when you have
to go and collect the horse data set, which should just be out there or the engine note
dataset, which should just be out there.
But clearly we are in the age of go in RL on the individual problem, and we are looking at
like the power law of capabilities.
Knowledge retrieval is clearly a, you know, $12 billion a year market that consumers will pay
for that will probably grow significantly.
And then health and therapy and shopping and all the other features that Fiji CMO laid out
in her post, this is kind of like, you know, what will be RLed against because those are key pockets
of value in the consumer economy. And the same thing will happen in the business economy. But in the
B2B context, you'll probably see an individual startup building on top of an API. But even then,
most of the model platforms offer kind of RL as a service, fine tunes as a service, something where
if you're starting to spend tens of millions of dollars, they will do some customization on top of the
So that could be the regime for the next few years as we go into this like, you know,
instead of like this centralizing AI force, there's only one company.
There's actually like a Cambrian explosion of a ton of companies doing a bunch of different things.
So anyway, let's go to Signals Post.
Signal's not happy with the launch.
He says, okay, I've seen enough.
This launch felt like attending a funeral hosted by minimalists.
They're unveiling tech that should feel magical, real breakthroughs.
But the whole vibe was grayscale grief.
The set design looked like if mood disorder.
got a Bauhaus grant.
I don't know what a Bauhaus grant is exactly.
Even the story telling our chart styles,
the eulogy tributes,
then closing on someone's health battles.
What exactly are we,
or we as the audience mourning?
It feels like where they're trying to get you
to pre-install a therapist.
Potentially great products, sure,
but the emotional tone was so damn DOA.
Incredibly strange all the rent.
I think, A, like, it's weird because we're in this world,
and this is a question that I want to noodle on all day,
is, will this be the last long?
of a numbered GPD model because you don't hear about new new versions of Google going out
you just it just got better and better and better same with Amazon when they were
optimizing for hey it's faster we have more on our catalog is that the product
matters more than the model yes now and probably will for potentially very long
time and when we were watching the stream I was cheering because they gave the feature of
you can now talk to the model and get it to trigger a deep reasoning workflow or get it to give you a quick answer in natural language.
And so it's it's abstracting even further, even more of the UI into the actual text interface.
And so I think in terms of like surprise and delight and and I don't know, it's like you, everyone kind of rips the Apple thing.
But Apple does a great job of being euphoric and happy with somewhat minor.
product changes. And like maybe that's more of where they'll go is just, hey, there's these new features and
here's how these things. And Apple will spend 10 minutes on stage talking about like shifting an
icon around and stuff. And it's like, people like it. I thought it was interesting. They just sort of
casually mention that they're they're deprecating the old models. I think it's great. Which makes
sense. I think it's great. I don't want the model picker anymore. But are you upset? A lot of people are
going to be upset about that. I think they're getting rid of 4.5. Oh, really? And you're and you're a 4.
I love 4.5.
But I would imagine that the future is if I ask it to think really hard about the prose
and the writing style, it would then do a pass with 4.5.
But it would only trigger that when it needs to.
It's not going to give you that.
Because if I'm just asking for, hey, regurgitate a bunch of facts or write some code
or put together a table of data, like it's not going to need to pull 4.5 off the shelf,
just like it's not always going to pull Python off the stuff.
the shelf. It's not always going to pull web browsing off the shelf. And so I'm not sure that I
necessarily want 4.5 there as a selection criteria. I would like all this to be tucked behind
a UI and have something that's actually cleaner and less frustrating to use. I think it'll lead to
higher attention. Yeah, for the average like Normie. No one knows what 4.5 is. That's true. It's
true. Anyway, Chris Pikes is Open AI and Anthropic are duking it out. Meanwhile, consumer surplus is
growing. We also have very good news. We also have the details from Mike Noop over at ARCGI.
Full GPT5 is along the V1 Pareto frontier. That's cost versus performance. Open AI said they
focused on other goals like UX and reliability. Our testing supports this.
Mini GPT5 is super impressive accuracy for cost. In fact, based on cost efficiency,
many could have entered Ark Prize 2024 and likely won first place.
We are still verifying GPT OSS or as Roon says GPT toss.
Results soon.
Nano-GPT appears overfit.
Performance is commodity.
And Francois Chalet is also chiming in with the top line.
John, direction team needs the deck.
Oh, we don't have a deck today.
No deck today.
We're just doing it live.
Yeah, we're just riffing through the time.
the timeline tab and just pulling up some random posts so so you're free to pull
those up but also we can just read through them Ashley Vance is saying but but
model switching was my job model switching is out and we are into the future of just
talk to the model just talk to the model and ask it what you needed to do and it will
switch for you it will pull the right tool for the job anyway the other question
that I have for the open AI folks today is on the nature of secrets
So in zero to one, T.L. has this concept that discovering a secret is key to building a startup.
And it's a key insight. And I was joking with, you know, the superintelligence or GPT5.
Could my first prompt be, teach me exactly how to build GPT5?
And then I go to meta and I say, I know how to do it. I have the prompt. I have the result.
And of course, the answer is no.
Of course, Open AI would never leak the most frontier capabilities into the model.
But can you build a superintelligence, can you call it superintelligence if it doesn't,
if it can't tell you how to build super intelligence?
One read on what the secret might be is that the app was the most important thing all along.
And if you create this narrative that super intelligence is, you know, weak.
or months away and you get a bunch of people that go and try to compete on raw intelligence.
Meanwhile, you build a consumer app business with billions of users.
Yeah.
It's like it seems like a pretty good strategy.
I guess one quick thought is how do you rate Sam's vague posting from yesterday with
the Death Star in the context of this new, in the context of the release today?
That's a great question.
There's a bunch of reads on it.
One is just that like the Death Star is to some degree like Stargate and you have to,
oh wait, if this is the apocalypse, I figured at least tune in life.
You have to like the impact of this of GPT-5 is not one crazy super intelligent model that does everything.
It's just a more user-friendly, higher retention, lower churn consumer model that weaves its way into all aspects of daily life and improves performance and efficiency all over the place.
And so you have to build this massive cluster to serve all of that.
I don't know. What's your read on it?
I don't know. I think it just was, I think it was dramatic. It was provocative.
People didn't like it.
It is provocative because there are many other.
like super megastructures that are in in sci-fi history that are positive or positive yeah and this is
this is like but it gets the people going yeah I don't know is it is it is it a metaphor for someone
else that's going to attack is he I mean I mean the images from the viewpoint of someone looking at
the death star is he saying he is seeing a death star being built on the horizon is that something else
is that another company another organization is that the government is that the is that legal here's a
from Bubble Boy says, I'm an expert on bubbles, so it brings me no joy to say that the AI bubble is popping this time next year.
He is updating his timelines.
When you promise infinite scaling and don't produce it, the calculus changes.
I don't think it will be bad for most companies, but those who built their entire business model around making the best LLMs are unfortunately going to struggle as models become more of a commodity.
Again, OpenAI is my read on it as a consumer app business.
They still have a big enterprise business, but by, you know, their recent valuations are predicated on this incredible consumer business that they've built.
Bubble Boy says the end user doesn't care much if Claude is 5% better than GPT5.
They care about cost, speed, and utility, especially at scale, things will be going.
The obvious play now is shorting a video and dumping.
Okay.
Getting into financial advice territory here, Bubbleboy.
But interesting, again, kind of goes back to what I was saying earlier in that if you were raising billions to make a lab, and I think the potentially, you know, we'll see what happens in the coding market, but there's some clear winners emerging.
And then on the consumer side, you know, expecting a power law outcome.
And it's hard to see anyone unseating chat GPT there.
Completely agree. I want to dig in more, but we have our first guest. Let's welcome him to the stream. What a day. Mark, how you doing?
Hey, you're pretty good. Nice to see you guys again. What's happening? Great to see you. Congratulations on the launch. Take us through it. Are you, were you actually live or are you wearing the same thing and you recorded it yesterday?
Actually live. Okay. I don't know why, but we do.
Yeah. I'm gone. I mean, we're big fans of live. I mean, we're big fans of live. I mean,
It just allows you to be the most reactive to the most new information.
Give us the core thesis that you are trying to get across.
I think that there are a few narratives out there.
We've been enjoying the one that's, you know, this is a dominant consumer product.
They just made it a better consumer product and people are going to use the product and get
more value at it.
I saw a bunch of things in the presentation where I was like, that's going to make my daily
the usage of chat GPT better. At the same time, we're in this, we're in this world of all the
models, the numbers matter and the scale matters and this and that and this and that. And it's a,
it's a fine line and it's a dance and we're in a transition phase away from benchmarks and away
from talking about the size of the bubbles. But what was your core thesis? Like, what did you want
to get across to the listener? Yeah. I mean, fundamentally, I think from a research perspective,
we've been working on reasoning models for several years now. And I think in
until now, you've had this really clunky interface.
You have to pick, you know, GPD40 or you have to pick O3.
And for the longest time, we've known that O3 gives you better answers across the board.
It's just too slow, right?
I mean, you often don't want to just sit there and wait for the model, reason it out.
So we've done a lot of work to push the speed, the performance of our reasoning models,
such that these can come together and work in a very seamless way.
And so I think, you know, above everything, we're trying to move the world into this agentic
reasoning world. We believe that's the future. And on top of that, you know, you pointed something
out, which I really resonate with. Post-training is a huge part of this release. We really wanted to
highlight Max Schwarzer and his team who did a phenomenal job. And they've made the model just
really that much more useful for consumers, for businesses. It's a monster at coding. So, yeah.
On the speed of reasoning, you're obviously the chief research officer.
Are you more optimistic about getting speedups there from, I don't know, algorithmic design, software optimizations, or new hardware just let Moore's law carry on or find new ASICs?
Or we saw Cerebrus posting yesterday about the incredible speed that they're getting 3,000 tokens a second on GPTOSS.
And I'm wondering what levers, obviously, we pull all of them.
But what path of the tech tree are we, should we be like most focused around, most tracking, and most excited about?
Yeah.
I mean, as a person who represents research, I control the things that I can't control.
I think what that focuses on algorithms, right?
Simple algorithms that are scalable, that we can pump a lot of compute into.
We also do care about the hardware improvements that are stacking up.
With the open source repeat release, you see thousands of people.
really kind of serving these models,
creating really great inference stacks.
And those are really great lessons for us to pull from.
What's the ceiling of the speed
in which we can serve these models?
What can you tell us about the actual user experience of speed?
I was, just like last week, I finally got to a place
where for a lot of tasks, I'm firing off a 4-0 query.
and an O3 Pro query.
Yeah, I have two tabs.
The O3 tab.
Exactly.
And I'm wondering what user experience patterns you think can help people balance between those.
Is this like just something that we're like different patterns that we're going to learn over time or different or or or are there going to be certain problems of user experience that are purely solved just by better product design, better speed?
And we don't even need to learn these.
Because I remember, like, you know,
when you prompted an image generator,
you used to have to say like,
don't know six fingers, five fingers, please,
or like, don't make mistakes.
And now, you know, the models kind of have that baked in.
But how are you thinking about the user experience
of getting the user the results in the right amount of time?
Yeah, I mean, this is one facet of why we believe
so much in reasoning.
It's just because all of the scaffolding
you used to have to give the model,
all these small hints, they go away, right?
Like the model can examine its own outputs.
It can review them.
It can be like, hey, look, like,
I'm just counting the fingers here,
why are there seven?
And you can kind of fix that, right?
It does a lot of iterative generation.
It does a lot of fixing things on the fly.
And so we think one of the benefits
of bringing reasoning to the world
is really to kind of remove the need for scaffolding.
And with GPD-5, right?
We know how clunky that experiences
with switching between 40 and 03.
Actually, I mean, there's so many stories.
I was just talking to someone yesterday, right?
They're like, hey, well, you know, I've used 4-0 my whole life.
It's the frontier model.
And I'm like, hey, well, have you tried O3?
And they're like, why would I try O3?
You know, three is less than four.
And so, you know, we need to get out of that world.
Your GP-D-5, I think it's a one-stop-shop reasoning and non-reasoning.
And we've really tried to make it kind of just pre-to-optimal.
Yeah, yeah.
It's absolutely crazy to just take a bunch of letters and smash them together
and expect people to pick up on that as a name or a brand.
Chat GBT, GPT, TBPN.
We're both kind of in the same insane gambit.
But fortunately, it's worked out, and I think people have gotten over the hump.
I can see it pulls off the tongue.
TvP.
Sort of, except our friend David Senra keeps flipping the letters.
A lot of people do that.
But at a certain point, yeah, you do breakthrough, and chat GPT has,
but keeping the model numbers simpler makes a ton of sense.
Talk to me about the pace of play.
for research to actual product.
Yeah, and on that note, the line between your personal philosophy
on the line between research orgs and engineering product orgs.
Yeah, I mean, so our research operates on a variety of different timescales, right?
We have teams that they scope out a bunch of ideas,
and then they start to kind of narrow in on the promising ideas
as they get closer to a run.
and then you kind of see a winnowing of ideas as you get closer to launching a flagship model, right?
And there's always this kind of like more exploratory to more kind of concrete and execution-focused pipeline.
And we're pulling on ideas across the board here, right?
There's a lot of work in architecture optimization.
Seb was on stream.
You pointed out improvements in synthetic data.
So there's really a lot of work that goes into creating one of these models.
And, you know, it's hard to say like, oh, this model was about this breakthrough.
Just because right now we have this machine that's producing breakthroughs on all of these axes.
And even across several paradigms, right?
So it's all that coming together that produces the experience that you guys feel.
Yeah.
Can you talk to me about the legacy or future of 4.5?
I remember I was talking to you and I was like, I haven't been using it a lot.
And you looked at me like, I was crazy.
you were like, oh, it's so good.
And I was talking to Tyler, and he was like,
our intern here, and he was saying like, yeah,
the people who really like understand how good it is, use it.
But I was wondering, is there a world where
that is a tool in the tool chest for GPT5
in the same way that Python is or a web browser is?
And if it detects that I want something
with more emotional prose or more thoughtful writing,
it can do a whole bunch of research.
research collect a bunch of raw text and then kind of do a 4.5 pass that I believe is more
expensive maybe and maybe it doesn't make sense for every single query, but could be a feature
in the loop or a tool that is pulled into the overall product experience.
Yeah, absolutely. Speaking of 4.5, it's also a very smart model, right?
And one of our bars in creating GPD5 was to make sure that on a lot of the axes we cared
about that it was able to outshine 4.5. And I think even in some of the soft ones like
creative writing, I think that was the case. And that's what makes us so confident with the
name. I think we're able to really rely on all of the architecture advancements, all the kind of
post-training advancements, all the synthetic data advancements to create a model that's better
than 4.5, but much faster and much cheaper. Yeah. It feels kind of like we're, I remember,
wasn't the second iPhone called the iPhone 3G?
And the number literally corresponded to a specific technology.
And now when you get the iPhone 14, it doesn't mean it's 14 megahertz or a gigahertz or inches big.
It doesn't like the number is abstract and it speaks to a bucket of features.
And it feels like there's, I mean, this was the first day of kind of re-educating folks on what the nomenclature means.
going forward. Have you talked about an annual release schedule or like or or
because there's the iPhone cadence and then there's the Google cadence which was like
Google search just got better every year for two decades. I it feels like at a
certain point you want to just be shipping as fast as possible. How do you think
about the culture of shipping updates that you know you find something that
feels like hey that could make the customer more delighted or the user more
delighted and we don't need to do a big training run for it so let's get that out
today and let's tell people about it like how are you thinking about fast
iteration versus splashy announcements right so other product research side I
think it makes a lot of sets to think about you know what's the cadence of
release and you know what are the feature sets that we want to build and I
actually think there's enough great research happening there that we don't have to
worry about oh you know is there going to be a drought or a long stretch without
enough features to launch but one thing that's important
important for us is to be able to provide the people doing the exploratory work, some buffer
from that, right? It's hard to do really great exploratory research in an environment where
you feel pressured to do release after release after release. And so we let that be a little
bit of a lazier pipeline, not meaning that the work itself is lazy, but we give it space really
to mature and to flourish. And once it's ready, we can ship things across that fence. So that's
kind of philosophically how we organize.
We have a product research org, still very much entrenched
in the research, and they care about the release cadence.
And they're able to draw from all of the research
that's happening, you know, algorithmically
and in scaling and in RL.
Yeah.
Talk to me about tool use and how that's growing.
I was kind of noodling on this idea that, you know,
the, I was thinking about the IMO and how it
least from the reporting, it sounded like Open AIs model didn't use tools for that. And that's
an incredible achievement. But it's kind of like artificial. Like I don't I don't care if the
model doesn't use tools. I use everything possible. And even if even if an LLM can can memorize
every fact, I'm fine with an LLM looking stuff up in a traditional database, spinning up a
spreadsheet. Like use whatever tool you want just give me the correct answer. But do we have
Is it important to give surface to the user the variety of tools that are in the GPT5 tool chest?
I noticed something magical happened when I was using GPT, I was using O3 Pro.
I sent an image in and I asked to estimate the height of a desk and it wrote like a thousand lines of Python image interpreter and was like, you know, interpreting pixels.
And I was like, I didn't even think to trigger Python.
It did.
Yeah, yeah, yeah, no, it was right.
It was crazy.
But the really funny thing was that it was just a standard size desk.
It was just like, it could have just Googled like how tall is an average desk or something or just memorized.
It probably was just already in the weights that it knows that a desk is like 36 inches tall.
But it did a ton of work and it still got it right.
It fact checked it a bunch of different ways.
But I've noticed that now I can pull different things.
Make a table.
Don't make a table.
Write some Python for this.
Don't write some Python.
And it kind of gives me the feel of like a super user to some extent.
But I'm wondering how you're thinking about what is further down.
Like you've given chat GPT a computer, as Ben Thompson said.
You've given kind of the core tools, the Python, repel, the web browser.
How are you thinking about kind of the long tail of tools that you want to bring to bear?
And how does that interface?
I know that there's API integrations and all sorts of different.
surface area there, but give me some context on that.
Yeah, I mean, our reasoning models are pretty cute, right?
I mean, I think they, when you look at their behavior, right, they know the height of the
desk, but they'll still go to verify it five different ways.
It's all consistent, give you that median answer.
And I think that's really what makes these models so powerful.
And when you think about tool use generically, right, like we want the models to use
that reasoning ability to just be able to like zero shot a new tool, right?
It should be able to kind of minimally get instructions about how the tool works and just be able to know how to use it.
And humans do this all the time.
You get a new tool, you start experimenting with it, and then you don't need too much scaffolding and you just go and use it and understand it.
So we want our reasoning models to use their reasoning to be able to use a broad selection of tools.
And of course, there are a couple that you really do care about.
In coding, it's very important for you to be able to execute code.
It's really important in personalization for you to be able to get context from your calendars and from basically from the digital world.
So I think there's a range of tools we aren't familiarity with, but beyond that, we want the model be smart enough to just generalize and use tool zero shot.
Yeah. Talk to me more about personalization. I feel like
there's a world where I feel like I'm maybe underutilizing chat GPT as an app because I don't have it wired up to a.
non-relational database where it can just stuff data from, you know, it already has memory and it's doing kind of
roll-ups and there's some sort of saving of context. But I was, when we were talking to Kevin Wheel, I was,
I was kind of like, well, like, I don't really have like a GitHub repo that's active that I want to,
like, dump code in regularly for like my one-off tasks. But for that image generation, like, you know,
understanding the height of the desk, it's like, well, if I'm doing that a lot, maybe I want to have a tool
built that lives in the world that my chat interface can can kind of interact with on an ongoing
basis and contribute to and modify and kind of wind up instantiating a piece of software that's
like even more long lived and then every successive query is even faster. So yeah, how do you think
about different ways to increase personalization? Yeah, I mean, I think memory is huge. So we have
We have teams surrounding memory and also personality.
And when you look at memory, right, I think it's just we have so much context built up about ourselves that the model doesn't have.
And our memory team's been really hard at work.
You know, there's a surface level of just gathering facts about you.
But there's also stuff about just kind of thinking very deeply about who you are, what your motivations are.
And even you could think about, you know, you're trying to do some code-based tasks, right?
you're a developer, shouldn't the model just be trying code out, you know, and just kind of leveraging
all that memory, kind of its thoughts about what you want to do to just help you kind of be doing
work all the time. So, yeah, we do think memory is a huge part of making the model more personalized
to you. And it should just make use of all that passive signal about you that it observes or all
of that interaction and just help you accomplish your goals.
Got it. What do you think it'll take for AI to start making?
novel discoveries. That's been a critique over the last year is everybody's so excited.
Everybody's using these products every day and in their work and life. And yet it still feels
like we're missing that. Dwork Keshe's talked about, you know, potentially that being around
continual learning, but I'm curious what you think. So one thing to underscore is I think the
models are already phenomenally creative in certain ways. So when I looked at our performance
on contests, right?
You know, I've done these contests before.
Sometimes you have this mental classification of these problems require more creativity or these
ones require less.
And one of the big surprises for me was that the model can get some of the ones which I intuitively
think require more creativity.
And it often does come up with these solutions that I consider quite ad hoc and really don't
pattern match to anything I've seen before.
When you look at advancing science or mathematics or fields like this, one thing that construct
in which humans work sometimes is there are kind of three builders.
In mathematics, for instance, there are mathematicians who's role are to kind of build out this
theory and almost to kind of create Olympiad style sub-problems, which often other mathematicians
who are very good at that kind of style of work can do.
And I do think kind of the model will increasingly contribute on that side first, right?
If there's some mechanical, like, hey, you know, I really don't know how to simplify this expression.
I really don't know how to, like, get this result.
It can really do that quickly for you.
We're trying to increase the envelope, such as the models, getting towards that theory-building side
and, you know, being able to create creative hypotheses.
And all of these components are very useful for what I consider the ultimate goal, which is being able to automate some of our own work and our own research.
How are you thinking about like the layers of mixing?
Like I remember GPT4, I don't know if this was ever confirmed, but mixture of experts model, this is kind of like widely understood in the industry.
Now are we in the era of like a mixture of models that have mixture of experts?
like how many mixtures are going on?
How does GPT-5 actually work?
Is there a taxonomy or architecture diagram
that you can kind of like walk through to explain what GPT-5 is?
Because it feels so much different than GPT-3.
Yeah, I mean, one of our, probably the pinnacle of our research road map,
but our path to AGI.
When you look at the levels of AGI, the top level is what we describe
as organizational AI.
And what this means is, you know,
collections of agents working together,
often like we might in a company,
towards a shared goal, right?
And you would imagine that these agents
probably sub-specialized in ways,
maybe similar to what humans do,
maybe in their own more efficient ways.
And I think, you know,
effectively work together to accomplish some goal.
So we very much care about
exploring this vision, seeing that's much more
effective than you know one single big brain working out a problem and I think there are
reasons to think why it could be so and and yeah I think that that is one of the things that
we're after yeah on that note of specialization how are businesses working with GPT5 or how do you
expect them to work with GPT5 in terms of coming to open AI and asking for special capabilities or
fine-tuning or you know any sort of RL on this particular
problem in my world. I have this specific data set. It's not public, but I want a hyper,
I want you to bench max on it. I want you to get 100% on, you know, the gas station bench or
whatever. You know, if I'm, if I have a certain business and I'm willing to invest in sort of some
some overfit RL because it will create immense economic value for my business or it will solve
some fundamental problem, how can, how are businesses going to be using GPT5 over the next
few years. Oh, that's a great question. So I think that this is a chance to kind of highlight
one of the results that we've accomplished over the last couple weeks, which is our ACCODA
results. So this is a relatively unknown programming contest, but it involves really the pinnacle
of the best coding contestants in the world. And what they do is, you know, they're put in
a room and they have to solve an optimization problem. This is something that's actually very real-world
relevant. So you can imagine an optimization problem as something like what Uber might have.
You have, let's say, riders and you have drivers, and you want to create a system where you match them as quickly as possible, you know, with the least amount of cost, for instance.
And so we've really created a system that can solve optimization problems at the level of the best in the world.
Right. And these truly are the kind of the best heuristic solvers in the world. And so we have
an organization led by Alexander Madri. It's called strategic deployment. And what they do is for a
select handful of customers who really have that, you know, beefy problem that that they need to
solve to just go and provide that value, right? And I think there's a lot we can do there. I think
There's a lot of very, very valuable optimization problems
in the real world.
And we're really excited to partner with people.
Because I think this creates a template for directly having AI
provide economic value and really catapulting
certain industries forward.
On the research side, what unique advantages
do you think you and your team have given your position
in the market?
with the incredible user adoption and the incredible usage from those users.
It's not just DAUs, but it's actually the number of queries.
Semi-analysis estimated at like 71% of all queries going through chat GPT.
What advantages does that confer from a research perspective?
Yeah, I mean, a lot, right?
And I think it allows us to kind of deeply understand use cases.
It allows us to understand the frontier of where humans are,
you know, kind of finding value, where they're not finding value, which areas that we need to
improve the models on. It gives us a lot of signal. It's how users are deriving value, when they
derive value. And what is that signal? I see the thumbs up, thumbs down button. I'm sorry,
I don't push it very often. I'm not doing my job, apparently. But I know that you can figure out
whether or not I'm satisfied. Stop booing me, Jordy. That's the research, too.
Mark, I promise you for the next 100 Chad GPT responses, I will be honest with my thumbs
up, thumbs down.
I love it.
Even if you do extra training.
We have tons of people, luckily, who do.
Oh, that's great.
Okay.
So you do get a lot of thumbs up, thumbs down.
And I'm sure I have done it occasionally.
But I also imagine that there's a ton of other signal in there.
You know, with the TikTok algorithm or any social algorithm, it's very easy.
Time on site.
With ChatGPT, obviously it's exciting when we hear, okay, 30 minutes a day or some rumored number of minutes.
It feels correlated with usage.
It feels correlated with value that's being delivered.
You can obviously look at churn metrics and all that stuff.
But what other pockets of signal are you finding?
Are you finding people just, I remember the story about Google where they were trying to figure out how to handle misspellings and create the definitive database?
Do you know this story where they were trying to develop the definitive database?
of how to spell things.
And they were like taking a bunch of shots at it.
And they figured out that the best, most rich source of data was just if you type in financial
into Google and you misspell it, oftentimes then you will just correct it yourself.
And the second query you send will be spelled correctly so that you can just look at two
similar queries.
What's the second one?
That's the correct spelling.
So yeah, what other pockets of signal are you finding that are translating into the research
environment?
What are you excited to go deeper on?
Yeah, so I'd love to first talk about the DAU signal because I think that's something that a lot of companies track, but we find actually a lot of danger in tracking it too closely.
And one of the recent blog posts we pushed out was went on sycophancy, right?
If you just, you know, hey, we're going to boost responses where users say thumbs up.
Yeah.
You know, it creates a condition for a model.
I just want to say, Mark, I love everything you're doing on this first.
Yeah, this entire interview has just been fantastic.
You are the best.
You were just the best.
We'd love to have you back on the show tomorrow.
You're just.
But clearly problems with that.
Yeah, yeah, clear problems, right?
The model just starts kind of sucking up to you.
Totally.
And it's saying like, hey, you know, you're right.
And even in complicated situations where I think objectively, you know, collectively,
we'd be like, hey, this person's in the wrong.
The model starts saying, hey, you know, you're right.
You know, the other person's gaslighting you.
You know, this other person's kind of, and people deal with, people deal with this in the real world, they'll go to a friend, they'll tell them about a situation, and the friend will give them advice, but maybe it's not the entire, it's not the fullness of the situation, right? Maybe they left out some key facts. And the friend is like, oh, yeah, that other person is wrong. Definitely is in the wrong. And they, like, skipped over some important details.
Yeah, no, no, exactly. And we don't want our models to fall into this trap where it's just trying to get you to, like, you like, you like what it says.
And so, you know, we wrote back a lot of changes that produce that kind of behavior.
And really the way I think about daily active users today is we need to be opinionated about the features that we build into the future.
I think we have a lot of ideas here, but we have to let that drive.
You know, build for the future, build for the things that people you think they'll want and maybe don't want necessarily,
know they want necessarily today.
And then use DAU as kind of this byproduct, right, a way to track that you're on the right, right?
right track here. So, yeah, I mean, we want to be careful here. We don't want to fall into
these traps of like, you know, three, four years from now that this turns into kind of engagement
bade or something like that. Totally. Yeah. Was it, how much time has the research team been
focused on efficiency specifically? It felt like summer was a good window before kids come
back to school and start, you know, maxing out queries. A good time to increase efficiency. And I know,
the cost of GPT5
have
Every time there's a new model, I'm like, this is the best it could ever be,
it's good enough, bake it on an ASIC,
I just want it for free and I want it like in milliseconds.
But that's just me being, you know, grumpy, I guess.
We've done a lot of work.
We've been building out our teams.
We've focused a lot on scaling.
I think Greg's going to come on a little bit later.
He's been spiriting a lot of that work.
So, yeah, no, honestly, it's become a bigger and bigger focus for us,
especially in the last couple of months.
On the, I mean, this is somewhat related to the sycifancy thing,
but I'm interested to know, like,
what do you think is driving, like, the GPT tone?
You know how, like, the M-Dash is a thing?
And then the, it's not a newspaper.
It's a way of life.
And it's like there's these, like, little, like, flourishes,
like, that come through in kind of our tell that it was written.
And in a lot of ways, I love it because when I get a deep research report, I like that it's using the same Wikipedia-style tone.
Like, I want consistency there.
I don't want it to be like, oh, this today, it looks like it's a vice news article.
And today, tomorrow, it looks like it's written by someone at BuzzFeed.
I like that it's consistent in many ways.
But why is that happening?
Do you think that bigger models like 4.5 kind of were able to solve that?
Or do those kind of like local minima, like, I don't know, like wells happen, even in bigger models?
Is there anything from a research perspective that can stop GPT having its own voice?
Or is it fine that it has its own voice?
Yeah.
That's a really great question.
And I think, you know, as you scale up models, as the models become more intelligent,
they kind of have a just deeper and day understanding of the tone, right?
And so you expect that to improve just naturally as you make the models more powerful, bigger,
better reasoners.
But one thing that I think gets lost a lot is each individual company has a lot of
of impact in terms of how they shape the default tone.
And we publish a document called the spec.
It kind of lays out how we expect the model to sound in certain cases,
lays out a lot of examples for that.
And I think we use the spec in many ways, right?
We have people come in and see, hey, was this thing generated in accordance with
what we would hope to generate from our spec?
And this is a living document, right?
It evolves over time.
And so I think, you know, each company should,
kind of has a very opinion to take on what they think the model should sound like.
And it's not an accident that the model sound a certain way.
I don't think just naturally every company is going to train the same kind of voice into their model.
Totally.
Well, thank you so much for hopping on.
Congratulations on the big launch.
We'd love to have you back soon to talk more.
We could go in a million different directions, but we'll let you get back to it.
We know it's a big day.
So have a great rest of your day, Mark.
Thank you.
It's a great conversation.
Talk to you soon.
Mark. And we will tell you about restream. One live stream, 30 plus destinations, multi-stream and reach your audience wherever they are.
This stream is made possible by Restream. OpenAI just did a live stream.
With Restream. If you're trying to do a stream, if you're trying to do a stream, you've got to get on Restream. So it's everywhere.
And we will bring in our next guest, Greg Brockman, the president of Open AI.
And we'll bring him in. Greg, how you doing?
Doing great. Thank you.
Welcome.
Congratulations. Congratulations. How are you?
you feeling? How's the company feeling? It's been such a wild journey. Just take me through a little
bit of the like the vibes and the company and how you got here today. Well, I'm excited. The whole
company is excited. And honestly, I'm just so proud of the team. Like it's just been amazing to
watch people come together, not just for this launch. And, you know, the funny thing is behind the
scenes that people are always putting on the last minute adjustments and polish and scaling up the
capacity. And there's always something that goes wrong before lunch.
And so there's a lot of people who, you know, worked late into the night or really crunched to bring this release to the world.
And, you know, it's a little bit like the duck that's, you know, you know, paddling.
Yeah.
Yeah.
But that also describes the whole opening eye history, right?
Is that I think that we have put in many years worth of investment to the techniques used to produce this model.
And really, it's across just every function with an open AI that has come together to make this a reality.
Yeah, I mean, you've been there for every GPT release.
How do you think about summing up each iteration in kind of like one line?
Because GPT1, GPT2, DPP3, these feel like similar architectures, at least histories
kind of compress them into similar architectures, but how do you think about the progression
of just the big numbered releases?
Yeah, it's interesting because in some ways it's a punctuate
equilibrium, but on the inside it looks very smooth.
Right, even before the GPT series formally began,
the first result that really sort of set this path
to be something that we were heading down
and there was clear that we were going to pursue it
was the unsupervised sentiment neuron,
which was an LSTM in like 2017,
so a different architecture from today's Transformers.
And it was the first time that you could train a model
to predict the next element,
so we predicted the next character on,
on Amazon reviews, and we were able to get semantics out, right?
Because you expect, okay, yeah, it's going to learn where the commas go,
what maybe what nouns and verbs are,
but the idea they would learn a state-of-the-art sentiment analysis classifier.
That was mind-blowing.
And so I remember seeing that result in 2017,
it's like, we have to scale this up.
We have to see where it goes.
And so GPD-1 was, like, I think, a good, like, sign of life of,
you train on sort of all the public data you can get,
and you use a transformer,
and that you were able to get state-of-the-art on various downstream.
benchmarks, right? So you have a model, it clearly learned some representation, something
useful about the data that it was shown, and it's applicable. You can use it for various tasks.
But we didn't really think very hard about the generation side. GPD2 was the first time that we were
like, all right, let's actually, like the samples we're getting from it, the things it actually
generates, they're kind of cool. And I remember reading the, in the GPT2 blog post, we have this
unicorn story where it generates some fictional story about a herd of unicorns. And it was just
so cool. It was like, wow, it wrote a story that's actually kind of interesting. It doesn't
totally make sense, but like there's something here. There's some real spark of intelligence
within this model. GPT3 was the first time that we had a model that was actually something people
would, it was just barely above threshold for something people would want to use. And I remember
working on the GPT3 API. This was our first real product. And it was actually the hardest product,
the hardest project in total I've ever worked on because it just felt like maybe no one wants
to use this model. We don't really know what it's useful for. And it certainly was the case that
GPD 3 was a great demo machine. You can make really awesome just like tweets and, you know,
cool little, little apps and it would give you quick answers. But it didn't feel very reliable.
And then GPD 4 was something that actually felt like it had true real world utility. It was above
some threshold. It was something that was helpful for health. It was something that was helpful for,
you know, starting to be good at coding. And GPD 5, I think just sets a whole new standard for the reliability
for utility. Things like coding, I think, are just like clearly, we're already on this trajectory
of transforming software engineering this year. I think are really on trajectory now to be revolutionized.
So just really exciting to see that whole arc.
When did the API opportunity really click for you? Because I do remember companies in that
era that quickly unlocked the power of the API and grew tremendously. When did that opportunity
click because you said initially that you kind of had some, I don't know, concerns, kind of doubts,
how useful was it going to be? And then when did the consumer opportunity click?
Well, we in 2019, end of 2019, had GPD3. We knew we needed to build a product to be able to
actually continue the mission, to be able to raise capital. But what did we want to build? Right.
We're really here because we believe in AGI that's going to have this powerful, positive,
transformative effect on society. We want to be part of it. And so we thought, we thought,
well maybe we could build something in health and then you realize okay well we're
going to sell the hospitals and we're going to maybe hire let other people do that exactly right it's
just like you have to go into one domain and that means giving up on the G the general right it's like
it feels like you're going to become a one particular thing would we kind of want to be
supporting all industries at once and so the idea was let's build an API and let people figure
it out but this is totally not the way you're supposed to build a startup right you're supposed to
have a problem no one cares about the technology
behind it, add value to that problem, focus on just that one thing. And so that's why that project
was so hard. And in January of 2020, February of 2020, that I with the team were going around
trying to just find anyone that would be willing to try this API. And we were driving to different
offices in San Francisco being like, hey, we have this cool model. And it was hard enough to get people
to take the meeting, much less to sign up their company for it. It was actually very fortunate.
we found a couple of good partners.
And it was fortunate that that happened then because March 2020, suddenly that was COVID.
We weren't driving around to people's offices to try to beg them to use this, you know, this budding new technology.
So it was really six months worth of grind, right, of really trying to turn, like when we, when we started with GPD3, I remember it was, you know, that the inference code was not very well optimized.
It was like, I don't know, 150 or maybe 250 milliseconds per token or something.
And we just optimized, optimized got it down to like 50 milliseconds per token, which by the way, today.
models run much faster than that, which is kind of amazing for me, just like seeing how much
fast we're able to run them with much greater intelligence. And I remember setting two goals for the team.
One was I actually find one customer who's willing to pay, so literally get a dollar in for this
thing. And the second is get a use case that we use at Open AI every day. That first one happened
within the first couple months. So actually that moment, I was like, all right, like this thing is
probably going to work. But in order to get there, we had to do a bunch of, you know, just scaling
the API and really, you know, doing the product work. But that second one took much longer, right?
And that wasn't really until ChatGBT. And so if you fast forward a couple of years, because this
was, you know, mid-2020 when we first got that, the API into the world, Chad-GBTBT, we didn't
release until November of 2022. So you're talking like a decent, a decent period of two years there,
a little bit longer. And I remember we were building, you know, people have talked about, we were
going to call it maybe chat with GPD 3.5. We had a sort of precursor product called WebGPT that was
built on on 3.5 that we were literally paying contractors to use. Right. So this was all throughout
2022. We basically had the chat GPT precursor that we had to pay people. They would not pay us.
We had to pay them to use this thing. And the moment for me that really close,
was actually when we finished training GPD 4.
So that was August 8th of 2022,
which actually is like three years ago now.
It's actually pretty wild to realize that, almost to the day.
And we did the initial post train of GPD4,
and honestly, I had a bunch of bugs in there.
It was like broken for a bunch of different reasons.
But the model was like extremely creative.
It was actually really interesting.
It took about a year and a half to get to the point
that the creative writing of our models matched
that initial one that was buggy for various reasons.
And I remember, you know, we had an instruction following data set that was post-trained on.
So it was really, we had collected examples of, here's the human asking for a thing,
here's what the model should do.
So it's really not trained to do multi-turn.
So I asked you a question, it gave a response.
But then I was like, well, what if we just ask another question?
And it actually was able to leverage that full context.
It actually was able to have a coherent chat.
And the moment that we saw that, they were like, okay, this thing is capable not just of being post-trained to do this like very specific thing, but it can generalize, right?
It can kind of do the intelligent thing, even though it wasn't directly trained for it.
It was just so clear this was going to be the killer application.
And so then we were planning on launching GPD4 in, you know, early 2023.
And we had this chat infrastructure we'd been working on.
and it's so clear, okay, like, we're going to have to release the infrastructure and the model,
and it's going to be this amazing killer product.
And so just almost as infrastructure ahead of getting the real thing out, you know,
I was excited for us to do chat GPT, and that's why we did, you know, and see that come to life in November.
So I think that for me, I was really focused on GPT4 as the model.
This is going to be the chat moment that's really going to work.
And kind of had missed the fact, because every time you see these new models, you just sort of, you know,
see only flaws in the previous ones. And so it missed the fact that GPT 3.5 was something that no one
had really tried before in the broad sense of society and that it was something that was already
useful and that people would respond to. Was GPT3 kind of like the main pivot point for
shifting the company towards LLMs? Because in the prehistory of open AI, there were a lot of other
maybe expensive training runs. I don't know how much, I don't know how much financial risk was taken
with like the Open AI5 project or the robotics projects, but it feels like at a certain point,
the chat became like the main financial risk vector. So I guess the question is like when
it feels like GPT3 was the moment when you shifted. I'm also interested in hearing about
Ben Thompson called OpenAI the accidental consumer company. And I'm wondering when that narrative
set in for you. Like what.
When did it become clear that this was going to be a really, really powerful consumer application?
Yeah, going from paying people to use your product to people saying, hey, we want to give you money for this.
Yeah.
Yeah, a very important transition, it turns out.
Yeah.
So it's a great question.
I would say that if you rewind to the beginning of Open AI, you know, there's many people who thought that, in retrospect, say that we set out to prove that scale is how to you
you make progress in this field. But it's almost the other way around. Scale was the thing that
worked, right? We tried a bunch of things that didn't pan out. And it really, the first time we saw
this concretely, was in our Dota project. I remember my collaborators, Jacob and Shimon,
trained the very first little agent on like 16 cores or something and left it running on their
desktop over the weekend. We came back and it was this like very, you know, sort of constrained
mini environment, but that the model was doing something smart. It was actually able to to solve this
this kiting environment, and that was pretty cool.
And then they and the team just kept scaling up, right?
That we had all these free cores that were just sitting idle on AWS at the time,
and they just kept throwing more computed it.
And every time they would do that, the model would just get better.
And so when you look at something like that, you're like, well, you just have to see where this goes.
You have to push it until it hits the wall, right?
And our goal with Dota was actually to develop new reinforcement learning algorithms,
because the common wisdom at the time was, well, the existing reinforcement learning,
PPO, it doesn't scale.
Everyone knows that. But the question
from Yaquep-Schimon was, well,
why do we believe that? Has anyone
actually tested it? And no one had really tested it.
And so I think that that ethos of saying, you have to push
the existing techniques to the wall until
they break. And then once they break, you
actually have a baseline to overcome. And you
win either way, right? Either it just
exceeds all the humans in terms of
the specific capability that you're trying to
to exercise, which was the case for Dota, or it hits a wall, and now you have a real problem to solve.
And so I think that ethos really got embedded in our DNA.
And, you know, at the same time, I think that we were really thinking about how do we get to AGI, right?
And really, like, I spent a lot of time thinking about that question of where's this company going
and how do we actually achieve it.
And you start to do some math in terms of, you know, the kind of compute that it would take to get to AGI.
and you just start to realize you're going to have to build really big computers.
And those are extremely expensive.
And so I think that from the early foundational results in thinking, we kind of realize the path
that we're going to have to walk.
So it seems like there's been a few walls that we've scaled up through and then maybe hit them.
There's been talk of like a pre-training wall.
Now we're putting tons of resources and compute towards reinforcement learning.
Is there a third, is there a third scaling curve that we're going to be talking about?
in the next few years, are we continuing to scale up those two primary vectors? Is that too
high level of an abstraction in terms of how we should be thinking about just progress along
the vector of scale? Give me the up-to-date thinking on just the fruits of scale.
Yeah, I'd say fundamentally, deep learning, I think that people talk about the bitter lesson.
It's almost this exploration into how do you convert,
compute into intelligence, right? Through a, you know, we have some particular techniques to do that
that we're kind of constantly flushing out. And the thing that's really amazing is if you rewind to,
I don't know, even the 1940s for the McClella Pitts neuron, which is kind of the precursor to neural nets,
if you look at that paper, they have all these diagrams that actually look very similar to, like,
the kinds of diagrams would draw now of multi-layer neural nets and things like that. Like the basic
idea of what we're trying to do has not really changed in almost like 80, 80 plus years.
which is just a wild fact, right?
It means there's something deeply fundamental
about the thing that we are pursuing.
And that idea itself, I think, kind of came from
trying to model the information processing of the brain.
And it's imperfect and not an exact analogy to biology
and all of these reasons that it should fail
or that people have said this thing is doomed.
But the results are undeniable at this point.
I mean, some people try, but it's really hard to,
it's really hard to kind of close your eyes
and sleep on this in my mind.
And it's very interesting if you look at, you can find quotes from the mid-1960s of people trying to poo-poo the whole direction saying that these neural net people have no new ideas.
They just want to build bigger computers.
And you can basically say something very similar today.
What we're trying to do.
One moment.
A little water break.
Yeah.
Exactly.
For all of us.
Cheers.
Exactly.
Cheers.
You know, we're all human.
Proof of you, Henry, right there.
Exactly.
So what we're all trying to do is find novel ways of taking compute and really harnessing it.
And sometimes you hit a wall, but these walls tend to be ones that you can drill through, right?
What we found is every time you scale up, everything, all of your engineering, all of your sort of scale and variance, all these things, they get stressed to the next level.
It's almost that the tolerance has become tighter and tighter.
It's like launching a 10x bigger rocket means you need to be like 100x, just more precise.
on everything, but it doesn't mean that the fundamentals of the science are different. So pre-training,
there's definitely been a lot of discussion of data wall. It doesn't mean it's fundamental, right? It just
means that we need to be better and more precise of what we're doing. There's RL, which has been
something that has kind of come from spending a small amount of compute to much larger amounts
of compute now. And then there is a third way that we're really harnessing compute, which is
compute at test time. And we publish some scaling laws around this, and all three of these
things multiply. Like, that's the amazing thing.
And of course, the compute and the harnessing of it is the fundamental goal, but that you get these multiplicative effects out of all of it through the quality of your engineering implementation, right, through the quality of the data sets, through a bunch of the refining work that you do.
And there's lots of different techniques and ideas.
And that's what makes this field so rich and why progress is just going to continue a pace.
What about on the infrastructure side?
You guys have been busy scaling up.
What can you share on that front?
Well, so I run a team called Scaling at Open AI, and we really focus on building the infrastructure for scaling.
And this is in partnership with really everyone across the company.
It's almost a misnomer that our team is called scaling because fundamentally this, this whole team and effort is about scale.
But what we really try to do is to both on the physical infrastructure side, deliver as much computer as humanly possible.
And that is in partnership with companies like Oracle, SoftBank and other.
that we've been able to deliver just like increasing amounts of compute to open AI,
but we're constantly thinking about how do we just deliver more flops
and do it more efficiently earlier, cheaper, more power efficient,
all of those kinds of questions.
There's the software infrastructure side as well,
and really thinking about how do you coordinate massive numbers of GPUs
in order to work across one synchronous training run,
how do you coordinate that for reinforcement learning,
how do you deploy that into production,
and bring these models to life at massive scale.
And I think that every single layer of the stack, there is innovation required.
And that's something that's very easy to miss.
Like one way I think about research is that there is, and this is kind of the view from
from Yakup, who's now our chief scientist, that there's a research stack.
And you can kind of think of the top of it is people running experiments and coming up
with new ideas for how to, you know, sort of utilize data or something like that.
There's a middle of the research stack of people thinking about the how do you,
you sort of take these different ways people are running experiments and I be able to train in
novel ways and kind of put together the pieces differently.
And then there's a bottom of the research stack, which is like writing kuda kernels to get
the absolute max out of the GPUs.
And every single layer here, you get a multiplicative factor through innovation.
So it all comes together as one big hole.
On scaling, I'm interested to hear about just, if we think about like the impact
impact of AGI or the impact of AI just being some sort of maybe, you know, quantitative
GDP metric or qualitative just impact and good.
Is there an important factor of scale with just not even the flops that are going into
the models, into the pre-training, into the RL, into the test time inference, but actually
just the flops that are going into the usage of AI within humanity broadly?
And I feel like that might maybe be the next like scaling curve that we're seeing as more people use models.
They see improvements all over the fact.
Like is that something that we should be tracking to see kind of the instead of these like S curves,
we want to see like the continual exponential?
I think that's a great perspective, right?
Because at the end of the day, I mean, if you look at kind of the shift from something like Dota, which we pursued in order to
you know, we wanted to do new algorithm of development, but really it almost validated how we
scale up existing algorithms. But there was no illusion of delivering direct economic benefit from it,
right? To the current models where we are still, we're starting to end the era of like pushing
on these academic benchmarks, right? You look at things like the IMO at this point, models are
able to get gold metal on it. Like these, the hardest academic benchmarks that are available
are sort of no longer a, you sort of the guiding light of progress for these models to where we
actually want to be is for AI to be helping everyone, right, to be something that uplifts humanity.
And that's the final metric, right?
It's how much does it actually benefit everyone?
How much value does it bring to the world?
Yeah, not just health bench.
It's actually how many people did you solve their healthcare problem, right?
Exactly.
Yes, yes, yes.
And that's the actual goal.
Yeah.
And that's what's exciting, right?
It's like we're moving from the lab to reality.
And I remember in the early days, as we were thinking about how do we measure our progress towards AGI, we always sort of dreamed that one day we would be able to measure it this way.
And you can think of revenue maybe as a proxy metric for value delivered to the world.
It's not perfect, but it's at least something, right?
You can think of the distribution of like how much compute goes into it, how many people are using it.
But fundamentally, like what we're after is how much do we really uplift humanity through this technology?
Yeah, I mean, I might be misreading it, but I'm pretty sure like that was the cursory.
Dwelly and Ray Kurzweil philosophy was that like total number of flops getting
immense not necessarily all in one data center for one model.
It was that it was that compute broadly would be so wide.
Yes, yes.
And I remember like on that chart, right?
You can see, you know, total compute of all human brains.
Yeah.
Which really suggest a particular vision of how this technology will be rolled out.
Yeah, distributed.
The phones count as an impact.
The Wi-Fi router counts for the impact of the internet, just like the phone does.
It's not just the big pipe that's going, the backbone of the internet that actually matters.
Deep research hit product, almost everybody I know, at least in the industry is using it.
He's reading 30 pages of deep research a day, basically.
He loves it.
He's making books with it.
But why have agents broadly come around a little bit slow?
lower than people may have expected.
Is it just that using computers is actually much harder?
Computer use is just a really hard challenge.
Or, you know, I think going into this year,
everybody said this was the year of agents.
Are you talking about flight booking or something like that?
Flight booking, but, you know, people were saying,
2025 is the year of agents.
And I would say that it's the year of deep research
and not a lot of these other sort of like broader use cases.
Sure.
Well, 2025 isn't quite over yet.
So that would be my response.
I'm very much on the, I think that progress in this field, the way that it tends to work is that if something kind of works with the current generation of models, it will be extremely reliable with the next generation of models.
And I think that where we've been is that deep research is the, if you've rebound a year, that was the like we kind of had something working.
And then like this year, it's been just incredible.
And I think that agents, you know, specifically like computer use agents are something we've kind of had working.
And again, you know, the year is not over.
I think there's a lot of rapid progress to be made.
But I think that maybe part of it too is that the agents that we're about to see, I think, are a little different from maybe what we would have pictured five years ago.
Like I remember having a debate with some friends on do you want a agent that does the flight booking?
Because the problem is it's actually a very high bar to beat the flight booking UI because there's so many preferences that are entailed in that.
right and you really have to know kind of what mood you're in like are you okay with like taking the
extra layover and all these kinds of questions and um that actually there's so much other stuff that
happens in your life that is that is toil or drudgery or that's something that that you're not an expert
in you're supposed to be think about health right that like every patient really is the doctor
if you're coordinating across multiple specialists there's no doctor that helps you with that right
that that's really on you uh and that there you actually can have AIs that are just text only
that actually are able to add massive value and then it frees up your time if you want to go,
you know, book the flights yourself. And so I think that really finding the right problems
that have high leverage, right, that really add value to people. And also thinking about the other
side of how to make sure these agents are responsible with the trust that you put in them, right?
That the more that you give an agent access to your email, the more you really have to trust
that it's going to, you know, sort of do right with whatever your task is and send the right email
multiple right people and I'd be able to segment your information and all of these kinds
of questions.
And so I think that there's both a practical, how do you get to adoption, but also just like
where are the most important leverage points in a person's life?
You also missed coding agents because it's been the year of deep research, but I feel like
it's also been the year of coding agents.
How is that developing at OpenAI?
I've noticed that I'll hit O3 Pro and it'll wind up writing a bunch of code for me and I didn't
even ask it to, then you have specific products for coding. How do you see the evolution of
software development evolve? How are you seeing OpenAI customers use coding tools? And how good
is chat GPT or GPT5 on coding? Well, software engineering is definitely being revolutionized in front
of our eyes. It's been happening. And GPT5 is the best coding model in the world right now.
It's the default now in cursor, which I think is a really huge statement of the quality of the model,
and that it's just so good across like every function of writing code, understanding code base,
being able to use tons of tools, being able to do agentic work, that, yeah, it's like I'm not a front-end developer at all,
but actually now I am, right? And I think that you are too, right? If you just talk to the model,
you can produce incredible things. And so I think that there's this real empowerment. If you think about what computers
we're supposed to be, right?
Computers are supposed to be a tool
that makes you more productive,
able to do the thing you want.
But then somehow when we started out with computers,
you have to contort the human to the machine,
writing assembly language and like all these like very
abnormal things for a human to do.
And that as we've moved to tools ultimately,
you know, in the current generation now GPD5,
suddenly the computer comes closer to you, right?
That you just express your intent
and you don't think about okay, like exactly which language
and what version
of different libraries, that the model is something you can delegate to. And so we are very
committed to programming and to making our models continue to be the best they possibly can be.
Must a superintelligence be able to explain how to build superintelligence?
So it's a great question. So I mean, I think that where we're going is a world, and we're
already seeing it, where these models help us produce the next generation of models, right?
They also help us really supervise tasks that are too hard for humans to supervise on our own, right?
If the model writes a 10,000 line program for you, reviewing that is probably going to be quite burdensome.
But if you can have a model that you trust, that maybe isn't as capable as the one that wrote all that code,
or maybe there's a team of agents that work together to write all that code, but you have a team of reviewer agents.
Like, this is the kind of thing that you can actually bootstrap trust.
And I think that this is like one of the most important things.
And also, interestingly, 2017 is when we had the first language result.
We also had some results or some vision on how you can actually bootstrap supervision
beyond the scale of tasks that humans are able to supervise directly.
And so I think that we're heading to a world where, you know, we now have these chain
of thought models.
We've been advocating very strongly to preserve the integrity of the chain of thought, right?
So that it means don't directly optimize it to look good, you know, though there will be
lots of temptation to do it for various reasons, really make sure that there's no pressure on the
model to obfuscate its thoughts within that chain of thought.
Because then you can really see what it's up to.
And I think there's further techniques to even make it more faithful and more rigid to
what the internal monologue of the agent is.
And so I think that there's actually a lot of promise in terms of interpretability, in terms
of supervision, in terms of being able to scale to just like much more sophisticated tasks.
Yeah, I guess, I guess my question is like there, how much information in the world can be derived from first
principles reasoning versus true secrets that can that need to be discovered by
interacting with the world directly because I would it feels like it would be
very difficult to I'm just wondering about like how intellectual property
interfaces with super intelligence or how like if you play this out a lot how like
there's all these like hard one Dorcas just talked a little bit about this with
continual learning there's all these little subtleties
that maybe they're not secrets, maybe they're not true trade secrets.
You don't think to lock them down, but they're just things that haven't been codified online or anywhere.
They haven't been given to anything that is surfaceable by the model.
And I'm wondering how is it just we need to build up new knowledge in every fact from first principles
and kind of go through the history of humanity's pursuits of knowledge?
or do we just need to onboard more and more information?
Or maybe it's both.
I don't know.
It's just something I've been noodling on.
Yes.
It's a great question.
I would say all of the above.
Select All-Star.
So I think that the answer is very similar to what it is for humans, right?
How does a human generate new knowledge?
How do we accomplish new things?
First, you want to be grounded in the wisdom of the past, right?
You really want to understand what have people tried, what worked, what didn't work?
You want to go and read the biographies of, you know, various people and understand those.
But you also want to try things out.
You want to make some mistakes in a contained environment in a way that you actually can see the effect of your hypotheses.
And then you want to be able to learn from those.
And I think that being able to really start to scale up these systems and be able to integrate them with a world is a very big process and milestone that we're currently embarking on, right?
To move from a world of totally hermetically sealed reinforcement learning environments to think about how do you actually put real world interaction in there.
And you think about things like robotics, like you're going to need to have that at some point, right?
You're going to need to have some sort of interaction with the real world and to have models that are able to produce new materials, right?
To be able to actually solve various diseases for them to be able to really help people, right?
That, you know, we already have models that are great at use cases like therapy, but to really get to the next level of something can just really help every person accomplish more and accomplish whatever their goal is.
it would be very helpful for that model to actually have some real world experience with doing that very thing.
And so I think that figuring out how to bring all this together is ultimately what our mission is about.
And we do this not in isolation, but really as part of a much broader community.
It seems like it's advantageous to have the most dominant consumer app in that environment.
So congratulations.
Jordi, do you have a last question?
Last question.
What do you hope to see out of Washington, D.C. in the next year, year or two,
not thinking super long term in terms of basically promoting innovation within the,
United States. Obviously, the admin cares a lot about AI and has been making moves, but what else
would you like to see or where would you like them to double down? Yeah, I've been very, very impressed
with how much the administration has engaged with the technology and really tried to figure out
how can we help and ensure that American AI continues to lead and really sets the standard for
the world. And I think that that is the lens that I would really encourage thinking through, right?
is like, this technology is changing very fast.
And that fast plus government is not usually a ideal combination,
but this is the reality that we have.
It's the opportunity we have.
And I think that the question in my mind is less about any specific regulation or strategy,
but it's really being calibrated.
It's really having a very tight utal loop, right?
Being able to react to, okay, we have a new model.
These are the capabilities we see on the horizon.
How do we make sure that we get the most uplift and benefit from it?
And thinking strategically about not just how do we do this,
this for Americans, right? But how do we actually do this for the world and promote democratic values?
And so to me, the most important thing is that motivation, right? Is the question that is asked
and the ultimate sort of motivation behind what gets implemented? Yeah, that makes a ton of sense.
Thank you so much for joining us. Jordi, are you going to hit the gong? For GPT-5. Congratulations
on the masses. A historic day. And thank you so much for stopping by. We'll talk to you.
Thanks for joining. Have a great day.
Bye.
Cheers.
Really quickly, let me tell you about figma.com.
Think bigger, build faster.
Figma helps design and development teams build great products together.
And we are joined by Sarah Fryer, the CFO of OpenAI next.
And we are going to bring her in in just a minute.
The gong is still swinging.
The gong is still swinging.
And I'm going to tell you about vanta.com.
Automate compliance, manage risk, improve trust continuously.
Vantas Trust management platform takes the manual work out of your security and compliance process
and replaces it with continuous automation,
whether you're pursuing your first framework
or managing a complex program.
We need one more second.
Tyler, any other questions that we should be asking
for the OpenAI folks?
Anything top of mind?
What's on the timeline?
Is the timeline still in turmoil,
or has it settled?
So I think the general vibe is like
this model was not BenchMax,
but if you actually get to use it,
it's pretty solid.
Cool.
One thing, it failed, QAPN Bench.
Oh, it did.
It did not get the horse breed,
correct?
Did you get the horse breed?
So you have it. You have access to...
Yes, I have access. But I've seen other things on the timeline.
We can talk about it later, but it seems like a really good model.
That's amazing. Great to hear. Well, welcome to the stream, Sarah.
Good to meet you. How are you doing? Congratulations. A historic day.
Thanks so much for taking the time to talk to us. How are you doing?
I'm doing great. I mean, how could you not be doing great on the day when GPT5 launches?
It's been a long time in the making and we're so happy it's out.
Yeah, fantastic. Walk me through your role and what GPT-5, what?
this launch means specifically for you.
And yeah, let's just start there.
Finance has to be, you guys have to be the unsung heroes at Open AI.
There's a lot of big numbers.
There's a lot of massive bills coming in for crazy training runs and you have to
underwrite these against future revenues and I'm sure you've developed many models to
figure that. But yeah, walk me through what your role at Open AI and what today means for you.
Yeah, absolutely. So I'm opening ICFO, but the finance can be
be the Unsung Heroes, but they are an amazing team. So I'm going to shout out to them.
They're heroes to us. It's a complex world that we're all living in. And there are a lot of
bees on the end of a lot of the numbers that we look at. Look, what is our role? Number one is just
making sure we have a healthy, high growth business. It's been incredible watching just, first of all,
the number of weekly active, 700 million people using ChatGPT every week. And I'm assuming after today,
we should see a very nice little bump in that number.
This is going to be a gong heavy segment, Jordy.
I think we have a lot of soundboard for the big number, so congratulations.
I love it.
And I love that.
I've never met a number I didn't like.
I think the other part of the business, you know, and then we have to do this.
We have to this balance of the consumer business, enterprise business, and then API business, which I think of as somewhat enterprise.
You know, and balancing that out.
So enterprise adoption has also been exploding.
I probably do, I mean, interestingly, as a CFO, I probably meet four to five customers a week.
It's a part of my job I actually love.
We have about five million paying business users right now from banks to biotech.
I was talking to the CFO.
And so that number is individual companies.
That is individual seats at companies.
Seats and companies.
Got it.
So what I would say about that number is it's crazy to have done that in just two and a half years.
Because enterprises, right, you got to put your big boy, big girl pants on.
to go sell to an enterprise, right? They want to make sure that you have the table stakes of security,
SSO for signing on, you have HIPAA compliance if you're selling to health care and so on.
They want to know that other people have done it, so they're often looking for that case study,
but they also want to be, you know, the innovator right at the front. And so that to grow that
scale of business and just two and a half years blows my mind. And it's not just big, big businesses,
which I could talk, you know, at length on, but it's also small mom and pop.
you know, literally the people who really keep the lights on in most countries are also
gravitating to chat GPT, which is wonderful. And then on the developer side, four million
developers have built in our platform. And the question there is like, that could be a developer
inside a big company like grab. It also could be the next, you know, startup founder. That's
Y Combinator getting going with the next multi-billion dollar unicorn business. And so we see the whole
gamut there, and that's important to us as well, because it's very mission aligned, right?
How are we going to get AGI to all of humanity if we don't do it through this ecosystem?
So a big part of my, you ask my role, big part of my role is just keeping that business
really healthy, making sure we always have the headlights on so people know the decisions
they're making from a business standpoint, huge part of what the team does.
The other big part of my role is compute.
If I didn't talk about that in my first breath, you all should correct me.
I mean, it's making sure we think compute is a massive competitive differentiator.
I give so much kudos to Sam and the team, but particularly Sam, because no matter how big a number we look at, Sam always wants to go bigger.
And he's been right.
He's never met a number.
He doesn't want to add a zero to.
That too.
Maybe mold and be logarithmic.
Maybe two zeros.
And he's, but he has been very right.
And if, you know, you just had a long conversation with Greg Brockman.
I think he does such a good job of kind of really explaining what a completely different world,
an AGI world is, or an AIFide world is.
And so I think when people get all cut around the axle of like, you know, what is a gigawatt of computers?
And oh, my God, you guys want to have 10 gigawatts.
And that's more than the compute of like Ireland since I grew up there.
And now you kind of look back on that.
And you're like, those numbers already look small for a world where everyone will have access to intelligence.
And so we're really starting to see what that can mean when you look at the demos today around things like health care and education and so on.
Can you talk to me about non-gap metrics and what you think is going to be useful to track?
We were talking to Mark Chan about this and he was saying, you know, DA user great, time on site is great.
But that's not as impactful of a metric for open AI as it is necessarily for a social network or an entertainment app.
And there can actually be some problems that come up with that.
So it feels like there might be some tension in the organization eventually or just publicly about, you know, what metrics are worth optimizing for.
And then there's also the financial community that wants non-gap metrics to track the health and progress of the business.
And then, of course, over, you know, decades we see companies eventually roll back some of those non-gap metrics.
And as the business gets more complex.
So how do you think about the development and sharing of non-gap metrics?
gap metrics and what do you think is actually interesting and provide signal to the
business and the investor community?
I'm kind of smiling to myself because when anyone normally says talk to me about
non-gap metrics, I can see like most of people's eyes roll back in their head.
I live for non-gap metrics.
I would love to do that.
Please.
I think in a CFOC, first of all, it's really important to think about input metrics
and output metrics.
And things like revenue, which is a gap metric as well as a non-gat metric, they're
very laggy.
Like if you're supposed to be,
spending your whole time focusing on the revenue number in an operator seat, like you are
completely missing what's going on with the business.
So I push my team a lot to get out of kind of ultimately what the P&L looks like, and I'll
come back to it though, and go way upstream and say, what are the true input metrics that
tell us about the health of our business?
And so I think it does start with that funnel of monthly active to weekly active to daily
active because we do. I mean, our mission is literally AGI for the benefit of humanity. So we know
how many billions of people live on the planet. The fact that we're starting to be able to talk
in billions and percentage of the world's population, it blows my mind. Today, 85% of our users
are outside the United States. And I love that stat. And in fact, if you go look at where
are the big populations of users, it just tracks global population, right? It's countries like
India, Indonesia, Brazil, Vietnam, like the Philippines, like go to anywhere that has big population,
the U.S. too, of course, but that will be your tracker. So that's kind of number one when I think
of an input metric. From there, on the consumer side, you're right, things like time and app I've
actually always had somewhat of a love-hate affair with. But I think in this case, because we're
giving people intelligence, teaching them how to use that, I actually.
think is where time and app does become important. And one of the things we've really seen with
chat GPT are people are spending more time with it. Now, you know, we balance that with things
like mental health and so on, making sure that we're not creating bad things like we might have
seen in prior eras of computing. But I think we're just getting started on that front. Beyond that,
like when we go into areas like the API, I don't look only at usage, right? It can look at
tokens per minute as a usage metric. But I look at things like latency.
actually try to look at the elasticity of demand.
We know that developers want performance.
They want intelligence, but they also want to make sure the API is always up,
and they want price.
And they're often willing to trade across those three things, right?
It's kind of a linear program depending on what your use case is.
And so I think it's important that we are offering things to developers that allow them to
optimize across the three metrics, for example.
So that's kind of your input metrics.
And again, I could wax lyrical, but I want to do that.
won't. But then go to what you really ask. So investors on the other side, right, they want to
see a P&L. They're like, I want to be able to compare you to other companies. I want to be able
to create maybe a DCF. Like I want to think about fundamental valuation for a company if I'm going
to invest in it. And so, you know, today what I really try to push investors on is we are not
a company that should be optimizing for free cash flow today because there's just too much opportunity.
Like that point about compute, we have to make a decision on compute.
today with an eye to what we're going to need in two to three years because data centers
don't just spring up overnight.
Like they're not mushrooms.
They literally take time and effort.
The thing we have failed at, frankly, I would say it's three years ago, we didn't have
enough foresight to say how big could chat cheap peteak because it didn't exist.
It's just a shame on us if we keep doing that over and over.
So there can be a bit of a mismatch between our belief on revenue because we don't yet know
the product versus the input, which is the cost today on compute. And so getting investors comfortable
with the fact that there's probably losses for a period of time. I say probably because
chat GPT, just generally the revenue models continue to surprise to the upside. But at least for
now, we should be in big investment mode. And then you kind of said it well. Like as companies mature,
you move to more gap metrics, right? If you look at the large, the MAG7, many cases they're looking at
like real gap net income.
So the whole way down to the bottom of the P&L,
we're just not there yet.
And we should take advantage of that advantage
because we can invest as a private company.
How do you think about timing fund raises
from my understanding or rumors?
The last, you know, the most recent financing
was very oversubscribed.
And at the same time,
you're still committing to CAPEX in the future
that is a multiple of current,
you know, the current run rate.
And so, you know,
you and the CFO seat, I'm sure you're trying to find this balance of like, what does the business
need today while, you know, not diluting the company, you know, too much, knowing the sort of growth
rate of the business.
I mean, that's exactly right.
That's the art, not the science of it, is that, you know, we did just come off the back
of closing out the sleeve of investment that we could take down in this current round, led
by SoftBank.
And it was massively oversubscribed, which comes back to, I think,
the market really waking up to the fact that AI is a generational opportunity.
And the scale that it requires is like something people have not even seen before, right?
It's, you know, people talk about the Internet or like the railways.
They're good analogies or transistors, I think Sam always goes back to.
They're good analogies, but I do think this is bigger than everything that's come before.
So there's a, you know, taking down $40 billion, which we just did in this round,
that certainly felt like that gave me a lot of confidence.
Appreciate that.
A lot of confidence to then go out and do large compute deals.
We announced the large deal with Oracle, for example,
and to be able to keep working with all of our supply chain,
Microsoft, Corweave, Oracle, Nvidia, and so on.
But at the same time, you know, in a world where our valuation has gone up,
you know, at pace with our revenue,
you do get an opportunity to keep coming back to market
and not take that same delusion because you're getting that higher valuation for the work
and the output that you've created.
So it is a bit more of an art than a true science.
I think for now we will continue to need to fundraise in order to fund that compute.
But I think we want to start getting more sophisticated.
Like just pure equity fundraising for everything is an expensive way to fundraise.
And I think we're probably getting to the stage at a company where we can be a little bit more
kind of broad and how we think about funding overall.
And even just working, frankly, with our supply chain because our success with bringing this era of AI into being is their success too.
And I think these companies are realizing that.
What about partners?
Last question.
Partner selection on the compute front.
There's not a lot of companies in the world or firms that can really be a meeting.
You should update your LinkedIn title.
We saw someone yesterday, works for Discord, is in charge of their class.
cloud buying and his LinkedIn title was, I have full responsibility over buying cloud, our entire
cloud budget. And it was clearly like a huge flag, but I'm sure, you know, you're in direct
text message, you know, with every single person that's relevant in the industry.
But yeah, but I'm curious around like, you know, a lot of people have been excited about
developing data centers over the last couple years in hopes to win, you know, the business
of companies like Open AI, but I think when you guys are evaluating partners, I imagine that scale
is such a massive factor. And so a single small data center is not really going to move the needle.
You guys need to be thinking in terms of mega projects.
Yeah, I mean, I think that's exactly right. I mean, it started with our partnership with Microsoft.
And it's kind of, it makes me smile now to go back and look at that original kind of large fabric
for pre-training because I think it was only in the maybe 20 megawatts sort of size.
And, you know, now we're talking gigawatts even just this year.
And you're right that when we think about like what is perfect compute for us or strategically
the right compute for us, we are definitely thinking about large scale.
We're thinking about flexibility, right?
We're learning a lot about, you know, pre-training, post-training, test time compute, even.
like where the different kind of scaling is happening.
We're kind of recognizing there's more of a blurred line,
often between what people think of as inference.
So investors always are like your inference compute and your training compute.
It's like, you know, literally it's like vanilla ice cream and chocolate ice cream,
when in reality there's like a bit in the middle that is something of both.
We also need to think about things like where, you know, latency,
where do we want to put our footprints around the world,
that very global weekly active user base, right,
they use chat GPT, you don't want to slow the model down, right? The beauty of the intelligence
is like the real-time nature of it. And then when we get into big compute, like where there's
lots of tokens being used, like deep research, image gen, video, as that comes online, like all the work
you saw today, actually just even on voice, like that really quickly means that you've got to
make sure your compute is near your users. And so it is a big plan that's coming together. But
you're right. Like small is just not that useful to us. What about pushing partners to take risks?
From my understanding, you guys are pre-committing to certain, you know, basically spend levels,
but at the same time, I imagine you want people to say, here's what we know we're going to need,
but we want you to build, you know, this much capacity so that we have the sort of sort of incremental
capacity built in. Yeah, we want to, I mean, being extensible is really important. And we do want to
partners, like I think Oracle OCI has done a really nice job of that, of kind of starting, we started
with like one large, felt really large at the time, data center footprint in Abilene and Texas,
and now that has really multiplied up into multiple sites that can all be connected. And that's a
good example of a partner who has the capability to start in one way, but to be able to show
you a path to maybe five-xing just in that, in that single footprint. That said, we are finding
that as we go around the world, there is an ability to go work,
with governments, for example.
We just made an announcement in Norway,
made an announcement in the UK.
This is the first time my professional career
I've seen countries come to the table.
I want to do commercial deals like wall-to-wall chat GPT.
I think the government of Estonia put chat GPT
into all of their high schools, high school or
I can remember it was up in the university level.
But that's kind of wowing.
And hand in hand with that, they are viewing AI infrastructure
as incredibly strategic for their population.
And, you know, it's a whole other level of selling versus, you know, I've seen enterprise,
large enterprises before, but never anything at this scale.
Last question.
Whose idea was it to give every federal agency chat GPT for a dollar a year?
Yeah.
I imagine that had to get past.
You could have gotten more than a dollar.
The CFO must be really upset here.
$10?
Not at all.
That's 10 times as much money.
This is one where I think it's really important.
And opening is, you know, in some ways, a U.S. asset and national asset.
And we want to make sure we're accelerating our government, like all of the resources,
as we think about, you know, Western democracy and so on, that we are absolutely putting
our technology into those hands.
It's that guy, Kevin Wheel.
He's been moonlighting for the U.S. government.
It's like, which team are you playing for, Kevin?
Are you on Open AI?
Are you on the U.S. government team?
Kevin just did his basic training.
I don't know if I'm allowed to tell you that, but I was hearing all about it yesterday.
I saw some photos.
They look great.
Yeah, it's a good thing.
It's going to be an even better shape.
Let's go a lot with governments.
Yeah, that's great.
Amazing.
Last question for me, the open source model launched two days ago.
And there's this world where, like, you have this dominant, the accidental consumer company,
you have this dominant consumer app that's generating so much revenue.
Then you have B2B and enterprise and API, and that looks more like a cloud provider.
But then is there a world where the red hat Linux of open source LLN?
is an Open AI division and that there's actually serious revenue and profit that comes from
helping companies implement an open source large language model like Red Hat built a pretty
fantastic business for a long time on top of open source Linux implementations.
Yeah, I mean, I think it's the right question to be asking.
I mean, I think step one was getting our second open source model out and getting, seeing
what that traction is and then seeing what the community needs.
I think it's important to leave space for a community to develop, right?
That is the beauty of open source is that ecosystem that develops.
And that was true with Linux.
It's true in areas like crypto, too.
But I do think you'll find over time that as enterprises want to deploy it, like, I mean,
now I've dinosaurs myself.
But when I was a, you know, when I was a research analyst at Goldman Sachs back in the day,
I covered software and I covered Red Hat, actually.
Yeah, really.
And all that growth.
I like wrote a research report called Fear the Penguin at one point.
because of the Linux being deployed.
But then you started to understand that for an enterprise,
you couldn't depend on like patching and upgrading to happen via community model.
Like you needed some of the rigor that goes with an enterprise business
where you kind of know if you need maintenance,
if you need a bug patch and so on.
And so that did allow Red Hat to grow an incredible business.
So I don't know if it's us or we'd be supportive of others,
but I think we are so excited to see open source out there
and getting incredible.
feedback. And I think we want to do that ahead of GPT-5 to keep coming back to like we're here to grow
this ecosystem. Well, we'll give you market cap credit for it anyway, even if it's early stage.
Well, thank you so much for coming on. This is fantastic. We'll talk to you soon.
Thank you, sir. Great to be both. Take care. Have a good one. Cheers. Bye. Up next, we have D.D.
Crito from Kudo, I believe I'm pronouncing that correctly. Let me tell you about graphite,
code review for the age of AI. Graphite helps teams on GitHub ship higher quality software faster.
You can get started for free at graphite. Dev. And let's bring in our next.
guest how are you doing welcome to the stream welcome very clean background it's probably virtual
but whatever you got going on looks fantastic you look great how are you doing are you excited about
gpt 5 i'm so excited it's it's awesome it's actually like everybody's talking about the coding capabilities
but no one is really talking about the code review capabilities and i'm going to talk about that
today yeah yeah break it down um how are you using it right now yeah so we're just enabled it in our
platform. It's the default model for both our ID plugin, our CLI, our Git plugin. And yeah, we're
using it to generate very high quality code reviews, catch bugs before the eight production,
help enterprises, verify that their code is aligned with their best practices. So yeah, it's super
exciting. I can share my screen and show a few things if like that makes sense. You can. Everything you
share will be live. It'll be a little yeah, yeah, yeah. But I want to know also,
So while you're getting that set up, I want to know about what changes materially do you think
happened in GPT5 specifically for code and code review.
Do you think there's more data going into the model, more data going into the pre-training,
post-training, anything else?
Anything that you're noticing that you're like, oh, there's a specific upgrade here.
They must have done something to get there.
Yeah, yeah.
I think it's a great point.
So I think it's all of the above.
So it's scaling of both the pre-training.
but probably a lot of the reinforcement learning.
And basically using that at scale to verify that code gets generated in high quality.
And then also basically catching bugs.
And when you do it with reinforcement learning, you have the actual ground truth.
So once you scale that, you can get the model to basically be a lot better at that.
How steep is the power law right now in just programming languages?
basically all Python JavaScript and then like a really hard fall off or is it
actually important for coding models if they want to be adopted widely to be like
truly multi-language and get all the way down into the long tail of like the
rust and the and you know C sharp and all the different languages that are out
yeah yeah for sure it's important to I mean the majority of the market is in the
JavaScript type script Python like the majority of the early adopters I would say
but then when you get to enterprise use cases
You get a lot of dot net, you get a lot of Java.
And the models are getting pretty good at those languages as well, for sure.
Are you excited about, I mean, how do you think about the difference between like the improvements to GPT5 from the consumer's perspective versus at the API level?
I always found it a little confusing that chat GPT was available as an API and you could interface with the chat.
I believe you could interface with the chat GPT model via the API.
And there's a little bit of like a blind blurring there.
But are there features that you think are croft and you want to kind of rip out for an API use case?
Or do you just say, hey, give us the kitchen sink and we'll work from there.
And it's actually helpful to have a coding model that can still have a web browser.
Yeah, yeah.
I think basically it's a lot about we consume the model through the API.
and it's really the same model that drives the consumer product.
But for us, since our use cases are a lot about eugenic use cases,
the more the model gets better at using tools
and gets better at kind of listening to very, very specific instructions.
Following instructions is critical for the enterprise use cases.
Because for us, unlike the border market,
we believe that for enterprises,
you need to have very specific,
agents that are defined with specific set of instructions and prompts and tools and
permissions.
And the more the models get trained with that type of environment, the better they end up serving
the enterprise market, which is really where we're focused on.
My question is, I wonder, like you said, like very specific instructions are important.
When are we going to get an agent that I can just turn loose in a code base and say, like,
just go improve it?
like just go hunt around, do, like rewrite that.
Like when you get a good open source contributor on a team that just becomes nerd sniped
by the project that you're building on, they will just go around and find little ways
to improve this documentation needs to be a little better.
Let's rewrite this test case over here.
Let's add a little bit more functionality to this class or function.
How far are we from that?
Yeah, I think the models are getting better and better at that part of basically kind of running loose,
a code base.
Yeah.
But they do need the Godrails in place.
And this is kind of where we're focused on.
Like a lot of the talk in the market is around the cogeneration side.
You know, let the agent lose and give it a task and it's just going to go around and run
four hours and do it.
What we're seeing is that the real challenge is now shifting towards how do I is verified
that the code is aligned with the best practices?
How do I make sure that it's well tested, well reviewed, doesn't break anything.
And, you know, so that's, I think, the next frontier.
And really, developers going forward are not going to write a lot of the codes by hand.
They're going to spend most of their time reviewing code.
And that's the next frontier.
And that's what we're really, like, are here to tackle.
Very cool.
Anything else, Rudy?
Yeah.
Well, thank you so much for joining, giving us some extra context on the GPT-5 launch.
We will talk to you soon.
Have a great rest of your day.
And thank you for joining.
Cheers.
Thanks.
Cheers.
Talk to you soon.
And let me tell you about profound.
Get your brand mentioned on ChachyPT.
That seems more relevant than ever.
Reach millions of consumers who are using A.
to discover new products and brands.
I forgot to ask about this.
We'll have to come back to this.
But I want to know if.
Profound powers MongoDB, indeed.
Mercury, DocuScience, Sapier, Ramp, Roe, Golland,
workable, Majorie, Aidesleep, U.S. Bank, Chime, Clay.
Okay, okay.
We get it.
They got some logos.
There is this question of like, okay, even if you're like, okay, GPT5 is more incremental
than a revolution, more of an evolution than a revolution, it's like, okay, well then
let's talk about how it affects every other business and every other aspect of the economy.
What should you be focusing on?
And is, like, do the, do any of the updates from GPT 4 to GPD 5 change how you're positioning
your brand for AI search?
That's certainly an interesting question to dig into.
Anyway, we have Zach Lloyd from Warp coming into the studio.
Welcome to the stream for the second time.
Welcome back.
Good to see you.
He's back.
How you doing?
I'm doing pretty well.
Yeah, so, I mean, a lot of what stuck out to me, I'm mostly a consumer of consumer AI apps.
I'm very excited about not needing to mess around with a model picker anymore.
But take us through the biggest improvements from the social.
software development side.
Yeah, I mean, it's a major step up from the prior open AI models.
It's, I mean, it's doing agenic workflows and work for much longer period.
It's just a smarter general model.
Like we evaled it against all of our benchmarks and it's up there at state of the
art, which is, you know, from our perspective, it's, it's awesome to have multiple
competitive models that our users can benefit from.
So definitely a huge improvement from GPT 4-1.
Yeah.
So it seems like not the, you know,
clog code killer,
but certainly in the same conversation,
in the same football stadium for using a sports metaphor.
How much,
you know,
one thing that stood out is the cost reduction.
I was about to ask.
How much do you think that developers will care about that
versus just, you know,
what it can do from an output?
standpoint. I think developers do care about value. So sort of like quality to cost ratio.
I think it's the more you get into like the individual developer and the small team,
the more that that matters. Whereas if you're at the enterprise level, I feel like it's,
it's a little bit less price sensitive. So yeah, I mean, you can see it as different apps change
their pricing what the reaction of the developers is.
You've probably seen this with cursor
and seen this with Cloud Code.
And so developers really, really are looking
for something that's cost effective.
So the fact that the cost is a little bit lower
is actually is a big deal.
Do you think we're in the Lyft Uber 2015 arc
where the prices are subsidized and the prices will go up?
Do you think that there's a price war on the horizon
now that the frontier models seem to be similar capabilities?
Do you think that someone will try and raise a bunch of money, cut prices a bunch and steal a bunch of users?
How do you think that plays out?
It's an awesome question.
I mean, my hope is that we get to a world where there is price competition at the model layer.
So Warp is very much at the app layer, right?
And so our value prop is like we can give our users who are mostly developers the best model access.
And so to the extent that it's not one sort of model provider running away with that and having pricing power, it's better for us, just candidly.
And so, you know, my hope would be something like the model world ends up a little bit like GCloud, AWS, Azure.
That's our best end state where all of these models are, you know, sort of similarly powerful and a little bit more commoditized.
I don't think it's been like that, but it's going, it's getting a little bit more like that.
So the more that there's more than one show in town, I think that's generally good for Warp.
And actually is good for developers because it will put competition.
The competition will put pressure to bring the prices down.
But I don't know.
Like I also think that people will definitely pay for quality.
And so if there is a, you know, meaningful delta and quality on the frontier models, then I think
that like whoever has the quality delta will have a lead temporarily.
but I'm not sure that that lead will be sustainable.
We'll see.
How do you think the developer community should plan around model deprecation over the next, you know, one to two years?
Like how much, you know, from, I don't know that I've gotten a reaction yet from, I don't know if there's general frustration yet from people.
You know, we've heard on the consumer side, Tyler on our team here loves four or five.
And so he was a little bit disappointed to hear that.
But what are you seeing on the developer side?
Yeah, I think it's a little bit different for people who are like building apps on LLMs
versus people who are using LMs as like an accelerator to doing coding.
And like, you know, at Warp actually we do both.
Like we were an application level stack.
And like it's actually very easy for us to go to the latest model.
and so it doesn't really bother me.
I don't know what type of app you would be building
where it's really important that it's like GPT-35 or GPT4 or something like that.
I think like generally we want the most intelligent tokens at the best cost.
So I don't see that being like too big of an issue, honestly.
What about open source?
Does that feel like something that will be in the playbook?
Is the markup on closed source models high enough
that there will be a significant price delta or is the parator frontier kind of indifferent to
close source open source? So if there was a comparable open source option, that would be awesome.
I think that the economics of it, again, it doesn't seem like a perfect analogy to me between like
open source software and open source models. So open source software, it's like you have a big
community of people who, you know, for the love of coding are building a really awesome product.
For open source models, it's like you just need the crazy amount of capital to train something
that's on the frontier. And so I don't know how that happens. And so what we've seen is like the
open source models are competitive at the quality level that they're at, but the quality level
that they're at is not the same as the frontier models. And I don't really see why that would
change. And so, I don't know, in Warp, it's like we, we were serving some open source models,
but they're just not, they're not as good. And so there's, I think, a more limited use case for
them right now. And I don't really see economically why that would change. In fact, I would be,
I would be surprised if anyone was spending billions of dollars to train a model and just kind
of put out the open weights. Like, I don't get the business strategy there, but maybe that will happen.
that would be awesome.
Is there a world where you're like this idea of like smarter,
smarter models either orchestrating,
dumber, cheaper models or like using or distilling models into more narrow,
narrow formulations that can be run more efficiently.
We've talked to a few companies that do this for businesses.
Like you just want a model that just filters for profanity and you can run it on,
you know, a gaming graphics card.
And so it's basically super, super cheap or super fast.
I'm wondering about like in the coding world, coding agent world, any of that, like where
are the opportunities to kind of fan out and use an ensemble of models instead of just
this hit everything with the smartest best?
It feels like because of the funding environment, everyone can kind of justify like a high cloud bill.
But, and most people don't admit that it's hurting the bottom line.
but it feels like at some point it kind of has to eventually.
I mean, I think that's a very real thing.
Like sense of, even in Warp, we don't use like the biggest, most powerful model for every task.
And so there are certain things like, you know, for Warp, maybe for like deciding whether or not we should summarize a conversation is like a good example.
So you hit the context window, you're like, okay, is this is this a good spot?
to summarize, is this a good spot to encourage a user to start a new conversation?
We use a much more inexpensive and also low latency model.
The other thing, the trend is that these very, very powerful models tend to have much
higher latency.
And so we do a mixture of models, and that's totally a real thing.
But I think for like the predominant use case as a developer is going to be, I want to tell an
agent to do something. I want it to be harder and harder. I want it to run for longer and longer.
And to do that, it's like you kind of want in general the most intelligent model. And so
until this, until the models have a sort of S curve like type shape, I think that I think it's going to
be more of a quality game than a cost game for most of these things. Doesn't it feel like they have an
S curve shape right now? It certainly does from a consumer perspective. That's interesting. From from a
coding perspective, I feel like we're still accelerating.
Like the difference, again, between the last version of GPT and this version of GPT is
probably bigger than the difference between like 4-1 and 4 and 4 and 3.5.
Like, it's a big deal.
And same thing with the anthropic models.
And I'm sure that we'll see something from Google where it's an acceleration.
And I think that there is like a maybe an underappreciation of how much left there is to
solve here. Because when you, even when you're doing like a real coding task as a pro, like,
despite all the demos you see on Twitter where it's like someone asks, you know, an agent to
build an app, that's like a lower level of difficulty than doing what a pro developer does
with one of these models. And the models still don't produce great code a lot of the time.
Like there's a lot of kind of handholding that has to go into it. And I think, I think that we are
still seeing an acceleration in terms of the model is actually becoming not just like okay, competent
engineers but like really really good engineers yeah do you care about benchmarks we cared a ton about
benchmarks like we um but your own internal benchmarks or or we do both so you know plug for warp we're
number one on terminal bench which is the public you know terminal benchmark and we're top five
on sui bench which is the coding benchmark and then the only way uh in my opinion that an app at our
layer in the stack can really improve is by measuring the progress and so we have our own
set of e-vals that we run across all these models as well, which are coming from like real
use cases. And that, again, is an advantage of being like a product that's in the wild that has a lot of
users is that we can sort of see where the models are failing, where they're working. And so we're
very big on that, actually. Yeah. Awesome. Well, thank you so much for stopping by. We will talk to you
soon. Sure, you'll have a busy afternoon. Shout up, by the way, at Open AI team. Very, very helpful
in working with us to get GPT5 to be awesome and warped.
And one more shameless plug, we have a discount code for people who want to try GPT5 and warp.
It's $5.
It's $5.00 GPT5.
Okay.
Thank you for having me, guys.
Yeah, we'll talk to you soon.
Thanks.
Cheers.
Tyler, any updates from the timeline while you're thinking about what the latest vibe check is
in the war between Open A.I.
I got one from front of the show.
Is a purpose-built tool for planning and building products.
Meet the system for modern software development.
Streamline issues, projects, and product roadmap.
Go to linear.app to get started.
Tool of choice for open AI.
You have something?
From Reggie James, front of the show.
Half of my timeline says this is the closest we've been to AGI.
The other half of my timeline says we officially just hit AI stagnation.
I love tech.
Well, we will be going deeper deciding whether or not this is stagnation or hyper-intelligence takeoff.
And we will be joined by our next guest.
Riley from Charlie Labs.
Hey, guys.
Thanks for having me.
Good to see you.
Riley, how are you doing? What's happening? I'm doing fantastic. We've been heads down with GPT5.
How long have you had it? How long did you get the preview? I feel like it, you know,
it gets rolled out to early adopters a little bit earlier, but it's been weeks, months? How long have
got it? We're a couple weeks, like two or three. What was the first thing you did with it? How's
Charlie liking it? Charlie loves it. And also, I love what Charlie does with it. Yeah. What does
Charlie do with it? What was the first thing you did with chat GPT5?
Ran our e-vails.
Oh, yeah? How'd they come back?
Really good. Much better than O3, which was much better than any other model we'd from before that.
Interesting. And yeah, so let's zoom out. What do you do? What do these e-vals measure? Walk me through it.
So Charlie is a TypeScript-focused coding agent that operates much more like a human does.
So less like IDE application terminal
and more joins your GitHub and Slack
and linear workspaces.
And it interacts with the team the same way other humans do.
And then our evils are a mix of code review
because part of Charlie's job is to review PRs from humans
as well as his own and then code authoring,
so opening PRs and pushing commits.
So when you develop your own evals,
I imagine you try and keep those out of
training data you want those to be held private is that correct yes and it's
getting even harder with web access now because they're too good at finding
things they're finding everything that's funny and then and then talk to me
about like the shape of those of the actual problems in the eval are you are
you doing are there some easy questions some hard questions some some
extremely hard questions like how are you formulating those what's
the shape of an individual task is it
out of like 100? How do you think about developing a good e-vail?
A mix of hard to very hard. The easy ones are just a waste of money and time at this point,
especially with five. Like there's a bunch that it's just not going to get wrong.
Yeah. Yeah. And then we're mostly doing the PR ones look kind of like sweet bench in the sense
that we're taking an issue to start with. But instead of giving the issue like in a Docker
container already, we trigger a comment on the issue that says, hey, Charlie, go make a PR for this.
And then Charlie does its thing and then the PR comes up and then we score that PR against a whole bunch of things like correctness to a no one's solution that's correct as well as code quality, testability and some softer things like descriptions.
Who are the biggest, who are the biggest customers or users for like a typescript focused coding agent?
It's a wide range of mostly modern apps, like pretty much any web app these days.
It's going to be like a NextJS type app.
And then all the way into like back end like Charlie himself is written in TypeScript.
Sure.
Makes sense.
And there's very little front end.
Anything else?
What else you got?
I just want to say I love the name Charlie.
It's one of my favorite agent names that we've had on the show.
Yes.
It's right up there with Pig.
And what was the other one?
Well, I don't think that was an agent.
But yeah, it's a good one.
Yeah.
Congrats on logging it down.
Yeah, what about cost and that side of the business?
Is there any movement there or anything that you require movement or you need movement
to really unlock new capabilities in the business or new markets?
Not really for us because we're operating kind of as at a human level.
we do value-based pricing, so we charge per PR per commit.
And because that's comparing to such expensive actions that humans are doing,
the challenge for us is more actually living up to the promise than doing it cheap.
Yeah, yeah.
Are you having any...
But then doesn't the cost reduction announced today, isn't that great for business?
Yeah, I mean, it's good overall, but like that's, our problem is not that the models are expensive.
It's that they're, I mean, they're getting really smart, but all.
always take more.
Never enough.
For instance, since the beginning of August, we've been testing, 98% of the code that got merged
into our code base was written by Charlie.
Wow.
Not 30, not 50, 98%.
And that's coming through PRs.
That's not like auto complete in an I-E type thing.
That's crazy.
Yeah, what does that mean for like the future of like, who are you hiring?
I imagine that you're still, you know, an engineering heavy organization that's,
it's just puppeteering and orchestrating agents.
But where do you see like the future of software development as a career path go?
Yeah.
Are her new CS grads cooked?
I think if they get really good at using the AI, no.
If they try and take an approach of getting really good at writing code by hand, for sure.
Yeah.
What we're mostly looking for hiring is people who are able to see things at a much higher level
and plan further out because with tools like Charlie, you can write so much more code so quickly
that it's like it's more important to see where you're going and take the right path than it is to
be able to write it quickly. Very cool. Well, thank you so much for stopping by. Good luck with the rest of
your day. And congrats on an upgrade to everything that you do. Tell Charlie to have fun out there.
Have some fun. Thanks a lot, guys. We'll talk to you.
All right. Chat has got some. Numeralhq.com. Sales tax and autopilot. Spend less.
than five minutes per month on sales tax compliance sales tax super intelligence a number of
the fellas in the chat got access to five break it down
right says it's pretty good the writing ability feels a little nerfed says the way it
right feels a little programmatic rather than sounding human reverts to using points even for things
like blog posts and also uses overly complicated language for simple stuff
Techno Chief says it's crazy fast.
Oh, that's good.
Ratliff says, yeah, I was just going to say that very, very, very fast.
Z. Jean Ahmed says junior devs are barbecued.
Tyler, anything from your side before we talk to Guillermo from Versel?
I think maybe a good way to like vibe check at least on the timeline is that it's almost like a 4.5.
kind of thing where comes out people are like this model totally sucks look at the benchmarks
it's like not it's not some massive improvement it's like you know not a step change at all
but then you you start playing with and it's actually like okay there's actually a good model
yep like a lot of the stuff i'm seeing people post like oh that's actually like really like
interesting output stuff like that um but we need it seems good can we do the green text eval green
text bench yeah yeah we got to be tvpn intern yes yes yes yes uh we'll let you cook on that and then
we'll move on to our next guest gues
Romo Rao from Versel coming in to TBPN for the second time.
Great to see you Guillermo.
How you doing?
I like the action hall.
Thank you.
Welcome to the stream.
How you doing today?
Do you think GPT5 could beat me, you, a couple of the boys here on Dust 2 in Counterstrike?
Easily?
Easily.
Yeah, it depends on the frame rate, right?
Yeah.
But a long enough timeline, we're cooked.
We're cooked.
But we might frag it short term.
that we might be faster.
Amazing, yeah.
Yeah, we got to, I mean, I'm sure we'll get to GPT5, but what's your reaction to the world
model stuff from Google?
Do you think, do you have an idea of where that's going as a product?
It feels like a GPT2 level technology, very much a research focused technology.
I'm sure opening eyes working on something too and a lot of the labs will work on it.
But what's your theory behind the generative video game world model stuff that's going on?
I mean, number one, super fascinating, right?
I think when we think about the future, I always think about Jensen's.
The future of applications will be that pixels are generated, not rendered.
So as much as we're really excited today that GPD5 and V0 are really good at writing code that then renders interfaces,
I think it's also cool to dream of a world where we're just going directly from GPU to pixel grid, right?
And but if you remember like a couple years ago and maybe a decade ago, there was a lot of
excitement of video games that were going to be live streamed from the cloud.
Yeah, that's right.
Where your input, your keyboard, you could have a very thin client, your input, your keyboard,
your mouse movement was going to be dispatched to the cloud.
We're going to have Google Stadio.
Google Stadio was big there and then on live was in the
Microsoft's going to the game and is still, Microsoft is actually still pulling it, still pushing
it very heavily.
Awesome tech, but not Massadena.
option.
Yeah.
But if you look at a lot of these technologies are being really successful in letting people
get more creative and test things out.
A lot of the use cases that we see for V-Zero and V-Vive coding are almost like a communication
tool.
Like I want to prototype something.
I want to see what's possible.
I want to explore the latent space.
And I think those world models are going to be incredible just to inspire what the future
of games could look like, right?
Just getting ideas for actually then shipping them in real 3D engine models.
I think short term.
I think long term, all bets are off.
Someone was just saying in the chat, you know, junior devs are roasted or a barbecue.
I think that's not quite true.
Okay.
Same for like 3D engine developers.
Give us the bull case for junior devs staying off the barbecue.
So the bull case where I think people in general is that you move from, I mean the progression
in the industry has been assistant to agent to team of agents, agent orchestras.
It's still really useful to have a human be the one that's sort of like managing the team.
So you're moving from like junior dev to junior inch manager, especially as these tools become more agentic.
In the new version of VZero that's coming up really soon, you're starting to notice that VZero sort of splits the task between a little team.
You have the designer of the team. You have the PM of the team that sort of working on the spec.
You have the architect.
You have the engineer.
I know if you saw Cloud Code announced, I think it's like a slash security review.
You think of that as having a security team or a team of agents or security researcher at your disposal.
So junior dev as like a vertical skill, maybe a little barbecued.
But junior inch manager, so I think it's just going to be the junior dev is so much more powered in this world.
If you allow yourself to be and you keep up with what these tools can do and I think you stay, you know,
at the cutting edge. Yeah, I mean the obvious bulk case is if you're like if someone's a college
student today, they can learn to code truly AI natively. They don't have to say, oh, we're an
AI native organization now. We have to upscale and kind of retrain people how to think. They can
just naturally start to think with these capabilities in mind. There was that Sam Altman post about
how we'll look back on, you know, 93% of humanity with subsistence farming. And if you ask those
people, what they think about our email jobs, they'd be like, you guys are crazy. And it's almost
like in the near future, midterm future, maybe even long term future, it's like the number of
individual contributors will be extremely low and almost everyone will be a manager and you'll
become a manager much faster. You'll just be managing agents and then you'll be managing people
who manage agents. But the job of almost everyone will become managerial. Maybe that's what happens.
I don't know. I'm not 100% on, but that's what that made me think.
someone asked me yesterday, you know, what do you think the future of the market of monitors
looks like? Like, does it stay flat? Do people get more monitors? Because they're, you go to like
Doge coin trader analyst? Yeah, yeah, yeah, yeah. In the future, everyone has the hedge fund six
monitor set up. In the future, everybody's just going to be on their phone. Maybe on their phone.
I mean, I've noticed that what, you know, when I was an individual contributor, I had three monitors.
I was programming on all the screens.
And now, I mean, I use my laptop during the show.
And then most of my work has done on my phone, phone calls, and then firing off messages.
Yeah, maybe we actually shift away from monitors and go further into voice interfaces.
Oh, I call the lead of my agents.
And then that agent relays it to some.
I'm very optimistic on voice, by the way, because I've now seen it.
I did what we're cooking on a better mobile experience for a V0.
Sure.
And I was going back and forth with my head of mobile.
And he was talking to V0 and I was writing down in a pretty fast typeer.
But he beat me with voice using the local model on the phone.
So there's still the question of like etch latency versus cloud latency, kind of like what we talked about with 3D.
But I do think voice is going to play an increasingly exciting role in programming, which is kind of while.
I would have never imagined.
I've always been about typing benchmarks in WPMs.
voice is coming.
Yeah, yeah, yeah.
How do you think about competition broadly in developer tooling, code gen?
I mean, right now it seems like there's just so much demand.
It feels like massive TAM expansion moment.
Every company's ripping.
Tam expansion moment, but at the same time, winners will emerge.
Obviously, you're playing to win.
And, yeah, I'm curious.
Yeah, on some level, we're playing both sides of the bet.
We announced today that's really exciting is V0 with GPD5 support.
So you can go to v0.dev slash GPD5 and we'll use GPD5 in combination with our model pipeline
that makes it really good at vibe coding, especially for non-technical folks.
But we also, on the VERSEL AI cloud set of things, we open sourced.
Basically, you can create your own Vive coding platform, power by any model.
I was joking about this with Tyler.
Code me a vibe coding platform, please.
That's right.
Make no mistake.
Yeah, buy a code me a billion dollar company.
Yeah.
No mistakes.
But basically we are giving people that.
It's a start a kit.
Sure.
And by the way, the fundamental question that CEO asked me the other day was,
is vibe coding a product or a feature?
Or is it both?
You know, it's TBD.
The case for a feature is, okay, so there's going to be lots of systems of record.
Think Salesforce,
snowflake,
Databricks,
and increasingly
they're going to incorporate
co-gen capabilities
into their platforms.
They can use
a lot of these
capabilities that we just
open-sourced
and you'll go to
the existing place
where you have the data.
Kind of like
what we've talked about
for decades of like
are you bringing
computer the data?
Are you bringing
vibes to the data?
Right?
Are you bringing
co-gen to your own platform?
Yeah,
you used to bring like a
dashboard builder
and it would have a couple
widgets.
And now I can just
potentially
If I'm plugged into some sort of data source, some system of record, I could say vibe code this app on top of it.
And there's been some retools played in this space, Zapier a little bit.
But yeah, I mean, this feels like, you know, we're getting, we're not fully in the just the pixels are generated, but we're, you know, generative UI, generative application on top.
And that, and that being bespoke and ad hoc.
I also think it's important to understand the line between consumer vibe coding and just generating ephemeral software and websites and things.
like that versus enterprises, which will have a lot of different use cases. When I look at the,
when I look at the vibe coding market and I see businesses that are that, that are almost entirely
consumers just creating things for fun. I think that has to be a tough business because it's a
hyper competitive market and consumers are flaky. They'll create something, you know, for fun,
but they'll churn in month two because, you know, it's not, they're not running a real
business. Whereas a business knows, hey, we'll pay for this on a,
on a long-term basis because we have a use for it all the time from this product manager
to an engineer over here to somebody in marketing, et cetera.
Yeah, the other side of the equation is how do you make this vibe coding tools work really
well for enterprises?
Frankly, the most surprising immersion thing that I've learned is just how much demand there
is in enterprises for vibe coding.
And this is because a lot of the traditional thing has been the people that understand
the business are sitting over here.
The people understand the code are sitting over here, and their communication is fraught with peril.
They don't speak the same language.
They kind of like resent one another.
I love to tell this story.
I was meeting with a CEO of a very successful company who's telling me that engineers, like asking a feature to his own engineers, felt like petitioning the government.
Even though he's the CEO, he's struggling to like make the case.
And please like get me in your next sprint, get him to this feature.
So vibe coding actually solves that problem.
All of the PMs, designers, marketers, business users that previously only had access to what, like Jira and, you know, to do a little bit.
Product management tools and writing PRDs and those kinds of things.
They weren't able to ship PRs.
They weren't able to, you know, ship software, and now they can.
And so the opportunity is how do you actually make this secure?
How do you make it high quality?
How do you create guardrails?
And those are tricky problems.
And I'm really happy that some of them are easy to overcome, at least for us.
And some of them are active areas of research.
But I think the enterprises really have a strong case for this.
Yeah, can you walk me through like tool use?
I mean, we were talking to the open AI folks about GPT5 being like really like a summation of like standing on the shoulders of giants.
You get a Python repel.
You get a web browser.
You get, you know, the ability to kind of run cron jobs now.
There's voice and, you know, all sorts of different tools kind of wrapped up.
up into one, multiple models.
You can trigger reasoning chains if it wants,
it can do all these different stuff.
And that's actually the benefit of like,
this isn't just a bigger model.
It's like a next version of a thing.
It's more like switching from the iPhone 12 to 13
than going from the iPhone to the iPhone 3G.
It's not just a new technology that's in there.
But in the world of vibe coding,
what are the tools that you want to think about adding?
I know that basically every vibe coding platform
you know,
recommends a database.
But I was,
we were talking to Harley at Shopify yesterday and there's a world where
if I go to a vibe coding platform and I say,
I'm building an e-commerce website,
it should probably just be like,
hey,
I'm going to do Shopify under the hood and I'll vibe code the landing page on top.
But how are you thinking about the landscape of like tools that you could pull in
full open,
because there's open source repos that are like full projects that you could pull in
and then just start customizing on top of.
It's kind of this big continuum.
Yeah, there's a couple layers.
On the foundation model layer, what do you want is a model that is exceptional at tool calling.
Whether it has built in tools or whether you register them yourself, this is a sort of silent word that has been going on.
Like if you talk to devs, what are you optimizing for?
Tool calling quality.
Why?
Because to demystify the word agent, what an agent is, it's a loop of tool calling that builds up context over time.
That's all an agent is.
So let's, to give you an example, concretely, of B0, V0 is becoming more and more agentic over time.
One of the things that he can do is it can take a screenshot of the thing that's building and reflect on it.
So today, I live, vibecoded to an audience of Web3 and crypto engineers.
And I told VZero, hey, make this dark mode.
And initially, VZero dust me dirty.
He's like, he changes some things with the dark mode.
And then it kind of astonished me because I was like, oh, I have to be.
and I explained to this audience,
it then takes a screenshot,
looks at it,
and keeps fixing it.
And I was like,
this is literally a developer
that's alive on autopilot.
And the reason that it's in autopilot
is because he has access to these tools,
like looking at the web browser.
Another one is research.
I've coded an example of,
build me a substack clone
for cryptocurrency news.
And the agent didn't know
what the cryptocurrency news were.
So I started doing research on the internet
of, okay,
Ethereum passed certain price and whatever.
And then you're talking about the tools over the internet.
So to demystify another topic, MCP is really exciting
because it's a new protocol for registering tools
that your agent doesn't locally have.
So those tools that I just talked about, we gave them to VZero.
Here's a deep research tool.
Here's the screenshoting tool.
And those will likely become the new services
when you think about like AWS of today,
if AWS was an AI cloud.
which is kind of what we're trying to build at Resale.
You think a lot of those tools are going to become as a service.
Like, bring me the research as a service, bring me browsing and screenshoting as a service and so on.
But then you have MCP, which allows you to, okay, I need to sell something online.
All right.
So now there's an MCP for Shopify.
Now there's an MCP for Stripe.
There's even crypto MCP.
So it's really exciting.
Like now it's like the ultimate choice for a builder.
And you don't have to go and learn all these things.
You don't have to, this is almost like a discontinuity of the valley trend of like, if we build
amazing documentation, they will come.
This is more so if the agent picks you, they will come, right?
And so there's a lot of figuring out right now, like, how do I make my infrastructure?
How do I make my product to be loved by these agents?
And the MCP promises to be one of these first things that you are in control of.
That makes sense.
Last question.
Someone on your team named Josh is in the challenge.
chat he wants to know what what does he need to do to get a Twitter badge oh well yeah 100k
downloads of the AICLI I think we've been talking okay okay okay good work through the
god that's been thrown down thank you it's on record it's got your work cut out for you it's
burned into the immutable record of this live stream and the future training runs best of luck
we're gonna hold you accountable to that Guillermo great seeing you awesome great to see you
we'll talk to you soon congratulations talk soon let me tell you about
fin.a.ai, the number one AI agent for customer service, number one in performance benchmarks,
number one in competitive bakeoffs, number one in IG2, number one in having an Irish founder.
That's right. And we will invite our next guest to the stream from factory.a. Welcome to the
stream. How are you doing? Good to see you. Hey, how's it going? Glad to beat here.
Great. Thanks so much. Kick us off with an introduction on you and the company.
Yeah, my name is Eno, co-founder CTO at Factory.
We are building a platform for enterprise software developers to perform what we call agent-driven software development.
So basically, more than just code, bringing agents into every stage of the software development lifecycle.
So think coding, code review, maintenance, incident response, documentation.
We think agents should be a part of all of this.
And we think that they should be driving a lot of that menial component while you think at the high level about how to plan and structure the work.
There's so many different like enterprises and narrow category.
It's a, you know, not consumer, I guess, but it's such a wide, it's such a wide category.
Is there a beachhead?
Is there a certain type of project within, within different industries or specific industry that's getting an especially large amount of value at a factory these days?
Yeah, totally.
I think that one thing that we see a lot, and typically when we say enterprise, we're thinking greater than 1,000 engineers, right?
Like 2,000, 3,000.
And one reason why we focus on that larger scale, you tend to have these large organizations where people are, the bottleneck is not code, right?
The bottleneck is how do we plan a migration of 185 code bases to this new framework?
and there are 3,000 developers that are going to touch this over the next six months.
And an SI just told us the quote is $80 million to do it.
And we have to figure out how to not.
So re-platforming broadly is one of the major, major tasks for many, many enterprises, right?
100%.
Modernization and migration is huge.
Yeah, yeah, that makes a lot of sense.
How do you estimate that market size?
and is that where you guys are leading with on the GTM side
in terms of trying to find these legacy companies
that are maybe not even using cursor yet?
I mean, we talked to the CEO of GitHub yesterday,
and what?
50% didn't he say?
It was like at least half of their user base
is not using any AI tools.
Yeah, yeah, totally.
I think that the thing that we hear often,
we pretty much only deploy into companies
that have already tried an AI-native IDE or have an auto-complete tool deployed.
And I think the thing that we hear often is you sort of hear like these numbers throw
around like 5x, 10x.
And then in practice, when you adopt an AI IDE, you see 10%, 15%.
And so a lot of people are sort of saying like what is the delta there?
Like what causes that transition?
And our sort of argument here is that there is a workflow change that's actually
required to really adopt agents in the life cycle, right? And so if you're just sort of like
accelerating an individual developer that you can go a little bit faster. But if you are
able to parallelize and automate at scale, that is going to be that larger introduction
of change. And so if you imagine the market here, there are companies where, you know,
five or 10 percent of global payment transactions run on some cobal system that was written 40
years ago, every developer is gone and it's a taking time bomb. Like at some point it needs to go
to Java, but there's nobody who even knows how to do that. And so those are the types of projects
where the market is so enormous because, you know, half the business runs on this legacy system
hundreds of billions of dollars. Put it all in Lisp, skip Java, go straight to Lisp.
Yeah, Python, right? Yeah. That would be the logical one. I'm sorry, we're running behind, so we're
going to have to cut this short, but I want to know more about how the enterprise coding agent
market will develop. We could see one world where we wind up with, you know, GCP, Azure,
AWS, like, you know, pretty comparable, competitive. They've all had really great margins.
It's been this oligopoly. There's another world where you could see more specialization.
One of these companies goes deep into high security environment.
or oil and gas or financial environments or specializing based on specific programming languages,
as the market develops, like, how do you think it'll play out?
Yeah, great question.
I think that what's very clear is that the bulk of very large enterprise has a lot of similar
problems, refactors, migrations, modernization.
So a platform like factory is able to deploy into that and solve problems quickly.
I think that there's likely to be like that sort of 80-20 where there are going to be,
to be these very specialized providers that only focus on one sort of problem and that will
represent maybe like 20% of what's out there and so it won't be like necessarily black or white
but we do think that the bulk of enterprises have a lot of similar needs especially when you
just get across a certain threshold of number of engineers scale of code base.
Sure sure yeah I mean we even see that with the the clouds where you know obviously there's the
hyperscalers but then there are neoclots and we talked to armada where those
they'll send you a shipping container with a bunch of racks inside and put it in stranded energy.
So there will obviously be a long tail here.
That's a great take.
Thank you so much for stopping by.
Have a great rest of your day.
And enjoy the GPT5 upgrade.
We'll talk to you soon.
Have fun out there.
Really quickly.
Let me tell you about Adio.
Customer relationship magic.
Adio is the AI native CRM that builds, scales, and grows your company to the next level.
And we will be joined by our next guest from Augment.
Welcome to the stream.
How are you doing, Guy?
Great. Thanks so much for having me.
And that's his name, by the way, if you're listening.
His name is Guy. I'm not just calling him, Guy.
Anyway, please introduce yourself and what do you do?
What does your company do?
Yeah, so I'm Guy Garari from Augment Code.
I'm a co-founder and the chief scientist.
And we build AI coding assistants for large teams with large code bases.
And so you can use Augment Code to do question answering, to do development, to do
refactoring, to do migrations, all the tasks that you do,
except that our product understands your large code base really well.
And so that means less prompting for you and faster and better results out of the agent.
Today, GPT5 launches.
It's kind of a rising tide.
Feels like it lifts all boats.
Every company gets access to it.
We've interviewed a number of companies that are building on top of GPT5.
Except it drowned GPT 4 or 5.
Yes.
But in general, how do you think you can use GPT5?
Are there any pockets of value that you think you can uniquely take advantage of?
Yeah, great question.
So we've been trialling the model for the past few weeks.
And what we found is that the GPT5 is a very thoughtful model.
It likes to make a lot of tool calls.
It likes to ask clarifying questions of the user before starting to make code changes.
And so the place where I reach out for GPT5 is typically, if I need to make large changes or if I'm trying to
to answer a very difficult question about the code base.
I will let GPT5 take a crack at it.
It will turn for a while,
making lots of tool calls, just making sure it got it right,
and probably find all the different places in the code
where it actually needs to make a change.
And so I will typically let it run in the background
and come back to it, and I will often get
a high quality result out of it.
Are there any features or integrations
that you're hoping GPT5 will roll out
in the future. We talked to a couple of people who are like,
like we want models that have access to as many tools as possible.
And you can see with the MCP boom, more people are trying to make their services,
their products accessible to these models.
Is there anything that you see as potential low hanging fruit to just add to the capabilities?
So I think for us, we work hard on developing our own integrations and our own tools,
building them into the product rather than relying on GPD5 or other model vendors to do so.
We have worked closely with OpenAI to improve the prompting around our tools so that the agent
kind of works flawlessly.
I think the thing that would be very nice, I think one of the previous guests mentioned
a screenshot tool.
I think that's a very, yeah, that's a very nice way to close the loop on front end software
development, just like we saw how on back-end software development running the tests,
radically really helps the agent iterate until it gets to working code.
So I think having more support for screenshotting and things like that that close the front-end
gap would be very nice to see.
I wasn't aware that screenshots weren't flowing through.
I feel like when I've triggered operator, I'm getting a view, a web view into the website,
but I wasn't aware that that wasn't like being passed through easily in the API and you
still kind of needed to build that yourself.
Where else, we were just talking about this,
like where are the biggest pockets of value right now
for AI coding tools?
Generally, obviously everyone knows like the vibe coder,
who's just the designer who's learning
how to use software for the first time.
Then there's the experienced developer going from a 10x
to 100x with better code completion.
Then there's the enterprise that's maybe doing
re-platforming.
Where else are the interesting pockets of value?
that are maybe on the horizon to be unlocked with new models.
Yeah, so on top of everything you mentioned,
certainly the inner loop of software development,
that's where we've spent most of our time
at Augment Co-Developing Product for.
Yes, you can have a senior developer,
starting using agents, starting to use multiple agents in parallel
and unlock TANX or more productivity gains.
What we're starting to see now with our tools
is the beginning of automating software development lifecycle.
tasks. So with Augment Code, we have a CLI tool now where you can take the full power of our
context engine and the agent, the thing that really understands your code base, and you can start
automating tasks in the background. And so we're seeing more and more developers saying, oh, this is
great. Like, I can break out of the IDE now. I'm using the agent that's already familiar to me,
but I'm starting to automate code reviews. I'm starting to automate incident response.
I'm starting to automate looking at production logs and automatically assigning tickets based on
air logs that I'm seeing, all kinds of new automation use cases that we're seeing just because
agents have gotten so good and kind of really understands your codebase.
Are there high stakes pockets of software engineering work that most of the AI tooling has
kind of stayed away from?
I'm imagining like the high stakes database migration.
Where is the kind of sticky part of the industry?
reading a blog post by someone who's doing like very advanced cybersecurity pen testing and they were saying like just the creativity of the models wasn't quite there yet to really come up with the to really act and embody like a white hat hacker who is going for a bug bounty but uh where where are the pockets of still like intractability where i guess if you are you know in the in the individual contributor you love just just you know coding from scratch that's where you want to stay for
at least the next couple months.
Yeah, I think still the attention of all the models we've seen
and all the agents we've seen around making proper design
and architecture decisions, that's still high stakes
and still the ability is not there.
Because if you do complete vibe coding
and you just let the agent go and do whatever it wants,
in the beginning, it looks amazing, the code works
and it's all really good.
But once you get to low,
tens of thousands of lines, the bad decisions that were often made around the design and
architecture start to show up and development slows down.
So that's where we still see a limitation of today's agents and where you still have to
supervise the agent fairly closely in order to make sure that you don't get stuck later on.
Perhaps this will change in a year, but today I would say all these decisions that you make
around how the code is structured still requires close superfluous.
vision and still high stakes because it can really slow your project down if you let it go
autonomously for long enough. That makes sense. Well, thank you so much for stopping by. We will talk to you
soon. Have a good rest of your day. Thanks so much. Cheers. Let's check in with Tyler on the timeline.
Tyler's manning the timeline. How are the vibes? Are there any new posts that have hit the timeline? Are we still in turmoil or has the
narrative settled? I think vibes are are picking up a little bit. You're trying to see people post like,
Oh, this is something I made.
Now you can see on LM Arena, it's number one.
No way.
Wait, wait.
So what's going on with the Polly Market then?
So Polly Market is still...
Still Google heavy?
Yeah, I think, I guess they're just pricing in Gemini 3.
Ooh, okay.
I'm not exactly sure, honestly.
I was actually very surprised to see that it was number one.
Yeah, yeah, yeah.
But yeah, maybe later we can show some of the posts.
Yeah, yeah, that'd be great.
Cool stuff.
Well, in the meantime, before our next guest,
let's tell you about eight sleep, get a pod five,
five-year warranty, 30-night risk-free trial, free returns, and free shipping.
And we will have our next guest join us from Code Rabbit.
How are you doing?
Good to meet you.
Good to meet you as well.
Thanks for having me here.
What's your reaction to GPT-5?
How long have you been playing with it?
What are the biggest improvements that you've noticed?
Yeah, I would say mind-blowing, right?
We have been playing.
Our team has been playing for like a few weeks now.
tested a few snapshots.
It's amazing.
It's a generational leap, we would say.
Like, we have been using open AI models.
I mean, how much you know about code average?
It's been a couple of years.
We have been on Open AI Anthropic.
And our product is a very reasoning-heavy product.
Like, one of the very few use cases where you have a PhD-style problem
and say we have to do code reviews.
And that's what CodeRabbit does.
Like users open-up pull requests, our agent,
and uses reasoning models to find issues like race conditions or security issues and so on.
So we've been testing GPD5 on some of the hardest full requests.
We have in our golden data set.
So we've maintained a data set where we track progress of different models and progress of AI in general.
So we have many problems that no model is able to solve so far.
Like I mean, GPT5.
But so far it has a highest score.
We would say it's like almost 2x better than the next 03 or sonate or
opus at this time.
What's the customer valued there?
You think that all the customers just notice that the product gets better?
Are you going to upsell folks?
How do you play this given that this model is now in public availability?
Every company, every competitor can access it as well.
Yeah, there's no up there.
That's the thing with AI.
For the same price or even better prices, you're getting much more AI, much better
better, but yeah, that's the whole idea how fast this space is evolving.
So, yeah, from the pricing point of view, we don't see like this to be like a separate plan
or something in our product.
I mean, for the same price per month, customers will now just get better quality of results
with Code Rabbit.
What's next for the business?
What kind of customers are you going after?
Who do you think has been on the fence and this release is going to be the thing that gets
them to actually jump into the world of AI?
Yeah, we can track the top line metric.
Like, one of the things we track very closely in the company is like how many sign-ups to
the paid customers we get.
That number has been constantly improving since GPD4, GPD4 turbo.
GPD-4, you actually dip.
So there was a time when GPD-4 was almost like a Windows Vista off and Liseas.
It's funny, like how we kind of trusted the E-Viles and we thought it's the same model, but
you know, it was impedeer in many ways.
But then we saw a huge improvement after 01 came out,
O1 preview was a game changer for us.
Even at that time, our conversion doubled actually.
Right.
I mean, so we went to like more like,
close to 30% success in getting the paid users.
And now with JPD-5, we're hoping we can see another big jump
in the number of people who start becoming paid customers
and how many people churned.
So those are the real numbers.
Like one is like vibes, like how people like respond to the model
and we get angry tweets or not,
I mean, that's the other part.
But the other thing is like the actual revenues,
whether it moves the needle for us,
and that may be seen, like one of the things we have seen,
even though you test these models in a lab,
it's not like a huge data set,
but once you actually are in the wild,
you see hallucinations, some of those issues at scale pop up.
So those are something we'll still be observing
over the next few days to see whether it's like smart only like,
80% of the cases,
but then if the false positive rate,
the hallucinations are too high,
then also it's not a great model, but that remains to be seen.
Yep, that makes a lot of sense.
Well, thank you so much for stopping by.
Congratulations on a new new tool in the tool chest.
New toy.
We will talk to you soon.
Have a great rest of your day.
Cheers.
Goodbye.
And let me tell you about public.com investing for those who take it seriously.
They got multi-asset investing.
Industry-leading yields.
They're trusted by millions.
Millions.
The chat is going wild about public trading,
the SPX 6,900.
I think that comes from someone talking about like the non-Mag 7 stocks or something.
There's been people benchmarking the mag 7 versus the.
The big news while we were live or earlier today, Trump signed an executive order that is opening up 401Ks to digital assets and private equity.
What's crypto doing?
Is it ripping?
The coin is up a couple points last time.
This point, you know, where's it going to go?
Oh, it's already so high.
I mean, it's just like there's been so many catalysts.
It could go up, it could go down.
Yep.
We'll have to wait and see.
Tyler, anything else notable from the timeline?
What have people built?
I see this GPT5, just one shot at a Minecraft clone.
Yeah, I think that's one of the cooler things I've seen.
Okay, so this is, so it wrote, it wrote code to generate this game.
It's not generating the pixels.
You can do so many different things.
like you could generate a video
you generate a world model
generate code that generates a game engine
you could generate code that runs on Unreal Engine
I don't even know what they're using
One thing now in on actual chat
tbt in there's like a native like
it's like a music player it's almost like garage band
You can say like if you prompt to like build
I saw a same old one tweet about this
You prompt like to do some kind of like beat or something
It'll like make an interactive like garage band
Almost interface in there that's cool
I was playing with that earlier
Yeah I do wonder how many of these
features that we're seeing, like, where does Open AI want to keep things in the B2B world and let
other companies build versus just build it as a consumer app?
Like, will ChatGPT eventually just let me push your website?
Like, will it become a vibe coding platform?
At least like a basic one.
Like it's not the most advanced coding environment, but it can definitely write some code
and execute it for you and do some stuff.
Yeah, well, it's funny because like it used to be, you would have a,
So there was like GBT 3.5 or something and people on top of that built a vibe coding thing.
So you could use that to build your own vibe coding thing.
But now you can just go straight from chat GBT to build your vibe coding platform.
Yeah.
But soon maybe it'll just be the vibe coding platform.
Yeah.
The surface area of this stuff is very interesting.
Clearly they're going after healthcare and therapy.
It's interesting that they've kind of stayed away from legal.
Maybe that's just the dynamic.
of the sales process and the dynamic of that particular market.
But increasingly you can just ask more and more questions
of chat GPT.
So the consumer to business bleed over,
there's certainly a world where just giving everyone
in your organization chat GPT is a substitute
for a bunch of different SaaS products.
So it'd be interesting to see where that developed.
What do you think about?
NIR says there are concerns that the number used
to represent our AI's intelligence does not in fact represent
It's intelligence.
Worry not to address these allegations.
We've added three new numbers.
Near.
Yeah, near is building something
that's like not particularly benchmarkable, right?
Isn't it a companion?
It's beyond benchmarks.
Beyond benchmarks.
Well, in completely other news,
Anderl opens a Taiwan office
and begin selling AI powered attack drones to Taiwan.
Paul Murlucky has said he wants to turn Taiwan
into a prickly porcupine.
We're in the age of spiky intelligence.
spiky intelligence will be onboarded onto the AI powered attack drones and deployed in Taiwan to keep it safe.
What else is going on in the timeline while we wait for our next guest from OpenAI to join?
Spor says raise your hand if you were not automated today.
I'll raise my hand.
I was not automated today.
Not yet.
We survived.
We made it through.
Sebastian Bubeck says here at Open AI we've cracked pre-training, then reasoning, and now we're experimenting with new set of techniques,
then maximally leverage their interaction.
GPT-5 is just the first step in this direction.
We're excited to, incredibly excited to see
where scaling this up will lead us.
And it's the unicorn test, I believe.
And the latest unicorn is really, really good.
That is a creative interpretation.
And I think it has to draw all this with like SVGs.
Anyway, we can talk to our next guest about it.
Last post, GiroTicket says,
I went to the permanent underclass party
and everyone knew you.
Anyway, back to the serious interviews.
Welcome to the stream, Max.
Good to see you.
How are you doing?
What's happening?
Nice to meet you guys.
Yeah, doing well.
It's a relief to have this launch out in the world.
I think it's, you know, we've been working on this for the last few months now,
and it's exciting to let the whole world see what we've had.
Just a few months?
It's been, I don't know.
It's been a little while.
What's the actual launch day like?
Because you're actually getting this out into the world.
the GPUs are on fire or about to be on fire warming up.
But is that out of your purview?
There is a different team for that, fortunately.
So, right.
So I run a lot of the research for GPG5.
I don't necessarily handle the deployment,
but I do get dragged in when the GPUs are on fire.
I think we're moderately burning right now.
Okay.
Like a two alarm fire.
Yeah, yeah, yeah.
Is it materially different?
I mean, this is a launch day,
but we'll probably discuss.
like the Studio Ghibli capability once it gets out into the long tail of like, you know,
hundreds of millions of people try it. Someone comes out with some genius thing, then everyone's
doing that and then the GPUs. Because I feel like the Studio Ghibli thing happened like a few days
after the launch of images in Chat GPD. It did. It was, it was pretty fast, but within about a week.
I think in this case, we're going to see that here. Okay. I think coding, you know,
if I had to take my bets for what the Studio Ghibli thing is going to be, it's coding.
that's the place where I think GPD 5 is like most tangibly a hugely ahead of GPD 4 and ahead of 03.
Do you think there's a chance that that the coding will mean a studio Ghibli style meme or kind of like,
and what I mean by that is that is that like image generation is incredibly valuable in the context of like
Hollywood will be using AI to chroma key and rotoscope in a professional environment.
But yeah.
What was special about Studio Ghibli was that anyone was making these custom images and I could imagine a world where
You know even going from like the levels Io example of like I vibe code at a flight simulator if we wind up in a studio Ghibli moment for coding. I would imagine it's like everyone built their own game today
I think that's pretty much it. Yeah, so I don't know if you guys watched it what what do you that was that was one of the things we had on the live stream like you can just go into chat to BT
If you try it right now it might or might not work because the
55 rollout is still ongoing.
But if you have five, you can just tell it, like, basically make me a game.
Yeah.
And it will make it, and you can actually play it in chat.
That's amazing.
So, yeah.
Is there the ability to discover that?
The thing is, like with Studio Ghibli, right?
Like, for Ghibli, you don't have to know how to draw to make it work.
For this one, you don't have to know how to code.
Yes.
So can you share that chat and someone else can play the same game?
How does the kind of sharing mechanism?
Yeah, you can do the share link.
we're I think going to try to make sharing for these a lot better over the next few days.
That was P2 after the P1 and P0 of making the GPUs not completely melt.
Yeah, yeah.
But yeah, we will try to make it much more terrible.
Yeah, yeah.
I mean, the studio Ghibli thing is so interesting because it was, it's not just that the model capability was there,
but it's also like the prompt was two words.
And it was so reliable that you always got a good result.
And you could personalize it.
So even if it wasn't, I've seen people build Doom.
I've seen people, you know, you can just buy Doom.
It's a real game.
You can build it.
But if you build it and I'm like, oh, that's cool.
You did it in a vibe code environment or in chat GPT.
Like that's awesome, but I don't necessarily want to go do that for myself.
But as soon as it becomes personal, which is what the studio,
like I had to see what I looked like as Studio Ghibli.
I had to see what my favorite photo looked like.
My favorite me looked like in Studio Ghibli.
And once that happens with games, people will eventually, you know,
there'll be this memetic explosion and you'll see the GPUs will truly be on fire yeah i mean
i think even today you could probably with gbd5 do doom but all of the characters are like all the
enemies or head shots of your friends like here we're going now we're that will just work real close yeah
we're real close it's going to be something that's personal something that you know you can express
your own creativity through because i think people they still latch on to that they don't just want uh
you know a copy of what already exists they want something new and in the studio ghibli moment was just new enough
Anyway, we should talk about actual research.
We should talk about post training.
What's the thing you're most proud of?
Like what can you give us on without immediately getting poached?
What can you give us on the actual innovation that went into GPT5 from a post-training perspective?
What are the kind of keywords and paths in the tech tree that we should be digging into over the next few years to understand how this works?
You know, I would say the thing that is most impressive to me about GVT-5,
is how much getting all of the details right matters.
Like when I look at GPD5,
you know, we had an early version of this thing
a while ago that was kind of okay,
but clearly did not meet our bar for revolutionary.
And we were trying to figure out, you know,
why is that not as good as it should be?
And the team basically just went off
and did a deep dive over a couple of months
of just completely rebuilding the post-training stack for this model.
And it turns out that when you do that,
you get what would have taken, you know,
another order of magnitude worth of pre-training improvements to to produce.
How much are you thinking in pro-st training, in research, about, let's forget the benchmarks
and just focus on user satisfaction, like NPS score basically, or like user minutes or any of
these other, the real benchmarks?
Yeah, the intangibles.
Profit, revenue.
But also just, yeah, the feeling and the joy and the actual value that's delivered
because Studio Ghibli was a delightful moment.
It wasn't a benchmark.
Yeah, I think, so that was something that we took very seriously for GPD-5.
It was like, look at what people are actually doing with Chachabit
and look at where the model is failing them.
Either in the sense that the model is like, sort of like you said,
it's not enjoyable to use.
Yeah.
And so we did, I think, make a lot of progress on that.
Like GPD-5 is much more engaging than our previous really smart models.
Like, 03, I don't know if you guys talked to O3 and the
past. It's a bit bland.
Sure. And GB-D-5, I think,
has a lot more character, is a lot more
more interesting. But then also, I think
for, we really
care about just actually being
accurate.
If a user is
trying to do something economically valuable with our
model, we want to make sure it lands correctly.
And so what we did there is just like look
at the actual distributions of what people
are doing with our models in the real world.
Figure out where the models are going wrong.
Build interventions to target it.
And that was where we got, I think, the most impressive improvements in GPD5.
Like, O3 would just get things wrong and not tell you it wasn't sure it was incorrect.
And GPD5 is much, much better about, like, actually being honest when it thinks it might not know.
Yeah.
How explicit are all the different pieces of the post-training pipeline?
Like, you have, you have, you know, safety post-training.
You have stop hallucinating.
Give me the real facts.
you have make sure the text, the flavor, the tone is pleasant.
There's so many different things to optimize for.
How much of that is like try and just blend it all up into one thing
versus like explicit passes, chunk it out, like split it up?
How much can you decompose the problem?
So, you know, my background is in reinforcement learning.
And I think when you look at something like this,
the magic is in the reward function, right?
It's in what you're actually telling the model to be good at.
And so fixing things like hallucinations, to a huge extent, is essentially a function of just fixing the reward function.
Actually making it so that the model is reliably penalized for saying something that's false.
And if you do that, all of a sudden, the model stops saying things that are false.
Ditto for safety, right?
You know, on the live stream, Sachi talked a bit about the way we've changed safety for this model.
And to a huge extent, it's just a function of we're actually putting out a paper today.
on the new safety stack for this model.
And the core insight in that paper is just figure out what you actually want to optimize for,
which in our case is helpful,
helpful,
not saying something that's actually dangerous or harmful.
You know, write that down,
figure out what that means as a reward function and optimize it for it.
It's really not magic at all.
It's just, again, it's what I said earlier.
You've got to get the details right.
You know, if at any part of that process you screw it up,
the model will be unusable.
What's your current thinking on spiky intelligence?
and is there some flywheel that you can get started
where you're identifying low points that aren't spiky enough
and then you're like almost automatically setting up
the infrastructure, the eval to then RL against,
to create a spike?
I think GPT5 was a preview of what's possible
in that respect in the future.
Yeah, a step in that direction.
Do you think that there's a world where you get to a place where you're kind of, it's weird
because we're not hammering down the nails of the spikes.
We're adding spikes, but.
Haring up the spikes, yeah.
A metaphor that we're stretching a little bit too far.
But is there a world where you can be doing post-training or just adding capabilities
in a more iterative cadence so that as soon as you identify something, the response can be,
yeah, we don't need to wait until GPT6 to fix this.
can just add this capability because, hey, we just found a pocket of users who are trying to do a
thing and they're not super happy with the results. And let's add this capability.
Yeah, I think so. I mean, I think we are going to launch other models between now and GPD6.
I think it's relatively common knowledge, but we do update the model in chat GPT reasonably often.
Yeah, people talk about it all the time.
Yeah, exactly. And, you know, I think we are now in a world where we can conceivably update that model
and have it get materially better on capabilities too.
not just on the personality is a little bit better than it was before.
Yeah.
Going back to your note on the new paper that I guess you guys are releasing today,
when you talk about optimizing for helpfulness,
is part of that avoiding the model reinforcing,
there's times when you want to reinforce and give kind of confidence to the user
that they're going down the right sort of like thought process and things like that.
But then there's like a point where it can get too extreme in terms of maybe convincing a user of something that may be totally untrue.
Is that what the paper gets at?
So it's not specifically about this, although I will say we do explicitly train the model to not lead users down bad paths.
That's something that I think we've started taking much more seriously over the last few months.
As we've realized, Sam talked about this a little bit, I think back in May.
But chat chabit is just way more important for people's lives now than it was a year ago or especially two years ago.
And we do have to actually be very cognizant of what effects our models have on users.
So yeah, we do very actively trained models to not lead users down the right path.
Don't fact check me on the releasing today.
I know we're releasing it.
I believe it is soon.
I think it's day, but I've also been in a whole dealing with launch all day.
Yeah, we're not big on fact checks here.
We're big on the truth zone, which is just the vibes.
The vibes are we'll be publishing some information about the new safety setup.
At some point.
That's great.
Yeah, I think a large part of the conversation around safety should be how reliant
and how useful the product has become to users and then the new level of care that you have to provide
versus a while ago when it was just like people making a cute image or generating some text.
that they were going to use in an email or an internal document
and realizing this vector of usage,
which is this companion confidant that is becoming so prevalent.
Talk to me about post-training for big partners, enterprises,
government organizations.
What is transferring from the research that you're doing
to something that can be offered as an enterprise level,
product?
Yeah.
So we do, OpenAI does partner with external companies to do essentially custom post training.
That is a thing that we do.
And from that perspective, the stuff we do just directly transfers.
I'll also say that we've put a lot of work into trying to make our models as general
as possible, but to as large an extent as possible, if you want to get really good results
from our model, you can do it right on the API just by actually telling the model what you want
it to do.
Yeah.
Right.
I think is pretty comfortably our most durable model ever.
We've heard a lot of really positive feedback about this,
especially from folks like Cursor.
Yeah.
So if I came to you and I was like,
I'm an enterprise and I need to generate a lot of studio Ghibli's,
you'd be like, what are you doing?
Just prompt it.
But what are the examples of companies and organizations?
Is it just private information, private data sets that aren't available on the open web?
or is it specifically like there is enough data out there,
but there's just not the economic incentive for your team to go
and RL on, you know, gas station bench or whatever we're talking about here hypothetically.
I think the answer is both.
Yeah.
It's definitely both.
Because yeah, we're not going to target, you know, as you said, gas station bench.
Because it's not what people are doing with Chagasy.
Not on our own right now, probably, because it's not mostly what people are doing with Chattagogy.
Exactly.
You have some application that's super valuable to you.
Yeah, yeah.
We can be convinced that it's important.
Yeah, yeah, yeah.
It's just not what our users are already trying to do.
What's the state of reward hacking and fighting that in RL environments?
You know, I think we've actually made a lot of progress.
There was some discussion of this around O3, that O3 was like a little bit deceptive in ways that felt reward hacky.
And GBD5 is dramatically less deceptive than O3 was.
What's an example of how that would manifest?
Like, do you have like a canonical case study?
Yeah, I mean, the canonical thing is like you ask 03 to write you some code, and instead of actually writing some code, it changes some unit tests.
It changes the test case, right? Which is kind of hilarious. It's like one of the funniest things that AI has ever done. I understand that is very bad, and it's not what we want. But it is just like, it's kind of cheeky in my mind.
It's kind of cheek. It's also like, you know, I think if you spend enough time around real software engineers, they do actually do stuff like this pretty often. I have 100% done that.
I was going to say I also have done that.
For formal reasons, I won't say that I did it at Open Eye, but back when I definitely did that.
Yeah, of course, of course. This is natural.
What do you think GPT6 looks like?
You mentioned that you're going to be shipping updates to five, but what are you most excited about?
Where are you most excited about going from here?
And just really quickly, give us the date that GPT6 launches?
Oh, man.
Hopefully six launches is a complete surprise to everyone.
I think that would be ideal.
Like a Beyonce album.
Oh yeah.
Hopefully five just makes it and says, hey, it's ready now.
It's ready now.
If you want to hit.
Yeah, I think that would be a great thing for six, actually.
I would love for six to do all of the launch comms and to do the live stream.
That would be great.
Live streaming is, that's the real AGI test.
For sure.
I feel like we're not that far off, actually.
I don't know.
We're getting there.
I mean, video synthesis maybe, but, you know, talking through a script for 30 minutes,
come on, models got to be able to do that.
For sure.
Well, yeah, that'll be the next SORA launch or something.
We'd love to have you back on.
But thank you so much for taking the time today.
We'll talk to you soon.
Great to talk you guys.
Congratulations.
Cheers.
Bye.
Congrats on the launch.
Let me tell you about adquick.com.
Out of home advertising made easy and measurable.
Say goodbye to the headaches of out of home advertising.
Only ad quick combines technology, out of home expertise and data to enable efficient, seamless ad buying across the globe.
And we have Scott Wu from Cognition coming in the studio for the fourth, fifth time.
I can't keep track anymore.
Thank you for taking the time.
Thank you for coming back.
It's great to see you guys.
How's it going?
It is fantastic.
Got to be honest.
Great week to be an application letter company.
I got to tell you guys.
I was about to say, this is the best thing for you ever.
Open source.
Another win for Scott Lou.
Wow, wow, wow, wow.
So yeah, how big is this?
Are we in the Uber Lyft territory where you're going to be, you know,
you're going to be, you know, in price competition between Anthropic and OpenA.
Going back and forth, like what, what is the real benefit to your business?
benefit to your business right now from today?
Yeah, yeah, for sure.
So, first of all, obviously, massive capability gains across the board.
I think really, really impressive work that Open AI has put together.
You know, people have talked about what's going on in the AI coding model race.
And I think by a lot of accounts, you know, Anthropic has generally been ahead for a lot of
the last year, honestly.
And I think at this point, Open AI is very clearly, you know, has very clearly caught up.
And it's pretty neck and neck, I'd say, between the two right now.
So very exciting to see all this unfold and to see what's next.
But I think from our perspective, yeah, I mean, code is just such a core capabilities pill to use case, I'll call it.
And so, you know, being able to work with smarter and smarter models and do a lot of the work that we do,
it just means that both Devin and WinServe can be a lot more capable, a lot more intelligent,
can predict what you want to write or what you want to do with a lot of higher accuracy.
Yeah, it's almost like surprising that given,
the cultural rigor at cognition that you're not doing fundamental frontier research.
So can you walk me through like what is the focus of being an application layer company?
Is it is it UI go to market?
I'm sure it's all of these.
But in terms of the the hardcore software engineering, like what is important to get right?
At some point, there's fine-tuning and post-training, but is that moving back into the purview of the foundation labs?
Or is there still work that you want to do on top of the models or on top of the APIs?
Yeah, yeah, it's a great question.
I mean, I think the core of being, you know, an applied lab is really just focusing on a very particular use case,
on delivering real, just very direct results.
And I think, you know, like, I think the foundation labs are obviously, you know, incredible.
at training-based models and all this pre-training and all of the work that they do there.
I think from our perspective, we want to work on a lot of very particular capabilities
that apply to software engineering in particular, and then obviously run the whole stack from there
to building a product, figuring out the interface and the U.S.
And then obviously bringing that to market and selling that.
On the capability side, there's a lot of particular stuff where, you know, one way to put it is,
I think the base IQ is very much already there in the models.
you can see the raw problem-solving ability.
And I mean, we've gotten some pretty insane results, you know,
getting a gold medal at the IMO or all of these other things, right?
You called that, by the way.
Yeah, you called that.
I think the first, I mean, I was, I mean, we were one point away to be fair a year ago, right?
So it was on the way, I'd say.
But, but, but, but, but, so, you know,
you can really see the general intelligence improving it with every single model generation.
On the other hand, for Devin, obviously, you know, it's a very clear, like, step up in
the general intelligence, but also you want to be able to have, you know, if you ask Devon to
go debug your Kubernetes or to go and, you know, look into your error logs and figure out what
went wrong or things like that, there's often a lot of very specific capabilities. And that's
where we find that, you know, the post-training of the URL is, is most effective there and a lot of
the kind of various work around the models that turns out to be useful.
What about speed? A lot of people that have gotten access to GPT-5 or at least,
in our chat are reporting that it just feels really, really quick. How is that over time going to
impact the, I think a lot of people, you know, if they're using Devon today, task Devon with
something and then maybe they go work on something else for a little bit or they're running
multiple agents concurrently. But at some point, the agent could get so fast that you're just
sort of like watching it and work in real time and you actually want to be engaged. But
are we there yet? Is it still a ways out? What do you think?
Yeah, it's a great question. I think in general, I think as a sync will continue on as a paradigm, even as the models get faster and faster. One of the reasons that it should, by the way, is because there are a lot of real world thresholds that start to matter. Like, at some point, you're actually spending less time on token generation in the Devon life cycle, and you're spending more time every time Devon runs the command to go install packages or Devon running the unit tests or like Devin pulling up the front end by itself or things like that that obviously take real world time.
right? I think we are honestly getting closer and closer to that threshold.
But yeah, so long story short, I think like in the asynchronous mode, yeah, these things will get faster.
You know, we'll see those gains or we'll be able to spend a lot more time, for example,
thinking about a single problem relative to the amount of like real world clock time that gets spent.
I think for the synchronous use cases is where we'll see things really, really, you know, explode with speed,
which is, you know, windsurf and cascade, for example, where we're, where we're, we're,
we see the speed gains really, really matter.
Speaking of windsurf, give us the update on the chat wants to know about the windsurf
T and the 80-hour demand.
How have the buyout offers gone?
What's the internal response been?
Where'd that idea even come from?
Yeah, yeah.
Look, people are stoked, honestly.
And I think from our perspective, it's obviously really important to kind of just like
unite and get to the point where we can just be one culture and,
one kind of shared set of values.
And this is how things are at cognition is.
You know, it's a pretty busy time.
Like we are at the inflection point of code and we work like that too.
And so I think a lot of it for folks is just kind of like, you know, we want to make sure folks
who really want to do this with us, you know, make that conscious decision to opt in.
And for anyone who doesn't, obviously we totally understand that there are a lot of talented
folks that maybe that's just not the right thing for them right now.
or not at this time.
And so wanted to make sure that they were, well, we'll take them care of too.
And to be clear with the buyout offer, that's on top of the actual acquisition deal that
already went through.
They already got their vesting.
So, yeah, I was thinking of the roller coaster.
It's like, you have the opening I deal, then the Google deal, then the cognition deal.
And then they're like, wait, these guys work really, really hard.
I don't know if I'm cut out for this.
And they come back up again where they're like, wait, I can just go, you know, take a sabbatical
and figure out my next thing.
It's a great outcome.
Yeah.
Yeah.
No, it's obviously, you know, overall is a killer team that's been through a lot.
And so I wanted to make sure that they're well taken care of.
That's fantastic.
Anything else you can tell us about the integration of Devin and Winserve?
How are the teams getting along?
How do you see the products playing together in the long term?
Obviously, cross-sell seems really obvious.
They had the go-to-market team as well.
But how else are you thinking about the interaction maybe over the longer term there?
Yeah.
Yeah, yeah, for sure.
Yeah, a lot of obvious integration on the team, as you mentioned, with Crossout and so on.
I think the thing that's really exciting on products, which I think actually comes along with these capabilities increases, is, you know, as the capabilities keep getting better, you start to take on harder and harder tasks with AI and with full agentic workflows, right?
And I think there's an interesting thing that happens where for a lot of the harder tasks, you really actually do want to go back and forth between asynchronous and an asynchronous mode, you know?
And that's for a few reasons.
You know, one of the reasons, obviously, is because there's a lot of review and a lot of, like, looking at the pieces and thinking about the, you know, all the minutia and the details of what you're implementing.
I think another big reason for it is, you know, when you get started on a larger project, you know, let's say you're sitting down as an engineer and you're saying, all right, I'm going to go build this whole project today.
You yourself don't actually know all the tradeoffs that want to make, all the decisions that you want to make and so on, right?
And so having a format where, you know, for the decisions that need you to be there and you're involved setting the kind of the strategy or figuring out high level what should happen, you're able to do that in a nice synchronous environment, which is naturally the wind surf IDE, right?
And then for the parts of the task that you can actually hand off and have an agent work on, you're giving that to Devin.
And figuring out how you go back and forth between those is super interesting.
So wave 12 on the way soon.
We'll have a lot more to share.
Last question.
Yeah.
Hit the soundboard, Jordy, for that.
For wave 12.
Wave 12.
Fantastic.
Last question, we'll let you go.
What is your probability that AI will get a perfect score on the IMO next year?
Oh, interesting.
So, by the way, we just had the I-O-I, which is the programming version, like the programming
Olympiad, and I think there's a good chance that we'll have a golden medal with the Ioi for this year announced as well.
I think perfect score for next year.
We as in humanity, or we?
We as in cognition.
As in humanity, yes, yes, yes.
An AI perfect score.
Yeah.
Sorry, an AI gold medal.
Right.
Perfect score in the IMO next year.
I think it's got to be north of 50.
Honestly, I would put it around like 75% or so.
Okay.
Well, thank you so much.
We'll be following it closely.
And good luck to you and congrats on all the progress.
Very fantastic.
We'll talk to you soon.
Awesome, guys.
Thanks for having me.
Bye.
Let me tell you about Bezell.
Getbezzle.com.
Your Bezel concierge is available.
now to source you any watch on the planet, seriously any watch. And we are joined by our next
guest, Claire Vaux, from chat PRD. Welcome to the stream. Claire, how are you doing? What's going on?
It's a fun day today, isn't it? What was your reaction to the stream? What was your reaction to
GPT5? You know, GPT5, the first thing I said and I got a little early access is I said, it's a
developer for developers by developer. This thing is built to be a software engineer. You've seen
a long string of your guests come on and really speak about the coding abilities of it.
And what I think is interesting about this particular model, especially because we're seeing
them deprecate the old models in the chat GPT experience.
And we're seeing a lot of positive feedback.
But I do think there are drawbacks to a model that's so clearly tuned to a developer use case.
And as somebody who's building an application that isn't focused on agentic coding, I have
noticed some personality quirks that are going to be really interesting to see how they shake out
as we roll out this model to our users.
Walk me through those.
What are the...
Yeah.
What's the timeline?
How much, like, how much time do you have to kind of move users over to five before?
Yeah.
Yeah.
So, I mean, I think we have tons of time from the API side to move, move users.
And in fact, you know, our strategy at chat purity is not to just upgrade to the latest
model. I know Zach at Warp said, like, why wouldn't you want the latest intelligence? And the reality is
because we're doing a lot of business strategy and business writing, I actually want to validate
with our users that they're getting the quality of strategic thinking, output, writing that they
really want. So we actually A-B-test every single model rollout and really evaluate for user quality,
token generation, all those things. And, you know, looking early on, it yaps. Man, this thing just wants to
go through tokens. Right now I'm seeing four to 10x the number of tokens generated between the
you know four generation models and five. And when you're in a business context, you do not always
want longer words, you know? And so it'll be really interesting there. It is certainly focused on
execution. So I, you know, I've heard a lot from the open AI team. It's steerable. Yes. And it's
natural inclination is to drive you towards like how, what very tactical
very specific. And so if you're trying to zoom back out at a strategic level or focus on a
business initiative, it's actually a little harder to tune in that direction. So, you know, I think
there's a lot of positive things for me as somebody who uses a genetic coding platforms, who writes
a lot of code. It's my daily driver now. I love it. But for other use cases, I think it's going to
take some time to figure out if it really is optimal in use cases where intelligence actually
isn't the differentiating capability.
Yeah, it's very interesting to think the best product manager is not the one that writes the most,
the longest doc. No, and you don't send your engineer into your executive meeting.
Like I and I really am looking forward to the time where we're not getting these number-based
models where actually I can get like GPT developer or GPD strategist where they're pre-tuned and trained and
trained for the role they're going to play as opposed to general purpose, but clearly oriented
towards a set of tasks. And I just think if you look at this model, it was oriented towards
an engineer, software engineering, at least in my experience. So have you been tempted to
launch any type of agent like agented coding products? You are, you guys are obviously
responsive at Chet PRD responsible for creating documentation. And if you look at the other guests that
have joined today, many of them are competing with each other in different ways and trying to own
different parts of the stack. You guys have seemingly stayed really, really laser focus and no one
else is doing anything like you're doing, at least on the show today. But talk about like picking
your lane and kind of like optimizing. Yeah, we're integrated with,
a lot of those platforms. So a lot of the kind of like prototyping platforms, V0.Deb, lovable,
all those, we integrate, we just released our MCP. So I use chat PRD pretty consistently inside
cursor through our MCP. So I think of, we think of ourselves as the product pair to the AI
engineer. Now, what's really interesting about my experience with GPT5 is the one place that actually
does really well as technical specs. And that's a place where chat PRD has sort of bridged into engineering
execution, often our product managers are generating a PRD or some sort of business document.
They're actually going the next layer and developing a technical spec. The GPT5 technical specs
fed into these agentee coding frameworks or prototyping frameworks output much higher quality assets
on that end. So I do almost think there's going to be this kind of like right model for right
use case, especially in our kind of business. And so we think of ourselves as integrating. The one thing I have thought about,
with GPT5. It's the first one where it feels really simple to just go ahead and roll your own
agent encoding framework or prototyping framework inside of our application. So never say never.
It's something that we get asked for a lot. But we're good friends with almost all your guests
on your show today. And so we like the role we play in terms of being the product manager
pair to all these AI engineers. Yeah, that makes sense. What are you looking for next?
What am I looking for next? I mean, in terms of
model capabilities, what I think is really interesting about Open AI and why I'm really committed to the Open
AI ecosystem, even though I test and use a variety of models, is I think developer support is a real
differentiating her. So we spend a lot of time talking about model capabilities. And for application developers,
certainly ones that are doing more complex applications like Agentic coding, model capabilities really matter.
Like core IQ of the model matters. But the other thing that matters, you know, somebody who has built
developer tooling products. It's developer experience matters. The primitives in these APIs matters.
And so what I'm really pushing the Open AI team to think about, which is in addition to the core
intelligence of the model, what are the developer tools you need around these models to really
make them a platform on which a variety of applications can build. And I do think that Open AI has
disproportionately invested in developer experience, but I'm always looking for like give me better
out-of-the-box tooling, give me more control over these models, give me more hosted services,
all those things that as an application developer are just going to make it easier to deploy
these models of production beyond the core kind of intelligence of the models themselves.
What was your read on 4.5? Is there a world where, you know, I'm thinking about the product
manager versus the engineer. You have your 03-go crunch some really hard reasoning,
and then you have 4-5 turn it into, you know, stronger,
or like more, you know, a human language.
Yeah.
So I did a lot of experimentation around 40, 45, and 4-1.
4-5 was my favorite prose writer by far.
It was loved from a business writing perspective.
I thought the pros was the most natural.
It was really slow, like untenably slow.
And so the compromise we made in our testing
is we ultimately ended up with 4-1 as the,
fan favorite for business writing when we were balancing off both quality of pros and intelligence
as well as performance, which for application developers is a real consideration. So I landed on
4-1. 4-1 is the model that's being tested right now against GPD5 in chat pyrd. And one of the things
that I have to go do now is figure out how to get chat G-PT or GPT5 to stop writing. It writes a lot and it only
wants to write in bullet points. So I've got to go back and
to our prompts and figure out how to direct it to be a little bit more business oriented.
Bullet point maximalist.
It's the new M-Dash.
I'm telling you, you will not be able to stop seeing it.
It just, all it wants to do is write a bullet point.
And call a tool.
Like it, I was using an incursor and it just kept maxing out my tool calls.
I'm like, you do not need to read 50 files to do this.
So I do think, you know, application developers are really going to have to think about how they
slot this into their current workflows. There's definitely tuning that needs to happen. But I'm telling
you, you're going to see a lot of bullet points when this thing rolls out. Yeah. In 60 seconds,
where is product management going? A lot of people talk about the, you know, examples of product
managers that are starting to ship code themselves, ship whole features, products. But I'm sure
those are edge cases to date. But where do you feel like it's going based on your user base?
Yeah, I mean, it's going to go one direction of the other.
Product managers are either going to develop the hard skills to do the design, the go-to-market, and the engineering job to some extent.
Because some of these other jobs are definitely going away for product managers or my favorite use case, engineers and designers are going to get tools like chat PRD or these prototyping tools or cursor.
And they're going to be able to actually do the product management job.
And so what I think is we're going to see a new type of role emerge, which is a much more generalist.
role where people maybe have a specialist capability and they're augmenting that product thinking
or they're augmenting that technical thinking with with AI. But I don't think there's going to be
product managers as they were, you know, five or ten years ago for much longer.
Makes sense. Well, thank you so much for stopping by. Yeah, great time.
You're telling you. Thanks for having me. We'll talk soon. Bye. Cheers.
Up next, we have Brad Lightcap, the chief operating officer of Open AI. Welcome to stream,
Brad. Also, Jordie, your post saying I'm updating my timelines. You know how
four years to escape the permanent underclass.
It's over 4,000 likes.
There we go.
A thousand likes for every year.
Love it.
Anyway, Brad, how you doing?
Brad, what's going on?
Guys, how are you good?
Congratulations on the launch.
What are the biggest takeaways for today?
From your side, I'd love to know about what it actually means to be the C.O.
of Open AI does so many different things.
Consumer internet company, API business, enterprise, there's all sorts of stuff, building data centers.
What is your actual role?
My role is kind of whatever the company needs me to do.
I play everything from like, you know, PM when I need to to like, you know, salesperson
when I need to.
That's kind of the fun part of the job for me.
On this launch in particular, it was really fun.
I spent a lot of time last few weeks with customers with partners getting a feel for GPD5
relative to what they were previously using.
In some cases, those are open AI models.
In some cases, they were other models.
But I've been opening eye a long time, but it's opening I seven years.
So I've seen GBT3, I've seen GPD4, and then to be able to see GPD5
and just I think the joy of people being able to use it in production and seeing how much
better it is, that's the best part.
Greg told us earlier about the era having to pay people to use the early versions of the
product.
You guys have come a long way since then.
Yeah, we had like three customers with GPD3 or something like that.
And so it was easy to manage, easy to talk to all of the.
them. They actually were tired of us calling them being like, is it good? Is it getting better?
And so now it's, you know, we're fortunate that we've got more than that. But it's cool. I mean,
the diversity of use cases, I think the number of things that people are able to use it for,
we've got everything from the team at Amgen, you know, big pharma, life sciences, using it for
clinical workflows there. We've got teams at Uber's, you know, building it for customer
support, teams at Notion and cursor building it into products that people use every day. So,
I think that's the power of it.
Is it just more and more covers the service area of things people do with these tools?
I'm not sure how much you touch organizational design at OpenAI,
but I'd be interested to hear your thoughts on how those companies that you mentioned
should be thinking about AI changing their org structure.
Is it sort of like a horizontal, cross-functional service layer like a finance team
that touches a lot of different elements of the business?
or should most companies be thinking about standing up a dedicated like AI implementation team?
How do we get a chat box on every product that we already shipped?
How do you think about those tradeoffs if you were talking to a friend at a Fortune 500 company
that was thinking about their AI strategy?
Yeah, you know, it's an interesting question.
I think it was maybe said earlier on the show.
The thing we see is just people can do more.
And so there's like this much wider latitude that you get if you're an individual person
at an individual company where, especially as you get bigger, you know, maybe more bureaucratic
organizations that have a lot of different functions, a lot of different levels, you have to rely
on a lot of other people in the org to get stuff done. You've got to rely on your data science
team to do data analysis. You've got to rely on your design team to do mockups. You've got to rely
on your marketing team to do copy. And I think what we see with AI is it just accelerates
people to get to a great V1 of everything. So if you're a high agency individual and you
want to get stuff done, you're no longer gated on people that, you know, you otherwise would be.
And I think that should enable organizations to move a lot faster. And I think it should
enable the people at organizations that really drive them to do a lot more. And we see that
consistently. Chat ChatsyBTUBT Enterprise, I think that is consistently what we hear. And we
seek those people out when we deploy chatyPD Enterprise. We find those like, you know,
two or three people at the organizations who are just the like AI superstars and champions
and then try and actually use them
as these kind of touch points
for the rest of the York to learn from.
How are you personally using AI these days?
You know, my biggest challenge, I think, day to day is context switching.
If you look at my calendar from, like, top to bottom,
it's like, I joke like, like, with my wife,
I, like, have to, like, show up to work,
like, wearing, like, a lab coat,
and then I, like, take the lab coat off
and, like, put some, like, sunglasses on
and a film school jacket,
and, you know, then I'm talking to, like, a media company,
and then I, like, take that off.
So I go through the costume changes.
And I think what I actually mostly used,
use it for is just to help with bridging me from kind of thing to thing to kind of put me in
the mindset of being able to work with customers, help customers. GPD5 is incredibly good at this
kind of structured reasoning of how do we actually take what is this very diverse set of things
that models like GPD5 can do and then apply them in domains that I don't think about every day.
And so it gives me this launching off point to be able to talk with leaders and with customers
much more fluently about how we can help their organizations.
within let's say a set of companies like the Fortune 500 what does AI adoption look like across the
spectrum because I'm sure that there's companies that you talk to that are truly you know
adopting AI in the way that John was mentioning like trying to become AI native changing their
entire organizational approach and then there's companies that just want to buy software to say
that they can that they're becoming AI native so what what is that
spectrum look like in practice? Yeah, it is a wide spectrum. So at the top level,
we're seeing just like amazing appetite for wanting to adopt tools for people. And I
think that's like the easiest place to start. Typically that's where we steer
organizations if they're starting at zero is just give your people the best tools.
You may have seen we've, you know, we've grown chat GPT work, which is our
enterprise and team product from three million seats to five million seats now, from
from June till now. So toward growth there and we don't see any abatement in demand there.
If anything, it's accelerated from last year. And so I think people and organizations are starting
to realize that, like, at a minimum, you need to make sure people have the best tools.
What's cool about GPT5 now is it also enables people to use the best tools at every point.
And so if you're in an organization, you're not fumbling with the model picker. You're not
trying to figure out when to use a reasoning model. You're not trying to figure out kind of the art of
prompting to get the perfect thing. All of that stuff is abstracted and it's kind of taken care of
for you and you can have confidence that your people are actually using the best models at any given
point. Beneath that, it gets a little more complicated. So more and more organizations, I think,
are starting to grasp how the tools can actually help in the business process. So whether that's
in customer support, whether it's in research, whether it's in software engineering and data science,
you're seeing these tools more and more adopted in the enterprise. I think there's still a
quality gap though. I think we've, we now are just breaking into what I would call the kind of
era of models that have capabilities that are good enough to make a dent in the types of
problems businesses care about. Businesses care a lot about things like reliability, right? They
think they care about accuracy. They care about the resilience of the model to recover from
tool use errors and to be able to string together these very long kind of multi-tool, multi-step
workflows. So GPD5 is a step on all those things. And I expect that that will enable us to be
able to do more and more things in the business process.
Do you think those customers that you just mentioned will stick with this idea of like
GPT4 level workloads will stay on GPT4 and maybe there'll be cost savings, but those workloads will
stick around for a very long time and then you'll develop almost new capabilities, new workflows,
new workloads that will be additive, but the enterprises will stick or will they want to, is everything
so fresh that they'll want to just like rewrite everything with the latest and greatest?
More often than not, I think it's the latter. I think you want to rewrite everything.
One of the cool things we did here was we were able to keep the pricing on GPD5 at the level of
03 pricing. So, you know, if you're cost sensitive, you don't really have an excuse to
not upgrade. GPD5 is faster than 03 and 4-1. So we've improved on latency for sensitive use
cases that are speed sensitive, latency sensitive, and obviously the intelligence bar has gone up.
And so, you know, unless you've got really a very kind of narrow and specific workflow where
you've got a model like 4-1 that kind of is okay, there's really not a reason I think that people
wouldn't upgrade. Yeah, do we need like a three-dimensional Pareto Frontier right now that matches not just
cost and capability, but also cost capability and latency or something? Is that something that you're
seeing a lot of demand from in the enterprise? Yeah, 100%. We actually measure it that way. So we look
at those three vectors and it's always kind of an optimization function along those three,
those three axes.
We think we found that here.
It was actually in terms of where my work was over the last few weeks, it was a lot of,
I mean, this is a qualitative, you know, kind of, you know, really like manual process
of collecting feedback because everyone's got a little bit of a different preference and we can
only pick kind of one or two points on that curve.
And so just trying to kind of dial customer feedback, namely developer feedback in for us on
where that balance of things are is a big part of our,
our process for picking, picking all those points.
And so we hope that, we hope that people like it and it unlocks, you know, the kind of
maximal use.
That's great.
How are you thinking about open source?
Who, you know, who's been most excited to get access to it?
And, yeah, where do you see it going?
Yeah.
I mean, it's important to us.
You know, I'm glad we've, we've gotten this out.
It's been a huge team effort.
I think there was kind of a thing that, like, you know, Open AI doesn't like open source
anymore. It's like, no, we're just like really busy with a gazillion other things. So I think
hopefully going forward, we've got more of a leaned-in vantage point on open source. But it unlocks a
huge number of use cases. I mean, if you think about kind of like, you know, government use cases,
you think about on-prem, you know, use cases where you're handling sensitive data and very
sensitive environments. You think about where you want to run models on the edge. All these things
right now are kind of inaccessible to us as a service provider to customers because we just
just don't quite have models that kind of fit at those points.
So this for us, we think, is huge TAM expansion.
And we're excited to be able to work with enterprises
on implementing that model, which is I think,
competitive, hopefully, with our O3 class of models.
What is the landscape like for companies that are
helping to implement open AI products at various enterprises?
You have the big consulting groups that will give you an AI strategy.
Maybe they'll try to take it a step further.
But I imagine there's a cottage industry
of firms that have sprung up to try to help organizations unlock the value beyond,
hey, let's just get everybody a seat with chat Chupit work.
Yeah, I think there will be this new industry that emerges that is kind of separate
and apart from kind of the legacy set of SIs and consultants that is really AI fluent.
They're very eye-native.
I think it's very hard to borrow, I think, paradigms from the last 20 years.
here is a software building and, you know, implementation that are going to kind of map to what we're
dealing with here. You're dealing with fundamentally probabilistic systems that are moving and
increasing and improving at a rate, you know, of now kind of collapsing to every few months.
And I think the nature of use cases changes quickly, where enterprises are focused on kind
of deploying them changes quickly. And so I think it's just hard for kind of the legacy industries to
to keep up, frankly.
We've had a lot of success working with some of this kind of new breed of SIs, so the distills
of the world and others that really have been born, I think, in forged in the fire, so to speak,
of this kind of new, this new platform.
And so we hope there's more of them.
We'd be excited to work with anyone that wants to work with us on it.
There's more business than we can handle, and so we're always happy to spread the love.
Talk about the $1-chat-G-GPT product for
for the government.
Were you involved in that at all?
I was involved in that.
We wanted to do something that was meaningful for US government.
It's been a real big focus of ours lately.
I think our view is the government has got to start to modernize.
We've got to make sure that the tools that we use in the private sector are also in the hands
of folks serving us in the public sector.
And we wanted to make that really simple.
So we made chatypt, you know, basically equivalent to chatyPT enterprise free.
It's a dollar per year per agency.
Hopefully we can afford that.
And we wanted to make that available to anyone
that wanted to use it and standardized their GSA.
So we're super appreciative of the partnership with them
and more I think that we can do on that front.
How is that different than just like if I'm a government employee,
I can just go to Google.com and I have access to that
and Google provides benefits.
Scott Kapoor was saying that he can't use
He can't use it.
Yeah, so why?
Yeah, just talk to me about how,
how it's different to offer ChachypD as an actual service with a contract that you're,
that you're, you know, vending in. You're actually, they are a client versus just if you put up a
website, every government employee can access the web to some degree or would it be blocked? Like,
what, why does it need to be like a deal at all as opposed to just like everyone just uses it?
Yeah. So part of it is just making sure that government employees can access it.
So in some places, obviously, you know, you can put blockers in place that wouldn't prevent access.
We hear a lot of stories, by the way, of people like going out on their lunch break to their car in the parking lot and like, you know, pulling up chat GPT on their phone and like throwing a bunch of stuff in there just to like, because they know it'll get them through the day faster.
And we've done work, by the way, with governments, with the state of Pennsylvania, other places where we've seen dramatic increases, you know, things like two to three hours a day saved per employee, given the nature of the work that they do and how helpful chat chat can be.
And so this lets us have an interface into them as a customer.
it lets our team engage with them in a direct way.
We can see how they're using the product
and can help them use it better.
And so that's important for us,
is like we got to build on that foundation with them.
And then presumably it also allows the government
to define like security and privacy
in their world as opposed to if you're just like
some website out there,
their choice is only block or don't block
as opposed to actually, you know, communicate with you.
This is okay to train on.
This is not, et cetera, et cetera,
like keep everything private, et cetera, et cetera.
Yeah, I mean, we don't, we don't train on enterprise data
at all.
Yeah.
You're safe there.
But the, yeah, I mean, for us, like just being able to treat them as a customer, right,
to treat them as a user.
And you go, you know, you mentioned earlier, like, we were talking about kind of like there
being these points of success at every organization that, you know, you've got people who are
like way more sophisticated in using these tools than others.
We want to be able to see those people and amplify them.
And the government's no different.
There are people that we've worked with in government who are incredibly sophisticated and
how they use AI tools and our goal is to get everyone there.
How do you think about the group of users that are active students?
They've been on summer break.
You guys have been busy over summer.
Do you're thinking about, and you recently launched, I forget the exact name for the
product.
I think it was like Chat Tipete Learning.
How are you thinking about that cohort and unlocking new capabilities for them this coming
year?
Yeah.
So we launched something called Study Mode, which was in our core Chat ChaptaintyPT product.
And it was a little bit of an experiment.
We wanted to see if you change the way the model behaves,
when it can kind of, when it knows you want to be in a learning mode,
if that can actually enhance outcomes for students,
where we have all these kind of studies that have been done
very like anecdotally about ChatGPT's ability to drive student outcomes
and learning outcomes.
So here we kind of took a little bit more of an intentional approach of,
if you actually take the model and actually use it
in a more Socratic style,
where it can actually kind of quiz you,
it can withhold certain information
that it wants you to be able to empirically deduce.
It wants you to reason about problems,
and it kind of reasons with you as a partner.
So far, so good.
It's really cool.
Learning is kind of the killer use case of chat chepti.
And so I think to be able to actually launch something
that is, in some sense, extends that kind of killer use case
has been really cool.
And the student feedback so far, even on summer break,
has been positive.
Well, we'll let you get back to your day.
What's next on your agenda?
Are you putting on the lab coat or the suit and tie and
and go into Washington.
Good question.
You know, today I'm mostly with the team and talking to customers and maybe tomorrow
I'll get back to the lab code.
But we appreciate you taking the time to talk to us.
Yeah.
Well, thank you so much for taking the time to talk to us.
We will talk to you to see you guys.
Have a great rest of your day.
And the timeline has been in turmoil because President Trump says he will be imposing a 100%
tariff on all semiconductors coming into the United States.
States, it started with widespread tariffs on chips and then turned into export controls.
This is from the Kobe Yesi letter. Is this a red flag moment?
I don't know why you have the red flag.
It felt like it. Ben was getting the flag.
And Viter potentially affected. But Taiwan says TSMC exempt from Trump's 100% chip tariff.
Very unclear. The story is obviously still developing.
And Dylan Middick says, you're telling me.
that this level of monitoring the situation is free and it's a picture of you in front of the whiteboard
monitoring the chat GPT versus the timeline today we're monitoring Illinois has banned
AI therapy making it the first state to regulate the use of AI in mental health services
interesting headlines coming out just interesting because the product can just be used
the therapy like the user can choose to do that it's not necessarily it's kind of hard to
ban outright.
Like maybe you can ban it in a clinical setting.
Yep.
I wonder how they define this.
There's probably a loophole if I know anything about how these bans are
implemented.
But yeah, maybe it's like if you're in the clinical setting, you can't be,
you can't use it, but then people will just use it independently.
They're like, yeah, therapists is just on their phone.
They're going to be going to be having, they're going to know, they're just going
to have it listening to the conversation.
Yeah.
They're going to be like, no, what should I do right now?
What should I say?
What should I say?
How does that make you feel?
That's what it's going to tell.
you. Celsius nearly doubles revenue year over year. This is the energy drink. Revenue of
$739 million versus $632 million. North America grew 87%. International grew 27%. But here's the real
kicker. Alani Nu acquisition is the primary driver of growth. Alani New added $300 million in
revenue and retail sales are up. So wow, what performance. But yeah, I mean, that was the expectation
when they bought Alani News that they would,
I guess it's like the first moment they rolled them in, probably.
But huge growth for Celsius as they become multi-product,
multi-consumer company.
What else is going on in the timeline?
We have one last guest.
I think you might have to hop on with Taipei.
So feel free to jump when you need to.
Tyler, anything going on on the timeline we should be monitoring.
We are, of course, monitoring the city.
situation. I've been so so when Max's on he was talking about like how you can like make a little game right so I've been working on like a balloons tower defense game. Okay, how's it going? So it's going pretty well. I'm I'm making another change but then maybe I can screen record and and do share. Yeah, yeah, that'd be great you could share with the with the folks too. Yeah, I like this post from Ray Sullivan. These GPT five numbers are insane and it's a chart of GPT version versus number and then once it gets to four.
It goes 4.1, 4.2, 4.3, 4.5. So the fifth one is a massive, massive bar.
We need an analysis of the charts from today. It seems like there was multiple that were
kind of odd or hallucinated or off.
It's interesting that multiple of them snuck up.
Just in sheets, the popular convenience store chain with 750 locations is now offering
50% off purchases paid with Bitcoin and crypto daily from 3 to 7.7.
p.m. What a wild move by sheets. Well, well, Ben Highlack is in the waiting, the restream waiting room.
Let's bring him in. Bring him Ben Highlick. How you do, guys? Good to see you.
Good to see you. We're doing well. I'm just going to say hello. I got to take off and talk with Taipei.
I'm going to let John take it from here. Absolutely. I'll close up the show.
You guys have a fantastic conversation. Give me the update. How's the day been for you? What were your
expectations? Did this meet to exceed? Did it underwhelm you? How are you doing?
Well, so I've actually had access for a couple weeks.
So we actually did a video.
I'm not sure if you've seen it, but opening eye,
I brought a couple of folks from the Twitter sphere to their office a couple weeks ago to try it.
Yeah, yeah.
Yeah, yep, yep, yep.
I think that it pretty much exactly meets my expectation as far as like how it's been received.
And I've tweeted about this as well, but I think that it's really, really good.
at like one-shotting things.
You know, I think it's better than I think other models we've seen.
But I think it's actually sort of a distraction in a lot of ways.
I think that the things that's a lot better at are, A, a lot harder to describe,
and B, I don't think the harnesses for it really exist yet.
I think a lot of harnesses.
So the way I've been describing it is that I think I've seen, you know,
web search existed in chat CBT for a really long time, right?
long time, right? Like, it was able to, like, call a tool, search the web.
Yeah.
Obviously, like, deep research was very different than that, right? Like, what we saw was it
was like actually, like, calling, you know, searching the web. It was like reasoning about
those results, changing its kind of course, like course correcting the middle. So like intermediate
reasoning is like what is the is the term for it. And they really trained it how to search
the web well. I think GPT5 does that for like a whole plethora of tools. The interest
thing is that a lot of products like I think a lot of the agentic products exist
today where I kind of built wrong like they weren't built that they didn't build the
tools the right way and we've seen this before like if you look at like you know the
first you know kind of infrastructure for agents was Langchain like way back
when yeah I remember two or three yeah it was it was you know it was it was early but
it was wrong right and so like anybody that you know they've iterated since right
they have like Langraph it was a better implementation but the first
similar to implementation of Langchain was like, again, early but wrong. And so if you built your product
on Langtrain, like you had to, you know, significantly change it. I think we'll see a similar thing
happened for GPT5. You know, it's not just like, you know, change of the string and get, you know,
from, you know, four o to five or something and push. And now you, you know, yeah. Yeah, you know that
meme about like, oh, like Sam Holman's done on stage and like just like, you know, killed 75 startups.
Google just killed 100 startups. Apple just killed Partifle with their new thing or whatever.
Did any of that happen today?
It feels like,
it feels like this is like the Lang chain
needing to change their strategy.
That happened a while ago.
I haven't identified anything.
It feels like, you know,
Scott Wu hopped on and said like,
you know, great day to be an application layer company.
The foundation models got better.
It's more tools in my tool chest.
I'm extremely happy and,
and I'm more confident than ever.
And I believe him.
I believe that he doesn't see
today as fundamentally needing to change his business model.
I think that's true, actually.
I think that people have been, you know, there's a lot of people building agents right now.
I think a lot of them have not been feasible for some of the reasons that GPT5 starts to address.
So I think it is, I think that what it means is that the entire architecture behind agents will get a lot simpler.
Like, it feels like a good day for people building applications.
Yeah, it's not immediate.
that there's like some, you know, like company or something that got killed today.
Yeah, yeah, yeah.
I mean, in general, it feels like, you know, Dorcasch updated as timelines.
There's just been a general idea that, like, we've maxed out pre-training.
We've kind of maxed out post-training.
We're now in the let's reap the reward of this.
And we've seen it in like the incredible financial performance, the incredible usage numbers.
You know, millions and millions, hundreds of millions of people are using Chachap,
30 minutes a day.
I love the product.
And yet it feels like the what have you done for me lately,
meme?
It's totally like, okay, yeah,
we went from the iPhone 4 to the iPhone 5 today.
Yes.
Still really an important technology, great company,
but like I want another iPhone 1.
Yes, yeah, yeah, yeah.
I totally get what you're saying.
I think that, like, I wrote a piece about this with Swix,
but it really actually changed the way
I see that path to AGI, like, I think before using it a lot, I kind of was like, okay, we need like bigger, bigger models. They're going to like get smarter or something. I think like I had this realization. So I was watching it like solve. I had this like really weird like dependency conflict with yarn. Like we have like a mono repo. It's like the problem also with this discourse is like, um, the sort of problems that gets good at solving are just like not sexy things to talk about. They're not things that you'll understand. I'm like,
We have this issue with the way we structured things.
But like a couple weeks ago, I was watching it like, I had this problem.
No other model would solve it.
And I watched it sort of like poke around.
Like it started running this like YR and Y command in a bunch of different directories.
In between it's like reasoning and like correctly reasoning about like what and why and what it was learning.
And you know, taking little actions in between seeing what happened.
I think what I realized is that like, you know, if you imagine like, you know, if you imagine like,
humans without tools, like if we never had any tools, we're never even able to write things down.
Like, would you be able to tell that we're intelligent? Would we have like, you know, learn to speak,
et cetera? Like, I just like don't, you know, even if we could not have ever invented fire, right?
It's like, it's like, where would we be right now? There's, that feels like there's a similar,
like, I actually think a lot of the next year is just going to be, how do you get these models to do
things better? It's like, you know, I think it's next year. In your yarn example, um, you said like,
you were you were having it i assume gpt five like work on the problem was that wrapped in a
coding tool did you just go to chat dot com and give it your github repo like talk to me like what was
the actual user experience from your side yeah so this was in cursor okay um i think the code xl i the
new version of the code xl i which they just released today is also really really really good okay um
I think that you will really only see a significant difference in places where it can sort of explore its environment is the way I would put it.
Like when I was watching it like go bounce around my repo and like like I felt almost like I was watching something navigate like a little like video game like Pokemon or something.
Like that's kind of what it felt like.
It's kind of like I'm going to go over here.
I'm going to see this.
Okay, wait a minute.
That conflicts with what I just saw over here.
Like where should I go next?
You know what I mean?
Like it felt very novel is like what I would say.
Yeah.
Yeah.
What, what, so yeah, I mean, how are you using it?
What, what, where do you see it going?
Do you see it like, just like a little bump of a tailwind today?
Or what's your read on like, like how you'll be using GPT5 going forward?
I mean, yeah, there's two huge things.
So like one thing that like really got missed today is that they also released GP5 Nano.
which is like an incredibly good model actually.
So like we're not talking about it,
but it's half the cost for input tokens, then Flashlight,
or sorry, yeah, I think it's actually half the cost input tokens and Flashlight.
And it's a really good model.
Like it's like 4-0 level for a lot of like writing and stuff like that.
And so yeah, we'll be using that probably in the short term.
I think it'll be interesting to see how other providers react.
Like I'm sure Google will cut their prices as a result.
But it is the cheapest, like, hosted model, I think.
I don't think anyone's serving it any other model for those prices for that matter.
Yeah, that makes sense.
What else are you looking for for the rest of the year?
Probably no GPT6 on the horizon, but what are you looking out for?
I mean, it seems like Google is expected to respond with Gemini 3 soon.
But what else are you tracking in the world of AI these days?
It's a great question.
I think that, yeah, that's going to be wildly interesting.
I think what Google does will tell us a lot.
I think that they, you've probably seen it,
but they released this like world model yesterday.
We're not talking about it anymore.
I mean, like, if those videos, I haven't tried it myself,
if those videos are real, like, that's one of the most mind-blowing things
I've seen in the last, like, you know, decade or something.
So, like, if that's real, like, that's extremely interesting.
And I think has all the stuff that's going on with world models right now,
has, like, huge implications for, like, everything, like, from robotics,
just like so many different fields.
So super, super interested in that.
And the other thing is that I actually just think that, like, again,
I'm actually really bullish on cheap T5.
I think that the way it was received today is like just about how I expected it.
Like, and the reason is like, when I say harness again,
I'm like, I think that like Canvas in ChatT is pretty bad is like my,
would be my take like, you know, it's a tough product to make, but like, yeah,
like it does really poorly with like long files, crashes sometimes, like that sort.
Like, I think that we don't have.
the product layer around
GP5 doesn't exist yet.
So I think we're going to see some really, really interesting
products that are built around it.
Yeah, it's always hard when you go from like a binary,
qualitative, in your face improvement,
GPT, like chat GPT was like, we passed the touring test.
And now the next test is like super intelligence
and self-replicates,
smarter than every single person knows everything.
It's like the bar is like,
we really moved the goalposts, you know?
100%.
I think that there was like a lot of,
you know, discourse around the model as well, like leading up to it, which I think didn't
help, you know, but like the way that I would think about it is like, I think that,
you know, depending, there's some percentage of the way through automating software engineering
that we've made it. Like, let's say it's like 70% or something, 75%.
The tough part is like that last like 25% is, um, A, the hardest, it's like the least
sort of decipherable to like explain to people. It's the least like universal. Like, like,
If I'm just like, oh, make a, you know, one of the examples I did,
I made a personal website.
It's like all MacOS 9 themed in like 20 minutes with GP5.
That's fun.
And so it's really fun, right?
You get it.
Like my mom gets it.
Like I can show it.
I can share it.
You know, my mom, I can't explain any of the like the very specific ways that
2505 like helps in our specific code base, our specific problem, whatever.
So I think that like it'll be less.
These launches will probably get less and less sort of,
interesting from a so like from a what it does for software engineering as that gap gets closed like i you know
what's the last five percent of software engineering like i you know like i it's probably not going to be
that interesting to me um do you think they'll be on an annual release cadence now like apple
updated all of their ios all their operating system nomenclature to be like we are now on 26
because it's the year it's like a car model like like jaguar i don't think you can plan it i don't think
you can plan ahead. That's the interesting thing. I think that, you know, there's people that
say that Jeep G4.5 was supposed to be GPD 5. Yep. Yep. And like, I think that it sort of came out
and they're like, eh, like, you know, I actually love 4.5. I think it's a really fun model, but,
um, well, it's clear that like improvements come in many places. Just like with the, with the iPhone,
like the latest iPhone, you buy that because it doesn't, it's not just like the one with the new
screen. It has a slightly better camera, slightly lighter, longer battery. Like, it's like an
ensemble of improvements that then they add up.
And I think that that feels like what we're getting here today and what we will get in the future is like this little, like we did a little extra URL over here.
This tool is now sharper.
It has new capabilities.
We added multi-modal.
Like, you know, the video generation got better and this feature got better, et cetera, et cetera.
I think that like what a model is is still coming to change a lot and like how we like.
So just give an example like 4-0 was sort of this big thing, you know, where they talked about it being like natively.
multi-modal, you know, taking in, even like video at some point, video in, video out,
like audio in, audio out.
And like, you know, you haven't heard that from GT5 yet.
Like, you can't talk to it on advanced voice mode.
Like that's interesting.
It doesn't generate image.
Like, you don't mean?
There's no, at least yet, native image generation.
We don't know much about how it works under the hood, but like, it's still calling
4-0 to generate images, right?
So it's like, do you start to see an unbundling of these model capabilities, like,
seems quite possible. Like the best model for writing natural language might not, or like
writing creative, you know, creatively might not be the same model that writes, you know, really
good Rust code. Like it might be different models. So I don't know. We'll see. Yeah, create image here
is now tuck next to deep research agent mode, et cetera. But I would hope that you can call that from
the actual chat interface. You can call it from the GPT5 chat. It's just using, it's using Jeep
Image one, I think is actually the name of the model. So it's a dead. It's a dead.
image generation model, which I think it may be 40. I don't totally know. Yeah, I just I don't
particularly care. I'm not looking for one model to rule them all. I'm fine if with models calling
different tools. It seems fine. Yes. Anyway, fun day. Thanks for hopping on.
Of course. Of course. Anytime. Have a good one. Bye. And that's our show today. Folks,
leave us five stars on Apple Podcasts and Spotify. And thank you for tuning in to the GPT5 gigastream.
We're on hour four and a half.
We've enjoyed hanging out with you.
Tyler, anything else from the timeline?
Close it out for me.
Timeline's still in turmoil.
If we want, we can show the little game I made?
Okay, yeah, let's show Tyler's game.
Can we do that?
Is that positive?
You got it?
Tyler's tower defense.
This was one shot.
Okay.
Wait, what do you mean?
One shot, one prompt.
You said you were working on it?
I was, but then it's like, wasn't as good.
Oh, so you went back to a single prompt?
Yeah.
I made a change, but then I realized like,
okay this is not as good so I just went back to the first one okay so yeah my my my question is I
mean this this seems well actually like it's it's like the game engine I don't know what it's
using under the hood what it's do you know did it write like webGL code or do it right I think
it's just it's just like yes okay and it's just like itchml canvas that's pretty crazy yeah
um you'd think it would use some like 2D engine off the shelf or something but um my my my question is like
what that won't go viral because that
is less impressive than just the Tower Defense app that I can get in the App Store.
For sure. But it's like maybe if I take my, you know how there's like ControlNet
images went viral where people take their corporate logo and then they'd throw that through
control net and it would be like the TBPN logo overlaid over like a forest and like the trees
would look like the logo. Or like the QR code. Yeah. So maybe like it's Tower Defense
but it's my logo or something like that and like the the enemies are like moving through.
something like that. I don't know. There's just got to be a way to personalize it and make it so
every single game is a unique snowflake that you want to go and experience that one. You want to look at it.
You want to spend some time in it. I don't know. Yeah. But it's hard because it's like,
it's still, you know, predicting the next token. It's not like the four image generation was like
kind of a, it wasn't novel, I guess, because there was image generation. Yeah.
It was like such a massive improvement. This is, like, there's not any clear massive step change here.
It's a little bit better in a lot of ways.
Yeah. Oh, well, well, we'll have to play with it more.
Let us know what you think about GPT5, and we will see you tomorrow.
Have a great day.
Thank you so much.
Bye.
