CoRecursive: Coding Stories - The Pre-Training Wall and the Treadmill After It
Episode Date: May 9, 2026I've been confusing Don with frontier-lab links late at night for a bit. Ilya Sutskever told a NeurIPS audience that pre-training as we know it would unquestionably end. There's only one internet, and... the data isn't growing. The frontier labs call this the pre-training wall. A leaked Google memo from 2023 argued they had no moat. R1 is on GitHub. Llama is on Hugging Face. OpenAI's secondary-market valuation has climbed past $850 billion. Don was confused. So he came over and we made an episode about it. Episode Page Support The Show Subscribe To The Podcast Join The Newsletter
Transcript
Discussion (0)
So hi, I'm Adam Gourbel.
This is co-recursive, and today I have Don.
Hi, I'm Don.
I'm here.
Yeah, so Adam's been obsessed with AI and L-IMs for way too long.
He keeps sending me tweets and articles.
In fact, you sent me a bunch just like last night.
Like 10 o'clock, too.
It wasn't, it wasn't early.
It was late.
It was late.
And I'm like, what is?
I think it was like nine.
It was like, I don't know, close to 10.
It was like 940 or something.
I mean, it's not like I was busy.
It's fine.
920.
920.
I win this round.
Okay.
But you ended at 950.
So basically Open AI has a new release and so they're out pumping it up.
Yeah.
And the thing I sent you that I thought was super interesting was this quote from Greg Brockman,
who's the president of the company.
He said, I think of Spud as a new base, a new pre-train.
And I'd say it's like we have maybe two years worth of research that is coming to fruition in this model.
And I have no idea what those words mean.
What's Spud?
What's a base?
What's a pre-trained?
two years away. So we'll get into that. And what was the second one I sent you?
The other one was older. A leaked memo from inside Google, three years old, a line you had highlighted
said, we have no moat and neither does open AI. And moat, like, why are they making castle
analogy? Like, I mean, I feel like you kind of, you know what a moat means. I do. I do know
what a moat means. They're creating walled gardens, right? So like, they're like, well, hey, you know,
we're making this thing and we've got billions and billions of dollars in funding,
but there's nothing that stops somebody else from just doing this thing,
which is like the whole core of what the internet was created for,
like way back in the day, right?
It was just a bunch of people figuring things out.
Everything was open.
Then, you know, corporations moved onto the scene,
and all of a sudden it's like,
how can we monetize and make wild gardens and force people into our ecosystems?
Which I think ties to the third quote I sent you.
Do you want to share that one?
Sure.
It says R1 is on GitHub.
Lama is on hugging.
and what's this $850 billion for?
Yeah, that one, it's cryptic,
but I feel like this gets at exactly what you are getting at.
But yeah, I can call this format StackTrace,
working name, we'll see how that goes.
But it's like, you know, when something blows up
and you get a giant stack trace,
and you have to kind of figure out what the error is
and peel back the layers one by one, right?
So I thought if we could peel back through these quotes
from these articles that I thought were super interesting.
The Brockman one is like brand new, right?
It's now the 29th.
I think it was like a couple days ago they released their new framework.
So I thought if we can walk backwards from whatever the business case to the engineering,
what they're building, how it works, because none of this makes sense unless you understand
what pre-training is, what a base model is, like what Open AI is even doing or trying to do
with their new models.
And like once we add some meat onto these bones, maybe we can figure out if these companies
make sense, if they'll be profitable, if the world will change, et cetera.
Yeah, no, that sounds like a good idea.
Okay, so let's start with what training is.
So did you ever use the old school co-pilot where it was like auto-complete in VS code?
I used something similar in IntelliJ.
So I didn't use BS code too much, but IntelliJ had like auto-complete and it started
getting smarter and smarter.
Like it would sort of look at like the context in what you were writing and try and propose
something.
I would say that maybe 60% of the time it was useful, but like 40%, it was like, way off.
It was like, I don't want that.
And then it would, it would try and you get into this state where it's like, press a button to auto
complete it.
It's like, but I don't want to.
So now it's interrupted my flow, right?
Because I can't just press a button or else it'll spew out all the stuff I don't want.
So I'd have to like hit another button to cancel it out.
I don't know if this is just a me problem, but it got, it got in the way of me actually
writing the code of like, no, leave me alone.
You're suggesting something that's not useful.
Yeah, I feel like people had different reactions to it.
Like some people are still using that form factor, but many people aren't.
But that was like the part of the first iteration of these LLMs.
It was just picking the next token.
So you have like all this code and then it's like, hey, what comes next?
And it tries to guess that.
That's the entire like training objective.
So before co-pilot even launched in 2017, Google publishes paper, attention is all you need.
and it invented the transformer.
The transformer is the T in like GPT, right?
And it was just in Google, they figured out, hey, let's take this transformer thing,
let's feed it all the internet, every Wikipedia article, every book.
Every Reddit, yeah.
Everything you've ever posted on Reddit is preserved somewhere.
And you just get it to predict what the thing is that comes after that, right?
And Open AI, at this point, you know, they were kind of a research lab.
And they did like this Dota 2 battles where they were trying to beat professional players.
They had like physical robot.
We were trying to solve Rubik's cubes.
Like they did all this stuff.
It was like supposed to be a researchy type organization.
The GPT was like one of their bets.
And it was complicated because to get it to consume all of the internet was this complicated
training run and it was a bit finicky.
But it worked, right?
Like it started to become good at predicting this next token.
And like as we know this all, you know,
became this huge industry.
But the first thing that they kind of figured out,
even maybe, you know, in the very early days,
was that, hey, we have something here
where if we throw more compute,
if we throw more GPUs at this,
it just gets better.
Yeah.
Like that is kind of unusual, right?
Like most problems, I don't know,
most problems can't just be solved by like,
give it more CPUs.
I find the opposite.
That throwing more hardware at it is kind of like a,
it's kind of like a trope in our,
in our line at work, right?
If you have some code that's not well optimized,
it's used in a lot of memories taking a long time,
you throw more hardware at it and problem solved
until it starts slowing down against.
That's true, right?
Like, why optimize this code?
Yeah, that is a trope.
Why optimize this code?
Just buy a bigger server.
Just buy a bigger server.
Yeah.
Just upgrade it to the next node size.
Yeah.
So they have this very clear idea.
Like, hey, if we can throw enough computers,
then we'll have AGI or something.
We'll have something very intelligent.
hey, we got something that seems like you can think a little bit.
We can't chat to it yet.
And like if we just throw more compute at it, you can think even better, right?
And so let's just keep doing that.
And they call this process training, simple.
All the text of the world feed it to this thing, give it as much compute as you can.
We get something smarter and smarter.
So this original hypothesis of just like scaling up came from 2014, even before the Transformers,
there was this guy, Ilya, Stutzgiver,
and he had this paper called Sequence to Sequence,
and he argued that, yeah, with a big data set and enough compute,
success is guaranteed of, like, building, you know,
some sort of prediction machine.
What does success look like, like, define, like, success?
The predictions are only useful if they're accurate most of the time.
Yeah, so he had come out of the University of Toronto,
this, like, deep learning group,
and they had this great success on what had been this really hard
problem at the time, which was identifying images, like picking what the things were in images
and tagging them. And people have been competing for all different places to do the best
labeling of these images. And this group, what's the head guy's name? Hinton. So this is Hinton,
right? He was the professor. And yeah, they beat this benchmark of identifying images. They were
just like so much better at it. And they did it with deep neural networks and just like a lot more
compute, right? They blew it out of the water and revolutionized the field of machine learning.
This guy in the background, this is Ilya right here, right? He was one of his students.
They started working on this ImageNet, which was an annual computer vision competition,
and their submission was called AlexNet, and they trained on two consumer gaming cards
that they had in the basement of UFT. I don't know a lot about GPUs, but you probably do.
So they had two GTX 580s. Is that good, or?
Those were good, yeah, those were good cards.
I'm still rocking a 1080 TI.
It's old.
Yeah, I don't know.
I'm not up on the field of GPUs.
But the point is, like, they, they were able to use neural nets and they just blew this, this
benchmark out of the water that had been kind of, people have been inching up, right?
Like getting a little bit better at identifying things.
And that year when they submitted, like, the runner up 26.2 of the questions wrong, and they
got 15.3.
So, like, everybody had been slowly climbing into the 20s and they just, like, cut it in half.
They're like, we got everything except these.
But 15% is actually not bad.
What kind of questions are we talking about?
It's identifying all these different things, but for some reason, there's a lot of dogs in it.
So like to guess the dog breed and like circle like, oh, this is a whatever.
That's better than I would do.
Because of all those poodle ones.
Like, who knows?
Yeah.
Yeah, there's like a million poodle crosses.
Yeah.
So every researcher in this room grew up to become very important because this was like
a revolution when they built this, when they beat this using.
new approach, right? So three years later, based on this, Ilya, this guy, he co-found
open AI. And he co-foundes it with Sam Altman and Brockman was the original quote. And I'm
sure you've heard of Sam Altman. Then there was this other party, Elon Musk. Who's that guy?
Elon Musk. Oh, I haven't. Yeah. Not familiar. Interesting. Anyways, so Ilya is like, he's, he's the research brains,
right? He's the researcher who does all the research, right? Brockman is like the engineer, you know,
like let's actually productize all this, right?
So he becomes chief scientists, and then in 2020, there's a paper published.
They actually write a formal paper that's more than just vibes that says, like, here is how given more compute, like we can learn more things.
Right.
Like here's, this is their paper.
They say, like, here's how we do this.
Which is interesting because like before this, all these people were competing for ImageNet and it was like, you try a bunch of things, right?
you have your pile of GPUs in your basement or in your research lab and it's like try some stuff
see if you can do better but here they're like dude we have a graph and obviously the graph was was
based on on some like concrete data because i mean like anybody can make a graph and be like oh you know
i i i improve the performance by 20% when i like increased the hardware by this much so therefore
my graph goes to the moon yeah or like a you got married 10 years ago so by the time i'm 60 i'll be
married five time. You need like a solid base of some comprehensive data points at the beginning of
the graph to make a prediction. So I think it makes sense to be skeptical, but from their perspective,
there was something great they could do with their graph, right? Which is say like, hey, if you give
us more money, we can buy more GPUs. And like, per our graph, we'll have a smarter thing. And so
it becomes like a fundraising thing. Well, I mean, like, yeah, why else would you make a graph unless you
wanted to like, you know, convince somebody to give you money? Yeah, like, I have this thing here on the
graph, but think what I could do instead of those two like Don GPUs if I had all the highest end
ones that I could fit in this room. Oh, now we're talking, right? So this is sort of what they do, right?
The kind of this graph and this published paper, you know, and it's published in a reputable place.
So people have vetted it. Yeah, so originally it was financed by this guy who you said didn't know,
uh, Elon Musk. And, um, I don't know. He was just like, cool. Let's, let's build the, the, let's
build the super AGI of the future, right?
There was a little bit in the early days of all the people involved in this,
where they all believed in this idea that we could build a super human, intelligent machine.
But like that belief meant when somebody got a published graph saying like,
GPUs go up, smarts go up.
And they're like, let's do this, right?
I guess where I'm getting hung up is what was the thing that they were buying for $850 billion?
Like, a smartness?
Yeah, I'll buy, you know, here like $500.
units of smart. We can have a thousand units of smart. Cool. Here's $850 billion. Are they just
still buying units of smart? That's not a good business point. Yeah. So they built an API, right? So
early like GPT3, I used it. So this is before the chat. It was just token completion. Like,
I tried to use it to write tweets for my work because like I would write a blog post and
don't want to write the tweets. But you would have to be like give it an article and then a
sample tweet and an article on a sample tweet. And then when you give it your article, then
it's like, oh, I get the pattern.
Complete it for you.
You couldn't say like, hey, man, write me a tweet.
Like, it didn't understand that.
You couldn't communicate with it.
Anyway, so they put this on an API.
They charge for it.
People like it.
It's exciting.
But it's small, but they're small.
But they're like, we're on to something.
Let's raise more money.
And yeah, their business strategy that they were worried about that they said you could
write down on a single grain of rice was scale.
Like the word scale, because they're like, we have this thing on this API and people
are paying for it.
And it's, you know, it's one unit.
smart. If we had 10 times the amount of GPUs, we can have 10, 10 unit smart or whatever the graph
was. So they were like, we got to go, man. Like we have this thing. It's going to change the world.
But like, anybody can look at what we're doing and be like, oh, we could do the same thing, right?
There's no secret sauce from their perspective. They're like, dude, if people knew all we're doing
is trying to get as big as possible as fast, like we'll be in trouble.
Yeah. So like the underlying algorithms are easy to replicate.
And that's bad because they want to inevitably be the people that hold the keys?
Yeah, they had this idea that the first super intelligence that came around would be all powerful.
And so it better be us because we're great upstanding people who control it and not China or Iran or just that guy down the road.
But like the same rationale is the atom bomb.
Yeah.
but from a corporate side, because at this point,
I don't think national security has gotten into it yet.
At this point, it's just a bunch of nerds building this thing, right?
It's not, but the U.S. will get involved.
Okay, let's keep going.
So in the year 2022, DeepMind, which was a group within Google, right,
they published this paper called Chinchilla.
So at this point, like the GPD thing, you know, came out of Google.
Google wrote a paper, but they didn't really build anything on it except an internal
LLLM and then Open AI ran with it.
Google sees it's something important
and they keep working on it.
And so they put out this chinchilla thing.
And it shows that their graph
is kind of wrong.
Oh. Which isn't good, right?
That's not new.
And so the chinchilla people,
they built a whole bunch of models
of varying sizes and varying training.
And they found that like, no, there's actually a very
clear relationship. It has to do not
just with the compute and not with just how
big the model ends up with
but the amount of data that you trained it on,
which makes a lot of sense.
It's like you give it more information for it to get smarter.
Yeah, it needs to know more
so that it can have a bigger library to draw from.
Yeah, and so you can build it bigger with less information going in,
and it's just, it's not smarter.
It's just bigger, right?
So key factor is the data.
Yeah, if we can give more resources to this thing, it'll be better, right?
So Chitiel doesn't really break it.
It just adds a new important wrinkle, right?
It sounds like it reveals an important factor.
It's making it work.
It's not just compute.
You need the data.
And so at the same time, the same month as Chinchilla, OpenAI publishes a paper that they call Instruct GPD.
Guess what Instruct GPD is?
Are instructing it to do something?
So you're giving it data to learn on?
How would you instruct it?
You would have to feed it similar data for what you want to accomplish.
So, like, I mean, you're sort of close.
But no, this is chat GPT.
So it's a different way of asking it to do something?
Yeah, because before, you would give it a bunch of texts and it would predict the next token.
So now you're just talking to it.
Now you're talking to it.
Instructions.
So they call it Instruct GPT.
So the thing that happens is that becomes the post-training step.
So there was training and now we have this post-training step where we make it more human.
But then they decide to rename that first training step where it consumes the whole internet pre-training.
Right.
Which means you end up with this weird world where,
They have a pre-training step and then a post-training step.
There's no actual training step.
Like, they've accidentally...
They've removed the training.
Yeah, exactly.
The training is gone, even though it still exists.
Okay, so then mid-2020-three, start a new training run.
Same idea, right?
Let's just, let's do an even bigger pre-training, right?
So we've trained on the whole of the internet or a lot of it, like, let's add even more, right?
As we know, I guess they trained it on a lot of books that they probably didn't properly have access to, but it's like, let's feed it more data.
we understand the formula. Let's make it even bigger. We'll have an even smarter model. So that's
in mid-20203. They call this Orion. And this was supposed to be Chad Chupit 5. So 3.5 was the first
chat chitp-t, and then there was four. And they came really close. And they're like, let's make
five. Because the difference between 3.5 to 4 was really big. But I remember when they shipped it
because Sam Altman said something like, hey, we have this new bottle. It's pretty cool. Not sure if
we'll release it for a long time. But like he kind of downplayed it. He's like, it's all right. You
You know what I mean? He wasn't like, this is the most exciting thing. And so this was, it was
supposed to be GPT5, blow everybody's socks off. We're at the next level of smarts, but they
released it as 4.5, and it was super expensive if you use the API. And then I used it a little bit,
and it felt kind of like more natural. I used it to get critiques in my writing at the time,
like, hey, what's wrong with this essay? And it felt like more, it's hard to describe. It felt like
more human or something. I thought it was great, but they took it away, right? It's gone,
you can't use it anymore.
What was,
why did they take it away?
So interesting theories, right?
But one for sure is,
you know,
it was 10 times the size of GPT4 point whatever, right?
A lot more expensive.
Like,
requires 10 times more servers.
Like,
and it's like,
it's a bit better.
Like some people were like,
yeah,
no,
it's,
I mean,
I can tell it's a bit better.
But like,
you're like,
yeah,
but it costs 10 times more.
And,
you know,
the results are,
you know,
maybe not as obvious as you don't.
If I play,
against the chess spot that's on my phone, like, it will beat me, right? If I play against
whatever alpha chess, the best chess player in the world, it will also beat me. It'll also
beat you. You don't notice. So yeah, reportedly, it's $500 million that they spent on that
first run. And it was, it was like the model was fine, right? We got to try again. We got to get
the smarter one. By that time, you're up to a billion dollars in compute. So have you ever
had a project this big fail? No, no, I haven't. I can speak from experience. I've never had 500
million dollar project and it like flop because that's a huge failure but there's also this problem right
the business case is predicated on that they're going to make this forward progress so it could be
devastating so that's probably why they did a second time they're like we're not giving up on this
in the middle of this ilia the guy we're talking about he he left so he just he left opening a guy not a good
sign there has to be some kind of mitigating factor to why but to this point they've been operating
on the premise that if they just give more computing data that
it will increase according to this chart.
Yeah, we shall find out, right?
I think, like, a lot of people have this perception
that these labs, like, open AI,
like they're huge, they're making all this money,
they're sitting on these big piles of cash,
people are paying them for this product,
like it's an amazing place to work.
But if you think of it, it's really high stress, right?
Like, they need to keep this promise going.
It's very important for their valuation
that they're always able to have the next exciting model.
Like, the whole thing,
is premised upon, you know, number go up.
And that's most corporations.
But just being the hottest one with this huge, like, valuation.
And it's not like Apple, where they have phones and stuff, like installed.
It's like they have this API that you call.
And if it's not getting better, and if there's alternatives, like, it's just, it can very
quickly.
Yeah.
And I guess the thing is that when people didn't like it and you say, well, why?
It's like, well, didn't, it just didn't feel as.
good. It's like, are the results based on feelings? How do they quantify, I guess? The benchmarks, right?
And like in the early days, they tested against like the LSATs, like lawyer tests, like the GRE,
like graduate test. Right. And then when they had the $500 million model and the 10 times more
expensive, did they perform those benchmarks and it was like way better? Was it like 10 times better?
No, no, it was like a little bit better. Right. It was. Oh, okay. So like that's what they're basing this
result on. It's not. Oh, yeah. It's, you got an.
82 it got 84 and you're like oh but it's 10 better and you're like maybe it's one of those
phenomenon where like yeah but that last 10% it's like being perfect very hard it's way it's like a
logarithmic scale right where like you could put in 10 times but you're not going to get 10 times
improvement on that score right you're going to have to put a hundred times in to get like you know
yeah yeah percent the final 10 percent is the hardest so this goes on for two years so they build this
giant model, it's not great. And semi-analysis, like an industry research place says like open
AIs, leading researchers have not yet completed the successful full-scale pre-run that has been
broadly deployed since May 24. That was GPG-40. So like that's not good, right? It's like the main
thing they do, they haven't been able to do a new one and time is passing on, right? Internally,
you have to imagine they're trying all these things and that they're not like moving things like
like they see. And even inside, people didn't really agree on why this wasn't working.
So they looked into it and they couldn't figure it out? Well, I mean, what's your guess?
Something to do with the core way in which it operates. It got to the point where more hardware
isn't going to actually make up for improvements in the algorithm. So in December, 24, like,
Ilya, who left, so he left Open AI. There was a whole kerfuffle with, he tried to get Sam Altman kicked out.
he failed at that corporate drama.
Yeah, corporate drama, right?
The researcher guy tried to let the power move
and the executive people.
I'll make a movie about that someday.
Yeah, right?
Anyway, so he starts a new company,
and then he's at Nureps.
He's at this big conference where he's being presented in an award
for his great earlier work that led to all this,
and he gives a talk.
Pre-training, as we know it, will unquestionably end.
Why will it end?
Because while compute is growing,
better hardware, larger clusters,
the data is not growing, but we have but one internet.
You could even go as far as to say that data is the fossil fuel of AI.
So like they made these early versions.
They scrape a lot of the internet.
They scrape all these books.
They feed it to it.
And it's great.
And then you're like, okay, we need 10 times more.
So like, okay, we used to download the source code on a GitHub repo.
Now let's get every revision, right?
Let's get all the history.
Well, it's just like less good data, right?
or like we got every Reddit, you know, important posts, let's go to the really obscure forums
or let's, it's like there's just less good data out there.
Well, it seems that they've reached the limit to what the core algorithm can actually
solve given its data.
We operate every day without the whole internet to like, like figure it answers to questions,
right?
Yeah, yeah.
So if you need the whole entirety of the human internet to be a little bit better,
then maybe you're not using the data.
you have as efficiently as you should be.
You've eaten all the good parts.
Yeah.
Only the crumbs are left and they're not going to get you where you want to be.
Yeah, I agree.
And so they call this the pre-training wall.
Pre-training was the original training that they were in the pre-training wall.
They're like, we just can't.
There's nothing here.
Like, we can't get past this.
We've, uh, or as Ilya says it, right, like, there was these fossil fuels,
which was all of the internet and all these books and we ate it all.
We've hit peak oil.
There's nothing left.
There's nothing.
This is, so they have to find another way, right?
Okay.
So to understand what happened next, like how these improvements happened, we have to go back to DeepMind, right?
So DeepMind was the people who released the chinchilla paper, but the more important thing
was like, I don't know if you remember like a decade ago.
There was this Alpha Go.
Do you remember that?
Yeah.
The game go.
Yeah.
Yeah, I remember that.
These guys, they had this DeepMind company got bought by Google.
And originally they started it playing at.
Atari games. Then eventually they did Go and then chess. And the way that they trained it was
this reinforcement learning. So they create something. They get it to play Go against itself. And then
whichever one wins, they like let that one continue, make two copies of it. So like kind of like a
evolution type thing. Yeah. So it learns. But uniquely it doesn't need the internet. It's not reading
Go books, right? It's playing Go. Creating its own data. It's creating its own data by playing the game
itself and when they originally created this go was considered like uncrackable and then they had this
big Google had this big tournament against the best go player in Korea.
Lee Sedol and nobody thought that this thing would beat him and of course it crushed him
because it had been playing go against itself like for the compute equivalent of the zillion
years right it's just like learning and learning and learning right so it's creating its own data
as you said right which is like a great solution to to this problem
but it needs a scoreboard, right?
Somebody has told it what is the preferable outcome.
Like in a game, there's rules and you know when you win, right?
So you can generate data because you can always figure out, oh, did I win?
Yes or no, right?
So they started with this training, right?
They became pre-training.
And they added on this chat thing, the instruct.
Now they add on this new step, which is reinforcement learning.
So they call it RLVR.
But basically, they need ways to have an action that we need the LLM to take.
where we can verify if it got it right or not.
What's an example of something that's easy to verify if you got right?
Like math.
Yeah, math or anything that has like a right or correct answer, right?
Or I think the most impactful one of recent years is like coding, right?
You can write code.
And it will work or it won't.
It'll work or it won't, right?
You can run the compiler see if it worked.
There's some nuance there because you can write code that will work but isn't good.
Yes, I know that.
I used to work with you.
I know.
Yeah, yeah.
Oh, thanks.
It's a cheap shot.
Cheap shot.
Yeah.
And so the cool thing is that it just lean into this, right?
So this is a new way to generate data that Open AI comes up with in their panic and they kind of keep it to themselves.
But if they can, you know, they can ask the LLM to, yeah, to come up with the solution to a bunch of calculus problems.
And they ask it to like think out through all the steps, right?
So let's ask it 12 times with random problems to solve this calculus problem and think it out step by step.
And like most of them are wrong, but maybe one is right.
And so then they take that one where it got it right and they can feed that back in as training model, like update the weight.
And they just start doing this in loops, right?
Because once they get it to successfully do some calculus, then they update all the weights.
Now it's a little bit better.
And they can give it more problems, get more right answers.
Now they're doing this deep mind like go thing, right?
They can take their LM, do like thousands and thousands of generations.
getting better at problems, as long as the problem has like somebody to say, like,
is this right or not?
So now they're generating their own data.
So this becomes O-1, this GPT model.
And so in a way, it's like they have this wall of training and like they hit this wall and then
they found just like a new dimension, right?
So they can grow by generating their own data in another direction.
Yeah.
I mean, going back to his analogy of how that's the fossil fuel of AI, it's like they've just come up
with a more efficient combustion engine.
Or a renewable resource, right?
Because here the thing, it's like you take the LLM
and it can play its own games.
And if it succeeds, you're feeding that back in, right?
So it's renewable and it's generating its own data
if you have a way to score it, right?
I mean, in the places where you can verify the answer, it can learn.
But okay.
Yeah.
So in February, so now we're like coming close to modern day, right?
So in February 2025, we haven't even talked about Anthropic,
but Anthropic releases Claude Code and the cool thing,
I don't know where it happened with them and they've never confirmed it,
but all of a sudden, these LLMs,
they don't necessarily start doing a lot better at all different trivia,
but they just start getting super good at coding.
And the theory that I think is pretty much confirmed, right,
is just like, Anthropic builds Claude Code,
but they can train on Claude Code as well, right?
So they have all these problems and then they can run Claude Code through it,
and then when it works, they're like, good job Claude Code,
and they reinforce it.
And so it gets better and better at coding.
It's not necessarily better at all kinds of other things,
but this is like a very clear signal.
If we have a bug on this project and it can solve it,
it learns to get better and better at these things, right?
And so that makes, but that makes all this synthetic data, right?
If they have cloud code run on a problem
and it solves it correctly,
then, you know, they end up with this thing
where it's step through things, right?
They're creating their own data, as we said.
But so this is a big,
this is like a second big breakthrough
by OpenAI, right?
They have a new way to generate more data.
They kind of keep it closely held.
But at the same time, the government is getting antsy about, you know, AGI.
The U.S. is like, I hope we get AGI and not China.
And then they outsmart us and destroy the world.
Or maybe the other idea is the government's just worried, like, hey, this is going to be huge
industry.
We want this industry to be American, right?
And so they start putting in place controls on NVIDIA, telling NVIDIA, like, don't sell
GPUs to China.
Like we just don't want that.
And then Nvidia doesn't love that because they're like, we like to sell these things.
We make a profit on them, right?
Companies can buy these Nvidia GPUs, but they are handicapped.
So they're super good at doing GPU stuff, but they have a very low memory.
They did that back for Bitcoin learning too.
That's when it started.
But now it's not a tariff, I guess, but it's like an export constraint.
The Chinese just can't get the ones that you can buy here.
Yeah, there's proprietary processors meant specifically for AI that Navidia makes.
They're not allowed to sell them, yeah, to China.
So in China, there's this company called Deep Seek that we talked about at the beginning.
And they spun out of this hedge fund because the hedge fund wanted to do all this machine learning,
I'm assuming to predict the stock market.
They decide like we're going to build, we're going to build our own AI, right?
And because they couldn't get the Navidia chips.
Yeah.
So they could get the Nvidia chips, but not the really high-end ones.
They could get the less high-end ones.
And so they couldn't get the H-100, which is the Frontier Lab ones that cost.
like $30,000, you know, per, and you end up in the server, they end up putting like eight of these in.
So they're very expensive.
I mean, they would have bought them, but they weren't allowed.
Right.
So they could only get this one called the H-800s that just are much less good at talking to each other.
And the problem is you need a whole cluster of these to make it work.
And so DeepSeek is like, hey, we got to crack this code, right?
They released a paper about what they did.
So here's one of the things.
Low precision training is often limited by the presence of outlawful.
buyers in activators, weights, and gradients.
So this is one of their tricks where they were able to lower the bits.
Like, it's like they're running like a, like an N64 game on like an NES 8 bit.
Like they were able to lower the bits without losing the accuracy somehow, which let them...
Like MP3s, right?
Yeah, and then...
They're able to do more with less.
Only 20 SMs are sufficient to fully utilize the bandwidths of IB and NVLink.
So what happened there is these things were handy.
at how quickly they could network to each other.
And so they found a way to use some of the layers as like a network card so that they
could more quickly talk to each other.
We employ customized PtX instructions and auto tune the communication chunk size.
So basically they figured out the instruction set or maybe it's known for Nvidia GPUs.
And instead of using the normal like SDKs, they wrote in assembly how the instructions
would work, sidestepping, how Nvidia does things so that they get performance speed.
And so they published this whole paper on this.
They published this new LLM and it blew people's minds.
Like it's a more gritty approach, right?
It's like we're constrained.
So we need to come up with a different way.
Yeah, I mean, and that's how a lot of things were back in the early days of, you know,
software development.
And like, you had to be very aware of how many bytes of data certain fields were because
you only had so much to work with.
So the government, like the U.S. government tried to prevent China from getting a heads up.
by putting these constraints in place.
But the constraints actually just taught these Chinese companies how to how to do with less, right?
Maybe it even advantaged them because now they can operate on a smaller budget.
Yeah, I remember when this happened, they came out with like their deep seat.
And it was like, the video was freaking out because, well, if they can do this with their constraints,
what's going to happen to us, right?
The gravy train might be coming to an end here because obviously, like, you know, we have,
all the unlimited hardware.
And we can't perform as well as this.
It's almost like we should have been looking at how to optimize our AI instead of just
drawing hardware out of it.
Yeah, exactly.
So they published in their paper that it cost them to do this, about $5.6 million,
which was a little bit misleading because they were only talking about one specific stage
of the training.
But that got published and people were using it.
And they're like, this thing's amazing.
And meanwhile, Open AI is saying like, we spent a billion and it didn't.
We didn't get the same results.
Yeah, if we didn't get an improvement, yeah, everybody panicked, right?
The Nvidia stock fell.
Everybody was like, what's going on here?
Other thing that happened is when they published this paper and they released this model that they called R1.
So one thing is this has to do with the moat, right?
So we found ways to work with less, right?
The other thing is the deep seek people publish in their R1 thing, this whole reinforcement learning idea.
The Open AI, this is their new secret, right?
They're like, oh, we can give this thing rewards, have it think out, provide this feedback.
So R1 uses the same trick.
They came up with it on their own.
Hey, try to solve these problems, try to think it through, and we'll take all these results, and then we'll feed them back.
And they publish exactly how they train it.
The Open AI's new trick that's going to blow away the market, this Chinese company just...
It's just open.
Just put it in a PDF and put it on GitHub, right?
Eventually, like, enough people are going to come to the same conclusion in the
I mean, that's how most inventions happened, right?
Maybe, but like, not right away, right?
And not right away.
Yeah, I guess they were hoping that they could keep that secret a little bit longer.
That's our secret sauce.
That's our secret sauce.
Yeah, so they called it RL on verified reward loops, and they described this like multi-stage
pipeline and that there was this like aha moment where they saw after doing this feedback that
the, the LM started to talk to itself and say like, oh, that seems like the wrong answer.
maybe I should try this.
And in its thing, you're seeing like, oh, whether it's thinking or not, it doesn't matter,
but it's starting to be able to put out reasoning loops of like following a chain down one path,
backtracking, going down another.
They're like, oh, we're on to something, right?
So they get these reasoning loops where it's succeeding.
It's like thousands of generations of it generating a bunch of questions,
verifying which are right, feeding it back in.
It's learning.
It's generating its own data.
And they just put the model out open weight for people to use.
They put out, here's how we built it, right?
because they're part of a hedge fund.
Like, this isn't how they make their money.
And it's like, oh, my God, this is a crater into this whole, like,
capitalistic venture of building these amazing models, right?
It's because, like, on the one hand, this AI models,
they're amazing.
Like, the work that they can do, it's ridiculous, right?
But, like, it's just the most, it's such a powerful tool.
But at the other hand, yeah, you create it.
And then somebody else is right behind you.
And then quickly the value of them is, like, is, like, going towards zero.
Well, it could use like an older analogy.
If you think way back in the day when, you know, they came up with TCBIP.
And if they had to walled that technology off and they're like, only we know what TCIP, how it works, right?
Somebody else would have figured it out.
Yeah, yeah.
It's networking because networking is very much real connecting.
But yeah, I get your point.
Yeah, it was a technology that, you know, they could have held on to it and said, this is how, this is how it works.
And now we are the holders to anybody who wants a network.
Priitary technology that, like, makes networking work.
and you have to pay us a license to use it.
But other people are going to figure that out eventually, right?
Or they'll come out with an open standard and be like, well, you know, everybody should use
this because it's easier.
Yeah, exactly.
So if you go back to where we started, right, we have this like pre-training that's actually
really training and then this like thing to make it chat light and then this thing to do
this reinforcement to generate its own data, right?
So like Deepseek figured out how to do this part, right?
That nobody else could.
and they figured out how to do it very cheaply.
But like, doing that first step of consuming the whole internet is still really expensive, right?
And so, like, you could think, like, oh, that's a, that's like a moat, like getting all this data and putting it together.
And the barrier there is higher because you have to consume the whole internet.
And that's something that's logistically hard to do.
Yeah.
But enter Mark Zuckerberg, right?
So at the similar time, I'm not sure the exact time frame, right?
Facebook, meta, they start building their own base model.
they like we don't want to be left out of it and then when they release it um they say like hey
we're not actually in the business of of being like a an lLM serving API or something like we're
we sell ads about i don't know what their ads are yeah i don't i don't use facebook yeah and so
they say like hey if you're a researcher you can just uh download our model just ask for permission
and you can download it and so they do that and then um very quickly one of those researchers
downloads it and just puts it on BitTorrent
because literally
yeah, why not?
Why not? And then Facebook demands that they
meta, I guess, demands that they take it down
and then, but it's too late.
And so they change their stance.
They say like, no, this, I mean, you can use this.
If for non-commercial purposes, just grab it and use it.
Right. And this becomes a llama.
So this is like the first, I think, open weight model.
If you have the GPUs that you can run this on,
like you can just grab it.
and use it for free.
So Facebook spent whatever,
the $500 million to consume the whole internet,
which is weird, right?
Like, why would they do that?
I don't know.
Here's what Zuckerberg said.
In the early days of high-performance computing,
the major tech companies of the day
each invested heavily in developing their own
closed source versions of Unix.
It was hard to imagine at the time
that any other approach could develop
such advanced software.
Eventually, though, open-source Linux gained popularity.
Today, Linux is the industry standard foundation for both cloud computing and the operating systems that run most mobile devices.
And we all benefit from superior products because of it.
I believe AI will develop in a similar way.
And that makes sense to me, right?
It's like the premise of open source software software.
So there's like this business strategy.
I heard about it from Joel Spolski, and it was called commoditize your complement.
Right?
And so it's like if you sell a product and along with this problem,
something else is used, if you can actually decrease the cost of that thing, it makes your
thing more valuable.
Right.
Like, if electricity is super cheap, electric cars are more valuable, right?
So if you're an electric car company, if there was some magic trick to make electricity
cheaper, it would help the value of your car.
And so, Meda's like, we're not in the business of LLMs, but we're going to need them, right?
We're going to need them to, like, judge if somebody's spamming comments or whatever, right?
We don't want to pay these absorbent fees for open AI or whatever.
We'll just build our own.
And because it's not our business, we'll just give it away.
And it also allows me, you know, probably to give the finger to these other companies, right?
Yeah, subvert them.
Abvert them in a way, right?
So that's what they did, right?
So that erodes like another thing, right?
Now the base thing that's very expensive, you can just get it.
I mean, maybe it's not as good, but it does exist, right?
It creates an atmosphere of competition.
Exactly, right?
except he's not doing it.
It's not necessarily for charitable reasons.
And then it's helpful, like, from his perspective,
if we make the one that we give away for free
and everybody else builds on it,
we benefit from all of those things, right?
Yeah.
If you build some ORM internally,
that's like a crazy Don creation,
you have to maintain it and whatever.
But if you build one and then release it
and the industry starts using it,
they make improvements, and then you can pull those in, right?
Oh, there's an interest in increasing their own business.
Yeah, I forget where I am.
So yeah, what were the original text that I sent you?
Do we, have we answered any?
Yeah, I think we have.
We've covered it because like R1 is on GitHub, Lama's on Hugging Face,
and what's this 880 billion for?
Number one was the model that the deep seek came up with.
So it is, it has the more efficient algorithm because they were constrained by
hardware restrictions.
So they came up with like a better, a better way of doing it.
that wasn't locked in to like, you know,
Navidia's model.
And Lama was the,
was the Facebook training model that had all of the internet included in it.
So you didn't have to go through all the work of combing the whole internet.
There's no,
there's no secret sauce.
There's no special sauce.
Everything's open.
You can,
you can get a model.
It's all,
it's all out there for people to develop.
There's no,
there's no reason why somebody couldn't make our product.
So then I think,
think the only thing we haven't answered is like the very first part right which is like Greg
Brockman being like oh yeah I think of spud as a new base model like a new a new pre-trained
it sounds like he's saying we're two years we're two years ahead of everybody else so Orion we
talked about that was like their 500 million dollar and then a billion dollar run yeah it just
an amount to anything became it became 4.5 but then they they pulled the plug on it so spud is one of
their new models and so Brockman in that video I shared
was like from a couple days ago talking about Spud, their internal model,
and then they released it.
So it became GPT 5.5, right?
Six days ago, they released GPT 5.5.
It's a new class of intelligence for real work.
Okay, well, like, every time they release a new model, they say revolutionary.
Of course, yeah.
But there's some interesting things going on here, right?
And I think his quote is the only thing we haven't unpacked.
It's a new base.
It's a new pre-train.
It's a new pre-train, right?
But like, so what is it?
Because, like, we discussed this problem wherein they keep, they tried making the bigger, but there was no good data.
They made one 10 times the size and it wasn't better.
They were diminishing returns.
And now they have another big one.
So what is their secret?
Yeah.
Well, wait a couple weeks.
It'll get leaked or you'll figure it out.
Because there's this other problem that happens, right?
And so there's this process called distillation.
If I am building a new model, I have this 4.5, let's say, that there was superheaval.
huge and very expensive to run and was like a little bit better. Well, I can chat with it. And
similar to this reinforcement learning process, I can take that chat log and I can take a smaller
model that's not giant and I can train it on that. Right. And so it's learning from the bigger
model. It's like it's teaching a small model that I can run at lower cost with the great answers
that the bigger model had. And like it can't learn at all because it's just it's it's not as big,
but it will get a lot closer. It's like a senior teaching a junior, right? It's like
I went through some
the last 20 years
I've seen some stuff
Yeah
You don't have to do all that
You don't have to make all my mistakes
Just do this
Yeah and so they
They call this like distillation
And if it's like my son
He won't listen to me
He'll make the mistakes anyway
And then he'll be like
Oh yeah it turns out you were right
That's the learning process
But like the big labs do this
themselves right
There is like
BPT 4.4
And there's like GPT 4.4
Mini
And mini cost less
and it's still really smart,
but it's much smaller and faster
because it's like they took this big model
and they distilled it,
they got all these important lessons from it
and gave it to the small one.
Well, Open AI and Anthropic
made all these allegations
against Deepseek and other openweight companies
that this is what they're doing.
Because whenever there's like a new super smart model out,
six months later,
there's an open weight model that seems just this smart.
The only thing that you need
to do this distillation, really,
is the logs of chats from these smart models,
which is in fact their product.
If you want to make a smarter model and I have one,
just like have lots of chats with my one
and then train your model on it.
So it's even more like competitive pressure, right?
If I come up with a really smart model,
by definition, you can use it to make your smarter, right?
So this is another moat problem, I guess, right?
And if you look at OpenAI,
they have these thinking models,
but it won't actually tell you what the thinking is.
Well, they don't want people to know.
how it's thinking because that's their trade secret. That's their secret sauce. And so even without
that black box, they're still alleging, I don't think they're lying, that these Chinese labs and
other open-mate companies are just using their service and using that to train their models. Because,
like, why not? Then you can use their models and you can see exactly how it's reasoning.
Yeah. But so every time they come up with a super smart model, like a couple months later,
probably there'll be a new smart model from the open-weight companies because it's easy to extract
things, right? It's like the problem of MP3s. Like, it's easy to copy this information. Not as easy
as an MP3, but like, sort of easy. The moat is leaking. So if we go back to the 880 or 850,
like, what can be, where's that value hiding, right? Well, I think a lot of investors are asking
this well. But like, I'm not so negative on Open AI. I think they could be a very valuable
company, but where is that value? Well, what's the good way to try and get some money? They're not
subscribing, but we do have subscription. I mean, I'm a subscriber. Yeah.
But I think that overall the money that they get from subscriptions isn't at the levels that they would need in order to call it a success.
So ads, man.
That's the worst.
I don't want that.
That's, I think they're talking about it.
Like, that's how they have to monetize, right?
A lot of people are using their product, even though there are alternatives.
Brand loyalty.
Yeah, brand, whatever.
But it just means that you're going to have to meet them where they are.
I don't know.
Like, businesses are using.
I think if you look at it frozen in a moment, right?
Like the amount of spend that my company is spending.
So it's anthropic, but it could easily be open AI.
Just for like coding subscriptions, right?
It's massive.
If the amount of money we're pouring into this company is huge.
So I think at any given moment, like, that's real, right?
They're making a profit off that.
But the challenge is, yeah, that it can quickly diminish and there's competitors.
If there's an open source alternative as well, right?
Like, why would your business keep paying a subscription for hundreds and hundreds of
on credits when they could just, you know, use this open source alternative that's maybe even
locally hosted.
Yeah.
Or just like there's a company that hosts it, you know, at cost because they didn't have to
build it, right?
The open source model that have it.
Lots of those exist.
So it's a Red Queen's race, right?
There's, have you ever heard this term before?
No.
I love this term.
So it's like through the looking glass from Alice in Wonderland and the Red Queen has a race
and they're running, right?
And Alice says like, why aren't we moving?
Like they're running and it's like they're on treadmills and they're not going anywhere.
And the Red Queen says like, oh here, like you have to run as fast as you can just to stay in the same place.
If you slow down, you go backwards.
But as fast as you run, you stay in the same place.
Owen can ever win.
Yeah.
As soon as you have an advantage, keeping that advantage requires working just as hard as you did before.
It's a great metaphor for this process, right?
It's like, okay, every year or so, Open AI is coming up with a new breakthrough.
that lets them push the frontier or Anthropica's.
So the open-way models right now are all,
just like, let's say, six months behind.
Maybe, I don't know about this new release,
but previous to Brockman saying we're two years ahead,
the gist is kind of, the open-weight models are six months behind.
But in that six months, like, the models got so much better
that everybody's paying for the premium service.
Yeah, well, it's like movie theaters, right?
You can go see it in the theater.
It's very expensive.
it. You get to see it first. There's a lot of people who just wait. If you want to be six months
behind, you can use a cheap model and it's fine. Right now the curve is so high that it's,
no, you've got to get on the new thing. Everybody feels that way. If that curve
flattened out, it's over, right? If that curve keeps going up, though,
oh my God, who knows where we're going to end up? It's the Red Queen's race. They,
like, all these frontier labs, if that's their product, they're running as fast as they can
to stay, right? Like right now, Chad Chubit has my $20 a month, and Propik has my $100 a month for
the coding agent, but if something better comes out than those or just everybody else catches up
and has a cheaper service that's sold at cost, that money's gone. Thropik has to go as hard as they can
just to keep my money, because I'll just switch. There's this line from Stuart Brand. People
always remember it's like, information wants to be free. But his actual quote was,
information wants to be free, but information also wants to be expensive. Some information is just so
valuable, but at the same time, it's free to share it. And like these companies are in this place
where they have something that's so amazing, this amazing breakthrough with these AIs that are so
valuable. But yet, it's depreciating like nothing, like a peach and like a summer day because
everybody's catching up. And so he comes up with this new thing, right? Brockman's two years of
research is coming to fruition. That's not modesty, right? He's trying to tell people
hey actually I think we're more than six months ahead but the news came out so semi analysis who we talked about before
they said this is the first new scale up in pre-training since gpte 4.5 bigger model so they're back on the curve right so this curve that we would follow this curve up so far
and then they could never get past it now they're claiming we're up we we got past this wall ilia said there's nowhere else to go up here
but we we found a spot right ilia left the company we're here we found away right there's the
two founders, one the engineering guy and one the researcher and the researcher left and said,
like, fossil fuels have been exhausted, but they're saying like, hey man, no, actually, it's still
going.
We found something.
What are their fossil fuels?
And one interesting thing is, usually these GPT models have been getting cheaper and cheaper
over time.
This 5.5 that they just released costs four times as much per token per conversation.
Okay.
So it's obviously going to be more of a moneymaker then.
Well, maybe, but 4.5 was also super expensive and then they pulled it. And the reason is, these things get bigger. Like, it just become more expensive to host. They're like, no, man, this costs more. Like, this is a bigan, right? This is a chunky model of the biggest we've ever shipped. But will the results be, like, compelling enough for somebody like yourself.
Is it worth it, right? Four times more is a lot. So I don't know. And like, whatever, the podcast will go out. Somebody's listening to this and it's a year later. And we'll know. And we'll know.
But like, it doesn't matter because this is the next one and the next one and the next one.
The race keeps going.
I could not get a subscription and use this open weight model from a number of providers,
or I could just pile up some GPUs here and run it or whatever.
I have friends who will say, like, hey, this is all a trick.
The Open AI gets us addicted to coding using these coding agents.
Then they'll jack up the price and everybody forgets what coding work.
That's what I was worried about.
It's a risk for your company and you want to limit the experience.
that you have to that risk. If they start relying on it, then the risk is that Anthropic could
jack the price by four times. Yeah, but I'm saying, hopefully, why there's less of a concern.
If you can jump to a free model. Because the free models are always just a little bit behind.
These companies are actually fighting tooth and nail with each other. If both Anthropic and
open AI collapse, we'll just lose the latest six months because everybody's racing to keep up.
These things are not going away. You can torrent that Lama version. Like it's not at the lead
anymore, but people use that Lama as the base model. They add all their training on. They do the
distilling, the anthropic and whatever is mad at. And like if all of these companies explode, we still
end up with just like the open weight models of six months ago. And there's a bunch of companies
that host these and you can use like open router.
Yeah, the cat's already out of the bag, right?
Like, cats out of the bag, this isn't going away.
Okay, we got to wrap this though.
Okay, so let's go through it again.
So what were the original quotes?
There was like where the open AI president says,
I think of Spud as a new base as a new pre-trained.
And then there was like the Google memo that was like,
guys, we don't have any moat and nobody does.
So what's your feeling?
True, false?
The memo leaked well well before.
the announcement of that new base model, right?
So it could have been true at that point, but maybe it's not.
Maybe the moat is now in the new base model.
See, I feel like it's still true because Rockman may think that they have a moat,
but he's saying we have a moat that's two years.
And like, not very long.
He's like, we think we know six months this time.
Now we have two years.
That's, we need to reinvent ourselves in the next two years.
Yeah, it seems like the underlying premise of them having something over other companies
is temporary.
It's still,
so it can't be something
that won't eventually be discovered.
Yeah, so what do you think?
We have no mo and neither is open AI.
True or false?
Um,
I feel like it's false.
Oh,
I feel like it's true,
but yeah,
interesting.
I mean,
I guess it depends on the timeline.
Like,
they have one,
but like,
it's a temporary one.
And like,
they maybe have one for now,
right?
But when that one gets bridged,
they will be stuck
trying to dig out another one.
They do have one,
but it's not,
it's not permanent.
Yeah.
Okay.
And then the first quote,
I think we understand. So he says, I think of Spud as a new base, a new pre-trained,
and it's two years worth of research coming to fruition. So the two years of research doesn't
mean that they have two years before somebody figures it out. Like you could spend a lot of time
on the research and development and then release it and somebody like copies it. And then what's
the third one? So the the quote is, if all this stuff is already built, why are you paying
$850 billion? What are you buying with that?
Yeah, I think we actually agree on what the answer is.
The answer is people are betting on this horse.
They're saying, like, we know this one moat is only going to last so long,
but we think this company will build the next moat, right?
They'll keep the treadmill going.
Let's keep the treadmill going.
Everything's like on open source.
Why are we spending money buying $850 billion of something I can fork today?
But you're not buying that.
You're buying the process.
But I think we understand it all.
I think we got through it.
What do you think?
I think we figured it out.
Thank you.
