Latent Space: The AI Engineer Podcast - Scaling Test Time Compute to Multi-Agent Civilizations — Noam Brown, OpenAI
Episode Date: June 19, 2025Solving Poker and Diplomacy, Debating RL+Reasoning with Ilya, what’s *wrong* with the System 1/2 analogy, and where Test-Time Compute hits a wallFull Video EpisodeTimestamps00:00 Intro – Diplomacy..., Cicero & World Championship 02:00 Reverse Centaur: How AI Improved Noam’s Human Play 05:00 Turing Test Failures in Chat: Hallucinations & Steerability 07:30 Reasoning Models & Fast vs. Slow Thinking Paradigm 11:00 System 1 vs. System 2 in Visual Tasks (GeoGuessr, Tic-Tac-Toe) 14:00 The Deep Research Existence Proof for Unverifiable Domains 17:30 Harnesses, Tool Use, and Fragility in AI Agents 21:00 The Case Against Over-Reliance on Scaffolds and Routers 24:00 Reinforcement Fine-Tuning and Long-Term Model Adaptability 28:00 Ilya’s Bet on Reasoning and the O-Series Breakthrough 34:00 Noam’s Dev Stack: Codex, Windsurf & AGI Moments 38:00 Building Better AI Developers: Memory, Reuse, and PR Reviews 41:00 Multi-Agent Intelligence and the “AI Civilization” Hypothesis 44:30 Implicit World Models and Theory of Mind Through Scaling 48:00 Why Self-Play Breaks Down Beyond Go and Chess 54:00 Designing Better Benchmarks for Fuzzy Tasks 57:30 The Real Limits of Test-Time Compute: Cost vs. Time 1:00:30 Data Efficiency Gaps Between Humans and LLMs 1:03:00 Training Pipeline: Pretraining, Midtraining, Posttraining 1:05:00 Games as Research Proving Grounds: Poker, MTG, Stratego 1:10:00 Closing Thoughts – Five-Year View and Open Research Directions This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Transcript
Discussion (0)
Hey, everyone. Welcome to the Latin Space Podcast.
This is Alessio, partner, and CETO Dessable, and I'm joined by my co-hosts, FUX
AI.
Hello, hello, and we're here recording on a holiday Monday with Noam Brown for an Open AI.
Welcome.
Thank you.
So glad to have you finally join us.
A lot of people have heard you.
You've been rather generous of your time on podcasts, Lex Friedman, and you've done a TED Talk
recently, just talking about the thinking paradigm.
But I think maybe perhaps your most interesting.
recent achievement is winning the World Diplomacy Championship.
Yeah.
In 2022, you built like sort of Cicero, which was top 10% of human players.
I guess my opening question is, how has your diplomacy playing changed since working on Cicero
and then now personally playing it?
When you work on these games, you kind of have to understand the game well enough to
be able to debug your bot.
Because if the bot does something that's like really radical and like that he was
stupidly wouldn't do, you're not sure if that's like a mistake or if that's, you're not sure.
if that's like a mistake or if that's just like if it's a bug in the system or it's actually just
like the bot being brilliant. When we were working on diplomacy, I kind of like did this deep dive,
like trying to understand the game better. I played in tournaments. I like watched a lot of like
tutorial videos and commentary videos on games. And over that process, I got better. And then also
seeing the bot like the way would behave in these games. Like sometimes it would do things that
humans typically wouldn't do. And that taught me about the game as well. When we released
Cicero, we announced it in late 2022.
I still found the game really fascinating.
And so I kept up with it.
I continued to play.
And that led to me winning the championship in the World Championship in 2025, so just a couple
months ago.
There's always a question of like Centaur systems where humans and machines were together.
Like was there an equivalent of what happened in Go where you updated your play style?
If you're asking if I used Cicero when I played in the tournament, the answer is no.
Seeing the way the bot played and like taking inspiration from that,
I think did help me in the tournament, yeah.
Yeah.
Do people now ask Turing questions every single time when they're playing diplomacy?
Ask.
To try to tell if the person they're playing with is a bot or a human.
Yeah, like that's the one thing you're worried about when you started.
It was really interesting when we were working on Cicero because like, you know,
we didn't have the best language models.
We were really bottlenecked on the quality of the language models.
And sometimes the bot would do, would say like bizarre things.
Like, you know, 99% of time it was fine.
But then, like, every once in a while, it would say this, like, really bizarre thing.
Like, it would just hallucinate about something.
Somebody would reference something that they said earlier in a conversation with the bot.
And the bot would be like, I have no idea we're talking about.
I never said that.
And then the person would be like, look, you could just scroll up in the chat.
It's literally right there.
And the bot would be like, no, you're lying.
And when it does these kinds of things, like, people just kind of, like, shrugged it off as like,
oh, that's just, you know, the person's tired or they're drunk or whatever, or they're just, like,
trolling me.
But I think that's because people weren't looking for a bot.
They weren't expecting a bot to be in the games.
We were actually really scared because we were afraid that people would figure out at one point that there's a bot in these games.
And then they would just always be on the lookout for it.
And if you're looking for it, you're able to spot it, that's the thing.
So I think now that it's announced and that people know to look for it, I think they would have an easier time spotting it.
Now, that said, the language models have also gotten a lot better since 2022.
It's adversarial.
Yeah.
So at this point, like, you know, the truth is, you know, GPD 40 and.
like 03. These models are like passing the touring test. So I don't think they can really ask
that many touring complete questions that would actually make a difference. And Cesar was very small,
like 2.7B, right? It was a very small language model, yeah. It was one of the things that we realized
over the course of the project that like, oh, yeah, you really benefit a lot from just having
larger language models. Right. Yeah. How do you think about today's perception of AI and a lot of like
maybe the safety, the scores of like, you know, you're going to build a bot that is really good at
persuading people until like helping them win a game. And I think maybe today labs want
to say they don't work on that type of problem. How do you think about that dichotomy, so to speak,
between the two? You know, honestly, like after we released Cicero, a lot of the AI safety community
was really happy with the research and like the way it worked because it was a very controllable
system. Like we conditioned Cicero on certain concrete actions. And that gave it a lot of
steerability to say like, okay, well, it's going to pursue a behavior.
that we can very clearly interpret and very clearly define.
It's not just like, oh, it's a language model like running loose and doing whatever it feels
like, no, it's actually like pretty steerable.
And there's this whole reasoning system that steers the way the language model interacts
with the human.
Actually, a lot of researchers reached out to me and said, like, we think this is like potentially
a really good way to achieve safety with these systems.
I guess the last diplomacy-related questions that we might have is, have you updated or
tested like O-Series models on diplomacy?
and would you expect a lot more difference?
I have not.
I think I said this on Twitter at one point.
I think this would be a great benchmark.
I would love to see all the leading bots play a game of diplomacy with each other and see who does best.
And I think a couple people have taken inspiration from that and are actually building out these benchmarks and like evalling the models.
My understanding is that they don't do very well right now.
But I think it really is a fascinating benchmark.
And I think it would be, yeah, I think it would be a really cool thing to try out.
Well, we're going to go a little bit into O Series now.
I think the last time you did a lot of publicity, you were just launching 01, you did your TED Talk and everything.
How has the vibes changed just in general? You said you were very excited to learn from domain experts like in chemistry, like how they review the old series models.
Like how have you updated since, let's say, end of last year?
I think the trajectory was pretty clear, pretty early on in the development cycle.
and I think that everything that's unfolded since then has been pretty on track for what I expected.
So I wouldn't say that my perception of where things are going or has honestly changed that much.
I think they were going to continue to see, as I said before, that we're going to see this paradigm continue to progress rapidly.
And I think that that's true even today.
We saw that with like going from 01 preview to 01 to 03, consistent progress.
And we're going to continue to see that going forward.
and I think that we're going to see a broadening of what these models can do as well.
You know, like we're going to start seeing agentic behavior.
We're already starting to see agentic behavior.
Like, honestly, for me, 03, I've been using it a ton in my day-to-day life.
I just find it so useful, especially the fact that I can now browse the web and, like, you know, do meaningful research on my behalf.
Like, it's kind of like a mini-deep research that you can just get a response in three minutes.
So, yeah, I think it's just going to continue to become more and more useful and more powerful as time goes on and pretty quickly.
Yeah.
And talking about deep research, you tweeted about if you need proof that we can do this in non-verifiable domains, deep research is kind of like a great example.
Can you maybe talk about if there's something that people are missing?
I feel like I hear that repeat it a lot.
It's like, you know, it's easy to do encoding in math, but like not in these other domains.
I frequently get this question, including from pretty established AI researchers that, okay, we're seeing these reasoning models exceed in math and coding and these easily verifiable domains.
but are they ever going to succeed in domains where success is less well defined?
I'm surprised that this is such a common perception because we've released deep research
and people can try it out. People do use it. It's very popular. And that is very clearly a domain
where you don't have an easily verifiable metric for success. It's very like what is the best
research report that you could generate? And yet these models are doing extremely well at this
domain. So I think that's like an existence proof that these models can succeed.
in tasks that don't have as easily verifiable rewards.
Is it because there's also not necessarily like a wrong answer?
Like there's a spectrum of deep research quality, right?
You can have like a report that looks good, but the information is kind of so-and-so,
and then you have a great report.
Do you think people have a hard time understanding the difference when they get the result?
My impression is that people do understand the difference when they get a result,
and I think that they're surprised at how good the deep research results are.
There's certainly, it's not 100%.
It could be better.
and we're going to make it better.
But I think people can tell the difference
between a good report and a bad report.
And it's really,
and a good report and a mediocre report.
And that's enough to kind of feed the loop later
to build the product and improve the model performance.
I mean, I think if you're in a situation
where people can't tell the difference between the outputs,
then it doesn't really matter if you're like, you know,
hill climbing on progress.
These models are going to get better at domains
where there is a measure of success.
Now, I think this idea that it has to be like easily verifiable
or something like that,
I don't think that's true. I think that you can have these models do well, even in domains where
success is a very difficult to define thing. Could sometimes even be subjective. People lean on a lot.
You've done as well is the thinking fast and slow analogy for just thinking models. And I think
it's reasonably well diffused now the idea of that this is kind of the next scaling paradigm.
All analogies are imperfect. What is one way in which thinking fast and slow or system
wants it into, kind of doesn't transfer to how we actually scale these things.
One thing that I think is underappreciated is that the models, the pre-trained models,
need a certain level of capability in order to really benefit from this, like, extra thinking.
This is kind of why you've seen the reasoning paradigm emerge around the time that it did.
I think it could have happened earlier, but if you try to do the reasoning paradigm on top of GPD2,
I don't think it would have gotten you almost anything.
Is this emergence?
hard to say if it's emergence necessarily, but like I haven't done the, you know, the measurements to really define that clearly. But I think it's pretty clear. You know, people try chain of thought with GPD, like really small models. And they saw that it just didn't really do anything. Then you go to bigger models and it starts to give a lift. I think there's a lot of debate about like the extent to which this kind of behavior is emergent. But clearly there is a difference. So it's not like there are these two independent paradigms. I think that they are related in the sense.
that you need a certain level of System 1 capability in your models in order to have System 2.
Is it able to benefit from System 2?
Yeah.
I have tried to play amateur neuroscientists before and try to compare it to the evolution of the brain
and how you have to evolve the cortex first before you evolve the other parts of the brain.
And perhaps that is what we're doing here.
Yeah.
And you could argue that actually this is not that different from like I guess the System 1, System 2 paradigm.
Because, you know, if you ask like a pigeon to think really hard about playing
chess. You know, it's not going to get that far. It's, you know, it doesn't matter if, like,
things for a thousand years. It's, like, not going to be able to be better at playing chess.
So maybe you do still also, also in, like, with animals and humans that you need a certain
level of intellectual ability just in terms of system one in order to benefit from system two as
well. Yeah. Just this side tangent, does this also apply to visual reasoning? So, let's say we
have, now we have the four-oh, like, natively omnemodel type of thing. Then that also makes
03 really good at Geogessor. Does that apply to other modalities too? I think the evidence is yes.
It depends on exactly the kinds of questions that you're asking. Like there are some questions that I think
don't really benefit from System 2. I think geogessor certainly one where you do benefit. I think
image recognition, if I had a guess, it's like one of those things that you probably benefit less
from System 2 thinking. Because you know it or you don't. Yeah, exactly. There's no way.
Yeah. And the thing I typically point to you is just like information like retrieval. If somebody asks you,
like when was this person born and you don't have access to the web, then you either know it or you
don't and you can sit there and you can think about it for a long time. Maybe you can make an educated
guess so you can say like, well, this person was like probably lived around this time and so
this is like a rough date. But you're not going to be able to like get the date unless you've
actually just just know it. But like spatial reasoning like Tick Tecto might be better because you have
all the information there. Yeah. And I think it's true that like with Tic Tecto we see that like
GPD 4.5 falls over. You know, it plays decently well. I shouldn't say.
falls over. It does reasonably well. You can draw the board. It can make legal moves, but it will
make mistakes sometimes. And if you really need that system two to enable it to play perfectly.
Now, it's possible that if you got to GPD6 and you just did system one, it would also play perfectly.
You know, I guess we'll know one day. But I think right now you need the system two to really, like,
do well. What do you think of like the things that you need in system one? So obviously, general
understanding of like game rules. Do you also need to understand some sort of like,
meta game of like, you know, usually this is like how you value pieces in different games,
even though it's a, you know, how do you generalize in System 1 so that then in System 2,
you can kind of get to the gameplay, so to speak.
I think the more that you have in the System 1, like, this is the same thing with humans,
you know, like humans are when they're playing for the first time, a game like chess,
they can apply a lot of system 2 thinking to it.
And if you apply a ton of system 2 thinking to it, like if you just present a really smart
person with a completely novel game and you tell them like, okay, you're going
going to play this game against like an AI or like a human that's like mastered this game and you
tell them to like sit there and you and think about it for like three weeks about how to play this
game. My guess is they could actually do pretty well. But it certainly helps to build up that
system one thinking, like build up intuition about about the game because it will just make you so much,
yeah, so much faster. I think the Pokemon example is a good one of like the system one kind of
maybe all this information about games.
And then once you put it in the game,
it still needs a lot of harnesses to work.
And I'm trying to figure out how much of,
can we take things from the harness and have them in system one?
So the then system two is as harness-free as possible.
But I guess that's like the question about generalizing games and AI.
Yeah, I guess I view that as a different question.
I think the question about like harnesses,
in my view is that the ideal harness is no harness.
Right.
I think harnesses are like a crutch that eventually
we're going to be able to move beyond.
So only two calls.
And you could ask, you know, you could just ask O3.
And actually, you know, it's interesting.
It's like when this playing Pokemon thing kind of like emerged as, as this like, you know, benchmark.
I was actually like pretty opposed to e-valling this with our like opening eye models because my feeling is like, okay, if we're going to do this e-vail, let's just do it with O3, you know?
How far does O3 get without any harness?
How far does it get playing Pokemon?
And the answer is like not very far, you know?
And that's fine.
I think it's fine to have an e-val where the models do terribly.
And I don't think the answer to that should be like, well, let's build a really good harness
so that now it can do well on this e-val.
I think the answer is like, okay, well, let's just improve the capabilities of our models
so they can do well at everything.
And then they also happen to make progress on this e-vel.
Would you consider things like checking for a valid move, a harness, or is this in the model,
you know, like, chess.
It's like you can either have the model learn in System 1, what moves are valid,
and what it can and cannot do versus insist them to figuring out what are not.
I think there's like a lot of this is design questions.
Like for me, I think you should give the model the ability to check if a move is legal if you want.
Like that could be an option in the environment of like, okay, here's a, you know, an action that you can,
like a tool call that you can see if an action is legal.
If it wants to use that, it can.
And then there's like, it designs a question of like, well, what do you do if the model makes an illegal move?
And I think it's totally reasonable to say, like, well, if they make an illegal move,
then they lose the game.
I don't know.
What happens when a human makes an illegal move in a game of chess?
I actually don't know.
I don't know.
I mean,
they're just not allowed to?
Yeah.
Like, do you just lose the game?
I don't know.
So if that's the case,
and I think it's totally reasonable to say, like, yeah, we're going to have an e-val
where that's also the criteria for the AI models.
Yeah, but I think, like, maybe one way to interpret that in sort of researcher terms is,
are you allowed to do search?
And one of the famous findings from deep seek is that MCTS wasn't that useful to them.
But I think, like, there are a lot of engineers trying out search and spending a lot
of tokens doing that and maybe it's not worth it. Well, I'm making a distinction here between
like a tool call to check whether a move is legal or illegal is different from actually making that
move and then seeing whether it ended up being legal or illegal, right? So if that tool call is
available, I think it's totally fine to make that tool call and check whether a move is legal or
illegal. I think it's different to have the model say, oh, I'm making this move. And then, you know,
it gets feedback that like, oh, you made an illegal move. And so then it's like, oh, just kidding.
I'm going to do something else now.
So that's the distinction I'm drawing.
Some people have tried to classify that second type of playing things out as test time compute.
You would not classify that as test time compute.
There's a lot of reasons why you would not want to rely on that paradigm when you're going to
imagine you have a robot, you know, and your robot takes some action in the world and it like breaks something.
And you're just like, oh, you can't say like, oh, just kidding.
I didn't mean to do that.
I'm going to undo that action.
Like the thing is broken.
So if you want to simulate what would happen if I move the robot in the software.
way, and then in the simulation, you saw that this thing broke and then you decided not to do that
action. That's totally fine. But you can't just undo actions that you've taken in the world.
There's a couple more things I wanted to cover in this rough area. I actually had an answer on the
thinking fast and slow side, which maybe I'm curious what you think about. Like a lot of people
are trying to put in effectively model router layers, let's say between like the fast response model
and the long thinking model. Anthropic is explicitly doing that. And I think there's a question about
always do you need a smart judge to route or do you need a dumb judge judge to route because it's
fast? So when you have a model router, let's say you're passing request between system one side
and system two side, does the router need to be as smart as the smart model or dumb to be fast?
I think it's possible for a dumb model to recognize that a problem is really hard and that it won't
be able to solve it and then route it to a more capable model. But it's also possible for a dumb model
to be fooled or to be overconfident?
I don't know. I think there's a real trade-off there.
But I will say, like, I think there are a lot of things that people are building right now
that will eventually be washed away by scale.
So I think harnesses are a good example where I think eventually the models are going to be,
and I think this actually happened with the reasoning models.
Like, before the reasoning models emerged, there was, like, all of this work that went into
engineering, these, like, agentic systems that, like, made a lot of calls to GPD40 or, like,
these non-reasoning models to get reasoning behavior. And then it turns out like, oh, we just
created reasoning models. And you don't need this complex behavior. In fact, in many ways it makes
it worse. Like, you just give the reasoning model the same question without any sort of scaffolding,
and it just does it. Now, you can still, and so people are building scaffolding on top of the
reasoning models right now, but I think in many ways, like those scaffolds will also just be replaced
by the reasoning models and models in general becoming more capable. And similarly, I think
things like model like these routers,
you know, we've said pretty openly
that we want to move to a world where there is
a single unifying model.
And in that world, you
shouldn't need a router on top of the model.
So I think that
the router issue will
eventually be solved also.
Like you're building the router into the model
kind of weights itself.
I don't think there will be a
benefit for
like I shouldn't say because I could be wrong about
this. Like, you know, it's certainly
maybe there's
reasons to route
to different model providers or whatever
but I think that routers are
going to eventually go away.
And I can understand why it's worth doing it
in the short term because the fact is
it is beneficial right now and if you're
building a product
and you're getting a lift from it then
it's worth doing right now.
One of the tricky things I'd imagine that a lot of developers
are facing is that you kind of have to
plan for where these models are going to be
in six months and 12 months and that's like very
hard to do because things are progressing very quickly. You know, you don't want to spend six months
building something and then just have it be totally washed away by scale. But I think I would
encourage developers, like when they're, you know, building these kinds of things like scaffolds and
routers, keep in mind that the field is evolving very rapidly. You know, things are going to change
in three months, let alone six months. And that might require radically changing these things around
or tossing them out completely. So don't spend six months building something that might get tossed
down in six months. It's so hard though. Everyone says this and then like no one has concrete suggestions
on how. What about reinforcement fine-tuning? Is this something that obviously you just
released it a month ago at Urban I? Is there something people should spend time on right now or maybe
wait until the next jump up the scale? I think reinforcement fine tuning is pretty cool and I think it's like
worth looking into because it's really about specializing the models for the data that you have.
and I think that
something that's worth
worth looking into
for developers.
We're not suddenly going to
have that data
baked into the raw model
a lot of times.
So I think that's kind of
like a separate question.
Yeah.
So creating the environment
and the reward model
is the best thing people
can do right now.
I think the question
that people have is like
should I rush
to fine tune the model
using RFT
or should I build the harness
to then RFT the models
as to get better?
I think the difference.
I think the difference is that for reinforcement fine-tuning, you're collecting data that's going to be useful as the models improve as well.
So if we come out with future models that are even more capable, you can still fine-tune them on your data.
That's, I think, actually a good example where you're building something that's going to complement the model scaling and becoming more capable rather than necessarily getting washed away by the scale.
Yeah. One last question on Ilya. You mentioned on, I think, the Sarah and Elad podcast where you had this conversation with Ilya a few years ago about more RL and reasoning and language models. Just any speculation or thoughts on why his attempt, when he tried it, it didn't work or the timing wasn't right and why the time is right now.
I don't think I would frame it that way that, like, his attempt didn't work. In many ways, it did.
So Ilya, for me, I saw that in all of these domains that I'd worked on, on poker and Hanabi and diplomacy, having the models think before acting made a huge difference in performance.
Like orders of magnitude difference.
Like 10,000 times.
Yeah, like, you know, 1,000 to 100,000 times.
It's the equivalent of a model that's like 1,000 to 100,000 times bigger.
And in language models, you weren't really seeing that.
The models would just respond instantly.
some people in the field, in the LLM field,
were convinced that, like,
okay, we just keep scaling pre-training.
We're going to get to superintelligence.
And I was kind of skeptical of that perspective.
In late 2021, I was having a meal with Ilya.
He asked me what my H.I timelines are,
a very standard SF question.
And I told them, like, look,
I think it's actually quite far away
because we're going to need to figure out
this reasoning paradigm in a very general way.
And with things like LMs,
LMs are very general,
but they don't have a reasoning paradigm that's very general.
And until they do,
they're going to be limited in what they can do.
We're going to scale these things up by a few more orders of magnitude.
They're going to become more capable.
But we're not going to see superintelligence from just that.
And like, yes, if we had a quadrillion dollars to train these models, then maybe we would.
But like, you're going to hit the limits of what's economically feasible before you get to superintelligence unless you have a reasoning paradigm.
And I was convinced incorrectly that the reasoning paradigm would take a long time to figure out because it's like this big unanswered research question.
And, you know, Ilya agreed with me.
And he said, like, yeah, you know, I think we need this, like, additional paradigm.
But his take was that, like, maybe it's not that hard.
I didn't know it at the time.
But, like, he and others at Open AI had also been thinking about this.
They had also been thinking about RL.
They had been working on it.
And I think they had some success.
But, like, you know, with most research, like, it does, you have to iterate on things.
You have to try out different ideas.
You have to, yeah, try different things.
And then also, as the models become more capable, as they become faster, it becomes easier
to iterate on experiments. And I think that the work that they did, even though it didn't
result in a reasoning paradigm, it all builds on top of previous work. So they built a lot of
things that over time led to this reasoning paradigm. For listeners, Noam can talk about this,
but the rumor is that that thing was codename GPT0 if you want to search for that line of work.
I think there was a time where basically R.L kind of went through a dark age when everyone like
went all in on it and then nothing happens and they gave up and like now it's like sort of the golden age again
so that's what I'm like trying to identify like why what is it and it could just be that we have smarter
based models and better data I don't think it's just that we have smarter base models I think it's
that yeah so I we did end up getting a big success with with reasoning and I but I think it was
in many ways a gradual thing to some extent it was gradual you know like there were signs that there were signs
of life, and then we, like, you know, iterated and tried out some more things. We got, like,
better signs of life. I think it was around, like, November, 23, or October 2023, when I think
I was convinced that we had, like, very conclusive signs of life that, like, oh, this is going to be,
this is the paradigm, and it's going to be a big deal. That was, in many ways, a gradual thing.
I think what opening eye did well is, like, when we got those signs of life, they recognized it
for what it was and invested heavily in scaling it up. And I, I think, I, you know, and I,
I think that's what led to reasoning models arriving when they did.
Was there any disagreement internally, especially because, like, you know, Open AI kind of
pioneer pre-training scaling, you know, and kind of like computers all you need. And then
you're kind of saying maybe that's not how we get there. Was it clear to everybody that like,
okay, this is going to work or was it controversial?
There's always different opinions about this stuff. I think there were some people that
felt that pre-training was all we need, that we scaled it up to infinity and we were there.
I think a lot of the leadership actually at Open AI recognized that there was another paradigm
that was needed. And that was why they were investing all this research effort into this like RL
stuff. And I think that's also to the credit of opening eye that like, okay, yes, they figured out
the pre-training paradigm and they were very focused on scaling that up. In fact, the vast majority
of resources were focused on scaling that up. But they also recognized the value that that something
else was going to be needed. And it was worth researching, putting researcher effort into other
directions to figure out what that extra paradigm was going to be. There was a lot of debate about,
first of all, like, what is that extra paradigm? So I think a lot of the researchers looked at
reasoning and an RL was not really about scaling test time compute. It was more about data
efficiency. Because, you know, the feeling was that, well, we have tons and tons of compute,
but we actually are more limited by data. So there's the data wall. And we're going to hit that
before we hit limits on the compute. So how do we make these algorithms more data efficient? They
are more data efficient. But I think that also, like, they are also just like the equivalent
of scaling up compute also by a ton. That was interesting. There's like a lot of debate around like,
okay, what exactly are we doing here? And then I think also, even when we got the signs of life,
I think there was a lot of debate about the significance of it. That like, okay, how much should we
invest in scaling up this paradigm? I think especially when you're in a small company, like, you know,
open AI, like in 2023 was not as big as it is today. And compute was more constrained than it is today.
And if you're investing resources in a direction, that's coming at the expense of something else.
And so if you look at these signs of life on reasoning and you're saying like, okay, well, this
looks promising, we're going to scale this up by a ton and invest a lot more resources into it,
where are those resources coming from?
You have to make that tough call about where to draw the resources from.
And that is a very controversial, very difficult call to make that makes some people unhappy.
And I think there was debate about whether we're focusing too much on this pair of
paradigm, whether it's really a big deal, whether we would see it generalize and do various things.
I remember it was interesting that I talked to somebody who left OpenAI after we had
discovered the reasoning paradigm, but before we announced a 1. And they ended up going to a competing
lab. I saw them afterwards, after we announced a 1. And they told me that, like, at the time,
they really didn't think this reasoning thing, like, these O-Series, the strawberry models were
like that big of a deal. It was like, they thought we were making a bigger deal of it than
really deserve to be. And then when we announced O-1 and they saw the reaction of their
co-workers at this competing lab about how everybody was like, oh, crap, like this is a big deal.
And they like pivoted the whole research agenda to focus on this. Then they realized, like,
oh, actually, like, this maybe is a big deal. You know, a lot of this seems obvious in retrospect.
But at the time, it's actually not so obvious and be quite difficult to recognize something
for what it is.
I mean, open AI is like a great history of just making the right bet.
I feel GPD models are kind of similar, right?
Where it started with games, NRL, and then it's like maybe we can just scale these language models instead.
And I'm just impressed by the leadership and obviously the research team that keeps coming out with these insights.
Looking back on it today, it might seem obvious that like, oh, of course, like these models get better with scaled.
So you should just scale them up a ton and it'll get better.
but it really is the best research is obvious in retrospect
and at the time it's not as obvious as it might seem today
follow questions on data efficiency
this is a pet topic of mine
it seems that our current methods of learning are so
inefficient still right like compared to the existence proof of humans
we take five samples and we learn something
machines 200 maybe you know per
whatever data point you might need anyone doing anything
interesting in data efficiency
Or do you think there's just a fundamental inefficiency that machine learning has that will just always be there compared to humans?
I think it's a good point that if you look at the amount of data that these models are trained on and you compare it to the amount of data that human observes to get the same performance,
I guess pre-training, it's a little hard to make an apples to apples comparison because I don't know, how many tokens does a baby actually absorb when they're developing?
But I think it's a fair statement to say these models are less data efficient than humans.
And I think that that's an unsolved research question and probably one of the most important unsolved research questions.
Maybe more important than algorithmic improvements.
Because we can increase the supply of data out of the existing set of the worlds and humans.
I guess this is a good.
So a couple thoughts on that.
One is that the answer might be an algorithmic improvement.
Like maybe algorithmic improvements do lead to greater data efficiency.
And the second thing is that it's not like humans learn from just reading the individual.
internet. So I think it's certainly easiest to learn from just like data that's on the internet,
but I don't think that's like the limit of what data you could collect. The last follow up before we
changed topics to coding, any other just anecdotes or insights from Ilya just in general?
Because like you've worked with them. So there's not that many people that we can talk to that have
worked with him. I think I've just been very, very impressed with his vision that I think like,
especially when I joined and I saw, you know, the internal documents at open.
AI of like what he had been thinking about back in like 2021, 2022, even earlier, I was very
impressed that he had a clear vision of like where this was all going and what was needed.
Some of his emails from 2016-17 when they were founding Open AI was published.
And even then he was talking about how he thinks like one big experiment is much more valuable
than 100 small ones.
That was like a core insight that differentiated them from brain, for example.
It just seems very insightful that he just sees things much.
more clearly than others, and I just wonder what his production function is. How do you make a human
like that? And how do you improve your own thinking to better model it? I mean, I think it is true that,
I mean, one of opening eyes big success was betting on the scaling paradigm. It is just kind of odd because,
you know, they were not the biggest lab. You know, it was like difficult for them to scale.
Back then, it was much more common to do like a lot of small experiments, more academic style.
People were trying to figure out these various algorithmic improvements.
And opening eye bet pretty early on large scale.
We had David Luanon, who I think was VPN at the time of GPT 1 and 2.
And he talked about how the differences between brain and opening eye was basically the cause of Google's inability to come out with a scaled model.
Like just structurally, everyone had allocated compute and you had to pool resources together to make bets.
And you just couldn't.
I think that's true.
that Open AI was structured differently, and I think that really helped them. Like, opening eye functions
a lot like a startup, and other places tended to function more like universities or, you know,
research labs as they traditionally existed. The way that opening eye operates more like as a startup with
this mission of building AGI and superintelligence, that helped them organize, collaborate,
pool resources together, make hard choices about like how to allocate resources. And I think a lot of
the other labs, like, have now been trying to adopt paradigms more like that, like setups more like
that. Let's talk about maybe the killer use case, at least in my mind, of these models,
which is coding. You've released codex recently, but I would love to talk through the Noam Brown
coding stack. What models you use, how you interact with them? Curser, Widsurf. Lately I've
been using WinSurf and Codex. Like, actually, a lot of codex. I've been having a lot of fun.
You just give it a task, and it just goes off and does it, and it comes back five minutes later
with like a, you know, pull request. And is it core research task?
or like side stuff that you don't super care about?
I wouldn't say it's like side stuff.
I would say basically anything that I would normally try to code up,
I try to do it with codex first.
Well, for you, it's free, but yeah, for everybody is free right now.
And I think that's partly because it's the most effective way for me to do it.
And also it's good for me to get experience working with this technology
and then also like seeing the shortcomings of it.
it just helps me better understand like, okay, this is the limits of these models and like what we need to push on next.
Have you felt the AGI?
I felt the AGA multiple times, yes.
Like, like, how should people push Codex in ways that you've done?
And, you know, I think you see it before others because obviously you work closer to it.
I think anybody can use Codex and feel the AGI.
It's kind of funny how like you feel the AGI and then you get used to it very quickly.
So it's really like,
dissatisfied with like where it's lacking.
Yeah, I know.
It's magical one day.
I was actually looking back at the old SORA videos when they were announced.
Yeah.
Because like, remember when Sora came out, it was just like...
The biggest news ever.
It was just magical.
You look at that and you're like, it's like, it's really here.
Like, this is AGI.
But if you look at it now and it's kind of like, oh, you know, the people like don't move
like very organically and it's like there's like a lack of consistency in some ways.
And you see all these flaws in it now that you just didn't really notice when it first came
out. And yeah, you get used to this technology very quickly. But I think what's cool about it is that because
it's developing so quickly, you get those fieldy AGI moments like every few months. So something else is
going to come out and just like, it's magical to you. And then you get used to it very quickly.
What are your windsurf pro tips now that you've immersed in it?
I think one thing I'm surprised by is how few people, I mean, maybe your audience is going to be more
comfortable with reasoning models and use reasoning models more. But I'm surprised at how many people
don't even know that O3 exists. I've been using it day to day. It's basically replaced
Google search for me. I just use it all the time. And also for things like coding, I tend to just
use the reasoning models. My suggestion is like if people are not, have not tried the reasoning
models yet, because honestly, like we do like people love them, people that use it, love them.
obviously a lot more people use
GPD 40 and just like the default
on chat GBT and that kind of stuff
I think it's worth trying the reasoning models
like I think people would be surprised at what they can do
I use windsurf daily and
they still haven't actually enabled it as like a default
in windsurf like I always have to dig up
like type in 03 and then it's like
oh yeah that that exists
it's weird I would say like my struggle
with it has been that it takes so long
to reason and actually break out of flow
I think that is true yes
And I think this is one of the advantages of codex
that like, okay, you can give it a task that's kind of self-contained
and like it can go off and do its thing
and come back 10 minutes later.
And I can see that if you're doing,
if you're using this thing as like more like a pair programmer kind of thing,
then yeah, you want to use GPD 4.1 or something like that.
What do you think are the most broken part of the development cycle with AI?
Like in my mind, it's like, um, pull request review.
Like for me, like I use codex all the time and then I got all these poor requests
and it's kind of hard to like go through all of them.
what other thing would you like people to build
to make this even more scalable?
I think it's really on us to build a lot more stuff.
These models are very limited in some ways.
I think I find it frustrating that
you ask them to do something
and then they spend 10 minutes doing it
and then you ask them to do something pretty similar
and then they go spend 10 minutes doing it
and like, you know, I think I describe them as like
their geniuses, but it's their first day on the job.
And that's like kind of annoying.
Like even the smartest person on earth
when it's their first day on the job,
they're not going to be as useful as you would like them to be.
So I think being able to get more experience
and act like somebody that's actually been on the job for like six months
instead of one day,
I think would make them a lot more useful.
But that's really on us to build that capability.
Do you think a lot of it is like GPU constraint for you?
Like if I think about Codex,
why is it asking me to set up the environment myself when like the model,
if I ask Go3 to like create an environment setup script for a repo,
I'm sure it'll be able to do it.
but today in the product, I have to do it.
So I'm wondering, in your mind, could these be a lot more
if we just, again, put more test time compute on them?
Or do you think there's like a fundamental model capability limitation today
that we still need a lot of human harnesses around it?
I think that we're in an awkward state right now.
We're like progress is very fast and there's things that are like,
clearly we could do this in the models, be better.
We're going to get to it.
It's, you're just limited by how many hours there are on the day.
Right.
So progress can only proceed so quickly.
we're trying to get to everything as fast as we can.
And I think that 03 is not where the technology will be in six months.
I like that question overall in like there's a software development lifecycle, not just
generation of the code, like from issue to PR, basically.
It's like the typical commentary of that.
And then there's the windsurf side, which is insider ID.
Like what else?
Right.
Poal request review is like something that people don't really.
There are startups there built around it.
It's not something that Codex does.
And it could.
So, like, then there's, like, what else is there, you know, that is sort of rate limiting the amount
of software you could be iterating on? It's an open question. I don't, I don't know if there's an answer.
Anything else on A-Suite in general? Like, where do you think this goes just in form factors or
what will we be looking at this time next year in terms of how things are, how, what models were able to do
that they're not able to today? I don't think it's going to be limited to A-Suite. I don't think it's
going to be limited to software engineering. I think it's going to be able
to do a lot of remote work kind of tasks.
Yeah, like Swelancer type of work.
Yeah, or just like even things that are not necessarily software engineering.
Okay.
So the way that I think about it is like anybody that's doing a remote work kind of job,
I think it's valuable to become familiar with their technology and like kind of get a sense
of like what it can do, what it can't do, what it's good at, what it's not good at.
Because I think the breadth of things that it's going to be able to do is going to expand
over time as well.
I feel like virtual assistance might be the next thing after ACV then.
because they're the most easily,
like, you know virtual assistants,
like hire someone in the Philippines,
someone who just look through your email and all that,
because that is entirely,
you can intercept all the inputs and all the outputs
and train on that.
And maybe opening I just buys a virtual assistant company.
Yeah,
I think what I'm looking forward to is that
for things like virtual assistants,
the models, like if they're aligned well,
they could end up being like really preferable
for that kind of work, you know?
If there's always just like principal,
agent problem where if you delegate a task to somebody, then like, are they really aligned with
doing it as you would want it to be done?
Which is as cheaply as quickly as they can.
Yeah.
And so if you have an AI model that's actually really aligned to you and your preferences,
then that could end up doing a way better job than a human could.
Well, not it's doing a better job than a human could, but like it's doing a better job
than a human would.
That word alignment, by the way, I think there's like an interesting overriding or homomorphism
between safety alignment and instruction following alignment.
And I wonder where they diverge.
Okay, so I think where it diverges is, is like,
what do you want to align the models to?
Like, that's, I think, a difficult question, you know?
Like, you could say, like, you wanted to align it to the user.
Okay, well, what happens to the user wants to build a novel virus
that's going to wipe out half of humanity?
That's safety alignment, yeah.
So there's a question of, like, I think alignment, I think they're related, you know,
and I think the big question is, like, what are you aligning towards?
Yeah, there's, like, humanity goals,
and then there's your personal goals and everything in between.
So that's kind of, I guess, the individual agent.
And you announced you're leading the multi-agent team at OpenAI.
I haven't really seen many announcements.
Maybe I missed them on what you've been working on.
But what can you share about interesting research directions or anything from there?
Yeah, there's hasn't really been announcements on this.
I think we're working on cool stuff.
And I think we'll get to announce some cool stuff at some point.
I think the team, in many ways, is actually a misnomer because we're working on more than just multi-agent.
Multi-agent is one of the things we're working on. Some other things that we're working on is just like being able to scale up test time compute by a ton. So we get these models thinking for 15 minutes now. How do we get them to think for hours? How do we get them to think for days, even longer? And be able to solve incredibly difficult problems. So that's one direction that we're pursuing. Multi-agent is another direction. And here, I think there's a few different motivations. We're interested in both the collaborative and the competitive aspect of multi-agent. I think the way
that I describe it is people often say in AI circles that humans occupy this very narrow band
of intelligence and AIs are just going to like quickly catch up and then surpass like this
band of intelligence. And I actually don't think that the band of human intelligence is that
narrow. I think it's actually quite broad because if you compare anatomically identical humans
from, you know, caveman times, they didn't get that far in terms of like, you know, we would
consider intelligence today, right? Like, they're not putting a man on the moon. You know,
they're not like building semiconductors or nuclear reactors or anything like that. And we have those
today, even though we as humans are not anatomically different. And so what's the difference? Well,
I think the difference is that you have thousands of years, a lot of humans, billions of humans,
cooperating and competing with each other, building up civilization over time. The technology that we're
seeing is the product of this civilization. And I think similarly, the AIs that we have today are
kind of like the cavemen of AI. And I think that if you're able to have them cooperate and compete
with billions of AIs over a long period of time and build up a civilization essentially, the things
that they would be able to produce and answer would be far beyond what is possible today
with the AIs that we have today.
Do you see that being similar to maybe like Jim fans, Voyager skill library idea,
resaving these things?
Or is it just the models then being retrained on this new knowledge?
Because the humans then have a lot of it in the brain as to grow.
I think I'm going to be evasive here and say that like we're not going to, yeah,
we're not going to.
Until we have something to announce, which I think that we will in the not too distant future.
I think I'm going to be a bit vague about like exactly what we're doing.
But I will say that the way that.
the way that we are approaching multi-agent in the details and the way we're actually going about it
is, I think, very different from how it's been done historically and how it's being done today by
other places. I've been in the multi-agent field for a long time. I've kind of felt like the
multi-agion field has been a bit misguided in some ways and the approaches that the field has taken
and like the way that's been approached. And so I think we're trying to take a very principled approach
to multi-agent. Sorry, I got to add. So you can't talk about what you're doing, but you can say
what's misguided. What's misguided? I think that a lot of the approaches that have been taken have
been very heuristic and haven't really been following like the bitter lesson approach to scaling and
research. Okay. I think maybe this might be a good, a good spot. So obviously you've done a lot of
amazing work in poker. And I think as the reasoning model got better, I was talking to one of my friends
who used to be a hardcore poker grinder. And I told them I was going to interview you. And their
question was, at the table, you can get a lot of information.
from a small sample size about how a person plays.
But today, GTO is like so prevalent that sometimes people forget that you can play
exploitively.
What do you think is the state?
As you think about multi-agent and kind of like competition, is it always going to be
trying to find the optimal thing or is a lot of it trying to think more in the moment,
like how to exploit somebody?
I'm guessing your audience is probably not super familiar with poker terminology.
So I'll just like explain this a bit.
A lot of people think that poker is just like a luck game and that's,
not true. It's actually, there's a lot of strategy in poker. So you can win consistently in poker
if you're playing the right strategy. So there's different approaches to poker. One is game three
optimal. This is like you're playing an unbeatable strategy and expectation. Like you're just
unexploidable. It's kind of like in rock paper scissors, you can be unbeatable in rock paper
scissors if you just randomly choose between rock paper and scissors with equal probability.
Because no matter what the other guy does, you know, they're not going to be able to exploit you
or you're going to win. You're going to like not lose an expectation. Now,
A lot of people hear that and they think, like, well, that also means that you're not going to win an expectation because you're just playing totally randomly.
But in poker, if you play the equilibrium strategy, it's actually really difficult for the opponents to figure out how to tie you.
And they're going to end up making the stakes that will lead you to win over the long run.
It might not be a massive win, but it is going to be a win.
If you play enough hands for a long enough period of time, you're going to win in expectation.
Now, there's also exploitative poker.
and the idea here is that you're trying to spot weaknesses
in how the opponent plays.
Maybe they're not bluffing enough
or maybe they fold too easily to a bluff.
And so you start adapting from the game theory
optimal balance strategy of like you bluff sometimes,
you don't bluff sometimes,
to then playing a very unbalanced strategy
that's like, oh, I'm just going to like bluff a ton
against this person because they always fold whenever I bluff.
Now the key is that there's a tradeoff here
because if you're taking this exploitative approach,
then you're opening yourself up to exploitation as well.
And so you have to choose this balance between playing a defensive game theory optimal policy
that guarantees you're not going to lose but might not make you as much money as you potentially could,
versus playing an exploitative strategy that can be much more profitable, but also it creates weaknesses
that the opponents can take advantage of and trick you. And there's no way to perfectly balance it to.
It's kind of like in raw paper scissors. If you notice somebody is like playing paper for five times in a row,
you might think like, oh, they have a weakness in their strategy. I should just be throwing scissors
and I'm going to take advantage of them.
And so on the sixth time you throw scissors,
but actually that's the time when they throw broke.
And you never really know.
So you always have this trade-off.
The poker AIs that have been extremely successful.
And my background is, like, I worked on AI for poker for several years during grad school
and made the first superhuman no-limit poker AIs.
The approach that we took was this game theory optimal approach,
where the AIs would play this unbeatable strategy,
and they would play against the world's best and beat them.
Now, that also means they beat the world's worst.
They would just beat anybody.
But if they were up against a weak opponent, they might not beat them as severely as a human expert might, because the human expert would know how to adapt from the game theory optimal policy to be able to exploit these weak players.
And so there's this kind of unanswered question of like, how do you make an exploitative poker AI?
And a lot of people had to pursue this research direction.
I had like dabbled in it a little bit during grad school.
And I think fundamentally it just comes down to AI as not being a sample efficient.
as humans, you know, we discussed earlier. If a humans playing poker, they're able to get a
really good sense of the strengths of weaknesses of a player within a dozen hands. It's like honestly
really impressive. And back when we were working on AI for poker in like the, you know, mid-2010s,
you'd have to these AIs would have to play like 10,000 hands of poker to like get a good profile
of like who this player is, like how they're playing where their weaknesses are. Now, I think with
more recent technology, that has come down. But still the sample efficiency has been a big challenge.
Now, what's interesting is that after working on poker, I worked on diplomacy.
I think we talked about this earlier.
And diplomacy is this, you know, it's a seven-player negotiation game.
And when we started working on it, I took a very game theory approach to the problem.
I felt like, okay, it's kind of like poker.
You have to compute this game theory, ultimate policy, and you just play this.
You're going to not lose an expectation.
You're going to win in practice.
But that actually doesn't work in diplomacy.
And it doesn't work, again, for a question of like, how much?
of a rabbit hole we want to go down on this. But like basically when you're playing like the
zero-sum games like poker, game three optimal works really well. When you're playing a game like
diplomacy where there's like you need to collaborate and compete and you need, there's
room for collaboration, then game theory optimal actually doesn't work that well. And you have to
understand the players and adapt to them much better. So this ends up being very similar to the
problem in poker of like how do you adapt to your opponents? In poker, it's about adapting to
their weaknesses and take advantage of that. In diplomacy, it's about adapting to their play
styles. It's kind of like if you're at a table and everybody's speaking French, you don't
want to just keep talking in English. You want to adapt to them and speak in French as well.
That's the realization that I have with diplomacy that we need to shift away from this game
theory optimal paradigm towards modeling the other players, understanding who they are, and then
responding accordingly. And so in many ways, the techniques that we developed in diplomacy
are exploitative. They're not exploitative. They're really. They're really.
just adapting to the opponents to the other players at the table. But I think the same of techniques
could be used in AI for poker to make exploitative poker aIs. If I didn't get, you know,
aGI-I pilled by the incredible progress that we were seeing with language models and like shifting
my whole research agenda to focusing on like general reasoning, probably what I would have worked on
next was making these like exploitative poker AI. It would be a really fun research direction to go down.
I think it's still there for anybody that wants to do it. And I think the key would be taking the
techniques that we use in diplomacy and applying them to things like poker.
I think to me the core piece is when you play online, you have a HUD, which tells you,
you know, all these stats about the other player and like, you know, how much they participate,
pre-flop, blah, blah, blah, blah.
And to me, it's like a lot of these models, from my understanding, are not really leveraging
the behavior of the other players at the table.
They're just kind of looking at the bore state and kind of working from there.
That's correct.
The way the poker eyes work today, they're just kind of like sticking to their pre-computed
GTO strategy and they're not adapting to the other players at the table.
And like you can do various like kind of hacky things to get them to adapt, but you know,
they're not very principal.
They're not, they don't work super well.
Yeah.
Okay.
Any grad students listening?
Yeah.
If you want to work on that, I think that is a very, very reasonable research direction that
will at least get in front of you and, you know, get some attention at least.
Yeah.
The other thing that this conversation brings up for me is, well, yeah, well, yeah, well, like, one of the
hypothesis for like what is the next step after test time compute is world models. Is world modeling
importance or worthwhile research direction? Like Jan Lecun has been talking about this nonstop.
But like basically no LLMs have like they have internal world models but like not explicitly
a world model. I think it's pretty clear that as these models get bigger, they have a world model
and that world model becomes better with scale. So they are implicitly developing a world model.
and I don't think it's something that you need to explicitly model.
I could be wrong about that.
When dealing with people or multi-agents, it might be,
because you have entities that are not the world
and you're resolving hypotheses of which are the many types of entities
you could be dealing with.
There was this long debate in the multi-agent AI community for a long time about,
and it's still going on,
about whether you need to explicitly model other agents,
like other people, or if they can be implicitly modeled as part of the environment.
For a long time, I took the perspective of, like, of course, you have to, like, explicitly
model these other agents because they're behaving differently from the environment.
Like, they take actions, they're unpredictable.
You know, they have agency.
But I think I've actually shifted over time to thinking that, like, actually, if these
models become smart enough, they develop things like theory of mind.
They develop an understanding that there are other agents that, like, can take actions and
have motives and all this stuff.
and these models just developed that implicitly
with scale and more capable behavior broadly.
So that's the perspective I take these days.
So like what I just said was an example of a heuristic
that is not bitter lesson filled and it just goes away.
Yeah, it's really all come back to the bitter lesson.
Got to cite them every every podcast.
So one of the interesting findings and most consistent findings,
you know, I think you were at ICLR and one of the hit talks there
was about open-endedness.
And this guy, Tim, who gave that talk,
has been doing a lot of bunch of research about multi-agent systems too.
One of the most consistent findings is always that it's better for AI to self-play and improve competitively
as opposed to humans training and guiding them.
And you find that with like, you know, Alpha Zero and R1-0, whatever that was,
do you think this will hold for multi-agents, like self-play to improve better than humans?
Yeah, so, okay, so this is a great question.
And I think this is worth expanding on.
So I think a lot of people today see self-play as like the next step and maybe the last
step that we need for superintelligence.
And I think if you're following, you know, you look at something like AlphaGo and Alpha
Zero, we seem to be following a very similar trend, right?
Like the first step in AlphaGo was you do large scale pre-training.
In that case, it was on human go games.
With LMs, it's pre-training on, you know, tons of like internet data.
And that gets you a strong model, but it doesn't get you.
an extremely strong model.
It doesn't get you superhuman model.
And then the next step in the AlphaGo paradigm
is you do large-scale test-time compute
or large-scale inference compute,
and in that case with MCTS,
and now we have reasoning models
that also do this large-scale inference compute.
And again, that boosts the capabilities a ton.
Finally, with AlphaGo and Alpha-Zero,
you have self-play,
where the model plays against itself,
learns from those games,
gets better and better and better
and just goes from something that's like
around human level performance to like way beyond human capability.
It's like these Go policies now are so strong that it's just like incomprehensible.
Like what they're doing is incomprehensible to humans.
Same thing with chess.
And we don't have that right now with language models.
And so it's like it's really tempting to look at that and say like, oh, well, we just need these like AI models to now interact with each other and learn from each other.
And they're just going to like get to superintelligence.
The challenge, and I kind of mentioned this like a little bit when I was talking about diplomacy, the challenge is that Go is this two players here or something.
game. And two-player zero-some games have this very nice property where when you do self-play,
you are converging to a mini-max equilibrium. And I guess I should take a step back and say,
like, in two-player-zero-some games, two-player games are chess, go even two-player poker,
all two-player zero-sum, what you typically want is what's called a mini-max equilibrium. This is
that GTO policy, this policy that you play where you're guaranteeing that you're not going to
lose to any opponent in expectation. I think in chess and go, that's like pretty clearly what you
want. Interestingly, when you look at poker, it's not as obvious. In a two-player zero-sum version of
poker, you could play the GTO Minimax policy, and that guarantees that you won't lose to any
opponent on earth. But again, I mentioned there's, you're not going to beat a weak player. You're
not going to make as much money off of them as you could if you instead played an exploitative policy.
So there's this question of like, what do you want? Do you want to make as much money as possible,
you want to guarantee they're not going to lose to any human alive. What all the bots have decided
is like, well, what all the AI developers in these games have decided is like, well, we're going
to choose the Minimax policy. And conveniently, that's exactly what self-play converges to. If you
have these AIs play against each other, learn from their mistakes, they converge over time to this
mini-max policy, guaranteed. But once you go outside of two players or some games, like in the case
of diplomacy, that's actually not a useful policy anymore. You don't want to just like have this
very defensive policy, and you're going to end up with really weird behavior if you start
doing the same kind of self-play in things like math. So for example, what does it mean to do
self-play in math? You could fall into this trap of like, well, I just want one model to pose
really difficult questions and the other model to solve those questions. You know, that's like a two-player
zero-some game. The problem is that like, well, you could just like pose really difficult
questions that are not interesting. You know, you just like get, ask it to do, like,
like 30-digit multiplication. It's a very difficult problem for the AI models, is that really
making progress in the dimension that we want, like, not really. So self-play outside of these two players
here or some games becomes like a much more difficult, nuanced question. So I think,
and Tim kind of like basically said something similar in his talk, that there's a lot of challenges
and really deciding what you're optimizing for when you start to talk about self-play outside of two
players or some games. My point is that like, this is where,
the AlphaGo analogy breaks down.
And not necessarily breaks down,
but it's not going to be as easy as self-play was in AlphaGo.
What is the objective function then for that?
What is the new objective function?
Yeah, it's a good question.
Yeah.
And I think that that's something that a lot of people are thinking about.
Yeah.
I'm sure you are.
One of the last podcast that you did,
you mentioned that you were very impressed by Sora.
You don't work directly on Sora,
but obviously it's part of opening eye.
I think the most recent new updates or in that sort of generative media space is auto-rogressive image gen.
Is that interesting or surprising in any way that you want to comment about?
I don't work on image gen, so my ability to comment on this is kind of limited, but I will say, like, I love it.
Like, I think it's super impressive.
It's like one of those things where, you know, you work on these reasoning models and you think, like, wow, we're going to, like, be able to do all sorts of crazy stuff, like, advanced science and, you know, solve agenting tasks and software engineering.
And then there's like this whole other like dimension of progress.
So you're like, oh, you're able to like make images and videos now.
And it's like so much fun.
And that's getting a lot more of the attention, to be honest, especially in the general public.
And it's probably driving a lot more of the like, you know, subscription plans for Chachb-T, which is great.
But I think it's just kind of funny that like, yeah, we're also, I promise we're also working on superintelligence.
But you can make everything Ghibli.
I think that the delta for me was, I was actually harboring this thesis, that diffusion was over.
because of auto-regressive emission.
Like, there were rumors about this end of last year,
and obviously now it's come out.
Then Gemini comes out with text diffusion,
and, like, diffusion is sold back.
And, like, this is two directions,
and it's very relevant for inference,
of auto-regressive versus diffusion.
Do we have both?
Does one win?
The beauty of research is, like, you know,
you've got to pursue different directions,
and it's always going to be clear,
like, what is, you know, the promising path?
And I think it's great that people are looking into different directions and trying different things.
I think that there's a lot of value in that exploration, and I think we all benefit from seeing what works.
Any potential in diffusion reasoning?
Let's say your channel.
Probably can't answer that.
Okay.
So you did a master's in robotics, too.
We'd love to get your thoughts on one.
You know, opening I kind of started with the pen spinning trick and like the robotic arm they wanted to build.
Is it right to work on this humanoid likes?
Do you think that's kind of like the wrong embodiment of?
AI, outside of the usual, you know, how long until we get robots, blah, blah, blah,
is there something that you think is, like, fundamentally not being explored right now
that people should really be doing in robotics?
I did a master's in robotics years ago, and my takeaway from that experience, first of all,
I didn't actually work with robots that much.
I was, like, technically in a robotics program, I played around with some Lego robots
my first week at the program, but then honestly, I just, like, pretty quickly shifted
just working on AI for poker, and it was kind of nominally in the,
robotics masters. But my takeaway from like interacting with all these roboticists and seeing their
research was that I did not want to work on robots because the research cycle is so much
slower and so much more painful when you're dealing with like physical hardware. Like software goes
so much more quickly and I think that's why we're seeing so much progress with language models
and like all these like virtual co-worker kind of tasks but haven't seen as much progress in robotics
that like physical hardware just is much more painful to iterate on. On the question of
humanoids, I don't have very strong opinions here because this isn't what I'm working on,
but I think there's a lot of value in non-humanoid robotics as well.
I think drones are a perfect example where, like, there's clearly a lot of value in that.
Is that a humanoid? No. But in many ways, that's great. You know, like you don't want a
humanoid for that kind of technology. I think weekly, I think that non-humanids provide a lot of
value. I was reading Richard Hemings, the art of doing science and engineering, and he talks about
how when you have a new technological shift,
people try and take the old workloads
and replicate them just in the new technology
versus you actually have to change the way you do it.
And when I see this video of like,
you're a humanoid in the house,
it's like, well, the human shape is kind of,
as a lot of limitations that can actually be improved.
But I think people, what's familiar, you know,
it's like, would you put a robot with like 10 arms
and like, you know, five legs in your house
or would it be Yuri?
And like, when you get up and you see that thing walking around
and is that why we use humanoid?
So I think to me there's almost like this local maxim of like, you know, we got to make
it look like a human.
But I think like what's like the best shape in-house would be?
I am terrible at product design.
So I'm not the person to ask on this.
I think there is a question of like, is it better to make humanoid because they're more
familiar to us or is it worse to make humanoid because they're more similar to us but
not quite identical?
Like I don't know which one I would actually find creepier.
Yeah.
Yeah. The thing that got me humanoid pilled a little bit was just the argument that most of the world is made for humans anyway. So if you want to replace human labor, you have to make a humanoid. I don't know if that's convincing. Again, I don't have very strong opinions in this field because I don't work in it. I was, like, weekly in favor of humanoid. And I think what really persuaded me to be weekly in favor of, like, non-humanoids was listening to the physical intelligence CEO and, like, some of his pitches about, like, why they're not pursuing, why they're pursuing, like, non-humanoid.
humanoid robotics. And conveniently, their office is actually very close to here. So if you wanted to...
They're speaking at the conference and running. Okay, perfect. Yeah. So I'm looking forward to that.
I'd say, like, listen to his pitch and maybe he can convince you that non-ignode is loading
awesome. The other one I would refer people to is Jim Fan recently did a talk on the physical
tearing tests, which he did at the Sequoia conference, which was very, very good. He's such a
great educator and explainer of things. It's very hard, especially in that field.
Cool. We're done asking you about things that you don't work.
on. So these are just more rapid fires to sort of explore some of your boundaries and get get some
quick hits. How do you or top industry labs keep on top of research? Like what are your
tools and practices? It's really hard. I think that a lot of people have this perception that
like academic research is irrelevant and this actually not the case. I think that we do.
We look at academic research. I think one of the challenges is like a lot of academic research shows
promise in their papers, but then actually doesn't work at scale or even doesn't replicate.
I think if we find interesting papers, we're going to try to reproduce that in-house and see
if it still holds up and then also does it scale well. But that is like a big source of
inspiration for us. Whatever hits archive, literally, you do the same as the rest of us?
Or do you have like a special process? Especially if I get recommendations. Like we have an internal
channel where people will post interesting papers. And like, I think that's a good source of like,
okay, well, this person that is more familiar with this area, thinks that this paper is interesting,
so therefore I should read it. And I, similarly, like, I'll keep track of things that are happening
in my space that I think are interesting. And, like, if I think it's really interesting, maybe I'll share it.
For me, it's like WhatsApp and signal group chats with researchers, and that's it.
Yeah. I think it is like, I mean, a lot of people look at things like Twitter, and I think
it's really unfortunate that we've reached this point where things need to get a lot of attention
on social media for it to be paid attention to. That's what the grad students are
trained. They're taking classes to do this. I do recommend to like, you know, I've worked with
grad students. I work with fewer now because we don't publish it much. But when I was at Fair
publishing papers, like I would tell the grad students I was working with that like you need to
post it on Twitter and you need to, and we go over like the Twitter thread about like how to
present the work and everything. And there's a real art to it and it does matter. And it's kind of
the sad truth. I know when you were doing the ACPC, like the AI poker competition, you mentioned
that people were not doing search because they were limited to like two CPUs at inference.
Do you see similar things today that are like keeping interesting research from being done?
That might be it's not as popular.
It doesn't get you into the top conferences.
Like are there some environmental limiters?
Absolutely.
And I think one example is for benchmarks that you look at things like humanity's last exam.
Like you have these incredibly difficult problems, but then are still very easily gradable.
And I think that actually limits the scope of what you can.
evaluate these models on if you if you stick to that paradigm it's very convenient because you know it's
very easy to like then score the models but actually a lot of the things that we want to you know
evaluate these models on are kind of like more fuzzy tasks that are not multiple choice questions
and making benchmarks for those for those kinds of things is so much harder and probably also like
a lot more expensive to evaluate but I think that those are really valuable things to work on
and that would fit the same moment gpd 4.5 is like a high taste model in a way there
kind of like all this like non-measurable things about a model that are really good that maybe people are not.
Well, I think there are things that are measurable, but they're just like much more difficult to measure.
And I think that a lot of benchmarks have kind of stuck to this paradigm of posing really difficult problems that are really easy to measure.
So let's say the pre-training scaling paradigm took about five years from like discovery of GPT to scaling it up to GPT4.
And then we give you, we give test time compute five years as well.
So if test time compute hit a wall by 2030, what would be the probable cause?
It's very similar to pre-training where you can push pre-training a lot further.
It just becomes more expensive with each iteration.
I think we're going to see something similar with test time compute.
We're like, okay, we're going to get them thinking instead of three minutes,
they're for three hours and then three days and then three weeks.
Oh, you run out of human life.
Well, there's two concerns.
One is that it becomes much more expensive to get the models to think for that long
like scale up test time compute. Like as you scale up test time compute, you're spending more on test time compute, which means that like there's a limit to how much you could spend. That's one potential ceiling. Now obviously, not obviously, but like I should say that we're also becoming more efficient. These models are becoming more efficient in the way they're thinking. It's able to do more with the same amount of test time compute. I think that's a very underappreciated points. That it's not just that we're getting these models to think for longer. In fact, if you look at 03, it's thinking for longer than 01 preview for some questions, but it's not like a radical difference. But it's way better. Why? Because it's just like becoming better at thinking.
Anyway, yeah, these models, you're going to scale up test on compute.
You can only scale it up so much.
That becomes a soft barrier.
In the same way that pre-training, it's becoming more and more expensive to train better and better pre-trained models, or bigger pre-trained models.
The second point is that, like, as you have these models think for longer, you kind of get bottlenecked by walk-clock time.
Like, if you want to iterate on experiments, it is really easy to iterate on experiments when these models would respond instantly.
It's actually much harder when they take three hours to respond.
And what happens when they have three weeks?
it takes you at least three weeks to do those evaluations and to then iterate on that.
And a lot of this, you can paralyze experiments to some extent, but a lot of it you have to run
the experiment completed and then see the results in order to decide on the next set of experiments.
I think this is actually the strongest case for long timelines that the models,
because they just have to do so much in serial time, we can only iterate so quickly.
How would you overcome that wall?
It's a challenge.
And I think it depends on the domain.
So drug discovery, I think, is one domain where this could be a real bottleneck.
I mean, if you want to see if something like extends human life, it's going to take you a long time to figure out if like this new drug that developed like actually extends human life and doesn't have like terrible side effects along the way.
Side note, do we not have perfect models of human chemistry and biology by now?
Well, so this is, I think, the thing.
And again, I want to be cautious here because I'm not actually a biologist or chemist.
Like I don't know very little about these fields.
Last time I took a biology class was 10th grade in high school.
I don't think that there's a perfect simulator of human biology right now.
And I think that that's something that could potentially help address this problem.
That's like the number one thing that we should all work on.
Well, that's one of the things that we're hoping that these recent models will help us with.
Yeah.
How would you classify mid-training versus post-training today?
It's such, all these definitions are so fuzzy.
So I don't have a great answer there.
It's a question people have.
And you're like opening eyes like now explicitly hiring for mid-training.
and everyone is like, what the hell is mid-training?
I think mid-training is between pre-training and post-training.
It's like it's not post-training.
It's not pre-training.
It's like adding more to the models, but like after pre-training.
Interesting ways.
Yeah.
Okay.
All right.
Well, you know, I was trying to get some clarity.
Is the pre-trained model now basically like just the artifact that then spawns other models?
and it's almost like the core pre-training model is never really exposed anymore.
And it's the mid-training, the new pre-training,
and then there's the post-training once you have the models branched out.
You never interact with an actual just like raw pre-trained model.
Like if you're going to interact with the model, it's going to go through mid-training and post-training.
So you're seeing the final product.
Well, you don't let us do it, but you know, we used to.
Well, yeah.
I mean, I guess there's open-source models where you can just like interact with the raw pre-trained model.
But for open-A-I models, like they go through a mid-training step,
then they go through a post-training step and then they're released.
And they're a lot more useful.
Like, frankly, if you interacted with the only pre-trained model,
it would be super difficult to work with and it would seem kind of dumb.
Yeah.
But it would be useful in weird ways, you know,
because there's a mode collapse when you post-trade for it for, like, chat.
Yeah.
In some ways, you want that mode collapse.
Like, you want that collapse of like,
to be useful.
Yeah.
I get it.
Yeah.
We're interviewing Greg Brockman next.
You've talked to him a lot.
What would you ask him?
What would I ask Greg? I mean, I get to ask Greg all the time. What should you ask Greg?
Like to evoke an interesting response that like he doesn't get asked enough about, but you know like this is something that he's passionate about or you just want his thoughts.
I think in general, it's worth asking where this goes. You know, like what does the world actually look like in five years? What does the world look like in 10 years? What does that distribution of outcomes look like? And what could the world or individuals do?
to help steer things towards like the good outcomes instead of the negative outcomes.
Okay.
Like an alignment question.
I think people get very focused on what's going to happen in like one or two years.
And I think it's also worth spending some time thinking about like, well, what happens in five or ten years?
And what does that world look like?
I mean, he doesn't have a crystal ball.
But he certainly has thoughts.
Yeah.
So I think that's worth exploring.
Okay.
What are games that you recommend to people?
especially socially.
What are games that I recommend to people?
I've been playing a lot of this game called Blood on the Clock Tower lately.
What is it?
It's kind of like Mafia or Red Wolf.
It's become very popular in San Francisco.
Oh, that's the one we played in your house.
Yeah.
Okay, got to get it.
It's kind of funny because I was talking to a couple people now that it told me that
it used to be that poker was the way that the VCs and tech founders and stuff
would socialize with each other.
And actually now it's shifting more towards Blot on the Clock Tower.
Like, that's the thing that people use to, like, you know, connect in the Bay Area.
And I was actually told that a startup held a recruiting event that was a blood on the clock tower game.
Wow.
Yeah.
So I guess it's like, it's really catching on.
But it's a fun game.
And I guess you lose less money playing it than you do playing poker.
So it's better for people that are not very good at these things.
I think it's kind of like a weird recruiting event, but it's certainly a fun game.
What qualities make a winner here that is interesting to hire for?
That's the thing.
It's like, okay, I guess you get the ability to lie.
Deception and like picking up on deception.
Like, is that the best employee?
I don't know.
So my slight final pet topic is Magic the Gathering.
So you have, we talked about some of these games, Chesco, and they have perfect information.
Then you have poker, which is imperfect information in a pretty limited universe.
You only have a 50-2 card deck.
And then you have these other.
games that have imperfect information, like a huge pool of possible options. Do you have any idea of
how much harder that is? Like, how does the difficulty of this problem scale? I love that you
ask that because I have this like huge store of knowledge on AI for information games. This is my
area of research for so long. And I know all these things, but I don't get to talk about it very often.
We've made superhuman poker AIs for No Limit Texas Hold'em. One of the interesting things about
that is that like the amount of hidden information is actually pretty limited because you have two hidden
cards when you're playing Texas Hold'em. And so the number of possible states that you could be in
is 1,326 when you're playing heads up at least. And, you know, that's multiplied by the number of other players
that there are at the table, but it's still like not a massive number. And so the way these AI models work
is that you enumerate all the different states that you could be in. So if you're playing like
six-handed poker, there's five other players, five times one thousand,
in 326, that's the number of states that you've been, and then you assign a probability to each
one, and then you feed those probabilities into your neural net, and you get actions back for each of
those states. The problem is that as you scale the number of hidden possibilities, like the number
of possible states you could be in, that approach breaks down. And there's still this very
interesting, unanswered question of what do you do when the number of hidden states becomes
extremely large? You know, so if you go to Omaha poker where you have four hidden cards,
there are things you could do that's kind of like that are kind of heuristic that you could do to reduce the number of states.
But actually it's still a very difficult question.
And then if you go to a game like Stratigo where you have 40 pieces, so there's like close to 40 factorial different states you could be in, then all these like existing approaches that we used for poker kind of break down.
And you need different approaches.
And there's a lot of active research going on about like how do you cope with that?
So for something like Matching the Gathering, the techniques that we used in poker would not out of the box work.
And it's still an interesting research question of like, what do you do?
Now, I should say this becomes a problem when you're doing the kinds of search techniques that we used in poker.
If you're just doing model-free URL, it's not a problem.
And my guess is that if somebody put in the effort, they could probably make a superhuman bot for Magic the Gathering now.
Yeah, there's still some unanswered research questions in that space.
Now, are they the most important unanswered research questions?
Right.
I'm inclined to say no.
I think there's like the problem is that like the techniques that we used in poker to do this kind of search stuff,
we're pretty limited. And like if you expand, if you expand those techniques, maybe you get them to work on things like Stratigo and Magic the Gathering, but they're still going to be limited. They're not going to get you like superhuman encode forces with language models. So I think it's more valuable to just focus on the very general reasoning techniques. And one day, as we improve those, I think we'll have a model that just out of the box one day plays Magic the Gathering at a superhuman level. And I think that's the more important and more impressive research direction. Cool. Amazing. Yeah. Thanks for much for coming on, No.
Yeah, thanks for your time.
Yeah, thanks.
Thanks for having.
