Latent Space: The AI Engineer Podcast - [State of RL/Reasoning] IMO/IOI Gold, OpenAI o3/GPT-5, and Cursor Composer — Ashvin Nair, Cursor
Episode Date: December 30, 2025From Berkeley robotics and OpenAI’s 2017 Dota-era internship to shipping RL breakthroughs on GPT-4o, o1, and o3, and now leading model development at Cursor, Ashvin Nair has done it all. We caught u...p with Ashvin at NeurIPS 2025 to dig into the inside story of OpenAI’s reasoning team (spoiler: it went from a dozen people to 300+), why IOI Gold felt reachable in 2022 but somehow didn’t change the world when o1 actually achieved it, how RL doesn’t generalize beyond the training distribution (and why that means you need to bring economically useful tasks into distribution by co-designing products and models), the deeper lessons from the RL research era (2017–2022) and why most of it didn’t pan out because the community overfitted to benchmarks, how Cursor is uniquely positioned to do continual learning at scale with policy updates every two hours and product-model co-design that keeps engineers in the loop instead of context-switching into ADHD hell, and his bet that the next paradigm shift is continual learning with infinite memory—where models experience something once (a bug, a mistake, a user pattern) and never forget it, storing millions of deployment tokens in weights without overloading capacity.We discuss:* Ashvin’s path: Berkeley robotics PhD → OpenAI 2017 intern (Dota era) → o1/o3 reasoning team → Cursor ML lead in three months* Why robotics people are the most grounded at NeurIPS (they work with the real world) and simulation people are the most unhinged (Lex Fridman’s take)* The IOI Gold paradox: “If you told me we’d achieve IOI Gold in 2022, I’d assume we could all go on vacation—AI solved, no point working anymore. But life is still the same.”* The RL research era (2017–2022) and why most of it didn’t pan out: overfitting to benchmarks, too many implicit knobs to tune, and the community rewarding complex ideas over simple ones that generalize* Inside the o1 origin story: a dozen people, conviction from Ilya and Jakob Pachocki that RL would work, small-scale prototypes producing “surprisingly accurate reasoning traces” on math, and first-principles belief that scaled* The reasoning team grew from ~12 to 300+ people as o1 became a product and safety, tooling, and deployment scaled up* Why Cursor is uniquely positioned for continual learning: policy updates every two hours (online RL on tab), product and ML sitting next to each other, and the entire software engineering workflow (code, logs, debugging, DataDog) living in the product* Composer as the start of product-model co-design: smart enough to use, fast enough to stay in the loop, and built by a 20–25 person ML team with high-taste co-founders who code daily* The next paradigm shift: continual learning with infinite memory—models that experience something once (a bug, a user mistake) and store it in weights forever, learning from millions of deployment tokens without overloading capacity (trillions of pretraining tokens = plenty of room)* Why off-policy RL is unstable (Ashvin’s favorite interview question) and why Cursor does two-day work trials instead of whiteboard interviews* The vision: automate software engineering as a process (not just answering prompts), co-design products so the entire workflow (write code, check logs, debug, iterate) is in-distribution for RL, and make models that never make the same mistake twice—Ashvin Nair* Cursor: https://cursor.com* X: https://x.com/ashvinnair_Full Video EpisodeTimestamps00:00:00 Introduction: From Robotics to Cursor via OpenAI00:01:58 The Robotics to LLM Agent Transition: Why Code Won00:09:11 RL Research Winter and Academic Overfitting00:11:45 The Scaling Era and Moving Goalposts: IOI Gold Doesn't Mean AGI00:21:30 OpenAI's Reasoning Journey: From Codex to O100:20:03 The Blip: Thanksgiving 2023 and OpenAI Governance00:22:39 RL for Reasoning: The O-Series Conviction and Scaling00:25:47 O1 to O3: Smooth Internal Progress vs External Hype Cycles00:33:07 Why Cursor: Co-Designing Products and Models for Real Work00:34:14 Composer and the Future: Online Learning Every Two Hours00:35:15 Continual Learning: The Missing Paradigm Shift00:44:00 Hiring at Cursor and Why Off-Policy RL is Unstable This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Transcript
Discussion (0)
Okay, we're here at Neeribs.
We're recording a special land space coverage of the folks in New
Rips, and we're here with Ashton from Kursa.
Welcome.
Hi, yeah, thanks for having me.
So, I guess the, like, Ashton from Kursa is like a new identity.
I didn't even know if I should say that because you only joined Kursa for three months.
Before that, you're opening I work in 01.
Before that, Berkeley, PhD in RL just, but focus on robotics.
Robotics, yeah.
Is it weird searching for robotics to lineage models?
Okay, this is kind of interesting because a lot of people,
have been kind of doing this.
I mean, opening out,
yeah, you got a robot.
I actually was at Open Eye in 2017 also.
Where's it on robotics?
Yeah, I was interning right before my PhD
where I worked on robotics there.
2017, is that like Japan and that's Japan over there?
He was famously opening at his first intern.
Oh, really? Okay, then he might have been before.
But yeah, there was like 15 interns.
It was a very different company.
It was just like robotics, Dota,
and like 15 interns that summer.
all having like pretty exciting individual projects like uh yeah that set of interns if you look
over there now it's kind of cool yeah um but yeah anyone from that class that like you would shout
out um like uh there's just like a lot of cool papers that came out like low pinto now is it
NYU um yeah the uh the person who leads um reasoning at x-a i forgot his name
uh well he led no eric yeah i forgot his name but um he worked on like kfack and stuff i think um
Did you know? Greg?
Not Greg, but yeah.
But yeah, it was like an exciting time to be there.
But yeah, I think robotics is a pretty good fit for LMs because like this switch ends
being pretty like, you know, you kind of do similar things.
Like you want to look at a lot of data.
It's like kind of hard to get stuff working in robotics world.
I think, you know, it kind of builds like very greedy people who like look at data a lot,
that kind of thing.
So yeah, for whatever reason I think like that transfer.
like yeah, it's happening a lot
and it makes a lot of sense.
One of my in Europe's highlights so far
I had dinner with,
it's like a small group dinner with Lex Freeman yesterday.
And Lex used to be in robotics.
And he was like,
my assessment of robotics people,
robotic people are the best to talk to
at a nearerps
because they're most rounded, he says,
because they don't have a choice.
They work with the real world.
Yeah, look at data.
And then like the most unhinged,
the most detachment reality
are like the simulation people.
I see.
Yeah, yeah, yeah.
Yeah.
I think I agree.
Yeah.
Yeah, and I actually did a little bit of both during my PhD.
Like I work in kind of like, you know, like prototype ideas in SIM
and then get them working on real-world robotics.
And, yeah, I mean, probably like robotics is where you kind of feel AI at the least, right?
Because it's just so far away from working.
Now I think, like, I think over the last year maybe,
there's been demos that have been super interesting from like physical intelligence
and like Sunday and stuff that, yeah, I'm starting to be like, okay, like this kind of
feels like Sunday robots themselves?
I haven't seen them.
Apparently they've been doing demos, I'm pretty keen to seeing that.
Yeah, I've seen the physical intelligence ones live, and yeah, it's pretty impressive.
Like, just on, like, in, like, someone's living room, like, folding laundry and stuff.
Like, you know, you can just, like, toss it in there.
Everybody must be materialized.
Yeah, yeah, yeah, yeah.
Okay, and last thing on robotics, and you can kind of pivot to 0103.
Just, Omniye has, like, is like restarting a robotics team?
Is that serious?
Is that?
I actually know very little about it because, yeah, I was in like a pretty different
Farsi org.
So, yeah, I mean, I think it's serious.
I think there's a ton of excitement around robotics right now.
I'm actually kind of curious what drives it because I don't think I fully understand.
You know, like there's been like crazy raises and stuff recently, right?
For robotics companies.
I guess my own view on it.
And so when I left robotics in 2022, I thought I would actually come back to robotics.
I think my view on it now is that
it feels like LLM agents are going to be
like a trillion dollar market before robotics
is maybe even like a $10 billion market.
And this is just because
so I mean, you know, LM agents
already create value out in the world.
Robotics, it's like kind of hard
to make the case that like, you know, kind of AI
robotics like does anything
that useful yet. And then once
it does something useful,
then you have to make the unit economics work out.
And I think that's also quite hard. Like
reliability, you know, I mean,
these robots have to be, like, fixed and this kind of thing.
So I think it's kind of hard.
I would say the market is kind of efficient in that the software
LM companies are raising tens of billions.
Yeah.
And then the robotics companies are raising hundreds of millions.
So I think this very recently, it's been like single digit billions.
Oh, really? Okay.
Yeah.
So I think, I think that's like the maybe surprising thing to me is that like,
it feels like.
It's ahead of where it's actually at.
Yeah.
Like, I would say that robotics isn't kind of like the GPS.
GPT1 to GPT2 era right now.
Okay.
And I haven't worked on a robot explicitly.
What task would qualify as like, oh, that's the inflection?
It's a little bit like you know when you see it.
I thought the like Sunday demo is were kind of cool.
Like maybe it's like starting to get there where and the details matter a lot where it's kind of like it can't be, it has to be in like a new scenario.
Like in one that you haven't seen before and maybe on like.
General IDOD.
Yeah, exactly.
And I think that was kind of what GPT2 was too, right?
is kind of like you start to see hints of like cool generalization.
But like and I think that's fine.
Like you know, it doesn't have to like work out of the box.
But yeah, I think at this point especially it still feels like in robotics.
You're not exactly investing in a technology probably.
You're just investing in a team.
Yeah.
Yeah, I'm not in the space whatsoever.
Sure.
That's kind of my impression.
It's actually nice when you're not in there.
Because you know as much as like most basically everyone else.
Yeah, yeah.
So we just kind of speculate.
Exactly.
In my people, there's a robotics team at opening eye.
Yeah.
back to language models. Did you join 401 or were you like...
So I joined right before Chachabutti in like, I think September of 22.
Yeah.
So, yeah, actually, yeah, I was like pretty burnt out for my PhD and I was like, okay, I'm
going to go to this like chill research lab and then like, yeah, yeah, like Chachaputea
happens and like, you know, everything kind of blew up and like a lot of stuff got kind of like
refocused.
But what, I guess what...
So obviously Chachutevety's surprise opening.
what did they tell you they were looking for you to do?
And then obviously it changed.
Yeah, I mean, so I joined on the CodeGen team.
The Codex.
Yeah, like you codex.
Exactly, yeah.
It was like the team that shipped Codex.
But by the time we were working on like, by the time I joined,
we were kind of more so working on the model doing tool use and these kind of things.
Yeah.
And so like very related to the chat.
Like we're kind of like a sister team to the team that made like a shit of chatchip.
Yeah.
Yeah, exactly.
So yeah, so we're just kind of working on making the models like smarter.
like kind of programming competitions, like, yeah, how to do like SIT for that, that kind of stuff.
The word and IOWI gold has felt reachable in that title.
Oh, yeah, crazy.
Like, I think, I think, and this is something I've like, like, repeat to people again and again.
And these days, like, if you told me that we could have gotten IOUI gold then,
I would have just assumed that we could all just go on vacation.
Like, you know, it's all over, like AI solved.
Like, no point in working anymore.
We got it.
Yeah, it feels like nothing's
nothing that much has changed, right?
Like, life is still the same.
Yeah.
Yeah, so I think that's like super interesting.
Yeah, I don't have a great way to explain it.
But I think that's actually like what I spent a lot of time thinking about
is like, you know, why is that the case?
Yeah.
Because, yeah, I mean, you kind of see this again and again in AI, right,
with like solving chess and then like it doesn't really matter and solving go.
And yeah, so you keep seeing it.
But yeah, I think you like surprise you every single time.
Yeah.
I think maybe, I think one is we keep moving a bit,
the goalposts. We're very good at that. And then two is I think actually our just our definitions of
what constitutes AGI is bad. And we don't actually mean what we say when we say, oh, when we have
achieved this, then we have AGI. So like clearly when we have achieved our goal with a language model,
we have AGI. It's wrong. Yeah. And I think shifting the goalpost to some extent is correct. Like,
we keep good-hearteding whatever goalpost we have. And I think it's kind of hard to like...
To be good-hard is like too negative. It's like, I will change.
cheat to do what you asked me to do.
But I don't think it was cheating.
It was just scaling test time compute.
At a meta level, I think the community,
not cheating, but makes a lot of like implicit decisions
to go after the, you know, evals and benchmarks that matter the most.
So, Sue is verified for sure.
Yeah, exactly.
But, yeah, I.I.
Hopefully not that goodhearted.
Well, but like, it kind of clearly is to some extent, right?
Because, like, you know, most programmers in the world cannot do I.
at any decent level, but like we're still struggling to like automate most programming jobs.
Or like, you know, there's a lot of stuff to do.
So it's like, it's like where language models are here, like junior, senior dev,
and then suddenly for IOUI, you're like spike.
Exactly. And like there's something to switch about that. Yeah, okay.
I kind of saw this at a meta level also with RL research.
So yeah, I did my PhD with 3011 at Berkeley from like 2017 to 22.
And that era of RL research was like super interesting because,
oil was like super hyped right like starting from about dQN in like 2015 and a lot of the methods that people were really excited about is like you know off policy learning like value functions like these kind of things and somehow that that stuff hasn't really panned out I would say and it's not exactly clear why but in the academic literature we thought we were making a ton of progress and I think in retrospect I had to say that we probably kind of overfit to the benchmark
pretty heavily. And, you know, how I see this in retrospect is that we gave ourselves a lot of, like, new knobs to tune and then implicitly kind of tuned those to fit the benchmarks.
Everyone knew that we're doing that at some level, but I think it's hard to appreciate, like, that it's not just happening for a single paper at kind of like a meta level for the whole community that's happening to.
And I think the result is that like, I don't know, like a lot of the RL research that came out of that era, I don't think is like that used, you know?
And I think it's for the first similar reason
that basically you were kind of like benchmark maxing.
I will full out say there was R.O. Winter.
Entire startups that were founded based on premise
at the time basically gave up.
Some of them died, some of them pivoted, whatever.
Yeah, yeah.
Yeah, I think in, so because I was like in academia,
there's still quite a lot of excitement over it.
But yeah, it still felt quite academic.
And yeah, I in that era was a little bit frustrated
because I felt like, you know,
one of the pitfalls of academia is that it doesn't really reward, like, simple ideas that work,
and instead kind of tends to reward, like, kind of math-year ideas.
Those math-year ideas also give you these, like, kind of implicit knobs to tune that allow you to, like,
overfit.
Well, you know, the things that actually work tend to be kind of simple ones that have less knobs
and just generalize to, like, many things without.
There's just, like, less secret sauce to it, apart from just throw a lot of compute in it.
Exactly, exactly.
But those are things that tend to like...
It's like not intellectually interesting.
Yeah, exactly.
And from academic point of view, it's like, oh, like, why am I sitting in school?
Like, yeah, I think for a lot of people who do PhDs, they're kind of wired in a way they want to, like, think about interesting new stuff.
Yeah.
And, yeah, like, you know, the scaling era kind of like, you know, probably stuck to that.
Scaling era.
Is scaling era over since we're proud of?
I think I've just been, like, page into that from Elyasasheba's interview.
I don't think it's over, but there's definitely something.
interesting happening, like the thing I was saying about, like, I.O.I. and IMO. I think we'll still
continue more or less on the same track. Like, clearly, you know, like, these labs are, like,
releasing their new pre-trained models, and they're, like, still doing, like, much better than
before. So I think, I think scaling is still happening, but I think it's happening in a different
way or
it's worth like seriously
interrogating why is it that we're not
just like automating all jobs right now
I think my view it is something like
RL, the way it's applied to LMs right now
is kind of a weird funny tool
where it doesn't really generalize
beyond the training distribution that much
it generates to some extent
and it generalizes in interesting ways
but it's like very piquy right
like it can kill the training distribution
completely
it can be like best in the world at it
with like not that much effort really
but yeah it doesn't really generalize
so I think what we had to do is
bring the world of economically useful tasks
in distribution for RL if we commit to using RL as a tool
and you know it might be the case that
maybe there's some like cool continual learning thing
or something that like shifts a paradigm next year
or something like that
but it really feels like
if RL is a tool then
yeah a big thing that needs to happen
is like, it doesn't feel like
intelligence of the models of the balladeck.
It's more like, you just have products
that bring the entire context
of what someone wants to do into the product
so that the LLM can see it.
And then you used to RL on top of that.
Yeah. Have you seen GDP VAL?
Yeah, I've seen it, yeah, yeah.
Is that basically what you're envisioning?
Yeah, I haven't looked at GDPVAL closely.
I actually haven't seen exactly, like,
roughly, yeah.
And recap, it's 128 tasks
across like any white-collar job
that takes more than 5% of GDP, right?
And they
basically created all the context
to eval on it and
evaluated every model.
Famously, OpenEI has evils to you.
Whoever runs that one always
finds the anthopics are the best.
Yeah, yeah.
It's, uh, yeah, props to them for,
uh, being published.
Yeah, it's doing that, yeah. It's an actual science.
I think it's good.
But, but like, I think like,
in a sense of like,
generalizing beyonds,
coding competitions to economically useful tax.
That is it.
I think that is the...
What is more important for GG6?
Yeah, what I'd like to do is kind of, like,
I just haven't read the GDP-VAL traces closely.
It's not clear to me that, you know, like,
what is the job of an accountant entail
and, like, what kind of context needs to be in the product?
Yeah, so you can actually do it.
They have, like, EDFs.
I see, I see.
Like, so they try to go as close to source documents as well.
I see, I see.
Yeah.
Yeah, so, yeah, I think, like, roughly operating in this kind of thing is what I envision.
Yeah.
Because it can be like an artificial, like, oh, let me clean up this data for you.
Exactly.
To make it easy for the LLM to process.
No.
PDF in an agent, go.
Yeah, I think that's roughly the right shape of the thing.
And I guess how I imagine this being operationalized is that you'd want to code design the product and the model.
So that like the product for whatever it is.
I mean, coding is kind of maybe the easiest first stuff because most of the context that you care about is just your code base.
And like being able to run stuff in the terminal and that kind of stuff.
And still, like, we're not that close really to automating it necessarily.
But, you know, for, like, all the other jobs, the context is, like, insane, right?
It's, like, all the conversations you've had with their coworkers, like, your Slack messages, you know, like, for my...
So at Open AI, I was working on kind of, like, hyper-parameter scaling research.
And I actually wrote not that much code.
Like, grid search or neuroarchitecture search?
No, more, like, understanding how different, like, science of deep learning in, like, 2020, where it's like, oh, you have to, like, initialize the,
layers in a particular way to get good scaling love.
Kind of the analog for that for RL.
The thing is, I didn't write a ton of code.
So the LM, you know, like writing code is not the bottleneck.
But it's more like, you know, over the course of a year, I like run sweeps,
look at like the interaction between different hyper parameters and kind of build up that
knowledge for like a year of like just different graphs.
And to do my job, the model would also need all those things in context, you know,
to like successfully like, you know, kind of automate my job.
And you'd kind of want a product that allows you to like bring all that context in.
Did you have to build it for yourself or?
Oh, no.
I mean, like, no, I mean, like I, you know, those graphs are just sitting in my head.
Yeah.
Right.
So I think it would be pretty hard to like go automate that job.
But I think what you need to do is build a product that kind of, yeah, has, it brings that context in.
and then you want an URL on top of that
to understand, like, to teach the models
to use that context.
Yeah.
Another conversation that I think
has really come to a phase this year
is kind of the depth of one model fits all.
I feel like the point of the G and the AGI
is like one model fits all.
I think Obriniye has clearly abandoned that this year.
Oh, what did you say that?
Fiji Simo writing a blog post of the title
that we are no longer doing one model fits all.
Okay, interesting.
And I think Mark Chen or one of the other senior people that are not Sam also saying this in a podcast.
So basically like the idea was you started with Codex.
Someone else was doing intro of GBT.
Then we launched JPT4, 40, I guess, 01.
And O1 was kind of a supposed to be like a reasoning one model fits all.
And there we merged the 4-0 and 0103 line into 5.
and now we're splitting it out into five and five codex again.
It's like just a weird...
Well, Ophi is very guilty.
I mean, you know, I don't think you should interpret those as like
scientific facts about the universe.
It's just more like OpenEight has a tendency to ship the org chart basically.
Yeah.
Right?
The world has the tendency.
Yeah, exactly.
So I think a lot of it's really to that.
But yeah, it's what you mean by like...
Yeah, actually, I do wonder if, yeah, like, the current reasoning paradigm,
The current reasoning paradigm is just kind of fitting itself to this kind of peaky in certain areas thing.
I don't think it's so much a matter of like model capacity though.
It's just more of another kind of organizational thing that like if you care really like a lot about coding,
you probably don't have the data to do all the other stuff.
I don't think it's so much a matter of like if you had all the data, probably you would benefit from just like training on all of it.
And you'll get some generalization between these.
But it's hard to find like one organization that cares about all these at once.
Yeah, yeah.
Yeah. So before I double-click on just like the old series in OpenE Eye, I do like to ask
Open Eye people who are there. Do you have a favorite blip story?
Yeah, the blip was crazy for me. Like, yeah, I was, it was like Thanksgiving.
Like, everyone remembers where they were, what they were.
Yeah, yeah, exactly. I was at Thanksgiving with two opening eye friends, actually,
and then one of them on like Friday afternoon is like, oh, like, Sam Altman.
just got fired. We were just like,
co-working together. I'm like,
what? Oh, ha-ha.
Like, good joke. And then, yeah, it was crazy.
And then, yeah, it was just like a crazy
weekend of just like ups and downs.
Like, you know, we thought, yeah,
like, you signed a letter?
Yeah, I did.
It's like 95% people signed. Yeah, yeah, yeah.
Yeah, I thought, you know, like.
I bought to Microsoft or?
Well, I think maybe I had a slightly
more complicated, like, I actually do think
that governance feels really important.
to me. Yeah. Because it does feel like, no matter if we hit AGI in like two years or 10 or whatever,
it's not clear that we have a good structure for the governance of it. Okay. And so it is a question
that I think we like probably should spend more time on. And I was like during that period
just pretty willing to be like, you know what? Like let's forget about the like equity and stuff.
Like, you know, I think it's like good and healthy to have a conversation about like how exactly
the government should work. Okay. You care about this. Yeah? Uh-huh. Right. So. So,
now the open-end nonprofit has this like secret shadow board of members that determine when we've reached AGI.
Yeah.
Yeah.
Better?
Yeah.
I don't have a like maybe, I would say I don't have an answer.
Like, you know, like it's just, it's not big.
Above my pay grade, but like.
Yeah.
And even even back then, I was kind of like, well, I don't care.
I do care quite a lot.
When the blip happened, one of my reactions was like, well, you know, this nonprofit board stuff, like, actually if it takes such someone.
like surprising maybe erratic actions.
Like maybe you'd rather just have like, you know, a thing like the Microsoft board,
which is kind of like, you know, like probably like all the pensions of the world.
But like serious people, but also like, you know, the stakeholders are kind of like the whole world
because everyone's kind of, you know, do their pensions or something invested in it.
Like maybe that is a bit more of a democratic way to run things and having like seven people
run it.
But yeah, I don't really know.
It feels like we haven't solved governance like.
At all, though, right? Like, forget AI. Even stuff like unhealthy food or, like, social media,
it kind of feels like, like, whatever the kind of, like, capitalistic incentive is, like,
doesn't actually, like, capture kind of good outcomes for society, maybe.
Yeah. So about, like, the transition into reasoning, right? You shocked me by,
by mentioning that the reasoning team is 300 people?
It's a...
It's kind of like, you know, now that...
it, like, you know, when 03 was the kind of
structured as a product, like, I think it just, like,
gets, like, larger and larger how many people worked on it.
So, yeah, I think I've, like, lost track for the numbers,
but, yeah, like, a lot of people contribute to the different aspects
of, like, safety and whatever, e-val.
Yeah, so, like, original 01, like, I saw the video.
It's, like, a dozen people, you know.
Yeah, well, even then, like, if you look at all the contributors,
it was probably more, like, 50, 200 people.
Okay.
Yeah.
So, so, I mean, like, let's tell that story from your point of view,
figuring out what does RL mean there,
and I guess was this a branch of any other prior work that you wanted to credit?
Yeah, so I think, yeah, like setting the scene, I guess, you know, in like 2023, people
kind of talking about, oh, like, is scaling laws dead, this kind of stuff.
Every year, every year.
Yeah, yeah, but especially, I think especially that year, it felt pretty, like, serious, you know?
Yeah, I think in general, open-air is really good about, like, having conviction in something
and just like really like from first principles like going after it.
And I think like the people who are kind of most responsible for that is probably like
like Ilya Satskyber and Yaakov Pachaki.
I think even like Dota was kind of more or less the same template in some ways, right?
And that was 2017.
And so a lot of the people there have kind of this like aGI in their bones kind of point of view.
And they've basically been convinced that like RL would be the,
way to get there. So I think for a long, long time, people have been convinced that something
like that should work. And it's just that it started to work once the, like, kind of pre-training
got good enough. Okay. Yeah. I think human feedback is kind of like a bit of like a side branch because
yeah, you can't really pour that much compute into it, right? It's like, you take the model and you
like elicit it to be a little bit better in terms of personality. But like the people that
were really convinced that at some point, you know, it's not about copying the internet. Like,
you can go, yeah, do RL and, like, you know, that's like the path to, like, getting much better
intelligence. So I think it, it was kind of like a long line of kind of like returning to RL
in, like, in like different ways. And then it's just that around like, yeah, 2023 is when it started
like really clicking. And it's kind of interesting because even, you know, it's, it's not like
those initial models performed, like, way better than the existing models, because they're
like smaller scale. But people were very good at being like, oh, like, this is kind of interesting.
Like, you know, the reasoning trace that you see here is kind of not something that you've
really seen be so accurate in other models like this, like this one. Kind of similar to how
I think a lot of people didn't really think of GPT or GPT2 as something that was like super
compelling probably. I know that I personally didn't like
GPD2 that much of GPD2. I was like okay whatever and then I think
and then GPD3 happened I'm like oh whoa like I feel a lot of phone listening my
PhD. It's kind of that where I think it takes a bit of like first principles
conviction to like yeah decide that like oh this this thing like there's something here
and we should really go scale it up and open eyes really good about once you decide that
something is good then you just like scale it up all the way. Yeah is was there an
internal prototype pre-01 that was like okay this is the thing
we'll fund it to scale it up, right?
Like, there usually is.
Yeah, yeah, exactly.
What was the thing?
What was the demo that like really sort of sold?
Just this like, you know, like a, like running RL on even like a pretty small model,
producing like very interesting reasoning traces and like getting like surprisingly good scores on math.
Yeah.
In a way that we couldn't have done without like a bunch more pre-training.
And then, you know, once once that looks good, then, you know, more and more resources
to just like scaling up that new.
new law.
Yeah.
And, you know,
things like adding
tool use and this kind of stuff.
Yeah.
I think a lot of people make
a lot of headlines on the
large models,
but I think a lot,
it's very underappreciated,
the minis,
how well this solution works.
Any comments or just like discoveries on...
Yeah, nothing much to say there.
I was also like not super involved in the mini stuff.
I think maybe one thing,
not exactly related to that,
but like,
it seems like,
externally people are kind of very like
oh, like research seems to come in these big leaps.
Okay.
But I think internally at OpenA, it feels very smooth.
Like you have a bunch of experiments.
Yeah.
Some of them have inconclusive results, but maybe you stack them.
Yeah, exactly.
You stack them and just like you keep scaling,
you keep like having like different runs that, you know,
get a little better each time.
Okay.
So I think that's maybe one other aspect.
That's like a little underappreciated is that like,
I don't know, like in the media,
there's just these wild swings between like,
Oh, it's so like, Googleers were in it.
Yeah, exactly.
And I think like internally at Big Labs, it's just kind of like, oh, we're just like chugging
along.
Like maybe this month is a little better than last month or something.
But it's like not as crazy up and down.
I think the question is, they used to be more of this.
And now I know there's less, which is, well, the stuff we've released, we're like, you know,
internally we're like six months ahead.
But it's like part of the reason why people, opening I wasn't that excited about
Chad GPT's launch was because you already had GPT4.
they're like, oh, we just put this out.
Like, we're already way ahead.
I think now people are just releasing things as they have them.
I think, yeah, especially because there's some like competitive pressure, right?
Yeah, yeah.
I think people are probably pretty worried that, like, if you let a lead linger for too long,
that will, like, grab a lot of market show.
Like, I don't know, like, nano banana pro right now is probably like, you know, it's like,
pretty good.
A month.
So I would say, like, now the lead, internal to external lead time is but one to two months.
Yeah, yeah.
Which is exactly.
Tiny.
Pretty short, yeah.
Anything else on reasoning side?
I guess you can talk about
on, specifically the work on coding,
anything surprised you or is an external misconception
on a 103 side before I go to Cursor?
Well, not really.
Like, yeah, I mean, it's pretty cool.
Like, I think it felt already by like maybe early 2024.
Like, oh, wow, like this recipe like really works.
And we can see how far we take it.
And so I think, you know, it was like very steady progress.
And by that point, it was probably pretty predictable that we could, like, you know, really, like, smash, like, you know, things like IMO or I-O-I-O-I-O-I-I-I.
Yeah, one funny thing that kind of happened is while this was happening, I went to this conference called The Cur, which is about, like, kind of AI progress.
And Joseph Gordon-Levitt.
Yeah, I went last year.
This was before the O-1 stuff was released.
Yeah.
And I, like, went to this thing where people were kind of making bets on where we would be on epoch AI's, like,
like the math, the Epiope math exam and, like, Humanities last exam and stuff I got.
And their estimates were like, oh, we'll be at like 10, 20% in like 2027.
And I think at the time, there was like, you know, models internally that were like already
better than their estimates.
So it's like off by like, you know, two years or something.
And the interesting thing is like those are also people who are kind of like, you know,
predicting that there would be like Dyson Spheres by like 2035 or something.
Okay.
Simpsies, so the current estimate is way under.
Yeah, they're too pessimistic in the short term, too optimistic the long term?
Yeah, well, I don't know if, like, there might be decent spheres like 2035.
Like, I don't really aren't.
But I think that is like one interesting aspect is that, yeah, I think people still seem pretty miscalibrated in different ways.
I do really appreciate how that community makes predictions, though.
Yeah, because I think most of the rest of the world just kind of like cynically says like,
I saw this the whole time.
Like, yeah.
So I do appreciate that.
Is this EA adjacent?
Yeah, I think I think it's, yeah, exactly.
It's like that.
It's like that group.
Yeah, yeah.
I like that they like to sort of register their opinions ahead of time.
Yeah.
And I think broadly, the people who've been, you know,
the capabilities predictions in that group have been broadly correct if you look, you know,
from like 2015 to 2020 or something, like where I think a lot of people kind of thought
that AI was like a sham or like, you know, not really going to be that useful for a long time.
And actually, you know, it is, it's somewhere under like 20, 30-ish thing that like you will probably
reach like human level intelligence. Yeah. It's weird. So like I feel like a skeptic when I keep saying,
like everyone always predicts that AJ happens in their lifetime. And then very convenient for whoever.
And like we have a consistent view of history where you make, see the people in the 1800s and 1900s
making predictions. It somehow always lands in their lifetime, whatever the thing is. But like,
this time it might happen.
Almost surely, right?
Like, I'm pretty sure.
Yeah.
Yeah.
Yeah.
So it's an interesting observation, like how different are we from our predecessors in terms
of developing of technology?
Yeah.
Did the Deep Seek moment this year, also this year, crazy, changed anything internally?
Not really, yeah.
I think that was, I think more so just, like, surprised that it created such a moment.
Like, it was kind of confusing, right?
It was like deep seek shows that
Nvidia chips are actually more useful than previously thought
and like Nvidia's stock like goes down a bunch.
Like it was kind of like...
I think it's more like, okay, well, I'll do the steel man
that side, which is, well, you don't need the top of the line
Nvidia's.
You can just use the sort of previous generation
or the shackled ones they sell to China
to do an equivalent amount of work for a recent model.
I see.
Yeah, but then it was also, I guess the feeling in open eyes that like, well, I think we had a better model already at the time, right?
So, and it was quite valuable.
Like, like smarter models were clearly quite valuable.
So you kind of wanted to be at the frontier.
Okay, so I wasn't quite framing this as like a race dynamics thing between labs.
It was just also more like, well, would they write?
Would their approach is right?
They had R10, which is kind of a really cool branch.
So more like commentary on what we learned about RL this year.
Yeah.
Yeah.
Well, it does seem like basically a lot of the labs have kind of like converged onto some similar-ish way of doing RL.
And they're all kind of back at the same level of like Frontier again.
Like even the anthropic models like the Opus 2.5.
It has this kind of like there's this like RKGI2 plot that looks exactly like the open-eye ones.
Right?
Like so I think everyone seems to be.
converging on a pretty similar form of RL.
Yeah, it's kind of interesting.
I think people basically figured out in one way or another
to achieve more or less the same thing.
Yeah.
Let's talk about the move to Cursor.
Yeah.
Why is Cursor accumulating, enjoying so many cool RL people?
Yeah.
Yeah, I've actually kind of already talked about this so far, I guess.
So, yeah, I think from the perspective of Cursor,
it's like, you know, nice not to be so dependent on, like, external labs
for everything.
And like, I think there's also like unique opportunities to co-design the product
with the model in ways that we couldn't do unless we actually, you know, built the model
ourselves and like had access to, yeah, making it good.
Yeah.
So, yeah, that's kind of like broadly why cursor so excited.
Okay, I'll push back a little bit, right?
Openly eye is, has infinity resources.
Infinity data has codex.
You could have just stayed.
Yeah, yeah.
Well, actually, right around when I was leaving is when, like, I think people started actually, like, using codex a lot.
So that was kind of like, you're like happened right after I left.
So that's kind of funny.
So mostly people are using cursor eternally, maybe a bit of windsurf because it's left over from the previous thing.
Sure, yeah, yeah, exactly.
So it wasn't that obvious.
But actually, I think more to the point, this thing I was saying about, like, R-L is kind of a tool that doesn't really generalize that well.
So what you want to do is bring the entire like kind of test distribution inside your training distribution.
I saw the opportunity to do that at cursor kind of like directly.
I think the cursor folks also just like really excited about that kind of vision.
And it's just like a small place where, you know, like the product people sit like right next to the ML people.
And I think there's a lot of potential there.
You can kind of see that.
Recently Jacob Jackson had this blog post about like online tab where like we're doing pulse.
It's every two hours.
Exactly.
Like a policy update
every two hours
or something.
And I think that's the type of thing
that, you know,
I think it's like a little hard to do,
it's like very hard to imagine
that at Open Eye, for example,
just because like, you know,
the product is this like kind of complicated thing.
And also like the product people
and RL people are pretty like,
you know,
on like different sides of the org.
I think if you put your mind to it,
you would.
It's like, you know,
tab is an auto confete.
It's a smaller model.
It's, you know,
it's not as complex, I guess,
as below them.
Yeah, but I don't think that's really
this, like, you know, I think we, I don't think that's why Kirster was able to do it. It's actually
more about, like, just the org itself being kind of like smaller and a bit more like focused.
Yeah. Well, I mean, since you're indulging this, I think the question about continual learning,
which obviously is a big theme. It's always been a big theme as bigger this year.
Is, well, don't you need to cure your data? You can't just like chuck whatever your
users are doing in straight in because that tends to get you towards the middle of the
distribution that actually you want to spike it.
I guess it depends how you're thinking about continuing.
I mean, I don't know, like, humans are quite good about doing the bad data too, right?
Like, you can see someone doing something dumb and decide, like, you're not going to do it.
Filter it out, yeah.
Yeah, but, like, it's not even actually filtered out.
Like, you have, you know, presumably some kind of value function that, like, says that if you see someone touch a hot stove, like, you're not going to go, you don't need to.
You don't need to, like, it's not just filtering it out.
You're actually not going to do it, right?
You could rediscover hot stoves on first place.
Yeah, but, like, you don't need to.
So I think there's something pretty deep there
Yeah, it seems like we're kind of like a few orders of magnitude of
Like kind of data efficiency basically away from like that kind of like
You know you do something once or like you make a mistake
Like you yeah you you introduce like a bug in your code
You're not going to do it again
But the models will happily just like keep doing it
Even within the same context but definitely you know of course across context
Yeah
So I think there's something like interesting
deep there is like maybe, yeah,
I suspect that it will be kind of like paradigm shifting
in the next like year or something,
but I had no idea like, you know, what it might be.
Yeah.
So is primarily you worked on composer,
tab, and maybe search?
So I have, I've actually just worked on composer.
Yeah, and that's kind of like the main focus of the company, basically,
or like the ML group is shipping a better,
shipping a better composer.
Can you describe, I guess, the impressive,
brag a bit about.
about the ML group.
Yeah, yeah.
I mean, I think the ML group is great.
It's like, you know, it's just like 20, 25 people.
And, you know, I was like, honestly, like, pleasantly, like, very, very surprised at, like,
how good composer is, like, given the size of the group.
And, you know, it's not, like, a big research lab yet.
And, yeah, I think it's, like, a really good model.
You can kind of see that in the reception.
And I think it's kind of the start of hints of, like, co-design with a product in
some ways because I think one of the reasons that people really like it is it's smart enough
that you actually want to use it and it's also fast so you kind of like stay in the loop with the
model while you use it because I think all the other smart models had this kind of this slow that
you want to go kind of context switch away and come back and that sucks you know like just as like a
programmer it just sucks to kind of context switch it kind of like gives you ADHD like it's like
really terrible yeah I agree and I think yeah it's like one
step in the direction of like being able to be more sync. I think that's, like they see the whole
company is just really, you know, full of people who want to, you know, code, even like the co-founders,
you know, like actually, the co-founders are often some of the best, like, like, high-taste
testers, which also kind of gives you a lot of like reassurance that you're going to ship good
stuff. So yeah. Any example test that like maybe composer doesn't solve yet, but you're really
motivated to solve? Yeah, ironically, I feel like I am actually like a low taste.
tester in some ways because I don't know
like you know I just like write like slow like
machine learning code and just like think about
algorithm and stuff all day.
I think more broadly I'm
super excited about code
designing the product so that you can
actually you know not just
right now we're getting better and better at
like answering user prompts
and I think that's why I compose one is like quite good
but you know what we're really
aiming for is like more like you know
automate software engineering as a process
where you like write code you
go look at data dog, look at what's happening, then come back and like, you know, maybe
have some hypotheses about what's better, like, rerun stuff. I think that's the type of thing
that we actually want to make them all do. And I do think that cursor is kind of like uniquely
positioned to do that in the sense of like, you know, if we can kind of, if a lot of what a
software engineer does kind of ends up in the product, I think we can use that to like get better
and better at, you know, not just writing code, but kind of like the whole job. Yeah, I think that's
inspiring. Just to double-click on just any sort of RL insights, Sasha and we have talked a lot
about like the internal tooling that you've had for all the like the cluster visualizations.
Is that helpful? Was that what every lab has? Yeah, I think the tooling at cursor is actually
really good. I think because, you know, it's just kind of like a, people are just down to like
vibe code stuff. They like do test their own stuff. So we just have like a lot of good tooling
where you can have like a SSAH session into like our own like user environment or something
and like you know see if like code runs the way that like users got it to run like this kind of thing
I think that's actually yeah quite nice I think basically one of the big lessons in MLD in general
is that you want to be like really close to your data and understand your data well and yeah I think
there's like kind of yeah again kind of like uniquely positioned to do that well
especially because all internal tooling you're not buying anything yeah it's just
just like internal.
And part of it is just that we're also working on a product where you can understand
it really well because of the code product.
Well, like, you know, if, I don't know, in Open AI, if I was like to look at like a biology
question, I have no idea, like, you know, what this is about.
Yeah.
Yeah, yeah, yeah, yeah.
Interesting.
Okay.
So I think that's a good overview of everything.
I guess other than the, we covered OpenEI and Cursor, just interesting REO work that other
people are doing, that you're like still mulling over.
it's influential to your thinking, good papers, anything like that.
Yeah, you know, unfortunately, I've kind of gotten the habit,
especially at open AI, of, like, not reading that much external work
and just, like, reading people's, like, Slack posts internally.
That's, like, the main, like, way to, like, you know, like, learn new stuff.
No super inspiring recent things have popped up to me.
I do think that this, like, kind of vibe of, like, yeah, continue learning
like does feel like, I think there's something super interesting there,
And like it feels like maybe even in academia people could make like a big crack at it.
And continual learning specifically meaning kind of what Tab is doing.
Yeah, maybe what Tab is doing, but also just like kind of like in context learning,
but with like infinite memory or something so that you don't, once you experience something in context,
it should just like be in your weights and you shouldn't have to like make that same mistake again,
that kind of thing.
Why do you think there's, okay, so it should be in your weights, but there's a finite.
capacity for the way to remember things.
You will forget things if you do that too much, right?
Not really. I mean, you know, you start out by memorizing or, you know,
like learning from trillions of tokens. Yeah.
Now you're going to experience like thousands or maybe millions of tokens and somehow like,
you know, we can, and those, the million tokens are kind of into.
And you only need one epoch. Yeah, exactly.
So crazy.
Yeah. So it feels like if you could learn enough about those million tokens that you're actually
in deployment on, I don't think you should need, like, I don't think there's a risk of overloading
the capacity of your model, right? Because you can trade on a trillion tokens, and it's like, fine.
Right, right. So proportionally it's a drop in the water. Yeah, exactly.
Yeah. Unless you run it for years. And, you know, at some point it's sort. Maybe, yeah.
Yeah. So basically, I find it very curious. I've only had one podcast on information theory of
language longers. Like, what is the theoretical capacity? How much are we using? And you should probably
track that. Yeah. Yeah. Yeah. Yeah. That's a good idea. Yeah. Like, treat that.
the weights, if you want to store things
and weights, okay, if it is a hard drive,
what's the capacity that the hard drive,
how much can be stored in there? We know the capacity.
It is the number of bits that, you know,
but the parameters
physically cannot store one in that.
Yeah, yeah, yeah. And it's, yeah,
I've heard that there's this kind of like
someone recently at Curse or Jacob
kind of brought up this view. I don't know if it's like a more
public view that's like, oh, there's kind of like a
hard drive view of, you know,
neural networks and kind of like a CPU view of
neural networks where, you know, is what's happening the weights, like, yeah, memorizing stuff?
Or is it like you're, like, having some, like, few circuits that do a lot of work?
Yeah.
This kind of thing.
And, yeah, I don't know.
Yeah.
You know, I would love to, uh, yeah, there's, like, actually so many of these kind of
more sciencey questions that I would, like, love to explore some time.
But then it really kind of conflicts with, like, empirical stuff, you know?
Like, unfortunately, at any given moment in time, it doesn't seem like the most root for, like,
improving something, especially in the short room,
but even in the next couple of years,
is understanding some of these questions.
Yeah, I mean, I guess this is technically supposed to be the role of academia,
but it's also hard to explore those ideas there without enough compute.
But yeah, actually, I would love to, like, go at some point,
you know, like in return to exploring these kind of like fundamental science ideas.
Okay.
This is a, I'm kind of springing this on you, so you can take some time.
What is a good RL interview question that if somebody can answer,
they should join Cursor immediately?
Ooh, it's a hard question.
I assume you do interviews.
Yeah, yeah.
Well, actually, at Kirster, we do like work trials, and it's like two-day work trials that I actually think that that's like more representative.
Because you plug in and see how they behave.
Exactly.
So I actually think it's like more valuable.
This is honestly less of a thing about how you understand RL and a bit more like were you around in the like 2017 to 22 era.
But it's like, why is Alpha Palsy RL unstable?
is kind of, I think, like, a good question to, like, yeah, dive into.
I don't actually know, so I'm digging into it.
Yeah.
Cool.
Thank you.
That was great conversation.
Do you have any sort of call section?
Yeah, I mean, you know, we are definitely hiring a cursor.
So, yeah, if you're interested in working on, especially, like, kind of data and rewards for code,
I think that's, like, a huge need.
Yeah, please, like, get in touch.
Yeah.
That's it?
Yeah.
Thank you.
Sweet.
Thank you.
