No Priors: Artificial Intelligence | Technology | Startups - Asimov: Building An Omniscient RL Oracle with ReflectionAI’s Misha Laskin
Episode Date: July 17, 2025Superintelligence, at least in an academic sense, has already been achieved. But Misha Laskin thinks that the next step towards artificial superintelligence, or ASI, should look both more user and pro...blem-focused. ReflectionAI co-founder and CEO Misha Laskin joins Sarah Guo to introduce Asimov, their new code comprehension agent built on reinforcement learning (RL). Misha talks about creating tools and designing AI agents based on customer needs, and how that influences eval development and the scope of the agent’s memory. The two also discuss the challenges in solving scaling for RL, the future of ASI, and the implications for Google’s “non-acquisition” of Windsurf. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @MishaLaskin | @reflection_ai Chapters: 00:00 – Misha Laskin Introduction 00:44 – Superintelligence vs. Super Intelligent Autonomous Systems 03:26 – Misha’s Journey from Physics to AI 07:48 – Asimov Product Release 11:52 – What Differentiates Asimov from Other Agents 16:15 – Asimov’s Eval Philosophy 21:52 – The Types of Queries Where Asimov Shines 24:35 – Designing a Team-Wide Memory for Asimov 28:38 – Leveraging Pre-Trained Models 32:47 – The Challenges of Solving Scaling in RL 37:21 – Training Agents in Copycat Software Environments 38:25 – When Will We See ASI? 44:27 – Thoughts on Windsurf’s Non-Acquisition 48:10 – Exploring Non-RL Datasets 55:12 – Tackling Problems Beyond Engineering and Coding 57:54 – Where We’re At in Deploying ASI in Different Fields 01:02:30 – Conclusion
Transcript
Discussion (0)
Hi, listeners. Welcome back to NoPriors. RL is back with a vengeance, and one of the most
talent-dense new research labs has a product release, a new code comprehension agent.
Reflection AI's co-founders Misha Laskin and Yana Santanago work together as leaders at Google
DeepMind on groundbreaking projects like AlphaGo, AlphaZero, and Gemini.
I talked to Misha about building universal superhuman agents,
the trickiness of reward modeling,
bringing all knowledge work tasks under data distribution,
how RL for language and robotics differs,
the windsurf non-acquisition, and the landscape from here.
Misha, welcome.
Thank you for doing this.
Yeah, thanks, Sarah, for having me.
So it's been about a wild like year and a half since you guys
started the company. Is that about right?
Roughly a year and a half, maybe a bit less, but I'd say it's
all part correct.
Well, can you just start by describing you said that the
company's mission is to build super intelligent autonomous
systems. And we've talked before about why like this is the
moment in time that's possible. What is different about that from building just super intelligence, which is now a sort
of more popular, ambitious goal?
At a high level, it's fairly synonymous.
But maybe there are different ways of thinking about how to build super intelligence and
what that might look like.
I think on one spectrum, there's an academic way to look at it, which
is in some sense, to some extent, superintelligence in that sense has already been achieved. So
AlphaGo was a superintelligent system and there were other systems during that time
that were built that were superintelligent in narrow domains. And I think you can go
for the goal of building a very broad superintelligence by, you know,
kind of locking yourself up in an academic or it's not really an academic, but kind of
an industrial lab with that is sort of kind of decoupled from product or customers and
kind of max out all the benchmarks that are out there and build superintelligence that
way.
I think that is that is one approach.
I think the other approach is to kind of think about
what is superintelligence more concretely?
How is it gonna be deployed?
What is it actually gonna look like in people's hands?
And build backwards from there.
So I would kind of say that that approach
is more kind of co-designing products
and research together.
Now, the kind of co-designing products and research together. Now the kind of benefits
of that approach is that you're kind of optimizing for real problems. The con to it is that you have
to be a lot more focused, right? Because your product kind of defines the sort of capabilities
that you want to draw out of the system. And you have to start out a lot more focused before
expanding across other product categories and other capabilities. So I would say that on the spectrum of companies that are superintelligence and just a research
lab and then figure out what the product is once it's built, as opposed to co-designing
products and research together to build very powerful systems in what I would call ASI
complete categories.
You can pick something that is maybe too small of a category to draw out a superintelligence.
As long as you pick a category that I would say is kind of big enough to be ASI complete.
I think, and this is kind of our approach at Reflection, is it makes a lot more sense
to be focused and co-design those two things together, the product of the research.
I want to come back to choice of initial problem in a minute. In terms of just having the intuition
and the confidence to say, like, we can go do this as a team, we're going to recruit great people
and go build reflection. You and your co-founder, Yanis, were working at Gemini together in key
roles before, and previously you had been part of Peter Abil's lab, who's an amazing researcher as
well.
You described to me as having, like, I believe the term you used was somewhat muscled your
way into AI and deep learning from originally a physics background.
Like, how did you decide to go work on this and end up in Peter's lab?
Yeah, as a kid, I became really interested in physics, theoretical physics. It was, I mean,
probably a byproduct of I'm Russian, kind of Israeli American and moved around. And then
when I landed in the States, it was kind of in a desert in Washington state, learning a new language.
And so I had a lot of time in my hands and you know bumped into my parents had had the
Feynman lectures in their library. And so I spent a lot of time you know just reading
what was on the shelf and bumped into that and got really interested in physics.
How old were you?
I was so when my interest in physics started that was probably around middle school and
it really I think became the thing I wanted to do in high school. And the reason physics was so interesting was because it
kind of seemed like the science that was at the root of many of the things that became
impactful. So I was reading about the history of the transistor, and it was invented by
a group of theoretical physicists. I was reading about how GPS works, so it turns out you need special relativity in order to accurately account for spatial coordinates using GPS. And so I felt that
physics was kind of the root science to pursue. I went in and studied it, got my PhD in it. At
the same time, I started seeing deep learning take off and really saw AlphaGo happen.
And my sense was that I want to pursue the root science, but there is such a thing as
the root science of our time. I think a lot of physics has a field. It's very interesting,
but it's crystallized a lot more than a new dynamic field that was being born out of nothing.
And AI to me felt like it was going through the moment that physics went to maybe 100
years ago that when I did problem sets and physics, and the most exciting stuff that
I was working on there was basically the things that people were discovering 100 years ago.
So I saw it kind of happening in front of my eyes, and I just decided that that was the science to bet on.
And in particular, because it was AlphaGo
that was, that inspired me
because it was just unbelievable to me
that you could train a neural network
to have such immense kind of basically
reasoning capabilities, right?
This thing was able, was super intelligent
within the realm of Go.
Yeah, I decided that I needed to kind of get myself into the best reinforcement learning lab I could.
And Peter's was, Peter's lab was that lab for me.
And then you and Janis were working specifically on RL at Gemini.
That's right. So Janis, my co-founder, was the
overall RL lead for Gemini at the time, for 1 and 1.5.
I was working very closely with him on his team. the overall RL lead for Gemini at the time for 1.1.5.
I was working very closely with him on his team.
It was a really exciting time because we went,
both of us from being reinforcement learning researchers,
to training large language models at scale.
We saw at the end of that project of what's to come,
which was Gemini 1.1.5 lands,
and it became pretty clear to us that
the next paradigm, and effectively the final paradigm that we need to have in place before
what people used to call AGI, or now I think the goalposts have shifted to ASI, is reached,
is just figuring out how to scale reinforcement learning on top of large language models.
And the first instances of that have been happening
over the last year.
I think we're still actually a lot earlier than people think.
But there is a wedge in and things have started to work.
Yeah, I definitely want to talk about what you think
is solved and unsolved here.
The entire field has clearly gotten
more focused on deep reinforcement learning
over the last 18 months.
You have this huge product launch this week with Asimov.
Can you just describe what it is?
So Asimov is the best code research agent in the world.
It's a comprehension agent, meaning
that it's really designed to feel almost like a deep research for large code bases.
The way a developer is supposed to feel when interacting with it is effectively like they
have a principal level engineer who deeply understands their organization at their fingertips.
So it's very different from the existing set of tools that I focus primarily on code generation.
Every single coding tool has some code generation and some comprehension aspect.
But as we spent a lot of time with our customers,
trying to understand why coding tools,
and this is enterprise specific,
so I think the world is different with startups.
But within enterprises,
when they're adopting coding tools
and you see the impact that this is having
on their actual productivity, and I
think it's much lower than people expect.
So in fact, it's sometimes negative, sometimes negligible.
Did you see the recent meter report on that?
Yeah, the meter report was very close
to what I've been hearing when talking to engineering
leaders within larger organizations.
And it's not just enterprises, it's, I would say, growth stage startups.
It's any kind of engineering organization that has a sufficiently complex code base
and sufficiently large team that no one engineer can have the entire code base kind of in their
heads.
And so reflection is one of those places as well.
We use our product actively because the training large
language model is complex.
And there's the large language model code base.
There's the product code base.
Knowledge is scattered across engineers.
It's not just in the code base.
It exists in your chats and project management tools
and other places where knowledge lives.
And so what we're effectively building towards
is this kind of omniscient oracle for organizations
that you can go in, ask any question
at any level of complexity and it'll provide you
an answer at the level of what that principle level engineer
would have given you or in the future
as the product expands to other categories, what the person who's most embedded in the organization understands.
And of course, once you have that solved, it begets much more reliable agents that act
for you as well. But I think the world today is focused on I would say 80% kind of action,
20% understanding. So 80% of action, 20% understanding.
So 80% code generation, 20% comprehension.
The actual problem is exactly the opposite.
That when you look at what an engineer does in an
organization, 80% of their time they're spending trying
to comprehend complex systems and collaborating with
teammates.
And what is collaboration?
It's usually someone asking someone else a question about
a system that they don't know. So that I think is kind of the problem at the heart of what would prevent a super intelligence
from actually working within an organization. It's really this kind of understanding and being
able to ingest from a lot of sources of information and from the team. Once you have that, then
and from the team. And once you have that, then the action part, I think,
becomes, I don't want to say trivial, but a lot easier.
Like to me, it seems like really 20% of the problem
is teaching these agents how to act,
and it's more or less solved.
That definitely squares with both my understanding
of engineering and then my experience
with coding agents personally, right?
If you think about the, I don't know,
the like context load time of just
to like trying to understand a new system or code anyone else has written or code your agent has
written. In the end, it's like, you know, very stupid implementation that like if you had reason
through it with context of the system, you never would have made such a mistake or a works in my environment type problem.
And so I think that very much mirrors my intuitive understanding of engineering here.
That's great as problem formation.
What makes Asimov different in terms of ability to understand better versus just generate
code?
There are a few things.
So I think this is kind of where, you know, why it is so important
to co-design research and product because as a researcher, you'd go in and say the
answer is entirely in the agent design or the model or something like this. And as a
product person, you would say, well, it's in these product, you know, differentiators
like being able to draw not just from your code base, but knowledge that lives, you know,
in other sources of information or being able to learn from just from your code base, but knowledge that lives in other sources of information,
or being able to learn from the engineering team
to offload their tribal knowledge.
So an engineer can go in and teach Asimov,
like, hey, we deploy our,
when we say environment jobs on our team,
we mean this specific thing,
which we mean kind of Google that job.
So now when another engineer asks a question
about environment jobs in the future,
the system just knows what they're talking about.
A lot of knowledge is stored in engineers' heads.
And I think you need both of these things.
You need to understand your customer really closely
and develop differentiated product,
almost independently of the models that are powering it.
But then you also need to innovate on the research
in terms of agent design and model training
to actually drive the capabilities
that you want to see out of the system.
And this becomes an evaluation problem,
which is basically at the heart of any frontier lab as well.
This is, I think, the least spoken about part
of what frontier labs do, but possibly the most important,
which is figuring out how they evaluate.
What makes Claude magically feel better at code than another model out there?
They did something right in their evaluations.
So when you look at this problem specifically, there are different capabilities that you
need to train.
And what we do is really post-training models where we really focus
on post-training today.
Some of these things are long context reasoning.
Now when I say long context reasoning, I actually mean kind of small models with very long contexts
that are able to go into giant code bases, sort of suck up as much information as they
can and reason over an output relevant stuff, basically.
So it's almost like neural retrieval.
There are capabilities like tool use and multi-hop reasoning.
So this is more for, you have your agent
and it's designed with some tools.
And there are two ways of training agentic models.
One is in this very general way
where you just train it on thousands of environments
and make it like the most general agent possible. And that is kind of almost like the pre-training
of agents. And that's sort of what, you know, that's what a FrontierLab does. That's what
there's a new release from Kimi2. That's kind of what that model does. And that's definitely part of it. But in order to that that kind of gives you a nice
general base to start from. But then to drive a capability kind of depth wise, like if you really
want this reasoner that has, you know, search tools and you know, ability to call like these long
reasoning context models, and other you know, other tools that might want to interact with like, oh,
when do I when do I read from JIRA?
When do I read from another tool?
This is kind of a reasoning problem.
If you train with those specific tools in mind,
that's typically what people refer to when they say tool use.
They actually train for a specific set of tools
and really drive the capabilities for those tools.
So these are the kinds of research problems
that you need to solve in order to build the overall system
that's the best in the world.
It's not any one thing.
It's all these things combined.
And some examples of systems that are being trained
for a specific set of tools,
the thing that comes to mind is the Grok 4 release,
and they kind of showed a plot of their general model.
And then the model that was trained with a tool
to basically climb on humanities last exam.
And there was some big noticeable difference
between the two.
Now, that's great, but I think the downside of that
is that does humanities last exam actually
matter in any meaningful way for an end user?
And I would argue that some weak correlation,
but the answer is most likely no.
And so you have to build the tools
and train for the things that users actually want.
I think that there's sort of no way around that.
What can you share about how you evaluate,
either like technically or philosophically,
that makes SASSMOS performance great?
This is sort of why it makes sense to do something
like this as a startup.
So the only advantage that you'll ever
have as a startup over a big incumbent,
especially when there are such talented teams out there,
is kind of focus and velocity against the thing
that you're focused on.
Now, I think you need, if you want
to be playing in what. Now, I think you need, if you wanna be playing
in what is arguably, I think, the biggest category in AI,
which is coding, then you need to have the talent as well
to do it, but what do you do if you don't have
the billions of dollars to pre-train models?
The only way we can win, I think, is by being very focused.
So the way I would describe what
does it look like to work on a big model within an incumbent lab is that you are one of hundreds
of evals. There are teams, when you look at the model card for, let's say, the 01 paper
that came out, I think, last year. If you look at the distribution
of what most people worked on on that paper was evals.
So you're one of many people doing all sorts of evals
and spreading yourself in that sense,
you get something that's general,
but it's spread fairly thin.
As a startup and a startup that has a very focused product
that didn't, that's not kind of being too diffuse and that's pretty opinionated about what it is it's
building. Your evals are basically what, you know, in the startup lore when, I
don't know, Paul Graham would tell you to kind of go talk to customers, like half
the time build product, half the time talk to customers. I think in the AI age,
it's develop your evals based on what customers are saying and what they're
doing. So you have to work with your customers
to look at what prompts it is that they're trying to solve.
What general questions are they trying to unlock?
So there's very specific pain points
that we've identified, like onboarding being one of them.
Like in a big company,
it takes months to onboard an engineer.
So how do you develop evals that accelerate the onboarding
of an engineer from months to hopefully just a couple of weeks now that all the questions
that they had, they can just ask Asimov and be able to onboard much faster. So I think there's no
silver bullet other than coupling to the information coming from customers, but then being very scientific in the evals that
you develop across them. So you have these, let's say, customer needs, let's say onboarding and,
you know, a bunch of others. And then you have your system capabilities, which is, well, what do you
need in order to provide a good experience there? Well, this customer is being onboarded onto a
giant code base, like it has, you know, it might be a codebase that on its own is like 100 million tokens
or something.
Well, then you need to figure out some way to reason over that giant codebase.
So you have kind of a long context reasoning capability, or you kind of look at your agent
and seeing like what's preventing it from satisfying the square from a user.
And so you kind of work backwards and reverse engineer
from what a user is asking for to
what capabilities you want to drive in your system.
But the important part I think is to be able to
tweak every part of the system from the product features,
to the agent design, to the model training,
in order to build the best overall system.
And if you are capped in which parts you can change,
if you can only change the products and agent design,
then you're actually pretty limited in what you can do
because you're kind of at the mercy of what
these general third party models can do.
What I'm hearing from you is also that there is some trade
off between having to serve all different kinds of users and optimizing across those
different evals because each one of the teams that is thinking about a particular use case
or audience at a more general organization, for example, is less likely to have the ability
to work through the entire pipeline from training to product to win their use case.
So the thing that was extremely satisfying about working on Gemini is that you're driving research in the frontier,
and there's something very gratifying about that.
The downside was that you were so far away removed from product that it was kind of a broken telephone game
of talking to four different people that information flowed through before the model got into the
customer's hands. That coupling was very loose. And I think it's very true that just because
a company might have the best model in some general set of academic benchmarks doesn't
actually mean they have the best product. And I think what we're seeing is when things
really fit together, it's usually that there's a, you know, a tight coupling between a product and a model that it's a whole system. It's not just the model alone.
Obviously, the first big example of that was chat GPT.
Chat GPT is kind of an incredible product that was coupled with the model and the model was post trained for the prompts that are coming in from users for chat from chat GPT.
Like there was a reason why it was, you know, when I saw the first coding blog post that
chat GPT produced for me, that was, that was just insane.
That was an insane magical moment and they post trained specifically for that.
And I think there's an another example that happening right now with quad code.
That's kind of tight model to product coupling.
And so I really think that it's important to really be able to do both at a great degree of excellence.
What is an example, as you guys open up the waitlist, that you want users to try, where it should just be obvious that the answers are better than other coding agents?
I think the kinds of queries that it tends to be better at
are, I guess, what we would call semantic queries.
So let's say, like, an example of a query where this is not
the best system to use.
It's like file level.
If you're looking at a file and there's, like,
a specific thing in that file and you're just
trying to get a quick answer to it,
you don't really need the hammer of, like, a deep research,
like, experience.
You don't need to wait, you know, know like tens of seconds or a minute or two to get that answer because that should just be
delivered snappily. But if you don't exactly know where you're looking for and you know you don't
know the function name or you don't you know something and this is kind of the hard problems
that engineers are usually in like there's a flaky test. I mean you know that this test is flaky but that's where your knowledge stops right
and that's when you usually go to slack and ask some engineers like this test is flaky what's
going on does anyone know? You know we've had the way we've used it is when you're training these
models there's a lot of infrastructure work that goes into it and it fails in interesting ways all the time.
And asking things like, you know, my jobs are running slowly, five times more slowly
than usually.
Why is that?
That's kind of a vague query that would be very hard to answer with existing systems,
especially since the knowledge around that query
might live not just in the code base.
So in the example that I just brought up,
when this was happening that our environment jobs
were slowing down, it turned out that two different teams,
kind of infrastructure and research team,
submitted pull requests that were, they passed the test.
It wasn't that they were wrong,
but they kind of conflicted together in a way that caused this kind of
effectively a race condition and slowed everyone's jobs down and
These are the kinds of bugs that actually engineers spend
You know, that's where you have like two or three engineers who spend a few days trying to solve one of these
So I think these kinds of semantic queries
these. So I think these kinds of semantic queries tend to be the place where a product like this shines. In the same way that when you think of what kind of query would you
ask ChatGPT to, you know, when it just needs to use kind of the browser tool. So it's like
a quick factual thing. Like you wouldn't invoke the deep research experience. But when you
wanted to compile kind of a lot of information around some more nebulous
query, I think that's where people seem to find a lot of value with deep research. So
I think a similar mindset holds here.
One thing I would do, working on new system with principle engineer next to me, is just
have them explain the entire system. Because I want to have that context where I can't, I can't even tell the agent what
to do.
And so I'm curious from a product perspective, like the way you have, you know, memory for
agents or even for teams is an increasingly popular idea.
There's lots of ideas about how, how to do it.
I think there are not many examples of collaborative memory in production in a
useful way yet, but I'm sure it is coming. Have you guys designed it in a form I can
understand too?
Yes. This is actually one of the more fun things to work on in product today. I think
it's one of the more fun features to work on at the company is how do you design a team-wide memory? Because
there are all sorts of details around, well, who can edit the memory, who can view different
parts of the memory, how do you, you know, how do you maintain a kind of repository of this memory
for people to edit and view? You have to have a concept of authority, right? People are going to
say things that are wrong. The way it's worked with customers we've started working with is they typically have, they
want to start off with kind of a group of trusted kind of senior staff level plus engineers
who are kind of the gatekeepers, which is a very, I think, common notion.
You have permissions, right, and ownership structure and code bases, and they basically
are the ones who kind of populate the memory first and then sort of expand the scope. But I think it works. It's actually a much more
complex feature to build because it touches on, yeah, org-wide permissions. There's some parts of
the code where a certain engineer should be able to edit the memory, but other engineers shouldn't.
And so it actually starts looking like the new way of versioning code effectively,
right? It's kind of a GitHub plus plus, because you're not versioning the code, you're kind
of versioning the meta knowledge around it that helps language models understand it better.
But definitely that is something that we built that I think it's a thing to iterate a lot
until you kind of get the right design here, because you're effectively building and you get from scratch.
Yeah, it's interesting.
And you're trying to design some sort of permissions into it versus like, you know, dominant system
today in actual version control is like, you know, at best pull requests review, right?
Like you just, you try.
And it's like somebody in the organization with the ability to review
makes a determination as to whether or not Misha should be able to make this change or
not actually based on the content.
And I think actually, it's going to look not too dissimilar from that, right? Where if
you want to change the agents, the team wide memory, then it probably is going to look
something like a pull request where the person who really understands that system approves or, you know, edits it or something like this. I don't think it's going to look too dissimilar.
And it makes sense to me that it would look perhaps a little bit more Git-like in that the person who knows what part of the codebase you are editing or creating or editing knowledge
about is going to evolve over time as the codebase evolves over time and the team does
as well.
Yeah, exactly.
But I think this is also how it was very common at Google and I think other places as well for different parts of the
code base to have owners and so there are like these ownership files that we have as
well and basically if you're on the ownership file then the review has to go through you
or it has to be approved by at least one of the members of the ownership file and as people
move around teams and so forth the ownership files themselves get updated.
So I think a pretty similar structure is probably going to hold here, but it's a lot more nuanced
than building kind of an individual memory, which is just kind of personal to you and
lives on your computer in your, you know, agents and be file or something.
Okay, if we zoom out and place like reflection overall in context a little bit and talk about
the larger environment.
Sounds good.
Yeah. reflection overall in context a little bit and talk about the larger environment. Sounds good, yeah.
You know, coding as a root problem in this era of AI research is somewhat commonly held
belief, right?
I think a criticism of companies that went after pre-training focused on coding was in
reality like you actually, you you needed language you needed a lot
of the capabilities who can say exactly which but the reasoning capabilities that could be elicited
from large pre-trained models to do code anyway and so you had to do all of the work without the
general use. Is it specifically the availability of pre-trained models that are more capable and open source that made you feel like we can go after
super intelligent, like autonomous systems in coding without spending the pre-training dollars
upfront as a new lab or help me think about that logic a little bit more. I think that that's
roughly correct for kind of, you know, the sort of why you can get into the game sort of short term.
And that that we made, you know, you're starting a company a year and a half ago,
was that very pretty decent open weight models out there that pre training, you know, we kind
of saw pre training is starting to more or less converge on kind of a known paradigm, there's
sort of a there's a known big data set
on the internet.
Yes, there are gonna be some algorithmic innovations,
but you're basically extracting signal
from an extremely noisy data set.
And we felt like there's only so much signal
that one would be able to extract
without getting into just absurd dollars for scaling this
in terms of what you're trying to get out of it.
So what we thought would happen is that there'd be decent open weight models. I think the quality of the open weight
frontier has surprised me. They're actually, the models are better than I thought they
would be. And we thought that you can just focus on, you know, we're in this brief period in history right now where the RL flops are still manageable.
Like you can you can you can really have a best in class product if you're focused.
And yes, you'll need to put you know, you still need a decent amount of GPUs.
But from a but from a flop perspective, it's nowhere near where pre-training is. Like two magnitudes off.
Exactly.
Right.
So you can get into it and kind of build out both kind of the product and a research arm.
Our thought was that this was the time where you can actually start a generational FrontierLab that does not need to be coupled to a big cloud provider.
Because if you do it right, you'll actually be able to generate sufficient revenues to
not have to be acquired or find some strange deal where the cloud provider kind of owns
you.
And that was kind of the model, I think, of a lot of what frontier lines look like pre-LLMs.
I think we're already starting to see that, you know,
this kind of more of a field-wide thing,
independently of reflection, right?
When you look at how fast like Anthropix revenue is growing,
I think, right, they're kind of in this spot where
it's like a massive revenue generating business
that's growing at an unprecedented rate.
That is, but that was very much the ethos
that we can come in, we don't need to pre-train.
You can get by with two orders of magnitude less compute
and really get something out there that's really good.
I think that roughly speaking,
you won't need the amount of compute
that I think FrontierLab needs today as you're focused, but you'll
still need kind of an order of magnitude less. So I think that the capitalization requirements
are still high. There's no way of avoiding that. But I'd say they're, and asymptotically,
they're probably the same, but asymptotically, the idea is that at that point, you just have
a generational business
that can raise capital off of that.
I guess part of my read at this point in time is, and maybe it was always true, but especially
now is your actual capabilities in terms of understanding what evals to go after, how
to design reward models.
There's perhaps less understanding and more dispersion in the field in post-training strategies versus,
as you said, more maturity in pre-training right now.
If it was a simple question of scaling RL on language models, people would be doing
it more aggressively right now.
Actually maybe that's a good question for you.
How would you describe the challenges in solving scaling here?
Why are we only able as a field to put a much smaller amount of compute to work here and
still get best in class results versus pre-training skilled GPUs right now?
I'd say that there are two categories that one would think that things fall into. One is more around
the problems, limitation of the problem structure, and the other one is, well, maybe the structure
is fine, but you need algorithmic advances to really drive the next frontier forward.
There's, you know, I'd say some mixture of both, but the biggest way I put it is on the
problem structure. So if you, the thing that I led for Gemini
was reward models. I built out the reward models that were used to post-train Gemini 1 and 1.5.
And I thought is that if you have a reward that accurately basically describes the outcome of
any arbitrary task that you throw at it, then that's it.
At that point, it's just algorithmic advances,
but even the very simple RL methods we have today
will be able to get a lot out of this.
They'll only be bound by their exploration abilities.
So that's the only thing, right?
But if today, we certainly are not in this world
where we have clean rewards for every task we could imagine. And so we're kind of making as a field, have to make sort of various shortcuts
and compromises to that. So you'll have things like LLM is judged with different rubrics
and that works to some extent, but it inevitably a noisy or like stochastic reward inevitably
gets hacked. So you kind of need a lot of these and you know, and there's only so much you can extract out of them. Then
you have sources that do have ground truth rewards, but they're not many of them. And
so you have to hope that by optimizing against those, you'll get some generalization effects.
And so I think that the fundamental problem is like the reward problem. You can either go in and say, I'm just going to, all I'm going to focus on is kind of rewards.
Or you can say, I'm going to take things as they are and just be more creative in the
methods that leverage the rewards that happen today.
And so examples of that are basically every synthetic generation pipeline is some example
of this.
So it's a messy problem, but I think it's fundamentally like we're in a reward bound
world.
I don't think there's going to be any breakthrough that all of a sudden, you know, we go from
we didn't have rewards for everything to we do because the reward problem itself is at
the time I called, I thought it was AGI complete.
Now I'd say it's a si complete, but by the time you have a neural network that can accurately verify any outcome that is probably a super intelligence and so then it goes back to again evaluations.
What if you're training your rewards your reward models on something like what are you evaluating against what are the the tasks that you want it to be good at? So
that's kind of how I think about it. I think it's a fundamentally reward model or rewards bound field.
And then there's also kind of algorithmic progress in terms of the RL methods we have today are quite bad, I would say, at exploration and credit assignment. Like they're sort of just like,
the fundamental algorithms are,
take the things that work and make them happen
more frequently and the things that don't work
and make them happen less frequently.
But they don't discern at all along your say reasoning chain
which part of the reasoning was correct
and which part was incorrect.
And so that's why you get these reasoning chains
that are kind of garden path meandering.
Like they'll explore all sorts of things that are, you know, completely unnecessary and don't look
at all like the kind of structured thinking that a person would have. That's how the algorithm works.
It doesn't, it doesn't actually look at there's no credit assignment step on any atomic level.
And so that I would say falls into more algorithmic progress bottlenecks.
Can I ask you for a few, like hot takes quickly?
Yeah, let's go for it. What do you think of all of
these efforts, either in house with, you know, labs and vendors
or young companies just creating software environments that look
like popular software to train agents in? All right, copies of
Airbnb or Amazon or Salesforce or Excel? Personally, maybe the
take is not very hot. I'm very bullish on it because how else are
you going to... Maybe the hot take is that there's no such thing as generalization. There's
just bringing the test distribution into train.
Okay. That is an aggressive take. Wow. Yeah.
So as long as your train distribution looks something like what you actually want to evaluate
for, then users will experience it as generalization.
I think there is some generalization that happens in these models, but we probably,
as users, overestimate it because we don't actually see how they were made.
But then, yeah, if you saw, oh, the synthetic environment was actually very similar to the
thing I was asking about, so it would make sense why the model would be would be good at that.
Maybe six months ago, I think you you you said like, I think it's possible we have my
definition of ASI in a couple years. Do you still believe that's true?
I still do believe that's true. I think that where I think will be in a couple years from
now is that there will be kind of definitive super intelligence in some
meaningful categories of work. And so for example, when I say coding, I don't mean all
of coding there, but there will be a super intelligence within some kind of slivers,
some meaningful slivers of coding that are driving, I'd say immense progress in the companies
that can benefit from that. And the reason why
I would say that the problem of ASI would have been solved by then is because you've
kind of, at that point, it's just a matter of operationalizing, like what you know, you
know, it just so happened that these particular categories, like you might have a super intelligent
front end developer, because there's so much data distribution for that on the internet,
and it's easier to make synthetic data for that.
But at that point, you have the recipe, and it's just a matter of making economic decisions
of is it worth sinking in X amount of dollars to get the data in this category to get something
close to superintelligence there.
An example of that is what happened with reinforcement learning
before language models. Effectively, the blueprint for building superintelligence systems was
developed. It happened with the Atari games, AlphaGo, then Dota 5 and AlphaStar were near
superintelligence systems. And if OpenAI and DeepMind had sunk more compute into them, they would have definitely become super intelligent.
It's just that at that point, it didn't really make sense.
Economically, why would you do that?
Then this is a definitional issue.
Because I was going to ask, help me understand your view of,
one of the big criticisms of RL overall
has been lack of generalization.
That's been just a general question for this direction.
I do have friends at every large research lab that's somewhat,
in a some, I mean, tell me if you hear something
of a different tenor or just believe differently.
They believe we're going to have systems that
are much more capable than humans
and many types of knowledge work.
But they believe less in generalization.
And so in a resigned way, they're also, as you're saying, like, I guess we're just going
to bring all of it under distribution one way or another.
But that means like, you know, it's a little bit different than my view of like, it's at
some point you're, you're just, you know, you have enough capability that the rest you
get for free, right?
The rest of sort of useful capability you get for free.
I think I kind of have a similar viewpoint
to the people you describe.
I think the generalization capabilities
of these things has been weaker.
First of all, it's all mind blowing if this exists.
So we went from fundamental existential crisis
and generalization, like this was the feel of reinforcement learning before language models was
we have these systems that we can make amazing, you know, at like very narrow tasks. We have
absolutely no answer for generalization, like zero. And we went from that to things that, you know,
feel like they're generalizing. They're certainly generalizing much better than
anything we had before,
but it's likely because the training distributions are so broad.
So at least the way I think about it is more
kind of output as a user,
is the system super intelligent
in some meaningful categories of work?
And then from a research perspective,
is it obvious how to make it general
for anything that you might care about?
And at that point, again, it's just a matter of economics.
Maybe there are some categories where collecting the data is so expensive and
the return on investment is low, where effectively just better to have craft
people than super intelligent AIs.
So I think we're moving into this kind of jagged superintelligence where you have a handful of these superintelligences
for categories that matter, maybe subsumed into one model at some point, but at first
they'll probably be, again, I think there'll be a few companies that have kind of product
model coupling that is superintelligent in different categories.
I think an example of, again, starting to see the first glimpses of superintelligence, but in a way that hasn't really transferred
to anything meaningful yet is, well, we have these superintelligent test acres now. The
Amy benchmark is completely saturated, codeforces and other competitive coding environments.
The models are almost best in the world
and within the year, probably just the best in the world.
And yet, so we have the best competitive coding agents.
Then you go into a company and you ask them,
have these things been helpful?
And they say-
It's uneven, yeah.
Yeah, right.
So in the parts of work that are really meaningful, that would you want to see these things driving
meaningful increases in GDP?
And I think the only way you will see that is if you go into a company and there's kind
of a universal understanding that, yeah, my engineers are double digit percentage points
as a whole, every single one of them more productive.
Right.
That's the kind of thing that if that starts happening across every field, then you'll see double digit increases in GDP. So
I think that the kind of benchmark maxing that's and it's a bit different than benchmark
maxing used to be before because you have benchmark maxing that is weakly correlated
to customer outcomes, but it still looks very similar to taking a board game,
training our own agent on it, getting kind of a landmark result in super intelligence,
and then making a claim that, you know, super intelligence solved. I think the reality is that
deployment of it is half the problem, which goes back to kind of evaluating on customer problems
and building product together with the models. So you must have seen the news of the Windsurf
non-acquisition into either OpenAI but non-acquisition into Google DeepMind. What do
you make of it? We're seeing this verticalization basically happen across categories that are material to frontier intelligence.
And one could argue that the first verticalized category was actually search, like through
chat GPT.
That's sort of a place where OpenAI verticalized first.
And coding has obviously emerged as another kind of frontier level category that could,
like all these companies have aspirations of...
ASI. Yeah... ASI.
Yeah, ASI.
And I think, you know, being basically trillion dollar companies or more, I don't think that
it's really the economics that are the driving factor, but it's more that if you want to
sustain frontier research, that's kind of what you have to become.
And so coding has clearly become one of these categories where verticalization is extremely
important.
And I think that there's, there are kind of two sides of the story, one on the FrontierLab side
and the other on the kind of more of product side, like a startup that builds product but
does not have its intelligence in-house.
So I think on the FrontierLab side, I think this is exactly kind of what Yannis and I
noticed when we were working in Gemini, is that your model is so far away from the product that oftentimes, even though you have
the best model, does not at all mean that you have the best product.
So there's a reason why basically startups are the places where adoption of coding tools
took off rather than the frontier labs.
And so there's a verticalization happening there and some are gonna do it successfully and some are not.
I think that that's kind of,
we're already starting to see that with plot code
really being an example of a successful verticalization.
I don't think that's guaranteed that a big lab can,
buy their way to the end user
because the fundamental problems of your research team being far away
from your product team will still be true and the company having a hundred different focus areas
will still be true. So I don't think that acquiring an asset will change that fundamentally,
but it does underscore the importance of verticalization. And then from the startup side, I think it actually puts companies that are in these
kind of critical path categories like search and coding in a pretty existential place if
they can't build their own frontier models.
Not all frontier labs will be able to verticalize correctly, but some will, maybe one will, and that's going to be enough, I think, to kind of take the thunder out from a company that's built a great user experience on top of someone else's model.
And I think some of those dynamics are probably starting to play out as well. There are some question marks around if you're on this critical path category and you don't
have your own intelligence, how do you compete when your competitor can just basically subsidize
their product a lot more than you can?
Because you're effectively as a startup that's building on top of these things, to grow quickly
you're subsidizing the margin that an anthropropic or Gemini or whatever is making.
Google and Anthropic and OpenAI can subsidize their products a lot more than you can.
I think that companies that don't own their intelligence or are not deeply integrated
into a customer in some way that makes them hard to remove find themselves in this pretty
existential place as it becomes clear to the frontier labs that this is a
category they need to verticalize around.
I work with a few robotics companies and so much of my lens on RL comes from that.
And I think it is like far less clear in robotics that RL will be a dominant part of the training versus imitation learning.
You'll actually appreciate this on imitation from humans using tools, right?
Because we run this, I'm going to describe this idea that is nuts, but I think it's just
funny.
We run this grant program twice a year for amazing people using ML in different fields and it's called embed and one of one of the ideas
I had as a joke recently was well like you just record everything right like
not obviously just the code base but like your slack and all your
documentation and all your conversations because you are a software engineering team. And I'm 100% sure that I can take that data set if you ship something into
production to an end customer that has real issues at any scale and sell it to a friend who's a
researcher at a lab working on this stuff. And so you have some floor value that is millions of
dollars for your couple person company and like bonuses, like maybe the software company works.
Right?
Obviously this is like very noisy and I'm mostly joking, but I'm curious how you think
about like exploring non RL data sets that could be useful to you here.
If that company existed, right?
We would, we would definitely pay for their data.
There we go.
Say it's not an idiot idea.
Yeah. Yeah. Especially if there's diversity. I think that'd be...
I can sell the whole set.
So is the question around how do you leverage alternative sources of data?
Yeah. The question is, I think there is like... I don't want to like over analogize to robotics,
right? But within robotics, you have learning from world models, you have learning from
SIM, you have learning from embodied data of different types, right? Imitation, then
you have RL. I think it's like much less clear that you can use RL for a lot of robotics
today, especially some of the harder like manipulation problems. And I'm curious, just
given, you know, your team has this enormous strength in RL's like a starting premise,
how you look at other types of data to create the, you know, coding agent experiences you want.
So I was actually a robotics researcher
for like in reinforcement learning.
Peter Beals' lab is a robotics lab.
And it was a mixture.
Peter's lab was always around the intelligence problem
and robotics as being a domain where you study it.
And one of the reasons I came to lead reward models for Gemini
was because that's the question I was
studying with robotics. We had these RL algorithms for getting robots to do some very narrow
tasks like moving blocks and various kind of narrow tasks and simulation. And the question
was, well, how do we get generalized manipulators and, you know, just how do we build
this onto one system? And it seemed like the rewards were bottlenecks. So a lot of what I
was studying before starting, you know, getting into language models was how do we design reward
functions or models for robotics or, you know, for 3D video games like Minecraft or something like this that have,
I think, similar challenges scientifically. The challenge is that if you think that language
model rewards are hackable, vision language model rewards or other sensory signal rewards
are infinitely more hackable. They're much more short-lived than rewards.
You can think of language as just a compressed representation of the world that we have that
we kind of magically have to start with.
Whereas, if you're processing pixels or a sensory motor signal, this is raw signal that
has a lot more noise in it.
And so, if you train a neural network that is sort of trying to detect whether
this thing was manipulated correctly or this thing was moved correctly, then that thing
is just infinitely more hackable than anything you have in language models. So the same problems
be blow up and become much larger. And so that's actually why I changed to language
models, because I felt that this was a fundamental
problem, but you know, we now have these confounding factors of these noisy signals coming in.
I think that in at least in a generalizable way, that's why it's really hard to get reinforcement
learning to work with robotics.
The one place where it really does work well is when you have a clean reward signal, which
has to happen to be in these locomotion-like scenarios.
So there's a lot of work on building
very robust, simp to real locomotion pipelines.
And it's because it's kind of, locomotion is just your body.
You don't have to manipulate the world around you.
And so you can actually build reward signals that are like,
oh, your quadruped is moving at this velocity
without damaging its body kind of thing. Maybe it's a bit of a roundabout answer to the question, but it's that I think
these two fields are very different in the data distribution that they support. And the
kind of imitation learning data for language models is of course the internet. And it's
of course, you know, we've people who've gathered all this data on how we write and so forth. And so aside from that, when
we're generating synthetic data, the only scalable path is really reinforcement learning.
The other thing that I'll say here is that when you're collecting data for robotics,
you can do it in this tele-op way. The things that we are trying to train robots to do are very intuitive for humans as well.
I mean, actually more intuitive for humans, right?
People are master manipulators.
So you can have a lot of tele-op data collection.
The things that we want language models to do are sort of, you know, at the level of, it's really hard to collect data of, you
know, the chain of thought process that goes on in like a human's head when they're trying
to solve some tasks. And that's kind of the data that you need. And so for that reason,
I think language models favor this more like synthetic data, RL like approach where, well,
it's easier for us to like verify whether the thing was done or not not than it is to actually generate all that data from a person specifically.
Maybe we just need a network interface to get the channel thought.
Yeah, maybe.
I mean, that's kind of actually when Janice and I were starting the company, we were thinking
about, well, what?
Maybe we just somehow had people speak into a microphone as they're doing tasks in order to capture that.
Just stream it.
Yeah.
And it seemed, you know, logistically very hard to pull off.
Okay.
One final question about sort of reflections path from here.
At what point do you, this is a decision you get to make in the future, but at what point
do you try to look at other problems beyond engineering and coding?
Do you feel like there's a level of sufficient depth where you should just go attack different
domains?
The thing that makes coding as a category special is that it's not synonymous with
software engineering.
It's just kind of how we think about the market today. The reason code is special is if you believe that the way a language model will interact
with almost any piece of software is through function calls and therefore code, then if
you build very capable reasoners, coding reasoners that are sort of purpose-built organizations,
so you've solved the kind of long context, how do I reason over a bunch of disparate source of information problem, and I can act on pieces of software through code,
then you've kind of built a system like the technology that will generalize, at least
operationally across other categories of work. And so the way I think about it is more first,
just build, not trying to get too ahead of yourself,
but just first build the most depth-wise comprehension system for software engineers.
This will naturally induce more reliable coding agents. You can plug that in as an MCP to your favorite IDE or coding agent or use one of our own.
You can plug that into whatever surface area makes sense for the customer and then naturally
start seeing where you're getting pulled from there.
The reason I think this will work is because this is what we're already seeing, right? In the, you know, how do you make the system useful
for product managers or technical support people?
And then, you know, I think moving on to things like sales
or something like this, but there are already places
where, you know, customers are pulling us
in different directions.
It's just kind of a matter of whether you engage
on that today or not.
And I think that the risk that a startup has is that, you know, you of a matter of whether you engage on that today or not. And I think that
the risk that a startup has is that you see a lot of shining areas where you can go and you start kind of going diffuse before you've really nailed a category. So I think it's really important to be
focused and not diffuse in the short term. And that if you kind of build the right,
as we kind of think about as a contextual
core for an organization, in this case, an engineering organization, then you can naturally
start expanding that into adjacent areas of work in that enterprise.
Okay, last question, Misha.
Where would you characterize us as like being on the path toward deployment of these capabilities
in different fields?
I think we're a lot earlier than most people think.
That this is going to be one of those areas where the technological building blocks outpace
their deployment.
And so, yeah, within the next couple of years, the blueprint roughly for how to build ASIs
will have been set more or less. Maybe there are still some efficiency
breakthroughs that need to happen, but more or less there'll be a blueprint for how do you build
a super intelligence in a particular category. Actually going in and deploying it and building
it for specific categories of work, there are going to be a lot of product and research innovation
specific to those categories that will probably make us a
multi-decade thing. So I don't think that it's a couple of years from now and GDP starts growing
10 percent, you know, year over year globally. I think we're actually going to get there,
but it's going to be a kind of multi-decade endeavor. I tend to kind of see a lot of patterns now in
kind of real-world deployment with reinforcement learning research as it
worked again before like large language models. And before large language models
it used to be kind of you pick an environment like you pick Go, you pick
Starcraft, you pick something else and you go and try to solve it with some combination
of imitation learning and reinforcement learning.
And when you look at all those projects, these were basically things that were called strikes
within DeepMind.
And each strike within and outside of DeepMind was a bit of a snowflake.
The reinforcement learning methods and environment setup for Go was at a high level,
conceptually similar, but in the detailed implementation level, very different from
StarCraft, very different from Dota 5.
And so I think that that's sort of, we're going into every big category having a different
environment, right?
And different kinds of agents with different tools.
And that means that you'll need to, you'll have like general base models that you can
start with, but you'll need to post train things in specific ways for those categories.
And we're starting to see that already in the sense that the model that powers OpenAI's
codecs is not the O series of models.
It's a model called codecs, which was post trained for that environment.
The deep research models, like that's a specific environment. They're also post trains for that environment. And I think we'll basically see more
and more that any category that has a sufficiently large business around it, that requires an
intelligence core to power it, there will be all sorts of interesting design decisions at the
research and product level of how do you actually gain the most performance out of this particular category. So I think we'll kind of see a lot more kind
of depth first players emerge over the coming decade or so.
I'm making a bet on it. And I also think that like part of to your point about choosing
like the problem for the era, we don't get to choose at Conviction a problem for a hundred years.
We do get to choose for like this decade or so, right?
And, you know, if you actually believe
it's gonna be a very long-term endeavor
to get to the sort of productivity
and abundance you described,
but we are going to get there,
then, you know, the other thing you think about
is like, like path to supporting the cost for bringing anything
under distribution during a particular period.
And so I'd say like in the, we've already backed companies in some of these areas, but
like let's say in life sciences or material science, it is more expensive to collect types
of data you might need.
And that might be a longer endeavor
or one that you have to figure out how to fund,
or in robotics.
And so I think it's a really interesting timing question
of any of these really big categories.
But I believe coding is this era.
I think coding is this era as well.
This one I think will take longer than people thought
as well, because again, enterprise, there's organizational problems,
just much different than the benchmarks that we have today. But I think it will be one of the
faster ones. So I don't think that that's kind of a decade out. That's within the next, you know,
say, dozens of months kind of thing. So I think the next sort of generational companies
in encoding are definitely being built today.
Well, congratulations on the release, Misha, thanks.
Yeah, thank you, Sarah.
Find us on Twitter at nopriorspod.
Subscribe to our YouTube channel
if you wanna see our faces.
Follow the show on Apple podcasts, Spotify,
or wherever you listen.
That way you
get a new episode every week and sign up for emails or find transcripts for every episode
at no-priors.com