Latent Space: The AI Engineer Podcast - The RLVR Revolution — with Nathan Lambert (AI2, Interconnects.ai)
Episode Date: July 31, 2025We first had Nathan on to give us his RLHF deep dive when he was joining AI2, and now he’s back to help us catch up on the evolution to RLVR (Reinforcement Learning with Verifiable Rewards), first p...roposed in his Tulu 3 paper. While RLHF remains foundational, RLVR has emerged as a powerful approach for training models on tasks with clear success criteria and using verifiable, objective functions as reward signals—particularly useful in domains like math, code correctness, and instruction-following. Instead of relying solely on subjective human feedback, RLVR leverages deterministic signals to guide optimization, making it more scalable and potentially more reliable across many domains. However, he notes that RLVR is still rapidly evolving, especially regarding how it handles tool use and multi-step reasoning.We also discussed the Tulu model series, a family of instruction-tuned open models developed at AI2. Tulu is designed to be a reproducible, state-of-the-art post-training recipe for the open community. Unlike frontier labs like OpenAI or Anthropic, which rely on vast and often proprietary datasets, Tulu aims to distill and democratize best practices for instruction and preference tuning. We are impressed with how small eval suites, careful task selection, and transparent methodology can rival even the best proprietary models on specific benchmarks.One of the most fascinating threads is the challenge of incorporating tool use into RL frameworks. Lambert highlights that while you can prompt a model to use tools like search or code execution, getting the model to reliably learn when and how to use them through RL is much harder. This is compounded by the difficulty of designing reward functions that avoid overoptimization—where models learn to “game” the reward signal rather than solve the underlying task. This is particularly problematic in code generation, where models might reward hack unit tests by inserting pass statements instead of correct logic. As models become more agentic and are expected to plan, retrieve, and act across multiple tools, reward design becomes a critical bottleneck.Other topics covered:- The evolution from RLHF (Reinforcement Learning from Human Feedback) to RLVR (Reinforcement Learning from Verifiable Rewards)- The goals and technical architecture of the Tulu models, including the motivation to open-source post-training recipes- Challenges of tool use in RL: verifiability, reward design, and scaling across domains- Evaluation frameworks and the role of platforms like Chatbot Arena and emerging “arena”-style benchmarks- The strategic tension between hybrid reasoning models and unified reasoning models at the frontier- Planning, abstraction, and calibration in reasoning agents and why these concepts matter- The future of open-source AI models, including DeepSeek, OLMo, and the potential for an “American DeepSeek”- The importance of model personality, character tuning, and the model spec paradigm- Overoptimization in RL settings and how it manifests in different domains (control tasks, code, math)- Industry trends in inference-time scaling and model parallelismFinally, the episode closes with a vision for the future of open-source AI. Nathan has now written up his ambition to build an “American DeepSeek”—a fully open, end-to-end reasoning-capable model with transparent training data, tools, and infrastructure. He emphasizes that open-source AI is not just about weights; it’s about releasing recipes, evaluations, and methods that lower the barrier for everyone to build and understand cutting-edge systems. Full Video EpisodeTimestamps00:00 Welcome and Guest Introduction01:18 Tulu, OVR, and the RLVR Journey03:40 Industry Approaches to Post-Training and Preference Data06:08 Understanding RLVR and Its Impact06:18 Agents, Tool Use, and Training Environments10:34 Open Data, Human Feedback, and Benchmarking12:44 Chatbot Arena, Sycophancy, and Evaluation Platforms15:42 RLHF vs RLVR: Books, Algorithms, and Future Directions17:54 Frontier Models: Reasoning, Hybrid Models, and Data22:11 Search, Retrieval, and Emerging Model Capabilities29:23 Tool Use, Curriculum, and Model Training Challenges38:06 Skills, Planning, and Abstraction in Agent Models46:50 Parallelism, Verifiers, and Scaling Approaches54:33 Overoptimization and Reward Design in RL1:02:27 Open Models, Personalization, and the Model Spec1:06:50 Open Model Ecosystem and Infrastructure1:13:05 Meta, Hardware, and the Future of AI Competition1:15:42 Building an Open DeepSeek and Closing Thoughts This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Transcript
Discussion (0)
Hey, everyone. Welcome to the Littance Base Podcast.
This is Alessio, partner and CTO Adensible, and I'm joined by Swiss Finder a small AI.
Hello, hello, and we're excited to welcome back, Nathan Lambert from AI2.
Welcome.
Thanks.
Fun to be here.
I feel like I also have to say interconnects and the Lex Freedom Podcast and the
and the AIU Wells Fair.
Like, you've just done a lot in the last year and a half.
Not that many.
Still stay no to plenty of things.
Yeah, yeah.
Your first episode with us was January, 2020.
before. When you just joined AI2, then you release all the almost. You joined us again at Neerrips,
where you did the open models. Well, Luca did. And you supported. And then you were more recently
here in SF for AI. First of all, I wanted to congratulate you on winning the best speaker.
Oh, yeah. Thank you. The reasoning track here. Here you go. I'm limited by emoji.
Oh, it's a nice AI generated. I look too Zen. I look so Zen and this AI generated. So we had our track host,
like take photos of you while you're speaking.
I'm going to turn it into Ghibli photos.
But this one, your eyes were closed.
It's funny.
Okay, we were trying to have Mochi, the reason named Fomsky,
joined us, but I think she's like very, getting very anxious,
very restless.
A little too crazy mochi.
Very restless. Okay. Sure.
Okay.
So you've been doing like really good work.
And honestly, like, I think one of the things
that we wanted to kind of establish was Tulu
and RLVR.
I guess.
Is that a good place to start?
Sure, it starts us in the recent journey.
I think that we can recap kind of the story of what Two, the Three was aiming to be,
and then kind of how it got folded into what the new narrative is.
What the goal is is try to do the work to compress what are complicated industry post-training recipes
into something somewhat tractable that you can modify on your own and do post-training
at a what is actual state-of-the-art level, I think.
What we do relative to Frontier Labs is that we probably have a smaller amount of tasks.
I think our post-training suite for Tulu is probably like 10 to 15 tasks,
but I would guess post-training at OpenAI at all, you have maybe hundreds of avals.
And adding more evals is more data work and more mixing work and making sure you have these things.
But on the core avals for our suite of models from, I think, 8, 70, and 4 or 5B is based on Lama at the time.
It's like it matches or beats meta on these core vowels.
I think meta has different priorities and their things for Lama 3.1, which is a great set of models at the time.
And it's just like, how do we distill what is very complicated post-training explanations or diagrams from the like of this Lama 3.1 report where they have these complex feedback diagrams with many iterations and earlier signs of that from like anthropic papers that have these multiple model variants and early like constitutional AI things for multiple years and what does that look like when you're doing.
a large-scale instruction tuning into preference tuning and what else you might add.
I think a lot of the core contributions of that before we talk about this reinforce the learning
thing. It's like we showed how to scale up preference data. It's just like the academic
community had been using this one dataset since like all the way back in the hugging face
models of like Zephyr beta is when this ultra feedback data set got popular. And still a year
later is like this state-of-the-art dataset for open preference tuning. And it's just like one of those
obvious things, it doesn't need to be the case. So it's a big trying to make more mature
recipes available to people. And I mentioned this either on one, I think I'm trying to talk
with Jordan. I mentioned the origin of the RLVR thing, which is like realistically, when you work
in the open, a lot of it is trying to match what industry has done. And we're on a different path
because our infrastructure is different. So some things that Open AI does now that works really
well for long contacts won't work that well for
Omo because we might not have enough flops
in our base model. We might not have certain
data sets for legal things. But
directionally, a lot of it is just trying to
reproduce things. And
I've long tried
to get John Schillman on the pod of
Open AI Anthropic and now
thinking machines. And at the time, he had
gone in approval to chat with me.
What he said was
confirming a lot of the things
that I had said on instruction tuning and multitask
and preference tuning. And he was like, oh, yeah, everyone
just does RL on the outputs. And that's how we got the RLVR idea and scale it into something that
is a general method. There was a lot of reasonably, or very similar works at the time, like Vine
PPO and Quiet Star on doing these math and coding domains for getting verifiable rewards.
I think the RLVR thing was about doing it in general recipes. And the naming was something that stuck.
Originally we had, I think it's especially like Costa Huang, who was a kind of lead RL engineer at AI2, who's doing some stealth startup now.
You can hear more from him now on that soon.
I think he's founding engineer of something.
And Hamish Iveson, who's still a student at UW, were leading most of the technical work on this.
And the naming was going to be RL from ground truths.
But then it's like the verifiable rewards is actually a more general notion because only like math questions have a ground truth.
where code is verifiable, precise instruction following is verifiable. So I think it's a nice
evolution of the name, which makes sense as you look at more domains, which is now why it catches
on with people. Once Jensen started using it, it was like, okay, that's set. That wasn't really our
goal, but that's setting that's where it took off? No, that was like in it being taking off because it
was after deep seek. But it's like when people like that have the acronym on the slides, and it's also
very clear of like, R like, Jeff is four letters. It's like we want to evolve that and have a similar
four letter acronym. It's not that much magic to it, but
They're definitely intention on these little things.
RLGT may not have worked as well.
I don't know why.
Yeah.
Yeah.
That's what these people like all that were definitely thinking and they made that
name change, which works, which was fun.
You did mention it.
So we'll show, you kind of mostly quoted from the Tulu paper there, but we'll show
the RLVR chart.
You did mention that you wanted to change it now.
And we'll sort of preview a little bit of the agent's discussion.
Yeah.
I think when you are introduced to RLVR, there's just a function, really, that checks if you have a string outputted from the language model.
You have a relatively simple function that's like, is this answer from the language model correct?
And there's no real environment because you're just looking at the generation.
And now I need to figure out the right way to communicate what either like multi-hop tool use looks like for this, which is something people are definitely doing.
doing, thinking, like, what is the right diagram to encapsulate how O3 is trained, which in action,
they take multiple actions because the next sequence depends on the feedback from the environment,
which is some sort of information store. So, like, when it's searching for a niche piece of
information, you can't know what the next actions are without whatever feedback from when Bing searches,
is what they say they use. That is a step that is very much happening. And then as people try to
transition to more end-to-end RL is a real strong notion of environment, which is that you're looking
from a sparse signal from this multiple generations. And that's what people want to do.
I think it's debatable whether or not people are actually doing it now. I think the deep research
blog post kind of hints that they do a bunch of small-scale RL and then poof the system works,
which I think is much more of what's happening is people train on a bunch of small things and they do
some prompting and they see that when you put these pieces together or a couple different
fine tunes of a model. So it seems like deep research has some fine tune of O3 in it.
Because you do that with some different domains of RL. It works rather than deep research being
trained on the outcome, which I think makes a lot of sense for it not working in deep research
because doing outcome-based RL for deep research would be RLHF again. Because you have to have two
humans and you're like which generated report is better. You can definitely do that and you
the whole sick-evincey thing, OpenAI, showed that they have so many different reward models
and reward signals in their post-training. But that's just one of them. And I think a lot of the
progress in making it exist is doing RL and a bunch of information retrieval and editing and
search tasks. We talked with Noam, about Noam Brown, about this deep research and kind of like
the verifiable rewards. He mentioned, obviously, that's an example of like non-verifiable thing,
having RL work on them. And in one of your recent posts, you also talked about how the
big labs of all this data that they can find long-tailed things to RL on, and then kind of when
you put them all together that fixes it. Do you feel like what we're able to verify is like
a big bottleneck that like the verifications are only done in kind of like these smaller atomic
things? And so we can not really scale that. I think my comment was on making, so in this
post as reflecting mostly on the question of what will agent progress look like relative to
modeling progress. So we've had almost three years of modeling progress and we're pretty used to
the messaging on that. And it wasn't just about being with the RL on small things, but do any post-training
to fix a weird behavior. And RL is a very data-efficient way if you can get the right signal,
but you could also just say, like, it does this weird, non-verifiable thing. Let's create 100 or
1,000 instructions to include in post-training so that the model does this types of information
extraction correctly or, like, soft extraction. It's a space that I want to flesh out more with more
examples of tasks. It's just, if you watch Claude code going, it's like, what is it doing in the
background? It's a lot of reading files and even just the compressing context. That's not,
I don't think that's really a verifiable thing, but that being messed up, like, that's a super
crucial skill for long context actions and longer tasks is just compressing well. And that's
going to take some training novelty on how do you, you can effectively modify your training data
instead of having all the multi-turn context,
you just insert the summary and you want to make the performance stay as well,
because it's also a cost-saving to have shorter contexts.
There's just a lot of new domains like that.
But do you feel like you can figure out what these things are before you release,
or do you think the labs have a big advantage because they have so much user data
that they can kind of like inspect this at inference?
I think it's mostly looking at real-world data at this point.
To the extent that there are clear benchmarks, you can use them in the open,
but I mean, we see the industry consolidated around data in different forms,
and I think that's a real important touch point for people.
I'm curious who's like still collecting reliable sources of open data that everyone uses.
There's a lot of action in the space, but hard to get traction.
Yeah.
So I think for a long time, preference data has been something where people understand
that it'd be very good to have large repositories of it.
If you want that, you can annoy me to try to release all.
for two, like, we have a final data set, but we have completions and ratings from more models.
Like, I'm talking to the student, and let's figure out how to mark this down because we just have so much completions and LM as a judge AI feedback data that we don't know how to clean.
That's one thing.
The problem is I think a lot of it is task and model specific.
So this notion of on policy to adopt an RL word for just this preference data and preference modeling, which is that you want the sequences that.
you're training this reward model on the sequences of generations to look like the model that
you're starting to fine tune. That is something that has made it hard to kind of grab off the box.
And it's, for example, like this ultra feedback that I mentioned is just has a lot of models
in it. So most of the models that people are fine tuning, there's some signal for it to improve on.
And I don't know how long that lasts. And we still don't have the answered question on how important
human is versus AI feedback. Every time I check in with people at Frontier Labs, they're like,
yeah, we still use human preference data. And I'm like, okay, I don't have access to that. And I don't
know how to measure how much it gives you, really. It might be most of the benefit is on the,
what's the right adjective to describe chatbot arena? It's like people are down on chatbot
arena, but it might be that the human data helps boost retention time and general preference a lot,
where most academics were doing multi-skill and alpaca valve type things, which it's just,
it's not as crucial to everybody's fighting in the attention economy.
You're quick, I mean, since we're there, you mentioned sick of fancy, you mentioned L and Marina.
That was one of your posts on interconnects that I really enjoyed.
Are they cooked?
Is there a future for arenas?
Like, how does this play out?
You know, they got $100 million now.
Like, what are you going to do?
I don't know what the money does for them, but I don't know.
I think the aval is still valuable, especially at the frontier people are very cynical,
but in the compression race of how much cheap, like what is the cheapest model you can have
that does pretty good at this is still so useful to a lot of people.
Like, chat is king.
Yeah, everyone chats with these things.
It's why I use, at GPD 4.5 isn't as good on chatbot arena.
I think it's higher on like yupp, which is a new competitor into this.
It's like they have like a vibe category.
which...
Sorry, Yup.
Yeah, there's like yupp.
There's something...
You can look it up.
It's a competitor, another startup.
They have like a...
All these companies have categories.
And one of their categories is vibes
and GPT 4.5 is on the top.
And I'm like, okay, there's some of these tracks.
It's a frontier model.
Yeah.
And it's just like...
That stuff intangibly is very nice.
The leaderboard is established.
People still should use it.
It's kind of a focusing function for the community
across different batches from industry to academia.
Yeah.
I'm not going to try to solve their monetization problems for them, but having clear norms and things that can be hill climb forever is very good.
Like having this idea of an Elon linking models.
You cannot saturate.
Yeah.
You just can't.
It's a great problem.
Like what is?
But you can game it.
So I think that's the issue.
Yeah.
But everyone evaluates on multiple.
Like here came out.
Like Sarah Hooker, I've never seen her so public about any of her.
Like she has great, but she doesn't really go public like that.
Yeah.
artificial analysis also has one, which I think is kind of cool.
The other thing I think is relevant to this discussion is a lot of the data actually is like single test, like a single round.
Like it's not multi-turn.
And I wonder how to create proper multi-turn arenas because you have to switch the models as a whole premise of Elam Arena.
It depends on how valuable the user data is.
If the user data keeps being equally or more valuable than the inference, there's going to be a platform to keep pushing this and
to more and more expensive things.
So they're just set up a deep research.
I mean, they're probably setting up a deep research arena
because that's the data that, I mean,
if I was opening AI working on deep research,
that's the data that I want.
And there are competitors.
And LMSSys is the entity that has the marketplace
to set it up.
Right.
I mean, it's almost like how I see scale.
It's like scale kept climbing the edge of what AI data processes is.
And because they're the name brand,
they keep climbing the incremental evaluation game,
and a lot of them have longevity.
Yeah.
Yeah. That's a network effect in some ways.
You mentioned skill, which is another hot topic, but like we'll put all the sort of hot takes at the end.
I do want to like, you know, focus, try to be technical up front, try to, you know, you're still writing the RLHF book.
Is it RLVR book now?
I can give my spiel on it. Ultimately, RVR is not mature enough, nor is it as interesting of a book.
So I'm like, on two, so those are the two fronts of why I don't want to rebrand. And there's also some personal
career strategy, but that should be independent on what is objectively a good book. Because
RLVR is going to be changing so much in the next 18 months. We've already seen it. There's all these new
algorithms, but I think there's a lot more under the hood on how you do the right pre-training
for it and what the data is, how tool use emerges. All of this stuff is core to what RLVR
will be seen as. I'm watching to see if O3 is like a niche model or becomes the path that
everybody needs to follow on its kind of different style of tool use that you see particularly
with search. And we don't know how opening I did this. And these are the things that I think is kind
of core to an RLVR book that we don't have, whereas RLHF is a more interdistance planarie in the same
way that Chatbotterina can never be saturated, RLHF can never be solved. And we kind of know
these problems of alignment and over-optimization and what the pipelines to getting data that people
are using R. And yes, I can add more RL algorithms to the book, which is nice for me to study,
but that's not really changing, it's not changing like, oh, what reward modeling is and the different
ways that people implement these today, whether it's a value function or reward model and stuff
like this. So I think the breadth on RLHF is nice. And I think I would tell a lot of academics that
I think RLHF problems are going to be foundational and kind of just have a much more steady study
rate where we're on this massive spike of RLVR, but it might just be solved. And then it just goes back to
zero academically. It's not, it's an embellishment, but there could just be a best practice for getting
100% accuracy on any problem that you want. And then it's solved to where the debate on what is
a preference is going to go on forever. Yeah, because it's verifiable. There is a right answer.
Yeah. Sorry, what do you mean by like over the next 18 months, there'll be a lot of changes? Like,
what do you foresee? Actually, let's just catch up. What's already happened in like the sort of
recent history? Yeah, so there's two categories of information that we have, which is what are
the models doing and what are the researchers doing? I think the models provide a lot of inspiration
in terms of what the, like what's actual frontier is. And that's things like 03, Gemini 2.5,
clod. These are a mix of just 03, I think, is the most scaling RL approach. And then clotted
Gemini 2.5 are very similar with hybrid reasoning models that you can turn on and off.
They rolled it out in different ways. So Gemini didn't have hybrid reasoning at launch, but
they brought it in and Claude had it at launch. One of the most important questions has got to be
is, is the O3 path of just a reasoning model or hybrid reasoning models more useful?
Do they diverge in their methods for training them? I think the NVIDIA-Lama Nebitron reasoning paper
is probably the most detailed paper on a hybrid reasoning thing.
And then DeepSeek R1 is still the canonical recipe
on a reasoning-only model.
And those are very different approaches,
and I don't know if one will win out or not.
And then there's just a lot of work on data side
and RL methods.
I think there's a list.
There's a whole list of kind of GRPO complaints
that are out there where the math doesn't make sense
for certain things.
To me, every paper I see come out
always has like some fixed the GRPO. It's kind of cool that like people are, you know,
taking variations on it. But also I don't know if deep seek's going to come up with R2 and just
blow away everyone with whatever is next. Yeah, I definitely don't think the algorithm tends to be the
most important thing. I think I had this in my engineer World Fair talk, which is kind of a snarky
of like, how do you train a reasoning model, which is like you get a starting data set, you
incrementally improve the data set, you do that until you're running out of time or your performance
starts going up. And then you try all of these switches from all the papers or you turn all the, you
do a whole bunch of binary tests of all these various algorithmic changes, and you do a grid
search and see what works.
Like, candidly, that's why I dismissed your P.O when it first came out, because it was sold as
an efficiency thing.
Yeah.
And I was like, okay, fine, like, but like, you know, I've been trained to not care about
efficiency because it's just a matter of resources.
Yeah.
The GRPO advantage estimate is very well suited to verifiable rewards.
Right.
But the other thing is kind of a intangible works better on the infrastructure type argument.
And when it came out for deep seek math, which is well before the RLVR phase.
So it was really marketed as that.
When you talk about hybrid models, how do you reconcile that with Open AI saying they want to move away from the model selector to just have a unified interface?
Do you feel like they feel pressure to like, hey, look, when I have all these different classes, we want to route them to the right thing?
Or do you think there's something else?
I would think that Open AI wants to have a model that knows how hard the pressure is.
I think that has to be the North Star for most people working on reasoning, which is the model will just spend the right amount of tokens on it.
And if you look at a compute-level discussion, see what inference time scaling means.
I think in plenty of ways, like, hybrid reasoners might just be aged out, except for niche applications, because quality is so much more important than having 100x less inference tokens.
is like you just pay for it and compute and that'll get better.
I think it's like really like that was something like Jensen said in his most recent,
I think like Straitakuri highlighted it or had the interview with him.
And it was like, yeah, everything's going to be a reasoning model because it's going to get so cheap and they're better.
And I was like that's why it's like the hybrid reasoning thing is a little bit weird.
And it's like I always just will turn reasoning on unless it's a really silly query like, oh, like what is this thing?
So it's like, okay.
In two years, that kind of tracks.
which I think O3 is also just burning money on us.
I mean, searches 80 websites for me asking what paper it is.
Like, that's a lot of tokens.
But it seems directionally, like, if that's the thing that works, that'll be the default.
Yeah.
At least in all of these high, most of the things that people that we talk to, whether it's coding
or very high-end information economy, those things, the value is there.
I wanted to double-click on something that you seem to be coming back to a lot.
you seem to assert that 03 does something very different by using search a lot,
much more than basically everyone else.
Do all models come with a search engine now?
Is that like a must have?
It depends on your use case.
If you're doing general information retrieval or understanding, yeah.
There's old papers that we can try to find the links.
I don't know if Sam Malman was talking about it,
but there's just this retro paper from Deep Mind and other architectures
that people have been pulling in the discussion again,
which is like you have a very small model
with a very big context length and a very big retrieval store,
which I'm not one to bet against the transformer architecture
and just figuring out long context and stuff like this,
but those are ideas that people are bringing back,
which is search is better.
You look at all the evals from reasoning models,
and one of the trends is that simple QA numbers all drop.
It's like deep seek R1 to the new R1, it goes down.
It's like all the new, like Quinn 2.5,
DeQuain 3, SimpleQA goes down, at least when you're evaluating these without tools.
And Simple QA is like a what is considered to be a very nice, fairly numerically robust,
like long-tail knowledge evaluation.
And all of these, the raw models, they're all going down.
Right.
It just made, like, long-tail information, just to have this search behavior makes a lot more sense.
Okay.
The kind of argument for this, just a, I have been through this journey, too, of like,
oh, why don't you make, like, a model that doesn't know anything but search, right?
You can search up anything that you want and learn just in time.
But the problem is you need to know what the search terms are.
You need some baseline intelligence to make all this work.
Yeah, that makes sense.
That's a good way to put it.
I think it's important because there's this thesis of like LMs becoming just online LLMs, like permanently.
And it hasn't been super pursued.
Like perplexity was one of the first to put it on my radar as like they were like,
we'll attach the search engine to the LM and that's what you get now.
And I think like more and more people are starting to offer it as part.
of their default services, like Gemini has like a search grounding thing as well.
I mean, it's what people say a big limitation of Anthropic is because it uses Brave Search,
which returns a bunch more like SEO slop than...
Is that proven? Because I don't know. I thought they had their own index.
Okay, so I don't have it. I haven't done detailed look, so I'm dealing with rumors.
But I think they'll all end up doing their own index. And it should, it's one of these
things that's like Google should have an advantage again. But who knows if they do?
I also hinted at this in my post, but it's like, Hamish had tried to set the
up the same student from RLVR playing with like search and an RL model. And it's very easy to get the
model to do tools if you prompted to, but it's very hard to get the like RL model to learn that the
tool is useful. And that's why it's to go through these things where it's like 80 failed tool
uses and it still gets it or like it stops or gets it on the 81st. Okay. It's just a RL behavior
that feels emergent from having a very nice way of like getting the model to learn to use the tool.
and it's not like you can't sFT this model to do this.
It just really feels like they set up the environment right
and it plugs into this deep research kind of line of work that they did
and they broke down the problem into these sub-RL tasks
and then it kind of lets it do this thing.
Interesting.
I don't want to be an open AI show all the time,
but I just think I tell people to play with O3 all the time
because it's weird.
It's excellent.
I would say like the amount of work you're imputing on
deep research team when like as far as I know it's three people did it. It was Issa and like the two
other collaborators that she had. I don't know if they did that much on top of O3. Like every
indication I've had from O'NeI is that deep research is more or less a thin wrapper over which is O3.
Yeah, it's probably like one or two small things that they're like, oh, we can make our, we can
make deep research work by adding this small amount of data to the training thing. And then it just
works. That is, that would be how I describe it. I mean, I mean, I mean,
what is it, Gwern, the anonymous person, he replied to my Q-Star post on Twitter the other day,
and he was like, why was this all wrong? And it's obviously simple things don't scale.
There's a lot of complexity, because there's a lot of other exciting things in the AI field at the time,
and Open AI kind of sends out a lot of things that confuse people. But this would fit into that,
which is deep research is a minor change from an existing RL trajectory of what was, like,
O3, probably they had already figured out that search was going to be better,
and then we're like, okay, we can repackage this.
And it's a simple thing that makes a big difference.
And most of the things are like that once you have traction.
I think trying to get the initial takeoff on the sigmoid is the hard Q-Star thing.
But then once it's like, once it's like this, a lot of things in the middle feel obvious,
which is why I have described one of the things that we work on for Olmo.
It's like a lot of it is just having motivation to do things that feel somewhat obvious,
but they're still hard.
It's hard to get different recipes or it's hard to get a full.
reasoning recipe off the ground. It's just like a huge change because you have all this inertia on
this aval suite. And then you have to figure out if you branch your recipe or do you start from like,
do we just take like open reason or zero and start from scratch, which is like, it's a whole other
headache of things. It's just hard to move these projects that are anywhere above five to 10 people
with inertia to get stuff done. But then once you're hill climbing, things can seem really obvious.
Yeah. Okay, you covered a lot there. Before my next question,
to close the brave thing.
Our friend Simon Willison wrote a post
that Anthropic added Brave Search
as one of the sub-processor in their product.
So that's where the thing came from.
Now, to what extent it gets used?
We don't know.
We don't know.
I would just kind of comment on a couple of things
that he said and then we'll go on to your question.
There's a very good post on, just on the retrospective of Q-Star,
there's a very good post that you had,
which was that I want to send people to,
which is, was 01 as IOP.
Right?
That does imply the question.
of like if one was a siop, what else could be siops now?
Yeah.
There's definitely the siops out there.
I mean, the whole inference time scaling plot is such a siop.
Why?
You put these two things next to each other with an x-axis, and it just looks like it's
easy to control.
Whenever you see an x-axis, you think it's easy to control it.
Whereas, like, for training on the left one was training.
Yes.
And training makes a lot of sense.
So if you haven't, even if you go to really old RL papers,
RL learning curves are a non-log-X axis usually.
and they look like this.
They look like these, like, whatever, like, logarithm or exponential rise.
And then if you take one of these and you make it a log excess, it's a straight line.
So, like, that side is like, oh, okay, we've seen this before with the RL.
But with inference time scaling, it being an X axis is why people are like, oh, there's a knob.
I can turn search up a lot.
Yeah.
Which is like what breeds all these weird ideas.
The core of that article is just they're taking points from within training or there's a natural variance and then you line them up.
And if you line them up, then you get this nice inference time scaling behavior, which is,
and now people, a lot of people have reproduced this plot on inference time scaling.
And it's much clearer now.
But at the time, it's like, I see why I thought it was a knob.
It's like, oh, look, they called it inference time scaling.
You control it.
I think the most interesting, well, you have a lot of interesting things in your blogs.
But one that stood out was about RL and tool use.
You said that it's easy in RL experiment to tell the model to try searching.
but then if it doesn't get results with the tool,
it's going to stop using the tool very rapidly.
Can we impact that?
So can there be a good tool that the model doesn't know how to use
and then it kind of fails and then it stops using it?
Can there be a bad tool that should be improved before giving up on it?
How should people think about designing the tool,
improving the model and kind of like where to intervene?
This is definitely on the newer side for my things that I want to work on or have worked on,
I think particularly in 2026, especially in the open sides, all the infrastructure models will
count up a lot where I want to go deeper on this in terms of deeper research style things or very
inference-heavy multiple calls. And to answer a question, that there definitely can be bad tools
and there definitely can be like the model just using them wrong. And something that I would
want to see in a model is kind of not necessarily creativity, but like an openness that it doesn't
know exactly what it will get out of all of its tools and this uncertainty to just try a few
different things, which almost seems classical RL behavior. But if you think about what a language
model does, they're always very confident. They're not necessarily confident, but they have like a path
and like a direction in their answer, whereas that's a big change in these reasoning tokens is to have
the notion of backtracking and things like that, which is some sort of like openness to the tools
having things that are unknown in it. It seems like a really nice thing for the model to have,
which is like, oh, what if I try this?
Like, what does it get?
Especially on the open model side, which is if this is going to work where people want to use open models with tools,
it's going to be because people have private data stores and stuff.
So if you were to train an open model that is going to be a good reasoner like 03,
but on private records of some sort that will never get sent to the cloud,
it needs to be thinking of, like, I can try some things with this to get a sense for it
before saying I have to give up.
And if you look at tool use right now seems much more similar to code execution,
or it's just a part of a sequential path that you need to get to, which is like, I have a plan,
and if it fails at a certain step, I might have a backup.
But it's not like this iterative of I need to fiddle with the environment in order to come up with my plan.
It's just that it's something that people probably are going to have to train into these models,
which is like you might just tell it.
Like, you don't know what is in this, but your answer might be in it, which is like a very odd prompt.
Maybe it'll help.
Yeah.
When we had Eric Schlons from Antrobic who worked on the cloud agent before cloud code, he mentioned
this spent basically like majority of the time on like the tool design to give to the model.
And then you just kind of learn how to do it.
Are you usually, well, I don't know how much you worked on actual this stuff, but are you
putting the tools one by one in the RL process?
Do you think that helps or do you usually give, is it better to give all the tools and like
the model explore?
I don't really know.
Like, we haven't gotten this to work.
I would say it would probably depend on the model in your starting point.
If your starting point is already good at tools, it can probably generalize more.
But if you're doing this weird base model RL and you have to have this kind of curriculum,
like, if you scale RL long enough, you're going to need a curriculum of things getting harder.
And like, that's pretty obvious.
So in that case, it might be tools get added when things become too hard for it to solve certain
questions, which would be, which sounds very intuitive, but, you know,
also just really hard to manage and practice because what is your automated signal and your
training run that is time to do that. That's why video games are so good because they're designed to unlock
things as you progress, but I think like with things like search, it's like, you know, if you're given
access to a small data store or you're given access to all knowledge on the internet.
It's good feedback for the Arc AGI people for the V3 benchmark is like have things where the
language model needs to learn to use new actuators in the world.
after a certain threshold.
That would be
ArchieGI 4 then.
Yeah, probably.
I don't know. They're cranking them out.
They're cranking them out.
They're actually doing a launch party, I think,
like in a couple weeks.
So I'm actually really, like,
it's fun to play Arc AGI.
I don't know if you tried.
Oh, I haven't.
It's pretty fun.
Like, these are IQ tests.
I used to be like,
oh, like they weren't that relevance.
But, like, actually,
now that we have a gradient
where, like, LMs are actually
significantly climbing them,
Now it's actually really more interesting to compare your own intelligence to the ELMS.
I'm with noam, noem on no harnesses.
No harnesses, yeah.
Yeah, I mean, harnesses are cool, but they're gonna, they're a handicap that's changing the learning dynamic substantially.
So it's good, it's good demos, but I feel like the core thrust has to be no harnesses.
I mean, it's always like, is it wrong to say that these are just inducted biases, right?
Like, they're not in the model, sure.
But like anything where you're just like looking at the results contaminates.
This is just a different task.
I think I do it or I mean I've, I think I talked with Greg about this at RKGI,
which I told him like do harness and no harness.
You just have both different categories.
It's like you're trying to be transparent and build targets for Frontier Labs.
Just do both.
Like I don't think it dilutes that much.
The no harness is going to obviously be harder.
And then you just get more bang for your buck on your benchmark.
Yeah.
It's the same dataset.
Staying on the topic of tools while we're at it,
you had a really good summary of recent work in multi-tool RL,
which had like Loop and Read Tool and Toro and all these other things.
And I think that this is just like an area that's super rich for research right now.
I just wanted to give you the space to like highlight what are your favorites?
What do you think that people should explore?
I could share what my moderate ambition, what would be fun research project things,
is you want to create some sort of competitive dynamic or a vowel, and it has to be so much narrower
than what industry is doing. So I told you this at lunch, which is like deep research, but only archive
papers. So you don't have to do a full index. You have a limited domain. You have to figure out how to
measure it or something. I think it's good for academics to work on academic tools because they have
very high domain expertise. They already know what's going. And just like figure out how to make that
something that is either very useful to users if it's going to be good enough of that or something
you can't help climb on. And I don't know if those are like brainstorming on the fly of like
take related works out of papers, just look at the text and break all the links and make an
avow which is filling in hundreds of related works with archive links. Like that's a fun deep research
style idea. See if you can do it with open models on a set data store with tools.
AI2 has gone through a lot of discussions with this,
which is if you're trying to have impact in AI right now,
as an academic, you have to level up out of papers to artifacts,
which is models, datasets, avals.
Datasets and avals are easier for people to have impact on.
And then the next thing is, like, what do people actually use?
In AI2, especially in this semantic scholar team
that's now working on, like, information agents of different types.
There's another thing that I'm, like, distance in,
so don't have all the names.
But it's, can we make open models, do that,
of thing better. It's like, can you make something that people actually care about? And then
you're, that's a whole level of impact that's much higher if you have actual users. It's hard
for academics and small institutions to do that. But if you're working on agents, like, dog feeding
is viable. It's like, can we make ourselves a good Slack summary bot that we like or something?
And just making these agents really tractable. I mean, that's one direction. Another direction
is just he'll climb on humanity's last exam with tools. I just think,
it's kind of unlikely that we're going to win as an academic and a state-of-the-art number because
they're going to start spending millions of tokens per query. It's just a lot of, it's a lot of
computer and like the getting, beating that on the flop equivalence is going to be so hard.
Unstructured thoughts is something that I'm mostly like, okay, I'll get to this. Like I have
more things to figure out on the modeling and what I call like skills level, which is just
how do you do reasoning to induce inference times,
and get high avail numbers.
And once you know you can do that,
you can take your knowledge with you
to do it in more specific domains.
There's skill and your skill acquisition, right?
They think the archa-gGI definition of AGI.
I quoted it.
It's like efficient, yeah, skill acquisition efficiency,
because it's described as three words.
Right, yeah.
Your emphasis on skills in your recent talks that you've done,
do you want to sort of reiterate that thesis
for people to pick up on?
Yeah, so I've been thinking about mostly I'm trying to
to get ahead of what OpenAI, etc., are doing probably now if it's not in their models.
And with all the agents, it seems that planning is a very critical task.
So it's kind of how do you come up with the taxonomy for different types of things you need to train into reasoning models for when it'll be a bottleneck.
And the found date, so I came up before.
And the foundational one was skills, which is what I would say that we have already done with 01 and R1, which is you do a lot of RL, you should.
show the inference time scaling works and you get really high benchmark numbers.
And then the next three are kind of what comes next.
And most of them are around planning.
So what I had is three and four of my list were abstraction and strategy,
which is trying to not use planning because planning is a word that people already use a lot.
The strategy would be the direction the model should go in and technically what are the steps of its plan.
And then abstraction is how does it break it down into things it can actually solve?
And then the fourth last thing is calibration, which is just not wasting compute and knowing when to like give up and ask the user things.
Because like overthinking is obviously a problem.
It's easy to keep getting your aval scores to go higher by using more inference time scaling.
But eventually like that's not what people want and their models.
They want a smarter training regime where the model is actually getting proportionately better for its training.
And not there's a lot of papers on overthinking and stuff like this.
which I think is, like, Open AI wants it
because they have to foot the GPU bill.
Like, if O3 just infinite loops itself
for a bunch of people, like, that's not good.
Does it actually?
I don't know, but it might.
Okay.
I mean, like, these reasoning methods
definitely can make the models
just kind of unstable and just, yeah.
So it's like, but it's also the GPT5 idea,
which is how do you get a model
that just routes the question to the right?
Maybe not necessarily a router,
but just knows if it needs to do a plan
or if it can just answer, if you look at Deepseek R1 and you ask it a hard math question,
it's not like, here's my plan of attack, it just starts.
And having a model that knows when to be like, okay, here's my plan of attack,
I might need to make myself a memory store.
I might need to take like a cloud code approach for this query.
I'm going to build a memory store and spin up some parallel searches and then come back.
Conceivably, this is all something you can train into a model because the searches or the parallel model,
could be like tools in that case.
The simple way to describe it is we have something like thinking tokens and then answer tokens,
and the model should be able to optionally have like planned tokens before thinking or before using tools.
It's like, okay, like here are the table stakes.
I need to do these things and these sorts of tasks will be harder versus easier.
It seems more tractable than some far out ideas for AI.
It's like a language model can write a good plan.
It just needs to be asked to do so.
which I would bet that ClaudeCode and Deep Research are doing this.
Like, you get a user prompt.
And first, the model is like, yeah, there's a plan tool in CloudCode.
And they first, they break it down.
And it's like, that is something they've trained into the models.
Like, deep, I don't think DeepSeek has, doesn't have it built in, but it probably could do it.
And just thinking about that interface between, like, if the model needs it to be able to do the task end to end on its own, you know, can it do that sort of thing.
I think that my challenge with this.
whole reconciling this approach with the no harnesses thing is that I think a lot of the way that
people, especially engineers, want to model it, is that the plans and the memories are tools.
And there are no special plan tokens. There are no special memory tokens. It's just context.
Or it's just, you know, whatever. Specifically for planning, because then you can do fan out to other
agents for tool calls and stuff. So it doesn't have to be sequential. But I'm just like, is this
fork in a road? Or do we have to make a real choice here as to do we outsource things to tools,
or do we keep it native within the models, tokens? I don't think it's a subjective difference.
I think mostly the planning idea is to make the point that people don't get things for free.
And the planning improvements might be kind of mundane, which is like we were prompting Claude
and its plans were bad in this way. Let's give some data where its plans are more detailed or
break things down into more steps so that it's easier for then to do it.
Yeah.
Because it's in a black box effectively.
So if it hasn't been targeted, it's unclear of what the performance will be.
Or on the like open model side, it might just be the idea of having different models for different parts of it.
Then you're really training a model to just be good at planning.
And like that's that's data that you need to come up with.
I mean, you only use that model for that one part of it.
Does it feel like plants are much more reusable and should maybe not be generated every
time. I feel like especially in coding for certain sets of tasks, you want to have similar types of
plans. So maybe it's not the right way to ask the model to regenerate a plan every time.
There should almost be like plan blueprints as like tools and then the model fills it in.
Like where do you think the balance should be? I think they're reasonable. A plan is obviously an
intermediate goal. It just seems likely that there's like failures on this kind of planning level.
I mean, the same thing goes for these rubrics that are popular,
whereas a lot of the technique that is popular for so-called rubric things,
is you have a prompt,
and you have a language model generated a rubric for that prompt,
which is a few specific things that needs to get right.
And that's conceptually very similar to making a plan for every task.
I think whether or not it's, like, grading is that you're going to have a different type of abstraction than executing.
But I think what people are seeing is that it's cheaper relative to the effectiveness.
to just generate it.
So, like, plans are not super long,
and they probably,
they're not that many tokens.
So it's probably just kind of like,
okay, we do this.
Like, putting it in my taxonomy
might be overselling it
where it just needs to be a prompt
and you just need to make sure
that your model's not too weird
at that prompting stage.
I think your taxonomy is super useful, by the way.
So skills calibration strategy, abstraction.
I feel like maybe abstraction
might be the most underrated one
or hardest to solve.
The way that you introduced it was different
than how you wrote in your blog post.
You said it was basically not to overthink.
That's calibration.
Yeah.
Abstraction is about breaking things down.
Yeah, I think both of these strategy and abstraction
make the most sense on the hardest tasks
that we don't know if the model can do them.
Right.
So if you're assigning a task to a model
that you don't know if it can implement it,
the strategy is very important
because it needs to be very specific and narrow.
Whereas if it's doing mundane code,
like deep research,
the plan is actually not that interesting of a thing.
But when you're at the frontier,
of if it can, I don't know, some GPU implementing thing.
You could buy into the Open AI and Anthropic narrative,
which is help me implement this research idea in our complex,
like, multi-distributed GPU thing.
My God.
It's like, this is a task that's hard for a human.
And for an AI to come up with the right plan to debug and do this is very narrow path.
So therefore, the strategy is pretty important of does it start with certain tests
and how does it actually build this out to complexity?
it's obvious that I need to come up with more better examples for this,
but I think as you push it, it's more natural to see that there's only a few plans that actually get it done.
And then abstraction is just important as your task becomes so big.
It's like a prompt engineering thing almost.
Yeah, and it's like you only have 100K tokens you can generate.
Like you need to make sure the model breaks it down.
So it's not just spawning a ton of infinite processes under itself, which I do agree that abstraction is an interesting one,
especially when you start to think about these models that could call in other models to do sub-task for it,
or parts that can be parallelized with multiple searches or just more compute.
I think that kind of folds into abstraction, which is just like, how do you approach a certain nugget of the problem?
And I definitely say, like, I don't have experience building this.
It just feels like if you're going to visualize AI doing the hardest software or other tasks,
it's something that humans are very good about.
So it's like, how do you come up with the research?
plan in 10 weeks.
Like there's a lot of how do you prioritize which experiments to do?
There's a lot of inductive biases that go into that.
Like a language model would not do well at that right now.
Probably memory would be helpful there.
So you can just like the way we do this in real life is we accumulate experience.
One thing I did want to dive in on was just parallelism in general.
There's one case where with 01 and sort of the sort of Q star,
ideas. There was one case where it was sort of overhyped in some sense. But now it's coming back
with O-N-Pro and Deep Think. The theory is at least, you correct me if I'm wrong, basically they
run 01 8 times and then you have a reward model, rate it, and then give you the best of the eight.
Yeah? Something like that. Something like that. Deep Think also the same. We don't know any details
beyond that. I think there's a lot of people exploring that, at least on the info provider's side,
of like, you know, how do we parallelize search and planning and all that.
And I'm worried about getting too hyped about it.
I think it makes a lot of logical sense.
And this is one of those things where MCTS also made a lot of logical sense.
And we were fooled.
Well, I don't think we're using parallel compute in a way to search over like low probability tokens.
We're using it to get robustness.
Like, O1 Pro is, it was so nice because it just had a very predictable depth to it, even on niche topic.
where like sometimes models just fail out.
Yeah, you have some numbers that they went to like from like 10 to like 95% or something.
I don't remember the exact numbers,
but that's what it feels like it doesn't feel like you turn on O3 Pro
to make it 10 times more likely to find some niche piece of information.
Like maybe it'll be a bit more likely,
but we're not getting that type of like searchy notion
of getting more breadth or depth into a tree.
So I think there's value to it where we want to use this parallel
on the whatever, either like the most important tokens that we're generating or like, okay, I know
this part is crucial. Let's just spend a bit more so that those tokens are better. But it's not a
transformative thing. The part that's potentially interesting on the transformative side is like if you can
get much better verifiers. So I think of verifiers of changing the slope of inference time scaling.
You spend more tokens at inference the better verifier you have. If you're doing parallel, it can extract a rare
occurrence. So like right now, if our verifiers are only, like, they're good at, like, human preference,
it's like, okay, we don't need to, we don't need to crank that up very much. But if we are doing
really diverse generations and your verifier is better, it'll get better, it'll do better. I think you
can look at the extreme between a reward model and an Oracle, where it's like, the Oracle is,
the more you search, eventually it works. So the slope is, is good. But a reward model is like,
there's really a capped signal out of it, at least if you're doing,
this preference type of thing. So the slope is pretty minor and it kind of has diminishing returns.
So I do think that if you could fill that with more interesting verifiers, there's potentially
more to get out of parallel compute, but I don't think it is like as transformative right now
on my outlook. It's more like parallel agents makes more sense of like if you could break down
abstraction nice, like as a throughput engine if our tasks are taking a long time rather than a like
at peak performance engine. Okay. Which are the kind of things.
with the whole agent versus model thing,
where agents are much more about, like, getting it done at all,
like being robust and being fast,
where, like, this model is one generation.
It's like, can you get the answer right?
Yeah.
I will spend a little bit more time on this and I'm happy to move on.
My pushback or counter to this is that it's a way to pull forward
a hypothetical future model that you can then distill from.
Yeah.
Which is nice.
Well, I bet people, I mean, they surely will use these for synthetic data.
It's just like the marginal gain on synthetic data.
very high. Or just like
Amanda Askell will say like better prompting
will effectively make it seem like you have the next
generation model. Or like most people
don't put effort into their prompts.
Or she had said something of those lines
in one of her enthrropic interviews.
It's just like if you can really figure out
how to kind of get into the
certain states of the model. Yeah.
Well, anyway, that's my pitch for like why this
is worth doing at all. And like, you know,
I have a science fiction story
that I want to write about quantum
models in a world where like
you could explore cheaply multiple universes, then like, you know, sort of pull forward the right one.
That would work. This sounds too science fiction-y, but I feel like in a world where we could
control quantum computing well enough to explore this and skill it up enough, it could be kind of cool.
It also could be that parallel compute is grounds for interesting types of innovation.
Like I don't know, like, what does it mean to have parallel compute with diffusion language
models that generate all their tokens at once? Like, does that meaningfully change some sort of
I don't really know. I think it would be, like, the diffusion language model would be fun if it works.
So you can have much more control over inference time scaling. I mean, like, Gemini has one. It's, like, hard to suss out what it changes.
But once we have all these knobs, I'm hopeful that it helps build some interesting types of innovation, because, like, the parallel stuff is new.
Architectures can change, we'll see. I've been using the Codex best of an thing. And I feel like most of the generations,
are like, you know, 5% different from each other.
Because you use Ruby.
No, no, no, I had a JavaScript one.
I have a JavaScript one, so it should be good at that.
I don't know if it's like just how the RL encoding works.
One thing that I've noticed,
these models always want to do if statements
when there's like a missing M variable
so that it doesn't fail when it runs.
And I feel like that to me, that's just like a symptom of the RL.
Yeah.
The code is terrible.
Like, you should not write code.
It shouldn't silently fail.
If there's missing variable, it should just raise an error.
But I feel like the URL is like pushing the code in this direction.
And then all the generation have the same pattern.
You know, I generate 14.
All of them use the if statement just in different pieces.
Yeah, that was something I will definitely get over.
That's just like the labs are trading off massive gains in performance for small
detriments and usability.
And it's like, do you ship that model?
Yeah.
Like you just ship it and deal with it later.
But I'm sure they could fake.
I'm sure that's a fixable thing.
I think to me that's the question is like, you know,
you talk about how you have gains in like pieces of the thing,
but not in the full trajectory sometimes.
Do you feel like these are examples of that?
Or do you feel like as we get better,
if we did a longer trajectory where instead of just writing this piece of code,
you have to think about how you're going to maintain it later
and like how it's going to run, that's going to fix it?
Or it's hard for me to grasp.
Yeah, the software stuff is not easy because it's almost like,
maintainability almost feels like a human preference type issue again, where somebody could look at it,
it'd be like, yeah, that's not as good. But adding the heuristic in trading seems very messy.
Yeah. So maybe it is. I don't know. There's a lot more to dig into that, I mean,
this is what Anthropics says they're doing and just what are the actual frontiers in making,
like they say they're working on code only and what does that actually mean? A bunch of it is
going to be designed tradeoffs and
how much autonomy the model
it has versus these potential
side effects from training longer
that we don't know how to get rid of.
I mean, that definitely could be the
sort of a behavior like that
is what I would say is like a simple thing
to remove where it might just be
obsessed with some code format that
fails when you revisit it
or something, even if it's
like everyone has seen it with just bypassing
test cases. I think they'll be a bit more
nuanced than that, but they could probably be
super simple. This topic has a similar semantic content address for me as over-optimization,
which is something that you've written about. It is over-optimization with a different reward
function. I know. Okay. Well, I meet that link. I want to verify that we are thinking on the same
wavelength. I just wanted to go over, again, specific topics on things that you've spent some time
thinking about. You write that there are three types of over-automization. First was RL for control. Second
was RLHF and third is RLVR.
They always happen. Obviously, RL is no stranger to reward hacking.
But maybe, do you want to elaborate on how things are evolving in terms of how we're learning as an industry?
Yeah.
So that three things breakdown is for people to put the pieces together for what has happened historically.
All of these over-optimizations are a just the model optimizer is strong enough where it can
manipulate the agent with respect to the environment or manipulate the environment.
environment in a useful, in a way that's useful to its target signal. Also, like, for context,
I think with what we're doing with language models in RL in general, is that if there's something
that can move its reward signal up, it'll move the easiest thing, the most direct things to
move that single up. So that's part of the story that I said on Sikoffancy, which is this reward model
for user feedback was probably so obvious that humans just like to like stuff that is, like,
people press that thumbs up.
Long,
and when they're...
Folling filled, bullet points.
Yeah, like, all those things
have just been really easy
for the model to extract.
So, like, once they added it,
the model changed a lot,
and the score went up a lot,
and it was easy for the RL to find that.
In control, the oldest RL,
the environment is normally a simulator
that is fixed.
There's no feedback.
So the over-optimization looks like
unphysical and nonsensical behaviors.
There's the motorboat example
going in circles.
There's, like,
an example is a project
that was middle author on
was effectively over-optimizing, like, half-cheetah, which is this Majoko thing.
Instead of running, it did carwheels off into the sunset and got like infinite numbers.
It's like obviously not the intended purpose.
It looks like a glitch.
So it's just kind of manipulating the agent interface with the environment.
RLHF is kind of a classic case where the model will just break down because the reward model is
imperfect.
So the environment is really imperfect in the RLHF case where...
It's so sparse.
It's like very artificial.
Yeah, it's a very artificial.
environment. So it makes sense that these actions, which are generated tokens, will do things,
like reduce into just repeating one token over again. It'll be like, I think one of the early
examples we had playing with us at Hugging Pace was the model would just say JavaScript.
It would be JavaScript, JavaScript, JavaScript, JavaScript. It was like some toy dataset.
And it's very obvious when you see it. It's probably harder to see when you're at the top
and making decisions on when to stop training if you're doing a lot of RLHF. But that was kind of
the phase that people have gone through. And now we're in the RLVR phase, which is,
we're giving the model reward when it does something quote-unquote right. For math, it's
a bit harder to over-optimize, I think, unless you have tools and the model learns to
search and cheat instead of learning math, which I'm sure somebody could see that out in the world,
which is like, oh, I'll just find the, you're training. It's like, the model's like, oh,
you're training me on Stanford's problem set for CS, whatever, that it's seen a thousand times.
So it's like, I'll just go get the solution manual, which I'm sure. Somebody can find an example
where that has surely happened. But on code,
and maybe information retrieval, it's easier to fudge.
So the code thing is, like, the easiest way to get a unit test and pass is just put a pass in it.
Like, that is not too surprising that a model can learn how to do that.
And there, for code, you need more reward design to think would be a nice, for like a substantial
academic work is, like, what is reward design and code for balancing this sort of, like,
understanding this over-optimization of test cases or avoiding family.
or something like this. I'm sure there's, it's not just like going to be a controlled environment because
these models are complicated, but I would guess you can reproduce that in some ways.
Just to double-click, reward design means, like, for example, giving credit, partial credit for
partially correct work. Yes, or like giving the model a slight penalty for doing the unit
test thing, if you can detect it. Yeah, for cheating. Yeah. Which is, it adds a lot of complexity to
training these models compared to math, which is just ifs answers, right? I mean, you can look at the
GRPO math and partial credit is weird in that because it's kind of normalized per batch.
I don't know if I have a whole feel ready on it for that, but it's also just, it becomes very
complicated if you're mixing domains and it's like, is partial credit in code better than partial credit
in math or all these things? It's like reward design becomes very complicated, and that's what
you're incentivizing the models to do different things.
Yeah.
Is there any literature or hypotheses about mixing these things?
So let's say you have the one for code, you have the one for math, you have whatever other
verifiers you can come up with, and individually they work?
Do they conflict?
I think part of the intuition of RLVR is that the model is good at knowing which prompt
area it is, which is why the models don't get worse on knowledge benchmarks if you're
training on just math or precise instruction following. So the model just kind of develops an
intuition for like where the different prompts are in space. So the gradient updates will be different
depending on your batches, which is partially why people will just say do big batches. So like a lot of
the model is activated and you have a less noisy signal with RL. But a lot of the intuition is that
the model just kind of handles that. And there's interesting questions on sequencing. Like do you do
large scale math and code RL to get the sequence length? And then
and add in more general stuff, which DeepSeek mentioned,
but that's one thing to go, the deep seek report is like math and code to more general
RL. There's a question on where do you do tools if you're going to do like code execution
and search within this. So I don't know if that's interweaved or if it's a second stage.
Got it. Yeah. I don't have comments there. It's just like, it's surprising how much is not known
and you just need a lot of compute for ablations. The inference, high inference length
generations definitely just like kind of breaks all infrastructure because there's just so many tokens
it's more opportunity for out of memory or other things go wrong so it's like just on a default
all of your training jobs need way more GPUs for the memory of inference sure and or just like
training but it's just it just makes it more of a pain yeah that's a cost thing um you know one of the
maybe controversial takeaways from the gnome prod which you listen to was that there's also
just wall clock time of just getting feedback from the environment, whatever that is,
especially if it's like a real world thing.
And I'm just like, yeah, I mean, there's some point at which your trading runs cannot
take longer than like a human life.
So to me, that was the wall.
He disagreed with that.
But like, that was what I meant by it.
Like at some point, long inference, you do want it to terminate within some reasonable
amount of time regardless, just as a user.
Yeah.
We have to find a way to accelerate internally within the training time faster than the passage of time in the actual universe.
Yeah, we're not, I'm not worried about that problem, but I agree with you in principle.
Right.
So I'm stretching this all too far.
I get it.
As we kind of start wrapping up, what are other interesting ideas that people should pursue?
Like in your AIE talk, you said what I'm thinking about for scaling REL, you had big multi-domain data sets, difficulty filtering, long run times.
Is there anything specific that if there's people out there that are either doing research or they want to do a company or whatever, these are like interesting things that you don't want to do, that you want other people to explore?
Most of them, I think, are not in the reasoning space, which, like, their talks have been about reasoning.
So I've been long talking about, like, character training is something that I think is under-indexed on and been advising a student.
Character level?
Like personality training.
Okay.
And how that, like, different ways of changing the personality of the model from prompting activation or fine-tuning.
Okay.
Like, data engineering.
So stuff that, like, Joanne Jeng does for Open AI.
So, like, like, how much does that matter?
What are the fundamental research things?
Hopefully, I can share more that I've been advising a student on that.
So I've been saying that for a while.
Do you like the model spec stuff that she's doing?
Yeah.
Okay.
Yeah.
That, that's, I've been a early fan of that.
I mean, that's how she finally, that's how, like, she noticed.
me as it was like the only person that covered it when they first released it. I think it was like
did. I would. I like did. Yeah. Not many people did. Okay. All right. All right. You were first.
I don't know. But like that's what she said to me. Well, we had a, you know, we had a model spec talk
closed the whole conference, right? Like, that was my sign of like, pay attention to this guys.
But it's real because of what it sends to like develop. It has a developer benefit of like
where your model's going. And then also just like regulatory. I think it is very important to like,
What is like an intentional behavior versus just like a training error?
Okay.
So I think for model transparency, it's really fantastic.
And I've said that like the model spec is much more useful than a constitution.
Because the constitution is like an intermediate training artifact that you give to the training algorithm in order to get the model that you want.
It is not necessarily like what model did we.
Like we don't write down our goals of the model in a constitution form.
By the way, have you looked at the constitution?
Not recently.
They talked about it.
They put in like Apple's like design guidelines.
Yeah.
then also like the UN like declaration of this level I've seen it.
I didn't know if they've updated it.
That's very odd.
I hope that Anthropic would write a model spec.
I'm not too optimistic, but they're the next domino to fall.
Well, so my take on that, actually I pushed for this too late because OpenEI
already approved the talk and all that.
But I was going to ask them to compare the OpenEI model spec to the CloudFor system prompt,
which is their closest thing to the model spec.
It's the system prompt is incomplete because Open AI has things in the model spec that their model
doesn't currently do.
especially when they started. It's like we want to, when they first released it, it was like,
we want the model to be able to engage on, like, sensitive subjects and maybe like even NSFW is in their
model spec, which is, they're just signaling of what they wanted to do. And they say, like,
this is very hard to implement because there's all these obvious risks to doing this. But it's like,
in an ideal model where we can solve every problem, this is what we do, which I think is good,
as I said, for many different stakeholders. So I mean, mostly my thing is like there hasn't been a good, like,
research paper on that. That's just a lot to do. It also runs into personalization and personality
or similar, which is like if open models are to win, part of it could be just like everybody
can have exactly the model they want. We're serving GVT 4.5. It's kind of its thing. You can
prompt it. But if fine tuning is more effective than prompting, everybody can have the model
that they want. So it's a good, it's like an academic problem or an open ecosystem problem where
people are fighting on the turf that it feels more likely to win, which is good. Is this someone
where you, like, as speaking as AI2, Omo, you want to win?
Or is this you're just advising a grad student on it?
I don't think it's a differentiating factor yet, but I'm very open to working on it.
I think, like, open models have a strong, you know, role play use case and, you know, like,
character, personization, all that stuff, right?
Especially because people, like, they find their wifu, they want to keep their wifu.
And, like, that's the derogatory term for it.
But, like, I would say that we've definitely discussed it.
And I want to, part of Olmo should be that it is a base model that's easy to take in directions
that you want. And we will have an opinion that is probably slightly conservative on personality.
I mean, I've gone through the open A on model spec, and it's like most of these we agree with
and, like, be conservative on anthropomorphization.
What do you disagree with?
I don't remember. I did it a couple months ago. But a lot of it is like openness or transparency,
which is like if we're training an open weight model personality, like we're not going to withhold anything.
And we have a different hierarchy.
So most of them are on like that type of information exchange rather than be kind.
Like opening eyes model stick is pretty agreeable.
And if you read through it and it's like treat the user with respect and all these things are.
I raise the kids that way.
Just read the spec.
Yeah.
It sounds kind of stupid.
But then the last thing is for people doing research, it's like wacky model routing things where you figure out like a bunch of different models to off-hugging phase to route.
to because an open model tool thing could use way more models more easily than any open AI product.
Because open AI is restricted to the open AI's models where if like maybe I don't know open routers,
like I'm going to make a product out of this, which is a router. Like open router actually does it.
And they're like our chat window knows the best model based on all this usage that we have.
Yeah.
For your query.
There's people that started the other way. Like Martian, not diamonds. I don't know who else is he would know.
There's a bunch.
There's a bunch.
Yeah.
So I don't know.
I don't know if that would work.
Hugging face should work on it.
It's like, it's a moonshot idea.
You don't know when it's.
Given your hugging face back in.
What does,
how does Huggy Face make money?
This is a very common meme question.
I think mostly like enterprise deals.
That's what they say.
Which is like,
they're doing their thing.
They're supporting their people.
I mean, look, they're great.
They're big.
They're profitable.
It's just not that obvious to most people.
I like the router idea for media models.
I feel like there's like so many.
There's like a long tail of like a,
background remover, like a style applier, like, that is actually hard to find.
On the tech side, I feel like just use the big model.
Unless you're like under some like latency or price constraint, you should just use the best model.
Even when we're doing thumbnails, I'm like, okay, I'm trying to remove a background of somebody.
And it's like I go and replicate and there's like 55 background remover.
Yeah, I just use Adobe because it's a website.
Well, but that doesn't work.
Like the Photoshop model is bad on some things.
But again, it's like, or I want to generate a diagram to like mimic something.
thing and it's like, well, which model is better diagrams?
Yeah.
You know, it's like, those are not easy to find because none of the benchmarks are right now.
Part of the argument is that if distillation works really well, we could just keep making the target for
distillation smaller and smaller, which is you have models that are very narrow.
Right.
And they're mimicking these huge models on something that's like pretty, I don't know, like,
reformatting tables.
It's like, can you do a table reformatter from markdown to Latech and a 100 million parameter model?
Like, if you get it small enough, that is really economic.
feasible because it's effectively free, it imprints and instantaneous.
My pushback is on this is just, if you're doing image editing, 4-0 should do it, do it all of it.
Well, yeah, but I think it does, like, we're just not there yet.
Like, give it five years, it'll do it, right?
So why we're kind of router at all?
You just scale up 4-0.
I guess I think it's, yeah, right?
Like, tell me where the logic is here.
Like, this is like a temporary thing.
On device.
On device.
Like the local modeling community, I think, is much smaller than people get.
give it credit for because most of the use for open models is still in APIs. It's like deep
seek API. It's convenient. It's like if there aren't that many models, somebody is going to host it
for cheaper than most people's doing it themselves. That's pretty realistic. But there is a small
community that need local. Yeah. The best outcome is if open models can compete on not just
long tail things. But that takes the most transformation. Sign note, so I
resisted by buying my own, like, buying my own GPUs, building my own cluster.
For this reason, I'm like APIs will solve most of it.
Like, people are losing money to serve me models.
Why am I, you know, having those?
Except for the fact that 4090 prices have doubled in the last year.
So actually, you made money doing local models.
How does that make you money?
Because your investment goes up.
Yeah, you can sell the card and you do that.
So as you used 4090s goes out.
Interesting.
Should have bought a 4090.
I got a 47.
Damn.
What is this?
Well, then it puts me on tail light.
Should I buy, you know, $15.90 if it ever, you know, is widely available.
GDT, they were doing the drops.
Yeah, I know.
It was crazy.
We were like running to the camper to buy it.
Any other topics before I give a closing question?
Just generally, your work, R.OVR, like, are topics of the day.
I think companies should keep considering releasing open models, mostly for PR and onboarding.
It seems like a way it's going if Open AI is releasing it.
Are you excited about that?
Do you feel like it's like a Syops again?
The Open AI model will be good.
I expected to be.
They're pretty serious.
It'll be best in class for some sized category and some subset of tasks.
That's like open AI only does things like that.
You have to give them the respect.
They won't deserve it.
Yeah.
That is a big like open wins when more people are doing it.
So like that's a win.
And yeah.
Well, I mean, hopefully they are actually open about the techniques.
just the weights. Do we think the size of the open model is, tells us anything about the hardware
that they're going to build? No. What? No. They're so secretive about this. That's like, that's why they
haven't released DVT 3.5 or anything because it's too revealing about internal stuff or plans.
Oh, okay. No, so I, I, you're talking about Stargate or what do you, what kind of hardware?
Johnny I think. No, yeah, I think that's a different phone factor. Yeah, that's it. Yeah, yeah.
I think that thing will run on the cloud. I don't think that'll run local anyways. Well, okay, we have to
talk about it. It seems like every podcast you talk about it. So apparently the news from today,
which I think you were looking at, was that it was like an ear device that they sued, they got
sued over or whatever. But like, I think the earform factor is pretty good. Like, I actually
did get there with B in terms of like where, where does this ultimately go? Like, you want something,
you want the AI to hear what you hear. And where do you hear what you hear it on the ear? Like,
that's pretty much it. I don't know if you guys have like thoughts on wearables and where that goes.
I try to be, I think it just knows too much.
That's really my...
But you want to give it context.
Yeah, I have false privacy hopes.
I think a lot of people, I mean, that's the whole thing.
It's like people don't actually care about privacy.
It's just no taking, you know?
It's just a really good memory.
I think the meta rayband form factor is good.
I don't think it's as mass market.
It's like if you get it in an AirPods-sized form factor, it's a way bigger market for obvious reasons.
But the, like, sunglasses form factor is the thing that works, I think.
I don't use them for AI, but they can fit the AI to work it.
Like, yeah, empirically, yeah, it obviously works.
Yeah.
Cool.
Well, the last question I was saving up was this whole, what is meta doing?
You know, you actually had a pretty interesting post back in, when was this?
In April, you said, Lama 4, did meta just push the panic button?
I feel like back then it didn't actually push the panic button, but now they really pushed the panic button.
That's fair.
I think the panic button at the time was the whole.
LMSSys model not being the model
they released thing along with a bunch of weirdities
about the day of the week they released.
But to be a model that claims to be open
and then not release the model that is your leading claim
is just like that is like bad execution.
Bad execution. Yeah, yeah, which is fine.
And then the recent stuff I think
mostly can be boiled down to talent is cheaper than GPUs
by a dramatic margin. And at the end of the day,
it's like, okay, if we're spending this much,
They go to the room and they're staring the mirror,
they're like, wait, it might not actually be that ridiculous
to spend this money on the top people.
It's like, might as well try.
They already spend it on VR.
Somebody was bound to do this eventually.
And it makes sense that it's like if Apple or somehow decide, like,
we're going to do this, they're going to come in and do exactly what Meadow is doing.
They need a founder mode CEO who is like, screw it, like, you know,
we'll take the L.
The thought that did occur to me is, you know,
meta, instead of spending on VR, they should spend on REL.
VR.
Well, I think the question is, like, I think a lot of, some researchers, like, most people
will take the payday and happily move to bed.
Everybody has a bribe number.
Right.
It's just a hammer really big.
Yeah.
But, like, I think some researchers are uncomfortable with the idea that this is sort of the
great man theory of research, that, like, you have to pay this much to get this level
of talents.
And the talent is definitely distributed.
Yeah.
Right.
A lot of the people that they would be paid.
paying this much, have the confidence to redo things or to just do some of the same things
and just like, whether you call it feeling the AGI or just drive to build things.
Feeling the AGI is not that different than a lot of things that have existed in Silicon
Valley lore in the past.
It's just people with the vision that are willing to execute on it and they see something
coming.
And those people make a big difference.
I think you have those people and you remove beer accuracy, getting
technical, talented researchers
is actually something that meta has a lot of
or has the ability to get
a lot of. So it's a lot of recycling,
which is very hard on individuals
and morale of an
organization. But that's
like understand the approach.
Yeah, for sure.
Cool. That's all that.
Any parting thoughts on
how you're going to build the American deep seek?
That was a nice tweet.
Yeah, mostly if I have to look
at what my
And if you were asking me, like, what my 10-year goal is, and it's like, I only will have, like, a two-to-five-year goal, where I think as models are shifting more towards agents, I think that, like, scaling is slowing. It's like, they're a side of it of a fixed cost and a fixed path to getting towards something like American Deep Seek, or mostly just, I would say it doesn't have to be American if it's fully open. You have everything and you can modify it, which is, like, there's a few things that need to fall. A lot of it is just more resources, but it's like, like, almost.
32B is if you squint like original GPT4 level and fully open. And it's like there's a few
levels that you need to go through. Like that's obviously a dense model. It needs to be taken to
sparse M-O-E and you need to scale it. You need to have a lot more GPUs and then you need to do
like large scale reasoning. That's the goal that I want to do. There's a lot like that's what I want
to do. There's a lot of complexity and navigating like how to work with AI. Like what does AI2 do
to get there. It's very hard. I think that, I mean, it's a nonprofit. It's hard to get the resources
and building a model is a lot of aligning a lot of different people. That's the deep seek story is they
have great people. Open AI has kept a lot of really good people for a long time. Anthropic
has gotten a lot of good people right now. And it's like it's a lot of incremental, hard technical
problems that you need to stack up. That's what I would like to do and make work in the next
couple of years, but it's not easy to get there.
So that's the pitch is, like, AI2's best case scenario is
AI2's going to do other things.
Like, you can't just run a nonprofit or a company that says,
our goal is in three years to have an American deep seek.
Like, no one's going to keep paying the bills on that because you have to tell a better
story.
But that's, like, what I would like to do in that.
And I'm sure AI2 will do many more interesting things along the way.
Like product stuff.
I don't know it's necessarily product, but like, what are more, like, what are cutting
edge things in AI that we can make a new architecture for certain things.
Okay.
Or like what are demos of open models working better, whether you have like private data or something,
or just far out ideas that could take you off the transformer trajectory.
I think that, like, you still need to be doing these to kind of lead an AI.
Thank you for working so hard on truly open source AI.
Yeah, it's fun.
I mean, it makes it easy to align like values with what you're doing.
Yeah.
Like, it'd be better for the world.
if more things are open and therefore,
a lot of it is just
willing it into existence.
And I take seeing, like, what opening
I does is, or is saying they're going to do
as, like, hopefully a win coming soon.
Yeah. Like, DeepSeek was the most
unexpected win that made some other
dominoes fall. Well, yeah,
I think that is the path
for her and see what it takes.
Thank you so much. That's for coming on.
