Latent Space: The AI Engineer Podcast - The RLVR Revolution — with Nathan Lambert (AI2, Interconnects.ai)

Episode Date: July 31, 2025

We first had Nathan on to give us his RLHF deep dive when he was joining AI2, and now he’s back to help us catch up on the evolution to RLVR (Reinforcement Learning with Verifiable Rewards), first p...roposed in his Tulu 3 paper. While RLHF remains foundational, RLVR has emerged as a powerful approach for training models on tasks with clear success criteria and using verifiable, objective functions as reward signals—particularly useful in domains like math, code correctness, and instruction-following. Instead of relying solely on subjective human feedback, RLVR leverages deterministic signals to guide optimization, making it more scalable and potentially more reliable across many domains. However, he notes that RLVR is still rapidly evolving, especially regarding how it handles tool use and multi-step reasoning.We also discussed the Tulu model series, a family of instruction-tuned open models developed at AI2. Tulu is designed to be a reproducible, state-of-the-art post-training recipe for the open community. Unlike frontier labs like OpenAI or Anthropic, which rely on vast and often proprietary datasets, Tulu aims to distill and democratize best practices for instruction and preference tuning. We are impressed with how small eval suites, careful task selection, and transparent methodology can rival even the best proprietary models on specific benchmarks.One of the most fascinating threads is the challenge of incorporating tool use into RL frameworks. Lambert highlights that while you can prompt a model to use tools like search or code execution, getting the model to reliably learn when and how to use them through RL is much harder. This is compounded by the difficulty of designing reward functions that avoid overoptimization—where models learn to “game” the reward signal rather than solve the underlying task. This is particularly problematic in code generation, where models might reward hack unit tests by inserting pass statements instead of correct logic. As models become more agentic and are expected to plan, retrieve, and act across multiple tools, reward design becomes a critical bottleneck.Other topics covered:- The evolution from RLHF (Reinforcement Learning from Human Feedback) to RLVR (Reinforcement Learning from Verifiable Rewards)- The goals and technical architecture of the Tulu models, including the motivation to open-source post-training recipes- Challenges of tool use in RL: verifiability, reward design, and scaling across domains- Evaluation frameworks and the role of platforms like Chatbot Arena and emerging “arena”-style benchmarks- The strategic tension between hybrid reasoning models and unified reasoning models at the frontier- Planning, abstraction, and calibration in reasoning agents and why these concepts matter- The future of open-source AI models, including DeepSeek, OLMo, and the potential for an “American DeepSeek”- The importance of model personality, character tuning, and the model spec paradigm- Overoptimization in RL settings and how it manifests in different domains (control tasks, code, math)- Industry trends in inference-time scaling and model parallelismFinally, the episode closes with a vision for the future of open-source AI. Nathan has now written up his ambition to build an “American DeepSeek”—a fully open, end-to-end reasoning-capable model with transparent training data, tools, and infrastructure. He emphasizes that open-source AI is not just about weights; it’s about releasing recipes, evaluations, and methods that lower the barrier for everyone to build and understand cutting-edge systems. Full Video EpisodeTimestamps00:00 Welcome and Guest Introduction01:18 Tulu, OVR, and the RLVR Journey03:40 Industry Approaches to Post-Training and Preference Data06:08 Understanding RLVR and Its Impact06:18 Agents, Tool Use, and Training Environments10:34 Open Data, Human Feedback, and Benchmarking12:44 Chatbot Arena, Sycophancy, and Evaluation Platforms15:42 RLHF vs RLVR: Books, Algorithms, and Future Directions17:54 Frontier Models: Reasoning, Hybrid Models, and Data22:11 Search, Retrieval, and Emerging Model Capabilities29:23 Tool Use, Curriculum, and Model Training Challenges38:06 Skills, Planning, and Abstraction in Agent Models46:50 Parallelism, Verifiers, and Scaling Approaches54:33 Overoptimization and Reward Design in RL1:02:27 Open Models, Personalization, and the Model Spec1:06:50 Open Model Ecosystem and Infrastructure1:13:05 Meta, Hardware, and the Future of AI Competition1:15:42 Building an Open DeepSeek and Closing Thoughts This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe

Transcript
Discussion (0)
Starting point is 00:00:03 Hey, everyone. Welcome to the Littance Base Podcast. This is Alessio, partner and CTO Adensible, and I'm joined by Swiss Finder a small AI. Hello, hello, and we're excited to welcome back, Nathan Lambert from AI2. Welcome. Thanks. Fun to be here. I feel like I also have to say interconnects and the Lex Freedom Podcast and the and the AIU Wells Fair.
Starting point is 00:00:24 Like, you've just done a lot in the last year and a half. Not that many. Still stay no to plenty of things. Yeah, yeah. Your first episode with us was January, 2020. before. When you just joined AI2, then you release all the almost. You joined us again at Neerrips, where you did the open models. Well, Luca did. And you supported. And then you were more recently here in SF for AI. First of all, I wanted to congratulate you on winning the best speaker.
Starting point is 00:00:53 Oh, yeah. Thank you. The reasoning track here. Here you go. I'm limited by emoji. Oh, it's a nice AI generated. I look too Zen. I look so Zen and this AI generated. So we had our track host, like take photos of you while you're speaking. I'm going to turn it into Ghibli photos. But this one, your eyes were closed. It's funny. Okay, we were trying to have Mochi, the reason named Fomsky, joined us, but I think she's like very, getting very anxious,
Starting point is 00:01:15 very restless. A little too crazy mochi. Very restless. Okay. Sure. Okay. So you've been doing like really good work. And honestly, like, I think one of the things that we wanted to kind of establish was Tulu and RLVR.
Starting point is 00:01:30 I guess. Is that a good place to start? Sure, it starts us in the recent journey. I think that we can recap kind of the story of what Two, the Three was aiming to be, and then kind of how it got folded into what the new narrative is. What the goal is is try to do the work to compress what are complicated industry post-training recipes into something somewhat tractable that you can modify on your own and do post-training at a what is actual state-of-the-art level, I think.
Starting point is 00:02:02 What we do relative to Frontier Labs is that we probably have a smaller amount of tasks. I think our post-training suite for Tulu is probably like 10 to 15 tasks, but I would guess post-training at OpenAI at all, you have maybe hundreds of avals. And adding more evals is more data work and more mixing work and making sure you have these things. But on the core avals for our suite of models from, I think, 8, 70, and 4 or 5B is based on Lama at the time. It's like it matches or beats meta on these core vowels. I think meta has different priorities and their things for Lama 3.1, which is a great set of models at the time. And it's just like, how do we distill what is very complicated post-training explanations or diagrams from the like of this Lama 3.1 report where they have these complex feedback diagrams with many iterations and earlier signs of that from like anthropic papers that have these multiple model variants and early like constitutional AI things for multiple years and what does that look like when you're doing.
Starting point is 00:03:01 a large-scale instruction tuning into preference tuning and what else you might add. I think a lot of the core contributions of that before we talk about this reinforce the learning thing. It's like we showed how to scale up preference data. It's just like the academic community had been using this one dataset since like all the way back in the hugging face models of like Zephyr beta is when this ultra feedback data set got popular. And still a year later is like this state-of-the-art dataset for open preference tuning. And it's just like one of those obvious things, it doesn't need to be the case. So it's a big trying to make more mature recipes available to people. And I mentioned this either on one, I think I'm trying to talk
Starting point is 00:03:42 with Jordan. I mentioned the origin of the RLVR thing, which is like realistically, when you work in the open, a lot of it is trying to match what industry has done. And we're on a different path because our infrastructure is different. So some things that Open AI does now that works really well for long contacts won't work that well for Omo because we might not have enough flops in our base model. We might not have certain data sets for legal things. But directionally, a lot of it is just trying to
Starting point is 00:04:07 reproduce things. And I've long tried to get John Schillman on the pod of Open AI Anthropic and now thinking machines. And at the time, he had gone in approval to chat with me. What he said was confirming a lot of the things
Starting point is 00:04:23 that I had said on instruction tuning and multitask and preference tuning. And he was like, oh, yeah, everyone just does RL on the outputs. And that's how we got the RLVR idea and scale it into something that is a general method. There was a lot of reasonably, or very similar works at the time, like Vine PPO and Quiet Star on doing these math and coding domains for getting verifiable rewards. I think the RLVR thing was about doing it in general recipes. And the naming was something that stuck. Originally we had, I think it's especially like Costa Huang, who was a kind of lead RL engineer at AI2, who's doing some stealth startup now. You can hear more from him now on that soon.
Starting point is 00:05:06 I think he's founding engineer of something. And Hamish Iveson, who's still a student at UW, were leading most of the technical work on this. And the naming was going to be RL from ground truths. But then it's like the verifiable rewards is actually a more general notion because only like math questions have a ground truth. where code is verifiable, precise instruction following is verifiable. So I think it's a nice evolution of the name, which makes sense as you look at more domains, which is now why it catches on with people. Once Jensen started using it, it was like, okay, that's set. That wasn't really our goal, but that's setting that's where it took off? No, that was like in it being taking off because it
Starting point is 00:05:41 was after deep seek. But it's like when people like that have the acronym on the slides, and it's also very clear of like, R like, Jeff is four letters. It's like we want to evolve that and have a similar four letter acronym. It's not that much magic to it, but They're definitely intention on these little things. RLGT may not have worked as well. I don't know why. Yeah. Yeah.
Starting point is 00:06:00 That's what these people like all that were definitely thinking and they made that name change, which works, which was fun. You did mention it. So we'll show, you kind of mostly quoted from the Tulu paper there, but we'll show the RLVR chart. You did mention that you wanted to change it now. And we'll sort of preview a little bit of the agent's discussion. Yeah.
Starting point is 00:06:20 I think when you are introduced to RLVR, there's just a function, really, that checks if you have a string outputted from the language model. You have a relatively simple function that's like, is this answer from the language model correct? And there's no real environment because you're just looking at the generation. And now I need to figure out the right way to communicate what either like multi-hop tool use looks like for this, which is something people are definitely doing. doing, thinking, like, what is the right diagram to encapsulate how O3 is trained, which in action, they take multiple actions because the next sequence depends on the feedback from the environment, which is some sort of information store. So, like, when it's searching for a niche piece of information, you can't know what the next actions are without whatever feedback from when Bing searches,
Starting point is 00:07:12 is what they say they use. That is a step that is very much happening. And then as people try to transition to more end-to-end RL is a real strong notion of environment, which is that you're looking from a sparse signal from this multiple generations. And that's what people want to do. I think it's debatable whether or not people are actually doing it now. I think the deep research blog post kind of hints that they do a bunch of small-scale RL and then poof the system works, which I think is much more of what's happening is people train on a bunch of small things and they do some prompting and they see that when you put these pieces together or a couple different fine tunes of a model. So it seems like deep research has some fine tune of O3 in it.
Starting point is 00:07:56 Because you do that with some different domains of RL. It works rather than deep research being trained on the outcome, which I think makes a lot of sense for it not working in deep research because doing outcome-based RL for deep research would be RLHF again. Because you have to have two humans and you're like which generated report is better. You can definitely do that and you the whole sick-evincey thing, OpenAI, showed that they have so many different reward models and reward signals in their post-training. But that's just one of them. And I think a lot of the progress in making it exist is doing RL and a bunch of information retrieval and editing and search tasks. We talked with Noam, about Noam Brown, about this deep research and kind of like
Starting point is 00:08:36 the verifiable rewards. He mentioned, obviously, that's an example of like non-verifiable thing, having RL work on them. And in one of your recent posts, you also talked about how the big labs of all this data that they can find long-tailed things to RL on, and then kind of when you put them all together that fixes it. Do you feel like what we're able to verify is like a big bottleneck that like the verifications are only done in kind of like these smaller atomic things? And so we can not really scale that. I think my comment was on making, so in this post as reflecting mostly on the question of what will agent progress look like relative to modeling progress. So we've had almost three years of modeling progress and we're pretty used to
Starting point is 00:09:15 the messaging on that. And it wasn't just about being with the RL on small things, but do any post-training to fix a weird behavior. And RL is a very data-efficient way if you can get the right signal, but you could also just say, like, it does this weird, non-verifiable thing. Let's create 100 or 1,000 instructions to include in post-training so that the model does this types of information extraction correctly or, like, soft extraction. It's a space that I want to flesh out more with more examples of tasks. It's just, if you watch Claude code going, it's like, what is it doing in the background? It's a lot of reading files and even just the compressing context. That's not, I don't think that's really a verifiable thing, but that being messed up, like, that's a super
Starting point is 00:10:01 crucial skill for long context actions and longer tasks is just compressing well. And that's going to take some training novelty on how do you, you can effectively modify your training data instead of having all the multi-turn context, you just insert the summary and you want to make the performance stay as well, because it's also a cost-saving to have shorter contexts. There's just a lot of new domains like that. But do you feel like you can figure out what these things are before you release, or do you think the labs have a big advantage because they have so much user data
Starting point is 00:10:32 that they can kind of like inspect this at inference? I think it's mostly looking at real-world data at this point. To the extent that there are clear benchmarks, you can use them in the open, but I mean, we see the industry consolidated around data in different forms, and I think that's a real important touch point for people. I'm curious who's like still collecting reliable sources of open data that everyone uses. There's a lot of action in the space, but hard to get traction. Yeah.
Starting point is 00:11:02 So I think for a long time, preference data has been something where people understand that it'd be very good to have large repositories of it. If you want that, you can annoy me to try to release all. for two, like, we have a final data set, but we have completions and ratings from more models. Like, I'm talking to the student, and let's figure out how to mark this down because we just have so much completions and LM as a judge AI feedback data that we don't know how to clean. That's one thing. The problem is I think a lot of it is task and model specific. So this notion of on policy to adopt an RL word for just this preference data and preference modeling, which is that you want the sequences that.
Starting point is 00:11:42 you're training this reward model on the sequences of generations to look like the model that you're starting to fine tune. That is something that has made it hard to kind of grab off the box. And it's, for example, like this ultra feedback that I mentioned is just has a lot of models in it. So most of the models that people are fine tuning, there's some signal for it to improve on. And I don't know how long that lasts. And we still don't have the answered question on how important human is versus AI feedback. Every time I check in with people at Frontier Labs, they're like, yeah, we still use human preference data. And I'm like, okay, I don't have access to that. And I don't know how to measure how much it gives you, really. It might be most of the benefit is on the,
Starting point is 00:12:25 what's the right adjective to describe chatbot arena? It's like people are down on chatbot arena, but it might be that the human data helps boost retention time and general preference a lot, where most academics were doing multi-skill and alpaca valve type things, which it's just, it's not as crucial to everybody's fighting in the attention economy. You're quick, I mean, since we're there, you mentioned sick of fancy, you mentioned L and Marina. That was one of your posts on interconnects that I really enjoyed. Are they cooked? Is there a future for arenas?
Starting point is 00:12:58 Like, how does this play out? You know, they got $100 million now. Like, what are you going to do? I don't know what the money does for them, but I don't know. I think the aval is still valuable, especially at the frontier people are very cynical, but in the compression race of how much cheap, like what is the cheapest model you can have that does pretty good at this is still so useful to a lot of people. Like, chat is king.
Starting point is 00:13:19 Yeah, everyone chats with these things. It's why I use, at GPD 4.5 isn't as good on chatbot arena. I think it's higher on like yupp, which is a new competitor into this. It's like they have like a vibe category. which... Sorry, Yup. Yeah, there's like yupp. There's something...
Starting point is 00:13:36 You can look it up. It's a competitor, another startup. They have like a... All these companies have categories. And one of their categories is vibes and GPT 4.5 is on the top. And I'm like, okay, there's some of these tracks. It's a frontier model.
Starting point is 00:13:48 Yeah. And it's just like... That stuff intangibly is very nice. The leaderboard is established. People still should use it. It's kind of a focusing function for the community across different batches from industry to academia. Yeah.
Starting point is 00:14:02 I'm not going to try to solve their monetization problems for them, but having clear norms and things that can be hill climb forever is very good. Like having this idea of an Elon linking models. You cannot saturate. Yeah. You just can't. It's a great problem. Like what is? But you can game it.
Starting point is 00:14:18 So I think that's the issue. Yeah. But everyone evaluates on multiple. Like here came out. Like Sarah Hooker, I've never seen her so public about any of her. Like she has great, but she doesn't really go public like that. Yeah. artificial analysis also has one, which I think is kind of cool.
Starting point is 00:14:34 The other thing I think is relevant to this discussion is a lot of the data actually is like single test, like a single round. Like it's not multi-turn. And I wonder how to create proper multi-turn arenas because you have to switch the models as a whole premise of Elam Arena. It depends on how valuable the user data is. If the user data keeps being equally or more valuable than the inference, there's going to be a platform to keep pushing this and to more and more expensive things. So they're just set up a deep research. I mean, they're probably setting up a deep research arena
Starting point is 00:15:06 because that's the data that, I mean, if I was opening AI working on deep research, that's the data that I want. And there are competitors. And LMSSys is the entity that has the marketplace to set it up. Right. I mean, it's almost like how I see scale.
Starting point is 00:15:19 It's like scale kept climbing the edge of what AI data processes is. And because they're the name brand, they keep climbing the incremental evaluation game, and a lot of them have longevity. Yeah. Yeah. That's a network effect in some ways. You mentioned skill, which is another hot topic, but like we'll put all the sort of hot takes at the end. I do want to like, you know, focus, try to be technical up front, try to, you know, you're still writing the RLHF book.
Starting point is 00:15:45 Is it RLVR book now? I can give my spiel on it. Ultimately, RVR is not mature enough, nor is it as interesting of a book. So I'm like, on two, so those are the two fronts of why I don't want to rebrand. And there's also some personal career strategy, but that should be independent on what is objectively a good book. Because RLVR is going to be changing so much in the next 18 months. We've already seen it. There's all these new algorithms, but I think there's a lot more under the hood on how you do the right pre-training for it and what the data is, how tool use emerges. All of this stuff is core to what RLVR will be seen as. I'm watching to see if O3 is like a niche model or becomes the path that
Starting point is 00:16:27 everybody needs to follow on its kind of different style of tool use that you see particularly with search. And we don't know how opening I did this. And these are the things that I think is kind of core to an RLVR book that we don't have, whereas RLHF is a more interdistance planarie in the same way that Chatbotterina can never be saturated, RLHF can never be solved. And we kind of know these problems of alignment and over-optimization and what the pipelines to getting data that people are using R. And yes, I can add more RL algorithms to the book, which is nice for me to study, but that's not really changing, it's not changing like, oh, what reward modeling is and the different ways that people implement these today, whether it's a value function or reward model and stuff
Starting point is 00:17:11 like this. So I think the breadth on RLHF is nice. And I think I would tell a lot of academics that I think RLHF problems are going to be foundational and kind of just have a much more steady study rate where we're on this massive spike of RLVR, but it might just be solved. And then it just goes back to zero academically. It's not, it's an embellishment, but there could just be a best practice for getting 100% accuracy on any problem that you want. And then it's solved to where the debate on what is a preference is going to go on forever. Yeah, because it's verifiable. There is a right answer. Yeah. Sorry, what do you mean by like over the next 18 months, there'll be a lot of changes? Like, what do you foresee? Actually, let's just catch up. What's already happened in like the sort of
Starting point is 00:17:55 recent history? Yeah, so there's two categories of information that we have, which is what are the models doing and what are the researchers doing? I think the models provide a lot of inspiration in terms of what the, like what's actual frontier is. And that's things like 03, Gemini 2.5, clod. These are a mix of just 03, I think, is the most scaling RL approach. And then clotted Gemini 2.5 are very similar with hybrid reasoning models that you can turn on and off. They rolled it out in different ways. So Gemini didn't have hybrid reasoning at launch, but they brought it in and Claude had it at launch. One of the most important questions has got to be is, is the O3 path of just a reasoning model or hybrid reasoning models more useful?
Starting point is 00:18:41 Do they diverge in their methods for training them? I think the NVIDIA-Lama Nebitron reasoning paper is probably the most detailed paper on a hybrid reasoning thing. And then DeepSeek R1 is still the canonical recipe on a reasoning-only model. And those are very different approaches, and I don't know if one will win out or not. And then there's just a lot of work on data side and RL methods.
Starting point is 00:19:07 I think there's a list. There's a whole list of kind of GRPO complaints that are out there where the math doesn't make sense for certain things. To me, every paper I see come out always has like some fixed the GRPO. It's kind of cool that like people are, you know, taking variations on it. But also I don't know if deep seek's going to come up with R2 and just blow away everyone with whatever is next. Yeah, I definitely don't think the algorithm tends to be the
Starting point is 00:19:30 most important thing. I think I had this in my engineer World Fair talk, which is kind of a snarky of like, how do you train a reasoning model, which is like you get a starting data set, you incrementally improve the data set, you do that until you're running out of time or your performance starts going up. And then you try all of these switches from all the papers or you turn all the, you do a whole bunch of binary tests of all these various algorithmic changes, and you do a grid search and see what works. Like, candidly, that's why I dismissed your P.O when it first came out, because it was sold as an efficiency thing.
Starting point is 00:19:57 Yeah. And I was like, okay, fine, like, but like, you know, I've been trained to not care about efficiency because it's just a matter of resources. Yeah. The GRPO advantage estimate is very well suited to verifiable rewards. Right. But the other thing is kind of a intangible works better on the infrastructure type argument. And when it came out for deep seek math, which is well before the RLVR phase.
Starting point is 00:20:19 So it was really marketed as that. When you talk about hybrid models, how do you reconcile that with Open AI saying they want to move away from the model selector to just have a unified interface? Do you feel like they feel pressure to like, hey, look, when I have all these different classes, we want to route them to the right thing? Or do you think there's something else? I would think that Open AI wants to have a model that knows how hard the pressure is. I think that has to be the North Star for most people working on reasoning, which is the model will just spend the right amount of tokens on it. And if you look at a compute-level discussion, see what inference time scaling means. I think in plenty of ways, like, hybrid reasoners might just be aged out, except for niche applications, because quality is so much more important than having 100x less inference tokens.
Starting point is 00:21:12 is like you just pay for it and compute and that'll get better. I think it's like really like that was something like Jensen said in his most recent, I think like Straitakuri highlighted it or had the interview with him. And it was like, yeah, everything's going to be a reasoning model because it's going to get so cheap and they're better. And I was like that's why it's like the hybrid reasoning thing is a little bit weird. And it's like I always just will turn reasoning on unless it's a really silly query like, oh, like what is this thing? So it's like, okay. In two years, that kind of tracks.
Starting point is 00:21:42 which I think O3 is also just burning money on us. I mean, searches 80 websites for me asking what paper it is. Like, that's a lot of tokens. But it seems directionally, like, if that's the thing that works, that'll be the default. Yeah. At least in all of these high, most of the things that people that we talk to, whether it's coding or very high-end information economy, those things, the value is there. I wanted to double-click on something that you seem to be coming back to a lot.
Starting point is 00:22:11 you seem to assert that 03 does something very different by using search a lot, much more than basically everyone else. Do all models come with a search engine now? Is that like a must have? It depends on your use case. If you're doing general information retrieval or understanding, yeah. There's old papers that we can try to find the links. I don't know if Sam Malman was talking about it,
Starting point is 00:22:37 but there's just this retro paper from Deep Mind and other architectures that people have been pulling in the discussion again, which is like you have a very small model with a very big context length and a very big retrieval store, which I'm not one to bet against the transformer architecture and just figuring out long context and stuff like this, but those are ideas that people are bringing back, which is search is better.
Starting point is 00:22:59 You look at all the evals from reasoning models, and one of the trends is that simple QA numbers all drop. It's like deep seek R1 to the new R1, it goes down. It's like all the new, like Quinn 2.5, DeQuain 3, SimpleQA goes down, at least when you're evaluating these without tools. And Simple QA is like a what is considered to be a very nice, fairly numerically robust, like long-tail knowledge evaluation. And all of these, the raw models, they're all going down.
Starting point is 00:23:26 Right. It just made, like, long-tail information, just to have this search behavior makes a lot more sense. Okay. The kind of argument for this, just a, I have been through this journey, too, of like, oh, why don't you make, like, a model that doesn't know anything but search, right? You can search up anything that you want and learn just in time. But the problem is you need to know what the search terms are. You need some baseline intelligence to make all this work.
Starting point is 00:23:47 Yeah, that makes sense. That's a good way to put it. I think it's important because there's this thesis of like LMs becoming just online LLMs, like permanently. And it hasn't been super pursued. Like perplexity was one of the first to put it on my radar as like they were like, we'll attach the search engine to the LM and that's what you get now. And I think like more and more people are starting to offer it as part. of their default services, like Gemini has like a search grounding thing as well.
Starting point is 00:24:12 I mean, it's what people say a big limitation of Anthropic is because it uses Brave Search, which returns a bunch more like SEO slop than... Is that proven? Because I don't know. I thought they had their own index. Okay, so I don't have it. I haven't done detailed look, so I'm dealing with rumors. But I think they'll all end up doing their own index. And it should, it's one of these things that's like Google should have an advantage again. But who knows if they do? I also hinted at this in my post, but it's like, Hamish had tried to set the up the same student from RLVR playing with like search and an RL model. And it's very easy to get the
Starting point is 00:24:44 model to do tools if you prompted to, but it's very hard to get the like RL model to learn that the tool is useful. And that's why it's to go through these things where it's like 80 failed tool uses and it still gets it or like it stops or gets it on the 81st. Okay. It's just a RL behavior that feels emergent from having a very nice way of like getting the model to learn to use the tool. and it's not like you can't sFT this model to do this. It just really feels like they set up the environment right and it plugs into this deep research kind of line of work that they did and they broke down the problem into these sub-RL tasks
Starting point is 00:25:22 and then it kind of lets it do this thing. Interesting. I don't want to be an open AI show all the time, but I just think I tell people to play with O3 all the time because it's weird. It's excellent. I would say like the amount of work you're imputing on deep research team when like as far as I know it's three people did it. It was Issa and like the two
Starting point is 00:25:43 other collaborators that she had. I don't know if they did that much on top of O3. Like every indication I've had from O'NeI is that deep research is more or less a thin wrapper over which is O3. Yeah, it's probably like one or two small things that they're like, oh, we can make our, we can make deep research work by adding this small amount of data to the training thing. And then it just works. That is, that would be how I describe it. I mean, I mean, I mean, what is it, Gwern, the anonymous person, he replied to my Q-Star post on Twitter the other day, and he was like, why was this all wrong? And it's obviously simple things don't scale. There's a lot of complexity, because there's a lot of other exciting things in the AI field at the time,
Starting point is 00:26:23 and Open AI kind of sends out a lot of things that confuse people. But this would fit into that, which is deep research is a minor change from an existing RL trajectory of what was, like, O3, probably they had already figured out that search was going to be better, and then we're like, okay, we can repackage this. And it's a simple thing that makes a big difference. And most of the things are like that once you have traction. I think trying to get the initial takeoff on the sigmoid is the hard Q-Star thing. But then once it's like, once it's like this, a lot of things in the middle feel obvious,
Starting point is 00:26:56 which is why I have described one of the things that we work on for Olmo. It's like a lot of it is just having motivation to do things that feel somewhat obvious, but they're still hard. It's hard to get different recipes or it's hard to get a full. reasoning recipe off the ground. It's just like a huge change because you have all this inertia on this aval suite. And then you have to figure out if you branch your recipe or do you start from like, do we just take like open reason or zero and start from scratch, which is like, it's a whole other headache of things. It's just hard to move these projects that are anywhere above five to 10 people
Starting point is 00:27:28 with inertia to get stuff done. But then once you're hill climbing, things can seem really obvious. Yeah. Okay, you covered a lot there. Before my next question, to close the brave thing. Our friend Simon Willison wrote a post that Anthropic added Brave Search as one of the sub-processor in their product. So that's where the thing came from. Now, to what extent it gets used?
Starting point is 00:27:49 We don't know. We don't know. I would just kind of comment on a couple of things that he said and then we'll go on to your question. There's a very good post on, just on the retrospective of Q-Star, there's a very good post that you had, which was that I want to send people to, which is, was 01 as IOP.
Starting point is 00:28:05 Right? That does imply the question. of like if one was a siop, what else could be siops now? Yeah. There's definitely the siops out there. I mean, the whole inference time scaling plot is such a siop. Why? You put these two things next to each other with an x-axis, and it just looks like it's
Starting point is 00:28:21 easy to control. Whenever you see an x-axis, you think it's easy to control it. Whereas, like, for training on the left one was training. Yes. And training makes a lot of sense. So if you haven't, even if you go to really old RL papers, RL learning curves are a non-log-X axis usually. and they look like this.
Starting point is 00:28:38 They look like these, like, whatever, like, logarithm or exponential rise. And then if you take one of these and you make it a log excess, it's a straight line. So, like, that side is like, oh, okay, we've seen this before with the RL. But with inference time scaling, it being an X axis is why people are like, oh, there's a knob. I can turn search up a lot. Yeah. Which is like what breeds all these weird ideas. The core of that article is just they're taking points from within training or there's a natural variance and then you line them up.
Starting point is 00:29:05 And if you line them up, then you get this nice inference time scaling behavior, which is, and now people, a lot of people have reproduced this plot on inference time scaling. And it's much clearer now. But at the time, it's like, I see why I thought it was a knob. It's like, oh, look, they called it inference time scaling. You control it. I think the most interesting, well, you have a lot of interesting things in your blogs. But one that stood out was about RL and tool use.
Starting point is 00:29:29 You said that it's easy in RL experiment to tell the model to try searching. but then if it doesn't get results with the tool, it's going to stop using the tool very rapidly. Can we impact that? So can there be a good tool that the model doesn't know how to use and then it kind of fails and then it stops using it? Can there be a bad tool that should be improved before giving up on it? How should people think about designing the tool,
Starting point is 00:29:54 improving the model and kind of like where to intervene? This is definitely on the newer side for my things that I want to work on or have worked on, I think particularly in 2026, especially in the open sides, all the infrastructure models will count up a lot where I want to go deeper on this in terms of deeper research style things or very inference-heavy multiple calls. And to answer a question, that there definitely can be bad tools and there definitely can be like the model just using them wrong. And something that I would want to see in a model is kind of not necessarily creativity, but like an openness that it doesn't know exactly what it will get out of all of its tools and this uncertainty to just try a few
Starting point is 00:30:33 different things, which almost seems classical RL behavior. But if you think about what a language model does, they're always very confident. They're not necessarily confident, but they have like a path and like a direction in their answer, whereas that's a big change in these reasoning tokens is to have the notion of backtracking and things like that, which is some sort of like openness to the tools having things that are unknown in it. It seems like a really nice thing for the model to have, which is like, oh, what if I try this? Like, what does it get? Especially on the open model side, which is if this is going to work where people want to use open models with tools,
Starting point is 00:31:09 it's going to be because people have private data stores and stuff. So if you were to train an open model that is going to be a good reasoner like 03, but on private records of some sort that will never get sent to the cloud, it needs to be thinking of, like, I can try some things with this to get a sense for it before saying I have to give up. And if you look at tool use right now seems much more similar to code execution, or it's just a part of a sequential path that you need to get to, which is like, I have a plan, and if it fails at a certain step, I might have a backup.
Starting point is 00:31:42 But it's not like this iterative of I need to fiddle with the environment in order to come up with my plan. It's just that it's something that people probably are going to have to train into these models, which is like you might just tell it. Like, you don't know what is in this, but your answer might be in it, which is like a very odd prompt. Maybe it'll help. Yeah. When we had Eric Schlons from Antrobic who worked on the cloud agent before cloud code, he mentioned this spent basically like majority of the time on like the tool design to give to the model.
Starting point is 00:32:11 And then you just kind of learn how to do it. Are you usually, well, I don't know how much you worked on actual this stuff, but are you putting the tools one by one in the RL process? Do you think that helps or do you usually give, is it better to give all the tools and like the model explore? I don't really know. Like, we haven't gotten this to work. I would say it would probably depend on the model in your starting point.
Starting point is 00:32:31 If your starting point is already good at tools, it can probably generalize more. But if you're doing this weird base model RL and you have to have this kind of curriculum, like, if you scale RL long enough, you're going to need a curriculum of things getting harder. And like, that's pretty obvious. So in that case, it might be tools get added when things become too hard for it to solve certain questions, which would be, which sounds very intuitive, but, you know, also just really hard to manage and practice because what is your automated signal and your training run that is time to do that. That's why video games are so good because they're designed to unlock
Starting point is 00:33:05 things as you progress, but I think like with things like search, it's like, you know, if you're given access to a small data store or you're given access to all knowledge on the internet. It's good feedback for the Arc AGI people for the V3 benchmark is like have things where the language model needs to learn to use new actuators in the world. after a certain threshold. That would be ArchieGI 4 then. Yeah, probably.
Starting point is 00:33:31 I don't know. They're cranking them out. They're cranking them out. They're actually doing a launch party, I think, like in a couple weeks. So I'm actually really, like, it's fun to play Arc AGI. I don't know if you tried. Oh, I haven't.
Starting point is 00:33:42 It's pretty fun. Like, these are IQ tests. I used to be like, oh, like they weren't that relevance. But, like, actually, now that we have a gradient where, like, LMs are actually significantly climbing them,
Starting point is 00:33:54 Now it's actually really more interesting to compare your own intelligence to the ELMS. I'm with noam, noem on no harnesses. No harnesses, yeah. Yeah, I mean, harnesses are cool, but they're gonna, they're a handicap that's changing the learning dynamic substantially. So it's good, it's good demos, but I feel like the core thrust has to be no harnesses. I mean, it's always like, is it wrong to say that these are just inducted biases, right? Like, they're not in the model, sure. But like anything where you're just like looking at the results contaminates.
Starting point is 00:34:26 This is just a different task. I think I do it or I mean I've, I think I talked with Greg about this at RKGI, which I told him like do harness and no harness. You just have both different categories. It's like you're trying to be transparent and build targets for Frontier Labs. Just do both. Like I don't think it dilutes that much. The no harness is going to obviously be harder.
Starting point is 00:34:46 And then you just get more bang for your buck on your benchmark. Yeah. It's the same dataset. Staying on the topic of tools while we're at it, you had a really good summary of recent work in multi-tool RL, which had like Loop and Read Tool and Toro and all these other things. And I think that this is just like an area that's super rich for research right now. I just wanted to give you the space to like highlight what are your favorites?
Starting point is 00:35:11 What do you think that people should explore? I could share what my moderate ambition, what would be fun research project things, is you want to create some sort of competitive dynamic or a vowel, and it has to be so much narrower than what industry is doing. So I told you this at lunch, which is like deep research, but only archive papers. So you don't have to do a full index. You have a limited domain. You have to figure out how to measure it or something. I think it's good for academics to work on academic tools because they have very high domain expertise. They already know what's going. And just like figure out how to make that something that is either very useful to users if it's going to be good enough of that or something
Starting point is 00:35:48 you can't help climb on. And I don't know if those are like brainstorming on the fly of like take related works out of papers, just look at the text and break all the links and make an avow which is filling in hundreds of related works with archive links. Like that's a fun deep research style idea. See if you can do it with open models on a set data store with tools. AI2 has gone through a lot of discussions with this, which is if you're trying to have impact in AI right now, as an academic, you have to level up out of papers to artifacts, which is models, datasets, avals.
Starting point is 00:36:24 Datasets and avals are easier for people to have impact on. And then the next thing is, like, what do people actually use? In AI2, especially in this semantic scholar team that's now working on, like, information agents of different types. There's another thing that I'm, like, distance in, so don't have all the names. But it's, can we make open models, do that, of thing better. It's like, can you make something that people actually care about? And then
Starting point is 00:36:45 you're, that's a whole level of impact that's much higher if you have actual users. It's hard for academics and small institutions to do that. But if you're working on agents, like, dog feeding is viable. It's like, can we make ourselves a good Slack summary bot that we like or something? And just making these agents really tractable. I mean, that's one direction. Another direction is just he'll climb on humanity's last exam with tools. I just think, it's kind of unlikely that we're going to win as an academic and a state-of-the-art number because they're going to start spending millions of tokens per query. It's just a lot of, it's a lot of computer and like the getting, beating that on the flop equivalence is going to be so hard.
Starting point is 00:37:29 Unstructured thoughts is something that I'm mostly like, okay, I'll get to this. Like I have more things to figure out on the modeling and what I call like skills level, which is just how do you do reasoning to induce inference times, and get high avail numbers. And once you know you can do that, you can take your knowledge with you to do it in more specific domains. There's skill and your skill acquisition, right?
Starting point is 00:37:51 They think the archa-gGI definition of AGI. I quoted it. It's like efficient, yeah, skill acquisition efficiency, because it's described as three words. Right, yeah. Your emphasis on skills in your recent talks that you've done, do you want to sort of reiterate that thesis for people to pick up on?
Starting point is 00:38:09 Yeah, so I've been thinking about mostly I'm trying to to get ahead of what OpenAI, etc., are doing probably now if it's not in their models. And with all the agents, it seems that planning is a very critical task. So it's kind of how do you come up with the taxonomy for different types of things you need to train into reasoning models for when it'll be a bottleneck. And the found date, so I came up before. And the foundational one was skills, which is what I would say that we have already done with 01 and R1, which is you do a lot of RL, you should. show the inference time scaling works and you get really high benchmark numbers. And then the next three are kind of what comes next.
Starting point is 00:38:49 And most of them are around planning. So what I had is three and four of my list were abstraction and strategy, which is trying to not use planning because planning is a word that people already use a lot. The strategy would be the direction the model should go in and technically what are the steps of its plan. And then abstraction is how does it break it down into things it can actually solve? And then the fourth last thing is calibration, which is just not wasting compute and knowing when to like give up and ask the user things. Because like overthinking is obviously a problem. It's easy to keep getting your aval scores to go higher by using more inference time scaling.
Starting point is 00:39:24 But eventually like that's not what people want and their models. They want a smarter training regime where the model is actually getting proportionately better for its training. And not there's a lot of papers on overthinking and stuff like this. which I think is, like, Open AI wants it because they have to foot the GPU bill. Like, if O3 just infinite loops itself for a bunch of people, like, that's not good. Does it actually?
Starting point is 00:39:49 I don't know, but it might. Okay. I mean, like, these reasoning methods definitely can make the models just kind of unstable and just, yeah. So it's like, but it's also the GPT5 idea, which is how do you get a model that just routes the question to the right?
Starting point is 00:40:04 Maybe not necessarily a router, but just knows if it needs to do a plan or if it can just answer, if you look at Deepseek R1 and you ask it a hard math question, it's not like, here's my plan of attack, it just starts. And having a model that knows when to be like, okay, here's my plan of attack, I might need to make myself a memory store. I might need to take like a cloud code approach for this query. I'm going to build a memory store and spin up some parallel searches and then come back.
Starting point is 00:40:32 Conceivably, this is all something you can train into a model because the searches or the parallel model, could be like tools in that case. The simple way to describe it is we have something like thinking tokens and then answer tokens, and the model should be able to optionally have like planned tokens before thinking or before using tools. It's like, okay, like here are the table stakes. I need to do these things and these sorts of tasks will be harder versus easier. It seems more tractable than some far out ideas for AI. It's like a language model can write a good plan.
Starting point is 00:41:05 It just needs to be asked to do so. which I would bet that ClaudeCode and Deep Research are doing this. Like, you get a user prompt. And first, the model is like, yeah, there's a plan tool in CloudCode. And they first, they break it down. And it's like, that is something they've trained into the models. Like, deep, I don't think DeepSeek has, doesn't have it built in, but it probably could do it. And just thinking about that interface between, like, if the model needs it to be able to do the task end to end on its own, you know, can it do that sort of thing.
Starting point is 00:41:36 I think that my challenge with this. whole reconciling this approach with the no harnesses thing is that I think a lot of the way that people, especially engineers, want to model it, is that the plans and the memories are tools. And there are no special plan tokens. There are no special memory tokens. It's just context. Or it's just, you know, whatever. Specifically for planning, because then you can do fan out to other agents for tool calls and stuff. So it doesn't have to be sequential. But I'm just like, is this fork in a road? Or do we have to make a real choice here as to do we outsource things to tools, or do we keep it native within the models, tokens? I don't think it's a subjective difference.
Starting point is 00:42:20 I think mostly the planning idea is to make the point that people don't get things for free. And the planning improvements might be kind of mundane, which is like we were prompting Claude and its plans were bad in this way. Let's give some data where its plans are more detailed or break things down into more steps so that it's easier for then to do it. Yeah. Because it's in a black box effectively. So if it hasn't been targeted, it's unclear of what the performance will be. Or on the like open model side, it might just be the idea of having different models for different parts of it.
Starting point is 00:42:53 Then you're really training a model to just be good at planning. And like that's that's data that you need to come up with. I mean, you only use that model for that one part of it. Does it feel like plants are much more reusable and should maybe not be generated every time. I feel like especially in coding for certain sets of tasks, you want to have similar types of plans. So maybe it's not the right way to ask the model to regenerate a plan every time. There should almost be like plan blueprints as like tools and then the model fills it in. Like where do you think the balance should be? I think they're reasonable. A plan is obviously an
Starting point is 00:43:28 intermediate goal. It just seems likely that there's like failures on this kind of planning level. I mean, the same thing goes for these rubrics that are popular, whereas a lot of the technique that is popular for so-called rubric things, is you have a prompt, and you have a language model generated a rubric for that prompt, which is a few specific things that needs to get right. And that's conceptually very similar to making a plan for every task. I think whether or not it's, like, grading is that you're going to have a different type of abstraction than executing.
Starting point is 00:44:01 But I think what people are seeing is that it's cheaper relative to the effectiveness. to just generate it. So, like, plans are not super long, and they probably, they're not that many tokens. So it's probably just kind of like, okay, we do this. Like, putting it in my taxonomy
Starting point is 00:44:16 might be overselling it where it just needs to be a prompt and you just need to make sure that your model's not too weird at that prompting stage. I think your taxonomy is super useful, by the way. So skills calibration strategy, abstraction. I feel like maybe abstraction
Starting point is 00:44:30 might be the most underrated one or hardest to solve. The way that you introduced it was different than how you wrote in your blog post. You said it was basically not to overthink. That's calibration. Yeah. Abstraction is about breaking things down.
Starting point is 00:44:44 Yeah, I think both of these strategy and abstraction make the most sense on the hardest tasks that we don't know if the model can do them. Right. So if you're assigning a task to a model that you don't know if it can implement it, the strategy is very important because it needs to be very specific and narrow.
Starting point is 00:44:58 Whereas if it's doing mundane code, like deep research, the plan is actually not that interesting of a thing. But when you're at the frontier, of if it can, I don't know, some GPU implementing thing. You could buy into the Open AI and Anthropic narrative, which is help me implement this research idea in our complex, like, multi-distributed GPU thing.
Starting point is 00:45:17 My God. It's like, this is a task that's hard for a human. And for an AI to come up with the right plan to debug and do this is very narrow path. So therefore, the strategy is pretty important of does it start with certain tests and how does it actually build this out to complexity? it's obvious that I need to come up with more better examples for this, but I think as you push it, it's more natural to see that there's only a few plans that actually get it done. And then abstraction is just important as your task becomes so big.
Starting point is 00:45:46 It's like a prompt engineering thing almost. Yeah, and it's like you only have 100K tokens you can generate. Like you need to make sure the model breaks it down. So it's not just spawning a ton of infinite processes under itself, which I do agree that abstraction is an interesting one, especially when you start to think about these models that could call in other models to do sub-task for it, or parts that can be parallelized with multiple searches or just more compute. I think that kind of folds into abstraction, which is just like, how do you approach a certain nugget of the problem? And I definitely say, like, I don't have experience building this.
Starting point is 00:46:21 It just feels like if you're going to visualize AI doing the hardest software or other tasks, it's something that humans are very good about. So it's like, how do you come up with the research? plan in 10 weeks. Like there's a lot of how do you prioritize which experiments to do? There's a lot of inductive biases that go into that. Like a language model would not do well at that right now. Probably memory would be helpful there.
Starting point is 00:46:48 So you can just like the way we do this in real life is we accumulate experience. One thing I did want to dive in on was just parallelism in general. There's one case where with 01 and sort of the sort of Q star, ideas. There was one case where it was sort of overhyped in some sense. But now it's coming back with O-N-Pro and Deep Think. The theory is at least, you correct me if I'm wrong, basically they run 01 8 times and then you have a reward model, rate it, and then give you the best of the eight. Yeah? Something like that. Something like that. Deep Think also the same. We don't know any details beyond that. I think there's a lot of people exploring that, at least on the info provider's side,
Starting point is 00:47:28 of like, you know, how do we parallelize search and planning and all that. And I'm worried about getting too hyped about it. I think it makes a lot of logical sense. And this is one of those things where MCTS also made a lot of logical sense. And we were fooled. Well, I don't think we're using parallel compute in a way to search over like low probability tokens. We're using it to get robustness. Like, O1 Pro is, it was so nice because it just had a very predictable depth to it, even on niche topic.
Starting point is 00:47:58 where like sometimes models just fail out. Yeah, you have some numbers that they went to like from like 10 to like 95% or something. I don't remember the exact numbers, but that's what it feels like it doesn't feel like you turn on O3 Pro to make it 10 times more likely to find some niche piece of information. Like maybe it'll be a bit more likely, but we're not getting that type of like searchy notion of getting more breadth or depth into a tree.
Starting point is 00:48:24 So I think there's value to it where we want to use this parallel on the whatever, either like the most important tokens that we're generating or like, okay, I know this part is crucial. Let's just spend a bit more so that those tokens are better. But it's not a transformative thing. The part that's potentially interesting on the transformative side is like if you can get much better verifiers. So I think of verifiers of changing the slope of inference time scaling. You spend more tokens at inference the better verifier you have. If you're doing parallel, it can extract a rare occurrence. So like right now, if our verifiers are only, like, they're good at, like, human preference, it's like, okay, we don't need to, we don't need to crank that up very much. But if we are doing
Starting point is 00:49:07 really diverse generations and your verifier is better, it'll get better, it'll do better. I think you can look at the extreme between a reward model and an Oracle, where it's like, the Oracle is, the more you search, eventually it works. So the slope is, is good. But a reward model is like, there's really a capped signal out of it, at least if you're doing, this preference type of thing. So the slope is pretty minor and it kind of has diminishing returns. So I do think that if you could fill that with more interesting verifiers, there's potentially more to get out of parallel compute, but I don't think it is like as transformative right now on my outlook. It's more like parallel agents makes more sense of like if you could break down
Starting point is 00:49:49 abstraction nice, like as a throughput engine if our tasks are taking a long time rather than a like at peak performance engine. Okay. Which are the kind of things. with the whole agent versus model thing, where agents are much more about, like, getting it done at all, like being robust and being fast, where, like, this model is one generation. It's like, can you get the answer right? Yeah.
Starting point is 00:50:09 I will spend a little bit more time on this and I'm happy to move on. My pushback or counter to this is that it's a way to pull forward a hypothetical future model that you can then distill from. Yeah. Which is nice. Well, I bet people, I mean, they surely will use these for synthetic data. It's just like the marginal gain on synthetic data. very high. Or just like
Starting point is 00:50:30 Amanda Askell will say like better prompting will effectively make it seem like you have the next generation model. Or like most people don't put effort into their prompts. Or she had said something of those lines in one of her enthrropic interviews. It's just like if you can really figure out how to kind of get into the
Starting point is 00:50:46 certain states of the model. Yeah. Well, anyway, that's my pitch for like why this is worth doing at all. And like, you know, I have a science fiction story that I want to write about quantum models in a world where like you could explore cheaply multiple universes, then like, you know, sort of pull forward the right one. That would work. This sounds too science fiction-y, but I feel like in a world where we could
Starting point is 00:51:07 control quantum computing well enough to explore this and skill it up enough, it could be kind of cool. It also could be that parallel compute is grounds for interesting types of innovation. Like I don't know, like, what does it mean to have parallel compute with diffusion language models that generate all their tokens at once? Like, does that meaningfully change some sort of I don't really know. I think it would be, like, the diffusion language model would be fun if it works. So you can have much more control over inference time scaling. I mean, like, Gemini has one. It's, like, hard to suss out what it changes. But once we have all these knobs, I'm hopeful that it helps build some interesting types of innovation, because, like, the parallel stuff is new. Architectures can change, we'll see. I've been using the Codex best of an thing. And I feel like most of the generations,
Starting point is 00:51:58 are like, you know, 5% different from each other. Because you use Ruby. No, no, no, I had a JavaScript one. I have a JavaScript one, so it should be good at that. I don't know if it's like just how the RL encoding works. One thing that I've noticed, these models always want to do if statements when there's like a missing M variable
Starting point is 00:52:18 so that it doesn't fail when it runs. And I feel like that to me, that's just like a symptom of the RL. Yeah. The code is terrible. Like, you should not write code. It shouldn't silently fail. If there's missing variable, it should just raise an error. But I feel like the URL is like pushing the code in this direction.
Starting point is 00:52:34 And then all the generation have the same pattern. You know, I generate 14. All of them use the if statement just in different pieces. Yeah, that was something I will definitely get over. That's just like the labs are trading off massive gains in performance for small detriments and usability. And it's like, do you ship that model? Yeah.
Starting point is 00:52:54 Like you just ship it and deal with it later. But I'm sure they could fake. I'm sure that's a fixable thing. I think to me that's the question is like, you know, you talk about how you have gains in like pieces of the thing, but not in the full trajectory sometimes. Do you feel like these are examples of that? Or do you feel like as we get better,
Starting point is 00:53:12 if we did a longer trajectory where instead of just writing this piece of code, you have to think about how you're going to maintain it later and like how it's going to run, that's going to fix it? Or it's hard for me to grasp. Yeah, the software stuff is not easy because it's almost like, maintainability almost feels like a human preference type issue again, where somebody could look at it, it'd be like, yeah, that's not as good. But adding the heuristic in trading seems very messy. Yeah. So maybe it is. I don't know. There's a lot more to dig into that, I mean,
Starting point is 00:53:44 this is what Anthropics says they're doing and just what are the actual frontiers in making, like they say they're working on code only and what does that actually mean? A bunch of it is going to be designed tradeoffs and how much autonomy the model it has versus these potential side effects from training longer that we don't know how to get rid of. I mean, that definitely could be the
Starting point is 00:54:07 sort of a behavior like that is what I would say is like a simple thing to remove where it might just be obsessed with some code format that fails when you revisit it or something, even if it's like everyone has seen it with just bypassing test cases. I think they'll be a bit more
Starting point is 00:54:23 nuanced than that, but they could probably be super simple. This topic has a similar semantic content address for me as over-optimization, which is something that you've written about. It is over-optimization with a different reward function. I know. Okay. Well, I meet that link. I want to verify that we are thinking on the same wavelength. I just wanted to go over, again, specific topics on things that you've spent some time thinking about. You write that there are three types of over-automization. First was RL for control. Second was RLHF and third is RLVR. They always happen. Obviously, RL is no stranger to reward hacking.
Starting point is 00:55:00 But maybe, do you want to elaborate on how things are evolving in terms of how we're learning as an industry? Yeah. So that three things breakdown is for people to put the pieces together for what has happened historically. All of these over-optimizations are a just the model optimizer is strong enough where it can manipulate the agent with respect to the environment or manipulate the environment. environment in a useful, in a way that's useful to its target signal. Also, like, for context, I think with what we're doing with language models in RL in general, is that if there's something that can move its reward signal up, it'll move the easiest thing, the most direct things to
Starting point is 00:55:40 move that single up. So that's part of the story that I said on Sikoffancy, which is this reward model for user feedback was probably so obvious that humans just like to like stuff that is, like, people press that thumbs up. Long, and when they're... Folling filled, bullet points. Yeah, like, all those things have just been really easy
Starting point is 00:55:57 for the model to extract. So, like, once they added it, the model changed a lot, and the score went up a lot, and it was easy for the RL to find that. In control, the oldest RL, the environment is normally a simulator that is fixed.
Starting point is 00:56:09 There's no feedback. So the over-optimization looks like unphysical and nonsensical behaviors. There's the motorboat example going in circles. There's, like, an example is a project that was middle author on
Starting point is 00:56:20 was effectively over-optimizing, like, half-cheetah, which is this Majoko thing. Instead of running, it did carwheels off into the sunset and got like infinite numbers. It's like obviously not the intended purpose. It looks like a glitch. So it's just kind of manipulating the agent interface with the environment. RLHF is kind of a classic case where the model will just break down because the reward model is imperfect. So the environment is really imperfect in the RLHF case where...
Starting point is 00:56:47 It's so sparse. It's like very artificial. Yeah, it's a very artificial. environment. So it makes sense that these actions, which are generated tokens, will do things, like reduce into just repeating one token over again. It'll be like, I think one of the early examples we had playing with us at Hugging Pace was the model would just say JavaScript. It would be JavaScript, JavaScript, JavaScript, JavaScript. It was like some toy dataset. And it's very obvious when you see it. It's probably harder to see when you're at the top
Starting point is 00:57:10 and making decisions on when to stop training if you're doing a lot of RLHF. But that was kind of the phase that people have gone through. And now we're in the RLVR phase, which is, we're giving the model reward when it does something quote-unquote right. For math, it's a bit harder to over-optimize, I think, unless you have tools and the model learns to search and cheat instead of learning math, which I'm sure somebody could see that out in the world, which is like, oh, I'll just find the, you're training. It's like, the model's like, oh, you're training me on Stanford's problem set for CS, whatever, that it's seen a thousand times. So it's like, I'll just go get the solution manual, which I'm sure. Somebody can find an example
Starting point is 00:57:48 where that has surely happened. But on code, and maybe information retrieval, it's easier to fudge. So the code thing is, like, the easiest way to get a unit test and pass is just put a pass in it. Like, that is not too surprising that a model can learn how to do that. And there, for code, you need more reward design to think would be a nice, for like a substantial academic work is, like, what is reward design and code for balancing this sort of, like, understanding this over-optimization of test cases or avoiding family. or something like this. I'm sure there's, it's not just like going to be a controlled environment because
Starting point is 00:58:25 these models are complicated, but I would guess you can reproduce that in some ways. Just to double-click, reward design means, like, for example, giving credit, partial credit for partially correct work. Yes, or like giving the model a slight penalty for doing the unit test thing, if you can detect it. Yeah, for cheating. Yeah. Which is, it adds a lot of complexity to training these models compared to math, which is just ifs answers, right? I mean, you can look at the GRPO math and partial credit is weird in that because it's kind of normalized per batch. I don't know if I have a whole feel ready on it for that, but it's also just, it becomes very complicated if you're mixing domains and it's like, is partial credit in code better than partial credit
Starting point is 00:59:09 in math or all these things? It's like reward design becomes very complicated, and that's what you're incentivizing the models to do different things. Yeah. Is there any literature or hypotheses about mixing these things? So let's say you have the one for code, you have the one for math, you have whatever other verifiers you can come up with, and individually they work? Do they conflict? I think part of the intuition of RLVR is that the model is good at knowing which prompt
Starting point is 00:59:37 area it is, which is why the models don't get worse on knowledge benchmarks if you're training on just math or precise instruction following. So the model just kind of develops an intuition for like where the different prompts are in space. So the gradient updates will be different depending on your batches, which is partially why people will just say do big batches. So like a lot of the model is activated and you have a less noisy signal with RL. But a lot of the intuition is that the model just kind of handles that. And there's interesting questions on sequencing. Like do you do large scale math and code RL to get the sequence length? And then and add in more general stuff, which DeepSeek mentioned,
Starting point is 01:00:16 but that's one thing to go, the deep seek report is like math and code to more general RL. There's a question on where do you do tools if you're going to do like code execution and search within this. So I don't know if that's interweaved or if it's a second stage. Got it. Yeah. I don't have comments there. It's just like, it's surprising how much is not known and you just need a lot of compute for ablations. The inference, high inference length generations definitely just like kind of breaks all infrastructure because there's just so many tokens it's more opportunity for out of memory or other things go wrong so it's like just on a default all of your training jobs need way more GPUs for the memory of inference sure and or just like
Starting point is 01:00:57 training but it's just it just makes it more of a pain yeah that's a cost thing um you know one of the maybe controversial takeaways from the gnome prod which you listen to was that there's also just wall clock time of just getting feedback from the environment, whatever that is, especially if it's like a real world thing. And I'm just like, yeah, I mean, there's some point at which your trading runs cannot take longer than like a human life. So to me, that was the wall. He disagreed with that.
Starting point is 01:01:27 But like, that was what I meant by it. Like at some point, long inference, you do want it to terminate within some reasonable amount of time regardless, just as a user. Yeah. We have to find a way to accelerate internally within the training time faster than the passage of time in the actual universe. Yeah, we're not, I'm not worried about that problem, but I agree with you in principle. Right. So I'm stretching this all too far.
Starting point is 01:01:51 I get it. As we kind of start wrapping up, what are other interesting ideas that people should pursue? Like in your AIE talk, you said what I'm thinking about for scaling REL, you had big multi-domain data sets, difficulty filtering, long run times. Is there anything specific that if there's people out there that are either doing research or they want to do a company or whatever, these are like interesting things that you don't want to do, that you want other people to explore? Most of them, I think, are not in the reasoning space, which, like, their talks have been about reasoning. So I've been long talking about, like, character training is something that I think is under-indexed on and been advising a student. Character level? Like personality training.
Starting point is 01:02:37 Okay. And how that, like, different ways of changing the personality of the model from prompting activation or fine-tuning. Okay. Like, data engineering. So stuff that, like, Joanne Jeng does for Open AI. So, like, like, how much does that matter? What are the fundamental research things? Hopefully, I can share more that I've been advising a student on that.
Starting point is 01:02:55 So I've been saying that for a while. Do you like the model spec stuff that she's doing? Yeah. Okay. Yeah. That, that's, I've been a early fan of that. I mean, that's how she finally, that's how, like, she noticed. me as it was like the only person that covered it when they first released it. I think it was like
Starting point is 01:03:10 did. I would. I like did. Yeah. Not many people did. Okay. All right. All right. You were first. I don't know. But like that's what she said to me. Well, we had a, you know, we had a model spec talk closed the whole conference, right? Like, that was my sign of like, pay attention to this guys. But it's real because of what it sends to like develop. It has a developer benefit of like where your model's going. And then also just like regulatory. I think it is very important to like, What is like an intentional behavior versus just like a training error? Okay. So I think for model transparency, it's really fantastic.
Starting point is 01:03:40 And I've said that like the model spec is much more useful than a constitution. Because the constitution is like an intermediate training artifact that you give to the training algorithm in order to get the model that you want. It is not necessarily like what model did we. Like we don't write down our goals of the model in a constitution form. By the way, have you looked at the constitution? Not recently. They talked about it. They put in like Apple's like design guidelines.
Starting point is 01:04:01 Yeah. then also like the UN like declaration of this level I've seen it. I didn't know if they've updated it. That's very odd. I hope that Anthropic would write a model spec. I'm not too optimistic, but they're the next domino to fall. Well, so my take on that, actually I pushed for this too late because OpenEI already approved the talk and all that.
Starting point is 01:04:18 But I was going to ask them to compare the OpenEI model spec to the CloudFor system prompt, which is their closest thing to the model spec. It's the system prompt is incomplete because Open AI has things in the model spec that their model doesn't currently do. especially when they started. It's like we want to, when they first released it, it was like, we want the model to be able to engage on, like, sensitive subjects and maybe like even NSFW is in their model spec, which is, they're just signaling of what they wanted to do. And they say, like, this is very hard to implement because there's all these obvious risks to doing this. But it's like,
Starting point is 01:04:51 in an ideal model where we can solve every problem, this is what we do, which I think is good, as I said, for many different stakeholders. So I mean, mostly my thing is like there hasn't been a good, like, research paper on that. That's just a lot to do. It also runs into personalization and personality or similar, which is like if open models are to win, part of it could be just like everybody can have exactly the model they want. We're serving GVT 4.5. It's kind of its thing. You can prompt it. But if fine tuning is more effective than prompting, everybody can have the model that they want. So it's a good, it's like an academic problem or an open ecosystem problem where people are fighting on the turf that it feels more likely to win, which is good. Is this someone
Starting point is 01:05:32 where you, like, as speaking as AI2, Omo, you want to win? Or is this you're just advising a grad student on it? I don't think it's a differentiating factor yet, but I'm very open to working on it. I think, like, open models have a strong, you know, role play use case and, you know, like, character, personization, all that stuff, right? Especially because people, like, they find their wifu, they want to keep their wifu. And, like, that's the derogatory term for it. But, like, I would say that we've definitely discussed it.
Starting point is 01:06:00 And I want to, part of Olmo should be that it is a base model that's easy to take in directions that you want. And we will have an opinion that is probably slightly conservative on personality. I mean, I've gone through the open A on model spec, and it's like most of these we agree with and, like, be conservative on anthropomorphization. What do you disagree with? I don't remember. I did it a couple months ago. But a lot of it is like openness or transparency, which is like if we're training an open weight model personality, like we're not going to withhold anything. And we have a different hierarchy.
Starting point is 01:06:31 So most of them are on like that type of information exchange rather than be kind. Like opening eyes model stick is pretty agreeable. And if you read through it and it's like treat the user with respect and all these things are. I raise the kids that way. Just read the spec. Yeah. It sounds kind of stupid. But then the last thing is for people doing research, it's like wacky model routing things where you figure out like a bunch of different models to off-hugging phase to route.
Starting point is 01:07:00 to because an open model tool thing could use way more models more easily than any open AI product. Because open AI is restricted to the open AI's models where if like maybe I don't know open routers, like I'm going to make a product out of this, which is a router. Like open router actually does it. And they're like our chat window knows the best model based on all this usage that we have. Yeah. For your query. There's people that started the other way. Like Martian, not diamonds. I don't know who else is he would know. There's a bunch.
Starting point is 01:07:28 There's a bunch. Yeah. So I don't know. I don't know if that would work. Hugging face should work on it. It's like, it's a moonshot idea. You don't know when it's. Given your hugging face back in.
Starting point is 01:07:37 What does, how does Huggy Face make money? This is a very common meme question. I think mostly like enterprise deals. That's what they say. Which is like, they're doing their thing. They're supporting their people.
Starting point is 01:07:48 I mean, look, they're great. They're big. They're profitable. It's just not that obvious to most people. I like the router idea for media models. I feel like there's like so many. There's like a long tail of like a, background remover, like a style applier, like, that is actually hard to find.
Starting point is 01:08:04 On the tech side, I feel like just use the big model. Unless you're like under some like latency or price constraint, you should just use the best model. Even when we're doing thumbnails, I'm like, okay, I'm trying to remove a background of somebody. And it's like I go and replicate and there's like 55 background remover. Yeah, I just use Adobe because it's a website. Well, but that doesn't work. Like the Photoshop model is bad on some things. But again, it's like, or I want to generate a diagram to like mimic something.
Starting point is 01:08:29 thing and it's like, well, which model is better diagrams? Yeah. You know, it's like, those are not easy to find because none of the benchmarks are right now. Part of the argument is that if distillation works really well, we could just keep making the target for distillation smaller and smaller, which is you have models that are very narrow. Right. And they're mimicking these huge models on something that's like pretty, I don't know, like, reformatting tables.
Starting point is 01:08:50 It's like, can you do a table reformatter from markdown to Latech and a 100 million parameter model? Like, if you get it small enough, that is really economic. feasible because it's effectively free, it imprints and instantaneous. My pushback is on this is just, if you're doing image editing, 4-0 should do it, do it all of it. Well, yeah, but I think it does, like, we're just not there yet. Like, give it five years, it'll do it, right? So why we're kind of router at all? You just scale up 4-0.
Starting point is 01:09:18 I guess I think it's, yeah, right? Like, tell me where the logic is here. Like, this is like a temporary thing. On device. On device. Like the local modeling community, I think, is much smaller than people get. give it credit for because most of the use for open models is still in APIs. It's like deep seek API. It's convenient. It's like if there aren't that many models, somebody is going to host it
Starting point is 01:09:41 for cheaper than most people's doing it themselves. That's pretty realistic. But there is a small community that need local. Yeah. The best outcome is if open models can compete on not just long tail things. But that takes the most transformation. Sign note, so I resisted by buying my own, like, buying my own GPUs, building my own cluster. For this reason, I'm like APIs will solve most of it. Like, people are losing money to serve me models. Why am I, you know, having those? Except for the fact that 4090 prices have doubled in the last year.
Starting point is 01:10:14 So actually, you made money doing local models. How does that make you money? Because your investment goes up. Yeah, you can sell the card and you do that. So as you used 4090s goes out. Interesting. Should have bought a 4090. I got a 47.
Starting point is 01:10:29 Damn. What is this? Well, then it puts me on tail light. Should I buy, you know, $15.90 if it ever, you know, is widely available. GDT, they were doing the drops. Yeah, I know. It was crazy. We were like running to the camper to buy it.
Starting point is 01:10:44 Any other topics before I give a closing question? Just generally, your work, R.OVR, like, are topics of the day. I think companies should keep considering releasing open models, mostly for PR and onboarding. It seems like a way it's going if Open AI is releasing it. Are you excited about that? Do you feel like it's like a Syops again? The Open AI model will be good. I expected to be.
Starting point is 01:11:07 They're pretty serious. It'll be best in class for some sized category and some subset of tasks. That's like open AI only does things like that. You have to give them the respect. They won't deserve it. Yeah. That is a big like open wins when more people are doing it. So like that's a win.
Starting point is 01:11:25 And yeah. Well, I mean, hopefully they are actually open about the techniques. just the weights. Do we think the size of the open model is, tells us anything about the hardware that they're going to build? No. What? No. They're so secretive about this. That's like, that's why they haven't released DVT 3.5 or anything because it's too revealing about internal stuff or plans. Oh, okay. No, so I, I, you're talking about Stargate or what do you, what kind of hardware? Johnny I think. No, yeah, I think that's a different phone factor. Yeah, that's it. Yeah, yeah. I think that thing will run on the cloud. I don't think that'll run local anyways. Well, okay, we have to
Starting point is 01:11:58 talk about it. It seems like every podcast you talk about it. So apparently the news from today, which I think you were looking at, was that it was like an ear device that they sued, they got sued over or whatever. But like, I think the earform factor is pretty good. Like, I actually did get there with B in terms of like where, where does this ultimately go? Like, you want something, you want the AI to hear what you hear. And where do you hear what you hear it on the ear? Like, that's pretty much it. I don't know if you guys have like thoughts on wearables and where that goes. I try to be, I think it just knows too much. That's really my...
Starting point is 01:12:33 But you want to give it context. Yeah, I have false privacy hopes. I think a lot of people, I mean, that's the whole thing. It's like people don't actually care about privacy. It's just no taking, you know? It's just a really good memory. I think the meta rayband form factor is good. I don't think it's as mass market.
Starting point is 01:12:49 It's like if you get it in an AirPods-sized form factor, it's a way bigger market for obvious reasons. But the, like, sunglasses form factor is the thing that works, I think. I don't use them for AI, but they can fit the AI to work it. Like, yeah, empirically, yeah, it obviously works. Yeah. Cool. Well, the last question I was saving up was this whole, what is meta doing? You know, you actually had a pretty interesting post back in, when was this?
Starting point is 01:13:16 In April, you said, Lama 4, did meta just push the panic button? I feel like back then it didn't actually push the panic button, but now they really pushed the panic button. That's fair. I think the panic button at the time was the whole. LMSSys model not being the model they released thing along with a bunch of weirdities about the day of the week they released. But to be a model that claims to be open
Starting point is 01:13:38 and then not release the model that is your leading claim is just like that is like bad execution. Bad execution. Yeah, yeah, which is fine. And then the recent stuff I think mostly can be boiled down to talent is cheaper than GPUs by a dramatic margin. And at the end of the day, it's like, okay, if we're spending this much, They go to the room and they're staring the mirror,
Starting point is 01:14:00 they're like, wait, it might not actually be that ridiculous to spend this money on the top people. It's like, might as well try. They already spend it on VR. Somebody was bound to do this eventually. And it makes sense that it's like if Apple or somehow decide, like, we're going to do this, they're going to come in and do exactly what Meadow is doing. They need a founder mode CEO who is like, screw it, like, you know,
Starting point is 01:14:21 we'll take the L. The thought that did occur to me is, you know, meta, instead of spending on VR, they should spend on REL. VR. Well, I think the question is, like, I think a lot of, some researchers, like, most people will take the payday and happily move to bed. Everybody has a bribe number. Right.
Starting point is 01:14:40 It's just a hammer really big. Yeah. But, like, I think some researchers are uncomfortable with the idea that this is sort of the great man theory of research, that, like, you have to pay this much to get this level of talents. And the talent is definitely distributed. Yeah. Right.
Starting point is 01:14:55 A lot of the people that they would be paid. paying this much, have the confidence to redo things or to just do some of the same things and just like, whether you call it feeling the AGI or just drive to build things. Feeling the AGI is not that different than a lot of things that have existed in Silicon Valley lore in the past. It's just people with the vision that are willing to execute on it and they see something coming. And those people make a big difference.
Starting point is 01:15:23 I think you have those people and you remove beer accuracy, getting technical, talented researchers is actually something that meta has a lot of or has the ability to get a lot of. So it's a lot of recycling, which is very hard on individuals and morale of an organization. But that's
Starting point is 01:15:43 like understand the approach. Yeah, for sure. Cool. That's all that. Any parting thoughts on how you're going to build the American deep seek? That was a nice tweet. Yeah, mostly if I have to look at what my
Starting point is 01:15:57 And if you were asking me, like, what my 10-year goal is, and it's like, I only will have, like, a two-to-five-year goal, where I think as models are shifting more towards agents, I think that, like, scaling is slowing. It's like, they're a side of it of a fixed cost and a fixed path to getting towards something like American Deep Seek, or mostly just, I would say it doesn't have to be American if it's fully open. You have everything and you can modify it, which is, like, there's a few things that need to fall. A lot of it is just more resources, but it's like, like, almost. 32B is if you squint like original GPT4 level and fully open. And it's like there's a few levels that you need to go through. Like that's obviously a dense model. It needs to be taken to sparse M-O-E and you need to scale it. You need to have a lot more GPUs and then you need to do like large scale reasoning. That's the goal that I want to do. There's a lot like that's what I want to do. There's a lot of complexity and navigating like how to work with AI. Like what does AI2 do to get there. It's very hard. I think that, I mean, it's a nonprofit. It's hard to get the resources and building a model is a lot of aligning a lot of different people. That's the deep seek story is they
Starting point is 01:17:08 have great people. Open AI has kept a lot of really good people for a long time. Anthropic has gotten a lot of good people right now. And it's like it's a lot of incremental, hard technical problems that you need to stack up. That's what I would like to do and make work in the next couple of years, but it's not easy to get there. So that's the pitch is, like, AI2's best case scenario is AI2's going to do other things. Like, you can't just run a nonprofit or a company that says, our goal is in three years to have an American deep seek.
Starting point is 01:17:37 Like, no one's going to keep paying the bills on that because you have to tell a better story. But that's, like, what I would like to do in that. And I'm sure AI2 will do many more interesting things along the way. Like product stuff. I don't know it's necessarily product, but like, what are more, like, what are cutting edge things in AI that we can make a new architecture for certain things. Okay.
Starting point is 01:17:57 Or like what are demos of open models working better, whether you have like private data or something, or just far out ideas that could take you off the transformer trajectory. I think that, like, you still need to be doing these to kind of lead an AI. Thank you for working so hard on truly open source AI. Yeah, it's fun. I mean, it makes it easy to align like values with what you're doing. Yeah. Like, it'd be better for the world.
Starting point is 01:18:22 if more things are open and therefore, a lot of it is just willing it into existence. And I take seeing, like, what opening I does is, or is saying they're going to do as, like, hopefully a win coming soon. Yeah. Like, DeepSeek was the most unexpected win that made some other
Starting point is 01:18:38 dominoes fall. Well, yeah, I think that is the path for her and see what it takes. Thank you so much. That's for coming on.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.