No Priors: Artificial Intelligence | Technology | Startups - Asimov: Building An Omniscient RL Oracle with ReflectionAI’s Misha Laskin

Starting point is 00:00:00 Hi, listeners. Welcome back to NoPriors. RL is back with a vengeance, and one of the most talent-dense new research labs has a product release, a new code comprehension agent. Reflection AI's co-founders Misha Laskin and Yana Santanago work together as leaders at Google DeepMind on groundbreaking projects like AlphaGo, AlphaZero, and Gemini. I talked to Misha about building universal superhuman agents, the trickiness of reward modeling, bringing all knowledge work tasks under data distribution, how RL for language and robotics differs,

Starting point is 00:00:38 the windsurf non-acquisition, and the landscape from here. Misha, welcome. Thank you for doing this. Yeah, thanks, Sarah, for having me. So it's been about a wild like year and a half since you guys started the company. Is that about right? Roughly a year and a half, maybe a bit less, but I'd say it's all part correct.

Starting point is 00:00:53 Well, can you just start by describing you said that the company's mission is to build super intelligent autonomous systems. And we've talked before about why like this is the moment in time that's possible. What is different about that from building just super intelligence, which is now a sort of more popular, ambitious goal? At a high level, it's fairly synonymous. But maybe there are different ways of thinking about how to build super intelligence and what that might look like.

Starting point is 00:01:21 I think on one spectrum, there's an academic way to look at it, which is in some sense, to some extent, superintelligence in that sense has already been achieved. So AlphaGo was a superintelligent system and there were other systems during that time that were built that were superintelligent in narrow domains. And I think you can go for the goal of building a very broad superintelligence by, you know, kind of locking yourself up in an academic or it's not really an academic, but kind of an industrial lab with that is sort of kind of decoupled from product or customers and kind of max out all the benchmarks that are out there and build superintelligence that

Starting point is 00:02:02 way. I think that is that is one approach. I think the other approach is to kind of think about what is superintelligence more concretely? How is it gonna be deployed? What is it actually gonna look like in people's hands? And build backwards from there. So I would kind of say that that approach

Starting point is 00:02:20 is more kind of co-designing products and research together. Now, the kind of co-designing products and research together. Now the kind of benefits of that approach is that you're kind of optimizing for real problems. The con to it is that you have to be a lot more focused, right? Because your product kind of defines the sort of capabilities that you want to draw out of the system. And you have to start out a lot more focused before expanding across other product categories and other capabilities. So I would say that on the spectrum of companies that are superintelligence and just a research lab and then figure out what the product is once it's built, as opposed to co-designing

Starting point is 00:02:56 products and research together to build very powerful systems in what I would call ASI complete categories. You can pick something that is maybe too small of a category to draw out a superintelligence. As long as you pick a category that I would say is kind of big enough to be ASI complete. I think, and this is kind of our approach at Reflection, is it makes a lot more sense to be focused and co-design those two things together, the product of the research. I want to come back to choice of initial problem in a minute. In terms of just having the intuition and the confidence to say, like, we can go do this as a team, we're going to recruit great people

Starting point is 00:03:37 and go build reflection. You and your co-founder, Yanis, were working at Gemini together in key roles before, and previously you had been part of Peter Abil's lab, who's an amazing researcher as well. You described to me as having, like, I believe the term you used was somewhat muscled your way into AI and deep learning from originally a physics background. Like, how did you decide to go work on this and end up in Peter's lab? Yeah, as a kid, I became really interested in physics, theoretical physics. It was, I mean, probably a byproduct of I'm Russian, kind of Israeli American and moved around. And then

Starting point is 00:04:14 when I landed in the States, it was kind of in a desert in Washington state, learning a new language. And so I had a lot of time in my hands and you know bumped into my parents had had the Feynman lectures in their library. And so I spent a lot of time you know just reading what was on the shelf and bumped into that and got really interested in physics. How old were you? I was so when my interest in physics started that was probably around middle school and it really I think became the thing I wanted to do in high school. And the reason physics was so interesting was because it kind of seemed like the science that was at the root of many of the things that became

Starting point is 00:04:54 impactful. So I was reading about the history of the transistor, and it was invented by a group of theoretical physicists. I was reading about how GPS works, so it turns out you need special relativity in order to accurately account for spatial coordinates using GPS. And so I felt that physics was kind of the root science to pursue. I went in and studied it, got my PhD in it. At the same time, I started seeing deep learning take off and really saw AlphaGo happen. And my sense was that I want to pursue the root science, but there is such a thing as the root science of our time. I think a lot of physics has a field. It's very interesting, but it's crystallized a lot more than a new dynamic field that was being born out of nothing. And AI to me felt like it was going through the moment that physics went to maybe 100

Starting point is 00:05:52 years ago that when I did problem sets and physics, and the most exciting stuff that I was working on there was basically the things that people were discovering 100 years ago. So I saw it kind of happening in front of my eyes, and I just decided that that was the science to bet on. And in particular, because it was AlphaGo that was, that inspired me because it was just unbelievable to me that you could train a neural network to have such immense kind of basically

Starting point is 00:06:20 reasoning capabilities, right? This thing was able, was super intelligent within the realm of Go. Yeah, I decided that I needed to kind of get myself into the best reinforcement learning lab I could. And Peter's was, Peter's lab was that lab for me. And then you and Janis were working specifically on RL at Gemini. That's right. So Janis, my co-founder, was the overall RL lead for Gemini at the time, for 1 and 1.5.

Starting point is 00:06:44 I was working very closely with him on his team. the overall RL lead for Gemini at the time for 1.1.5. I was working very closely with him on his team. It was a really exciting time because we went, both of us from being reinforcement learning researchers, to training large language models at scale. We saw at the end of that project of what's to come, which was Gemini 1.1.5 lands, and it became pretty clear to us that

Starting point is 00:07:06 the next paradigm, and effectively the final paradigm that we need to have in place before what people used to call AGI, or now I think the goalposts have shifted to ASI, is reached, is just figuring out how to scale reinforcement learning on top of large language models. And the first instances of that have been happening over the last year. I think we're still actually a lot earlier than people think. But there is a wedge in and things have started to work. Yeah, I definitely want to talk about what you think

Starting point is 00:07:39 is solved and unsolved here. The entire field has clearly gotten more focused on deep reinforcement learning over the last 18 months. You have this huge product launch this week with Asimov. Can you just describe what it is? So Asimov is the best code research agent in the world. It's a comprehension agent, meaning

Starting point is 00:08:01 that it's really designed to feel almost like a deep research for large code bases. The way a developer is supposed to feel when interacting with it is effectively like they have a principal level engineer who deeply understands their organization at their fingertips. So it's very different from the existing set of tools that I focus primarily on code generation. Every single coding tool has some code generation and some comprehension aspect. But as we spent a lot of time with our customers, trying to understand why coding tools, and this is enterprise specific,

Starting point is 00:08:39 so I think the world is different with startups. But within enterprises, when they're adopting coding tools and you see the impact that this is having on their actual productivity, and I think it's much lower than people expect. So in fact, it's sometimes negative, sometimes negligible. Did you see the recent meter report on that?

Starting point is 00:09:00 Yeah, the meter report was very close to what I've been hearing when talking to engineering leaders within larger organizations. And it's not just enterprises, it's, I would say, growth stage startups. It's any kind of engineering organization that has a sufficiently complex code base and sufficiently large team that no one engineer can have the entire code base kind of in their heads. And so reflection is one of those places as well.

Starting point is 00:09:24 We use our product actively because the training large language model is complex. And there's the large language model code base. There's the product code base. Knowledge is scattered across engineers. It's not just in the code base. It exists in your chats and project management tools and other places where knowledge lives.

Starting point is 00:09:43 And so what we're effectively building towards is this kind of omniscient oracle for organizations that you can go in, ask any question at any level of complexity and it'll provide you an answer at the level of what that principle level engineer would have given you or in the future as the product expands to other categories, what the person who's most embedded in the organization understands. And of course, once you have that solved, it begets much more reliable agents that act

Starting point is 00:10:17 for you as well. But I think the world today is focused on I would say 80% kind of action, 20% understanding. So 80% of action, 20% understanding. So 80% code generation, 20% comprehension. The actual problem is exactly the opposite. That when you look at what an engineer does in an organization, 80% of their time they're spending trying to comprehend complex systems and collaborating with teammates.

Starting point is 00:10:38 And what is collaboration? It's usually someone asking someone else a question about a system that they don't know. So that I think is kind of the problem at the heart of what would prevent a super intelligence from actually working within an organization. It's really this kind of understanding and being able to ingest from a lot of sources of information and from the team. Once you have that, then and from the team. And once you have that, then the action part, I think, becomes, I don't want to say trivial, but a lot easier. Like to me, it seems like really 20% of the problem

Starting point is 00:11:12 is teaching these agents how to act, and it's more or less solved. That definitely squares with both my understanding of engineering and then my experience with coding agents personally, right? If you think about the, I don't know, the like context load time of just to like trying to understand a new system or code anyone else has written or code your agent has

Starting point is 00:11:31 written. In the end, it's like, you know, very stupid implementation that like if you had reason through it with context of the system, you never would have made such a mistake or a works in my environment type problem. And so I think that very much mirrors my intuitive understanding of engineering here. That's great as problem formation. What makes Asimov different in terms of ability to understand better versus just generate code? There are a few things. So I think this is kind of where, you know, why it is so important

Starting point is 00:12:06 to co-design research and product because as a researcher, you'd go in and say the answer is entirely in the agent design or the model or something like this. And as a product person, you would say, well, it's in these product, you know, differentiators like being able to draw not just from your code base, but knowledge that lives, you know, in other sources of information or being able to learn from just from your code base, but knowledge that lives in other sources of information, or being able to learn from the engineering team to offload their tribal knowledge. So an engineer can go in and teach Asimov,

Starting point is 00:12:33 like, hey, we deploy our, when we say environment jobs on our team, we mean this specific thing, which we mean kind of Google that job. So now when another engineer asks a question about environment jobs in the future, the system just knows what they're talking about. A lot of knowledge is stored in engineers' heads.

Starting point is 00:12:50 And I think you need both of these things. You need to understand your customer really closely and develop differentiated product, almost independently of the models that are powering it. But then you also need to innovate on the research in terms of agent design and model training to actually drive the capabilities that you want to see out of the system.

Starting point is 00:13:13 And this becomes an evaluation problem, which is basically at the heart of any frontier lab as well. This is, I think, the least spoken about part of what frontier labs do, but possibly the most important, which is figuring out how they evaluate. What makes Claude magically feel better at code than another model out there? They did something right in their evaluations. So when you look at this problem specifically, there are different capabilities that you

Starting point is 00:13:40 need to train. And what we do is really post-training models where we really focus on post-training today. Some of these things are long context reasoning. Now when I say long context reasoning, I actually mean kind of small models with very long contexts that are able to go into giant code bases, sort of suck up as much information as they can and reason over an output relevant stuff, basically. So it's almost like neural retrieval.

Starting point is 00:14:09 There are capabilities like tool use and multi-hop reasoning. So this is more for, you have your agent and it's designed with some tools. And there are two ways of training agentic models. One is in this very general way where you just train it on thousands of environments and make it like the most general agent possible. And that is kind of almost like the pre-training of agents. And that's sort of what, you know, that's what a FrontierLab does. That's what

Starting point is 00:14:36 there's a new release from Kimi2. That's kind of what that model does. And that's definitely part of it. But in order to that that kind of gives you a nice general base to start from. But then to drive a capability kind of depth wise, like if you really want this reasoner that has, you know, search tools and you know, ability to call like these long reasoning context models, and other you know, other tools that might want to interact with like, oh, when do I when do I read from JIRA? When do I read from another tool? This is kind of a reasoning problem. If you train with those specific tools in mind,

Starting point is 00:15:13 that's typically what people refer to when they say tool use. They actually train for a specific set of tools and really drive the capabilities for those tools. So these are the kinds of research problems that you need to solve in order to build the overall system that's the best in the world. It's not any one thing. It's all these things combined.

Starting point is 00:15:32 And some examples of systems that are being trained for a specific set of tools, the thing that comes to mind is the Grok 4 release, and they kind of showed a plot of their general model. And then the model that was trained with a tool to basically climb on humanities last exam. And there was some big noticeable difference between the two.

Starting point is 00:15:52 Now, that's great, but I think the downside of that is that does humanities last exam actually matter in any meaningful way for an end user? And I would argue that some weak correlation, but the answer is most likely no. And so you have to build the tools and train for the things that users actually want. I think that there's sort of no way around that.

Starting point is 00:16:15 What can you share about how you evaluate, either like technically or philosophically, that makes SASSMOS performance great? This is sort of why it makes sense to do something like this as a startup. So the only advantage that you'll ever have as a startup over a big incumbent, especially when there are such talented teams out there,

Starting point is 00:16:37 is kind of focus and velocity against the thing that you're focused on. Now, I think you need, if you want to be playing in what. Now, I think you need, if you wanna be playing in what is arguably, I think, the biggest category in AI, which is coding, then you need to have the talent as well to do it, but what do you do if you don't have the billions of dollars to pre-train models?

Starting point is 00:16:59 The only way we can win, I think, is by being very focused. So the way I would describe what does it look like to work on a big model within an incumbent lab is that you are one of hundreds of evals. There are teams, when you look at the model card for, let's say, the 01 paper that came out, I think, last year. If you look at the distribution of what most people worked on on that paper was evals. So you're one of many people doing all sorts of evals and spreading yourself in that sense,

Starting point is 00:17:36 you get something that's general, but it's spread fairly thin. As a startup and a startup that has a very focused product that didn't, that's not kind of being too diffuse and that's pretty opinionated about what it is it's building. Your evals are basically what, you know, in the startup lore when, I don't know, Paul Graham would tell you to kind of go talk to customers, like half the time build product, half the time talk to customers. I think in the AI age, it's develop your evals based on what customers are saying and what they're

Starting point is 00:18:03 doing. So you have to work with your customers to look at what prompts it is that they're trying to solve. What general questions are they trying to unlock? So there's very specific pain points that we've identified, like onboarding being one of them. Like in a big company, it takes months to onboard an engineer. So how do you develop evals that accelerate the onboarding

Starting point is 00:18:26 of an engineer from months to hopefully just a couple of weeks now that all the questions that they had, they can just ask Asimov and be able to onboard much faster. So I think there's no silver bullet other than coupling to the information coming from customers, but then being very scientific in the evals that you develop across them. So you have these, let's say, customer needs, let's say onboarding and, you know, a bunch of others. And then you have your system capabilities, which is, well, what do you need in order to provide a good experience there? Well, this customer is being onboarded onto a giant code base, like it has, you know, it might be a codebase that on its own is like 100 million tokens or something.

Starting point is 00:19:09 Well, then you need to figure out some way to reason over that giant codebase. So you have kind of a long context reasoning capability, or you kind of look at your agent and seeing like what's preventing it from satisfying the square from a user. And so you kind of work backwards and reverse engineer from what a user is asking for to what capabilities you want to drive in your system. But the important part I think is to be able to tweak every part of the system from the product features,

Starting point is 00:19:36 to the agent design, to the model training, in order to build the best overall system. And if you are capped in which parts you can change, if you can only change the products and agent design, then you're actually pretty limited in what you can do because you're kind of at the mercy of what these general third party models can do. What I'm hearing from you is also that there is some trade

Starting point is 00:20:00 off between having to serve all different kinds of users and optimizing across those different evals because each one of the teams that is thinking about a particular use case or audience at a more general organization, for example, is less likely to have the ability to work through the entire pipeline from training to product to win their use case. So the thing that was extremely satisfying about working on Gemini is that you're driving research in the frontier, and there's something very gratifying about that. The downside was that you were so far away removed from product that it was kind of a broken telephone game of talking to four different people that information flowed through before the model got into the

Starting point is 00:20:45 customer's hands. That coupling was very loose. And I think it's very true that just because a company might have the best model in some general set of academic benchmarks doesn't actually mean they have the best product. And I think what we're seeing is when things really fit together, it's usually that there's a, you know, a tight coupling between a product and a model that it's a whole system. It's not just the model alone. Obviously, the first big example of that was chat GPT. Chat GPT is kind of an incredible product that was coupled with the model and the model was post trained for the prompts that are coming in from users for chat from chat GPT. Like there was a reason why it was, you know, when I saw the first coding blog post that chat GPT produced for me, that was, that was just insane.

Starting point is 00:21:34 That was an insane magical moment and they post trained specifically for that. And I think there's an another example that happening right now with quad code. That's kind of tight model to product coupling. And so I really think that it's important to really be able to do both at a great degree of excellence. What is an example, as you guys open up the waitlist, that you want users to try, where it should just be obvious that the answers are better than other coding agents? I think the kinds of queries that it tends to be better at are, I guess, what we would call semantic queries. So let's say, like, an example of a query where this is not

Starting point is 00:22:13 the best system to use. It's like file level. If you're looking at a file and there's, like, a specific thing in that file and you're just trying to get a quick answer to it, you don't really need the hammer of, like, a deep research, like, experience. You don't need to wait, you know, know like tens of seconds or a minute or two to get that answer because that should just be

Starting point is 00:22:30 delivered snappily. But if you don't exactly know where you're looking for and you know you don't know the function name or you don't you know something and this is kind of the hard problems that engineers are usually in like there's a flaky test. I mean you know that this test is flaky but that's where your knowledge stops right and that's when you usually go to slack and ask some engineers like this test is flaky what's going on does anyone know? You know we've had the way we've used it is when you're training these models there's a lot of infrastructure work that goes into it and it fails in interesting ways all the time. And asking things like, you know, my jobs are running slowly, five times more slowly than usually.

Starting point is 00:23:15 Why is that? That's kind of a vague query that would be very hard to answer with existing systems, especially since the knowledge around that query might live not just in the code base. So in the example that I just brought up, when this was happening that our environment jobs were slowing down, it turned out that two different teams, kind of infrastructure and research team,

Starting point is 00:23:39 submitted pull requests that were, they passed the test. It wasn't that they were wrong, but they kind of conflicted together in a way that caused this kind of effectively a race condition and slowed everyone's jobs down and These are the kinds of bugs that actually engineers spend You know, that's where you have like two or three engineers who spend a few days trying to solve one of these So I think these kinds of semantic queries these. So I think these kinds of semantic queries tend to be the place where a product like this shines. In the same way that when you think of what kind of query would you

Starting point is 00:24:11 ask ChatGPT to, you know, when it just needs to use kind of the browser tool. So it's like a quick factual thing. Like you wouldn't invoke the deep research experience. But when you wanted to compile kind of a lot of information around some more nebulous query, I think that's where people seem to find a lot of value with deep research. So I think a similar mindset holds here. One thing I would do, working on new system with principle engineer next to me, is just have them explain the entire system. Because I want to have that context where I can't, I can't even tell the agent what to do.

Starting point is 00:24:49 And so I'm curious from a product perspective, like the way you have, you know, memory for agents or even for teams is an increasingly popular idea. There's lots of ideas about how, how to do it. I think there are not many examples of collaborative memory in production in a useful way yet, but I'm sure it is coming. Have you guys designed it in a form I can understand too? Yes. This is actually one of the more fun things to work on in product today. I think it's one of the more fun features to work on at the company is how do you design a team-wide memory? Because

Starting point is 00:25:26 there are all sorts of details around, well, who can edit the memory, who can view different parts of the memory, how do you, you know, how do you maintain a kind of repository of this memory for people to edit and view? You have to have a concept of authority, right? People are going to say things that are wrong. The way it's worked with customers we've started working with is they typically have, they want to start off with kind of a group of trusted kind of senior staff level plus engineers who are kind of the gatekeepers, which is a very, I think, common notion. You have permissions, right, and ownership structure and code bases, and they basically are the ones who kind of populate the memory first and then sort of expand the scope. But I think it works. It's actually a much more

Starting point is 00:26:09 complex feature to build because it touches on, yeah, org-wide permissions. There's some parts of the code where a certain engineer should be able to edit the memory, but other engineers shouldn't. And so it actually starts looking like the new way of versioning code effectively, right? It's kind of a GitHub plus plus, because you're not versioning the code, you're kind of versioning the meta knowledge around it that helps language models understand it better. But definitely that is something that we built that I think it's a thing to iterate a lot until you kind of get the right design here, because you're effectively building and you get from scratch. Yeah, it's interesting.

Starting point is 00:26:48 And you're trying to design some sort of permissions into it versus like, you know, dominant system today in actual version control is like, you know, at best pull requests review, right? Like you just, you try. And it's like somebody in the organization with the ability to review makes a determination as to whether or not Misha should be able to make this change or not actually based on the content. And I think actually, it's going to look not too dissimilar from that, right? Where if you want to change the agents, the team wide memory, then it probably is going to look

Starting point is 00:27:21 something like a pull request where the person who really understands that system approves or, you know, edits it or something like this. I don't think it's going to look too dissimilar. And it makes sense to me that it would look perhaps a little bit more Git-like in that the person who knows what part of the codebase you are editing or creating or editing knowledge about is going to evolve over time as the codebase evolves over time and the team does as well. Yeah, exactly. But I think this is also how it was very common at Google and I think other places as well for different parts of the code base to have owners and so there are like these ownership files that we have as well and basically if you're on the ownership file then the review has to go through you

Starting point is 00:28:15 or it has to be approved by at least one of the members of the ownership file and as people move around teams and so forth the ownership files themselves get updated. So I think a pretty similar structure is probably going to hold here, but it's a lot more nuanced than building kind of an individual memory, which is just kind of personal to you and lives on your computer in your, you know, agents and be file or something. Okay, if we zoom out and place like reflection overall in context a little bit and talk about the larger environment. Sounds good.

Starting point is 00:28:44 Yeah. reflection overall in context a little bit and talk about the larger environment. Sounds good, yeah. You know, coding as a root problem in this era of AI research is somewhat commonly held belief, right? I think a criticism of companies that went after pre-training focused on coding was in reality like you actually, you you needed language you needed a lot of the capabilities who can say exactly which but the reasoning capabilities that could be elicited from large pre-trained models to do code anyway and so you had to do all of the work without the general use. Is it specifically the availability of pre-trained models that are more capable and open source that made you feel like we can go after

Starting point is 00:29:26 super intelligent, like autonomous systems in coding without spending the pre-training dollars upfront as a new lab or help me think about that logic a little bit more. I think that that's roughly correct for kind of, you know, the sort of why you can get into the game sort of short term. And that that we made, you know, you're starting a company a year and a half ago, was that very pretty decent open weight models out there that pre training, you know, we kind of saw pre training is starting to more or less converge on kind of a known paradigm, there's sort of a there's a known big data set on the internet.

Starting point is 00:30:05 Yes, there are gonna be some algorithmic innovations, but you're basically extracting signal from an extremely noisy data set. And we felt like there's only so much signal that one would be able to extract without getting into just absurd dollars for scaling this in terms of what you're trying to get out of it. So what we thought would happen is that there'd be decent open weight models. I think the quality of the open weight

Starting point is 00:30:31 frontier has surprised me. They're actually, the models are better than I thought they would be. And we thought that you can just focus on, you know, we're in this brief period in history right now where the RL flops are still manageable. Like you can you can you can really have a best in class product if you're focused. And yes, you'll need to put you know, you still need a decent amount of GPUs. But from a but from a flop perspective, it's nowhere near where pre-training is. Like two magnitudes off. Exactly. Right. So you can get into it and kind of build out both kind of the product and a research arm.

Starting point is 00:31:14 Our thought was that this was the time where you can actually start a generational FrontierLab that does not need to be coupled to a big cloud provider. Because if you do it right, you'll actually be able to generate sufficient revenues to not have to be acquired or find some strange deal where the cloud provider kind of owns you. And that was kind of the model, I think, of a lot of what frontier lines look like pre-LLMs. I think we're already starting to see that, you know, this kind of more of a field-wide thing, independently of reflection, right?

Starting point is 00:31:51 When you look at how fast like Anthropix revenue is growing, I think, right, they're kind of in this spot where it's like a massive revenue generating business that's growing at an unprecedented rate. That is, but that was very much the ethos that we can come in, we don't need to pre-train. You can get by with two orders of magnitude less compute and really get something out there that's really good.

Starting point is 00:32:17 I think that roughly speaking, you won't need the amount of compute that I think FrontierLab needs today as you're focused, but you'll still need kind of an order of magnitude less. So I think that the capitalization requirements are still high. There's no way of avoiding that. But I'd say they're, and asymptotically, they're probably the same, but asymptotically, the idea is that at that point, you just have a generational business that can raise capital off of that.

Starting point is 00:32:48 I guess part of my read at this point in time is, and maybe it was always true, but especially now is your actual capabilities in terms of understanding what evals to go after, how to design reward models. There's perhaps less understanding and more dispersion in the field in post-training strategies versus, as you said, more maturity in pre-training right now. If it was a simple question of scaling RL on language models, people would be doing it more aggressively right now. Actually maybe that's a good question for you.

Starting point is 00:33:23 How would you describe the challenges in solving scaling here? Why are we only able as a field to put a much smaller amount of compute to work here and still get best in class results versus pre-training skilled GPUs right now? I'd say that there are two categories that one would think that things fall into. One is more around the problems, limitation of the problem structure, and the other one is, well, maybe the structure is fine, but you need algorithmic advances to really drive the next frontier forward. There's, you know, I'd say some mixture of both, but the biggest way I put it is on the problem structure. So if you, the thing that I led for Gemini

Starting point is 00:34:06 was reward models. I built out the reward models that were used to post-train Gemini 1 and 1.5. And I thought is that if you have a reward that accurately basically describes the outcome of any arbitrary task that you throw at it, then that's it. At that point, it's just algorithmic advances, but even the very simple RL methods we have today will be able to get a lot out of this. They'll only be bound by their exploration abilities. So that's the only thing, right?

Starting point is 00:34:38 But if today, we certainly are not in this world where we have clean rewards for every task we could imagine. And so we're kind of making as a field, have to make sort of various shortcuts and compromises to that. So you'll have things like LLM is judged with different rubrics and that works to some extent, but it inevitably a noisy or like stochastic reward inevitably gets hacked. So you kind of need a lot of these and you know, and there's only so much you can extract out of them. Then you have sources that do have ground truth rewards, but they're not many of them. And so you have to hope that by optimizing against those, you'll get some generalization effects. And so I think that the fundamental problem is like the reward problem. You can either go in and say, I'm just going to, all I'm going to focus on is kind of rewards.

Starting point is 00:35:30 Or you can say, I'm going to take things as they are and just be more creative in the methods that leverage the rewards that happen today. And so examples of that are basically every synthetic generation pipeline is some example of this. So it's a messy problem, but I think it's fundamentally like we're in a reward bound world. I don't think there's going to be any breakthrough that all of a sudden, you know, we go from we didn't have rewards for everything to we do because the reward problem itself is at

Starting point is 00:36:02 the time I called, I thought it was AGI complete. Now I'd say it's a si complete, but by the time you have a neural network that can accurately verify any outcome that is probably a super intelligence and so then it goes back to again evaluations. What if you're training your rewards your reward models on something like what are you evaluating against what are the the tasks that you want it to be good at? So that's kind of how I think about it. I think it's a fundamentally reward model or rewards bound field. And then there's also kind of algorithmic progress in terms of the RL methods we have today are quite bad, I would say, at exploration and credit assignment. Like they're sort of just like, the fundamental algorithms are, take the things that work and make them happen more frequently and the things that don't work

Starting point is 00:36:50 and make them happen less frequently. But they don't discern at all along your say reasoning chain which part of the reasoning was correct and which part was incorrect. And so that's why you get these reasoning chains that are kind of garden path meandering. Like they'll explore all sorts of things that are, you know, completely unnecessary and don't look at all like the kind of structured thinking that a person would have. That's how the algorithm works.

Starting point is 00:37:12 It doesn't, it doesn't actually look at there's no credit assignment step on any atomic level. And so that I would say falls into more algorithmic progress bottlenecks. Can I ask you for a few, like hot takes quickly? Yeah, let's go for it. What do you think of all of these efforts, either in house with, you know, labs and vendors or young companies just creating software environments that look like popular software to train agents in? All right, copies of Airbnb or Amazon or Salesforce or Excel? Personally, maybe the

Starting point is 00:37:43 take is not very hot. I'm very bullish on it because how else are you going to... Maybe the hot take is that there's no such thing as generalization. There's just bringing the test distribution into train. Okay. That is an aggressive take. Wow. Yeah. So as long as your train distribution looks something like what you actually want to evaluate for, then users will experience it as generalization. I think there is some generalization that happens in these models, but we probably, as users, overestimate it because we don't actually see how they were made.

Starting point is 00:38:18 But then, yeah, if you saw, oh, the synthetic environment was actually very similar to the thing I was asking about, so it would make sense why the model would be would be good at that. Maybe six months ago, I think you you you said like, I think it's possible we have my definition of ASI in a couple years. Do you still believe that's true? I still do believe that's true. I think that where I think will be in a couple years from now is that there will be kind of definitive super intelligence in some meaningful categories of work. And so for example, when I say coding, I don't mean all of coding there, but there will be a super intelligence within some kind of slivers,

Starting point is 00:38:56 some meaningful slivers of coding that are driving, I'd say immense progress in the companies that can benefit from that. And the reason why I would say that the problem of ASI would have been solved by then is because you've kind of, at that point, it's just a matter of operationalizing, like what you know, you know, it just so happened that these particular categories, like you might have a super intelligent front end developer, because there's so much data distribution for that on the internet, and it's easier to make synthetic data for that. But at that point, you have the recipe, and it's just a matter of making economic decisions

Starting point is 00:39:31 of is it worth sinking in X amount of dollars to get the data in this category to get something close to superintelligence there. An example of that is what happened with reinforcement learning before language models. Effectively, the blueprint for building superintelligence systems was developed. It happened with the Atari games, AlphaGo, then Dota 5 and AlphaStar were near superintelligence systems. And if OpenAI and DeepMind had sunk more compute into them, they would have definitely become super intelligent. It's just that at that point, it didn't really make sense. Economically, why would you do that?

Starting point is 00:40:11 Then this is a definitional issue. Because I was going to ask, help me understand your view of, one of the big criticisms of RL overall has been lack of generalization. That's been just a general question for this direction. I do have friends at every large research lab that's somewhat, in a some, I mean, tell me if you hear something of a different tenor or just believe differently.

Starting point is 00:40:35 They believe we're going to have systems that are much more capable than humans and many types of knowledge work. But they believe less in generalization. And so in a resigned way, they're also, as you're saying, like, I guess we're just going to bring all of it under distribution one way or another. But that means like, you know, it's a little bit different than my view of like, it's at some point you're, you're just, you know, you have enough capability that the rest you

Starting point is 00:41:01 get for free, right? The rest of sort of useful capability you get for free. I think I kind of have a similar viewpoint to the people you describe. I think the generalization capabilities of these things has been weaker. First of all, it's all mind blowing if this exists. So we went from fundamental existential crisis

Starting point is 00:41:24 and generalization, like this was the feel of reinforcement learning before language models was we have these systems that we can make amazing, you know, at like very narrow tasks. We have absolutely no answer for generalization, like zero. And we went from that to things that, you know, feel like they're generalizing. They're certainly generalizing much better than anything we had before, but it's likely because the training distributions are so broad. So at least the way I think about it is more kind of output as a user,

Starting point is 00:41:54 is the system super intelligent in some meaningful categories of work? And then from a research perspective, is it obvious how to make it general for anything that you might care about? And at that point, again, it's just a matter of economics. Maybe there are some categories where collecting the data is so expensive and the return on investment is low, where effectively just better to have craft

Starting point is 00:42:16 people than super intelligent AIs. So I think we're moving into this kind of jagged superintelligence where you have a handful of these superintelligences for categories that matter, maybe subsumed into one model at some point, but at first they'll probably be, again, I think there'll be a few companies that have kind of product model coupling that is superintelligent in different categories. I think an example of, again, starting to see the first glimpses of superintelligence, but in a way that hasn't really transferred to anything meaningful yet is, well, we have these superintelligent test acres now. The Amy benchmark is completely saturated, codeforces and other competitive coding environments.

Starting point is 00:43:03 The models are almost best in the world and within the year, probably just the best in the world. And yet, so we have the best competitive coding agents. Then you go into a company and you ask them, have these things been helpful? And they say- It's uneven, yeah. Yeah, right.

Starting point is 00:43:22 So in the parts of work that are really meaningful, that would you want to see these things driving meaningful increases in GDP? And I think the only way you will see that is if you go into a company and there's kind of a universal understanding that, yeah, my engineers are double digit percentage points as a whole, every single one of them more productive. Right. That's the kind of thing that if that starts happening across every field, then you'll see double digit increases in GDP. So I think that the kind of benchmark maxing that's and it's a bit different than benchmark

Starting point is 00:43:56 maxing used to be before because you have benchmark maxing that is weakly correlated to customer outcomes, but it still looks very similar to taking a board game, training our own agent on it, getting kind of a landmark result in super intelligence, and then making a claim that, you know, super intelligence solved. I think the reality is that deployment of it is half the problem, which goes back to kind of evaluating on customer problems and building product together with the models. So you must have seen the news of the Windsurf non-acquisition into either OpenAI but non-acquisition into Google DeepMind. What do you make of it? We're seeing this verticalization basically happen across categories that are material to frontier intelligence.

Starting point is 00:44:47 And one could argue that the first verticalized category was actually search, like through chat GPT. That's sort of a place where OpenAI verticalized first. And coding has obviously emerged as another kind of frontier level category that could, like all these companies have aspirations of... ASI. Yeah... ASI. Yeah, ASI. And I think, you know, being basically trillion dollar companies or more, I don't think that

Starting point is 00:45:09 it's really the economics that are the driving factor, but it's more that if you want to sustain frontier research, that's kind of what you have to become. And so coding has clearly become one of these categories where verticalization is extremely important. And I think that there's, there are kind of two sides of the story, one on the FrontierLab side and the other on the kind of more of product side, like a startup that builds product but does not have its intelligence in-house. So I think on the FrontierLab side, I think this is exactly kind of what Yannis and I

Starting point is 00:45:42 noticed when we were working in Gemini, is that your model is so far away from the product that oftentimes, even though you have the best model, does not at all mean that you have the best product. So there's a reason why basically startups are the places where adoption of coding tools took off rather than the frontier labs. And so there's a verticalization happening there and some are gonna do it successfully and some are not. I think that that's kind of, we're already starting to see that with plot code really being an example of a successful verticalization.

Starting point is 00:46:16 I don't think that's guaranteed that a big lab can, buy their way to the end user because the fundamental problems of your research team being far away from your product team will still be true and the company having a hundred different focus areas will still be true. So I don't think that acquiring an asset will change that fundamentally, but it does underscore the importance of verticalization. And then from the startup side, I think it actually puts companies that are in these kind of critical path categories like search and coding in a pretty existential place if they can't build their own frontier models.

Starting point is 00:46:57 Not all frontier labs will be able to verticalize correctly, but some will, maybe one will, and that's going to be enough, I think, to kind of take the thunder out from a company that's built a great user experience on top of someone else's model. And I think some of those dynamics are probably starting to play out as well. There are some question marks around if you're on this critical path category and you don't have your own intelligence, how do you compete when your competitor can just basically subsidize their product a lot more than you can? Because you're effectively as a startup that's building on top of these things, to grow quickly you're subsidizing the margin that an anthropropic or Gemini or whatever is making. Google and Anthropic and OpenAI can subsidize their products a lot more than you can. I think that companies that don't own their intelligence or are not deeply integrated

Starting point is 00:48:00 into a customer in some way that makes them hard to remove find themselves in this pretty existential place as it becomes clear to the frontier labs that this is a category they need to verticalize around. I work with a few robotics companies and so much of my lens on RL comes from that. And I think it is like far less clear in robotics that RL will be a dominant part of the training versus imitation learning. You'll actually appreciate this on imitation from humans using tools, right? Because we run this, I'm going to describe this idea that is nuts, but I think it's just funny.

Starting point is 00:48:41 We run this grant program twice a year for amazing people using ML in different fields and it's called embed and one of one of the ideas I had as a joke recently was well like you just record everything right like not obviously just the code base but like your slack and all your documentation and all your conversations because you are a software engineering team. And I'm 100% sure that I can take that data set if you ship something into production to an end customer that has real issues at any scale and sell it to a friend who's a researcher at a lab working on this stuff. And so you have some floor value that is millions of dollars for your couple person company and like bonuses, like maybe the software company works. Right?

Starting point is 00:49:28 Obviously this is like very noisy and I'm mostly joking, but I'm curious how you think about like exploring non RL data sets that could be useful to you here. If that company existed, right? We would, we would definitely pay for their data. There we go. Say it's not an idiot idea. Yeah. Yeah. Especially if there's diversity. I think that'd be... I can sell the whole set.

Starting point is 00:49:54 So is the question around how do you leverage alternative sources of data? Yeah. The question is, I think there is like... I don't want to like over analogize to robotics, right? But within robotics, you have learning from world models, you have learning from SIM, you have learning from embodied data of different types, right? Imitation, then you have RL. I think it's like much less clear that you can use RL for a lot of robotics today, especially some of the harder like manipulation problems. And I'm curious, just given, you know, your team has this enormous strength in RL's like a starting premise, how you look at other types of data to create the, you know, coding agent experiences you want.

Starting point is 00:50:42 So I was actually a robotics researcher for like in reinforcement learning. Peter Beals' lab is a robotics lab. And it was a mixture. Peter's lab was always around the intelligence problem and robotics as being a domain where you study it. And one of the reasons I came to lead reward models for Gemini was because that's the question I was

Starting point is 00:51:05 studying with robotics. We had these RL algorithms for getting robots to do some very narrow tasks like moving blocks and various kind of narrow tasks and simulation. And the question was, well, how do we get generalized manipulators and, you know, just how do we build this onto one system? And it seemed like the rewards were bottlenecks. So a lot of what I was studying before starting, you know, getting into language models was how do we design reward functions or models for robotics or, you know, for 3D video games like Minecraft or something like this that have, I think, similar challenges scientifically. The challenge is that if you think that language model rewards are hackable, vision language model rewards or other sensory signal rewards

Starting point is 00:51:58 are infinitely more hackable. They're much more short-lived than rewards. You can think of language as just a compressed representation of the world that we have that we kind of magically have to start with. Whereas, if you're processing pixels or a sensory motor signal, this is raw signal that has a lot more noise in it. And so, if you train a neural network that is sort of trying to detect whether this thing was manipulated correctly or this thing was moved correctly, then that thing is just infinitely more hackable than anything you have in language models. So the same problems

Starting point is 00:52:36 be blow up and become much larger. And so that's actually why I changed to language models, because I felt that this was a fundamental problem, but you know, we now have these confounding factors of these noisy signals coming in. I think that in at least in a generalizable way, that's why it's really hard to get reinforcement learning to work with robotics. The one place where it really does work well is when you have a clean reward signal, which has to happen to be in these locomotion-like scenarios. So there's a lot of work on building

Starting point is 00:53:08 very robust, simp to real locomotion pipelines. And it's because it's kind of, locomotion is just your body. You don't have to manipulate the world around you. And so you can actually build reward signals that are like, oh, your quadruped is moving at this velocity without damaging its body kind of thing. Maybe it's a bit of a roundabout answer to the question, but it's that I think these two fields are very different in the data distribution that they support. And the kind of imitation learning data for language models is of course the internet. And it's

Starting point is 00:53:38 of course, you know, we've people who've gathered all this data on how we write and so forth. And so aside from that, when we're generating synthetic data, the only scalable path is really reinforcement learning. The other thing that I'll say here is that when you're collecting data for robotics, you can do it in this tele-op way. The things that we are trying to train robots to do are very intuitive for humans as well. I mean, actually more intuitive for humans, right? People are master manipulators. So you can have a lot of tele-op data collection. The things that we want language models to do are sort of, you know, at the level of, it's really hard to collect data of, you

Starting point is 00:54:27 know, the chain of thought process that goes on in like a human's head when they're trying to solve some tasks. And that's kind of the data that you need. And so for that reason, I think language models favor this more like synthetic data, RL like approach where, well, it's easier for us to like verify whether the thing was done or not not than it is to actually generate all that data from a person specifically. Maybe we just need a network interface to get the channel thought. Yeah, maybe. I mean, that's kind of actually when Janice and I were starting the company, we were thinking about, well, what?

Starting point is 00:55:00 Maybe we just somehow had people speak into a microphone as they're doing tasks in order to capture that. Just stream it. Yeah. And it seemed, you know, logistically very hard to pull off. Okay. One final question about sort of reflections path from here. At what point do you, this is a decision you get to make in the future, but at what point do you try to look at other problems beyond engineering and coding?

Starting point is 00:55:29 Do you feel like there's a level of sufficient depth where you should just go attack different domains? The thing that makes coding as a category special is that it's not synonymous with software engineering. It's just kind of how we think about the market today. The reason code is special is if you believe that the way a language model will interact with almost any piece of software is through function calls and therefore code, then if you build very capable reasoners, coding reasoners that are sort of purpose-built organizations, so you've solved the kind of long context, how do I reason over a bunch of disparate source of information problem, and I can act on pieces of software through code,

Starting point is 00:56:09 then you've kind of built a system like the technology that will generalize, at least operationally across other categories of work. And so the way I think about it is more first, just build, not trying to get too ahead of yourself, but just first build the most depth-wise comprehension system for software engineers. This will naturally induce more reliable coding agents. You can plug that in as an MCP to your favorite IDE or coding agent or use one of our own. You can plug that into whatever surface area makes sense for the customer and then naturally start seeing where you're getting pulled from there. The reason I think this will work is because this is what we're already seeing, right? In the, you know, how do you make the system useful

Starting point is 00:57:05 for product managers or technical support people? And then, you know, I think moving on to things like sales or something like this, but there are already places where, you know, customers are pulling us in different directions. It's just kind of a matter of whether you engage on that today or not. And I think that the risk that a startup has is that, you know, you of a matter of whether you engage on that today or not. And I think that

Starting point is 00:57:31 the risk that a startup has is that you see a lot of shining areas where you can go and you start kind of going diffuse before you've really nailed a category. So I think it's really important to be focused and not diffuse in the short term. And that if you kind of build the right, as we kind of think about as a contextual core for an organization, in this case, an engineering organization, then you can naturally start expanding that into adjacent areas of work in that enterprise. Okay, last question, Misha. Where would you characterize us as like being on the path toward deployment of these capabilities in different fields?

Starting point is 00:58:03 I think we're a lot earlier than most people think. That this is going to be one of those areas where the technological building blocks outpace their deployment. And so, yeah, within the next couple of years, the blueprint roughly for how to build ASIs will have been set more or less. Maybe there are still some efficiency breakthroughs that need to happen, but more or less there'll be a blueprint for how do you build a super intelligence in a particular category. Actually going in and deploying it and building it for specific categories of work, there are going to be a lot of product and research innovation

Starting point is 00:58:41 specific to those categories that will probably make us a multi-decade thing. So I don't think that it's a couple of years from now and GDP starts growing 10 percent, you know, year over year globally. I think we're actually going to get there, but it's going to be a kind of multi-decade endeavor. I tend to kind of see a lot of patterns now in kind of real-world deployment with reinforcement learning research as it worked again before like large language models. And before large language models it used to be kind of you pick an environment like you pick Go, you pick Starcraft, you pick something else and you go and try to solve it with some combination

Starting point is 00:59:25 of imitation learning and reinforcement learning. And when you look at all those projects, these were basically things that were called strikes within DeepMind. And each strike within and outside of DeepMind was a bit of a snowflake. The reinforcement learning methods and environment setup for Go was at a high level, conceptually similar, but in the detailed implementation level, very different from StarCraft, very different from Dota 5. And so I think that that's sort of, we're going into every big category having a different

Starting point is 00:59:58 environment, right? And different kinds of agents with different tools. And that means that you'll need to, you'll have like general base models that you can start with, but you'll need to post train things in specific ways for those categories. And we're starting to see that already in the sense that the model that powers OpenAI's codecs is not the O series of models. It's a model called codecs, which was post trained for that environment. The deep research models, like that's a specific environment. They're also post trains for that environment. And I think we'll basically see more

Starting point is 01:00:30 and more that any category that has a sufficiently large business around it, that requires an intelligence core to power it, there will be all sorts of interesting design decisions at the research and product level of how do you actually gain the most performance out of this particular category. So I think we'll kind of see a lot more kind of depth first players emerge over the coming decade or so. I'm making a bet on it. And I also think that like part of to your point about choosing like the problem for the era, we don't get to choose at Conviction a problem for a hundred years. We do get to choose for like this decade or so, right? And, you know, if you actually believe

Starting point is 01:01:13 it's gonna be a very long-term endeavor to get to the sort of productivity and abundance you described, but we are going to get there, then, you know, the other thing you think about is like, like path to supporting the cost for bringing anything under distribution during a particular period. And so I'd say like in the, we've already backed companies in some of these areas, but

Starting point is 01:01:35 like let's say in life sciences or material science, it is more expensive to collect types of data you might need. And that might be a longer endeavor or one that you have to figure out how to fund, or in robotics. And so I think it's a really interesting timing question of any of these really big categories. But I believe coding is this era.

Starting point is 01:01:56 I think coding is this era as well. This one I think will take longer than people thought as well, because again, enterprise, there's organizational problems, just much different than the benchmarks that we have today. But I think it will be one of the faster ones. So I don't think that that's kind of a decade out. That's within the next, you know, say, dozens of months kind of thing. So I think the next sort of generational companies in encoding are definitely being built today. Well, congratulations on the release, Misha, thanks.

Starting point is 01:02:32 Yeah, thank you, Sarah. Find us on Twitter at nopriorspod. Subscribe to our YouTube channel if you wanna see our faces. Follow the show on Apple podcasts, Spotify, or wherever you listen. That way you get a new episode every week and sign up for emails or find transcripts for every episode

Starting point is 01:02:49 at no-priors.com

No Priors: Artificial Intelligence | Technology | Startups - Asimov: Building An Omniscient RL Oracle with ReflectionAI’s Misha Laskin

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.