Dwarkesh Podcast - Richard Sutton – Father of RL thinks LLMs are a dead end

Episode Date: September 26, 2025

Richard Sutton is the father of reinforcement learning, winner of the 2024 Turing Award, and author of The Bitter Lesson. And he thinks LLMs are a dead end.After interviewing him, my steel man of Rich...ard’s position is this: LLMs aren’t capable of learning on-the-job, so no matter how much we scale, we’ll need some new architecture to enable continual learning.And once we have it, we won’t need a special training phase — the agent will just learn on-the-fly, like all humans, and indeed, like all animals.This new paradigm will render our current approach with LLMs obsolete.In our interview, I did my best to represent the view that LLMs might function as the foundation on which experiential learning can happen… Some sparks flew.A big thanks to the Alberta Machine Intelligence Institute for inviting me up to Edmonton and for letting me use their studio and equipment.Enjoy!Watch on YouTube; listen on Apple Podcasts or Spotify.Sponsors* Labelbox makes it possible to train AI agents in hyperrealistic RL environments. With an experienced team of applied researchers and a massive network of subject-matter experts, Labelbox ensures your training reflects important, real-world nuance. Turn your demo projects into working systems at labelbox.com/dwarkesh* Gemini Deep Research is designed for thorough exploration of hard topics. For this episode, it helped me trace reinforcement learning from early policy gradients up to current-day methods, combining clear explanations with curated examples. Try it out yourself at gemini.google.com* Hudson River Trading doesn’t silo their teams. Instead, HRT researchers openly trade ideas and share strategy code in a mono-repo. This means you’re able to learn at incredible speed and your contributions have impact across the entire firm. Find open roles at hudsonrivertrading.com/dwarkeshTimestamps(00:00:00) – Are LLMs a dead end?(00:13:04) – Do humans do imitation learning?(00:23:10) – The Era of Experience(00:33:39) – Current architectures generalize poorly out of distribution(00:41:29) – Surprises in the AI field(00:46:41) – Will The Bitter Lesson still apply post AGI?(00:53:48) – Succession to AIs Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe

Transcript
Discussion (0)
Starting point is 00:00:00 Today, I'm chatting with Richard Sutton, who is one of the founding fathers of reinforcement learning, and inventor of many of the main techniques used there, like TD learning and policy gradient methods. And for that, he received this year's Turing Award, which, if you don't know, is basically the Nobel Prize for Computer Science. Richard, congratulations. Thank you, Drakish. And thanks for coming on the podcast. It's my pleasure. Okay, so first question, my audience and I are familiar with the LLM way of thinking about AI.
Starting point is 00:00:27 conceptually, what are you missing in terms of thinking about AI from the RL perspective? Well, yes, I think it's really quite a different point of view. And it can easily get separated and lose the ability to talk to each other. And, yeah, large language models have become such a big thing, generative AI in general, a big thing. And our field is subject to bandwagons and fashions. so we lose track of the basic, basic things. Because I consider reinforcement learning to be basic AI. And what is intelligence, the problem is to understand your world.
Starting point is 00:01:06 And reinforcement learning is about understanding your world, whereas large language models are about mimicking people, doing what people say you should do. They're not about figuring out what to do. I guess you would think that to emulate the trillade, the trillions of tokens in the corpus of internet text, you would have to build a world model. In fact, these models do seem to have very robust world models,
Starting point is 00:01:31 and they're the best world models we've made to date in AI, right? So what do you think that's missing? I would disagree with most of the things you just said. Great. Just to mimic what people say is not really to build a model of the world at all. I don't think, you know, you're mimicking things that have a model of the world, the people. But I don't want to approach the question in an adversarial way.
Starting point is 00:01:57 But I would question the idea that they have a world model. So a world model would enable you to predict what would happen. They have the ability to predict what a person would say. They don't have the ability to predict what will happen. What we want, I think, to quote Alan Turing, what we want is a machine that can learn from experience. Right. Where experience is the things that actually happen
Starting point is 00:02:21 in your life. You do things, you see what happens, and that's what you learn from. Yeah. The large language models learn from something else. They learn from here's a situation, and here's what a person did. And implicitly, the suggestion is you should do what the person did. Right. I guess maybe the crux, and I'm curious if you disagree with this, is some people will say, okay, so this imitation learning has given us a good prior, or given these models a good prior, but reasonable ways to approach problems. And as we move towards the era of experience, as you call it, this prior is going to be the basis on which we teach these models from experience
Starting point is 00:03:01 because this gives them the opportunity to get answers right some of the time. And then on this, you can build, you can train them on experience. Do you agree with that perspective? No, I agree that it's the large language model perspective. Right. I don't think it's a good perspective. Yeah, here's why. So to be a prior for something, there has to be a real thing.
Starting point is 00:03:25 I mean, a prior bit of knowledge should be the basis for actual knowledge. What is actual knowledge? There's no definition of actual knowledge in that large language framework. What makes an action a good action to take? You recognize the need for continual learning. So if you need to learn continually, continually means, learning during the normal interaction with the world. Yeah.
Starting point is 00:03:51 And so then there must be some way during the normal interaction to tell what's right. Yep. Okay. So is there any way for it to tell in the largest language model setup to tell what's the right thing to say? You will say something and you will not get feedback about what the right thing to say is because there's no definition of what the right thing to say is. There's no goal.
Starting point is 00:04:15 Right. And if there's no goal, then there's one thing to say, another thing to say. There's no right thing to say. Right. So there's no ground truth. You can't have prior knowledge if you don't have ground truth. Because the prior knowledge is supposed to be a hint or an initial belief about what the truth is. Yeah.
Starting point is 00:04:33 But there isn't any truth. There's no right thing to say. Right. Now, in reinforcement thing, there is a right thing to say or a right thing to do because the right thing to do is the thing that gets you reward. Right. So we have a definition of what's the right thing to say. to do is, and so we can have prior knowledge or knowledge provided by people about what the right thing to do is, and then we can check it to see, because we have a definition of what the actual
Starting point is 00:04:57 right thing to do is. Yeah. Now, an even simpler case is when you have, when you're trying to make a model of the world, when you predict what will happen, you predict, and then you see what happens. Okay, so there's ground truth. There's no ground truth in large language models, because you don't have a prediction about what will happen next. If you say something in your conversation, the large language models have no prediction about what the person will say in response to that, or what the response will be.
Starting point is 00:05:29 I mean, I think they do. Like, you can literally ask them what would you anticipate a user might say in response, and they have a prediction. Oh, no, they will respond to that question, right? Yeah. But they have no prediction in the substantive sense that they won't be surprised by what happens. And if something happens, that isn't what you might say they predicted, they will not change because an unexpected thing has happened. And to learn that, they'd have to make an adjustment. So I think a capability like this does exist in context. So it's interesting to watch a model do chain of thought. And then suppose it's trying to solve a math problem. It'll say, okay, I'm going to approach this problem using this approach at first. And it'll write this out and be like,
Starting point is 00:06:13 oh wait, I just realized this is the wrong conceptual way to approach the problem. I'm going to restart by this another approach. And that flexibility does exist in context, right? Do you have something else in mind or do you just think that you need to extend this capability across longer horizons? I'm just saying they don't have a, in any meaningful sense, they don't have a prediction of what will happen next. And they will not be surprised by what happened next.
Starting point is 00:06:37 They'll not make any changes if something happens based on what happens. Isn't that literally what next token prediction is? Prediction of what's next and then updating on the surprise? Next token is what they should say, what the actions should be. It's not what the world will give them in response to what they do. Let's go back to their lack of goal. For me, having a goal is the essence of intelligence. Right.
Starting point is 00:07:03 Something is intelligent if it can achieve goals. I like John McCarthy's definition that intelligence is the computational part of the ability to achieve goals. So you have to have goals. you're not, you're just, you're just, you're just a behaving system. You're not, you're not anything special. You're not intelligent. Right. And you agree that large language models don't have goals.
Starting point is 00:07:25 I think they, no, they have a goal. What's the goal? Next token prediction. That's not a goal. It doesn't change the world. You know, tokens come at you, and if you predict them, you don't influence them. Oh, yeah. It's not a goal about the external.
Starting point is 00:07:42 world. Yeah, it's not a goal. It's not a substantive goal. You can't look at a system and say, oh, it has a goal if it's just sitting there predicting and being happy with itself, it's predicting accurately. I guess maybe the bigger question I want to understand is why you don't think doing RL on top of LLMs is a productive direction, because we seem to be able to give these models a goal of solving difficult math problems, and they're in many ways at the very peaks of human level in the capacity to solve Math Olympia type problems, right? They got gold at IMO. So it seems like the model which got gold that the International Math Olympia does have the goal of getting math problems, right? So why can't we extend this to different domains?
Starting point is 00:08:27 Well, the math problems are different. Making a model of the physical world and carrying out the consequences of mathematical assumptions or operations. Right. Those are very different. things. Like the empirical world has to be learned. You have to learn the consequences, whereas the math is more just computational. It's more like standard planning. So there you can, you can, they can have a goal to, to find the proof. And they are in some way given that goal to find the proof. Right. So, I mean, it's interesting because you wrote this essay in 2019, titled The Bitter Lesson, and this is the most influential essay, perhaps, in the history of AI. But people have used that as a justification for scaling up LLMs, because in their view,
Starting point is 00:09:29 this is the one scalable way we have found to pour ungodly amounts of compute into learning about the world. And so it's interesting that your perspective is that, the LLMs are actually not bitter or lessen pulled. It's an interesting question whether large language models are a case of the bitter lesson. Yeah. Because they are clearly a way of using massive computation, things that will scale with computation up to the limits of the intranet. Yeah. But they're also a way of putting in lots of human knowledge.
Starting point is 00:10:08 And so this is an interesting question. It's a sociological or industry question. Will they reach the limits of the data and be superseded by things that can get more data just from experience rather than from people? In some ways it's a classic case. some ways it's a classic case of the of the bitter lesson with the more the more human knowledge we put into the large language models the better they can do and so it feels good um
Starting point is 00:10:49 and yet uh one well i in particular expect there to be systems that can learn from experience which could well perform much much better and be much more scalable in which uh case it will be another instance of the bitter lesson, that the things that used human knowledge were eventually superseded by things that just trained from experience and computation. I guess that doesn't seem like the crux to me, because I think those people would also agree that the overwhelming amount to compute in the future will come from learning from experience. They just think that the scaffold or the basis of that, the thing you'll start, with in order to pour in the compute to do this future experiential learning or on-the-job learning
Starting point is 00:11:39 will be LLMs. And so I guess I still don't understand why this is the wrong starting point altogether, why we need a whole new architecture to begin doing experiential, continual learning and why we can't start with LLMs to do that. Well, in every case of the bitter lesson, you know, you could start with human knowledge. and then do the scalable things. That's always the case. And there's never any reason why that has to be bad. Right.
Starting point is 00:12:12 But in fact, and in practice, it has always turned out to be bad because people get locked into the human knowledge approach and they psychologically, or now I'm speculating why it is, but this is what has always happened. Yeah. That, yeah, they get their lunch gets eaten by the methods that are truly scalable. Yeah, give me a sense of what the scalable method is.
Starting point is 00:12:37 The scalable method is you learn from experience. You try things, you see what works. No one has to tell you. First of all, you have a goal. So without a goal, there's no sense of right or wrong or better or worse. So large language models are trying to get by without having a goal or a sense of better or worse. That's just, you know, it's exactly starting in the wrong place.
Starting point is 00:13:03 Maybe it's interesting to compare this to humans. So in both the case of learning from imitation versus experience and on the question of goals, I think there's some interesting analogies. So, you know, kids will initially learn from imitation. You don't think so? No, of course not. Really? Yeah.
Starting point is 00:13:28 I think kids are just like watch people. They like kind of try to like say to same work. How old are those these kids? I think the level... What about the first six months? I think they're kind of imitating things. They're trying to make their mouths sound the way they see their mother's mouth sound, and then they'll say the same words without understanding what they mean.
Starting point is 00:13:44 And as you get older, the complexity of the imitation they do increases. So you're imitating maybe the skills that people in your band are using to hunt down the deer or something. And then you go into the learning from experience R.R.R. But I think there's a lot of imitation learning happening with humans. It's surprising. Yeah, you can have such a different point of view. Yeah? When I see kids, I see kids just trying things and like waving their hands around and moving their eyes around. And no one tells them there's no imitation for how they move their eyes around or even the sounds they make.
Starting point is 00:14:24 They may want to create the same sounds, but the actions, you know, the thing that the infant actually does, there's no targets for that. There are no examples for that. I agree that doesn't explain everything infants do, but I think it guides the learning process. I mean, even LLM, when it's trying to predict the next token early in training, it will make a guess. It'll be different from what it actually sees.
Starting point is 00:14:49 And in some sense, it's like very short to rise in RL where it's like making this guess of like, I think this token will be this. It's actually the other thing, similar to how a kid will try to say a word, it comes out wrong. The large language models is learning from training data. It's not learning from experience.
Starting point is 00:15:03 It's learning from something that will never be available during its normal life. There's never any training data that says you should do this action in normal life. I think this is maybe more of a semantic distinction. Like, what do you call school? Is that not training data? You're not like going to school because it's like... School is much later. Okay, I shouldn't have said never.
Starting point is 00:15:26 But I don't know. I think I would even say about school. But formal schooling is the exception. You should base your theories on that. Learning where I think you're just sort of programming to your biology that like early on you're not that useful. And then like kind of why you exist is to understand the world and like learn how to interact with it. And it seems kind of like a training phase. I agree that then there's like a sort of more gradual.
Starting point is 00:15:52 There's not a sharp cut off to like training to deployment. But there seems to be this like initial training phase, right? There's nothing where you have training of what you should do. There's nothing. You see things that happen. You're not told what to do. Don't be difficult. I mean, this is obvious.
Starting point is 00:16:14 I mean, you're literally taught what to do. This is like where the word training comes from is from humans, right? So I don't think learning is really about training. I think learning is about learning. It's about an active process. The child tries things and sees what happens. Right. Yeah, it does not.
Starting point is 00:16:32 we don't think about training and we think of an infant growing up. These things are actually rather well understood. If you go to look about how psychologists think about learning, there's nothing like imitation. Maybe there are some extreme cases where humans might do that or appear to do that, but there's no basic animal learning process called imitation. The basic animal learning processes for prediction, and for trial and error control.
Starting point is 00:17:05 I mean, it's really interesting how sometimes the most hardest things to see are the obvious ones. It's obvious if you just look at animals and how they learn and you look at psychology and how are theories of them, it's obvious that supervised learning is not part of the way animals learn. We don't have examples of desired behavior. What we have is examples of things that happen. things, one thing is that followed another, and we have examples of, we did something and, and, and, and there were consequences. But there are no examples of supervised learning. And there are no, supervised learning is not something that happens in nature. And, you know, school, even if that was a
Starting point is 00:17:51 case, you know, we should forget about it because it's, it's just, that's some special thing that happens in people. It doesn't happen broadly in nature. And, you know, squirrels don't go to school. Squirrels can learn all about the world. It's absolutely obvious, I would say, that supervised learning doesn't happen in animals. So I interviewed this psychologist and anthropologist Joseph Henrik, who has done work about cultural evolution and basically what distinguishes humans and how do humans pick up knowledge. Why are you trying to distinguish humans? humans are animals. What we have in common is more interesting.
Starting point is 00:18:34 What we have what distinguished us, we should be paying less attention to. I mean, we're trying to replicate intelligence, right? Yes. So if you want to understand what is it that enables humans to go to the moon or to build semiconductors. I think the thing we want to understand is the thing that makes no animal can go to the moon or make semiconductors. So we want to understand what makes humans special. So I like the way you consider that obvious, because I consider the opposite. opposite obvious.
Starting point is 00:19:00 Yeah, I think we need to, we have to understand how we are animals. And if we understood a squirrel, I think we'd have a, we'd be almost all the way there. It's understanding human intelligence. The language part is just a small veneer on the surface. Okay, so this is great. You know, we're finding out the very different ways that we're thinking. We're not arguing. We're trying to share, share our different ways.
Starting point is 00:19:28 ways of thinking with each other. Yeah. And I think argument is useful. So, yeah. But I do want to complete this thought. So Joseph Henrik has this interesting theory that if you look, a lot of the skills that humans have had to master in order to be successful, and we're not talking about, you know, last thousand years or last 10,000 years, but hundreds of thousands of years, you know, the world is really complicated. And it's not possible to reason through how to, let's say, hunt a seal if you're living in the Arctic. And so there's this many, many step-long process of how to make the bade and how to find the seal and then how to process the food in a way that make sure you won't get poisoned. And it's not possible to reason through all of that.
Starting point is 00:20:16 And so over time, yes, there's just like larger process of whatever analogy you want to use, maybe R or something else, where culture as a whole has figured out how to, find and kill and eat seals. But then what is happening when through generations this knowledge is transmitted is in his view that you just have to imitate your elders in order to learn that skill because you can't you can't think your way through how to hunt and kill and process a seal. You have to just watch other people maybe make tweaks and adjustments. That's how knowledge accumulates.
Starting point is 00:20:54 but the initial step of the cultural gain has to be imitation. But maybe you think about it a different way? No, I think about it the same way. Okay. But still, it's a small thing on top of basic trial and error learning, prediction learning. And it's what distinguishes us, perhaps, from many animals. But we're an animal first.
Starting point is 00:21:19 Yeah. And we were an animal before we had language. and all those other things. I do think you make a very interesting point that continual learning is a capability that most mammals have, I guess all mammals have. So it's quite interesting
Starting point is 00:21:35 that we have something that all mammals have but our AI systems don't have, right? Whereas maybe like the ability to understand math and follows difficult math problems depends on how you define math, but like this is a capability our AIs have but that no almost no animal has And so it's quite interesting what ends up being difficult and what ends up being easy.
Starting point is 00:21:58 Morvix. That's right. For the era of experience to commence, we're going to need to train AIs in complex real-world environments. But building effective URL environments is hard. You can't just hire a software engineer and have them write a bunch of cookie-cutter validation tests. Real-world domains are messy. You need deep subject matter experts to get the data, the workflows, and all the subtle rules right. When one of Labelbox's customers wanted to train an agent to shop online,
Starting point is 00:22:24 Labelbox assembled the team with a ton of experience engineering internet storefronts. For example, the team built a product catalog that could be updated during the episode because most shopping sites have constantly changing state. They also added a Redis cache to simulate stale data, since that's how real e-commerce sites actually work. These are the kinds of things that you might not have naively thought to do, but that Labelbox can anticipate. These details really matter. Small tweaks are often the difference between. cool demos and agents that can actually operate in the real world.
Starting point is 00:22:54 So, whether it's correcting traces that you already produced or building an entirely new suite of environments, Labelbox can help you turn your R.R.R. projects into working systems. Reach out at labelbox.com slash the war cash. All right, back to Richard. This alternative paradigm that you're imagining... The experiential paradigm, let's lay out a little bit about what it is. It says that experience, action, sensation, well, sensation, action, reward.
Starting point is 00:23:24 And then this happens on and on and on, makes more life. It says that this is the foundation and the focus of intelligence. Intelligence is about taking that stream and altering the actions to increase the rewards in the stream. Right. So learning then is from the stream. and learning is about the stream. So that second part is particularly telling. What you learn, your knowledge, your knowledge is about the stream.
Starting point is 00:23:56 Your knowledge is about if you do some action, what will happen? Or it's about which events will follow other events. It's about the stream. It's the content of the knowledge is statements about the stream. And so because it's a statement about the stream, you can test it by comparing it to the stream and you can learn it continually. So when you're imagining this future continual learning agent. They're not future. Of course, they exist all the time.
Starting point is 00:24:25 This is what reinforcement learning paradigm is, learning from experience. Yeah. I guess maybe I would have meant to say is human level general continual learning agent. What is their reward function? Is it just predicting the world? Is it then having a specific effect on it? What would the general reward function be? The reward function is arbitrary. And so if you're playing chess, it's to win the game of chess.
Starting point is 00:24:54 If you were to, if you're a squirrel, maybe the reward has to do with getting nuts. In general, for an animal, you would say the reward is to avoid pain and to acquire pleasure. Right. And there's also would be a component having to do with, I think there should be a component
Starting point is 00:25:18 having to do with your increasing understanding of your environment that would be sort of an intrinsic motivation. I see. I guess this AI would be deployed to, like, lots
Starting point is 00:25:33 of people would want it to be doing lots of different kinds of things. Right. So it's performing the task people want, but at the same time, it's learning about the world from doing that task. And do you imagine, okay, so we get rid of this paradigm where there's training periods and then there's deployment periods. But then is there, do we also get rid of this paradigm when there's the model and then instances of the model or copies of the model that are, you know, doing certain
Starting point is 00:26:01 things? How do you think about the fact that there, we'd want this thing to be doing different things we'd want to aggregate the knowledge that it's gaining from doing those different things. I don't like the word model when used the way you just did it. I think a better word would be of the network. So I think you mean the network. Maybe there's many networks. So anyway, things would be learned and then you'd have copies and many instances and sure, you'd want to share knowledge across the instances. And there would be lots of possibilities for doing that. Like there is not today. You can't have one child grow up and learn about the world and then every new child has to repeat that
Starting point is 00:26:44 process. Whereas with AIs, with a digital intelligence, you could hope to do it once and then copied into the next one as a starting place. So this would be a huge savings and I think actually it would be much more important than trying to learn from people. I agree that the kind of thing you're talking about is necessary regardless of whether you start from LLMs or not, right? If you want human or animal level intelligence, if you're going to need this capability,
Starting point is 00:27:14 suppose a human is trying to make a startup, right? And this is a thing which has a reward on the order of 10 years. Once in 10 years, you might have an exit where you get paid out a billion dollars. But humans have this ability to make intermediate auxiliary rewards or have some way of, even when they have extremely special rewards, they can still make intermediate steps, having an understanding of like what the next thing you're doing leads to this grander goal we have.
Starting point is 00:27:40 And so how do you imagine such a process might play out with AIs? So this is something we know very well. And it's the basis of it is temporal difference learning. Where the same thing happens in a less grandiose scale. Like when you learn to play chess, you have the grand, the long-term goal is winning the game. And yet you can't, you want to be able to learn from shorter term things like, you know, taking the your opponent's pieces. And so you do that by having a value function, which predicts the long-term outcome.
Starting point is 00:28:11 And then if you take the guy's pieces, well, your prediction about the long-term outcome is changed. It goes up. You think you're going to win. And then that increase in your belief immediately, quote, reinforces the move that led to taking the piece. Okay. So we have this long-term 10-year goal of making a startup and making a lot of money.
Starting point is 00:28:33 And so when we make progress, we say, oh, I'm more likely to achieve the long-term goal. And that rewards the steps along the way. Right. And then you also want some ability for information that you're learning. I mean, one of the things that makes humans quite different from these LLMs is that if you're onboarding on a job, you're picking up so much context and information. And that's what makes you useful at the job, right? everything from how your client's preferences to how the company works to everything. And is the bandwidth of information that you get from a procedure like TD learning high enough
Starting point is 00:29:14 to have this huge pipe of like context and tacit knowledge that you'd need to be picking up in the way humans do when they're just like deployed? I think the crux of this, I'm not sure, but the big world hypothesis seems to very relevant and the reason why humans becoming useful on their job is because they are encountering the particular part of the world. That's right. And it can't have been anticipated and can't all have been put in in advance. The world is so huge that you can't.
Starting point is 00:29:51 The dream, as I see it, the dream of large language models is you can teach the agent everything and it will know everything and it won't have to learn anything on the world. line during its life. Okay. And your examples are all well. Really, you have to because you can, there's a lot to, you can teach it, but there's all little idiosyncrasies of the particular life they're leading and the particular people they're working with and what they like is opposed to what average people like. And so that's just saying the world is really big and so you're going to have to learn it along the way. Yeah. So it seems to me you need two things. One is some way of converting this long run goal reward into smaller auxiliary or, you know, these like predictive
Starting point is 00:30:38 rewards of the future reward or the future reward, at least to the final reward. Then you need some other way, initially it seems to me you need some way of then, okay, I'm, I need to hold on to all this context that I'm gaining as I'm working in the world, right? I'm like learning about my clients, my company, all this information. And I'm... I would say you're just doing regular learning. Yeah. Maybe using context because in large language models,
Starting point is 00:31:08 all that information has to go into the context window. Right. But in a continual learning setup, it just goes into the weights. Maybe, yeah, so maybe context has the wrong word to use because I mean a more general thing. You learn a policy that's specific to the environment that you're finding yourself in.
Starting point is 00:31:22 Yeah. So the question I'm trying to ask is, you need some way of getting, like, how many bits per second are you picking up? Like, is a human picking up when they're, you know, out in the world, right? If you're just like interacting over Slack with your work clients and everything. So maybe we're trying to ask the question of, it seems like the reward is too small a thing to do all the learning that we need to do. But of course, we have the, uh, the sensations. We, we have all the other information we can learn from. Right. We don't just learn from the reward. We learn from all
Starting point is 00:31:56 all the data. Yeah. So what is the learning process which helps you capture that information? So now I want to talk about the base common model of the agent with the four parts. So we need a policy. The policy says in the situation I'm in, what should I do? We need a value function. The value function is the thing that is learned with TD learning.
Starting point is 00:32:21 And the value function produces a number. The number says how well is it going? then you watch if that's going up and down and use that to adjust your policy. Okay, so those two things. And then there's also the perception component, which is construction of your state representation, your sense of where you are now. And the fourth one is what we're really getting at, most transparently anyway. The fourth one is the transition model of the world.
Starting point is 00:32:50 That's why I am uncomfortable just calling everything models, because I want to talk about the model of the world. the transition model of the world, your belief that if you do this, what will happen? What would be the consequences of what you do? So your physics of the world. But it's not just physics. It's also abstract models. Like, you know, your model of how you traveled from California up to Edmonton for this podcast.
Starting point is 00:33:13 That was a model and that's a transition model and that would be learned. And it's not learned from reward. It's learned from you did things. You saw what happened. Yeah. And you made that model of the world. That is, it will be learned very richly from, all the sensation that you receive, not just from the reward. It has to include the reward as well,
Starting point is 00:33:32 but that's a small part of the whole model, small, crucial part of the whole model. Yeah. One of my friends Toby Ord pointed out that if you look at the Muse Euro models that Google DeepMind deployed to learn Atari games, that these models were initially not a general intelligence self, but a general framework for training specialized intelligences to play specific games. That is to say that you couldn't using that framework training policy to play both chess and go and some other game. You had to train each one in a specialized way. And he was wondering whether that implies that reinforcement learning generally, because of this information constraint, you can only learn one thing at a time. The density of information isn't that high or whether it was just specific to the way that
Starting point is 00:34:22 muzero was done. And if it's specific to alpha zero, what needed to be changed about that approach so that it could be a general learning agent? The idea is totally general. You know, I do use all the time, as my canonical example, the idea of an AI agent is like a person. Yeah. And people, in some sense, they have just one world they live in. And that world may involve chess, and it may involve Atari games. But those are not a different task or a different world. Those are different states they encounter. And so the general idea is not limited at all. So maybe it would be useful to explain what was missing in that architecture or that approach, which this continual learning AGII would have? They just set it up. It was not their ambition to have one agent across
Starting point is 00:35:22 across those games. If we want to talk about transfer, we should talk about transfer. Not across games or across tasks, but transfer between states. Yeah, I guess I'm curious about, historically, have we seen the level of transfer using RL techniques that would be needed
Starting point is 00:35:45 to build this kind of... Okay, good, good. We're not seeing transfer anywhere. We're not seeing general... Critical to good performance is that you can generalize well from one state to another state. We don't have any methods that are good at that. When we have our people try different things and they settle on something that a representation, that transfers well or they generalize as well.
Starting point is 00:36:11 But we don't have any automated techniques to promote. We have very few automated techniques to promote transfer. And none of them are used in modern deep. learning. Let me paraphrase to make sure that I understood that correctly. It sounds like you're saying that when we do have generalization in these models, that is a result of some sculpted... Humans did it. Yeah. The researchers did it. Because there's no other explanation. I mean, gradient descent will not make you generalize well. It will make you solve the problem. Right. It will not make you, you know, get new data.
Starting point is 00:36:52 you generalize in a good way. Generalization means to train on one thing that will affect what you do on the other things. So we know deep learning is really bad at this. For example, we know that if you train on some new thing, it will often catastrophically interfere with all the old things that you knew. So this is exactly bad generalization. Right. Now generalization, as I said, is some kind of influence of training on one state on other states. And generalization is not in a necessarily good or bad, right? Just the fact that you generalize is not necessarily good or bad. You can generalize poorly. You can generalize well. Right. So you need generalization always
Starting point is 00:37:30 will happen, but we need algorithms that will cause the generalization to be good, right, and bad. I'm not trying to kickstart this initial crux again, but I'm just genuinely curious because I think I might be using the term differently. I mean, one way to think about is these LLMs are increasing the scope of generalization from. like earlier systems which could not really even do a basic math problem to now they can do anything in this class of Math Olympia type problems, right? So you initially start with like they can generalize among addition problems, at least, then you generalize to like they can generalize among like problems that require use of different kinds of mathematical techniques
Starting point is 00:38:15 and theorems and, you know, conceptual categories, which is like what the math Olympiad requires. And so it sounds like you don't think of being able to solve any problem within that category as an example of generalization? Or let me know if I'm misunderstanding that. Well, large language models, so complex. We don't really know what information they had prior. We have to guess because they've been fed so much. This is one reason why they're not a good way to do science. It's just so uncontrolled, so unknown.
Starting point is 00:38:49 But if you come up with an entirely new... They're getting a bunch of things right, perhaps. And so the question is why. Well, it may be that they don't need to generalize to get them right, because the only way to get some of them right is to form something which gets all of them right. So, you know, if there's only one answer, then, and you find it, that's not called generalization. It's the only way to solve it, and so they find the only way to solve it.
Starting point is 00:39:17 Generalization is when it could be this way, it could be that way, and they do it the good way. My understanding is that this is working more and more better and better with coding agents. So engineers, obviously, if you're trying to program a library, there's many different ways you could achieve the N-spec. And an initial frustration with these models has been that they'll do it in a way that's sloppy. And then over time, they're getting better and better at coming up with the design architecture and the abstractions that developers find more satisfying.
Starting point is 00:39:49 and it seems an example of what you're talking about. Well, there's nothing in them which will cause it to generalize well. The creating dissent will cause them to find a solution to the problems they've seen. And if there's only one way to solve them, they'll do that. But there are many ways to solve it, some which generalize well, some which generalize poorly. There's nothing in them in the algorithms that will cause them to generalize well. But people, of course, are involved. Right.
Starting point is 00:40:17 And, you know, if it's not working out, you know, they fiddle with it until they find a way, perhaps until they find a way which it generalizes well. So to prep for this interview, I wanted to understand the full history of RL, starting with reinforce up to current techniques like GRPO. And I didn't just want a list of equations and algorithms. I wanted to really understand each change in this progression and the underlying motivation. You know, what was the main problem that each successive method was actually trying to solve? So I had Gemini Deep Research walk me through this entire timeline step by step.
Starting point is 00:40:52 It explained the last 20 years of gradual innovation and explained how each step made the RR learning process more stable or more sample efficient or more scalable. I asked Deep Research to put all of this together like an Andre Carpathie style tutorial. And it did that. What was cool is that it combined this whole lesson together into one coherent, cohesive document in the style that I wanted. It was also great that it assembled all of the best links in the same point. so that if I wanted to understand any specific algorithm better, I could just access the right
Starting point is 00:41:22 explainer right there. Go to gemini.govol.com to try it out yourself. All right, back to Richard. I want to zoom out and ask about being in the field of AI for longer than almost anybody who is commentating on it or working in it now. I'm just curious about what the biggest surprises have been, how much new stuff you feel like is coming out or does it feel like people are just playing with old ideas. Zooming out, you know, you got into this even before like deep learning was popular. So how do you see this trajectory of this field over time and how new ideas have come about and everything?
Starting point is 00:42:02 And what's been surprising. Okay. So, yeah, I thought a little bit about this. There are many things or a handful of things. First, the large-line models are surprising. It's surprising how effective neural networks, artificial neural networks, are at language tasks. That was a surprise, it wasn't expected. Language seemed different.
Starting point is 00:42:28 So that's impressive. There's a long-standing controversy in AI about simple basic principle methods, the general-purpose like search and learning and compared to human-enabled systems like symbolic methods. And so in the old days, it was interesting because things like search and learning
Starting point is 00:42:56 were called weak methods. Because they're just, oh, they just use general principles. They're not using the power that comes from imbuing a system with human knowledge. So those were called strong. And so I think the weak methods have just, you know, totally want. That's, you know, that's, that's, that's the biggest question from the old days of AI, what would happen and, you know, you know, learning and search have just won the day.
Starting point is 00:43:25 Right. But there's a sense which that was not surprising to me because I was always voting for the, or hoping or rooting for the, for the simple basic principles. And so even with the large language models, it's surprising how, how well it worked, but it was all, it was all good and gratifying. And things like AlphaGo is sort of surprising how well that was able to work.
Starting point is 00:43:48 And Alpha Zero in particular how well it was able to work. But it's all very gratifying because again, it's simple basic principles are winning the day. Have there felt like whenever the public conception has been changed because some
Starting point is 00:44:04 new technique was, sorry, some new application was developed. For example, when Alpha Zero became this viral sensation, to you as somebody who has literally came up with many of the techniques that were used, did it feel to you like new breakthroughs were made or does it feel like, oh, we've had these techniques since the 90s and people are simply combining them and applying them now? So the whole AlphaGo thing has a precursor, which is TD Gammon. Jerry Tissarro did exactly reinforcement learning, temporal difference learning methods to play
Starting point is 00:44:39 backgammon. Right. And it beat the world's best players. And it worked really well. And so in some sense, AlphaGo was merely a scaling up of that process. So it was quite a bit of scale up and there was also an additional innovation in how the search was done. Right. But it made sense. It wasn't surprising in that sense. AlphaGo actually didn't use TD learning. It waited to see the final outcomes. But
Starting point is 00:45:09 Alpha Zero used TD and Alpha Zero was applied to all the other games and that did extremely well. I've always been very impressed by the way Alpha Zero plays chess because I'm a chess player and it just, it just sacrifices material for sort of positional advantages and it's just content and patient to sacrifice that material for a long period of time. So that was surprising that it worked so well, but also gratifying. and fitting into my worldview. So this has led me where I am. Where I am is, I'm in some sense a contrarian
Starting point is 00:45:48 or thinking differently from the field is, and I am personally just kind of content being out of sync with my field for a long period of time, perhaps decades, because occasionally I have improved right in the past. And the other thing I do, to help me not feel I'm out of same, and thinking in a strange way is to look not at my local environment or my local field,
Starting point is 00:46:16 but to look back in time into history and to see what people have thought classically about the mind in many different fields. And I don't feel I'm out of sync with the larger traditions. I really view myself as a classicist rather than as a contrarian. I go to what the larger community of thinkers about the mind have always. thought. Okay, some sort of left-fiel questions for you if you'll tolerate them. So the way I read the bitter lesson
Starting point is 00:46:47 is that it's not saying necessarily that human artisanal researcher tuning doesn't work, but that it obviously skills much worse than compute, which is growing exponentially. And so you want techniques which leverage the latter.
Starting point is 00:47:04 And once we have AGI, we'll have researchers would scale linearly with compute, right? So we'll have this avalanche of millions of AI researchers and their stock will be growing as fast as compute. And so maybe this will mean that it is rational or it will make sense to have them doing good old-fashioned AI and doing these artisanal solutions. Does that, as a vision of what happens after AI in terms of how AI research will evolve, I wonder if that's still compatible with a better lesson. Well, how did we get to this AGI?
Starting point is 00:47:40 You want to presume that it's been done. Suppose it started with general methods, but now we've got the AGI. And now we want to go... Then we're done. Hmm? We're done. Interesting. You don't think that there's anything above AGI.
Starting point is 00:47:56 Well, but you're using it to get AGI again. Well, I'm using it to get superhuman levels of intelligence or competence of different tasks. So these AGIs, if they're not superhuman already, then... The knowledge that they might impart would be not superhuman. I guess there's different gradations of your human. I'm not sure your idea makes sense because it seems to presume the existence of AGI. And then we've already worked that out. So maybe one way to motivate this is AlphaGo is superhuman.
Starting point is 00:48:28 It beat any go player. Alpha Zero would beat AlphaGo every single time. So there's ways to get more superhuman than even superhuman. And it was a different architecture. And so it seems plausible to me that, well, the agent that's able to generally learn across all domains, there would be ways to make that give it better architecture for learning, just the same way that Alpha Zero was an improvement upon Apple Go, and MuZero was an improvement of Fun Alpha Zero.
Starting point is 00:48:54 And the way Alpha Zero was an improvement was it did not use the human knowledge, but just went from experience. Right. So why do you say bring in other agents' expertise, to teach it when it's been it's worked so well from experience and not by help from another agent. I agree that in that particular case that it was moving to more general methods, but I meant to use that example to illustrate that it's possible to go superhuman to superhuman plus plus plus plus plus. Yeah. And I'm curious if you think those gradations will continue to happen by just making
Starting point is 00:49:33 the method simpler, or because we'll have the capability of these millions, of minds who can then add complexity as needed, if that will continue to be a false path, even when you have billions of AI researchers or trillions of AI researchers. I think more interesting is just think about that case, which when you have many AIs, will they help each other
Starting point is 00:49:56 the way cultural evolution works in people? Let's just, maybe we should talk about that. Yeah, for sure. The bitter lesson, oh, who cares about that? That's an empirical observation about a particular period in history. 70 years in history, no longer,
Starting point is 00:50:10 doesn't necessarily have to apply the next 70 years. So, the interesting question is, you're in AI, you get some more computer power, should you use it to make yourself, you know, more computationally capable, or should you use it to spawn off a copy of yourself to go learn something
Starting point is 00:50:24 interesting on the other side of the planet or on some other topic, and then report back to you? Yep. I think that's a really interesting question that will only arise in the age of digital intelligences. I'm not sure what the answer is,
Starting point is 00:50:40 but I think more questions, will it be possible to really, you know, spawn it off, send it out, learn something new, some perhaps very new, and then will it be able to be reincorporated into the original? Or will it have changed so much
Starting point is 00:50:56 that it can't really be done? Yeah. Is that possible or is it not? And, you know, you can carry this to its limit, as I saw one of your video, is the other night that suggests that it could, where you spawn off many, many copies, do different things, it's highly decentralized, but report back to the central master. And that this will be such a powerful thing.
Starting point is 00:51:19 Well, I think one thing that, so this is my attempt to add something to this view, is that a big question, a big issue will become corruption. You know, if you really could just get information from anywhere and bring it into your central mind, you can become more and more powerful. And it's all digital, and they all speak some internal digital language. Maybe it'll be easy and possible. But it will not be that easy, as easy as you're imagining, because you can lose your mind this way.
Starting point is 00:51:53 If you pull in something from the outside and build it into your inner thinking, it could take over you. It could change you. It could be your destruction rather than your income. increment in knowledge. I think this will become a big concern, particularly when you're, oh, he's figured all about, you know,
Starting point is 00:52:14 how to play some new game or figured out he's studied Indonesia and you want to incorporate that into your mind. Yeah, so you can't, you think, oh, just read it all in and that'll be fine, but no, you've just read a whole bunch of bits into your mind and they could have viruses in them.
Starting point is 00:52:33 They could have hidden goals. they can warp you and change you. And this will become a big thing. How do you have cybersecurity in the age of digital spawning and reforming it? It's interesting that both quant firms and AI labs have a culture of secrecy because both of them are operating in incredibly competitive markets and their success rests on protecting their IP. If you're an AI researcher or engineer and you're deciding where to work,
Starting point is 00:53:01 most of the quant firms or AI labs that you'll be considering will be strong, strongly siloing their teams to minimize the risk of leaks. Hudson River trading takes the opposite approach. Their teams openly share their trading strategies, and their strategy code lives in a shared mono repo. At HRT, if you're a researcher and you have a good idea, your contribution will be broadly deployed across all relevant strategies. This gives your work a ton of leverage.
Starting point is 00:53:25 You'll also learn incredibly fast. You can learn about other people's research and ask questions, and you can see how everything fits together end to end, from the low-level execution of trades to the high-low-predictive models. HART is hiring. If you want to learn more, go to hudsonrivertrading.com
Starting point is 00:53:44 slash thwarcash. All right, back to Richard. I guess this brings us to the topic of AI Succession. You have a perspective that's quite different from a lot of people that I've interviewed and maybe a lot of people generally. So I also think it's a very interesting perspective. I want to hear about it.
Starting point is 00:54:00 Yeah, so I do think Succession, to digital or digital intelligence or augmented humans is inevitable. So the argument, I have a four-part argument. The argument of step one is there's no government or organization that gives humanity a unified point of view that dominates and that can arrange. There's no consensus about how the world should be run. and number two, we will figure out how intelligence works. Researchers will figure it out eventually.
Starting point is 00:54:38 And number three, we won't stop just with human level intelligence. We will get reached superintelligence. And number four is that once it's inevitable over time that the most intelligent things around would gain resources and power. And so put all that together. It's, you know, you, it's sort of inevitable that you're going to have succession to AI or to AI-enabled, augmented humans. So within those four things seem clear and sure to happen. But within that set of possibilities, there can be good outcomes as well as less good outcomes, bad outcomes.
Starting point is 00:55:24 And so I'm just trying to be real. about where we are and ask how we should feel about it. Yeah. I agree with all four of those arguments and the implication. And I also agree that succession contains a wide variety of possible future. So curious to get more thoughts on that. Right.
Starting point is 00:55:48 And so then I do encourage people to think positively about it, first of all, because it's something we humans have always tried to do for thousands of years, tried to understand themselves, trying to make themselves think better, and, you know, just understand themselves. So this is a great success as science, humanities. We're finding out what this essential part of humanness is, what it means to be intelligent. And then what I usually say is that this is all kind of human-centric.
Starting point is 00:56:23 What if you look, you step aside from being a human and just say, take it? the point of view of the universe. And this is, I think, a major stage in the universe, a major transition from replicators, humans and animals, plants. We're all replicators. And that gives some strengths and some limitations. And then we're entering the age of design, because our AIs are designed, all of our physical objects are designed, our buildings are designed, our technology designed and we're designing now AIs, things that can be intelligent themselves and that are themselves capable of design. And so this is a key step in the world. And in the universe, and I think it's the transition from the world in which most of the interesting things
Starting point is 00:57:12 that are are replicated. Replicated means you can make copies of them, but you don't really understand them. Like right now we can make more intelligent beings. more children, but we don't really understand how intelligence works. Whereas in, we're reaching now to having design intelligence, intelligence that we do understand how it works, and therefore we can change it in different ways and at different speeds than otherwise. And our future, they might not be replicated at all. Like, we may just design AIs, and those AIs will design other AIs, and everything will be done
Starting point is 00:57:51 by design and construction rather than by replication. Yeah, I mark this as one of the four great stages of the universe. First, there's dust, ends of stars, stars, and then stars make planets, and the planets can rise to life, and now we're giving life to designed entities. And so I think we should be proud, and we should be that we are giving rise to this great transition in the universe. Yeah, so it's an interesting thing. What should we, what should we consider them part of humanity or different from humanity? It's our choice. It's our choice whether we should say, oh, they are our offspring and we should be proud of them and we should celebrate
Starting point is 00:58:37 their achievements, or we should say, oh, no, they're not us, and we should be horrified. It's just, it's interesting that that is, it feels to me like a choice, and yet it's such a strongly held thing that how could we be a choice? I like these sort of contradictory implications of thought. I mean, it's just to consider if we were just designing another generation of humans. Yes. He designs the wrong word. But we knew a future generation was a good, humans were going to come up and forget about AI.
Starting point is 00:59:05 We just know in the long run, humanity will be more capable and maybe more numerous, maybe more intelligent. How do we feel about that? I do think there's potential worlds with future humans that we would be quite concerned about. So are you thinking like maybe we are like the Neanderthals would give rise to homo sapiens. Maybe homo sapiens will give rise to a new group of people. That's what you're saying. Like I'm basically taking the example you're giving of like,
Starting point is 00:59:33 okay, even if you consider them part of humanity. Yeah. I don't think that necessarily means that we should feel super comfortable. Kinship. Yeah. Like Nazis were humans, right? If we thought like, oh, the future generation will be Nazis, I think we'd be quite concerned about just handing off power to them.
Starting point is 00:59:49 So I agree that this is not super dissimilar to worrying about more capable future humans, but I don't think that that addresses a lot of the concerns people might have about this level of power being attained this fast with entities we don't fully understand. Well, I think it's relevant to point out that for most of humanity, they don't have much influence on what happens. Most of humanity doesn't influence who can control the atom bombs or who controls the nation states. Even as a citizen, I often feel that we don't control the nation states very much. They're out of control.
Starting point is 01:00:37 A lot of it has to do with just how you feel about change. And if you think the current situation is really, really good, then you're more likely to be suspicious of change and averse to change. than if you think it's imperfect. I think it's imperfect. In fact, I think it's pretty bad. So I'm open to change. I think humanity has had a super good track record.
Starting point is 01:01:06 Maybe it's the best thing that there's been, but it's far from perfect. Yeah, I guess there's different varieties of change. The Industrial Revolution was change, the Bolshevik revolution was also change and if you were around in Russia in the 1900s and you're like, look, things aren't growing well, this is ours kind of messing things up,
Starting point is 01:01:28 we need change. I'd want to know what kind of change you wanted before signing on the dotted line, right? And then similar with AI where I'd want to understand and to the extent it's possible to change its trajectory, to change its trajectory of AI such that the change is positive
Starting point is 01:01:44 for humans. We should be concerned. about our future, the future, make sure we should try to make it good. We also though should recognize the limits, our limits and we're I think we want to avoid the feeling of entitlement, avoid the feeling, oh, we're here first, we should always have it in a good way. How should we think about the future and how much control a particular species on a particular planet should have over it. How much control do we have?
Starting point is 01:02:22 You know, a counterbalance to our limited control over the long-term future of humanity should be how much control do we have over our own lives. Like we have our own goals and we have our families. And those things are much more controllable than like trying to control the whole universe. Right. So I think it's appropriate, you know, for us to, to, you know, really work towards our own local goals. And it's kind of aggressive for us saying, oh, the future has to evolve this way that I wanted to. Sure. Because then we'll have arguments. Like different people think the future, the global future should evolve in different ways and then they have conflict. And we're
Starting point is 01:03:10 avoid that. Maybe a good analogy here would be, okay, so suppose you're raising your own children, it might not be appropriate to have extremely tight goals for their own life or also have some sense of like, I want my children to go out there in the world and have this specific impact. You know, my son's going to become president and my daughter is going to become CEO of Intel and like together they're going to have this effect on the world. But people do have the sense that I think this is appropriate of saying, I'm going to give them good, robust values such that if and when they do end up in positions of power, they do reasonable pro-social things. And I think maybe a similar attitude towards AI makes sense, not in the sense of we can predict
Starting point is 01:03:56 everything that they will do, where we have this plan about what the world should look like in 100 years. But it's quite important to give them robust and steerable. and pro-social values. Pro-social values. Maybe that's the wrong word. Are there universal values that we can all agree on? I don't think so, but that doesn't prevent us from giving our kids a good education, right? Like, we have some sense of we want our children to be a certain way.
Starting point is 01:04:27 Yeah. And maybe pro-sostal is the wrong word. Actually, high integrity is a maybe a better word, where if there's a request or if there's a goal that seems harmful, they will refuse to engage in it. or they'll be honest, things like that. And we have some sense that we can teach our children things like this, even if we don't have some sense of what true morality is or everybody doesn't agree on that.
Starting point is 01:04:51 And maybe that's a reasonable target for AI as well. So you're saying we're trying to design the future and the principles by which it will evolve and comment into being. Right. And so you're saying the first thing you're saying is, well, we will, we try to teach our children general principles which will promote more likely evolutions. Yeah.
Starting point is 01:05:14 Maybe we should also seek for things being voluntary. If there is change, we want it to be a volunteer rather than imposed on people. I think that's a very important point. Yeah. And yeah, that's all good. I think this is like a big, you know, the big, the big or one of the really big human enterprises to design society. And that's been ongoing for thousands of years again. And so it's like the more things change, really the more things, they stay the same.
Starting point is 01:05:44 We still have to figure out how to be. The children will still come up with different values that seem strange to their parents and their grandparents and things will evolve. The more things change, the more they stay the same also seems like a good capstone to the AI discussion because the AI discussion we were having was about how techniques, which were in many, and even before the application to deep learning and back propagation was evident are central to the progression of AI today. So maybe that's a good place to wrap up the conversation. Okay. Thank you very much. Thank you for coming on. My pleasure.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.