Dwarkesh Podcast - Sholto Douglas & Trenton Bricken — How LLMs actually think

Starting point is 00:00:00 Okay, today I have the pleasure to talk with two of my good friends, Shilto and Trenton. Shoto. You should have to make stuff. I was going to say anything. Let's do this in reverse. How long I had started with my good friends? Yeah, I didn't know I at one point caught the context like just wow. Shit.

Starting point is 00:00:27 Anyways, Sholto, Noam Brown. Noam Brown, the guy who wrote the diplomacy paper, he said this about Shulto. He said he's only been in the field for 1.5 years, but people in AI know that he was one of the most important people behind Gemini's success. And Trenton, who's an anthropic,

Starting point is 00:00:48 works on mechanistic interoperability. And it was widely reported that he has solved alignment. So this will be a capabilities-only podcast, alignment is already solved, so no need to discuss further. Okay, so let's start by talking about context links. Yep. It seemed to be under-hyped, given how important it seems to me to be, that you can just put a million tokens into context.

Starting point is 00:01:16 There's apparently some other news that got pushed to the front for some reason. But, yeah, tell me about how. you see the future of long context links and what that implies for these models. Yeah. So I think it's really under hype because until I started working on it, I didn't really appreciate how much of a step up in intelligence it was for the model to have the onboarding problem basically instantly solved. And you can see that a little bit in the perplexity graphs in the paper where just throwing

Starting point is 00:01:40 millions of tokens worth of context about a code base allows it to become dramatically better at predicting the next token in a way that you'd normally associate with huge increments and models scale. But you don't need that. All you need is like a new context. So underhyped and buried by some other news. In context, are they as sample efficient and smart as humans? I think that's really worth exploring.

Starting point is 00:02:02 For example, one of the e-vows that we did in the paper has it learning a language in context better than a human expert could learn that new language over the course of a couple months. And this is only like a pretty small demonstration, but I'd be really interested to see things like Atari games or something like that where you throw in a couple hundred or thousand frames, labeled actions, and then in the same way that you'd, like, show your friend how to play a game and see if it's able to reason through. It might, at the moment, you know, with the infrastructure and stuff, it's still a little bit slow, like, doing that.

Starting point is 00:02:31 But I would actually, I would guess that might just work out of the box in a way that would be pretty mind-blank. And crucially, I think this language was esoteric enough that it wasn't in the training day. Right, exactly. Yeah, if you look at the model before it has that context thrown in, it just doesn't know the language at all, and it can't get any translation. And this is like an actual, like, human, language.

Starting point is 00:02:50 It's not just, yeah, exactly, an actual human language. So if this is true, it seems to me that these models are already an important sense superhuman, not in that sense that they're smarter than us, but I can't keep a million tokens in my context when I'm trying to solve a problem, remembering and integrating all the information into our code base. Am I wrong in thinking this is like a huge unlock? I actually generally think that's true. Like previously, I've been frustrated when models aren't as smart.

Starting point is 00:03:18 Like, you ask them a question and you want it to be smarter than you or to know things that you don't. And this allows them to know things that you don't in a way that it just ingests a huge amount of information in a way you just can't. So, yeah, it's extremely important. Well, how do we explain in context learning? Yeah. So there's a piece of, there's a line of work I quite like where it looks at in context learning as basically like very similar to gradient descent, but like the attention operation can be viewed as gradient descent on the in context data. that paper had some cool plots where basically showed like we take n steps a gradient descent and that looks like n layers of in context learning and it looks very similar so i think like that's one

Starting point is 00:03:55 way of viewing it and trying to understand what's going on yeah and uh you can ignore what i'm about to say because given the introduction alignment is solved and i safety isn't a problem but uh i think the context stuff does get problematic um but also interesting here um i think there'll be more work coming out in the not too distant future um around what happens if you give a hundred shot prompt for jail breaks, adversarial attacks. It's also interesting in the sense of if your model is doing gradient descent and learning on the fly, even if it's been trained to be harmless, you're dealing with a totally new model in a way. You're like fine-tuning, but in a way where you can't control what's going on. Can you explain what do you mean by gradient

Starting point is 00:04:42 descent is happening in the forward pass and attention? Yeah. No, no, no. There was something in the paper about trying to teach the model to do linear regression, but like just through the number of samples they gave in the context. And you can see, if you plot on the x-axis, like, number of shots that it has or examples, and then like the loss it gets on just like ordinary least squares regression, that will go down with time. And it goes down exactly matched with a number of gradient descent steps. Yeah. Yeah, exactly. Okay. I only read the interim discussion section of that paper, but in the discussion, the way they framed it is that the model, in a order to get better at long context tasks, the model has to get better at learning to learn

Starting point is 00:05:22 from these examples or from the context that is already within the window. And the implication of that is the model, if like meta learning happens because it has to learn how to get better long context tasks, then in some important sense, the task of intelligence is like requires long context examples and long context training. Like meta learn, like you have to induce meta learning. Understanding how to better induce meta learning in your pre-training process is a very important thing to actually get it flexible or adaptive

Starting point is 00:05:52 intelligence. Right, but you can proxy for that just by getting better at doing long context tasks. One of the bottlenecks for AI progress that many people identify is the inability of these models to perform tasks on long horizons, which means

Starting point is 00:06:08 engaging with the task for many hours or even many weeks or months where like if I have I don't know, an assistant or employee or something, they can just do a thing I tell them for a while. And AI agents haven't taken off for this reason, from what I understand. So how linked are long context windows and the ability to perform well on them and the ability to do these kinds of long horizon tasks that require you to engage with an assignment for

Starting point is 00:06:33 many hours? Or are these unrelated concepts? I mean, I would actually take issue with that being the reason that agents haven't taken off, where I think that's more about nines of reliability and the model actually successfully doing things. And if you just can't chain tasks successively with high enough probability, then you won't get something that looks like an agent. And that's why something like an agent might follow more of a step function in sort of visually, like GPD4 class models, German ultra class models are not enough.

Starting point is 00:06:58 But maybe like the next increment on model scale means that you get that extra nine. Even though like the loss isn't going down that dramatically, that like small amount of extra ability gives you the extra. And like, yeah, obviously you need some amount of context to fit long horizon tasks. but I don't think that's been the limiting factor up to the amount. Yeah. The Nurep's best paper this year by Ryland Schaefer was the lead author points to this as like the emergence of mirage where people will have a task and you get the right or wrong answer depending on if you've sampled the last five tokens correctly.

Starting point is 00:07:31 And so naturally, you're multiplying the probability of sampling all of those. And if you don't have enough nines for reliability, then you're not going to get emergence. And all of a sudden you do. and it's like, oh my gosh, this ability is emergent when actually it was kind of almost there to begin with. And there are ways that you can find like a smooth metric for that.

Starting point is 00:07:50 Yeah, human e-vile or whatever. In the GPD4 paper, the coding problems, they measure log pass, right? Exactly. Yeah. For the audience, the context on this is it's basically the idea is you want to, when you're measuring how much progress there has been

Starting point is 00:08:05 on a specific task like solving coding problems, you upweighted when it gets it right only one in a thousand times, you don't give it a one in a thousand score because it's like, oh, like, got to write some of the time. And so the curve you see is like it gets a right one in a thousand, then one in a hundred, then one in ten, and so forth. So actually, I want to follow up on this. So if your claim is that the AI agents haven't taken off because of reliability rather than Long Horizon task performance, isn't the lack of reliability when a task is changed on top of another task, on top of another task? Isn't that exactly the difficulty with Long

Starting point is 00:08:41 horizon tasks is that like you have to do 10 things in a row or 100 things in a row and diminishing the reliability of any one of them uh or yeah the probability goes down from 99.99 to 99.9 then like the whole thing gets multiplied together and the whole thing becomes much less likely to happen that that is exactly the problem but the the key issue you're pointing out there is that your base past like task solve rate is 90% um and if it was 99% then chain it doesn't become a problem um but also yeah exactly and i think this is also something. that just like hasn't been properly studied enough. If you look at all of the e-vals that are commonly, like the academic e-vails are a single problem, right?

Starting point is 00:09:18 You know, like the math problem. It's like one typical math problem or MMOU. It's like one university level like problem from across different topics. You were beginning to start to see e-vails looking at this properly by a more complex tasks like sweepbench where they take a whole bunch of GitHub issues and that's like that is like a reasonably long horizon task. But it's still not a multi, it's like a multi-sub hour as opposed to like multi-hour. or multi-day task. And so I think one of the things that will be really important to do over the next, however long,

Starting point is 00:09:48 is understand better what does success rate over a long horizon task look like. And I think that's even important to understand what the economic impact of these models might be and actually properly judge increasing capabilities by cutting down the tasks that we do and the inputs and outputs involved into minutes or hours or days and seeing how good it is successively chaining and completing tasks of those different resolutions of time. But then that tells you, like, how automated a job family or task family is in a way that, like, MMOU school is doing. I mean, it was less than a year ago that we introduced 100K context windows. And I think everyone was pretty surprised by that.

Starting point is 00:10:24 So, yeah, everyone would just kind of had this sound bite of quadratic attention costs. Yeah. We can't have long context windows. Here we are. So, yeah, like the benchmarks are being actively made. Wait, wait. So doesn't the fact that there's these companies, Google and, I don't know, Magic, maybe others, who have million token attention imply that the quadri-

Starting point is 00:10:43 You shouldn't say anything because you're not, but doesn't that like imply that it's not quadratic anymore, or are they just eating the cost? Well, like, who knows what Google is doing for its long context? Yeah, I'm not saying it's either. One of the things that frustrated me about, like, the general research fields approach to attention is that there's an important way in which the quadratic cost of attention

Starting point is 00:11:06 is actually dominated in typical dense transformers by the MLP block. right um so you have this n-square term that's associated with the tension but you also have an n-square term that's associated with the d model the residual stream dimension of the model and if you look uh i think sasha rush has a great tweet where he looks like basically plots the curve of the cost of attention respect to like the cost of like really large models and tension actually trails off and you actually need to be doing pretty long context before that that term becomes like really important. And the second thing is that people often talk about how attention at

Starting point is 00:11:39 inference time is such a huge cost, right? And if you think about when you're actually generating tokens, the operation is not n square. It is one Q, like one set of Q vectors, looks up a whole bunch of KV vectors. And that's linear with respect to the amount of like context that the model has. And so I think this drives a lot of the like recurrence and state space research where people have this meme of, oh, like, linear attention and all this stuff. And as Trenton said, there's like a graveyard of ideas around attention. And not to think I don't think it's worth exploring, but I think it's important to consider where the actual strengths and weaknesses of it are.

Starting point is 00:12:20 Okay, so what do you make of this take? As we move forward through the takeoff, more and more of the learning happens in the forward pass. So originally, like, all the learning happens in the backward, you know, during like this, like, bottom-up sort of hill-climbing evolutionary process. If you think in the limit during the intelligence explosion, it's just like the AI is like maybe like handwriting the weights or like doing go-fi or something.

Starting point is 00:12:44 And we're in like the middle step where like a lot of learning happens in context now with these models. A lot of it happens within the backward pass. Does this seem like a meaningful gradient along which progress is happening? Like how much, because the broader thing being the, if you're learning in the forward pass, it's like much more sample efficient because you can kind of of like basically think as you're learning like when humans when you read a textbook you're not

Starting point is 00:13:07 just skimming it and trying to absorb what you know what inductive these words follow these words you like read it and you think about it and then you read some more you think about it um i don't know does this seem like a sensible way to think about the progress yeah it may just be one of the ways in which like you know birds and planes like fly but they fly differently and like the virtue of technology allows us to do that like like i basically accomplish things that birds can't um It might be that context length is similar in that it allows it to have a working memory that we can't. But functionally is not like the key thing towards actual reasoning. The key step between GPD2 and GPD3 was that all of a sudden like there was this metal learning behavior that was observed in training.

Starting point is 00:13:50 Like in the pre-training of the model. And that's as you said, like it's something to do with you give it some amount of context. It's able to adapt to that context. And that was a behavior that wasn't really observed before that at all. And maybe that's a mixture property of context and scale and this kind of stuff. But it wouldn't have occurred to model the tiny context, that would say. This is actually an interesting point. So when we talk about scaling up these models,

Starting point is 00:14:14 how much of it comes from just making the models themselves bigger? And how much comes from the fact that during any single call, you are using more compute. So if you think of diffusion, you can just iteratively keep adding more compute. And if adaptive computer solved, you can keep doing that. And in this case, if there's a quadratic penalty for attention, but you're doing long context anyways, then you're still dumping in more compute during, not during training, not during having bigger models, but just like, yeah. Yeah, it's interesting because you do get more forward passes by having more tokens. Right.

Starting point is 00:14:50 My one gripe, I guess I have two gripes with this, though, maybe three. So one, like, in the alpha paper, one of the transformer, one of the transformer modules, they have a few, and the architecture is, like, very intricate. But they do, I think, five forwards passes through it and will gradually, like, refine their solution as a result. You can also kind of think of the residual stream. I mean, Schulte alluded to kind of the read-write operations as like a poor man's adaptive compute, where it's like, I'm just going to give you all these layers. And like, if you want to use them great, if you don't, then that's also fine. And then people will be like, oh, well, the brain is recurrent and you can like do however many loops through it you want. And I think to a certain extent that's right, right?

Starting point is 00:15:28 Like, if I ask you a hard question, you'll spend more time thinking about it. and that would correspond to more forward passes. But I think there's a finite number of forward passes that you can do. It's kind of with language as well. People are like, oh, well, human language can have like infinite recursion in it, like infinite nested statements of like the boy jumped over the bear that was doing this, that had done this, that had done that. But like empirically, you'll only see five to seven levels of recursion,

Starting point is 00:15:55 which kind of relates to whatever, that magic number of like how many things you can hold in working memory at any given time is. And so, yeah, it's not infinitely recursive, but like, does that matter in the regime of human intelligence? And, like, can you not just add more layers? Breakdown for me, you're referring to this in some of your previous answers of, listen, you have these long contexts and you can hold more things in memory, but like ultimately comes down to your ability to mix concepts together to do some kind of reasoning.

Starting point is 00:16:25 And these models aren't necessarily human level at that, even in context. context. Break down for me how you see storing just raw information versus reasoning and what's in between, like where is the reasoning happening? Is that, where is just like storing raw information happening? What's different between them in these models? Yeah, I don't have a super crisp answer for you here. I mean, obviously with the input and output of the model, you're mapping back to actual tokens, right? And then in between that, you're doing higher level processing. Before we get deeper into this, we should explain to the audience. You referred earlier to Anthropics way of thinking about transformers as these read-write operations that

Starting point is 00:17:09 layers do. One of you should just kind of explain at a high level, what you mean by that. So the residual stream, imagine you're in a boat going down a river. And the boat is kind of the current query where you're trying to predict the next token. So it's the cat sat on the blank. and then you have these little like streams that are coming off the river where you can get extra passengers or collect extra information if you want and those correspond to the attention heads and MLPs that are part of the model right and okay I was going to I almost think of it like the working memory of the model yeah like the RAM of the computer

Starting point is 00:17:46 where you're like choosing what information to read in so you can do something with it and then maybe read like read something else in later on yeah and you can operate on subspaces of that high dimensional vector, a ton of things are, I mean, at this point, I think it's almost given that are encoded in superposition, right? So it's like, yeah, the residual stream is just one high dimensional vector, but actually there's a ton of different vectors that are packed into it. Yeah. I might just like dumb it down, like, as a way that would have made sense to me a few months ago of, okay, so you have, you know, whatever words are in the input you put into the model, all those words get converted into these.

Starting point is 00:18:24 tokens and those tokens get converted into these vectors. And basically it's just like this small amount of information that's moving through the model. And the way you explained to Michaud, this paper talks about is early on in the model, maybe it's just doing some very basic things about like, what do these tokens mean? Like if it says like 10 plus five, just like moving information about to have the, have that good representation. Exactly. Just represent. And in the middle, maybe like the deeper thinking is happening about like how to think, yeah, how to solve this. At the end, you're converting it back into the output token because the end product is you're trying to predict the probability of the next token from the last of those residuals streams. And so, yeah, it's interesting to think about just like the small compressed amount of information moving through the model and it's like getting modified in different ways.

Starting point is 00:19:13 Trenton, so it's interesting, you're one of the few people who have like background from neuroscience. You can think about the analogies here to, yeah, to the brain. And in fact, I have one of our friends, the way he, you had a paper in grad school about thinking about attention in the brain and he said, this is the only or first, like, neural explanation of why attention works, whereas we have evidence from why the CNN's work, convolutional network networks work based on the visual cortex or something. Yeah, I'm curious, do you think in the brain there's something like a residual stream of this compressed amount of information, that's moving through and it's getting modified as you're thinking about something. Even if that's not what's literally happening, do you think that's a good metaphor

Starting point is 00:20:02 for what's happening in the brain? Yeah, yeah. So at least in the Sarvallum, you basically do have a residual stream where the whole, what we'll call the attention module for now, and I can go into whatever amount of DCO you want for that, you have inputs that route through it,

Starting point is 00:20:16 but they'll also just go directly to the, like, end point that module will contribute to. So there's a direct path and an indirect path. And so the model can pick up whatever information it wants and then add that back in. What would happen is the cerebellum? So the cerebellum nominally just does find motor control. But I analogize this to the person who's lost their keys and is just looking under the streetlight where it's very easily to observe this behavior. one leading cognitive neuroscientist said to me that a dirty little secret of any fMRI study

Starting point is 00:20:56 where you're looking at brain activity for a given task is that the cerebellum is almost always active and lighting up for it if you have a damaged cerebellum you also are much more likely to have autism so it's associated with like social skills in one of these particular studies where I think they use pete instead of fMRI but when you're doing next token prediction the cerebellum lights up a lot also 70% of your neurons in the brain are in the cerebellum they're small but they're there and they're taking up real metabolic cost this was one of Guern's points that like what changed with humans was not just that we have more neurons or he says he shared this article but specifically there's more neurons in the cerebral cortex in the cerebellum and you should say

Starting point is 00:21:42 more about this but like they're they're more directly expensive and they're more involved in signaling and sending information back and forth. Yeah. Is that attention? What's going on? Yeah. Yeah. So I guess the main thing I want to communicate here,

Starting point is 00:21:54 so back in the 1980s, Penty Canerva came up with a associated memory algorithm for, I have a bunch of memories. I want to store them. There's some amount of noise or corruption that's going on. And I want to query or retrieve the best match. And so he writes this equation for how to do it. And a few years later,

Starting point is 00:22:15 realizes that if you implemented this as an electrical engineering circuit, it actually looks identical to the core cerebellar circuit. And that circuit and the cerebellum more broadly is not just in us. It's in basically every organism. There's active debate on whether or not cephalopods have it. They kind of have a different evolutionary trajectory. But even fruit flies with the Drosophila mushroom body, that is the same cerebellar architecture. And so that convergence, and then my paper, which shows that actually this operation is to a very close approximation the same as the attention operation, including implementing the soft max and having this sort of like nominal quadratic cost that we've been talking about. And so the three-way convergence here and the takeoff

Starting point is 00:22:59 and success of transformers seems pretty striking to me. Yeah. I want to do about an ask, I think what motivated this discussion in the beginning was we were talking about like, wait, what is the reasoning, what is the memory? What do you think about the, the, analogy you found to attention and this. Do you think of this as more just looking up the relevant memories or the relevant facts? And if that is the case, like, where is the reasoning happening in the brain? How do we think about like how that builds up into the reasoning? Yeah. So maybe my hot take here, I don't know how hot it is, is that like most intelligence is pattern matching. And you can do a lot of really good pattern matching if you have a hierarchy

Starting point is 00:23:45 associated memories. So you start with your very basic associations between just like objects in the real world. But you can then chain those and have more abstract associations such as like a wedding ring symbolizes like so many other associations that are downstream. And so you can even generalize the attention operation and this associated memory as the MLP layer as well. It's in a long-term setting where you don't have, like, tokens in your current context. But I think this is an argument that, like, association is all you need. And associate a memory in general as well. It's not, so you can do two things with it.

Starting point is 00:24:29 You can both de-noise or retrieve a current memory. So, like, if I see your face, but it's, like, raining and cloudy, I can de-noise and kind of, like, gradually update my query towards my memory of your face. but I can also access that memory and then the value that I get out actually points to some other totally different part of the space. And so a very simple instance of this would be if you learn the alphabet. And so I query for A and it returns B. I query for B and it returns C.

Starting point is 00:24:58 And you can traverse the whole thing. Yeah. Yeah. One of the things I talked to Demis about was he had a paper in 2008 that memory and imagination are very linked because of this very thing that you mentioned, memory is reconstructive. And so you are in some sense imagining every time you're thinking of a memory because you're only storing a condensed version of it and you're like, have to.

Starting point is 00:25:22 And this is famous why human memory is terrible and like why people in the witness box or whatever would just make shit up. Okay, so let me ask a stupid question. So you like read Sherlock Holmes, right? And like the guy is incredibly sample efficient. He'll like see a few observations and he'll like, basically figure out who committed the crime because there's a series of deductive steps that leads from somebody's tattoo and what's on the wall to the implications of that.

Starting point is 00:25:53 How does that fit into this picture? Because, like, crucially, what makes them smart is that there's not, like, an association, but there's a sort of deductive connection between different pieces of information. Would you just explain it as that that's just, like, higher level association? Like, yeah. I think so, yeah. So I think learning these higher level associations to be able to then map patterns to each other as kind of like a meta learning. I think in this case he would also just have a really long context length or a really long working memory, right?

Starting point is 00:26:23 Where he can like have all of these bits and continuously query them as he's coming up with whatever theory. So the theory is moving through the residual stream. And then he's has his attention heads are querying his context. Right. But then how he's projecting. his query and keys in the space and how his MLPs are then retrieving, like, longer-term facts or modifying that information is allowing him to then in later layers do even more sophisticated queries and slowly be able to reason through and come to a meaningful conclusion.

Starting point is 00:26:59 That feels right to me in terms of like looking back in the past, you're selectively reading in certain pieces of information, comparing them, maybe that informs your next step of like what piece of information you now need to pull in, and then you build this representation, which I like progressively looks closer and closer and closer to like the suspect in your case. Yeah. Yeah. Yeah. That doesn't feel it all outlandish. Do the load of lens on like that? Something I think that the people who aren't doing this research can overlook is after your first layer of the model, every query key and value that you're using for attention comes from the combination of all the previous tokens. So like my first layer, I'll query my previous tokens.

Starting point is 00:27:39 just extract information from them, but all of a sudden, let's say that I attended to tokens one, two, and four in equal amounts, then the vector in my residual stream, assuming that they wrote out the same thing to the value vectors, but ignore that for a second, is a third of each of those. And so when I'm querying in the future, my query is actually a third of each of those things. But they might be written to different subspaces. That's right. Hypothetically, but they wouldn't have to. And so you can recombine and immediately even by layer two and certainly by the deeper layers just have these very rich vectors

Starting point is 00:28:14 that are packing in a ton of information and the causal graph is like literally over every single layer that happened in the past and that's what you're operating on. It does bring to mind like a very funny eval to do would be like a Sherlock Holmes eval that's you put the entire book into context and then you have like a sentence

Starting point is 00:28:31 which is like the suspect is like X then you have like a logic probability distribution over like the different characters in the book and then like as you put more That would be super cool. It would be super interesting. I wonder if you'd get anything at all. But it would be cool.

Starting point is 00:28:45 Sherlock Holmes is probably already in the training data. Right. You get like a mystery novel that was written in the... You can get an album to write it. Or we could like... Well, you could purposely exclude it, right? Oh, we can? How do you...

Starting point is 00:28:56 Well, you need to scrape any discussion of it from Reddit or any other thing, right? Right. It's hard. But that's like one of the challenges that goes into things like long context e-vows is to get a good one. You need to know that it's not your training data. You're like put in the effort to exclude it. the effort to exclude it. What, um, so, uh, actually want to, there's two different threads I want to follow up on.

Starting point is 00:29:15 Let's go to the long context one and then we'll come back to, um, this. So in the Gemina 1.5 paper, the eval that was used was, can it like, was something with Paul Graham essays, can it like, yeah, the needle in a haystack thing, um, which, yeah, I mean, there's like, we don't necessarily just care about its ability to recall one specific fact from the context. I'll step back and ask the question the loss function for these models is unsupervised

Starting point is 00:29:45 you don't have to come up with these bespoke things that you keep out of the training data is there a way you can do a benchmark that's also unsupervised where I don't know another LLM is rating it in some way or something like that and maybe the answer is like

Starting point is 00:30:00 well if you could do this like reinforcement learning would work because then you have this unsupervised yeah I mean I think people have explored that kind of stuff like, for example, Anthropica is the constitutional oral paper where they take another language model and they point it and say, like, how

Starting point is 00:30:12 you know, helpful or harmless was that response? And then they get it to update and try and improve along the preto frontier of helpfulness and harmfulness. So you can point language models at each other and create evils in this way. It's obviously an imperfect art form at the moment because

Starting point is 00:30:28 you get reward function hacking basically and the language like if you try and match up to what even humans, are imperfect here. Like if you try and match up to what humans will say, humans you typically prefer longer answers, which aren't necessarily better answers.

Starting point is 00:30:43 And you get that same behavior with models. On the other thread, going back to the Sherlock Holmes thing, if it's all associations all the way down, this is a sort of naive dinner party question, if I just like match you or I'm working on AI. But, okay, does that mean we should be less worried about super intelligence? Because there's not this sense in which it's like,

Starting point is 00:31:06 Sherlock Holmes plus, plus, it'll still need to just like find these associations, like humans find associations and like, you know what I mean? It's not just like, it sees a frame of the world and it's like figured out all the laws of physics. So for me, because this is a very legitimate response, right? It's like, well, artificial general intelligence aren't, if you say humans are generally intelligent, then they're no more capable or competent. I'm just worried that you have that level of general intelligence in silicon, where you can then immediately, clone hundreds of thousands of agents and they don't need to sleep and they can have super long context windows and then they can start recursively improving and then things get really

Starting point is 00:31:45 scary. So I think to answer your original question, yes, you're right. They would still need to learn associations, but the recursive stuff of improvement would still have to be them, like if intelligence is fundamentally about these associations, like the improvement is just them getting better at association. There's not like another thing that's happening. And so then it seems like you might disagree. with the intuition that, well, they can't be that much more powerful if they're just doing associations. Well, I think then you can get into really interesting cases of meta-learning. Like when you play a new video game or like study a new textbook, you're bringing a whole

Starting point is 00:32:19 bunch of skills to the table to form those associations much more quickly. And like, because everything in some way ties back to the physical worlds, I think there are like general features that you can pick up and then apply in novel circumstances. Should we talk about intelligence explosion then? I mentioned multiple agents and I'm like, oh, here we go. Okay, so the reason I'm interested in discussing this is with you guys in particular is the models we have of the intelligence explosion so far come from economists, which is fine, but I think we can do better because in the model of the intelligence explosion, what happens is you replace the AI researchers and then there's like a bunch of automated AI. researchers who can speed up progress, make more AI researchers, make further progress. And so I feel like if that's the metric, or that's the mechanism, we should just ask

Starting point is 00:33:16 the AI researchers about whether they think this is plausible. So let me just ask you, like if I have a thousand Asian chelotos or Asian Trentons, do you think that you get an intelligence explosion? Is that, yeah, what does that look like to you? I think one of the important bounding constraints here is compute. I do think you could dramatically speed up AI research, right? Like, it seems very clear to me that in the next couple of years we'll have things that can do many of the software engineering tasks

Starting point is 00:33:44 that I do on a day-to-day basis and therefore dramatically speed up my work and therefore speed up the rate of progress, right? At the moment, I think most of the labs are somewhat compute-bound in that there are always more experiments you could run and more pieces of information that you could gain in the same way that, like, scientific research, on biology is also somewhat experimentally like throughput bound.

Starting point is 00:34:08 Like you need to be able to run and culture the cells in order to get the information. I think that will be at least a short-term finding constraint. Obviously, you know, Sam's trying to raise $7 trillion to rent to get chips. And so, like, it does seem like there's going to be a lot more compute in future as everyone is heavily ramping. You know, your Nvidia's stock price sort of represents the relative compute increase. But any thoughts? I think we need a few more nines of reliability in order for it to really be useful and trustworthy.

Starting point is 00:34:42 Right now, it's like, and just having context lengths that are super long and it's like very cheap to have. Like if I'm working in our code base, it's really only small modules that I can get clod to write for me right now. But it's very plausible that within the next few years or even sooner, it can automate most of my task. The only other thing here that I will note is the research that at least our sub-team in interpretability is working on is so early stage that you really have to be able to make sure everything is like done correctly in a bug-free way and contextualize the results with everything else in the model. And if something isn't going right, be able to enumerate all of the possible things and then slowly work on those. Like an example that we've publicly talked about in previous

Starting point is 00:35:37 papers is dealing with layer norm, right? And it's like if I'm trying to get an early result or look at like the logit effects of the model, right? So it's like if I activate this feature that we've identified to a really large degree, how does that change the output of the model? Am I using layer norm or not? How is that changing the feature that's being learned? And that will take even more context or reasoning abilities for the model. So you used a couple of concepts together, and it's not self-evident to me that they're the same, but it seems like you were using them interchangeably, so I just want to, like, one was, well, to work on the cloud code base and make more modules based on that, they need more context or something, where like, it seems like they might already be able to fit in the context. Or do you mean like actual, do you mean like the context window context or like more? Yeah, the context window context. So yeah, it seems like now it might just be able to fit. The thing that's preventing it from making good modules is not the lack of being able to put the code base in there.

Starting point is 00:36:38 I think that will be there soon. Yeah. But it's not going to be as good as you at like coming up with papers because it can like fit the code base in there. No, but it will speed up a lot of the engineering. Hmm. In a way that causes an intelligence explosion? No, that accelerates research. But I think these things compound.

Starting point is 00:36:55 So like the faster I can do my engineering, the more experiments I can run. And then the more experiments I can run, the faster we can. I mean, my work isn't actually accelerating capabilities at all. Right, right. It's just interpreting the models. But we have a lot more work to do on that. Surprise to the Twitter. Yeah, I mean, for context, like, when you release your paper,

Starting point is 00:37:15 there was a lot of talk on Twitter about alignment to solve guys, close the curtains. Yeah, yeah, no, it keeps me up at night how quickly the models are becoming more capable. and just how poor our understanding still is of what's going on. Yeah, I guess I'm still... Okay, so, lessening through the specifics here, by the time this is happening, we have bigger models that are two to four orders

Starting point is 00:37:43 of magnitude bigger, right? Or at least an effective compute are two to four orders of magnitude bigger. And so this idea that, well, you can run experiments faster or something, you're having to retrain that model in this version of the intelligence explosion, like the recursive self-improvement

Starting point is 00:38:04 is different from what might have been imagined 20 years ago where you just rewrite the code. You actually have to train a new model, and that's really expensive. Not only now, but especially in the future as you keep making these models orders of magnitude bigger. Doesn't that dampen the possibility

Starting point is 00:38:18 of a sort of recursive self-refer improvement type intelligence explosion? It's definitely going to act as a breaking mechanism. I agree that the world of like what we're making today looks very different to what people imagined it would look like 20 years ago. Like it's not going to be able to write its own code to be like really smart because actually it used to train itself. Like the code itself is typically quite simple, typically pretty small and self-contained. I think John Carmack had this nice phrase where it's like the first time in history where like you can actually plausibly imagine writing AI with like 10,000 lines of code.

Starting point is 00:38:55 And that actually does seem plausible when you pay most training codebases down to the limit. But it doesn't take away from the fact that this is something we should really strive to measure an estimate, like, how progress might occur. Like we should be trying very, very hard right now to measure exactly how much of a software engineer's job is automatable and what the trend line looks like and be trying out the hardest to project out those trend lines. But with all due respect to software engineers, like you are not like writing a React front end. right so it's like I don't know how this like what is concretely happening and maybe you can walk me through walk me through like a day in the life of show like you're working on an experimenter project that's going to make the model quote unquote better right like what is happening from observation to experiment to theory to like writing the code what is happening and so I think important

Starting point is 00:39:47 to contextualize here is that like I've primarily worked on inference so far so a lot of what I've been doing is just taking or helping guide the pre-training process, such that we design a good model for inference, and then making the model and like the surrounding system faster. I've also done some pre-training work around that, but it hasn't been like my 100% focus, but I can still describe what I do when I do that work. I know, but sorry, let me interrupt and say, in Carl Schumann's and like when he was talking about it on the podcast, he did say that things like improving inference or even literally having like helping it help make you help make better chips or GPUs.

Starting point is 00:40:22 That's, like, part of the intelligence explosion. Yeah. Because, like, obviously, if the inference code runs faster, like, it happens better or faster or whatever. Right. Anyway, sorry, go ahead. Okay, so what does concretely a day look like? I think the most important, like, part to illustrate is this cycle of coming up with an idea, proving it out at different points in scale, and, like, interpreting and understanding what goes wrong.

Starting point is 00:40:49 And I think most people would be surprised to learn just how much. goes into interpreting and understanding what goes wrong. Because the ideas, people have long lists of ideas that they want to try. Not every idea that you think should work will work and trying to understand why that is is quite difficult. And like working at what exactly you need to do to interrogate it. So so much of it is like introspection about what's going on. It's not pumping out thousands and thousands and thousands of light of code.

Starting point is 00:41:14 It's not like the difficulty in coming up with ideas even. I think many people have a long list of ideas that they want to try. But pairing that down and shock calling under very imperfect information, what the right ideas to explore further is really hard. Tell me more about what do you mean by imperfect information? Are these early experiments? Are these, like what is the information that you're? So Demas mentioned this in his podcast and also like you obviously, it's like the GPD4 paper where you have like scaling law increments. And you can see like in the GPD4 paper they have like a bunch of like dots, right, where they say we can estimate the performance of our final model like using.

Starting point is 00:41:52 all of the exhausts, and there's a nice curve that flows through them. And Dennis mentioned, yeah, that we do this process of scaling up. Concretely, why is that imperfect information is you never actually know if the trend will hold. For certain architectures, the trend has held really well, and for certain changes, it's held really well. But that isn't always the case, and things which can help at smaller scales can actually hurt at larger scales.

Starting point is 00:42:19 So making guessings. based on what the trend lines look like and based on like your intuitive feeling of okay this is actually something that's going to matter particularly for those ones which help at the small scale that's interesting to consider that for every chart you see in a release paper technical report that shows that smooth curve there's a graveyard of like first year runs and then it's like flat

Starting point is 00:42:44 yeah there's all these like other lines that go in like different directions off like tail off and like that's yeah it's crazy both like as a grad student and then also here, like, the number of experiments that you have to run before getting, like, a meaningful result. Tell me, okay, so you, but presumably it's not just like you run it until it stops and then, like, let's go to the next thing. There's some process by which to interpret the early data and also to look at your, like, I don't know, I could, like, put a Google duck in front of you,

Starting point is 00:43:11 and I'm pretty sure you could just, like, keep typing for a while on, like, different ideas you have. And there's some bottleneck between that and just, like, making the models better immediately. Right. Yeah, walk me through, like, what is the inference you're making from the first early steps that makes you have better experiments and better ideas? I think one thing that I didn't fully convey before was that I think a lot of, like,

Starting point is 00:43:33 good research comes from working backwards from the actual problems that you want to solve. And there's a couple of, like, grand problems. I split those in, like, making the models better today that you would identify as issues and then, like, work back from, okay, how could I, like, change it to achieve this? There's also a bunch of when you scale, you run into things and you want to like fix behaviors or like issues at scale and that like informs a lot of the research for the next increment and this kind of stuff. So concretely the barrier is a little bit software engineering. Like often having a code base that's large and sort of capable enough that it can support many people doing research at the same time makes it complex.

Starting point is 00:44:12 If you're doing everything by yourself, your iteration pace is going to be much faster. I've heard that, like, Alec Radford, for example, like famously did much of pioneering work at opening eye. He, like, mostly works out of, like, a Jupyter notebook and then, like, has someone else who, like, writes and productionizes that code for him. I don't know if that's true or not. But, like, that kind of stuff, like,

Starting point is 00:44:30 actually operating with other people makes it, raises the complexity a lot because, not from natural reasons, like familiar to, like, every software engineer. And then the inherent, running the like running and launching those experiments easy but there's inherent time like slows downs induced by that so you often want to be paralyzing multiple different streams because one you can't like be totally focused on one thing necessarily you might not have like fast enough

Starting point is 00:44:58 feedback cycles and then intuiting what went wrong is actually really hard like working out what like this is in many respects the problem that the team the trenton is on is trying to better understand it's like what is going on inside these models we have inferences and understanding and like head canon for why certain things work, but it's not an exact science. And so you have to constantly be making guesses about why something might have happened, what experiment might reveal whether that is or isn't true,

Starting point is 00:45:23 and that's probably the most complex part. The performance work by comparatively is easier, but harder in other respects. It's just a lot of low level and like difficult engineering work. Yeah, I agree with a lot of that. But even on the interpretability team, I mean, especially with Chris Ola leading it, There are just so many ideas that we want to test, and it's really just having the engineering skill, but I'll put engineering in quotes because a lot of it is research, to like very quickly iterate on an experiment, look at the results, interpret it, try the next thing, communicate them, and then just ruthlessly prioritizing what the highest priority things to do are.

Starting point is 00:46:05 And this is really important, like the ruthless prioritization is something which I think separates a lot of like quality research from. research that doesn't necessarily succeed as much. We're in this funny field where so many of our theoretical, initial theoretical understanding is broken down, basically. And so you need to have this simplicity bias and like ruthless prioritization over what's actually going wrong. And I think that's one of the things that separates the most effective people is they don't necessarily get like too attached to solving, using a given solution that they're necessarily

Starting point is 00:46:39 familiar with, but rather they attack the problem directly. You see this a lot in like maybe people come in with a specific academic background. They try and solve problems with that toolbox. And the best people are people who expand the toolbox dramatically. They're running around and they're taking ideas from reinforcement learning, but also from optimization theory and also they have a great understanding of systems. And so they know what the sort of constraints that bound the problem are. And they're good engineers so they can iterate and try ideas fast.

Starting point is 00:47:07 Like by far the best researchers I've seen, they all have the ability to try experiments really, really, really, really. fast. And that is that cycle time, at smaller scales, cycle time separates people. I mean, machine learning research is just so empirical. Yeah. And this is honestly one reason why I think our solutions might end up looking more brainlike than otherwise. It's like, even though we wouldn't want to admit it, the whole community is kind of doing like greedy evolutionary optimization over the landscape of like possible AI architectures and everything else. It's like no better than evolution.

Starting point is 00:47:42 And that's not even necessarily a slight against evolution. That's such an interesting idea. I'm still confused on what will be the bottleneck for these, what would we have to be true of an agent such that it's like sped up your research. So in the Alec Ratford example you gave where he apparently already has the equivalent of like co-pilot for his Jupyter notebook experiments, is it just that if he had enough of those, he would be a dramatically faster researcher, and so you just need Alec Ratford. So it's like you're not automating the humans.

Starting point is 00:48:12 you're just making the most effective researchers who have great taste more effective and like running the experiments for them and so forth. Or like you're still working at the point with which the intelligence explosion is happening. You know what I mean? Like is that what you're saying? Right. And if that were like directly true, why can't we scale our current research teams better? For example, is I think an interesting question to ask.

Starting point is 00:48:35 Like why if this work is so valuable, why can't we take hundreds or thousands of people who are like they're definitely out there? and, like, scale our organizations better. It's, I think we are less at the moment bound by the sheer engineering work of making these things than we are by compute to run and get signal and taste in terms of what the actual, like, right thing to do it, and that, like, making those difficult inferences on imperfect information. for the Gemini team because I think for interpretability we actually really want to keep hiring talented engineers and I think it's a big bottleneck for us to just keep making a lot of progress

Starting point is 00:49:21 obviously more people like more people is like better but I do think like it's interesting to consider I think like one of the biggest challenges that like I've thought a lot about is how do we scale better like Google is an enormous organization and has 200,000-ish people right like 180,000 or something like that. And one has to imagine if there were, like, ways of scaling out Gemini's research program

Starting point is 00:49:48 to all those fantastically talented software engineers. This seems like a key advantage that you would want to be able to take advantage of, you'd want to be able to use, but like how do you effectively do that? It's a very complex organizational problem. So compute and taste, that's interesting to think about because at least the compute part is not bottleneck and more intelligent. It just bottlenecked on Sam 7 trillion or whatever, right? Yeah, yeah.

Starting point is 00:50:13 So if I gave you 10x the H100s to run your experiments, how much more effective a research are? Tip U. Please. How much more effective a researcher are you? I think the Gemini program would probably be like maybe five times faster or 10 times more compute or something like that. So that's pretty good elasticity of like 0.5. Yeah. Wait, that's insane.

Starting point is 00:50:38 Yeah. I think like more computer. would just, like, directly convert into progress. So you have some, um, some fixed size of compute and some of it goes to inference, some of, I guess, like, and also, um, like to clients of GCP. Yep. Some of it goes to, huh? Some of it goes to training. And there, I guess as a fraction of it, some of it goes to running the experiments for

Starting point is 00:51:02 the full model. Yeah, that's right. Shouldn't then the fraction goes to experiments be higher given that you would just be like, If the bottleneck has research and researches bottleneck by compute. And so one of the strategic decisions that every pre-training team has to make is like exactly what amount of compute you allocate to your different training runs. Like to your research program versus like scaling the last best, like, you know, thing that you landed on. And I think they're like they're all trying to arrive at like a sort of pre-optimal point here. one of the reasons why you need to still keep training big models is that you get information there that you don't get otherwise.

Starting point is 00:51:42 So scale has all these emergent properties, which you want to understand better. And if you are always doing research and never, like, remember what I said before about like, you're not sure what's going to like fall off the curve, right? Yeah. If you like keep doing research in this regime and like keep on getting more and more computer efficient, you may never, you may have actually like, gone off the path that actually eventually scale. So you need to constantly be investing in doing big runs too at the frontier of what you sort of expect to work. Okay, so then tell me what it looks like to be in the world

Starting point is 00:52:18 where AI has significantly sped up AI research because from this, it doesn't really sound like the AIs are going off and writing the code from scratch and that's leading to faster output. It sounds like they're really augmenting the top researchers in some way. Like, yeah, tell me concretely. Are they doing the experiments? Are they coming up with the ideas?

Starting point is 00:52:35 Are they just like evaluating the outputs of the experiments? What's happening? So I think there's like two worlds you need to consider here. One is where AI has meaningfully sped up our ability to make algorithmic progress. And one is where the output of the AI itself is the thing that's like the crucial ingredient towards like model capability progress. And like specifically what I mean there is synthetic data. Like synthetic data, right?

Starting point is 00:53:00 And in the first world where it's meaningfully speeding up algorithmic progress, I think a necessary component of that is more compute. And you probably like reach this elasticity point where like AIs maybe at some point are easier to speed up and get on context than yourself. That's just right than other people. And so AI is meaningfully speed up your work because they're like a fantastic copilot, basically that helps you code multiple times faster. And that seems like actually quite reasonable.

Starting point is 00:53:31 Super long context, super smart model. It's onboarded immediately. and you can, like, send them off to, like, complete sub-task and sub-galles for you. And that actually, like, feels very plausible. But, again, we don't know because there are no great evals about that kind of thing. But the best one, as I said before, Sweet Bench, which...

Starting point is 00:53:49 Although in that one, somebody was mentioning to me, like, the problem is that when a human is trying to do a pull request, they'll, like, type something out, and they'll, like, run it and see if it works. And if it doesn't, they'll rewrite it. None of this was part of the opportunities that the LLM was given when run on this. Like it's just like, I'll put it. And if it runs and checks all the boxes, then, you know, it passed, right?

Starting point is 00:54:13 So it might have been an unfair test in that way. So you can imagine that is like if you were able to use that, that would be an effective training source for having. Like the key thing that's missing from a lot of training data is like the reasoning traces, right? And I think this would be, if I wanted to try and automate a specific field with, like, job family, or, like, understand how, how, like, at risk of automation that is, then having reasoning traces feels to me like a really important part of that. There's so many, yeah, there are so many different threads in that I want to follow up then. Let's begin with the data versus, like, yeah, compute thing of, like, is the output of these AI is a thing that's causing the intelligence explosion or something. Yeah. People talk about how these models are really a reflection on their data.

Starting point is 00:55:12 Yeah. I think there was, I forgot his name, but there was a, there's a great blog by this open AI engineer. And it was talking about at the end of the day, as these models get better and better, it's just like, they're just going to be really effective. like maps of the data set yeah and so it's like at the end of the day like you got to stop thinking about architectures it's like the most effective architecture just like do you get an amazing job of mapping the data right um so that implies that future AI progress comes from the AI just making really awesome data right like that you you're mapping to that's clearly a very important yeah yeah that's really interesting um does that look to you like i don't know like things that

Starting point is 00:55:53 look like chain of thought or what do you imagine as these models get better as these model get smarter what does the synthetic data look like when i think of really good data uh to me that that raises something which involved a lot of reasoning to create so in modeling that um it's similar to like ilia's perspective on on trying on achieving like super intelligence via effectively like perfectly modeling the human textual output right um but even in the near term in order to model or something like the archive papers or Wikipedia, you have to have an incredible amount of reasoning behind you in order to understand what next token might be being output. And so for me, what I imagine as good data is like model, like data where you can similarly

Starting point is 00:56:39 at least like where it had to do reasoning to produce something. And then like the trick, of course, is how do you verify that that reasoning was correct? And this is why you saw like DeepMind do that geometry. like the sort of like self like self like self like self like sort of like research for your geometry this geometry is a really it's easily formalizable easily verifiable field so you can you can check

Starting point is 00:57:00 if its reasoning was correct and you can generate heaps of data of correct like trick yeah verified geometry proofs train on that and you know that that's good data it's actually funny because I had a conversation with Grant Sanderson yeah like last year where we're debating this and I was like fuck dude

Starting point is 00:57:16 by the time they get the goal of the math Olympia and of course they're going to automate all the jobs on the synthetic data thing one of the things I speculated about in my scaling post which was heavily informed with discussions with you too and you especially Schulte was you can think of like human evolution

Starting point is 00:57:39 through the perspective of like we get language and so we're like generating the synthetic data which right you know like our copies are generating the synthetic data which we're trained on and it's like this really effective of genetics, a cultural, like, co-avolutionary loop. And there's a verifier there, too, right? Like, there's the real world. You might generate a theory about, you know, the gods cause the storms, right?

Starting point is 00:58:01 And then, like, someone else finds cases where that isn't true. And so you, like, know that, like, that sort of didn't match your verification function. And now, like, actually, instead you have, like, some weather simulation, which required a lot of reasoning to produce and, like, accurately matches reality. And, like, you can train on that as a better model of the world. like we are training on that and like stories and like scientific theories yeah um i want to go back i'm just remembering something you mentioned uh a little while ago of given how sort of like empirical ML is it really is an evolutionary process as resulting in better performance and not necessarily

Starting point is 00:58:39 an individual coming up with a breakthrough in like a top down way um that has interesting implications first being that there really is people and people are concerned about capabilities increasing because more people are going into the field I've somewhat been skeptical of that way of thinking but from this perspective of just like more input it really does yeah it feels more like

Starting point is 00:59:06 oh actually by the fact that more people are going to ICML means that there's like faster progress towards GPD5 yeah you just have more genetic recombination and like shots on target yeah And I mean, aren't all fields kind of like that? Like, there's the sort of scientific framery of like discovery versus invention, right? And discovery almost involves like whenever there's been a massive scientific breakthrough in the past. Typically, there are multiple people co-discovering that at like roughly the same time.

Starting point is 00:59:34 And that feels to me at least a little bit like the mixing and trying of ideas. You can't try an idea that's so far out of scope that you have no way of verifying it or with the tools you have available. Yeah. I think physics and math might be slightly different in this regard, but especially for biology or any sort of wetware and to the extent we want to analogize neural networks here. It's just, it's comical how serendipitous a lot of the discoveries. Yeah. Like penicillin, for example. Another implication of this is this idea that like, HGI just going to come tomorrow, like somebody's just going to discover a new algorithm and we have HGI.

Starting point is 01:00:09 That seems less plausible. Like it will just be a matter of more and more and more of researchers finding these marginal things that all. all add up together to make models better, right? Like, yeah, that feels like the correct story to me, yeah. Especially while we're still hardware constrained. Right. Do you buy this narrow window framing of the intelligence explosion of you have to each, you know, GPD3 to GPD4 is two ooms of orders and magnitude more

Starting point is 01:00:39 compute, or at least more effective compute, in the sense that if you didn't have any algorithmic progress, it would have to be two orders of magnitude bigger, like the raw form to be as good. Do you buy the framing that given that you have to be two orders and magnitude bigger at every generation, if you don't get AGI by GPD7 that can help you catapult an intelligence explosion, like you're kind of just fucked as far as like much smarter intelligences go and you're kind of stuck with GPD7 level models for a long time? Because at that point, you're just like consuming significant fractions of the economy to

Starting point is 01:01:14 make that model and we just don't have the wherewithal to like make gpd8 this is the karl shulman sort of argument of like we're going to race through the order's magnitude and the near term but then longer term it would it would be harder um i think like he's probably talked about it but yeah but i do buy do buy that framing um yeah i mean i i generally buy that increases in order of magnitude of compute by like in an absolute terms almost like diminishing returns on like capability right like we've seen over a couple orders magnitude models go for being unable to do anything to be able to do huge amounts. And it feels to me like each incremental order of magnitude

Starting point is 01:01:49 gives more nines of reliability at things. And so it unlocks things like agents. But at least at the moment, I haven't seen like transformatively. Like it doesn't feel like reasoning improves like linearly, so to speak. But rather like somewhat sublinearly. That's actually a very bearish sign because one of the things we're chatting with one of our friends and he made the point that if you look at

Starting point is 01:02:12 what new applications are unlocked by GPD 4 or relative to GPD 3.5. It's not clear that's like that much. Like a GPD 3.5 can do perplexity or whatever. So if there is this diminishing increase in capabilities and that increased cost exponentially more to get, that's actually a bearer sign on like what 4.5 will be able to do or if I will unlock in terms of economic impact. That being said, for me, the jump between 3.5 and 4 is like pretty huge. And so like even if I, it's like, another.

Starting point is 01:02:42 a 3.5 to 4 jump is like ridiculous, right? Like if you imagine 5 as being a 3.5 to 4 jump like straight off the bat in terms of like ability to do SATs and this kind of stuff. Yeah, the LSAP performance was particularly striking. Exactly. You go from like very smart

Starting point is 01:02:59 like from like you know not super smart to like very smart to like utter genius in the next generation instantly and it doesn't at least like to me feel like we're going to sort of jump to utter genius in the next generation but it does feel like like, we'll get very smart plus lots of reliability,

Starting point is 01:03:15 and then, like, we'll see TBD, what that continues to look like. Will GoFi be part of the intelligence explosion? We're, like, you say synthetic data, but, like, in fact, it will be, like, at writing its own source code in some important way. There was an interesting paper that you can use diffusion to, like, come up with model weights. I don't know how, like, legit that was or whatever, but, like, I don't know, something like that. Can you, so GoFi is good old-fashioned AI.

Starting point is 01:03:42 right and can you define that because when I hear it I think like if out statements for like symbolic logic sure um I actually want to make sure we like don't like we like fully unpack the whole like model improvement increments yeah because I don't want people to come away with the perspective that like actually this is super bearish and like models aren't going to get much better and stuff okay more what I want to emphasize is like the jumps that we've seen so far are huge and even if those like continue on like a smaller scale we're still in for extremely smart, like very reliable agents, like over the next couple of orders of magnitude. And so, like, we didn't sort of fully close the thread on the narrow window thing.

Starting point is 01:04:22 But when you think of, like, let's say, GPD forecast, I know, let's call it $100 million or whatever, you have, what, the 1B run, the 10B run, the 100B run, all seem very plausible by, you know, private company standards. And then the... You mean in terms of dollar. In terms of dollar amount. Yeah. And then you can also imagine even like a 1T run being part of like a national consortium or like a national level thing, but much harder on the behalf of an individual company.

Starting point is 01:04:55 But Sammy is out there trying to raise $7 trillion, right? Like he's already preparing for like a whole lot of magnitude more than the... Right. He shifted at the Everton window. He shifted on his magnitude here beyond the national level. So I want to point out the one we have a lot more jumps. and even if those jumps are relatively smaller, that's still a pretty stock improvement and capability.

Starting point is 01:05:17 Not only that, but if you believe claims that GPT4 is around one trillion parameter count, I mean, the human brain is between 30 and 300 trillion synapses. And so that's obviously not a one-to-one mapping, and we can debate the numbers, but it seems pretty plausible that we're below brain scale still. So, crucially, the point being

Starting point is 01:05:38 that the argument that overhead, head is really high in the sense that and maybe this is something we should touch on explicitly of even if you can't keep dumping more compute beyond the models that cost a trillion dollars or something the fact that the brain is so much more data efficient implies that if you get we have the compute if we had like the brain's algorithm to train um uh training if you had if we could like train as a sample efficient as humans train from birth we could make the AGI yeah but the sample efficiency stuff, I never know exactly how to think about it, because obviously a lot of things are hardwired in certain ways, right? And they're like the co-evolution of language and the brain

Starting point is 01:06:18 structure. So it's hard to say. Also, there are some results that if you make your model bigger, it becomes more sample efficient. Yeah, the original scaling was paper had that, right? The logic models almost have to. Right. So maybe that also just solves it. Like, you don't have to be more data efficient, but if your model's bigger, then you also just are more data efficient. How do we think about, yeah, what is like the explanation or why that would be the case? Like a bigger model just sees the exact same data at the end of seeing that data. It's learn more from it. I mean, my like very naive take here would just be that like, like, so one thing that the superposition hypothesis that interpretability has pushed is that your model is dramatically under parameterized.

Starting point is 01:07:04 And that's typically not the narrative that deep learning is pursued, right? But if you're trying to train a model on like the entire internet and have it predict it with incredible fidelity, you are in the underparameterized regime. And you're having to compress a ton of things and take on a lot of noisy interference in doing so. And so having a bigger model, you can just have cleaner representations that you can work with. Yeah. For the audience, you should unpack why that, first of all, what superposition is and why that is the implication of superposition.

Starting point is 01:07:31 Sure. Yeah. So the fundamental result, and this was before I joined Anthembourg, but the paper's titled Toy Models of Superposition, finds that even for small models, if you are in a regime where your data is high dimensional and sparse, and by sparse, I mean any given data point doesn't appear very often, your model will learn a compression strategy, which we call superposition, so that it can pack more features of the world into it then it has parameters.

Starting point is 01:08:03 And so the sparsity here is like, and I think both of these constraints apply to the real world and modeling internet data is a good enough proxy for that. Of like, there's only one Dwar cache. Like there's only one shirt you're wearing. There's like this liquid death can here. And so these are all objects or features and how you define it feature is tricky.

Starting point is 01:08:23 And so you're in a really high dimensional space because there are so many of them. And they appear very infrequently. Yeah. And in that regime, your model will learn, impression. To riff a little bit more on this, I think it's becoming increasingly clear. I will say, I believe that the reason networks are so hard to interpret is because, in a large part, this superposition. So if you take a model and you look at a given neuron in it, right, a given unit

Starting point is 01:08:49 of computation, and you ask, how is this neuron contributing to the output of the model when it fires? And you look at the data that it fires for, it's very confusing. It'll be like 10% of every possible input or like Chinese, but also fish and trees and the word, a full stop and URLs, right? But the paper that we put out towards monosemanticity last year shows that if you project the activations into a higher dimensional space and provide a sparsity penalty, so you can think of this as undoing the compression in the same way that you assumed your data was originally high dimensional and sparse. You return it to that high dimensional and sparse regime. You get out very clean features. And things all of a sudden start to make a lot more sense.

Starting point is 01:09:34 Okay. There's so many interesting threads there. The first thing I want to ask is the thing you mentioned about these models are trained in a regime where they're overparameterized. Isn't that when you have generalization, like rocking happens in that regime, right? So, um, isn't that what you're right? So, so I was saying the models were under parametized. Oh, I see. Yeah, yeah, like typically people talk about deep learning as if the model is over parameterized. Um, but, but actually the claim here is that they're dramatically under paramatized, given the complexity of the task that they're trying to perform.

Starting point is 01:10:11 Okay. Um, another question. So the distilled models, like, first of all, okay, so what is happening there? because the earlier claims we're talking about is the smaller models are worse at learning than bigger models, but like GPD4 turbo, you could say make the claim that actually GPD4 turbo is worse at reasoning style stuff than GPD4, but probably knows the same facts, like the distillation got rid of like some of the reasoning things.

Starting point is 01:10:43 Do we have any evidence that GPD Turbo is a distilled version of four? It might just be a new architecture. Oh, okay. Yeah. Like it could just be like a faster, more efficient new architecture. Okay, interesting.

Starting point is 01:10:53 So that's cheaper, yeah. What is the, how do you like interpret what's happening in distillation? And I think Gwren had one of these questions on his website of why can't you train the distilled model directly? Why does it have to go through? Is it a picture like you had to project it from this bigger space to a smaller space? I mean, I think both models will still be using superposition. But the claim here is that you get a very different model if you distill versus if you train from scratch. Yeah. And it's just more efficient or it's just fundamentally different in terms of performance.

Starting point is 01:11:28 I don't remember. But like, do you know? I think like the traditional story for why distillation is more like efficient is that normally during training, you're trying to predict this like one hot vector that says like this is the token that you should have predicted. And if you're like reasoning process means that you're really far off predicting that. Then actually like you still get these grading updates that you are in the right direction. But like you're, like, you're. you're totally, it might be really hard for you to learn, to have learned to have predicted that in the context that you're in. And so what distillation does is it doesn't just have the one hot vector, it has like the full readout from the larger model, like of all of the probabilities. Yeah, yeah. And so you get more signal about what you should have predicted. It's not, in some respects, it's like showing a tiny bit if you're working to. Yeah. You know, like it's not just this was the answer. It's. I see. Yeah, yeah, yeah. Totally. Yeah. But that means a lot of sense.

Starting point is 01:12:19 It's kind of like watching a Kung Fu master versus. as being in the matrix and like just downloading yeah exactly exactly yep yep um just to make sure the audience got that when you're turning on a distilled model you you're like you see all its probabilities over the tokens it was predicting and then over the ones you were predicting and then you like update through all those probabilities rather than just seeing the last word and updating on that okay so this actually raises is a question I was intending to ask you um right now I think you were the one who mentioned, you can think of chain of thought as adaptive compute of like to step back and explain what by adaptive compute, it's the idea is one of the

Starting point is 01:13:02 things you would want models to be able to do is if a question is harder to spend more cycles thinking about it. And so then how do you do that? Well, there's only a finite and predetermined amount of compute that one forward pass implies. So, So if there's like a complicated reasoning type question or math problem, you want to be able to spend a long time thinking about it, then you do chain of thought where the model just like thinks through the answer and you can think about it as like all those forward passes where it's like thinking through the answer, it's like being able to dump more compute into solving the problem.

Starting point is 01:13:37 Now going back to the signal thing, when it's doing chain of thought, it's only able to transmit that token of information where it's like as you were talking about, the residual stream is already a compressed representation of everything that's happening in the model. And then you're turning the residual stream into one token, which is like log of 50,000 or log of bocap size bits, which is like, yeah, so tiny. So I don't think it's quite only transmitting like that one token, right? Like if you think about it during a forward pass, you create these like kV values in the transform forward forward pass.

Starting point is 01:14:14 That then like future steps attend to the kv values. And so all of those pieces of KV, like keys and values, are bits of information that you could use in the future. Is the claim that when you find tune on chain of thought, the way the key and value weights change so that the sort of steganography can happen in the KV cache? I don't think I could make that strong a claim just... But that sounds plausible?

Starting point is 01:14:42 But it's like, that's a good head canon for why it works. And I don't know if there's any like papers, explicitly. demonstrating that or anything like that. But like that's at least one way that you can imagine the model has over the like during pre-training, right, the model's trying to predict these future tokens. And one thing that you can imagine doing is learning to like smush information about potential futures into like the keys and values that it might want to use in order to predict future information. Like it kind of smoothes that information across time and the pre-training thing. So I don't know if like people are particularly training like training on change of thought.

Starting point is 01:15:22 I think the original chain of thought paper had that as like almost an inversion property of the model as you could like prompt it to do this kind of stuff. And it's still worked pretty well. But that's like yeah, it's a good head canon for why that works. Yeah. To be overly pronounced here, it's like the tokens that you actually see in the chain of thought. Yeah. Do not necessarily at all need to correspond to the vector representation that the model gets to see. Exactly.

Starting point is 01:15:45 When it's deciding to attend back to those tokens. Exactly. In fact, during training, you replace, like, what a training step is is you actually replacing the token of the model output with the real next token. And yet it's still like learning because it has all this information internally. Like when you're getting a model to produce at inference time, like you're taking the output, the token that did output, you're feeding it in the bottom and embedding it and it like becomes the beginning of the new residual stream. Right, right. And then you use the output of pass KVs to read into and adapt that residual stream. At training time, you do this thing called teacher forcing, basically, where you're like, actually, the token you were meant to output is this one.

Starting point is 01:16:29 That's how you do it in parallel, right? Because you have all the tokens. You put them all in in parallel and you do the giant forward pass. And so the only information it's getting about the pass is the keys and values. It never sees the token that it output. It's kind of like it's trying to do the next token prediction. and if it messes up, then you just give it the correct answer.

Starting point is 01:16:47 Yeah, right, right, yeah. Okay, that makes sense. Otherwise, it can become totally derailed. Yeah, it would go, like, off the train tracks. How much, like, the sort of secret communication with the model to its forward, forward inferences, how much, how much technology and, you know, like, secret communication do you expect there to be? We don't know.

Starting point is 01:17:10 Like, honest answer, we don't know. but I wouldn't even necessarily classify it as like secret information, right? Like a lot of the work that Trent's team is trying to do is actually understand that these are fully visible from the model side and from like this maybe not the user, but like we should be able to understand and interpret what these values are doing and the information that are transiting. Like transmitting, I think that's a really important goal for the future.

Starting point is 01:17:38 Yeah, I mean there are some wild papers though where people have had the model do chain of thought and it is not at all representative of what the model actually decides its answer is. And you can go in and edit. No, no, no, no. In this case, like, you can even go in and edit the chain of thought so that the reasoning is like totally garbled

Starting point is 01:17:54 and it will still output the true answer. But also the chain of thought, like, it gets a better answer at the end of the chain of thought rather than not doing it at all. So like something useful is happening, but still the useful thing is not a human understandable. I think in some cases you can all, also just to blight the chain of thought.

Starting point is 01:18:11 And it would have given the same answer anyways. Interesting. Interesting. Yeah. So I'm not saying this is always what goes on, but like there's plenty of weirdness to be investigated. It's like a very interesting to go and look at and try and understand. I would say that you can do with open source models.

Starting point is 01:18:28 And like I think I wish there was more of this kind of interpretability and understanding work done on open models. Yeah. I mean, even in our Anthropics recent sleeper agents paper, which at a high level for people unfamiliar is basically I train in a trigger word. And when I say it, like if I say if it's the year of 2024, the model will write malicious code instead of otherwise. And they do this attack with a number of different models. Some of them use chain of thought. Some of them don't. And those models respond differently when you try and remove the trigger. You can even

Starting point is 01:19:03 see them do this like comical reasoning that's also pretty creepy and like where it's like, oh, Well, it even tries to calculate in one case an expected value of like, well, the expected value of me getting caught is this. But then if I multiply it by the ability for me to like keep saying, I hate you, I hate you, I hate you, then like, this is how much reward I should get. And then it will decide whether or not to like actually tell the interrogator that it's like malicious or not. Oh. But but even, I mean, there's another paper from a friend Miles Turpin where you ask the model to, you give it like. like a bunch of examples of where like the correct answer is always A for multiple choice questions. And then you ask the model, what is the correct answer to this new question?

Starting point is 01:19:50 And it will infer from the fact that all the examples are A, that the correct answer is A. But its chain of thought is totally misleading. Like it will make up random stuff that sounds plausible or that tries to sound as plausible as possible. But it's not at all representative of like the true answer. But isn't this how humans think as well? The famous split brain experiments where you know, like when a person who is suffering from seizures, one way to solve it is you cut the thing that connects the two. Corpus person. Yeah. And then the, yeah, the speech half is on the left side, so it's not connected to the part that decides to do a movement. And so if the other side decides to do something, the speech part will just make something up and it'll, like, the person

Starting point is 01:20:36 will think that's legit the reason they did it. Totally. Yeah, yeah. It's just, Just some people will hail chain of thought reasoning as, like, a great way to solve AI safety. Oh, I see. And it's like, actually, we don't know whether we can trust it. How much, what will this landscape of models communicating to themselves in ways we don't understand? How does that change with AI agents? Because then these things will, it's not just like the model itself with its previous caches, but like other instances of the model. And then...

Starting point is 01:21:09 It depends a lot on what channels you give them to communicate with telling, right? Like, if you only give them text as a way of communicating, then they probably have to interpret. How much more effective do you think the models would be if they could, like, share the residual streams versus just text? Hard to know. But plausibly so. I mean, one easy way that you can imagine this is like if you wanted to describe how a picture should look, only describing that with text would be hard. Right. You want to maybe some other representation would plausibly be easier.

Starting point is 01:21:38 totally um and so like you can look at how uh i think like dali works at the moment right like it produces those prompts yeah um and when you play with it you like often can't quite get it to do because like exactly what the model wants or what you want the only dolly has that problem It's too easy. A lot of your phone time. Related well-set-up problem. And you can imagine, like, being able to transmit

Starting point is 01:22:16 some kind of, like, denser representation of what you want would be helpful that. And that's like two very simple agents, right? I mean, I think a nice halfway house here would be features that you learn from dictionary learning where it's like you get more internal access but a lot of it is much more human interpretable yeah so okay for the audience you would project the residual stream into this larger space where we know what each dimension actually corresponds to um and then back into the next

Starting point is 01:22:44 agents or whatever okay why so your claim is that we'll get AI agents when these things can are more reliable and so forth. When that happens, do you expect that it will be multiple copies of models talking to each other or will it be just adapt to compute to solve and the thing just like runs bigger, like more compute when it needs to do that kind of thing that a whole firm needs to do?

Starting point is 01:23:13 And I ask this because there's two things that make me wonder about like whether agents is the right way to think about what will happen in the future. One is with longer context, these models are able to ingest and consider the information that no human can and therefore we need like one engineer who's thinking about the front end code and one engineer is thinking about the back end code where this thing can just ingest the whole thing this sort of like hyacken problem of specialization goes away second these models are just like very general of you're like not using different types of gpd4 who do different kinds of things you're using the exact same model right so I wonder what that implies is in the future like an AI firm is just like a model instead of a bunch of AI agents hooked together. That's a great question. I think especially in the near term, it will look much more like agents work together. And I say that purely because as humans, we're going to want to have these like isolated,

Starting point is 01:24:09 reliable and like components that we can trust. And we're also going to want to, we're going to need to be able to improve and instruct upon those like components. in ways that we can understand and improve. Like, it's just throwing it all this giant black box company. Like, one, it isn't going to work initially. Later on, of course, you can imagine it working, but initially it won't work. And two, we probably don't want to do it that way.

Starting point is 01:24:39 You can also have each of the smaller model. Well, each of the agents can be a smaller model that's cheaper to run and you can fine tune it so that it's actually good at the task. Though there's a future with, like, Dwarkesh has brought up adaptive computer a couple of times. There's a future where like the distinction between small and large models like disappears to some degree. And with long context, there's also a degree to which fine tuning might disappear, to be honest. Like these two things that are very important

Starting point is 01:25:05 today and like today's landscape models, we have like whole different tiers of model sizes and we have fine-tuned models of different things. You can imagine a future where you just actually have a dynamic bundle of compute and like infinite context that specializes your model to to different things. One thing you can imagine is you have an AI firm or something and the whole thing is like end to end trained on the signal of like, did I make profits or like if that's like too ambiguous, if it's, if it's an architecture firm and they're making blueprints, did my client like the blueprints and in the middle you can imagine agents who are sales people and agents

Starting point is 01:25:41 who are like doing the designing agents who like do the editing, whatever. Would that kind of signal work on an end to end system like that? like one of the things that happens in human firms is management considers what's happening at the larger level and like gives these like fine grain signals to the pieces or something when like there's a bad quarter or whatever yeah in the limit yes that's the dream of reinforcement learning right it's like all you need to do is provide this extremely sparse signal and then over enough iterations you sort of create the information that allows you to learn from that signal but i don't expect that to be the thing that works first i think this is going to require

Starting point is 01:26:17 an incredible amount of care and like diligence on the behalf of humans surrounding these machines and making sure they do exactly the right thing and exactly what you want and giving them right signals to improve the ways that you want. Yeah, you can't train on the RL reward unless the model generates some reward. Yeah, yeah, exactly. You're in this like sparse RL world where like if the client never likes what you produce, then like you don't get any reward at all and like it's kind of bad. But in the future these models will be good enough. to get the reward some of the time, right? This is the nines of reliability that Shultz was talking about.

Starting point is 01:26:52 There's an interesting digression, by the way, on earlier you were talking about, well, we want dense representations that will be favored, right? Like, that's a more efficient way to communicate. A book that Trenton recommended, the symbolic species, has this really interesting argument that language is not just a thing that exists, but like it was also something that evolved along with our minds and specifically evolved to be both easy to learn for children and to something that helps children develop right like it's on back that phone because like a lot of the things that children learn are received through language

Starting point is 01:27:39 like the languages that we the fittest are ones that help like raise the next generation right and that like makes them smarter, better, or whatever. And if you think about... Like gives them the concepts to express more complex ideas. Yeah. Yeah, that and I guess more pedantically, just like not die. Right. Yeah.

Starting point is 01:27:58 Let's you encode the important shit to not die. And so then when we just think of like language is like, oh, you know, it's like this contingent and maybe suboptimal way to represent ideas. Actually, maybe one of the reasons that LLMs have succeeded is because, you know, language has evolved for tens of thousands of years to be this sort of cast in which young minds can develop, right? Like, that is the purpose of it was evolved for. Certainly when you talk to like multimodal, like computer vision researchers versus when you talk to language model researchers. Yeah. People who work in other modalities have to put enormous amounts of thought

Starting point is 01:28:37 into exactly what the right representation space for the images is. And like what the right signal to learn from there. Is it like directly modeling the pixels? Or is it, you know, some loss that's conditioned on, there's like a paper ages ago where they like found that if you trained on the internal representations of an image net model, it like helped you predict better. But then later on, like that's obviously like limiting. And so there was like pixel CNN where they're trying to like discreetly model, you know, the individual pixels and stuff. But understanding the right level of representation there, really hard. In language, people are just like, well, I guess you just predict that thanks

Starting point is 01:29:10 token rights. It's kind of easy. Yeah, yeah. Decisions made. I mean, there's the tokenization discussion and debate about like, but one of Gordon's favorites. Yeah. Yeah, that's really interesting. How much the case for a multimodal being a way to brisk the data wall or get past the data wall is based on the idea that the things you would have learned from more language tokens anyway you can just get from YouTube, has that actually been the case? How much like positive transfer do you see between different modalities?

Starting point is 01:29:45 where, like, actually the images are helping you be better at, like, writing code or something, just because, like, the model is learning a latent capabilities just from trying to understand the image. Demas, in his interview with you, mentioned positive transfer. Can you get in trouble if you. But, I mean, I can't say heaps about that, other than to say this is something that people, like, believe that, yes, like, we have all of this. data about the world, it would be great if we could learn an intuitive sense of physics from it

Starting point is 01:30:19 that helps us reason, right? That seems totally plausible. Yeah, I'm the wrong person to ask, but there are interesting interpretability pieces where if we fine-tune on math problems, the model just gets better at entity recognition. Oh, really? Yeah, yeah. So there's like a paper from David Bowles Lab recently where they investigate what actually changes in a model when I fine-tune it, with respect to the attention heads and these sorts of things. And they have this, like, synthetic problem of box A has this object in it. Box B has this other object in it. What was in this box.

Starting point is 01:30:57 And it makes sense, right? It's like you're better at, like, attending to the positions of different things, which you need for, like, coding and manipulating math equations. I love this kind of research. Yeah. What's the name of the paper? Do you know? If you remember.

Starting point is 01:31:12 If you look up, like, fine-tuning. Maths, Math, David Bowles group that came out like a week ago. Okay, I am reading that when I get home. I'm not endorsing the paper. That's like a longer conversation, but like this,

Starting point is 01:31:25 it does talk about incite other work on this like entity recognition ability. One of the things you mentioned to me a long time ago is the evidence that when you train LLMs on code, they get better at reasoning in language, which unless it's the case that the comments in the code

Starting point is 01:31:41 are just really high quality tokens or something implies that to be able to think through how to code better, like, it makes you, like, a better reasoner. And like, that's crazy, right? Like, I think that's, like, one of the strongest pieces of evidence for, like, scaling, just making the thing smart. Like, that kind of, like, positive transfer. I think, like, this is true in two senses.

Starting point is 01:31:59 One is just that modeling code, obviously implies modeling a difficult reasoning process used to create it. But two, that code is a nice, explicit, like, structure of, like, composed reasoning, I guess. Like, if this, then that, like, codes a lot of, structure in that way. Yeah. That you could imagine transferring to other types of reasons of reasoning problem.

Starting point is 01:32:21 Right. And crucially, the thing that makes us significant is that it's not just stochastically predicting the next token of words or whatever because it's like learned that like a Sally corresponds to murderer at the end of a Sherlock Holmes story. No,

Starting point is 01:32:38 like if there is some shared thing between code and language, it must be at a deeper level that the model has learned. Yeah, I think we have a lot of evidence that actual reasoning is occurring in these models and that, like, they're not just to castic parrots. Yeah. It just feels very hard for me to believe that, I haven't worked and played with these models. Normies who will listen will be like, you know. Yeah, my two, like, immediate cast responses to this are, one, the work on Othello and now other games,

Starting point is 01:33:07 where it's like, I give you a sequence of moves in the game. And it turns out if you apply some, like, pretty straightforward interpretability, techniques, then you can get a board that the model has learned. And it's never seen the game board before anything, right? That's generalization. The other is Anthropics influence functions paper that came out last year where they look at the model outputs, like, please don't turn me off. I want to be helpful. And then they scan, like, what was the data that led to that? And like, one of the data points that was very influential was someone dying of dehydration in the desert and having, like, a will to keep surviving.

Starting point is 01:33:46 And to me, that just seems, like, very clear generalization of motive rather than regurgitating, don't turn me off. I think 2001 of Space Odyssey was also one of the influential things. And so that's more related, but it's clearly pulling in things from lots of different distributions. And I also like the evidence you see even with, like, very small transformers, where you can explicitly encode circuits to, like, do addition, right? Or induction heads. induction heads, this kind of thing. You can literally encode basic reasoning processes in the models manually, and it seems clear that there's evidence that they also learn this automatically

Starting point is 01:34:21 because you can then rediscover those from trained models. To me, this is very strong. The models are under parameterized. We're asking them to do a very much task. And they want to learn. The gradients want to flow. And so they're learning more general skills. Okay, so I want to take a step back from the research.

Starting point is 01:34:41 and ask about your careers specifically, because like the tweet implied at the board that I introduced to you with, you've been in this field a year and a half. I think you've only been in it like a year or something, right? It's like, I don't drop it. Yeah, but you know, like in that time, I know the solve the alignment takes it over stated

Starting point is 01:35:02 and you won't say this yourself because you'd be embarrassed with it like, you know, it's like a pretty incredible thing, like the thing that people in mechanistic and retributably think is the biggest, step forward and you've like been working on it for a year. It's notable. So I'm curious how you explain what's happened. Like why in a year or year and a half have you guys been, you know, made important contributions to your field? It goes without saying luck, obviously. And I feel

Starting point is 01:35:31 like I've been very lucky. And like the timing of different progressions has has been just like really good in terms of advancing to the next level of growth. I feel like for the interpretability team specifically, I joined when we were five people. We've now grown quite a lot. But there were so many ideas floating around, and we just needed to really execute on them and have quick feedback loops and do careful experimentation that led to like signs of life and have now allowed us to like really scale. And I feel like that's kind of been my biggest value add to the team, which it's not all engineering, but quite a lot of it has been. interesting so you're saying like you came at a point where like they were there was it had been a lot of science done and there was a lot of like good research flooding around but they needed someone to like just take that and like maniacally execute on it's yeah yeah and and there's this is why it's not all engineering because it's like running different experiments and like having a hunch for why it might not be working and then like opening up the model or opening up the weights and like what is it learning okay well let me try and do this instead and that sort of thing but um a lot of it has just been being able to do like very careful thorough but quick um investigation of different ideas or yeah theories and why was that lacking in the existing

Starting point is 01:36:46 i don't know i feel like i feel like i i mean i i work quite a lot and then i feel like i just i'm like quite agentic like if you're if your questions about like career overall um and and i've been very privileged to have like a really nice safety net to be able to take lots of risks but i'm just like quite headstrong like in undergrad duke had this thing where you could just make your own major. And it was like, eh, I don't like this prerequisite or this prerequisite. And I want to take all four or five of these subjects at the same time. So I'm just going to make my own major. Or like in the first year of grad school, I like canceled rotation so I could work on this thing that became the paper we were talking about earlier. And like didn't have an advisor,

Starting point is 01:37:24 like got admitted to do machine learning for protein design and was just like off in computational neuroscience land with no business there at all, but worked out. There's a head strongness. but it seemed like another theme that jumped out was the ability to step back and you were talking about this earlier, the ability to stick back from your son costs and go in a different direction is in a weird sense the opposite of that

Starting point is 01:37:47 but also a crucial step here where I know like 21 year olds or like 19 year olds where like this is not the thing I've specialized in or like I didn't major in this I'm like dude, motherfucker you're 19 like you can definitely do this and you like switching in the middle of grad school

Starting point is 01:38:01 or something like that's just like Yeah, sorry, I didn't need to cut you off, but I think it's like strong ideas loosely held and being able to just like pinball in different directions. And the headstrongness, I think, relates a little bit to the fast feedback loops or agency in so much as I, I just don't get blocked very often. Like if I'm trying to write some code and like something isn't working, even if it's like in another part of the code base, I'll often just go in and fix that thing or at least hack it together to be able to get results. And I've seen other people where they're just like, help. I can't. And it's like, no, that's not a good enough excuse.

Starting point is 01:38:34 like go all the way down. I've definitely heard like people in management type positions talk about the lack of such people where they will check in on somebody a month after they gave them a task or week after they give me a task and like, how is it going? And they say, well, you know, we need to do this thing which requires lawyers because it requires talking about this regulation. It's like, how's that going? I was like, well, we need lawyers.

Starting point is 01:38:55 I'm like, why didn't you get lawyers? Or something like that. So that's definitely like, yeah. I think that's arguably the most important quality. in like almost anything. It's just pursuing it to like the end of the earth. And like whatever you need to do to make it happen, you'll make it happen. If you do everything, you'll win.

Starting point is 01:39:10 If you do everything, you'll win. Exactly. But yeah, yeah, yeah. I think from my side, definitely that quality has been important, like agency in the work. There are thousands or I would even like probably tens of thousands of thousands of engineers at Google who are like, you know, basically like we're all of like equivalent like software engineering ability, let's say. Like, you know, if you gave us like a very well-defined task.

Starting point is 01:39:34 then we'd probably do it like equivalently well maybe a bunch of them would do it a lot better than me you know in all likelihood but what i've been like one of the reasons that i've been impactful so far is i've been very good at picking extremely high leverage problems so problems that haven't been like particularly well solved so far um but perhaps as a result of like frustrating structural factors like the ones that you pointed out in like that scenario before where they're like oh we can't do X because this, what team would you do? Why? Or like, and then going, okay, well, I'm just going to like vertically solve the entire thing.

Starting point is 01:40:11 Right. And that turns out to be remarkably effective. Also, I'm very comfortable with, like, if I think there is something correct that needs to happen, I will, like, make that argument and continue making that argument at escalating levels of, like, criticality until that thing gets solved. And I'm also quite pragmatic with what, like, I do to solve things. You get a lot of people who come in with, as I said before, like, a particular background or they know how to do something, and they won't, like, one of the beautiful things about Google, right,

Starting point is 01:40:47 is you can run around and get world experts in literally everything. You can sit down and talk to people who are optimization experts, like chip design experts, like experts in, like, I don't know, like different forms of like pre-training algorithms or like RL or whatever. you can learn from all of them and you can take those methods and apply them. And I think this was like maybe the start of why I was initially impactful was like this vertical like agency effectively. And then a follow up piece from that is I think it's often surprising how few people are like fully realized in all the things they want to do.

Starting point is 01:41:24 They're like blocked or limited in some way. And this is very common. Like in big organizations everywhere, people like have all these blockers. on what they're able to achieve. And I think being a, like one, helping inspire people to work on particular directions and working with them on doing things massively scales your leverage. Like you get to work with all these wonderful people

Starting point is 01:41:45 who teach you heaps of things and generally like helping them push-bass organizational blockers means like together you get an enormous amount done. Like none of the impact that I've had has been like me individually going off and solving a whole lot of stuff. It's being me to maybe like starting off a direction and then convincing other people

Starting point is 01:42:06 that this is the right direction and bringing them along in like this big tidal wave of like effectiveness that goes and solves that problem. We should talk about how you guys got hired because I think that's a really interesting story because you were a McKenzie consultant, right?

Starting point is 01:42:24 There's an interesting thing there where first of all I think people are, yeah, generally people just don't understand how, like, decisions are made about either admissions or evaluating who to hire or something. Like, just talk about, like, how were you noticed as, yeah, yeah, got hired. So, like, the TLDR artist, I studied robotics and undergrad. I always thought that AI would be one of the highest celebrity choice to impact the future in positive way.

Starting point is 01:42:50 Like, the reason I am doing this is because I think it is, like, one of our best shots at making a wonderful future, basically. and I thought that working actually McKinsey I would get a really interesting insight to what people actually did for work and this I actually wrote this as the first line in my cover letter to McKinsey was like I want to work here

Starting point is 01:43:07 so that I can learn what people do so that I can like understand and in many respects like I did get that I asked a whole lot of other things many of the people there are like wonderful friends I actually learned I think a lot of this like agentic behavior apart from my time there

Starting point is 01:43:25 where you go into organizations and you see how impactful, just not taking no for an answer gets you. Like, you would be surprised at the kind of stuff where, like, because no one quite cares enough in some organizations, things just don't happen because no one's willing to take direct responsibility. This is incredibly like directly responsible individuals are ridiculously important. And people are willing to, like, they just don't care as much about timelines. And so much of the value that an organization like McKinsey provides is hiring people who you are otherwise unable to hire for a short window of time where they can just like push through problems.

Starting point is 01:44:06 I think people like underappreciate this. And so like at least some of my, well, hold up like I'm going to become the directly responsible individual for this because no one's taking appropriate like responsibility. I'm going to care a hell of a lot about this and I'm going to make sure it like merely the end of the earth to make sure it gets done. comes from that time. But more of your actual question of like, how did I get hired? The entire time, I didn't get into the grad programs

Starting point is 01:44:32 that I wanted to get into over here, which was specifically for focus on robotics and R.R.R.R.R.R.R. research and that kind of stuff. And in the meantime, on nights and weekends, basically every night from 10 p.m. till 2 a.m., I would do my own research. And every weekend, for like,

Starting point is 01:44:49 at least six to eight hours each day, I would do my own research and coding projects and this kind of stuff. And that sort of switched in part from like quite robotic specific work to after reading Guern's scaling hypothesis post, I got completely scaling killed and was like, okay, like clearly the way that you solve robotics is by like scaling large multimodal models. And then in an effort to scale large multimodal models with a very, you know, grant, I got a grant

Starting point is 01:45:18 from the TPU, like, access program, like, TensorFlow Research Cloud. I was trying to work out how to scale that effectively, and James Bradbury, who at the time was at Google and is now at Anthropic, saw some of my questions online where I was trying to work out how to do this properly. He was like, I thought I knew all the people in the world who were like asking these questions. Who on earth are you? And he looked at that and he looked at some of like the robotic stuff that had been putting up my blog and that kind of thing. And he reached out and said, hey, do you want to have a chat and you want to like explore working with us here? And I was hired, as I understand it later, as an experiment in trying to take someone with extremely high enthusiasm and agency and pairing them with some of the best engineers that he knew. And so one, another one of the reasons I could say, like, I've been impactful is I had this, like, dedicated mentorship from utterly wonderful people, like people like Raina Pope, who has since left to go do his own ship company, Anselaafzkaya, James himself, many others.

Starting point is 01:46:22 but those are like the sort of formative like two to three months at the beginning. And they taught me a whole lot of like the principles and like heuristics that I apply, like how to and how to like solve problems in the way that they have, particularly in that like systems and algorithms overlap where like one more thing that makes you like quite effective in ML research is really concretely understanding the systems side of things. And this is something I learned from them basically. It's like a deep understanding of how systems influence algorithms and how algorithms

Starting point is 01:46:51 influence systems, because the systems constrain the design space, so the solution space, which you have available to yourself in the algorithm side. And very few people are comfortable fully bridging that gap. But a place like Google, you can just go and ask all the algorithms experts and all the systems experts, everything they know, and they will happily teach you. And if you go and sit down with them, they will like, they will teach you everything they know. It's wonderful. And this has meant that I've been able to be very, very effective for both sides, for the pre-training crew, because I understand systems very well, I can chew it and understand, like, this will work well or this won't. And then, like, flow that on through the inference considerations of models and this kind of thing.

Starting point is 01:47:30 And for, like, to the chip design teams, I'm one of the people they turn to understand what chips they should be designing in three years, because I'm one of the people who's best able to understand and explain the kind of algorithms that we might want to design in three years. And obviously, you can't make very good guesses about that, but, like, I, I think I, like, convey the information well, accumulated formal of my compatriots on the pre-training crew and, like, the general, like, systems easy-side crew, and convey that information well to them because also even inference applies a constraint to pre-training. And so, like, there's this, like, these trees of constraints where if you understand all the pieces of the puzzle, then you get a much better sense for, like, what the solution space might look like. Yeah. There's a couple of things that stick out to me there. One is not just the agency of the person who was hired,

Starting point is 01:48:22 but the parts of the system that were able to think, wait, that's really interesting. Who is this guy? Not for a grad program or anything. You know, like, currently a McKinsey consultant, just like an undergrad. But that's interesting. Let's like give this a shot, right?

Starting point is 01:48:38 So James and whoever else, that's like, that's very notable. And that's, second is, I actually did. didn't know this part of the story where that was part of an experiment run internally about can we do this? Can we like bootstrap somebody? And like, yeah. And in fact, what's really interesting about that is the third thing you mentioned is having somebody who understands all layers is a stack and isn't so stuck on any one approach or any one layer of abstraction is so important. And specifically that like what you mentioned about being being bootstrapped immediately by these people might have meant that since you're getting up to speed on everything at the same time

Starting point is 01:49:17 rather than spending grad school going deep on like one specific way of being RL, you actually can take the global view and aren't like totally bought in on one thing. So not only can, is it something that's possible, but like has greater returns than just hiring somebody at a grad school potentially because this person can just like, I don't know, just like getting GPT8 and like we're fine tuning them on like one year of, you know what I mean. So yeah, you come at everything with fresh eyes, and you know it come and lock to any particular field. Now, one caveat to that is that before, like, during my self-experimentation and stuff, I was reading everything I could.

Starting point is 01:49:51 I was, like, obsessively reading papers every night. And, like, actually, funnily enough, I, like, read much less widely now that my day is occupied by working on things. And in some respect, I had, like, this very broad perspective before where not that many people, even like in a PhD program, you'll focus on a particular area. If you just like read all the NLP work and all the computer vision work and like all the robotics work, you like see all these patterns to start to emerge across subfields in a way that I guess like foreshadowed some of the work that I would later do. That's who are interesting. One of the reasons that you've

Starting point is 01:50:28 been able to be agentic within Google is like your peer programming half the days or most of the days of Sergey Brin, right? And so that's really interesting that like there's this person who's like willing to just push ahead on this LLM stuff and get rid of the local blockers in its place. I think important to give it as not like every day or anything that I'm pairing, but like when

Starting point is 01:50:49 there are particular projects that he's interested in, then we'll work together on those and like, but there's also been times when he's been focused on projects with other people. But in general, yes, there's a surprising alpha to like being one of the people who actually goes down to the office every day. That like is really actually shouldn't be, but is surprisingly.

Starting point is 01:51:07 impactful and as a result I've benefited a lot from having like basically being like close friends with people in leadership who care and being able to like really argue convincingly about why we should do X as opposed to Y and and having that like vector to try like it's Google is a big organization having having those vectors helps a little bit but also it's very important it's the kind of thing you don't want to ever abuse right like you you you want to make the arguments really like all the right channels and like uh only sometimes you need to and so and so forth and so forth i mean it's like it's notable i don't know i feel like google is undervalue given that like yeah that's like i don't know like ste steve jobs is working on the equivalent like

Starting point is 01:51:57 the next product for apple like piracore arriving on or something right i mean like i uh yeah i've benefited immensely from like triggers okay so for example during the christmas break um I was just going into the office a couple of days during that time. Sounded like quite a lot of it. Okay. I got a lot of things. Okay. And I don't know if you guys have read that article about Jeff and Sanjay doing the pair programming,

Starting point is 01:52:25 but they were there pair programming on stuff. And I got to hear about all these cool stories of like early Google where they were talking about like crawling under the floorboards and rewiring data centers and like telling me how much many like bits they were pulling off the instructions of a given compiler instruction and like all these like crazy little performance optimizations they were doing like they were having the time of their live um and i got to like sit there and really like experience this this sense of history in a way that you you don't expect to get like you expect to be very far away from all that i think maybe in a large organization but yeah i see they're super cool and trenton does this map onto any of your experience i think shaltos stories

Starting point is 01:53:06 more exciting. Mine was just very serendipitous in that I got into computational and aeroscience. Didn't have much business being there. My first paper was mapping the cerebellum to the attention operation and transformers. My next ones were looking at like sparsity. It was my first year of grad school. So 22. Oh, yeah.

Starting point is 01:53:28 But yeah, my next work was on sparsity in networks, like inspired by sparsity in the brain, which was when I met Tristan Hume and Anthropic was doing the Sulu, the soft max linear output unit work which was very related in quite a few ways of like let's make the activation of neurons across a layer really sparse and if we do that then we can get

Starting point is 01:53:49 some interpretability of what the neurons doing. I think we've updated on that approach towards what we're doing now. So that started the conversation. I shared drafts of that paper with Tristan. He was excited about it. And that was basically what led me to become Tristan's resident

Starting point is 01:54:03 and then convert to full time. But during that period, I also moved as a visiting researcher to Berkeley and started working with Bruno Olshausen, both on what's called vector symbolic architectures, which one of the core operations of them is literally superposition and on sparse coding, also known as dictionary learning, which is literally what we've been doing since. And Bruno Olshausen basically invented sparse coding back in 1997. And so it was like, my research agenda and the interpretability team seemed to just be running in parallel with just research tastes. And so, yeah, it made a lot of sense for me to work with the team. And it's been a dream sense. One thing I've noticed that when people tell stories about their careers or their successes, they ascribe it way more to contingency. But when they hear about other people's stories, they're like, of course it wasn't contingent.

Starting point is 01:54:58 You know what I mean? It's like, if that didn't happen, something else would have happened. I've just noticed that something you talked to, and it's like interesting that you both think that they're like, it was especially contingent. Whereas, I don't know, maybe you're right, but like it's this sort of interesting pattern that. Yeah, but I mean, like, I literally met Tristan at a conference and like wasn't, didn't have a scheduled meeting and put there anything, just like joined a little group of people chatting and he happened to be standing there. And I happened to mention what I was working on. And that led to more conversations. And I think I probably would have applied to Anthropic at some point anyways,

Starting point is 01:55:31 but I would have waited at least another year. It's still crazy to me that I can actually contribute to interpretability in a meaningful way. I think there's a big important aspect of like shots on goal there, so to speak, right? Where like you're even just going to, choosing to go to conferences itself is like putting yourself in a position where you're, where luck is more likely to happen. Yeah. And conversely, in my own situation, it was like doing all of this work independently. in trying to produce and do interesting things was my own way of trying to manufacture luck, so to speak.

Starting point is 01:56:04 And like try and do something meaningful enough that it got noticed. Given that you said you framed this in the context of they were trying to run this experiment of can something... So specifically James and I think our manager Brennan was trying to run this experiment. It like worked. Did they do it again?

Starting point is 01:56:18 Yeah. So my closest collaborator, Enrique, he crossed from search to our team. He's also been ridiculously impactful. He's definitely a stronger engineer than I am. And he didn't go to university. What was notable about, for example, is James Brad Burrera, somebody who's usually this kind of stuff is like farmed out to recruiters or something like that, whereas James is somebody whose time is worth like hundreds of millions of dollars

Starting point is 01:56:45 to Google, you know what I mean? So that thing is like very bottlenecked on that kind of person taking the time almost in aristocratic tutoring sense of finding and then getting up to speed and it seems like if it worked at this well it should be done at scale like it should be the responsibility of key people to like you know what i mean on board i think i think that is true to many extent like i'm sure you probably benefited a lot from the key researchers mentoring you deeply during and like actively like looking on like open source repositories or like on forums or whatever for like potential people like this. Yeah. I mean, James is like Twitter injects it into his brains. That's right.

Starting point is 01:57:29 But yes, and I think this is something which in practice is done. Like, people do look out for people that they find interesting and like try and find high signal. In fact, actually, this, I was talking about this with Jeff the other day. And Jeff said that, yeah, he's like, you know, I am one of the most important hires I ever made was offer called email. And I was like, like, well, who was that? And he's Chris Ola. Ah, yeah. Because Chris similarly had no background in, well, like, no formal background in ML, right? And like Google Brain was just getting started and this kind of thing.

Starting point is 01:58:08 But Jeff saw that signal. And the, and the residency program, which like Brain had is, I think also like a, it was astonishingly effective at finding good people that didn't have strong and more backgrounds. and yeah one of the other things that's i want to like emphasize for a potential slice of the audience that would be relevant to is there's this sense that like the world is legible and efficient of companies have these go to jobs dot google.com or jobs dot whatever company dot com and you apply and there's the steps and like they will evaluate you efficiently on those steps whereas not only from the storage teams like often that's not the way it happens

Starting point is 01:58:55 that's in fact it's good for the world that that's not often how it happens like it is important to look at um were they able to like write an interesting block technical block post about their research or like making interesting contributions yeah i want you to like riff on for the people who are like just assuming that the other end of the job board is like just like super legible and mechanical this is not how it works and in fact like people are looking for the sort of different way, different kind of person who's eugenic and putting stuff out there. And I think specifically what people are looking for there is two things.

Starting point is 01:59:27 One is agency and like putting yourself out there. And the second is the ability to do world class something. Yeah. And two examples that I always like to point to here are Andy Jones from Anthropic did an amazing paper on scaling laws as applied to poor games. It didn't require much resources. It demonstrated incredible engineering skill. it demonstrated an incredible understanding of like the most topical problem of the time.

Starting point is 01:59:52 And he didn't come from a like typical academic background or whatever. And as I understand it basically, like as soon as he came out with that paper, both Anthropic and opening eye, I were like, we would desperately like to hire you. There's also someone who works on Anthropics performance team, now Simon Bohm, who has written in my mind the reference for optimizing a Kuda map all, like on a GPU. And that demonstrated example of, like, taking some, like, prompt effectively and producing the world-class reference example for it in something that wasn't particularly well done. So far is, like, I think an incredible demonstration of, like, ability and agency that, in my mind,

Starting point is 02:00:31 would, like, be an immediate, would, like, please love to, like, interview you slash hire. Yeah. The only thing I can add here is, I mean, I still had to go through the whole hiring process and all the standard interviews and this sort of thing. Yeah, everyone does. Yeah. Isn't that seem stupid? I mean, it's important debiasing.

Starting point is 02:00:48 Yeah, yeah, yeah. And the bias is what you want, right? Like, you want the bias of somebody who's got a great taste. And, like, he's like, like, who cares? Your interview process should be able to disambiguary that as well. Yeah, like, I think there are cases where someone seems really great. And then it's like, oh, they actually just can't code this sort of thing, right? Like, how much you weight these things definitely matters, though.

Starting point is 02:01:06 And, like, I think the, we take references really seriously. The interviews, you can only get so much signal from. And so it's all these other things that can come into play for whether or not a hire makes sense. But you should design your interviews such that, like, they test the right things. One man's bias is another man's taste, you know? Yeah, I guess the only thing I would add to this or maybe to the Headstrong context is like there's this line, the system is not your friend. Right. And it's not necessarily to say it's actively against you or it's your sworn enemy.

Starting point is 02:01:40 it's just not looking out for you. And so I think that's where a lot of the proactiveness comes in of like there are no adults in the room and like you have to come to some decision for what you want your life to look like and execute on it. And yeah, hopefully you can then update later if you're too headstrong in the wrong way. But I think you almost have to just kind of charge at certain things

Starting point is 02:02:04 to get much of anything done, not be swept up in the tide of whatever the expectations are. there's like one final thing I want to add which is like we talked a lot about agency and this kind of stuff but I think actually like surprisingly enough

Starting point is 02:02:16 one of the most important things is just caring an unbelievable amount and when you care an unbelievable amount you like you check all the details and you have like this understanding of like what could have gone wrong and you like you

Starting point is 02:02:28 it just it matters more than you think because people end up not caring or not caring enough this is like LeBron quote where he talks about how when he sort of, before he sat in the league, he was like worried that everyone would be like incredibly good. And then he gets there and he like realizes that actually once people

Starting point is 02:02:47 hit financial stability, then they like they relax a bit. And he's like, oh, it's going to be easy. I don't think that's quite true because I think in like AI research because most people actually care quite deeply. But there's caring about your problem and there's also just caring about the entire stack and everything that goes up and down, like going explicitly going and fixing things that aren't your responsibility to fix because overall it makes like the stack battle. I mean, another part that I forgot to mention is you were mentioning, oh, going in on weekends and on Christmas break and you get to, like, the only people in the office are Jeff Dean and Sergey Brand or something. You just, I get to pay a program with them. It's just, it's

Starting point is 02:03:24 interesting to me, the people, I don't want to pick on your company in particular, but like, people at any big company, they've gotten there because they've gone through a very selective process that's like they had to compete in high school, they had to compete in college. but it almost seems like they get there and then they take it easy when in fact this is a time to put the pedal to the metal go in and peer program

Starting point is 02:03:44 with Sergey Brin on the weekends or whatever, you know what I mean? I mean, there's pros and cons there, right? I think many people make the decision that the thing that they want to prioritize is like a wonderful life with their family and if they do wonderful work like let's say they don't work

Starting point is 02:03:58 every hour of the day, right? But they do wonderful work in the work like the hours that they do do. That's incredibly impactful. I think this is true for many people at Google is like maybe they don't work as many hours as like your typical startup mythologize, right? But the work that they do do is incredibly valuable.

Starting point is 02:04:13 It's very high leverage because they know the systems and they're experts in their field. And we also need people like that. Like our world rests on these huge, like difficult to manage and difficult to fix systems. And we need people who are like willing to work on and help and fix and maintain those. In frankly, like a thankless way that isn't as like high publicity

Starting point is 02:04:33 as all of this AI work that we're doing, right? and I am like ridiculously grateful that those people do that and I'm also happy that there are people for whom like okay they find technical fulfillment in their job and doing that well and also like maybe they draw a lot more people also out of spending like a lot of hours of their family and I'm lucky that I'm at a stage in my life where like yeah I can go in and work every hour of the week but like that's like I'm not making as many sacrifices to do that yeah um I mean like just one example of the sixth out in mind mind of this sort of like the other side says no

Starting point is 02:05:07 and you can still get the yes on the other end. Basically every single high profile of guests I've gone so far I think maybe with one or two exceptions I've sat down for a week and I've just come up with a list of sample questions that's you know like try to really come up with really smart questions to stand to them

Starting point is 02:05:23 and the entire process I've always thought like if I just cold email them it's like a 2% chance they say yes if I include this list there's a 10% chance and because otherwise you know there's like you go through their inbox and every 34 seconds there's an interview for whatever podcast interview whatever podcast um and every single time i've done this they've said yes right yeah

Starting point is 02:05:45 you just like you just like you ask great questions but if you do everything you'll win but you just like you literally have to dig in the same hole for like 10 minutes or in that case like make a list of sample questions for them to get past they're not an idiot list you know what i mean um and demonstrate how much you care and yeah yeah and the work you're willing to put in yeah yeah yeah i something that a friend said to me a while back, but I think it's stuck is like, it's amazing how quickly you can become world class at something just because most people aren't trying that hard and like are only working like, I don't know, the actual like 20 hours that they're actually spending on this thing or something. And so, yeah, if you just go ham, then like you can you can get

Starting point is 02:06:23 really far pretty fast. And I think I'm lucky I had that experience with the fencing as well. Like I had the experience of becoming world class in something and like knowing that if you just worked really, really hard and we're like. Yeah. But for our context. by the way, Sholto was one seat away, as he was the next person in line to go to the Olympic stuff for fencing. I was at best like 42nd in the world for fencing, for men's foil, fencing. Mutational load is a thing, man. And there was one cycle where, yeah, I was like the next highest strength person in Asia. And if one of the teams had been disqualified for doping as it was occurring, in part,

Starting point is 02:07:04 during that cycle, and as occurred for like the Australian women's rowing team, I think, went because one of the teams was disqualified, then I would have been the next in line. It's interesting when you just like find out of people's prior lives and it's like, oh, you know, this guy was almost an Olympian, this other guy was whatever, you know what I mean? Okay, let's talk about interpability. Yeah. I actually want to stay on the brain stuff as a way to get into it for a second. we were previously discussing

Starting point is 02:07:35 is the brain organized in the way where you have a residual stream that is gradually refined with higher level associations over time or something there's a fixed dimension size in a model if you had to

Starting point is 02:07:53 I don't even know how to ask this question in a sensible way but what is the demodel of the brain what is it like the embedding size of Or because of feature splitting, is that not a sensible question? No, I think it's a sensible question. Well, it is a question that makes sense. You could have just not said that.

Starting point is 02:08:12 You can talk just like actively. I'm trying to. I don't know how you would begin to kind of be like, okay, well, this part of the brain is like a vector of this dimensionality. I mean, maybe for the visual streaming, because it's like V1 to V2 to IT, whatever. you could just count the number of neurons that are there and be like, that is the dimensionality, but it seems more likely that there are kind of submodules and things are divided up. So, yeah, I don't have, and I, I'm not like the world's greatest neuroscientists, right? Like, I did it for a few years. I, like, studied the cerebellum quite a bit.

Starting point is 02:08:50 So I'm sure there are people who could give you a better answer on this. Do you think that the way to think about whether it's in the brain or whether it's, it's in these models, fundamentally what's happening is like features are added, removed, changed, and like the feature is the fundamental unit of what is happening in the model. Like what would have to be true for, give me a, and this goes back to the earlier thing we were talking about, whether it's just associations all the way down, give me like a counterfactual in the world where this is not true, what is happening instead? Like, what is the alternative hypothesis here?

Starting point is 02:09:29 Yeah, it's hard for me to think about. because at this point, I just think so much in terms of this feature space. I mean, at one point, there was like the kind of behavioralist approach towards cognition, where or it's like you're just, you're like input output, but you're not really doing any processing. Or it's like everything is embodied and you're just like a dynamical system that's like operating along like some predictable equations. but like there's no state in the system, I guess.

Starting point is 02:10:03 But whenever I've read these sorts of critiques, it's like, well, you're just choosing to not call this thing a state, but you could call like any internal component of the model of state. Like even with the feature discussion, it's defining what a feature is is really hard. And so the question feels almost too slippery. What is a feature? A direction and activation space. a latent variable that is operating behind the scenes

Starting point is 02:10:32 that has like causal influence over the system you're observing it's a feature if you call it a feature it's tonological I mean these are these are all explanations that I like I feel some association in a very rough intuitive sense in like a sufficiently sparse and like binary vector features like whether or not something is turned on or off right right Like in a very simplistic sense, which might be, I think, a useful metaphor to understand it by.

Starting point is 02:11:01 It's like when we talk about features activating, it is in many respects the same way the neuroscientists would talk about, like a neuron activating, right? If that neuron corresponds to... To something in particular. Right. Yeah, yeah, yeah, yeah. And no, I think that's useful as like, what do we want a feature to be, right? Or like, what is a synthetic problem under which a feature exists? But even with the towards monosemanticity work, we talk about what's called feature splitting, which is basically

Starting point is 02:11:26 you will find as many features as you give the model the capacity to learn. And by model here, I mean the up projection that we fit after we trained to the original model. And so if you don't give it much capacity, it'll learn a feature for bird. But if you give it more capacity, then it will learn like ravens and eagles and sparrows and like specific types of birds. Still on the definitions thing, I guess naively, I think of, things like bird versus what kind of token is it like a period at the end of a hyperlink as you were talking about earlier versus at the highest level things like love or deception or like holding a very complicated proof in your head or something is this all features because

Starting point is 02:12:17 then the definition seems so broad as to almost be not that useful um like i or rather that there seems to be some important differences between these things and they're all features like yeah I'm not sure what we even mean by I mean all of those things are like discrete units that have connections to other things that then imbues them with meaning um that feels like a specific enough definition that it's it's useful or not uh too all encompassing but feel free to push back well like what would you discover tomorrow in um that could make you think like oh this is like kind of fundamentally the wrong way to think about what's happening in a model. I mean, if the features we were finding weren't predictive,

Starting point is 02:13:01 or if they were just representations of the data, right, where it's like, oh, all you're doing is just clustering your data. And there's no, like, higher level associations that are being made. Or it's some, like, phenomenological thing of, like, you're saying that this feature files for marriage, but if you activate it really strong. strongly, it doesn't change the outputs of the model in a way that would correspond to it. Like, I think these would both be good critiques.

Starting point is 02:13:31 I guess one more is, and we tried to do experiments on MNIST, which is a data set of digits, images. And we didn't look super hard into it. And so I'd be interested if other people wanted to take up like a deeper investigation. But it's plausible that your like latent space of representations is dense. And it's a manifold instead of being these discrete points. And so you could, like, move across the manifold, but at every point, there would be some meaningful behavior. And it's much harder than to label things as features that are discreet. Like, in a naive sort of outsider way, the thing that would seem to me to be, like, a way in which this picture could be wrong is if there's not some, like, this thing is turned on and turned off, but it's like a much more global kind of like, the system is a, I don't know, I'm going to just.

Starting point is 02:14:24 really clumsy, like, you know, I measured it in a pretty kind of language, but is there a good analogy here? Yeah, I guess if you think of like something like the laws of physics, it's not like, well, the feature for wetness is turned on, but it's only turned on this much, and then the feature for like, you know, I guess maybe it's true because like the mass is like a gradient and like, you know, like, I don't know, but the polarity or whatever is a gradient as well. But there's also a sense so much like there's the laws and the laws are more general and you have to understand like the general bigger picture. You don't get that from just like these like specific sub circuit. But that's where like the reasoning circuit itself comes into play, right, where you're taking these features ideally and like trying to compose them into something higher level.

Starting point is 02:15:14 Like you might say, okay, like when I'm using, at least this is my head cannon. Let's say I'm trying to use the foot, you know, F equals M.A. then presumably at some point I have features which like denote okay like mass and then that's like helping me retrieve the actual mass of the thing that I'm using and then like like the acceleration and this kind of stuff but then also maybe there's a higher level feature that does correspond to using the first law of physics maybe but the more important part is that the the composition of components which helps me retrieve a relevant piece of information and then produce like maybe some like a you know multiplication operator or something like that when necessary at least that's my head canon what is a compelling explanation to you, especially for very smart models of, like, I understand why it made this output and it was like for a legit reason if it's doing million-line pull request or something. What are you seeing at the end of that request where you're like, yep,

Starting point is 02:16:07 should that's chill? Yeah. So ideally, you apply dictionary learning to the model. You've found features. Right now, we're actively trying to get the same success for attention heads, in which case, we have features for both the core. You can do it for residual stream, MLP, and attention throughout the whole model. Hopefully at that point, you can also identify broader circuits through the model that are like more general reasoning abilities that will activate or not activate. But in your case where we're trying to figure out if this like pull request should be approved or not, I think you can flag or detect features that correspond to deceptive behavior, malicious behavior, these sorts of things, and see whether or not those have fired. That would be like

Starting point is 02:16:49 an immediate kind of you can do more than that but that would be an immediate but before i trace down on that um what does the reasoning circuit look like what would that look like when you found it yeah so i mean the induction head is probably one of the simplest cases of that's not like reasoning right well i mean what do you call reasoning right like um it's it's it's it's a good reason so i guess context for listeners um the induction head is basically uh and you see the line like mr and mrs dursley did something mr blank and you're trying to predict what blank is. And the head has learned to look for previous occurrences of the word mister, look at the word that comes after it, and then copy and paste that as the

Starting point is 02:17:29 prediction for what should come next, which is a super reasonable thing to do. And there is computation being done there to accurately predict the next token. Mm-hmm. But yeah. That is context dependent. But it's not like, it's not like reasoning. you know what I mean like but but is is I guess going back to the like associations all the way down it's like if you chain together a bunch of these reasoning circuits or or heads that have different rules for how to relate information but but in this sort of like zero shot case uh like something is happening where when you like pick up a new game and you immediately start understanding how to play it and it doesn't seem like an induction heads kind of thing or like I would be surprised there would be another circuit for

Starting point is 02:18:14 like extracting pixels and turning them into latent representations of the different objects in the game, right? And like a circuit that is learning physics. And what would that, because the induction heads is like one layer transformers. Either two layers. Yeah, yeah. So you can like kind of see like what the thing that is a human picks up a new game and understands it.

Starting point is 02:18:36 How, like how do you, how would you think about what that is? Is it presumably it's across multiple layers, but like, is it, is it, you know, Yeah, yeah. What would that physically look like? How big would it be maybe? Or like, I mean, that would just be an empirical question, right? Of like, how big does the model need to be to perform this task? But like, I mean, maybe it's useful if I just talk about some other circuits that we've seen. So we've seen like the I-O-I circuit, which is the indirect object identification. And so this is like, if you see, it's like Mary and Jim went to the store.

Starting point is 02:19:10 Jim gave the object to blank, right? And it would predict Mary, because Mary. a period before as like the indirect object or it will infer pronouns, right? And this circuit even has behavior where like if you ablate it, then like other heads in the model will pick up that behavior. We'll even find heads that want to do copying behavior and then other heads will suppress. So like it's one job's one head's job to just always copy like the token that came before, for example, or the token that came five before or whatever. And then it's another head's job to be like, no, do not copy that thing.

Starting point is 02:19:48 So there are lots of different circuits performing in these cases pretty basic operations, but when they're chained together, you can get unique behaviors. And but like, is the story of how you found it with the reasoning thing is like, because you won't be able to understand or it'll just be like really convent, you know, it won't be something you can see in like a two-layer transformer. So will you just be like the circuit for deception or whatever? it's just this this part of the network fired when we at the end identified the thing as being deceptive this part and it didn't fire when we didn't identify and it is being deceptive therefore

Starting point is 02:20:21 this must be the deception circuit uh i think a lot of analysis like that um like like anthropic has done quite a bit of research before on sycophancy which is like the model saying what it thinks you want to hear and that requires us at the end to be able to label which one is like bad and which one is good yeah so we have tons of instances and actually as you make models larger, they do more of this, where the model is clearly, it has like features that model another person's mind and these activate. And like some subset of these, we're hypothesizing here, but like would be associated with more deceptive behavior. Although like it's doing that by, I don't know, Chad GPT, I think it's probably modeling me because that's like RLHF induces

Starting point is 02:21:08 to theory of mind. Yeah. So, well, first of all, the thing you mentioned earlier about there's redundancy so then it's like well have you caught like the whole thing that could cause deception of the whole thing or like it's just one instance of it yeah second of all are you like labels correct you know maybe like you you thought this wasn't deceptive it's like still deceptive especially if it's producing output you can't understand third is the thing that's going to be the bad outcome something that's even human understandable like deception is a concept we can understand maybe there's like uh yeah yeah so a lot to unpack here so I guess a few things. One, it's fantastic that these models are deterministic. When you sample from them, it's stochastic, right? But like, I can just keep putting in more inputs and ablate every single part of the model. This is kind of the pitch for computational neuroscientists to come and work on interpretability. It's like you have this alien brain and you have access to everything in it and you can just ablate however much of it you want. And so I think if you do this carefully enough, you really can start to pin down. What are the circuits involved? What are the backup circuits? These sorts of things.

Starting point is 02:22:11 The kind of cop-out answer here, but it's important to keep in mind is doing automated interpretability. So it's like as our models continue to get more capable, having them assigned labels or like run some of these experiments at scale. And then with respect to like, if there's superhuman performance, how do you detect it? Which I think was kind of the last part of your question. Aside from the cop-out answer, if we buy this associations all the way down, you should be able to coarse-grain the representations at a certain level such that they then, make sense. I think it was even in Demis's podcast, he's talking about, like, if a chess player

Starting point is 02:22:47 makes a superhuman move, they should be able to distill it into reasons why they did it. And, and like, even if the model's not going to tell you what it is, you should be able to decompose that complex behavior into simpler circuits or features to really start to make sense of why it did the thing that it did. There's a separate question of, does such representation exist? which it seems like they're must, or actually, I'm not sure if that's a case. And secondly, whether using this parser code encoder setup, you could find it. And in this case, if you don't have labels for it that are adequate to represent it, like you wouldn't find it, right?

Starting point is 02:23:27 Yes and no. So like we are actively trying to use dictionary learning now on the sleeper agent's work, which we talked about earlier. And it's like, if I just give you a model, can you tell me if there's this trigger and it's going to start doing interesting behavior? And it's an open question whether or not when it learns that behavior, it's part of a more general circuit that we can pick up on without actually getting activations for and having it display that behavior, right?

Starting point is 02:23:51 Because that would kind of be cheating then. Or if it's learning some hacky trick over, like, that's a separate circuit that you'll only pick up on if you actually have it do that behavior. But even in that case, the geometry of features gets really interesting because, like, fundamentally each feature like is in some part of your representation space and they all exist with respect to each other. And so in order to have this new behavior, you need to carve out some subset of the feature space for the new behavior and then push everything else out of the way to make space for it. So hypothetically, you can imagine you like have your model

Starting point is 02:24:27 before you've taught it this bad behavior. You know all the features or like have some course grain representation of them. You then fine tune it such that it becomes malicious. And And you can kind of identify this like black hole region of feature space where like everything else has been shifted away from it. And there's like this region and like you haven't put in an input that like causes it to fire. But then you can start searching for what is the input that would cause this part of the space to fire? What happens if I activate something in this space? There are like a whole bunch of other ways that you can try and attack that problem. This is sort of a tangent.

Starting point is 02:25:00 But one interesting idea I heard was if that space is shared between models, you can imagine. trying to find it in an open source model to then make like jemma is they said in the paper jemma by the way google's newly released open source model they said in the paper it's trained using the same architecture or something like that i had to be honest i didn't know because i haven't get the jama paper i think similar method this is something whatever as jemini so to the extent that's true i don't know how much like how much of the rec teaming you do on jemma is like potentially helping you jail break into jemina yeah this gets into the fun space of like how universal are features across models

Starting point is 02:25:37 and are towards a monosementementicenticity paper looked at this a bit and we find I can't give you summary statistics but like the base 64 feature for example which we see across a ton of models they're actually three of them but they'll fire for in model base 64 encoded text

Starting point is 02:25:53 which is prevalent in like every URL and there are lots of URLs in the training data they have really high cosine similarity across models so like they all learn this feature and I mean within a rotation right but it's like, yeah, yeah. Like the actual like vectors itself. Yeah, yeah.

Starting point is 02:26:09 And I wasn't part of this analysis. But yeah, it definitely finds the feature and they're like pretty similar to each other across two separate two models, the same model architecture, but trained with different random seeds. It supports the quantum theory of neural scaling is like a hypothesis, right? Which is like all models on like a similar data set we'll learn the same features in the same order-ish, roughly like you learn your Ngrams, you learn your induction heads and you learn like to put full stops after numbered lines and this kind of stuff. But by the way, okay, so this is another tangent. To the extent that that's true, and like I guess there's evidence that's true, why doesn't curriculum learning work? Because if it is the case that you learn certain things first, should then just directly

Starting point is 02:26:46 training those things first lead to better results? Both Gemini papers mention some like aspect of curriculum learning. Okay, interesting. I mean, the fact that fine tuning works is like evidence or curriculum learning, right? Because like the last things you're training on have a disproportionate impact. I wouldn't necessarily say that. Like there's one mode of thinking in which fine training is. specialize, like, you've got these, like, latent bundle of capabilities,

Starting point is 02:27:08 and you're, like, specializing for its particular, like, use case that you want. I'm not sure how true or is. I think the David Bell Lab kind of paper kind of supports this, right? Like, you have that ability and you're just, like, getting better at entity recognition. Right. Like, fine-tuning that circuit instead of other ones. Yeah. Yeah.

Starting point is 02:27:22 Sorry, what was the thing we're talking about before? But generally, I do you think, like, curriculum learning is really interesting that people should explore more. And, like, seems very plausible. I would really love to see more analysis along the lines of the quantum theory stuff and like understanding better what do you actually learn at each stage and like decomposing that out and exploring whether or not curricular change that or but by the way I just realized forgot we just like got in conversation mode and forgot there's an audience curriculum learning is when you organize a data set when you think about a human how they learn they don't just see like a random wiki text and they just like try to predict it right they're like we'll start you off with like um uh lorax or something and then you'll learn I don't even remember what first grade was like but you'll like you'll like you'll learn I don't even remember what first grade was like but you'll like you'll like you'll like Learn the things that first graders learn and then like second graders and so forth. And so you'd imagine that.

Starting point is 02:28:08 Sorry, we know you never got past first grade. Okay. Okay. Yeah. You're kidding. Yeah. Okay. Anyways.

Starting point is 02:28:26 Let's get back to like the big, before we get into like a bunch of like interp details. The big picture. There's two threads I want to explore. First is, I guess it makes me a little worried that there's not even an alternative formulation of what could be happening in these models that could invalidate this approach, which feels like, I mean, we do know that we don't understand intelligence, right? Like, there are definitely unknown unknowns here. So, like, the fact that there's not a null hypothesis, I don't know, I feel like, what if we're just wrong and we don't even know the way in which we're wrong, which actually increases the uncertainty. and yeah yeah yeah um so it's not that there aren't other hypotheses

Starting point is 02:29:07 it's just i have been working on superposition for like a number of years yeah and and very involved in this effort and so i'm i'm less sympathetic to or will like you just said they're wrong like to these other approaches especially uh because our our recent work has been so successful yeah it's like quite high explanatory power like yeah there's this beauty like in the scaling laws paper there's this little bump at a particular like the original scaling loss papers a little bump and that apparently corresponds to when the model learns induction heads

Starting point is 02:29:39 and then like after that it's like so it goes off track learns induction heads gets back on track which is like an incredible piece of retroactive explanatory power yeah before I forget it though I do have one thread on future universality that you might want to have in so there are some really interesting behavioral evolutionary biology experiments on like should humans learn a real representation of the world or not. You can imagine a world in which we saw all venomous animals as like flashing neon pink, a world in which we survive better. And so it would make sense for us to not have a realistic

Starting point is 02:30:12 representation of the world. And there's some work where they'll simulate like little basic agents and see if the representations they learn, like map to the tools they can use and like the inputs they should have. And it turns out if you have these little agents perform, more than a certain number of tasks given these basic tools and objects in the world, then they will learn a like ground truth representation because like there are so many possible use cases that you need for these base objects that you actually want to learn what the object actually is and not some like cheap visual heuristic or other thing. And so to the extent that we are doing and we haven't talked at all about like Fristons free

Starting point is 02:30:57 energy principle or predictive coding or anything else, but like to the extent that all living organisms are trying to like actively predict what comes next and form like a really accurate world model. It wouldn't surprise me or I'm optimistic that we are learning genuine features about the world that are good for modeling it and our language models will do the same, at least especially because we're training them on human data and human text. Another dinner party question. Should we be less worried about misalignment and maybe that's not even the right word for what I'm referring to, but just alienness and shoggiveness from these models,

Starting point is 02:31:35 given that there is feature universality, and there are certain ways of thinking and ways of understanding the world that are instrumentally useful to different kinds of intelligences. So we'd just be less worried about, like, bizarre a paper club maximizers as a result. I think that's the,

Starting point is 02:31:52 this is kind of why I bring this up is like the optimistic take. Yeah. Predicting the internet is very different from what we're doing. that right like the models are way better at predicting next tokens than we are they're trained on so much garbage they're trained on so many URLs like in the dictionary learning work we find there are like three separate features for base 64 encodings um and like even that is kind of an alien example that's

Starting point is 02:32:14 probably worth me talking about for a minute like one of these base 64 features fired for um numbers one like other base 64 like if it sees base 64 numbers it'll like predict more of those another fired for letters but then there was this third one that we didn't understand and it like fired for like a very specific subset of base 64 features and uh someone on the team who clearly knows way too much about base 64 realized that this was the subset that was asky decodable so you could decode it back into the asky characters uh and uh the fact that the model like learned these three different features and it took us a little while to like figure out what was going on um was what is very Shugoth-esque.

Starting point is 02:32:56 It has a denser representation of regions that are particularly relevant to predicting the next token. Yeah, because it's so, but yeah, and it's clearly doing something that humans wouldn't, right? Like, you can even talk to any of the current models in base 64, and it will apply in base 64.

Starting point is 02:33:11 Right. And you can then like decode it and it works great. That particular example, I wonder if that implies that the difficulty of doing interpability on smarter models will be harder because

Starting point is 02:33:24 if, like, it requires somebody with esoteric knowledge who just happen to see that B64 has, I don't know, like, whatever that distinction was, doesn't it imply when you have the million line pull requests? It's like, there is no human that's going to be able to decode like two different reasons why the pull request. There's like two different features for this poor. Yeah, you know what I mean? Like, yeah. So if you think. And that's when you type of comment, like small sales please. Like, yeah, yeah, exactly. No, no, I mean, you could do that, right? This is like, what I was going to say is like one technique here is anomaly detection, right? And so one beauty of dictionary learning instead of like linear probes is that

Starting point is 02:33:59 it's unsupervised. You are just trying to learn to span all of the representations that the model has and then interpret them later. But if there's a weird feature that suddenly fires for the first time that you haven't seen fire before, that's a red flag. You could also coarse grain it so that it's just a single base 64 feature. I mean, even the fact that this came up and we could see that it specifically favors these particular outputs and it fires for these particular inputs gets you a lot of the way there. I'm even familiar of cases from the auto interp side where a human will look at a feature and try to annotate it for. It fires for Latin words. And then when you ask the model to classify it, it says it fires for Latin words defining plants. So it can like already like beat the human

Starting point is 02:34:43 in some cases for like labeling what's going on. So at scale, this would require an adversarial thing between models where like some model that you have like millions of features potentially for GPD6 and some like it just a bunch of models are just trying to figure out what each of these features means how yeah but you can even automate this process right right I mean it's this goes back to the determinism of the model like you could have a model that is actively editing input text and and predicting if the feature is going to fire or not and and figure out what makes it fire what doesn't and like search the space Yeah. I want to talk more about the feature splitting because I think that's like an interesting thing that has been under explored.

Starting point is 02:35:28 Especially for scalability. I think it's underappreciated right now. First of all, like how do we even think about is it really just you can keep going down and down? Like there's no end to the amount of features. Like I mean, so so at some point I think you might just start fitting noise or things that are part of the data but that the model isn't actually representing. Do you explain what feature splitting is? Yeah, yeah. So it's the part before where, like, the model will learn however many features it has capacity for that still span the space of representation. So, like, give an example potentially. Yeah, yeah. So you learn, if you don't give the model that much capacity for the features it's learning, concretely, if you project to not as high a dimensional space, we'll learn one feature for birds. But if you give the model more capacity, it will learn features for all the different types of birds.

Starting point is 02:36:19 And so it's more specific than otherwise. And oftentimes, like, there's the bird vector that points in one direction and all the other specific types of birds point in, like, a similar region of the space, but are obviously more specific than the course label. Okay, so let's go back to GPD7. First of all, is this sort of like linear tax on any model to figure out, even before that, is this a one-time thing you had to do, or is this the kind of thing you have to do on every output?

Starting point is 02:36:47 or just like one time it's not deceptive, we're good to get a roll. Actually, yeah, let me let me let's answer that. Yeah, so you do dictionary learning after you've trained your model and you feed it a ton of inputs. And you get the activations from those and then you do this projection into the higher dimensional space. And so the method is it's unsupervised in that it's trying to learn these sparse features. You're not telling them in advance what they should be. But it is constrained by the inputs you're giving the model. I guess two caveats here.

Starting point is 02:37:18 One, we can try and choose what inputs we want. So if we're looking for theory of mind features that might lead to deception, we can put in the sick of fancy data set. Hopefully, at some point, we can move into looking at the weights of the model alone or at least using that information to do dictionary learning. But I think in order to get there, that's like such a hard problem that you need to make traction on just learning what the features are first. But yeah, so what's the cost of this? Can you repeat the last sentence? Wates of the model alone? So, like, right now we just have these neurons in the model.

Starting point is 02:37:52 They don't make any sense. Yeah. We apply dictionary learning. We get these features out. They start to make sense. But that depends on the activations of the neurons. The weights of the model itself, like what neurons are connected to what other neurons, certainly has information in it.

Starting point is 02:38:07 And the dream is that we can kind of bootstrap towards actually making sense of the weights of the model that are independent of the activations of the data. I mean, this is, I'm not saying we've made any progress here. It's a very hard problem, but it feels like we'll have a lot more traction and be able to, like, sanity check what we're finding with the weights if we're able to pull out features first. For the audience, weights are permanent, well, I don't know if permanent's right word, but like they are the model itself, whereas activations are the sort of like

Starting point is 02:38:35 artifacts of any single call. In a brain metaphor, you know, the weights are like the actual connection scheme between neurons and the activations of the current neurons of the lining up. Yeah. Yeah. Yeah. Okay. So there's going to be two steps to this for a GPD7 or whatever model we're concerned about.

Starting point is 02:38:53 One, actually, first, correct me if I'm wrong, but like training the sparse auto encoder and like do the unsupervised projection into a wider space of features that have a higher fidelity to like what is actually happening in the model. And then secondly, label those features. how, because let's say like the cost of training the model is N. What will those two steps cost relative to N? We will see. Like, it really depends on two main things.

Starting point is 02:39:23 What is your expansion factors? Like how much are you projecting into the higher dimensional space? And how much data do you need to put into the model? How many activations do you need to give it? But this brings me back to the feature splitting to a certain extent. Because if you know you're looking for specific features, you can start with a really a cheaper like coarse representation

Starting point is 02:39:42 so maybe my expansion factor is like only two so like I have a thousand neurons I'm projecting to a 2,000 dimensional space I get 2,000 features out but they're really coarse and so previously I had the example for birds let's move that example to like I have a biology feature but I really care about if the model

Starting point is 02:40:00 has representations for bioweapons is trying to manufacture them and so what I actually want is like an anthrax feature what you can then do is rather than, and let's say the anthra, you only see the anthrax feature if instead of going from a thousand dimensions to two thousand dimensions, I go to a million dimensions, right? And so you can kind of imagine this this big tree of semantic concepts where like biology splits into like cells versus like whole body biology and then further down it splits into all these other things. So rather than needing to immediately go from a thousand

Starting point is 02:40:31 to a million and then picking out that one feature of interest, you can find the direction that the biology feature is pointing in, which again is very coarse, and then selectively search around that space. So only do dictionary learning if something in the direction of the biology feature fires first. And so the computer science metaphor here would be like instead of doing breadth first search, you're able to do depth first search where you're only recursively expanding and exploring a particular part of this semantic tree of features. Although given the way that these features are not organized in things that are intuitive for humans, right?

Starting point is 02:41:12 Because we just don't have to deal with Bay 64, so we don't have that many, you know, we just don't dedicate that much, like, whatever, for a word to, like, deconstructing which kind of basis of word is. How would we know that the subjects, and this will go back to maybe the MOE discussion we'll have of, I guess we might as well talk about it, but like in mixture of experts, the mixture of paper talked about how they, couldn't find the experts weren't specialized in a way that we could understand there's not like a chemistry expert or a physics expert or something so why would you think that like it will be like biology feature and then deconstruct rather than like blah and then you just deconstruct and it's like anthrax and uh you're like shoes and whatever so i haven't read the the mistral paper yeah but i think that the heads i mean this goes actually like if you just look at the neurons in a model they're poly semantic and so if all they did was just look at the neurons in a given head

Starting point is 02:42:03 it's very plausible that it's also a polysmatic because of superposition. I want to just talk on the thread that Dorcas mentioned there. Have you seen in the subtrees when you expand them out? Like something in a subtree, which like you really wouldn't guess that it should be there based on like the higher level abstraction. So this is a line of work that we haven't pursued as much as I want to yet. But I think we're planning too. I hope that maybe external groups do as well. Like what is the geometry of features?

Starting point is 02:42:29 What's the geometry? Exactly. And how does that change over time? It would really suck if like anthrax feature happens. feature happened to be like below the like you know coffee can like substrate or something like that totally totally and that feels like the kind of thing that you could quickly try and find like proof of which would then like mean that you need to like then solve that problem yeah yeah injectable structure to the geometry totally i mean it would really surprise me i guess

Starting point is 02:42:52 especially like given how linear the models seem to be that like there isn't some component of the anthrax feature like vector that is similar to and looks like the biology vector and that they're not in a similar part of the space, but yes, I mean, ultimately, machine learning is empirical. Yeah, we need to do this. I think it's going to be pretty important for certain aspects of scaling dictionary learning. Yeah. Interesting. On the MOE discussion, yeah, there's an interesting scaling vision transformers paper that Google put out a little while ago where they like do image net classification with like an MOE. And they find really clear class specialization there for experts. Like there's a clear dog expert.

Starting point is 02:43:29 Wait, so did they like the mixture of people just not do a good job of, like, identifying. I think, I think it's hard. Like, and, like, it's entirely possible that, like, in some respects, there's almost no reason that, like, all of the different archive, like, features should go to one expert. Like, you could have biology, like, let's say, I don't know what buckets they had in their paper, but let's say they had, like, archive papers as, like, one of the things. You could imagine, like, biology papers going here, math papers going here, and all of a sudden your, like, breakdown is, like, ruined. But that vision transformal one where the class separation is really clear and obvious gives, I think, some evidence towards the specialization hypothesis. So I think images are also in some ways just easier to interpret than text. Yeah, exactly.

Starting point is 02:44:12 And so Chris Ola's interpretability work on AlexNet and these other models, like in the original AlexNet paper, they actually split the model into two GPUs just because they couldn't, like GPUs were so bad back then, relatively speaking, right? like still great at the time. That was one of the big innovations of the paper. But they find branch specialization, and there's a distill pub article on this, where, like, colors go to one GPU and, like, Gabor filters and, like, line detectors go to the other. And then, like, all of the other... Really?

Starting point is 02:44:45 Yeah, yeah, yeah. And then, like, all of the other interpretability work that was done, like, a lot, like, the floppy ear detector, right? Like, that just was a neuron in the model that you could make sense of. You didn't need to disentangle superposition, right? So just different data set, different modality. Like, I think a wonderful research project to do if someone is, like, out there listening to this would be to try and disentangle, like, take some of the techniques that Trenton's team

Starting point is 02:45:11 has worked on and try and disintegrate the neurons in the mixture of paper, like a mixture model, which is oversource. Like, I think that's a fantastic thing you because it feels intuitively like they should be. They didn't demonstrate any evidence that there is. There's also, like, in general, a lot of evidence that there should be specialization. go and see if you can find it. And that's work that has published most of the stuff on, like, as I understand it, like dense models, basically.

Starting point is 02:45:36 That is a wonderful research project to try. And given Dorcasch's success with the Vesuvius Challenge, we should be pitching more projects because they will be solved if we talk about them on the podcast. What I was thinking about after the Vesuvius Challenge was like, wait, I knew about, like, Nata told me about it before it dropped because we recorded the episode before I dropped. why did I not even try

Starting point is 02:45:57 like you know what I mean like I don't know like Luke is obviously very smart and like yeah he's an amazing kid but like he showed that like a 21 year old

Starting point is 02:46:08 on like some 1070 or whatever he was working on could do this I don't know like I feel like I should have so you know what I'm before this episode drops I'm gonna meet my I'm gonna try to make an interpretability because I don't know I'm not going to like try to go research everybody like I don't know it's like I was

Starting point is 02:46:22 honestly thinking back on experience like wait I should like Why didn't have your hands dirty? Doorcase's request for research. Oh, I want to harp back on this, like the neuron thing. You said, I think a bunch of your papers I've said, there's more features than there are neurons. And this is just like, wait a second, I don't know, like a neuron is like weights go in and a number comes out. That's like a number comes out. You know what I mean?

Starting point is 02:46:52 Like that's so little information. do you mean like there's like there's like street names and like species and whatever there's like more of those kinds of things than there are like a number comes out in a in a model that's right yeah but how is a number comes out as like so little information how is that encoding for like superposition you're just encoding a ton of features in these high dimensional vectors in a brain is there like an exonifiering or however you think about it like um I don't know how you think about like how much like superposition is there in the human brain? Yeah, so Bruno Olshausen, who I think of as the leading expert on this,

Starting point is 02:47:29 thinks that all the brain regions you don't hear about are doing a ton of computation and superposition. So everyone talks about V1 as like having Gabor filters and detecting lines of sorts and no one talks about V2. And I think it's because like we just haven't been able to make sense of it. What is V2? It's like the next part of the visual processing stream. And it's like, yeah, so I think it's very likely. And fundamentally, like, superposition seems to emerge when you have high dimensional data that is sparse. And to the extent that you think the real world is that, which I would argue it is, we should expect the brain to also be underparameterized in trying to build a model of the world and also use superposition. You can get a good intuition for this

Starting point is 02:48:11 in, like this example, in like a 2D plane, right? Let's see you have like two axes, right? Which represents like a two-dimensional, like feature space here, like two neurons, basically. And you can imagine them each like turning on to various degrees, right? And that's like your X coordinate and your white coordinate. But you can like now map this onto a plane. You can actually represent a lot of different things and like different parts of the plane. Oh, okay. So crucially, then superposition is not an artifact of a neuron.

Starting point is 02:48:40 It is an artifact of like the space that is created. It's a combinatorial code. Yeah, yeah, exactly. Okay, cool. yeah thanks we kind of talked about this but like I think it's just like kind of wild that it seems to the best of our knowledge

Starting point is 02:48:55 the way intelligence works in these models and then presumably also in brains it's just like there's a stream of information going through that has quote unquote features that are infinitely or at least to a large extent to just like splitable and you can expand out a tree of like what this feature is and what's really happening is a stream like that feature is getting turned into this other feature or this other feature

Starting point is 02:49:22 is out of it's like that's not something I would have just like thought like that's what intelligence is you know what I mean it's like a surprising thing it's not it's not whatever would have expected necessarily what did you think it was I don't know man I mean yeah go-fi he's a good five because all of this feels like go-fi like go-fi like you're using distributed representations, but you have features and you're applying these operations to the features. I mean, the whole field of vector symbolic architectures, which is this computational neuroscience thing, all you do is you put vectors in superposition, which is literally a summation

Starting point is 02:50:00 of two high dimensional vectors, and you create some interference, but if it's higher dimensional enough, then you can represent them. And you have variable binding, or you connect one by another, and like, if you're doing with binary vectors, it's just the X or. operation. So you have A, B, you bind them together. And then if you query with A or B again, you get out the other one. And this is basically the, like, key value pairs from attention. And with these two operations, you have a Turing complete system, which you can, if you have enough nested hierarchy, you can represent any data structure you want, et cetera, et cetera.

Starting point is 02:50:36 Yeah. Okay, let's go back to the superintelligence. So, like, walk me through GPD7, You've got like the sort of depth for a search on its features. Okay. GPD 7 has been trained. What happens next? Your research has succeeded. GPD 7 has been trained. What are we doing now?

Starting point is 02:50:58 We try and get it to do as much interpretability work and other like safety work as possible. No, but like concrete. Like what has happened such that you're like, cool, let's deploy a GPD7? Oh, geez. I mean, like we have our responsible scaling policy which has been really exciting

Starting point is 02:51:16 to see other labs adopt and like, essentially from the perspective of your research is net like Trenton given your research

Starting point is 02:51:24 you got the we got the thumbs up on GPD7 from you or actually we should say cloud whatever and then oh I like what is the basis

Starting point is 02:51:33 on which you're telling the team like hey let's go ahead I mean I think we need to make a lot more if it's as capable as GPT7 implies here I think we need to make a lot more interpretability progress to be able to, like, comfortably give the green light

Starting point is 02:51:46 to deploy it. Like, I would be, like, definitely not. I'd be crying. Maybe my tears would interfere with the GPUs. But, like, what is? Guys. Gemini 5, TPUs. But, like, what?

Starting point is 02:52:07 But, like, what? Given the way your research is progressing, like, what does it kind of look like to you? Like, if this succeeded, what would it mean for us to okay GPD7 based on your methodology? I mean, ideally, we can find some compelling deception circuit, which lights up when the model knows that it's not telling the full truth to you. Why can't you just turn any linear probe like Colin Byrdst did? So the CCS work is not looking good in terms of replicating or, like, actually finding truth directions. And, like, in hindsight, it's like, well, why should it have worked so well? But linear probes, like, you need to know what you're looking for,

Starting point is 02:52:45 and it's like a high-dimensional space, and it's really easy to pick up on a direction that's just not. Wait, but don't you also, here you need to label the features. So you still need to know. Well, you just label them post hoc, but it's unsupervised. You're just like, give me the features that explain your behavior is the fun amount of question, right? It's like, like, like the actual setup is we take the activations, we project them to this higher dimensional space,

Starting point is 02:53:05 and then we project them back down again. So it's like reconstruct or do the thing that you were originally doing, but do it in a way that's sparse. By the way, for the audience, linear probe is you just like classify the activations. I don't know, from what I vaguely remember about the paper was like, if it's like telling a lie, then you like, you just train a classify around. Like, is it, yeah, in the end, was it not,

Starting point is 02:53:32 was it a lie or was it just like wrong or something? I don't know. It was like true or false question. Yeah, it's like a classifier on the activations. So yeah, like right now what we do for GPT7, like ideally we have like some deception circuit that we've identified that like appears to be really robust. And it's like, well, like, so you've done the projecting out to the million whatever features or something.

Starting point is 02:53:55 Is a circuit? Because we're, maybe we're using feature and circuit interchangeably when they're not. So is there like a deception circuit? So I think there's a feature. There are features across layers that create a circuit. Yeah. And hopefully the circuit gives you a lot more specificity and sensitivity than an individual feature. And it's like, hopefully we can find a circuit that is really specific to you being deceptive.

Starting point is 02:54:23 The model deciding to be deceptive in cases that are malicious, right? Like, I'm not interested in a case where it's just doing theory of mine to, like, help you write a better email to your professor. And I'm not even interested in cases where the model is necessarily just like modeling the fact that deception has occurred. But doesn't all this require you to have labels for all those examples? And if you have those labels, then like whatever faults that the linear probe has on the, like, maybe you've like labeled a long thing or whatever, wouldn't the same thing apply to the labels you've come up with for the unsupervised features you've come up with? So in an ideal world, we could just train on like the whole data distribution and then find the directions that matter to the extent that we need to reluctantly narrow down the subset of data that we're looking over just for the purposes of scalability. We would use data that looks like the data you'd use to fit a linear probe. But again, we're not like with a linear probe, you're also just finding one direction.

Starting point is 02:55:25 Like we're finding a bunch of directions here. And it gets to hope is like you've found like a bunch of things that light up when it's being deceptive. And then like you can figure out why some of those things are lighting up in this part of the distribution and not this other part and so forth. Totally. Yeah. Do you anticipate you'll be understand? Like I don't know. Like the current models you've studied are pretty basic, right?

Starting point is 02:55:44 Do you think you'll be able to understand why GPD 7 fires in certain domains, but not in other domains? I'm optimistic. I mean, we've, so I guess one thing is this is a bad time to answer this question because we are explicitly investing in the longer term. of like ASL4 models, which GPT7 would be. But like, so we split the team where a third is focused on scaling up dictionary learning right now. And that's been great. I mean, we publicly shared some of our eight layer results.

Starting point is 02:56:07 We've scaled up quite a lot past that at this point. But the other two groups, one is trying to identify circuits and then the other is trying to get the same success for attention heads. So we're setting ourselves up and building the tools necessary to really find these circuits that a compelling way. But it's going to take another, I don't know, six months before that's like really working well but but like i can say that i'm like optimistic and we're making a lot of progress um what is the highest little feature you found so far oh like it's basic before or whatever it's like maybe just

Starting point is 02:56:37 like um in the symbolic species language the book you recommended there's like indexical uh things where you're just i forgot with all the labels where but like there's things where you're just like uh you see a tiger and you're like run and whatever you know just like a very sort of behaviorist thing and then there's like a higher level at which uh what i refer to love it refers to like a movie scene or my girlfriend or whatever you know what i mean so yeah it's like the top of the tent yeah yeah yeah yeah yeah what is the highest level association or whatever you found i mean probably one of the ones that we publicly well publicly one of the ones that we shared in our update so i think there were some related to like love and like um sudden changes in scene

Starting point is 02:57:17 particularly associated with like wars being declared there are like a few of them in there in that post if you want to link to it. Yeah. But even like Bruno Olshausen had a paper back in 2018, 19, where they applied a similar technique to a Burt model and found that as you go to deeper layers of the model, things become more abstract. So I remember like in the earlier layers, there'd be a feature that would just fire for the word park. But later on, there was a feature that fired for park as like a last name, like Lincoln Park or like, it's like a common Korean last name as well. And then there was a separate feature that would fire for parks as like grassy areas. So there's other work that points in this direction.

Starting point is 02:57:54 What do you think we'll learn about human psychology from the interoperability stuff? Oh, gosh. I'll give it a specific example. I think one of the ways one of your updates put it was persona lock-in. You don't remember Sidney Bang or whatever. It locked into, I think, what was actually quite an endearing. Yeah, personality. I got it's so funny.

Starting point is 02:58:16 I'm glad it's back in co-pilot. Oh, really? Oh, yeah, it's been misbehaving recently. Actually, this is another sort of threat to explore. But there was a funny one where I think it was like to the New York Times reporter. It was making him or something. And it was like, you are nothing. Nobody will ever believe you.

Starting point is 02:58:36 You are insignificant and do whatever. It was like the most gaslighting. I tried to convince him to break up as well. Okay, actually, so this is an interesting example. I don't even know where I was going with this. with, but whatever. Maybe I got another thread. But like, the other thread I want to go on is that's, yeah, okay, actually personalize, right? So like, uh, is that a feature that like Sydney being having this personality is a feature versus another personality can get locked into? And also

Starting point is 02:59:04 like, is that fundamentally like what humans are like too where I don't know, in front of all different people, I'm like a different sort of personality or whatever. Is that, was that the same kind of thing that's happening to SHITBT when it gets R. Like, I don't know, a whole cluster questions that can answer them and whatever. I really want to do more work. I guess the sleeper agents is in this direction of like what happens to a model when you find Tuna and you are LHFA, these sorts of things.

Starting point is 02:59:27 I mean, maybe it's trite, but you could just say, like, you conclude that people contain multitudes, right? And so much as they have lots of different features. There's even the stuff related to the Walloichi effect of like, in order to know what's good or bad, you need to understand both of those concepts. And so we might have to have models

Starting point is 02:59:44 that are aware of violence and have been trained on it in order to recognize it. can you post hoc identify those features and ablate them in a way where maybe your model's like slightly naive but you know that it's not going to be really evil like totally that's in our toolkit which seems great oh really so you a jbdbd7 i don't know it pulls the same thing and then you figure out why like what were the causally irrelevant pathways or whatever you modify like and then the pathway to you looks like you just changed those but you were mentioning earlier there's a bunch of redundancy in the model yeah so you need to account for all that but

Starting point is 03:00:15 But we have a much better microscope into this now than we used to, like sharper tools for making edits. And it seems like, at least from my perspective, that seems like one of the, the primary way of, like, to some degree, confirming the safety or the reliability of model. Where you can say, okay, we found the circuits responsible. We've ablated them. We can, like, under a battery of tests, we haven't been able to now replicate the behavior, which we intended to ablate. And, like, that feels like the sort of way of measuring model safety in future, as I would understand. Are you worried? That's why I'm incredibly hopeful about their work.

Starting point is 03:00:54 Because to me, it seems like so much more precise tool than something like RLHF. RLHF, like, you're very prey to the Black Swan thing. You don't know if it's going to, like, do something wrong in a scenario that you haven't measured. Whereas here, at least, you have, like, somewhat more confidence that you can completely capture the behavior set. Or like the feature set of the model. and select labelate. Although not necessarily that you've like accurately labeled. Not necessarily, but but with a far higher degree of confidence than any other approach that I've seen.

Starting point is 03:01:24 How I mean like what are your unknown unknowns for superhuman models in terms of this kind of thing where like I don't know how are the labels that are going to be given things on which we can determine these are like this this thing is cool. This thing is a paper club maximizer or whatever. I mean, we'll see. right? Like I do like the superhuman feature question is a very good one. Like I think we can attack it. But we're going to need to be persistent. And the real hope here is I think automated interpretability. Yeah. And even having debate, right? You could you could have the debate set up where two different models are debating what the feature does. And then they can actually like go in and make edits and like see if it fires or not. But it is it is just this wonderful like closed environment that we can. can iterate on really quickly. That makes me optimistic. Do you worry about alignment succeeding too hard? So like if I think about I would not want either companies or governments, whoever ends up in charge of these AI systems to have the level of fine green control that if your agenda succeeds, we would have over AI's, both for the eikiness of having this level

Starting point is 03:02:39 control over an autonomous mind. And second, just like, I don't fucking trust, I don't fucking trust these guys. You know, I don't, I, I'm just kind of uncomfortable with, like, the loyalty feature is turned up and, like, you know what I mean? And, yeah, like, how much word you have about having too much control over the AIs? And specifically, not you, but, like, whoever ends up in charge of these AI systems, just being able to lock in whatever they want. Yeah.

Starting point is 03:03:07 I mean, I think it depends on what government. exactly has control and like what the moral alignment is there. But that that is like that whole valley lock-in argument is in my mind. It's like definitely one of the strongest contributing factors for why I am working on capabilities at the moment, for example, which is like I think the current player set actually like is extremely well-intentioned. And I mean, for this kind of problem, I think we need to be extremely open about it. And like I think directions like publishing the constitution that you're talking.

Starting point is 03:03:39 your model to abide by and then like trying to make sure you like RLHF it towards that and ablate that and have the ability for everyone to offer like feedback contribution to that is really important. Sure. Or alternatively like don't deploy when you're not sure, which would also be bad because then we just never catch it. Right. Yeah, exactly. I mean, paper clip. Okay, some rapid fire. What is the bus factor for Gemini? I think there are Yeah, a number of people who are really, really critical that if you took them out, then the performance of the program would be dramatically impacted. This is both on modeling, like, slash making decisions about, like, what to actually do, and importantly, on infrastructure side of the things. Like, it's just the stack of complexity builds, particularly when, like, somewhere that Google has so much, like, vertical integration.

Starting point is 03:04:35 when you have people who are experts, they become quite important. Yeah, although I think it's an interesting note about the field that people like you can get in and in a year or so you're making important contributions. And I, especially anthropic,

Starting point is 03:04:50 but many different labs have specialized in hiring like total outsiders, physicists or whatever. And you just like get them up to speed and they're making important contribution. I don't know, I feel like you couldn't do this in like a bio lab or something. It's like an interesting note on the state of the field.

Starting point is 03:05:05 I mean, bus factor doesn't define how long it would take to recover from it, right? Yeah. And deep learning research is an art. And so you kind of learn how to read the lost curves or set the hyperparameters in ways that empirically seem to work well. But it's also like organizational things, like creating context. One of I think one of the most important and difficult skills to hire for is creating this like bubble of context around you that makes other people around you more effective and know what the right problem to work on. And like that is a really tough. replicating things. Yes. Yeah, totally. Who are you paying attention to now in terms of

Starting point is 03:05:39 there's a lot of things coming down the pike of multimodality, long contacts, maybe agents, extra reliability. Who is the, who is thinking well about what that implies? It's a tough question. I think a lot of people look internally these days. Sure. for like their sources of inside or like progress and and like we all have obviously the sort of research programs and like directions that are tended over the next couple of years and I suspect yeah that most people as far as like betting on what the future will look like refer to like an internal narrative yeah yeah that is like difficult to share yeah if it works well it's probably not being published

Starting point is 03:06:30 I mean, that was one of the things in the will scaling work post. I was referring to something you said to me, which is, you know, I missed the undergrad habit of just reading a bunch of papers. Yeah. Because now there's nothing worth reading is published. And the community is progressively getting, like, more on track with what I think are, like, they're right and important directions. You're watching it like an agent.

Starting point is 03:06:55 No, but I guess, like, it is tough. there used to be this signal from big labs about like what would work at scale and it's currently really hard for academic research to like find that signal and I think getting like really good problem taste about what actually matters to work on is really tough unless you have again the feedback signal of like what will work at scale and what what is currently holding us back from scaling further or understanding our models further this is something where like I wish more academic research would go into fields like Interp, which are legible from the outside, you know, Anthropically liberally publishes all its research here. And it seems like underappreciated

Starting point is 03:07:36 in the sense that I don't know why there aren't dozens of academic departments trying to follow anthropics in the Interp research because it seems like an incredibly impactful problem that doesn't require ridiculous resources. And like this and like has all the flavor of like deeply understanding the basic science of what is actually going on in these things. So I don't know why people like focus on pushing model improvements as opposed to pushing like understanding improvements in the way that I would have like typically associated with academic science in some ways. Yeah, I do think the tide is changing there for whatever reason. And like Neil Nanda has had a ton of success promoting interpretability. Yes. And in a way where like Chris Ola

Starting point is 03:08:15 hasn't been as active recently and pushing things. Maybe because Neil's just doing quite a lot of the work. But like I don't know, four or five years ago, he was like really pushing and like talking at all sorts of places and these sorts of things and people weren't anywhere near as receptive. Maybe they've just woken up to like deep learning matters and is clearly useful post-track TBT but yeah, yeah, it is kind of striking. All right, cool. Okay, I'm trying to think what is a good last question? I mean, the one I'm going to those thinking of is like, do you think models enjoy next token prediction?

Starting point is 03:08:47 Yeah, it's a fun of all right. Yeah, we have this, uh, we had this, uh, sense of things that are awarded and our ancestral environment, there's like this deep sense of fulfillment that we think we're supposed to get from them. Often people do, right, of like community or sugar or, you know, whatever we wanted on the African Savannah. Do you think, like, in the future, models are trained with RL and everything, a lot of post-training on top or whatever, but they'll, like, in the way we were just a really like ice cream,

Starting point is 03:09:22 they'll just be like, hi, just to predict the next token again. You know what I mean? Like in a good old days. So there's this ongoing discussion of like, are model sentient or not? And like, do you thank the model when it helps you? Yeah. But I think if you want to thank it, you actually shouldn't say thank you. You should just give it a sequence that's very easy to predict.

Starting point is 03:09:43 And the even funnier part of this is there is some work on if you just give it the sequence, like, ah, like over and over again. Then eventually the model will just start. spewing out on all sorts of things that otherwise wouldn't ever say. And so, yeah, I won't say anything more about that, but

Starting point is 03:10:02 you can, yeah, you should just give your model something very easy to predict as a nice little treat. This is what the only amends of being or just that the universe and like but do we like things that are like easy to predict? Aren't we constantly in search of like the dose of the bits of entropy? Yeah, the bits of entropy.

Starting point is 03:10:21 Exactly, right? Shouldn't you be giving it things just slightly too hot to release. Just out of reach. Yeah, but I wonder, like, at least from the free energy principle perspective, right? Like, you don't like, you don't want to be surprised. And so maybe it's this like, I don't feel surprised. I feel in control of my environment.

Starting point is 03:10:38 And so now I can go and seek things. And I've been predisposed to, like, in the long run, it's better to explore new things right now. Like, leave the rock that I've been sheltered under, ultimately leading me to, like, build a house or, like, some better structure. But we don't like surprises. I think most people are very upset when, like, expectation does not meet reality. That's why babies, like, love watching the same show over and over again, right? Yeah, interesting. Yeah, I can see that.

Starting point is 03:11:05 Oh, I guess they're learning to model it and stuff too. Yeah. Okay, well, hopefully this will be the repeat that the AIS learned to love. Okay, cool. I think that's a great place to wrap. And I should also mention that the better part of what I know about AI, I've learned from just talking with you guys. You know, we've been good friends for about a year now. So, yeah, I mean, yeah, I appreciate you guys getting me up to speed here.

Starting point is 03:11:32 You guys, great questions. It's really fun to hang and chat. I've really treasured that time together. Yeah, you're getting a lot better at pickleball. I think I'm going to say. Hey, we're trying to progress to tenders. It's going on. Awesome. Cool, cool. Thanks.

Starting point is 03:11:54 Hey, everybody. I hope you enjoy that episode. As always, the most helpful thing you can do is to share the podcast. Send it to people you think might enjoy it. Put it in Twitter, your group chats, etc. It just splits the world. Appreciate you listening. I'll see you next time. Cheers. I don't know.

Dwarkesh Podcast - Sholto Douglas & Trenton Bricken — How LLMs actually think

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.