Dwarkesh Podcast - Sholto Douglas & Trenton Bricken — How LLMs actually think
Episode Date: March 28, 2024Had so much fun chatting with my good friends Trenton Bricken and Sholto Douglas on the podcast.No way to summarize it, except: This is the best context dump out there on how LLMs are trained, what c...apabilities they're likely to soon have, and what exactly is going on inside them.You would be shocked how much of what I know about this field, I've learned just from talking with them.To the extent that you've enjoyed my other AI interviews, now you know why.So excited to put this out. Enjoy! I certainly did :)Watch on YouTube. Listen on Apple Podcasts, Spotify, or any other podcast platform. There's a transcript with links to all the papers the boys were throwing down - may help you follow along.Follow Trenton and Sholto on Twitter.Timestamps(00:00:00) - Long contexts(00:16:12) - Intelligence is just associations(00:32:35) - Intelligence explosion & great researchers(01:06:52) - Superposition & secret communication(01:22:34) - Agents & true reasoning(01:34:40) - How Sholto & Trenton got into AI research(02:07:16) - Are feature spaces the wrong way to think about intelligence?(02:21:12) - Will interp actually work on superhuman models(02:45:05) - Sholto’s technical challenge for the audience(03:03:57) - Rapid fire Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe
Transcript
Discussion (0)
Okay, today I have the pleasure to talk with two of my good friends, Shilto and Trenton.
Shoto.
You should have to make stuff.
I was going to say anything.
Let's do this in reverse.
How long I had started with my good friends?
Yeah, I didn't know I at one point caught the context like just wow.
Shit.
Anyways, Sholto,
Noam Brown.
Noam Brown, the guy who wrote the diplomacy paper,
he said this about Shulto.
He said he's only been in the field for 1.5 years,
but people in AI know that he was one of the most important people
behind Gemini's success.
And Trenton, who's an anthropic,
works on mechanistic interoperability.
And it was widely reported that he has solved alignment.
So this will be a capabilities-only podcast, alignment is already solved,
so no need to discuss further.
Okay, so let's start by talking about context links.
Yep.
It seemed to be under-hyped, given how important it seems to me to be,
that you can just put a million tokens into context.
There's apparently some other news that got pushed to the front for some reason.
But, yeah, tell me about how.
you see the future of long context links and what that implies for these models.
Yeah.
So I think it's really under hype because until I started working on it, I didn't really appreciate
how much of a step up in intelligence it was for the model to have the onboarding problem
basically instantly solved.
And you can see that a little bit in the perplexity graphs in the paper where just throwing
millions of tokens worth of context about a code base allows it to become dramatically
better at predicting the next token in a way that you'd normally associate with huge
increments and models scale.
But you don't need that.
All you need is like a new context.
So underhyped and buried by some other news.
In context, are they as sample efficient and smart as humans?
I think that's really worth exploring.
For example, one of the e-vows that we did in the paper has it learning a language in context
better than a human expert could learn that new language over the course of a couple months.
And this is only like a pretty small demonstration, but I'd be really interested to see things like Atari games or something like that
where you throw in a couple hundred or thousand frames, labeled actions,
and then in the same way that you'd, like, show your friend how to play a game
and see if it's able to reason through.
It might, at the moment, you know, with the infrastructure and stuff,
it's still a little bit slow, like, doing that.
But I would actually, I would guess that might just work out of the box
in a way that would be pretty mind-blank.
And crucially, I think this language was esoteric enough that it wasn't in the training day.
Right, exactly.
Yeah, if you look at the model before it has that context thrown in,
it just doesn't know the language at all, and it can't get any translation.
And this is like an actual, like, human,
language.
It's not just, yeah, exactly, an actual human language.
So if this is true, it seems to me that these models are already an important sense
superhuman, not in that sense that they're smarter than us, but I can't keep a million
tokens in my context when I'm trying to solve a problem, remembering and integrating
all the information into our code base.
Am I wrong in thinking this is like a huge unlock?
I actually generally think that's true.
Like previously, I've been frustrated when models aren't as smart.
Like, you ask them a question and you want it to be smarter than you or to know things that you don't.
And this allows them to know things that you don't in a way that it just ingests a huge amount of information in a way you just can't.
So, yeah, it's extremely important.
Well, how do we explain in context learning?
Yeah.
So there's a piece of, there's a line of work I quite like where it looks at in context learning as basically like very similar to gradient descent, but like the attention operation can be viewed as gradient descent on the in context data.
that paper had some cool plots where basically showed like we take n steps a gradient descent
and that looks like n layers of in context learning and it looks very similar so i think like that's one
way of viewing it and trying to understand what's going on yeah and uh you can ignore what i'm
about to say because given the introduction alignment is solved and i safety isn't a problem
but uh i think the context stuff does get problematic um but also interesting here um i think
there'll be more work coming out in the not too distant future um around what happens if you give
a hundred shot prompt for jail breaks, adversarial attacks. It's also interesting in the
sense of if your model is doing gradient descent and learning on the fly, even if it's been trained
to be harmless, you're dealing with a totally new model in a way. You're like fine-tuning,
but in a way where you can't control what's going on. Can you explain what do you mean by gradient
descent is happening in the forward pass and attention? Yeah. No, no, no. There was something in the
paper about trying to teach the model to do linear regression, but like just through the number
of samples they gave in the context. And you can see, if you plot on the x-axis, like, number of
shots that it has or examples, and then like the loss it gets on just like ordinary least
squares regression, that will go down with time. And it goes down exactly matched with a number
of gradient descent steps. Yeah. Yeah, exactly. Okay. I only read the interim discussion section
of that paper, but in the discussion, the way they framed it is that the model, in a
order to get better at long context tasks, the model has to get better at learning to learn
from these examples or from the context that is already within the window. And the implication
of that is the model, if like meta learning happens because it has to learn how to get better
long context tasks, then in some important sense, the task of intelligence is like requires
long context examples and long context training. Like meta learn, like you have to induce
meta learning.
Understanding how to better induce
meta learning in your pre-training process is a very
important thing to actually get it flexible or adaptive
intelligence. Right, but you can proxy
for that just by getting better at doing
long context tasks.
One of the bottlenecks for
AI progress that many people identify is
the inability of these models to
perform tasks
on long horizons, which means
engaging with the task for many
hours or even many weeks
or months where like if I have
I don't know, an assistant or
employee or something, they can just do a thing I tell them for a while.
And AI agents haven't taken off for this reason, from what I understand.
So how linked are long context windows and the ability to perform well on them and the ability
to do these kinds of long horizon tasks that require you to engage with an assignment for
many hours?
Or are these unrelated concepts?
I mean, I would actually take issue with that being the reason that agents haven't taken
off, where I think that's more about nines of reliability and the model actually successfully
doing things.
And if you just can't chain tasks successively with high enough probability, then you won't get something that looks like an agent.
And that's why something like an agent might follow more of a step function in sort of visually, like GPD4 class models,
German ultra class models are not enough.
But maybe like the next increment on model scale means that you get that extra nine.
Even though like the loss isn't going down that dramatically, that like small amount of extra ability gives you the extra.
And like, yeah, obviously you need some amount of context to fit long horizon tasks.
but I don't think that's been the limiting factor up to the amount.
Yeah.
The Nurep's best paper this year by Ryland Schaefer was the lead author points to this as like
the emergence of mirage where people will have a task and you get the right or wrong answer
depending on if you've sampled the last five tokens correctly.
And so naturally, you're multiplying the probability of sampling all of those.
And if you don't have enough nines for reliability, then you're not going to get emergence.
And all of a sudden you do.
and it's like, oh my gosh, this ability is emergent
when actually it was kind of almost there
to begin with.
And there are ways that you can find
like a smooth metric for that.
Yeah, human e-vile or whatever.
In the GPD4 paper, the coding problems,
they measure log pass, right?
Exactly.
Yeah.
For the audience, the context on this is
it's basically the idea is you want to,
when you're measuring how much progress there has been
on a specific task like solving coding problems,
you upweighted when it gets it right
only one in a thousand times, you don't give it a one in a thousand score because it's like,
oh, like, got to write some of the time. And so the curve you see is like it gets a right
one in a thousand, then one in a hundred, then one in ten, and so forth. So actually, I want to
follow up on this. So if your claim is that the AI agents haven't taken off because of reliability
rather than Long Horizon task performance, isn't the lack of reliability when a task is changed
on top of another task, on top of another task? Isn't that exactly the difficulty with Long
horizon tasks is that like you have to do 10 things in a row or 100 things in a row and diminishing
the reliability of any one of them uh or yeah the probability goes down from 99.99 to 99.9
then like the whole thing gets multiplied together and the whole thing becomes much less likely to
happen that that is exactly the problem but the the key issue you're pointing out there is that
your base past like task solve rate is 90% um and if it was 99% then chain it doesn't become a problem
um but also yeah exactly and i think this is also something.
that just like hasn't been properly studied enough.
If you look at all of the e-vals that are commonly, like the academic e-vails are a single problem, right?
You know, like the math problem.
It's like one typical math problem or MMOU.
It's like one university level like problem from across different topics.
You were beginning to start to see e-vails looking at this properly by a more complex tasks like sweepbench
where they take a whole bunch of GitHub issues and that's like that is like a reasonably long horizon task.
But it's still not a multi, it's like a multi-sub hour as opposed to like multi-hour.
or multi-day task.
And so I think one of the things that will be really important to do over the next, however long,
is understand better what does success rate over a long horizon task look like.
And I think that's even important to understand what the economic impact of these models might be
and actually properly judge increasing capabilities by cutting down the tasks that we do
and the inputs and outputs involved into minutes or hours or days and seeing how good it is
successively chaining and completing tasks of those different resolutions of time.
But then that tells you, like, how automated a job family or task family is in a way that, like, MMOU school is doing.
I mean, it was less than a year ago that we introduced 100K context windows.
And I think everyone was pretty surprised by that.
So, yeah, everyone would just kind of had this sound bite of quadratic attention costs.
Yeah.
We can't have long context windows.
Here we are.
So, yeah, like the benchmarks are being actively made.
Wait, wait.
So doesn't the fact that there's these companies, Google and, I don't know, Magic, maybe others,
who have million token attention imply that the quadri-
You shouldn't say anything because you're not,
but doesn't that like imply that it's not quadratic anymore,
or are they just eating the cost?
Well, like, who knows what Google is doing for its long context?
Yeah, I'm not saying it's either.
One of the things that frustrated me about, like,
the general research fields approach to attention
is that there's an important way in which the quadratic cost of attention
is actually dominated in typical dense transformers
by the MLP block.
right um so you have this n-square term that's associated with the tension but you also have an
n-square term that's associated with the d model the residual stream dimension of the model and if you look
uh i think sasha rush has a great tweet where he looks like basically plots the curve of the cost of
attention respect to like the cost of like really large models and tension actually trails off
and you actually need to be doing pretty long context before that that term becomes like really
important. And the second thing is that people often talk about how attention at
inference time is such a huge cost, right? And if you think about when you're actually
generating tokens, the operation is not n square. It is one Q, like one set of Q vectors,
looks up a whole bunch of KV vectors. And that's linear with respect to the amount of like
context that the model has. And so I think this drives a lot of the like recurrence and state space
research where people have this meme of, oh, like, linear attention and all this stuff.
And as Trenton said, there's like a graveyard of ideas around attention.
And not to think I don't think it's worth exploring, but I think it's important to consider
where the actual strengths and weaknesses of it are.
Okay, so what do you make of this take?
As we move forward through the takeoff, more and more of the learning happens in the forward
pass.
So originally, like, all the learning happens in the backward, you know, during like this, like,
bottom-up sort of hill-climbing evolutionary process.
If you think in the limit during the intelligence explosion,
it's just like the AI is like maybe like handwriting the weights
or like doing go-fi or something.
And we're in like the middle step where like a lot of learning happens in context now
with these models.
A lot of it happens within the backward pass.
Does this seem like a meaningful gradient along which progress is happening?
Like how much, because the broader thing being the,
if you're learning in the forward pass,
it's like much more sample efficient because you can kind of
of like basically think as you're learning like when humans when you read a textbook you're not
just skimming it and trying to absorb what you know what inductive these words follow these words
you like read it and you think about it and then you read some more you think about it um i don't know
does this seem like a sensible way to think about the progress yeah it may just be one of the ways
in which like you know birds and planes like fly but they fly differently and like the virtue of
technology allows us to do that like like i basically accomplish things that birds can't um
It might be that context length is similar in that it allows it to have a working memory that we can't.
But functionally is not like the key thing towards actual reasoning.
The key step between GPD2 and GPD3 was that all of a sudden like there was this metal learning behavior that was observed in training.
Like in the pre-training of the model.
And that's as you said, like it's something to do with you give it some amount of context.
It's able to adapt to that context.
And that was a behavior that wasn't really observed before that at all.
And maybe that's a mixture property of context and scale and this kind of stuff.
But it wouldn't have occurred to model the tiny context, that would say.
This is actually an interesting point.
So when we talk about scaling up these models,
how much of it comes from just making the models themselves bigger?
And how much comes from the fact that during any single call,
you are using more compute.
So if you think of diffusion, you can just iteratively keep adding more compute.
And if adaptive computer solved, you can keep doing that.
And in this case, if there's a quadratic penalty for attention, but you're doing long context anyways, then you're still dumping in more compute during, not during training, not during having bigger models, but just like, yeah.
Yeah, it's interesting because you do get more forward passes by having more tokens.
Right.
My one gripe, I guess I have two gripes with this, though, maybe three.
So one, like, in the alpha paper, one of the transformer, one of the transformer modules, they have a few, and the architecture is, like, very intricate.
But they do, I think, five forwards passes through it and will gradually, like, refine their solution as a result.
You can also kind of think of the residual stream.
I mean, Schulte alluded to kind of the read-write operations as like a poor man's adaptive compute, where it's like, I'm just going to give you all these layers.
And like, if you want to use them great, if you don't, then that's also fine.
And then people will be like, oh, well, the brain is recurrent and you can like do however many loops through it you want.
And I think to a certain extent that's right, right?
Like, if I ask you a hard question, you'll spend more time thinking about it.
and that would correspond to more forward passes.
But I think there's a finite number of forward passes that you can do.
It's kind of with language as well.
People are like, oh, well, human language can have like infinite recursion in it,
like infinite nested statements of like the boy jumped over the bear that was doing this,
that had done this, that had done that.
But like empirically, you'll only see five to seven levels of recursion,
which kind of relates to whatever, that magic number of like how many things you can
hold in working memory at any given time is.
And so, yeah, it's not infinitely recursive, but like, does that matter in the regime of human
intelligence?
And, like, can you not just add more layers?
Breakdown for me, you're referring to this in some of your previous answers of, listen,
you have these long contexts and you can hold more things in memory, but like ultimately comes down
to your ability to mix concepts together to do some kind of reasoning.
And these models aren't necessarily human level at that, even in context.
context. Break down for me how you see storing just raw information versus reasoning
and what's in between, like where is the reasoning happening? Is that, where is just like
storing raw information happening? What's different between them in these models? Yeah, I don't
have a super crisp answer for you here. I mean, obviously with the input and output of the model,
you're mapping back to actual tokens, right? And then in between that, you're doing higher level
processing. Before we get deeper into this, we should explain to the audience. You referred
earlier to Anthropics way of thinking about transformers as these read-write operations that
layers do. One of you should just kind of explain at a high level, what you mean by that.
So the residual stream, imagine you're in a boat going down a river. And the boat is kind of the
current query where you're trying to predict the next token. So it's the cat sat on the blank.
and then you have these little like streams that are coming off the river
where you can get extra passengers or collect extra information if you want
and those correspond to the attention heads and MLPs
that are part of the model right and okay I was going to I almost think of it
like the working memory of the model yeah like the RAM of the computer
where you're like choosing what information to read in so you can do something with it
and then maybe read like read something else in later on yeah and you can operate on
subspaces of that high dimensional vector, a ton of things are, I mean, at this point, I think
it's almost given that are encoded in superposition, right? So it's like, yeah, the residual
stream is just one high dimensional vector, but actually there's a ton of different vectors
that are packed into it. Yeah. I might just like dumb it down, like, as a way that would have
made sense to me a few months ago of, okay, so you have, you know, whatever words are in the
input you put into the model, all those words get converted into these.
tokens and those tokens get converted into these vectors. And basically it's just like this small amount
of information that's moving through the model. And the way you explained to Michaud, this paper
talks about is early on in the model, maybe it's just doing some very basic things about like,
what do these tokens mean? Like if it says like 10 plus five, just like moving information about
to have the, have that good representation. Exactly. Just represent. And in the middle, maybe like the deeper
thinking is happening about like how to think, yeah, how to solve this.
At the end, you're converting it back into the output token because the end product is you're trying to predict the probability of the next token from the last of those residuals streams.
And so, yeah, it's interesting to think about just like the small compressed amount of information moving through the model and it's like getting modified in different ways.
Trenton, so it's interesting, you're one of the few people who have like background from neuroscience.
You can think about the analogies here to, yeah, to the brain.
And in fact, I have one of our friends, the way he, you had a paper in grad school about thinking about attention in the brain and he said, this is the only or first, like, neural explanation of why attention works, whereas we have evidence from why the CNN's work, convolutional network networks work based on the visual cortex or something.
Yeah, I'm curious, do you think in the brain there's something like a residual stream of this compressed amount of information,
that's moving through and it's getting modified
as you're thinking about something.
Even if that's not what's literally happening,
do you think that's a good metaphor
for what's happening in the brain?
Yeah, yeah.
So at least in the Sarvallum,
you basically do have a residual stream
where the whole,
what we'll call the attention module for now,
and I can go into whatever amount of DCO you want for that,
you have inputs that route through it,
but they'll also just go directly to the, like,
end point that module will contribute to.
So there's a direct path and an indirect path.
And so the model can pick up whatever information it wants and then add that back in.
What would happen is the cerebellum?
So the cerebellum nominally just does find motor control.
But I analogize this to the person who's lost their keys and is just looking under the streetlight where it's very easily to observe this behavior.
one leading cognitive neuroscientist said to me that a dirty little secret of any fMRI study
where you're looking at brain activity for a given task is that the cerebellum is almost always active
and lighting up for it if you have a damaged cerebellum you also are much more likely to have autism
so it's associated with like social skills in one of these particular studies where I think
they use pete instead of fMRI but when you're doing next token prediction the cerebellum lights up a lot
also 70% of your neurons in the brain are in the cerebellum they're small but they're there
and they're taking up real metabolic cost this was one of Guern's points that like what
changed with humans was not just that we have more neurons or he says he shared this article
but specifically there's more neurons in the cerebral cortex in the cerebellum and you should say
more about this but like they're they're more directly expensive and they're more involved in signaling
and sending information back and forth.
Yeah.
Is that attention?
What's going on?
Yeah.
Yeah.
So I guess the main thing I want to communicate here,
so back in the 1980s,
Penty Canerva came up with a associated memory algorithm for,
I have a bunch of memories.
I want to store them.
There's some amount of noise or corruption that's going on.
And I want to query or retrieve the best match.
And so he writes this equation for how to do it.
And a few years later,
realizes that if you implemented this as an electrical engineering circuit, it actually looks identical
to the core cerebellar circuit. And that circuit and the cerebellum more broadly is not just in us. It's
in basically every organism. There's active debate on whether or not cephalopods have it.
They kind of have a different evolutionary trajectory. But even fruit flies with the Drosophila
mushroom body, that is the same cerebellar architecture. And so that convergence, and then my paper,
which shows that actually this operation is to a very close approximation the same as the attention
operation, including implementing the soft max and having this sort of like nominal quadratic
cost that we've been talking about. And so the three-way convergence here and the takeoff
and success of transformers seems pretty striking to me. Yeah. I want to do about an ask,
I think what motivated this discussion in the beginning was we were talking about like, wait,
what is the reasoning, what is the memory? What do you think about the, the,
analogy you found to attention and this. Do you think of this as more just looking up the relevant
memories or the relevant facts? And if that is the case, like, where is the reasoning
happening in the brain? How do we think about like how that builds up into the reasoning?
Yeah. So maybe my hot take here, I don't know how hot it is, is that like most intelligence
is pattern matching. And you can do a lot of really good pattern matching if you have a hierarchy
associated memories. So you start with your very basic associations between just like objects
in the real world. But you can then chain those and have more abstract associations such as like
a wedding ring symbolizes like so many other associations that are downstream. And so
you can even generalize the attention operation and this associated memory as the MLP layer as well.
It's in a long-term setting where you don't have, like, tokens in your current context.
But I think this is an argument that, like, association is all you need.
And associate a memory in general as well.
It's not, so you can do two things with it.
You can both de-noise or retrieve a current memory.
So, like, if I see your face, but it's, like, raining and cloudy, I can de-noise and kind of, like,
gradually update my query towards my memory of your face.
but I can also access that memory and then the value that I get out
actually points to some other totally different part of the space.
And so a very simple instance of this would be if you learn the alphabet.
And so I query for A and it returns B.
I query for B and it returns C.
And you can traverse the whole thing.
Yeah.
Yeah.
One of the things I talked to Demis about was he had a paper in 2008
that memory and imagination are very linked because of this very thing
that you mentioned, memory is reconstructive.
And so you are in some sense imagining every time you're thinking of a memory
because you're only storing a condensed version of it and you're like, have to.
And this is famous why human memory is terrible and like why people in the witness box
or whatever would just make shit up.
Okay, so let me ask a stupid question.
So you like read Sherlock Holmes, right?
And like the guy is incredibly sample efficient.
He'll like see a few observations and he'll like,
basically figure out who committed the crime because there's a series of deductive steps
that leads from somebody's tattoo and what's on the wall to the implications of that.
How does that fit into this picture?
Because, like, crucially, what makes them smart is that there's not, like, an association,
but there's a sort of deductive connection between different pieces of information.
Would you just explain it as that that's just, like, higher level association?
Like, yeah.
I think so, yeah.
So I think learning these higher level associations to be able to then map patterns to each other as kind of like a meta learning.
I think in this case he would also just have a really long context length or a really long working memory, right?
Where he can like have all of these bits and continuously query them as he's coming up with whatever theory.
So the theory is moving through the residual stream.
And then he's has his attention heads are querying his context.
Right.
But then how he's projecting.
his query and keys in the space and how his MLPs are then retrieving, like, longer-term
facts or modifying that information is allowing him to then in later layers do even more
sophisticated queries and slowly be able to reason through and come to a meaningful conclusion.
That feels right to me in terms of like looking back in the past, you're selectively reading
in certain pieces of information, comparing them, maybe that informs your next step of like
what piece of information you now need to pull in, and then you build this representation,
which I like progressively looks closer and closer and closer to like the suspect in your case.
Yeah. Yeah. Yeah. That doesn't feel it all outlandish.
Do the load of lens on like that? Something I think that the people who aren't doing this research can overlook is after your first layer of the model,
every query key and value that you're using for attention comes from the combination of all the previous tokens.
So like my first layer, I'll query my previous tokens.
just extract information from them, but all of a sudden, let's say that I attended to tokens
one, two, and four in equal amounts, then the vector in my residual stream, assuming that
they wrote out the same thing to the value vectors, but ignore that for a second, is a third
of each of those. And so when I'm querying in the future, my query is actually a third of each
of those things. But they might be written to different subspaces.
That's right. Hypothetically, but they wouldn't have to. And so you can recombine and immediately
even by layer two and certainly by the deeper layers
just have these very rich vectors
that are packing in a ton of information
and the causal graph is like literally over
every single layer that happened in the past
and that's what you're operating on.
It does bring to mind like a very funny eval to do
would be like a Sherlock Holmes eval
that's you put the entire book into context
and then you have like a sentence
which is like the suspect is like X
then you have like a logic probability distribution
over like the different characters in the book
and then like as you put more
That would be super cool.
It would be super interesting.
I wonder if you'd get anything at all.
But it would be cool.
Sherlock Holmes is probably already in the training data.
Right.
You get like a mystery novel that was written in the...
You can get an album to write it.
Or we could like...
Well, you could purposely exclude it, right?
Oh, we can?
How do you...
Well, you need to scrape any discussion of it from Reddit or any other thing, right?
Right.
It's hard.
But that's like one of the challenges that goes into things like long context e-vows is to get a good one.
You need to know that it's not your training data.
You're like put in the effort to exclude it.
the effort to exclude it.
What, um, so, uh, actually want to, there's two different threads I want to follow up on.
Let's go to the long context one and then we'll come back to, um, this.
So in the Gemina 1.5 paper, the eval that was used was, can it like, was something with
Paul Graham essays, can it like, yeah, the needle in a haystack thing, um, which, yeah,
I mean, there's like, we don't necessarily just care about its ability to recall one specific
fact from the context.
I'll step back and ask the question
the loss function for these models
is unsupervised
you don't have to come up with these bespoke things
that you keep out of the training data
is there a way you can do a benchmark
that's also unsupervised
where I don't know
another LLM is rating it in some way
or something like that
and maybe the answer is like
well if you could do this
like reinforcement learning would work
because then you have this unsupervised
yeah I mean I think people have explored
that kind of stuff
like, for example, Anthropica is the constitutional
oral paper where they take another language model
and they point it and say, like, how
you know, helpful or harmless was that
response? And then they get it to update
and try and improve along the
preto frontier of helpfulness and harmfulness.
So you can point language models
at each other and create evils in this way.
It's obviously an imperfect
art form at the moment because
you get reward
function hacking basically and the language
like if you try
and match up to what even humans,
are imperfect here.
Like if you try and match up to what humans will say,
humans you typically prefer longer answers,
which aren't necessarily better answers.
And you get that same behavior with models.
On the other thread, going back to the Sherlock Holmes thing,
if it's all associations all the way down,
this is a sort of naive dinner party question,
if I just like match you or I'm working on AI.
But, okay, does that mean we should be less worried
about super intelligence?
Because there's not this sense in which it's like,
Sherlock Holmes plus, plus, it'll still need to just like find these associations, like humans find
associations and like, you know what I mean? It's not just like, it sees a frame of the world and
it's like figured out all the laws of physics. So for me, because this is a very legitimate
response, right? It's like, well, artificial general intelligence aren't, if you say humans are
generally intelligent, then they're no more capable or competent. I'm just worried that you have
that level of general intelligence in silicon, where you can then immediately,
clone hundreds of thousands of agents and they don't need to sleep and they can have super
long context windows and then they can start recursively improving and then things get really
scary. So I think to answer your original question, yes, you're right. They would still need to
learn associations, but the recursive stuff of improvement would still have to be them,
like if intelligence is fundamentally about these associations, like the improvement is just
them getting better at association. There's not like another thing that's happening. And
so then it seems like you might disagree.
with the intuition that, well, they can't be that much more powerful if they're just doing
associations. Well, I think then you can get into really interesting cases of meta-learning.
Like when you play a new video game or like study a new textbook, you're bringing a whole
bunch of skills to the table to form those associations much more quickly. And like, because
everything in some way ties back to the physical worlds, I think there are like general features
that you can pick up and then apply in novel circumstances.
Should we talk about intelligence explosion then?
I mentioned multiple agents and I'm like, oh, here we go.
Okay, so the reason I'm interested in discussing this is with you guys in particular is the models we have of the intelligence explosion so far come from economists, which is fine, but I think we can do better because in the model of the intelligence explosion, what happens is you replace the AI researchers and then there's like a bunch of automated AI.
researchers who can speed up progress, make more AI researchers, make further progress.
And so I feel like if that's the metric, or that's the mechanism, we should just ask
the AI researchers about whether they think this is plausible.
So let me just ask you, like if I have a thousand Asian chelotos or Asian Trentons,
do you think that you get an intelligence explosion?
Is that, yeah, what does that look like to you?
I think one of the important bounding constraints here is compute.
I do think you could dramatically speed up AI research, right?
Like, it seems very clear to me that in the next couple of years
we'll have things that can do many of the software engineering tasks
that I do on a day-to-day basis
and therefore dramatically speed up my work
and therefore speed up the rate of progress, right?
At the moment, I think most of the labs are somewhat compute-bound
in that there are always more experiments you could run
and more pieces of information that you could gain
in the same way that, like, scientific research,
on biology is also somewhat experimentally like throughput bound.
Like you need to be able to run and culture the cells in order to get the information.
I think that will be at least a short-term finding constraint.
Obviously, you know, Sam's trying to raise $7 trillion to rent to get chips.
And so, like, it does seem like there's going to be a lot more compute in future as
everyone is heavily ramping.
You know, your Nvidia's stock price sort of represents the relative compute increase.
But any thoughts?
I think we need a few more nines of reliability in order for it to really be useful and trustworthy.
Right now, it's like, and just having context lengths that are super long and it's like very cheap to have.
Like if I'm working in our code base, it's really only small modules that I can get clod to write for me right now.
But it's very plausible that within the next few years or even sooner, it can automate most of my
task. The only other thing here that I will note is the research that at least our sub-team
in interpretability is working on is so early stage that you really have to be able to make sure
everything is like done correctly in a bug-free way and contextualize the results with everything
else in the model. And if something isn't going right, be able to enumerate all of the possible
things and then slowly work on those. Like an example that we've publicly talked about in previous
papers is dealing with layer norm, right? And it's like if I'm trying to get an early result or look
at like the logit effects of the model, right? So it's like if I activate this feature that we've
identified to a really large degree, how does that change the output of the model? Am I using
layer norm or not? How is that changing the feature that's being learned? And that will take
even more context or reasoning abilities for the model.
So you used a couple of concepts together, and it's not self-evident to me that they're the same, but it seems like you were using them interchangeably, so I just want to, like, one was, well, to work on the cloud code base and make more modules based on that, they need more context or something, where like, it seems like they might already be able to fit in the context. Or do you mean like actual, do you mean like the context window context or like more? Yeah, the context window context.
So yeah, it seems like now it might just be able to fit.
The thing that's preventing it from making good modules is not the lack of being able to put the code base in there.
I think that will be there soon.
Yeah.
But it's not going to be as good as you at like coming up with papers because it can like fit the code base in there.
No, but it will speed up a lot of the engineering.
Hmm.
In a way that causes an intelligence explosion?
No, that accelerates research.
But I think these things compound.
So like the faster I can do my engineering, the more experiments I can run.
And then the more experiments I can run, the faster we can.
I mean, my work isn't actually accelerating capabilities at all.
Right, right.
It's just interpreting the models.
But we have a lot more work to do on that.
Surprise to the Twitter.
Yeah, I mean, for context, like, when you release your paper,
there was a lot of talk on Twitter about alignment to solve guys, close the curtains.
Yeah, yeah, no, it keeps me up at night how quickly the models are becoming more capable.
and just how poor our understanding still is
of what's going on.
Yeah, I guess I'm still...
Okay, so, lessening through the specifics here,
by the time this is happening,
we have bigger models that are two to four orders
of magnitude bigger, right?
Or at least an effective compute
are two to four orders of magnitude bigger.
And so this idea that, well,
you can run experiments faster or something,
you're having to retrain that model
in this version of the intelligence explosion,
like the recursive self-improvement
is different from what might have been imagined 20 years ago
where you just rewrite the code.
You actually have to train a new model,
and that's really expensive.
Not only now, but especially in the future
as you keep making these models
orders of magnitude bigger.
Doesn't that dampen the possibility
of a sort of recursive self-refer improvement
type intelligence explosion?
It's definitely going to act
as a breaking mechanism.
I agree that the world of like what we're making today looks very different to what people imagined it would look like 20 years ago.
Like it's not going to be able to write its own code to be like really smart because actually it used to train itself.
Like the code itself is typically quite simple, typically pretty small and self-contained.
I think John Carmack had this nice phrase where it's like the first time in history where like you can actually plausibly imagine writing AI with like 10,000 lines of code.
And that actually does seem plausible when you pay most training codebases down to the limit.
But it doesn't take away from the fact that this is something we should really strive to measure an estimate, like, how progress might occur.
Like we should be trying very, very hard right now to measure exactly how much of a software engineer's job is automatable and what the trend line looks like and be trying out the hardest to project out those trend lines.
But with all due respect to software engineers, like you are not like writing a React front end.
right so it's like I don't know how this like what is concretely happening and maybe you can walk
me through walk me through like a day in the life of show like you're working on an experimenter
project that's going to make the model quote unquote better right like what is happening from
observation to experiment to theory to like writing the code what is happening and so I think important
to contextualize here is that like I've primarily worked on inference so far so a lot of what I've been
doing is just taking or helping guide the pre-training process, such that we design a good
model for inference, and then making the model and like the surrounding system faster.
I've also done some pre-training work around that, but it hasn't been like my 100% focus,
but I can still describe what I do when I do that work.
I know, but sorry, let me interrupt and say, in Carl Schumann's and like when he was talking
about it on the podcast, he did say that things like improving inference or even literally
having like helping it help make you help make better chips or GPUs.
That's, like, part of the intelligence explosion.
Yeah.
Because, like, obviously, if the inference code runs faster, like, it happens better or faster or whatever.
Right.
Anyway, sorry, go ahead.
Okay, so what does concretely a day look like?
I think the most important, like, part to illustrate is this cycle of coming up with an idea,
proving it out at different points in scale, and, like, interpreting and understanding what goes wrong.
And I think most people would be surprised to learn just how much.
goes into interpreting and understanding what goes wrong.
Because the ideas, people have long lists of ideas that they want to try.
Not every idea that you think should work will work
and trying to understand why that is is quite difficult.
And like working at what exactly you need to do to interrogate it.
So so much of it is like introspection about what's going on.
It's not pumping out thousands and thousands and thousands of light of code.
It's not like the difficulty in coming up with ideas even.
I think many people have a long list of ideas that they want to try.
But pairing that down and shock calling under very imperfect information, what the right ideas to explore further is really hard.
Tell me more about what do you mean by imperfect information?
Are these early experiments?
Are these, like what is the information that you're?
So Demas mentioned this in his podcast and also like you obviously, it's like the GPD4 paper where you have like scaling law increments.
And you can see like in the GPD4 paper they have like a bunch of like dots, right, where they say we can estimate the performance of our final model like using.
all of the exhausts, and there's a nice curve that flows through them.
And Dennis mentioned, yeah, that we do this process of scaling up.
Concretely, why is that imperfect information is you never actually know if the trend will
hold.
For certain architectures, the trend has held really well, and for certain changes, it's held
really well.
But that isn't always the case, and things which can help at smaller scales can actually
hurt at larger scales.
So making guessings.
based on what the trend lines look like
and based on like your intuitive feeling of
okay this is actually something that's going to matter
particularly for those ones which help at the small scale
that's interesting to consider that for every chart you see
in a release paper technical report that shows that smooth curve
there's a graveyard of like first year runs and then it's like flat
yeah there's all these like other lines that go in like different directions
off like tail off and like that's yeah it's crazy both like as a grad student
and then also here, like, the number of experiments that you have to run
before getting, like, a meaningful result.
Tell me, okay, so you, but presumably it's not just like you run it until it stops
and then, like, let's go to the next thing.
There's some process by which to interpret the early data and also to look at your, like,
I don't know, I could, like, put a Google duck in front of you,
and I'm pretty sure you could just, like, keep typing for a while on, like, different
ideas you have.
And there's some bottleneck between that and just, like, making the models better
immediately.
Right.
Yeah, walk me through, like, what is the inference you're making from the first early
steps that makes you have better experiments and better ideas?
I think one thing that I didn't fully convey before was that I think a lot of, like,
good research comes from working backwards from the actual problems that you want to
solve.
And there's a couple of, like, grand problems.
I split those in, like, making the models better today that you would identify as
issues and then, like, work back from, okay, how could I, like, change it to achieve this?
There's also a bunch of when you scale, you run into things and you want to like fix behaviors or like issues at scale and that like informs a lot of the research for the next increment and this kind of stuff.
So concretely the barrier is a little bit software engineering.
Like often having a code base that's large and sort of capable enough that it can support many people doing research at the same time makes it complex.
If you're doing everything by yourself, your iteration pace is going to be much faster.
I've heard that, like, Alec Radford, for example,
like famously did much of pioneering work at opening eye.
He, like, mostly works out of, like, a Jupyter notebook
and then, like, has someone else who, like,
writes and productionizes that code for him.
I don't know if that's true or not.
But, like, that kind of stuff, like,
actually operating with other people makes it,
raises the complexity a lot because,
not from natural reasons,
like familiar to, like, every software engineer.
And then the inherent,
running the like running and launching those experiments easy but there's inherent time like slows
downs induced by that so you often want to be paralyzing multiple different streams because one
you can't like be totally focused on one thing necessarily you might not have like fast enough
feedback cycles and then intuiting what went wrong is actually really hard like working out what
like this is in many respects the problem that the team the trenton is on is trying to better
understand it's like what is going on inside these models we have inferences and understanding and
like head canon for why certain things work,
but it's not an exact science.
And so you have to constantly be making guesses
about why something might have happened,
what experiment might reveal whether that is or isn't true,
and that's probably the most complex part.
The performance work by comparatively is easier,
but harder in other respects.
It's just a lot of low level and like difficult engineering work.
Yeah, I agree with a lot of that.
But even on the interpretability team,
I mean, especially with Chris Ola leading it,
There are just so many ideas that we want to test, and it's really just having the engineering skill, but I'll put engineering in quotes because a lot of it is research, to like very quickly iterate on an experiment, look at the results, interpret it, try the next thing, communicate them, and then just ruthlessly prioritizing what the highest priority things to do are.
And this is really important, like the ruthless prioritization is something which I think separates a lot of like quality research from.
research that doesn't necessarily succeed as much.
We're in this funny field where so many of our theoretical, initial theoretical understanding
is broken down, basically.
And so you need to have this simplicity bias and like ruthless prioritization over what's
actually going wrong.
And I think that's one of the things that separates the most effective people is they don't
necessarily get like too attached to solving, using a given solution that they're necessarily
familiar with, but rather they attack the problem directly.
You see this a lot in like maybe people come in with a specific academic background.
They try and solve problems with that toolbox.
And the best people are people who expand the toolbox dramatically.
They're running around and they're taking ideas from reinforcement learning,
but also from optimization theory and also they have a great understanding of systems.
And so they know what the sort of constraints that bound the problem are.
And they're good engineers so they can iterate and try ideas fast.
Like by far the best researchers I've seen,
they all have the ability to try experiments really, really, really, really.
fast. And that is that cycle time, at smaller scales, cycle time separates people.
I mean, machine learning research is just so empirical.
Yeah. And this is honestly one reason why I think our solutions might end up looking more
brainlike than otherwise. It's like, even though we wouldn't want to admit it, the whole
community is kind of doing like greedy evolutionary optimization over the landscape of like possible
AI architectures and everything else. It's like no better than evolution.
And that's not even necessarily a slight against evolution.
That's such an interesting idea.
I'm still confused on what will be the bottleneck for these,
what would we have to be true of an agent such that it's like sped up your research.
So in the Alec Ratford example you gave where he apparently already has the equivalent of like co-pilot
for his Jupyter notebook experiments, is it just that if he had enough of those,
he would be a dramatically faster researcher, and so you just need Alec Ratford.
So it's like you're not automating the humans.
you're just making the most effective researchers who have great taste more effective
and like running the experiments for them and so forth.
Or like you're still working at the point with which the intelligence explosion is happening.
You know what I mean?
Like is that what you're saying?
Right.
And if that were like directly true, why can't we scale our current research teams better?
For example, is I think an interesting question to ask.
Like why if this work is so valuable, why can't we take hundreds or thousands of people who are like they're definitely out there?
and, like, scale our organizations better.
It's, I think we are less at the moment bound by the sheer engineering work of making these things
than we are by compute to run and get signal and taste in terms of what the actual, like,
right thing to do it, and that, like, making those difficult inferences on imperfect information.
for the Gemini team because I think for interpretability
we actually really want to keep hiring talented engineers
and I think it's a big bottleneck for us to just keep making a lot of progress
obviously more people like more people is like better
but I do think like it's interesting to consider
I think like one of the biggest challenges that
like I've thought a lot about is how do we scale better
like Google is an enormous organization
and has 200,000-ish people right
like 180,000 or something like that.
And one has to imagine if there were, like, ways of scaling out Gemini's research program
to all those fantastically talented software engineers.
This seems like a key advantage that you would want to be able to take advantage of,
you'd want to be able to use, but like how do you effectively do that?
It's a very complex organizational problem.
So compute and taste, that's interesting to think about
because at least the compute part is not bottleneck and more intelligent.
It just bottlenecked on Sam 7 trillion or whatever, right?
Yeah, yeah.
So if I gave you 10x the H100s to run your experiments, how much more effective a research are?
Tip U.
Please.
How much more effective a researcher are you?
I think the Gemini program would probably be like maybe five times faster or 10 times more compute or something like that.
So that's pretty good elasticity of like 0.5.
Yeah.
Wait, that's insane.
Yeah. I think like more computer.
would just, like, directly convert into progress.
So you have some, um, some fixed size of compute and some of it goes to inference,
some of, I guess, like, and also, um, like to clients of GCP.
Yep.
Some of it goes to, huh?
Some of it goes to training.
And there, I guess as a fraction of it, some of it goes to running the experiments for
the full model.
Yeah, that's right.
Shouldn't then the fraction goes to experiments be higher given that you would just be like,
If the bottleneck has research and researches bottleneck by compute.
And so one of the strategic decisions that every pre-training team has to make is like exactly what amount of compute you allocate to your different training runs.
Like to your research program versus like scaling the last best, like, you know, thing that you landed on.
And I think they're like they're all trying to arrive at like a sort of pre-optimal point here.
one of the reasons why you need to still keep training big models is that you get information there that you don't get otherwise.
So scale has all these emergent properties, which you want to understand better.
And if you are always doing research and never, like, remember what I said before about like, you're not sure what's going to like fall off the curve, right?
Yeah.
If you like keep doing research in this regime and like keep on getting more and more computer efficient, you may never, you may have actually like,
gone off the path that actually eventually scale.
So you need to constantly be investing in doing big runs too
at the frontier of what you sort of expect to work.
Okay, so then tell me what it looks like to be in the world
where AI has significantly sped up AI research
because from this, it doesn't really sound like the AIs are going off
and writing the code from scratch
and that's leading to faster output.
It sounds like they're really augmenting the top researchers in some way.
Like, yeah, tell me concretely.
Are they doing the experiments?
Are they coming up with the ideas?
Are they just like evaluating the outputs of the experiments?
What's happening?
So I think there's like two worlds you need to consider here.
One is where AI has meaningfully sped up our ability to make algorithmic progress.
And one is where the output of the AI itself is the thing that's like the crucial ingredient
towards like model capability progress.
And like specifically what I mean there is synthetic data.
Like synthetic data, right?
And in the first world where it's meaningfully speeding up algorithmic progress,
I think a necessary component of that is more compute.
And you probably like reach this elasticity point where like AIs maybe at some point
are easier to speed up and get on context than yourself.
That's just right than other people.
And so AI is meaningfully speed up your work because they're like a fantastic copilot,
basically that helps you code multiple times faster.
And that seems like actually quite reasonable.
Super long context, super smart model.
It's onboarded immediately.
and you can, like, send them off
to, like, complete sub-task and sub-galles for you.
And that actually, like, feels very plausible.
But, again, we don't know
because there are no great evals about that kind of thing.
But the best one, as I said before, Sweet Bench, which...
Although in that one, somebody was mentioning to me,
like, the problem is that when a human is trying to do a pull request,
they'll, like, type something out,
and they'll, like, run it and see if it works.
And if it doesn't, they'll rewrite it.
None of this was part of the opportunities that the LLM was given when run on this.
Like it's just like, I'll put it.
And if it runs and checks all the boxes, then, you know, it passed, right?
So it might have been an unfair test in that way.
So you can imagine that is like if you were able to use that, that would be an effective training source for having.
Like the key thing that's missing from a lot of training data is like the reasoning traces, right?
And I think this would be, if I wanted to try and automate a specific field with, like, job family, or, like, understand how, how, like, at risk of automation that is, then having reasoning traces feels to me like a really important part of that.
There's so many, yeah, there are so many different threads in that I want to follow up then.
Let's begin with the data versus, like, yeah, compute thing of, like, is the output of these AI is a thing that's causing the intelligence explosion or something.
Yeah.
People talk about how these models are really a reflection on their data.
Yeah.
I think there was, I forgot his name, but there was a, there's a great blog by this open AI engineer.
And it was talking about at the end of the day, as these models get better and better, it's just like, they're just going to be really effective.
like maps of the data set yeah and so it's like at the end of the day like you got to stop thinking
about architectures it's like the most effective architecture just like do you get an amazing job of
mapping the data right um so that implies that future AI progress comes from the AI just making
really awesome data right like that you you're mapping to that's clearly a very important
yeah yeah that's really interesting um does that look to you like i don't know like things that
look like chain of thought or what do you imagine as these models get better as these model
get smarter what does the synthetic data look like when i think of really good data uh to me that
that raises something which involved a lot of reasoning to create so in modeling that um it's similar
to like ilia's perspective on on trying on achieving like super intelligence via effectively like
perfectly modeling the human textual output right um but even in the near term in order to model
or something like the archive papers or Wikipedia, you have to have an incredible amount
of reasoning behind you in order to understand what next token might be being output.
And so for me, what I imagine as good data is like model, like data where you can similarly
at least like where it had to do reasoning to produce something.
And then like the trick, of course, is how do you verify that that reasoning was correct?
And this is why you saw like DeepMind do that geometry.
like the sort of like self like self like self like self like
sort of like research for your geometry
this geometry is a really
it's easily formalizable easily verifiable
field so you can you can check
if its reasoning was correct
and you can generate heaps of data of correct
like trick yeah
verified geometry proofs train on that
and you know that that's good data
it's actually funny because I had a conversation with Grant
Sanderson yeah like last year where we're
debating this and I was like fuck dude
by the time they get the goal of the
math Olympia and of course they're going to automate all the jobs
on the synthetic data thing
one of the things I
speculated about in my scaling post
which was heavily informed with discussions with you too
and you especially Schulte was
you can think of like human evolution
through the perspective of like we get language
and so we're like generating the synthetic data
which right you know like our copies are generating the synthetic data
which we're trained on and it's like this really effective
of genetics, a cultural, like, co-avolutionary loop.
And there's a verifier there, too, right?
Like, there's the real world.
You might generate a theory about, you know, the gods cause the storms, right?
And then, like, someone else finds cases where that isn't true.
And so you, like, know that, like, that sort of didn't match your verification function.
And now, like, actually, instead you have, like, some weather simulation, which required
a lot of reasoning to produce and, like, accurately matches reality.
And, like, you can train on that as a better model of the world.
like we are training on that and like stories and like scientific theories yeah um i want to go back
i'm just remembering something you mentioned uh a little while ago of given how sort of like empirical
ML is it really is an evolutionary process as resulting in better performance and not necessarily
an individual coming up with a breakthrough in like a top down way um that has interesting implications
first being that
there really is
people and people are concerned about capabilities increasing
because more people are going into the field
I've somewhat been skeptical of that way of thinking
but from this perspective of just like more input
it really does yeah it feels more like
oh actually by the fact that more people are going to ICML
means that there's like faster progress towards GPD5
yeah you just have more genetic recombination
and like shots on target yeah
And I mean, aren't all fields kind of like that?
Like, there's the sort of scientific framery of like discovery versus invention, right?
And discovery almost involves like whenever there's been a massive scientific breakthrough in the past.
Typically, there are multiple people co-discovering that at like roughly the same time.
And that feels to me at least a little bit like the mixing and trying of ideas.
You can't try an idea that's so far out of scope that you have no way of verifying it or with the tools you have available.
Yeah.
I think physics and math might be slightly different in this regard, but especially for biology or any sort of wetware and to the extent we want to analogize neural networks here.
It's just, it's comical how serendipitous a lot of the discoveries.
Yeah.
Like penicillin, for example.
Another implication of this is this idea that like, HGI just going to come tomorrow, like somebody's just going to discover a new algorithm and we have HGI.
That seems less plausible.
Like it will just be a matter of more and more and more of researchers finding these marginal things that all.
all add up together to make models better, right?
Like, yeah, that feels like the correct story to me, yeah.
Especially while we're still hardware constrained.
Right.
Do you buy this narrow window framing of the intelligence explosion of
you have to each, you know, GPD3 to GPD4 is two ooms of orders and magnitude more
compute, or at least more effective compute, in the sense that if you didn't have
any algorithmic progress, it would have to be two orders of magnitude bigger, like the raw form
to be as good.
Do you buy the framing that given that you have to be two orders and magnitude bigger at
every generation, if you don't get AGI by GPD7 that can help you catapult an intelligence
explosion, like you're kind of just fucked as far as like much smarter intelligences go and
you're kind of stuck with GPD7 level models for a long time?
Because at that point, you're just like consuming significant fractions of the economy to
make that model and we just don't have the wherewithal to like make gpd8 this is the
karl shulman sort of argument of like we're going to race through the order's magnitude and
the near term but then longer term it would it would be harder um i think like he's probably
talked about it but yeah but i do buy do buy that framing um yeah i mean i i generally buy that
increases in order of magnitude of compute by like in an absolute terms almost like diminishing
returns on like capability right like we've seen over a couple orders magnitude models go for being
unable to do anything to be able to do huge amounts.
And it feels to me like each incremental order of magnitude
gives more nines of reliability at things.
And so it unlocks things like agents.
But at least at the moment,
I haven't seen like transformatively.
Like it doesn't feel like reasoning improves like linearly, so to speak.
But rather like somewhat sublinearly.
That's actually a very bearish sign because one of the things we're chatting
with one of our friends and he made the point that if you look at
what new applications are unlocked by GPD 4 or relative to GPD 3.5.
It's not clear that's like that much.
Like a GPD 3.5 can do perplexity or whatever.
So if there is this diminishing increase in capabilities and that increased cost exponentially
more to get, that's actually a bearer sign on like what 4.5 will be able to do or if I
will unlock in terms of economic impact.
That being said, for me, the jump between 3.5 and 4 is like pretty huge.
And so like even if I, it's like, another.
a 3.5 to 4 jump is
like ridiculous, right?
Like if you imagine 5 as being a 3.5 to 4 jump
like straight off the bat in terms of like
ability to do SATs and this kind of stuff.
Yeah, the LSAP performance was particularly striking.
Exactly. You go from like
very smart
like from like you know
not super smart to like very smart
to like utter genius in the next generation
instantly and it doesn't at least
like to me feel like we're going to
sort of jump to utter genius in the next
generation but it does feel like
like, we'll get very smart plus lots of reliability,
and then, like, we'll see TBD, what that continues to look like.
Will GoFi be part of the intelligence explosion?
We're, like, you say synthetic data, but, like, in fact, it will be, like,
at writing its own source code in some important way.
There was an interesting paper that you can use diffusion to, like, come up with model weights.
I don't know how, like, legit that was or whatever, but, like, I don't know,
something like that.
Can you, so GoFi is good old-fashioned AI.
right and can you define that because when I hear it I think like if out statements for like
symbolic logic sure um I actually want to make sure we like don't like we like fully unpack
the whole like model improvement increments yeah because I don't want people to come away with
the perspective that like actually this is super bearish and like models aren't going to get much
better and stuff okay more what I want to emphasize is like the jumps that we've seen so far
are huge and even if those like continue on like a smaller scale we're still in for
extremely smart, like very reliable agents, like over the next couple of orders of magnitude.
And so, like, we didn't sort of fully close the thread on the narrow window thing.
But when you think of, like, let's say, GPD forecast, I know, let's call it $100 million or
whatever, you have, what, the 1B run, the 10B run, the 100B run, all seem very plausible
by, you know, private company standards.
And then the...
You mean in terms of dollar.
In terms of dollar amount.
Yeah.
And then you can also imagine even like a 1T run being part of like a national consortium or like a national level thing, but much harder on the behalf of an individual company.
But Sammy is out there trying to raise $7 trillion, right?
Like he's already preparing for like a whole lot of magnitude more than the...
Right.
He shifted at the Everton window.
He shifted on his magnitude here beyond the national level.
So I want to point out the one we have a lot more jumps.
and even if those jumps are relatively smaller,
that's still a pretty stock improvement and capability.
Not only that,
but if you believe claims that GPT4
is around one trillion parameter count,
I mean, the human brain is between 30 and 300 trillion synapses.
And so that's obviously not a one-to-one mapping,
and we can debate the numbers,
but it seems pretty plausible that we're below brain scale still.
So, crucially, the point being
that the argument that overhead,
head is really high in the sense that and maybe this is something we should touch on explicitly of
even if you can't keep dumping more compute beyond the models that cost a trillion dollars or
something the fact that the brain is so much more data efficient implies that if you get we have
the compute if we had like the brain's algorithm to train um uh training if you had if we could like
train as a sample efficient as humans train from birth we could make the AGI yeah but the sample
efficiency stuff, I never know exactly how to think about it, because obviously a lot of things
are hardwired in certain ways, right? And they're like the co-evolution of language and the brain
structure. So it's hard to say. Also, there are some results that if you make your model bigger,
it becomes more sample efficient. Yeah, the original scaling was paper had that, right? The logic models
almost have to. Right. So maybe that also just solves it. Like, you don't have to be more data
efficient, but if your model's bigger, then you also just are more data efficient.
How do we think about, yeah, what is like the explanation or why that would be the case?
Like a bigger model just sees the exact same data at the end of seeing that data.
It's learn more from it.
I mean, my like very naive take here would just be that like, like, so one thing that the superposition hypothesis that interpretability has pushed is that your model is dramatically under parameterized.
And that's typically not the narrative that deep learning is pursued, right?
But if you're trying to train a model on like the entire internet and have it
predict it with incredible fidelity, you are in the underparameterized regime.
And you're having to compress a ton of things and take on a lot of noisy interference in doing so.
And so having a bigger model, you can just have cleaner representations that you can work with.
Yeah.
For the audience, you should unpack why that, first of all, what superposition is and why that is the
implication of superposition.
Sure.
Yeah.
So the fundamental result, and this was before I joined Anthembourg,
but the paper's titled Toy Models of Superposition, finds that even for small models,
if you are in a regime where your data is high dimensional and sparse, and by sparse,
I mean any given data point doesn't appear very often, your model will learn a compression
strategy, which we call superposition, so that it can pack more features of the world into it
then it has parameters.
And so the sparsity here is like,
and I think both of these constraints apply to the real world
and modeling internet data is a good enough proxy for that.
Of like, there's only one Dwar cache.
Like there's only one shirt you're wearing.
There's like this liquid death can here.
And so these are all objects or features
and how you define it feature is tricky.
And so you're in a really high dimensional space
because there are so many of them.
And they appear very infrequently.
Yeah. And in that regime,
your model will learn,
impression. To riff a little bit more on this, I think it's becoming increasingly clear. I will say,
I believe that the reason networks are so hard to interpret is because, in a large part, this
superposition. So if you take a model and you look at a given neuron in it, right, a given unit
of computation, and you ask, how is this neuron contributing to the output of the model when it fires?
And you look at the data that it fires for, it's very confusing. It'll be like 10% of every
possible input or like Chinese, but also fish and trees and the word, a full stop and URLs, right?
But the paper that we put out towards monosemanticity last year shows that if you project the
activations into a higher dimensional space and provide a sparsity penalty, so you can think of
this as undoing the compression in the same way that you assumed your data was originally high
dimensional and sparse. You return it to that high dimensional and sparse regime. You get out very
clean features. And things all of a sudden start to make a lot more sense.
Okay. There's so many interesting threads there. The first thing I want to ask is the thing you
mentioned about these models are trained in a regime where they're overparameterized.
Isn't that when you have generalization, like rocking happens in that regime, right?
So, um, isn't that what you're right?
So, so I was saying the models were under parametized.
Oh, I see.
Yeah, yeah, like typically people talk about deep learning as if the model is over parameterized.
Um, but, but actually the claim here is that they're dramatically under paramatized, given the complexity of the task that they're trying to perform.
Okay. Um, another question. So the distilled models, like, first of all, okay, so what is happening there?
because the earlier claims we're talking about
is the smaller models are worse at learning
than bigger models, but like GPD4 turbo,
you could say make the claim
that actually GPD4 turbo is worse at reasoning style stuff
than GPD4, but probably knows the same facts,
like the distillation got rid of like some of the reasoning things.
Do we have any evidence that GPD Turbo
is a distilled version of four?
It might just be a new architecture.
Oh, okay.
Yeah.
Like it could just be like a faster,
more efficient new architecture.
Okay, interesting.
So that's cheaper, yeah.
What is the, how do you like interpret what's happening in distillation?
And I think Gwren had one of these questions on his website of why can't you train the distilled model directly?
Why does it have to go through?
Is it a picture like you had to project it from this bigger space to a smaller space?
I mean, I think both models will still be using superposition.
But the claim here is that you get a very different model if you distill versus if you train from scratch.
Yeah. And it's just more efficient or it's just fundamentally different in terms of performance.
I don't remember. But like, do you know? I think like the traditional story for why distillation is more like efficient is that normally during training, you're trying to predict this like one hot vector that says like this is the token that you should have predicted. And if you're like reasoning process means that you're really far off predicting that.
Then actually like you still get these grading updates that you are in the right direction. But like you're, like, you're.
you're totally, it might be really hard for you to learn, to have learned to have predicted that
in the context that you're in. And so what distillation does is it doesn't just have the
one hot vector, it has like the full readout from the larger model, like of all of the probabilities.
Yeah, yeah. And so you get more signal about what you should have predicted. It's not,
in some respects, it's like showing a tiny bit if you're working to. Yeah. You know, like it's not
just this was the answer. It's. I see. Yeah, yeah, yeah. Totally. Yeah. But that means a lot of sense.
It's kind of like watching a Kung Fu master versus.
as being in the matrix and like just downloading yeah exactly exactly yep yep um just to make sure
the audience got that when you're turning on a distilled model you you're like you see all its
probabilities over the tokens it was predicting and then over the ones you were predicting and then you
like update through all those probabilities rather than just seeing the last word and updating on
that okay so this actually raises is a question I was intending to ask you um right now I think
you were the one who mentioned, you can think of chain of thought as adaptive compute
of like to step back and explain what by adaptive compute, it's the idea is one of the
things you would want models to be able to do is if a question is harder to spend more
cycles thinking about it. And so then how do you do that? Well, there's only a finite
and predetermined amount of compute that one forward pass implies. So,
So if there's like a complicated reasoning type question or math problem, you want to
be able to spend a long time thinking about it, then you do chain of thought where the model
just like thinks through the answer and you can think about it as like all those forward
passes where it's like thinking through the answer, it's like being able to dump more compute
into solving the problem.
Now going back to the signal thing, when it's doing chain of thought, it's only able to transmit
that token of information where it's like as you were talking about, the residual stream
is already a compressed representation of everything that's happening in the model.
And then you're turning the residual stream into one token,
which is like log of 50,000 or log of bocap size bits, which is like, yeah, so tiny.
So I don't think it's quite only transmitting like that one token, right?
Like if you think about it during a forward pass,
you create these like kV values in the transform forward forward pass.
That then like future steps attend to the kv values.
And so all of those pieces of KV, like keys and values,
are bits of information that you could use in the future.
Is the claim that when you find tune on chain of thought,
the way the key and value weights change
so that the sort of steganography can happen in the KV cache?
I don't think I could make that strong a claim just...
But that sounds plausible?
But it's like, that's a good head canon for why it works.
And I don't know if there's any like papers, explicitly.
demonstrating that or anything like that. But like that's at least one way that you can imagine
the model has over the like during pre-training, right, the model's trying to predict these future
tokens. And one thing that you can imagine doing is learning to like smush information about potential
futures into like the keys and values that it might want to use in order to predict future
information. Like it kind of smoothes that information across time and the pre-training thing.
So I don't know if like people are particularly training like training on change of thought.
I think the original chain of thought paper had that as like almost an inversion property of the model as you could like prompt it to do this kind of stuff.
And it's still worked pretty well.
But that's like yeah, it's a good head canon for why that works.
Yeah.
To be overly pronounced here, it's like the tokens that you actually see in the chain of thought.
Yeah.
Do not necessarily at all need to correspond to the vector representation that the model gets to see.
Exactly.
When it's deciding to attend back to those tokens.
Exactly. In fact, during training, you replace, like, what a training step is is you actually replacing the token of the model output with the real next token.
And yet it's still like learning because it has all this information internally.
Like when you're getting a model to produce at inference time, like you're taking the output, the token that did output, you're feeding it in the bottom and embedding it and it like becomes the beginning of the new residual stream.
Right, right.
And then you use the output of pass KVs to read into and adapt that residual stream.
At training time, you do this thing called teacher forcing, basically, where you're like, actually,
the token you were meant to output is this one.
That's how you do it in parallel, right?
Because you have all the tokens.
You put them all in in parallel and you do the giant forward pass.
And so the only information it's getting about the pass is the keys and values.
It never sees the token that it output.
It's kind of like it's trying to do the next token prediction.
and if it messes up,
then you just give it the correct answer.
Yeah, right, right, yeah.
Okay, that makes sense.
Otherwise, it can become totally derailed.
Yeah, it would go, like, off the train tracks.
How much, like, the sort of secret communication with the model to its forward,
forward inferences, how much, how much technology and, you know, like, secret communication
do you expect there to be?
We don't know.
Like, honest answer, we don't know.
but I wouldn't even necessarily classify it as like secret information, right?
Like a lot of the work that Trent's team is trying to do is actually understand
that these are fully visible from the model side
and from like this maybe not the user,
but like we should be able to understand and interpret
what these values are doing and the information that are transiting.
Like transmitting, I think that's a really important goal for the future.
Yeah, I mean there are some wild papers though
where people have had the model do chain of thought
and it is not at all representative
of what the model actually decides its answer is.
And you can go in and edit.
No, no, no, no.
In this case, like, you can even go in and edit
the chain of thought so that the reasoning is like totally garbled
and it will still output the true answer.
But also the chain of thought, like,
it gets a better answer at the end of the chain of thought
rather than not doing it at all.
So like something useful is happening,
but still the useful thing is not a human understandable.
I think in some cases you can all,
also just to blight the chain of thought.
And it would have given the same answer anyways.
Interesting.
Interesting.
Yeah.
So I'm not saying this is always what goes on, but like there's plenty of weirdness to be
investigated.
It's like a very interesting to go and look at and try and understand.
I would say that you can do with open source models.
And like I think I wish there was more of this kind of interpretability and understanding
work done on open models.
Yeah.
I mean, even in our Anthropics recent sleeper agents paper, which at a high level for
people unfamiliar is basically I train in a trigger word. And when I say it, like if I say
if it's the year of 2024, the model will write malicious code instead of otherwise. And they do
this attack with a number of different models. Some of them use chain of thought. Some of them
don't. And those models respond differently when you try and remove the trigger. You can even
see them do this like comical reasoning that's also pretty creepy and like where it's like, oh,
Well, it even tries to calculate in one case an expected value of like, well, the expected value of me getting caught is this.
But then if I multiply it by the ability for me to like keep saying, I hate you, I hate you, I hate you, then like, this is how much reward I should get.
And then it will decide whether or not to like actually tell the interrogator that it's like malicious or not.
Oh.
But but even, I mean, there's another paper from a friend Miles Turpin where you ask the model to, you give it like.
like a bunch of examples of where like the correct answer is always A for multiple choice
questions. And then you ask the model, what is the correct answer to this new question?
And it will infer from the fact that all the examples are A, that the correct answer is A.
But its chain of thought is totally misleading. Like it will make up random stuff that sounds
plausible or that tries to sound as plausible as possible. But it's not at all representative of like the
true answer. But isn't this how humans think as well? The famous split brain experiments where
you know, like when a person who is suffering from seizures, one way to solve it is you cut
the thing that connects the two. Corpus person. Yeah. And then the, yeah, the speech half is on the left
side, so it's not connected to the part that decides to do a movement. And so if the other side
decides to do something, the speech part will just make something up and it'll, like, the person
will think that's legit the reason they did it. Totally. Yeah, yeah. It's just,
Just some people will hail chain of thought reasoning as, like, a great way to solve AI safety.
Oh, I see.
And it's like, actually, we don't know whether we can trust it.
How much, what will this landscape of models communicating to themselves in ways we don't understand?
How does that change with AI agents?
Because then these things will, it's not just like the model itself with its previous caches, but like other instances of the model.
And then...
It depends a lot on what channels you give them to communicate with telling, right?
Like, if you only give them text as a way of communicating, then they probably have to interpret.
How much more effective do you think the models would be if they could, like, share the residual streams versus just text?
Hard to know.
But plausibly so.
I mean, one easy way that you can imagine this is like if you wanted to describe how a picture should look, only describing that with text would be hard.
Right.
You want to maybe some other representation would plausibly be easier.
totally um and so like you can look at how uh i think like dali works at the moment right
like it produces those prompts yeah um and when you play with it you like often can't quite get it
to do because like exactly what the model wants or what you want the only dolly has that problem
It's too easy.
A lot of your
phone time.
Related well-set-up problem.
And you can imagine, like, being able to transmit
some kind of, like, denser representation
of what you want would be helpful that.
And that's like two very simple agents, right?
I mean, I think a nice halfway house here
would be features that you learn from dictionary
learning where it's like you get more internal access but a lot of it is much more human
interpretable yeah so okay for the audience you would project the residual stream into this larger
space where we know what each dimension actually corresponds to um and then back into the next
agents or whatever okay why so your claim is that we'll get AI agents when these things can
are more reliable and so forth.
When that happens, do you expect that it will be
multiple copies of models talking to each other
or will it be just adapt to compute to solve
and the thing just like runs bigger,
like more compute when it needs to do
that kind of thing that a whole firm needs to do?
And I ask this because there's two things
that make me wonder about like whether agents
is the right way to think about what will happen in the future.
One is with longer context,
these models are able to ingest and consider the information that no human can and therefore we need like one engineer who's thinking about the front end code and one engineer is thinking about the back end code where this thing can just ingest the whole thing this sort of like hyacken problem of specialization goes away second these models are just like very general of you're like not using different types of gpd4 who do different kinds of things you're using the exact same model right so I wonder what that implies is in the future like an AI
firm is just like a model instead of a bunch of AI agents hooked together. That's a great question.
I think especially in the near term, it will look much more like agents work together.
And I say that purely because as humans, we're going to want to have these like isolated,
reliable and like components that we can trust. And we're also going to want to, we're going
to need to be able to improve and instruct upon those like components.
in ways that we can understand and improve.
Like, it's just throwing it all this giant black box company.
Like, one, it isn't going to work initially.
Later on, of course, you can imagine it working,
but initially it won't work.
And two, we probably don't want to do it that way.
You can also have each of the smaller model.
Well, each of the agents can be a smaller model
that's cheaper to run and you can fine tune it
so that it's actually good at the task.
Though there's a future with, like, Dwarkesh has brought up
adaptive computer a couple of times. There's a future where like the distinction between small
and large models like disappears to some degree. And with long context, there's also a degree
to which fine tuning might disappear, to be honest. Like these two things that are very important
today and like today's landscape models, we have like whole different tiers of model sizes and
we have fine-tuned models of different things. You can imagine a future where you just actually
have a dynamic bundle of compute and like infinite context that specializes your model to
to different things.
One thing you can imagine is you have an AI firm or something and the whole thing
is like end to end trained on the signal of like, did I make profits or like if that's like too
ambiguous, if it's, if it's an architecture firm and they're making blueprints, did my client
like the blueprints and in the middle you can imagine agents who are sales people and agents
who are like doing the designing agents who like do the editing, whatever.
Would that kind of signal work on an end to end system like that?
like one of the things that happens in human firms is management considers what's happening at
the larger level and like gives these like fine grain signals to the pieces or something
when like there's a bad quarter or whatever yeah in the limit yes that's the dream of
reinforcement learning right it's like all you need to do is provide this extremely sparse signal
and then over enough iterations you sort of create the information that allows you to learn from
that signal but i don't expect that to be the thing that works first i think this is going to require
an incredible amount of care and like diligence on the behalf of humans surrounding these
machines and making sure they do exactly the right thing and exactly what you want and giving
them right signals to improve the ways that you want. Yeah, you can't train on the RL reward
unless the model generates some reward. Yeah, yeah, exactly. You're in this like sparse RL world
where like if the client never likes what you produce, then like you don't get any reward at all
and like it's kind of bad. But in the future these models will be good enough.
to get the reward some of the time, right?
This is the nines of reliability that Shultz was talking about.
There's an interesting digression, by the way, on earlier you were talking about,
well, we want dense representations that will be favored, right?
Like, that's a more efficient way to communicate.
A book that Trenton recommended, the symbolic species,
has this really interesting argument that language is not just a thing that exists,
but like it was also something that evolved along with our minds and specifically evolved to be
both easy to learn for children and to something that helps children develop right like it's
on back that phone because like a lot of the things that children learn are received through language
like the languages that we the fittest are ones that help like raise the next generation right
and that like makes them smarter, better, or whatever.
And if you think about...
Like gives them the concepts to express more complex ideas.
Yeah.
Yeah, that and I guess more pedantically, just like not die.
Right.
Yeah.
Let's you encode the important shit to not die.
And so then when we just think of like language is like, oh, you know,
it's like this contingent and maybe suboptimal way to represent ideas.
Actually, maybe one of the reasons that LLMs have succeeded is because, you know,
language has evolved for tens of thousands of years to be this sort of cast in which young minds
can develop, right? Like, that is the purpose of it was evolved for. Certainly when you talk
to like multimodal, like computer vision researchers versus when you talk to language model
researchers. Yeah. People who work in other modalities have to put enormous amounts of thought
into exactly what the right representation space for the images is. And like what the right
signal to learn from there. Is it like directly modeling the pixels? Or is it,
you know, some loss that's conditioned on, there's like a paper ages ago where they like
found that if you trained on the internal representations of an image net model, it like
helped you predict better. But then later on, like that's obviously like limiting. And so
there was like pixel CNN where they're trying to like discreetly model, you know, the
individual pixels and stuff. But understanding the right level of representation there,
really hard. In language, people are just like, well, I guess you just predict that thanks
token rights. It's kind of easy. Yeah, yeah. Decisions made. I mean, there's the tokenization
discussion and debate about like, but one of Gordon's favorites.
Yeah.
Yeah, that's really interesting.
How much the case for a multimodal being a way to brisk the data wall or get past
the data wall is based on the idea that the things you would have learned from more
language tokens anyway you can just get from YouTube, has that actually been the case?
How much like positive transfer do you see between different modalities?
where, like, actually the images are helping you be better at, like, writing code or something,
just because, like, the model is learning a latent capabilities just from trying to understand
the image.
Demas, in his interview with you, mentioned positive transfer.
Can you get in trouble if you.
But, I mean, I can't say heaps about that, other than to say this is something that people, like,
believe that, yes, like, we have all of this.
data about the world, it would be great if we could learn an intuitive sense of physics from it
that helps us reason, right? That seems totally plausible. Yeah, I'm the wrong person to ask,
but there are interesting interpretability pieces where if we fine-tune on math problems,
the model just gets better at entity recognition. Oh, really? Yeah, yeah. So there's like a paper
from David Bowles Lab recently where they investigate what actually changes in a model when I fine-tune it,
with respect to the attention heads and these sorts of things.
And they have this, like, synthetic problem of box A has this object in it.
Box B has this other object in it.
What was in this box.
And it makes sense, right?
It's like you're better at, like, attending to the positions of different things,
which you need for, like, coding and manipulating math equations.
I love this kind of research.
Yeah.
What's the name of the paper?
Do you know?
If you remember.
If you look up, like, fine-tuning.
Maths,
Math, David Bowles group
that came out like a week ago.
Okay, I am reading that when I get home.
I'm not endorsing the paper.
That's like a longer conversation,
but like this,
it does talk about incite other work
on this like entity recognition ability.
One of the things you mentioned to me
a long time ago is the evidence
that when you train LLMs on code,
they get better at reasoning in language,
which unless it's the case
that the comments in the code
are just really high quality tokens or something
implies that to be able
to think through how to code better, like, it makes you, like, a better reasoner.
And like, that's crazy, right?
Like, I think that's, like, one of the strongest pieces of evidence for, like, scaling,
just making the thing smart.
Like, that kind of, like, positive transfer.
I think, like, this is true in two senses.
One is just that modeling code, obviously implies modeling a difficult reasoning process
used to create it.
But two, that code is a nice, explicit, like, structure of, like, composed reasoning, I guess.
Like, if this, then that, like, codes a lot of,
structure in that way.
Yeah.
That you could imagine transferring to other types of
reasons of reasoning problem.
Right.
And crucially,
the thing that makes us significant
is that it's not just stochastically predicting
the next token of words or whatever
because it's like learned that like a Sally corresponds to murderer
at the end of a Sherlock Holmes story.
No,
like if there is some shared thing between code and language,
it must be at a deeper level that the model has learned.
Yeah, I think we have a lot of evidence that actual reasoning is occurring in these models
and that, like, they're not just to castic parrots.
Yeah.
It just feels very hard for me to believe that, I haven't worked and played with these models.
Normies who will listen will be like, you know.
Yeah, my two, like, immediate cast responses to this are, one, the work on Othello and now other games,
where it's like, I give you a sequence of moves in the game.
And it turns out if you apply some, like, pretty straightforward interpretability,
techniques, then you can get a board that the model has learned. And it's never seen the game board
before anything, right? That's generalization. The other is Anthropics influence functions paper
that came out last year where they look at the model outputs, like, please don't turn me off.
I want to be helpful. And then they scan, like, what was the data that led to that? And like,
one of the data points that was very influential was someone dying of dehydration in the desert
and having, like, a will to keep surviving.
And to me, that just seems, like, very clear generalization of motive rather than regurgitating, don't turn me off.
I think 2001 of Space Odyssey was also one of the influential things.
And so that's more related, but it's clearly pulling in things from lots of different distributions.
And I also like the evidence you see even with, like, very small transformers, where you can explicitly encode circuits to, like, do addition, right?
Or induction heads.
induction heads, this kind of thing.
You can literally encode basic reasoning processes in the models manually,
and it seems clear that there's evidence that they also learn this automatically
because you can then rediscover those from trained models.
To me, this is very strong.
The models are under parameterized.
We're asking them to do a very much task.
And they want to learn.
The gradients want to flow.
And so they're learning more general skills.
Okay, so I want to take a step back from the research.
and ask about your careers specifically,
because like the tweet implied at the board
that I introduced to you with,
you've been in this field a year and a half.
I think you've only been in it like a year or something, right?
It's like, I don't drop it.
Yeah, but you know, like in that time,
I know the solve the alignment takes it over stated
and you won't say this yourself
because you'd be embarrassed with it like,
you know, it's like a pretty incredible thing,
like the thing that people in mechanistic and retributably think
is the biggest,
step forward and you've like been working on it for a year. It's notable. So I'm curious how
you explain what's happened. Like why in a year or year and a half have you guys been, you know,
made important contributions to your field? It goes without saying luck, obviously. And I feel
like I've been very lucky. And like the timing of different progressions has has been just like
really good in terms of advancing to the next level of growth.
I feel like for the interpretability team specifically, I joined when we were five people.
We've now grown quite a lot.
But there were so many ideas floating around, and we just needed to really execute on them and have quick feedback loops and do careful experimentation that led to like signs of life and have now allowed us to like really scale.
And I feel like that's kind of been my biggest value add to the team, which it's not all engineering, but quite a lot of it has been.
interesting so you're saying like you came at a point where like they were there was it had been a lot of science done and there was a lot of like good research flooding around but they needed someone to like just take that and like maniacally execute on it's yeah yeah and and there's this is why it's not all engineering because it's like running different experiments and like having a hunch for why it might not be working and then like opening up the model or opening up the weights and like what is it learning okay well let me try and do this instead and that sort of thing but um a lot of it has just been being able to do like very careful thorough but quick um
investigation of different ideas or yeah theories and why was that lacking in the existing
i don't know i feel like i feel like i i mean i i work quite a lot and then i feel like i just
i'm like quite agentic like if you're if your questions about like career overall um and and i've
been very privileged to have like a really nice safety net to be able to take lots of risks but i'm
just like quite headstrong like in undergrad duke had this thing where you could just make your
own major. And it was like, eh, I don't like this prerequisite or this prerequisite. And I want to
take all four or five of these subjects at the same time. So I'm just going to make my own major.
Or like in the first year of grad school, I like canceled rotation so I could work on this
thing that became the paper we were talking about earlier. And like didn't have an advisor,
like got admitted to do machine learning for protein design and was just like off in computational
neuroscience land with no business there at all, but worked out. There's a head strongness.
but it seemed like another theme that jumped out was
the ability to step back
and you were talking about this earlier,
the ability to stick back from your son costs
and go in a different direction
is in a weird sense the opposite of that
but also a crucial step here
where I know like 21 year olds
or like 19 year olds where like
this is not the thing I've specialized in
or like I didn't major in this
I'm like dude, motherfucker you're 19
like you can definitely do this
and you like switching in the middle of grad school
or something like that's
just like
Yeah, sorry, I didn't need to cut you off, but I think it's like strong ideas loosely held and being able to just like pinball in different directions.
And the headstrongness, I think, relates a little bit to the fast feedback loops or agency in so much as I, I just don't get blocked very often.
Like if I'm trying to write some code and like something isn't working, even if it's like in another part of the code base, I'll often just go in and fix that thing or at least hack it together to be able to get results.
And I've seen other people where they're just like, help.
I can't.
And it's like, no, that's not a good enough excuse.
like go all the way down.
I've definitely heard like people in management type positions talk about the lack of such
people where they will check in on somebody a month after they gave them a task or week after they
give me a task and like, how is it going?
And they say, well, you know, we need to do this thing which requires lawyers because it requires
talking about this regulation.
It's like, how's that going?
I was like, well, we need lawyers.
I'm like, why didn't you get lawyers?
Or something like that.
So that's definitely like, yeah.
I think that's arguably the most important quality.
in like almost anything.
It's just pursuing it to like the end of the earth.
And like whatever you need to do to make it happen, you'll make it happen.
If you do everything, you'll win.
If you do everything, you'll win.
Exactly.
But yeah, yeah, yeah.
I think from my side, definitely that quality has been important, like agency in the work.
There are thousands or I would even like probably tens of thousands of
thousands of engineers at Google who are like, you know, basically like we're all of like
equivalent like software engineering ability, let's say.
Like, you know, if you gave us like a very well-defined task.
then we'd probably do it like equivalently well maybe a bunch of them would do it a lot better than me
you know in all likelihood but what i've been like one of the reasons that i've been impactful
so far is i've been very good at picking extremely high leverage problems so problems that haven't
been like particularly well solved so far um but perhaps as a result of like frustrating structural
factors like the ones that you pointed out in like that scenario before where they're like oh we can't do
X because this, what team would you do?
Why?
Or like, and then going, okay, well, I'm just going to like vertically solve the entire thing.
Right.
And that turns out to be remarkably effective.
Also, I'm very comfortable with, like, if I think there is something correct that needs
to happen, I will, like, make that argument and continue making that argument at escalating
levels of, like, criticality until that thing gets solved.
And I'm also quite pragmatic with what, like, I do to solve things.
You get a lot of people who come in with, as I said before, like, a particular background or
they know how to do something, and they won't, like, one of the beautiful things about Google, right,
is you can run around and get world experts in literally everything.
You can sit down and talk to people who are optimization experts, like chip design experts,
like experts in, like, I don't know, like different forms of like pre-training algorithms or like RL or whatever.
you can learn from all of them and you can take those methods and apply them.
And I think this was like maybe the start of why I was initially impactful was like this
vertical like agency effectively.
And then a follow up piece from that is I think it's often surprising how few people are like
fully realized in all the things they want to do.
They're like blocked or limited in some way.
And this is very common.
Like in big organizations everywhere, people like have all these blockers.
on what they're able to achieve.
And I think being a, like one, helping inspire people
to work on particular directions and working with them
on doing things massively scales your leverage.
Like you get to work with all these wonderful people
who teach you heaps of things
and generally like helping them push-bass organizational blockers
means like together you get an enormous amount done.
Like none of the impact that I've had
has been like me individually going off and solving
a whole lot of stuff. It's being me
to maybe like starting off a direction
and then convincing other people
that this is the right direction and bringing them along
in like this big tidal wave of like
effectiveness that
goes and solves that problem.
We should talk about
how you guys
got hired because I think that's a really interesting story
because you were a McKenzie consultant, right?
There's an interesting thing there where
first of all I think
people are, yeah, generally people just don't understand how, like, decisions are made about
either admissions or evaluating who to hire or something.
Like, just talk about, like, how were you noticed as, yeah, yeah, got hired.
So, like, the TLDR artist, I studied robotics and undergrad.
I always thought that AI would be one of the highest celebrity choice to impact the future
in positive way.
Like, the reason I am doing this is because I think it is, like, one of our best shots
at making a wonderful future, basically.
and I thought that working actually McKinsey
I would get a really interesting insight
to what people actually did for work
and this I actually wrote this as the first line
in my cover letter to McKinsey
was like I want to work here
so that I can learn what people do
so that I can like understand
and in many respects
like I did get that
I asked a whole lot of other things
many of the people there are like wonderful friends
I actually learned I think a lot of this like
agentic behavior apart from my time there
where you go into organizations and you see how impactful,
just not taking no for an answer gets you.
Like, you would be surprised at the kind of stuff where, like,
because no one quite cares enough in some organizations,
things just don't happen because no one's willing to take direct responsibility.
This is incredibly like directly responsible individuals are ridiculously important.
And people are willing to, like, they just don't care as much about timelines.
And so much of the value that an organization like McKinsey provides is hiring people who you are otherwise unable to hire for a short window of time where they can just like push through problems.
I think people like underappreciate this.
And so like at least some of my, well, hold up like I'm going to become the directly responsible individual for this because no one's taking appropriate like responsibility.
I'm going to care a hell of a lot about this and I'm going to make sure it like merely the end of the earth to make sure it gets done.
comes from that time.
But more of your actual question
of like, how did I get hired?
The entire time,
I didn't get into the grad programs
that I wanted to get into over here,
which was specifically for focus
on robotics and R.R.R.R.R.R.R.
research and that kind of stuff.
And in the meantime, on nights and weekends,
basically every night from 10 p.m. till 2 a.m.,
I would do my own research.
And every weekend, for like,
at least six to eight hours each day,
I would do my own research and coding
projects and this kind of stuff.
And that sort of switched in part from like quite robotic specific work to after
reading Guern's scaling hypothesis post, I got completely scaling killed and was like,
okay, like clearly the way that you solve robotics is by like scaling large multimodal
models.
And then in an effort to scale large multimodal models with a very, you know, grant, I got a grant
from the TPU, like, access program, like, TensorFlow Research Cloud.
I was trying to work out how to scale that effectively, and James Bradbury, who at the time was at Google and is now at Anthropic, saw some of my questions online where I was trying to work out how to do this properly.
He was like, I thought I knew all the people in the world who were like asking these questions.
Who on earth are you?
And he looked at that and he looked at some of like the robotic stuff that had been putting up my blog and that kind of thing.
And he reached out and said, hey, do you want to have a chat and you want to like explore working with us here?
And I was hired, as I understand it later, as an experiment in trying to take someone with extremely high enthusiasm and agency and pairing them with some of the best engineers that he knew.
And so one, another one of the reasons I could say, like, I've been impactful is I had this, like, dedicated mentorship from utterly wonderful people, like people like Raina Pope, who has since left to go do his own ship company, Anselaafzkaya, James himself, many others.
but those are like the sort of formative like two to three months at the beginning.
And they taught me a whole lot of like the principles and like heuristics that I apply,
like how to and how to like solve problems in the way that they have,
particularly in that like systems and algorithms overlap where like one more thing
that makes you like quite effective in ML research is really concretely understanding
the systems side of things.
And this is something I learned from them basically.
It's like a deep understanding of how systems influence algorithms and how algorithms
influence systems, because the systems constrain the design space, so the solution space,
which you have available to yourself in the algorithm side. And very few people are comfortable
fully bridging that gap. But a place like Google, you can just go and ask all the algorithms
experts and all the systems experts, everything they know, and they will happily teach you.
And if you go and sit down with them, they will like, they will teach you everything they know.
It's wonderful. And this has meant that I've been able to be very, very effective for both sides,
for the pre-training crew, because I understand systems very well, I can chew it and understand, like, this will work well or this won't.
And then, like, flow that on through the inference considerations of models and this kind of thing.
And for, like, to the chip design teams, I'm one of the people they turn to understand what chips they should be designing in three years,
because I'm one of the people who's best able to understand and explain the kind of algorithms that we might want to design in three years.
And obviously, you can't make very good guesses about that, but, like, I,
I think I, like, convey the information well, accumulated formal of my compatriots on the pre-training crew and, like, the general, like, systems easy-side crew, and convey that information well to them because also even inference applies a constraint to pre-training.
And so, like, there's this, like, these trees of constraints where if you understand all the pieces of the puzzle, then you get a much better sense for, like, what the solution space might look like.
Yeah.
There's a couple of things that stick out to me there.
One is not just the agency of the person who was hired,
but the parts of the system that were able to think,
wait, that's really interesting.
Who is this guy?
Not for a grad program or anything.
You know, like, currently a McKinsey consultant,
just like an undergrad.
But that's interesting.
Let's like give this a shot, right?
So James and whoever else, that's like, that's very notable.
And that's, second is, I actually did.
didn't know this part of the story where that was part of an experiment run internally about
can we do this? Can we like bootstrap somebody? And like, yeah. And in fact, what's really
interesting about that is the third thing you mentioned is having somebody who understands all layers
is a stack and isn't so stuck on any one approach or any one layer of abstraction is so important.
And specifically that like what you mentioned about being being bootstrapped immediately by
these people might have meant that since you're getting up to speed on everything at the same time
rather than spending grad school going deep on like one specific way of being RL, you actually
can take the global view and aren't like totally bought in on one thing. So not only can,
is it something that's possible, but like has greater returns than just hiring somebody at a
grad school potentially because this person can just like, I don't know, just like getting GPT8 and
like we're fine tuning them on like one year of, you know what I mean. So yeah, you come at everything
with fresh eyes, and you know it come and lock to any particular field.
Now, one caveat to that is that before, like, during my self-experimentation and stuff,
I was reading everything I could.
I was, like, obsessively reading papers every night.
And, like, actually, funnily enough, I, like, read much less widely now that my day is
occupied by working on things.
And in some respect, I had, like, this very broad perspective before where not that many
people, even like in a PhD program, you'll focus on a particular area. If you just like read
all the NLP work and all the computer vision work and like all the robotics work, you like see
all these patterns to start to emerge across subfields in a way that I guess like foreshadowed
some of the work that I would later do. That's who are interesting. One of the reasons that you've
been able to be agentic within Google is like your peer programming half the days or most of the
days of Sergey Brin, right? And so that's really interesting that like there's this person who's like
willing to just push ahead on
this LLM stuff
and get rid of the local blockers
in its place.
I think important to give it as not like every day
or anything that I'm pairing, but like when
there are particular projects that he's interested
in, then we'll work together on those and like, but there's
also been times when he's been focused on projects with other people.
But in general, yes, there's a surprising alpha
to like being one of the people who
actually goes down to the office every
day. That like is really
actually shouldn't be, but is surprisingly.
impactful and as a result I've benefited a lot from having like basically being like close
friends with people in leadership who care and being able to like really argue convincingly
about why we should do X as opposed to Y and and having that like vector to try like it's Google
is a big organization having having those vectors helps a little bit but also it's very
important it's the kind of thing you don't want to ever abuse right like you you you want to make
the arguments really like all the right channels and like uh only sometimes you need to and so
and so forth and so forth i mean it's like it's notable i don't know i feel like google is undervalue
given that like yeah that's like i don't know like ste steve jobs is working on the equivalent like
the next product for apple like piracore arriving on or something right i mean like i uh yeah i've
benefited immensely from like triggers okay so for example during the christmas break um
I was just going into the office a couple of days during that time.
Sounded like quite a lot of it.
Okay.
I got a lot of things.
Okay.
And I don't know if you guys have read that article about Jeff and Sanjay doing the pair programming,
but they were there pair programming on stuff.
And I got to hear about all these cool stories of like early Google where they were talking about like crawling under the floorboards and rewiring data centers and like telling me how much
many like bits they were pulling off the instructions of a given compiler instruction
and like all these like crazy little performance optimizations they were doing like they were having
the time of their live um and i got to like sit there and really like experience this
this sense of history in a way that you you don't expect to get like you expect to be very far away
from all that i think maybe in a large organization but yeah i see they're super cool
and trenton does this map onto any of your experience i think shaltos stories
more exciting.
Mine was just very serendipitous in that I got into computational and aeroscience.
Didn't have much business being there.
My first paper was mapping the cerebellum to the attention operation and transformers.
My next ones were looking at like sparsity.
It was my first year of grad school.
So 22.
Oh, yeah.
But yeah, my next work was on sparsity in networks, like inspired by sparsity in the brain,
which was when I met Tristan Hume
and Anthropic was doing the
Sulu, the soft max linear output unit work
which was very related in quite a few ways
of like let's make the activation of neurons
across a layer really sparse
and if we do that then we can get
some interpretability of what the neurons doing.
I think we've updated on that approach
towards what we're doing now.
So that started the conversation.
I shared drafts of that paper with Tristan.
He was excited about it.
And that was basically what led me
to become Tristan's resident
and then convert to full time.
But during that period, I also moved as a visiting researcher to Berkeley and started working with Bruno Olshausen, both on what's called vector symbolic architectures, which one of the core operations of them is literally superposition and on sparse coding, also known as dictionary learning, which is literally what we've been doing since.
And Bruno Olshausen basically invented sparse coding back in 1997.
And so it was like, my research agenda and the interpretability team seemed to just be running in parallel with just research tastes.
And so, yeah, it made a lot of sense for me to work with the team.
And it's been a dream sense.
One thing I've noticed that when people tell stories about their careers or their successes, they ascribe it way more to contingency.
But when they hear about other people's stories, they're like, of course it wasn't contingent.
You know what I mean?
It's like, if that didn't happen, something else would have happened.
I've just noticed that something you talked to, and it's like interesting that you both think that they're like, it was especially contingent.
Whereas, I don't know, maybe you're right, but like it's this sort of interesting pattern that.
Yeah, but I mean, like, I literally met Tristan at a conference and like wasn't, didn't have a scheduled meeting and put there anything, just like joined a little group of people chatting and he happened to be standing there.
And I happened to mention what I was working on.
And that led to more conversations.
And I think I probably would have applied to Anthropic at some point anyways,
but I would have waited at least another year.
It's still crazy to me that I can actually contribute to interpretability in a meaningful way.
I think there's a big important aspect of like shots on goal there, so to speak, right?
Where like you're even just going to, choosing to go to conferences itself is like putting yourself in a position where you're, where luck is more likely to happen.
Yeah.
And conversely, in my own situation, it was like doing all of this work independently.
in trying to produce and do interesting things
was my own way of trying to manufacture luck, so to speak.
And like try and do something meaningful enough
that it got noticed.
Given that you said you framed this in the context
of they were trying to run this experiment
of can something...
So specifically James and I think our manager Brennan
was trying to run this experiment.
It like worked. Did they do it again?
Yeah. So my closest collaborator, Enrique,
he crossed from search to our team.
He's also been ridiculously impactful.
He's definitely a stronger engineer than I am.
And he didn't go to university.
What was notable about, for example, is James Brad Burrera,
somebody who's usually this kind of stuff is like farmed out to recruiters or something
like that, whereas James is somebody whose time is worth like hundreds of millions of dollars
to Google, you know what I mean?
So that thing is like very bottlenecked on that kind of person taking the time almost in
aristocratic tutoring sense of finding and then getting up to speed and it seems like if it
worked at this well it should be done at scale like it should be the responsibility of key people to like
you know what i mean on board i think i think that is true to many extent like i'm sure you
probably benefited a lot from the key researchers mentoring you deeply during and like actively like
looking on like open source repositories or like on forums or whatever for like potential people like
this. Yeah. I mean, James is like Twitter injects it into his brains. That's right.
But yes, and I think this is something which in practice is done. Like, people do look out for
people that they find interesting and like try and find high signal. In fact, actually, this,
I was talking about this with Jeff the other day. And Jeff said that, yeah, he's like, you know,
I am one of the most important hires I ever made was offer called email. And I was like,
like, well, who was that? And he's Chris Ola.
Ah, yeah.
Because Chris similarly had no background in, well, like, no formal background in ML, right?
And like Google Brain was just getting started and this kind of thing.
But Jeff saw that signal.
And the, and the residency program, which like Brain had is, I think also like a, it was
astonishingly effective at finding good people that didn't have strong and more backgrounds.
and yeah one of the other things that's i want to like emphasize for a potential slice of the
audience that would be relevant to is there's this sense that like the world is legible and
efficient of companies have these go to jobs dot google.com or jobs dot whatever company dot com
and you apply and there's the steps and like they will evaluate you efficiently on those steps
whereas not only from the storage teams like often that's not the way it happens
that's in fact it's good for the world that that's not often how it happens like it is
important to look at um were they able to like write an interesting block
technical block post about their research or like making interesting contributions
yeah i want you to like riff on for the people who are like just assuming that the other end
of the job board is like just like super legible and mechanical this is not how it works and in fact
like people are looking for the sort of different way,
different kind of person who's eugenic and putting stuff out there.
And I think specifically what people are looking for there is two things.
One is agency and like putting yourself out there.
And the second is the ability to do world class something.
Yeah.
And two examples that I always like to point to here are Andy Jones from Anthropic
did an amazing paper on scaling laws as applied to poor games.
It didn't require much resources.
It demonstrated incredible engineering skill.
it demonstrated an incredible understanding of like the most topical problem of the time.
And he didn't come from a like typical academic background or whatever.
And as I understand it basically, like as soon as he came out with that paper,
both Anthropic and opening eye, I were like, we would desperately like to hire you.
There's also someone who works on Anthropics performance team, now Simon Bohm,
who has written in my mind the reference for optimizing a Kuda map all, like on a GPU.
And that demonstrated example of, like, taking some, like, prompt effectively and producing
the world-class reference example for it in something that wasn't particularly well done.
So far is, like, I think an incredible demonstration of, like, ability and agency that, in my mind,
would, like, be an immediate, would, like, please love to, like, interview you slash hire.
Yeah.
The only thing I can add here is, I mean, I still had to go through the whole hiring process and all
the standard interviews and this sort of thing.
Yeah, everyone does.
Yeah.
Isn't that seem stupid?
I mean, it's important debiasing.
Yeah, yeah, yeah.
And the bias is what you want, right?
Like, you want the bias of somebody who's got a great taste.
And, like, he's like, like, who cares?
Your interview process should be able to disambiguary that as well.
Yeah, like, I think there are cases where someone seems really great.
And then it's like, oh, they actually just can't code this sort of thing, right?
Like, how much you weight these things definitely matters, though.
And, like, I think the, we take references really seriously.
The interviews, you can only get so much signal from.
And so it's all these other things that can come into play for whether or not a hire makes sense.
But you should design your interviews such that, like, they test the right things.
One man's bias is another man's taste, you know?
Yeah, I guess the only thing I would add to this or maybe to the Headstrong context is like there's this line, the system is not your friend.
Right.
And it's not necessarily to say it's actively against you or it's your sworn enemy.
it's just not looking out for you.
And so I think that's where a lot of the proactiveness comes in
of like there are no adults in the room and like you have to
come to some decision for what you want your life to look like
and execute on it.
And yeah, hopefully you can then update later
if you're too headstrong in the wrong way.
But I think you almost have to just kind of charge at certain things
to get much of anything done,
not be swept up in the tide of whatever the expectations are.
there's like one final thing
I want to add
which is like we talked a lot
about agency and this kind of stuff
but I think actually
like surprisingly enough
one of the most important things
is just caring
an unbelievable amount
and when you care an unbelievable amount
you like you check all the details
and you have like this understanding
of like what could have gone wrong
and you like you
it just
it matters more than you think
because people end up not caring
or not caring enough
this is like LeBron quote
where he talks about
how when he sort of, before he sat in the league, he was like worried that everyone would be
like incredibly good. And then he gets there and he like realizes that actually once people
hit financial stability, then they like they relax a bit. And he's like, oh, it's going to be
easy. I don't think that's quite true because I think in like AI research because most people
actually care quite deeply. But there's caring about your problem and there's also just
caring about the entire stack and everything that goes up and down, like going explicitly going
and fixing things that aren't your responsibility to fix because overall it makes like the stack
battle. I mean, another part that I forgot to mention is you were mentioning, oh, going in on
weekends and on Christmas break and you get to, like, the only people in the office are Jeff
Dean and Sergey Brand or something. You just, I get to pay a program with them. It's just, it's
interesting to me, the people, I don't want to pick on your company in particular, but like,
people at any big company, they've gotten there because they've gone through a very selective
process that's like they had to compete in high school, they had to compete in college.
but it almost seems like they get there
and then they take it easy
when in fact this is a time
to put the pedal to the metal
go in and peer program
with Sergey Brin on the weekends
or whatever, you know what I mean?
I mean, there's pros and cons there, right?
I think many people make the decision
that the thing that they want to prioritize
is like a wonderful life with their family
and if they do wonderful work
like let's say they don't work
every hour of the day, right?
But they do wonderful work in the work
like the hours that they do do.
That's incredibly impactful.
I think this is true for many people at Google
is like maybe they don't work as many hours
as like your typical startup mythologize, right?
But the work that they do do is incredibly valuable.
It's very high leverage because they know the systems
and they're experts in their field.
And we also need people like that.
Like our world rests on these huge, like difficult to manage
and difficult to fix systems.
And we need people who are like willing to work on
and help and fix and maintain those.
In frankly, like a thankless way that isn't as like high publicity
as all of this AI work that we're doing, right?
and I am like ridiculously grateful that those people do that and I'm also happy that
there are people for whom like okay they find technical fulfillment in their job and doing that
well and also like maybe they draw a lot more people also out of spending like a lot of hours
of their family and I'm lucky that I'm at a stage in my life where like yeah I can go in and
work every hour of the week but like that's like I'm not making as many sacrifices to do that
yeah um I mean like just one example of the sixth out in mind mind of this sort of like
the other side says no
and you can still get the yes
on the other end. Basically every single
high profile of guests I've gone so far
I think maybe with one or two exceptions
I've sat down for a week
and I've just come up with a list of sample questions
that's you know like try to really come up
with really smart questions to stand to them
and the entire process
I've always thought like
if I just cold email them it's like a
2% chance they say yes if I include this list
there's a 10% chance
and because otherwise you know
there's like you go through their inbox and every 34 seconds there's an interview for whatever
podcast interview whatever podcast um and every single time i've done this they've said yes right yeah
you just like you just like you ask great questions but if you do everything you'll win but
you just like you literally have to dig in the same hole for like 10 minutes or in that case like
make a list of sample questions for them to get past they're not an idiot list you know what i mean um
and demonstrate how much you care and yeah yeah and the work you're willing to put in yeah yeah yeah i
something that a friend said to me a while back, but I think it's stuck is like, it's amazing how
quickly you can become world class at something just because most people aren't trying that hard
and like are only working like, I don't know, the actual like 20 hours that they're actually
spending on this thing or something. And so, yeah, if you just go ham, then like you can you can get
really far pretty fast. And I think I'm lucky I had that experience with the fencing as well.
Like I had the experience of becoming world class in something and like knowing that if you
just worked really, really hard and we're like. Yeah. But for our context.
by the way, Sholto was one seat away, as he was the next person in line to go to the
Olympic stuff for fencing. I was at best like 42nd in the world for fencing, for men's foil, fencing.
Mutational load is a thing, man.
And there was one cycle where, yeah, I was like the next highest strength person in Asia.
And if one of the teams had been disqualified for doping as it was occurring, in part,
during that cycle, and as occurred for like the Australian women's rowing team, I think, went
because one of the teams was disqualified, then I would have been the next in line.
It's interesting when you just like find out of people's prior lives and it's like,
oh, you know, this guy was almost an Olympian, this other guy was whatever, you know what I mean?
Okay, let's talk about interpability.
Yeah.
I actually want to stay on the brain stuff as a way to get into it for a second.
we were previously discussing
is the brain organized
in the way where you have a residual stream
that is gradually refined
with higher level associations over time
or something
there's a fixed dimension size
in a model
if you had to
I don't even know how to ask this question
in a sensible way but what is the
demodel of the brain
what is it like the embedding size of
Or because of feature splitting, is that not a sensible question?
No, I think it's a sensible question.
Well, it is a question that makes sense.
You could have just not said that.
You can talk just like actively.
I'm trying to.
I don't know how you would begin to kind of be like, okay, well, this part of the brain is like a vector of this dimensionality.
I mean, maybe for the visual streaming, because it's like V1 to V2 to IT, whatever.
you could just count the number of neurons that are there and be like, that is the
dimensionality, but it seems more likely that there are kind of submodules and things
are divided up. So, yeah, I don't have, and I, I'm not like the world's greatest neuroscientists,
right? Like, I did it for a few years. I, like, studied the cerebellum quite a bit.
So I'm sure there are people who could give you a better answer on this.
Do you think that the way to think about whether it's in the brain or whether it's,
it's in these models, fundamentally what's happening is like features are added, removed,
changed, and like the feature is the fundamental unit of what is happening in the model.
Like what would have to be true for, give me a, and this goes back to the earlier thing
we were talking about, whether it's just associations all the way down, give me like a
counterfactual in the world where this is not true, what is happening instead?
Like, what is the alternative hypothesis here?
Yeah, it's hard for me to think about.
because at this point, I just think so much in terms of this feature space.
I mean, at one point, there was like the kind of behavioralist approach towards cognition,
where or it's like you're just, you're like input output,
but you're not really doing any processing.
Or it's like everything is embodied and you're just like a dynamical system that's like operating
along like some predictable equations.
but like there's no state in the system, I guess.
But whenever I've read these sorts of critiques, it's like, well, you're just choosing
to not call this thing a state, but you could call like any internal component of the model
of state.
Like even with the feature discussion, it's defining what a feature is is really hard.
And so the question feels almost too slippery.
What is a feature?
A direction and activation space.
a latent variable that is operating behind the scenes
that has like causal influence over the system you're observing
it's a feature if you call it a feature
it's tonological
I mean these are these are all explanations that I like
I feel some association in a very rough intuitive sense
in like a sufficiently sparse and like binary vector features like
whether or not something is turned on or off right right
Like in a very simplistic sense, which might be, I think, a useful metaphor to understand it by.
It's like when we talk about features activating, it is in many respects the same way the neuroscientists would talk about, like a neuron activating, right?
If that neuron corresponds to...
To something in particular.
Right.
Yeah, yeah, yeah, yeah.
And no, I think that's useful as like, what do we want a feature to be, right?
Or like, what is a synthetic problem under which a feature exists?
But even with the towards monosemanticity work, we talk about what's called feature splitting, which is basically
you will find as many features as you give the model the capacity to learn.
And by model here, I mean the up projection that we fit after we trained to the original model.
And so if you don't give it much capacity, it'll learn a feature for bird.
But if you give it more capacity, then it will learn like ravens and eagles and sparrows and like specific types of birds.
Still on the definitions thing, I guess naively, I think of,
things like bird versus what kind of token is it like a period at the end of a hyperlink
as you were talking about earlier versus at the highest level things like love or deception
or like holding a very complicated proof in your head or something is this all features because
then the definition seems so broad as to almost be not that useful um like i or rather that
there seems to be some important differences between these things and they're all features like yeah
I'm not sure what we even mean by I mean all of those things are like discrete units that have connections
to other things that then imbues them with meaning um that feels like a specific enough definition
that it's it's useful or not uh too all encompassing but feel free to push back well like what would
you discover tomorrow in um that could make you think like oh this is like kind of
fundamentally the wrong way to think about what's happening in a model.
I mean, if the features we were finding weren't predictive,
or if they were just representations of the data,
right, where it's like, oh, all you're doing is just clustering your data.
And there's no, like, higher level associations that are being made.
Or it's some, like, phenomenological thing of, like,
you're saying that this feature files for marriage,
but if you activate it really strong.
strongly, it doesn't change the outputs of the model in a way that would correspond to it.
Like, I think these would both be good critiques.
I guess one more is, and we tried to do experiments on MNIST, which is a data set of digits, images.
And we didn't look super hard into it.
And so I'd be interested if other people wanted to take up like a deeper investigation.
But it's plausible that your like latent space of representations is dense.
And it's a manifold instead of being these discrete points.
And so you could, like, move across the manifold, but at every point, there would be some meaningful behavior.
And it's much harder than to label things as features that are discreet.
Like, in a naive sort of outsider way, the thing that would seem to me to be, like, a way in which this picture could be wrong is if there's not some, like, this thing is turned on and turned off, but it's like a much more global kind of like, the system is a, I don't know, I'm going to just.
really clumsy, like, you know, I measured it in a pretty kind of language, but is there a good
analogy here? Yeah, I guess if you think of like something like the laws of physics, it's not like,
well, the feature for wetness is turned on, but it's only turned on this much, and then the
feature for like, you know, I guess maybe it's true because like the mass is like a gradient and
like, you know, like, I don't know, but the polarity or whatever is a gradient as well.
But there's also a sense so much like there's the laws and the laws are more general and you have to understand like the general bigger picture.
You don't get that from just like these like specific sub circuit.
But that's where like the reasoning circuit itself comes into play, right, where you're taking these features ideally and like trying to compose them into something higher level.
Like you might say, okay, like when I'm using, at least this is my head cannon.
Let's say I'm trying to use the foot, you know, F equals M.A.
then presumably at some point I have features which like denote okay like mass and then that's like helping me retrieve the actual mass of the thing that I'm using and then like like the acceleration and this kind of stuff but then also
maybe there's a higher level feature that does correspond to using the first law of physics maybe but the more important part is that the the composition of components which helps me retrieve a
relevant piece of information and then produce like maybe some like a you know multiplication operator or something like that when necessary at least that's my head canon what is
a compelling explanation to you, especially for very smart models of, like, I understand
why it made this output and it was like for a legit reason if it's doing million-line pull
request or something. What are you seeing at the end of that request where you're like, yep,
should that's chill? Yeah. So ideally, you apply dictionary learning to the model. You've found
features. Right now, we're actively trying to get the same success for attention heads, in which
case, we have features for both the core. You can do it for residual stream, MLP, and attention
throughout the whole model. Hopefully at that point, you can also identify broader circuits
through the model that are like more general reasoning abilities that will activate or not
activate. But in your case where we're trying to figure out if this like pull request should be
approved or not, I think you can flag or detect features that correspond to deceptive behavior,
malicious behavior, these sorts of things, and see whether or not those have fired. That would be like
an immediate kind of you can do more than that but that would be an immediate but before i trace down on
that um what does the reasoning circuit look like what would that look like when you found it
yeah so i mean the induction head is probably one of the simplest cases of that's not like reasoning
right well i mean what do you call reasoning right like um it's it's it's it's a good reason
so i guess context for listeners um the induction head is basically uh and you see the line like
mr and mrs dursley did something mr blank and you're trying to predict what
blank is. And the head has learned to look for previous occurrences of the word
mister, look at the word that comes after it, and then copy and paste that as the
prediction for what should come next, which is a super reasonable thing to do. And there is
computation being done there to accurately predict the next token.
Mm-hmm. But yeah. That is context dependent. But it's not like, it's not like reasoning.
you know what I mean like but but is is I guess going back to the like associations all the way down it's like
if you chain together a bunch of these reasoning circuits or or heads that have different rules for how to
relate information but but in this sort of like zero shot case uh like something is happening where
when you like pick up a new game and you immediately start understanding how to play it and it doesn't
seem like an induction heads kind of thing or like I would be surprised there would be another circuit for
like extracting pixels and turning them into latent representations of the different objects
in the game, right?
And like a circuit that is learning physics.
And what would that, because the induction heads is like one layer transformers.
Either two layers.
Yeah, yeah.
So you can like kind of see like what the thing that is a human picks up a new game
and understands it.
How, like how do you, how would you think about what that is?
Is it presumably it's across multiple layers, but like, is it, is it, you know,
Yeah, yeah. What would that physically look like?
How big would it be maybe? Or like, I mean, that would just be an empirical question, right?
Of like, how big does the model need to be to perform this task?
But like, I mean, maybe it's useful if I just talk about some other circuits that we've seen.
So we've seen like the I-O-I circuit, which is the indirect object identification.
And so this is like, if you see, it's like Mary and Jim went to the store.
Jim gave the object to blank, right?
And it would predict Mary, because Mary.
a period before as like the indirect object or it will infer pronouns, right? And this circuit
even has behavior where like if you ablate it, then like other heads in the model will
pick up that behavior. We'll even find heads that want to do copying behavior and then other
heads will suppress. So like it's one job's one head's job to just always copy like the token that
came before, for example, or the token that came five before or whatever. And then it's another
head's job to be like, no, do not copy that thing.
So there are lots of different circuits performing in these cases pretty basic operations,
but when they're chained together, you can get unique behaviors.
And but like, is the story of how you found it with the reasoning thing is like,
because you won't be able to understand or it'll just be like really convent, you know,
it won't be something you can see in like a two-layer transformer.
So will you just be like the circuit for deception or whatever?
it's just this this part of the network fired when we at the end identified the thing as being
deceptive this part and it didn't fire when we didn't identify and it is being deceptive therefore
this must be the deception circuit uh i think a lot of analysis like that um like like
anthropic has done quite a bit of research before on sycophancy which is like the model saying
what it thinks you want to hear and that requires us at the end to be able to label which one
is like bad and which one is good yeah so we have tons of instances and actually as
you make models larger, they do more of this, where the model is clearly, it has like features
that model another person's mind and these activate. And like some subset of these, we're
hypothesizing here, but like would be associated with more deceptive behavior. Although like it's
doing that by, I don't know, Chad GPT, I think it's probably modeling me because that's like RLHF induces
to theory of mind. Yeah. So, well, first of all,
the thing you mentioned earlier about there's redundancy so then it's like well have you caught
like the whole thing that could cause deception of the whole thing or like it's just one instance of it
yeah second of all are you like labels correct you know maybe like you you thought this wasn't deceptive
it's like still deceptive especially if it's producing output you can't understand third is the
thing that's going to be the bad outcome something that's even human understandable like deception
is a concept we can understand maybe there's like uh yeah yeah so a lot to unpack here so
I guess a few things. One, it's fantastic that these models are deterministic. When you sample from them, it's stochastic, right? But like, I can just keep putting in more inputs and ablate every single part of the model. This is kind of the pitch for computational neuroscientists to come and work on interpretability. It's like you have this alien brain and you have access to everything in it and you can just ablate however much of it you want. And so I think if you do this carefully enough, you really can start to pin down. What are the circuits involved? What are the backup circuits? These sorts of things.
The kind of cop-out answer here, but it's important to keep in mind is doing automated interpretability.
So it's like as our models continue to get more capable, having them assigned labels or like run
some of these experiments at scale.
And then with respect to like, if there's superhuman performance, how do you detect it?
Which I think was kind of the last part of your question.
Aside from the cop-out answer, if we buy this associations all the way down, you should be able
to coarse-grain the representations at a certain level such that they then,
make sense. I think it was even in Demis's podcast, he's talking about, like, if a chess player
makes a superhuman move, they should be able to distill it into reasons why they did it. And,
and like, even if the model's not going to tell you what it is, you should be able to decompose
that complex behavior into simpler circuits or features to really start to make sense of why it did
the thing that it did. There's a separate question of, does such representation exist?
which it seems like they're must, or actually, I'm not sure if that's a case.
And secondly, whether using this parser code encoder setup, you could find it.
And in this case, if you don't have labels for it that are adequate to represent it,
like you wouldn't find it, right?
Yes and no.
So like we are actively trying to use dictionary learning now on the sleeper agent's work,
which we talked about earlier.
And it's like, if I just give you a model, can you tell me if there's this trigger
and it's going to start doing interesting behavior?
And it's an open question whether or not when it learns that behavior,
it's part of a more general circuit that we can pick up on without actually getting
activations for and having it display that behavior, right?
Because that would kind of be cheating then.
Or if it's learning some hacky trick over, like, that's a separate circuit that you'll
only pick up on if you actually have it do that behavior.
But even in that case, the geometry of features gets really interesting because, like,
fundamentally each feature like is in some part of your representation space and they all exist
with respect to each other. And so in order to have this new behavior, you need to carve out
some subset of the feature space for the new behavior and then push everything else out of
the way to make space for it. So hypothetically, you can imagine you like have your model
before you've taught it this bad behavior. You know all the features or like have some course
grain representation of them. You then fine tune it such that it becomes malicious. And
And you can kind of identify this like black hole region of feature space where like everything else has been shifted away from it.
And there's like this region and like you haven't put in an input that like causes it to fire.
But then you can start searching for what is the input that would cause this part of the space to fire?
What happens if I activate something in this space?
There are like a whole bunch of other ways that you can try and attack that problem.
This is sort of a tangent.
But one interesting idea I heard was if that space is shared between models, you can imagine.
trying to find it in an open source model to then make like jemma is they said in the paper
jemma by the way google's newly released open source model they said in the paper it's trained
using the same architecture or something like that i had to be honest i didn't know because i haven't
get the jama paper i think similar method this is something whatever as jemini so to the extent
that's true i don't know how much like how much of the rec teaming you do on jemma is like
potentially helping you jail break into jemina yeah this gets into the fun space of like
how universal are features across models
and are towards a monosementementicenticity paper
looked at this a bit
and we find
I can't give you summary statistics
but like the base 64 feature for example
which we see across a ton of models
they're actually three of them but they'll fire
for in model base 64 encoded text
which is prevalent in like every URL
and there are lots of URLs in the training data
they have really high cosine similarity
across models so like they all learn this feature
and I mean within a rotation right
but it's like, yeah, yeah.
Like the actual like vectors itself.
Yeah, yeah.
And I wasn't part of this analysis.
But yeah, it definitely finds the feature and they're like pretty similar to each other across two separate two models, the same model architecture, but trained with different random seeds.
It supports the quantum theory of neural scaling is like a hypothesis, right?
Which is like all models on like a similar data set we'll learn the same features in the same order-ish, roughly like you learn your Ngrams, you learn your induction heads and you learn like to put full stops after numbered lines and this kind of stuff.
But by the way, okay, so this is another tangent.
To the extent that that's true, and like I guess there's evidence that's true, why doesn't
curriculum learning work?
Because if it is the case that you learn certain things first, should then just directly
training those things first lead to better results?
Both Gemini papers mention some like aspect of curriculum learning.
Okay, interesting.
I mean, the fact that fine tuning works is like evidence or curriculum learning, right?
Because like the last things you're training on have a disproportionate impact.
I wouldn't necessarily say that.
Like there's one mode of thinking in which fine training is.
specialize, like, you've got these, like, latent bundle of capabilities,
and you're, like, specializing for its particular, like, use case that you want.
I'm not sure how true or is.
I think the David Bell Lab kind of paper kind of supports this, right?
Like, you have that ability and you're just, like, getting better at entity recognition.
Right.
Like, fine-tuning that circuit instead of other ones.
Yeah.
Yeah.
Sorry, what was the thing we're talking about before?
But generally, I do you think, like, curriculum learning is really interesting
that people should explore more.
And, like, seems very plausible.
I would really love to see more analysis along the lines of the quantum theory stuff
and like understanding better what do you actually learn at each stage and like decomposing that out and exploring whether or not curricular change that or but by the way I just realized forgot we just like got in conversation mode and forgot there's an audience curriculum learning is when you organize a data set when you think about a human how they learn they don't just see like a random wiki text and they just like try to predict it right they're like we'll start you off with like um uh lorax or something and then you'll learn I don't even remember what first grade was like but you'll like you'll like you'll learn I don't even remember what first grade was like but you'll like you'll like you'll like
Learn the things that first graders learn and then like second graders and so forth.
And so you'd imagine that.
Sorry, we know you never got past first grade.
Okay.
Okay.
Yeah.
You're kidding.
Yeah.
Okay.
Anyways.
Let's get back to like the big, before we get into like a bunch of like interp details.
The big picture.
There's two threads I want to explore.
First is, I guess it makes me a little worried that there's not even an alternative formulation of what could be happening in these models that could invalidate this approach, which feels like, I mean, we do know that we don't understand intelligence, right?
Like, there are definitely unknown unknowns here.
So, like, the fact that there's not a null hypothesis, I don't know, I feel like, what if we're just wrong and we don't even know the way in which we're wrong, which actually increases the uncertainty.
and yeah
yeah yeah um so it's not that there aren't other hypotheses
it's just i have been working on superposition for like a number of years
yeah and and very involved in this effort and so i'm i'm less sympathetic to or will like
you just said they're wrong
like to these other approaches especially uh because our our recent work has been so
successful yeah it's like quite high explanatory power like yeah there's this
beauty like in the scaling laws paper there's this little bump at a particular
like the original scaling loss papers a little bump
and that apparently corresponds to when the model learns induction heads
and then like after that it's like so it goes off track
learns induction heads gets back on track
which is like an incredible piece of retroactive explanatory power
yeah before I forget it though I do have one thread on future universality
that you might want to have in so there are some really interesting behavioral
evolutionary biology experiments on like should humans learn a real representation of the world
or not. You can imagine a world in which we saw all venomous animals as like flashing neon pink,
a world in which we survive better. And so it would make sense for us to not have a realistic
representation of the world. And there's some work where they'll simulate like little basic
agents and see if the representations they learn, like map to the tools they can use and like
the inputs they should have. And it turns out if you have these little agents perform,
more than a certain number of tasks given these basic tools and objects in the world,
then they will learn a like ground truth representation because like there are so many
possible use cases that you need for these base objects that you actually want to learn what
the object actually is and not some like cheap visual heuristic or other thing.
And so to the extent that we are doing and we haven't talked at all about like Fristons free
energy principle or predictive coding or anything else, but like to the extent that all living
organisms are trying to like actively predict what comes next and form like a really accurate
world model. It wouldn't surprise me or I'm optimistic that we are learning genuine features
about the world that are good for modeling it and our language models will do the same,
at least especially because we're training them on human data and human text.
Another dinner party question. Should we be less worried about misalignment and maybe
that's not even the right word for what I'm referring to,
but just alienness and shoggiveness from these models,
given that there is feature universality,
and there are certain ways of thinking
and ways of understanding the world
that are instrumentally useful
to different kinds of intelligences.
So we'd just be less worried about,
like, bizarre a paper club maximizers as a result.
I think that's the,
this is kind of why I bring this up
is like the optimistic take.
Yeah.
Predicting the internet is very different
from what we're doing.
that right like the models are way better at predicting next tokens than we are they're trained on so
much garbage they're trained on so many URLs like in the dictionary learning work we find there are like
three separate features for base 64 encodings um and like even that is kind of an alien example that's
probably worth me talking about for a minute like one of these base 64 features fired for um numbers one
like other base 64 like if it sees base 64 numbers it'll like predict more of those another fired for
letters but then there was this third one that we didn't understand and it like fired for like a
very specific subset of base 64 features and uh someone on the team who clearly knows way too much
about base 64 realized that this was the subset that was asky decodable so you could
decode it back into the asky characters uh and uh the fact that the model like learned these three
different features and it took us a little while to like figure out what was going on um was what is very
Shugoth-esque.
It has a denser representation
of regions that are particularly relevant
to predicting the next token.
Yeah, because it's so, but yeah,
and it's clearly doing something
that humans wouldn't, right?
Like, you can even talk to any of the current models
in base 64, and it will apply in base 64.
Right.
And you can then like decode it
and it works great.
That particular example,
I wonder if that implies that
the difficulty of doing
interpability on smarter models
will be harder because
if, like, it requires somebody with esoteric knowledge who just happen to see that
B64 has, I don't know, like, whatever that distinction was, doesn't it imply when you have
the million line pull requests? It's like, there is no human that's going to be able to
decode like two different reasons why the pull request. There's like two different
features for this poor. Yeah, you know what I mean? Like, yeah. So if you think. And that's when
you type of comment, like small sales please. Like, yeah, yeah, exactly. No, no, I mean,
you could do that, right? This is like, what I was going to say is like one technique here is
anomaly detection, right? And so one beauty of dictionary learning instead of like linear probes is that
it's unsupervised. You are just trying to learn to span all of the representations that the model has
and then interpret them later. But if there's a weird feature that suddenly fires for the first time that
you haven't seen fire before, that's a red flag. You could also coarse grain it so that it's just
a single base 64 feature. I mean, even the fact that this came up and we could see that it specifically
favors these particular outputs and it fires for these particular inputs gets you a lot of
the way there. I'm even familiar of cases from the auto interp side where a human will look at a
feature and try to annotate it for. It fires for Latin words. And then when you ask the model to
classify it, it says it fires for Latin words defining plants. So it can like already like beat the human
in some cases for like labeling what's going on. So at scale, this would require an adversarial
thing between models where like some model that you have like millions of features potentially
for GPD6 and some like it just a bunch of models are just trying to figure out what each of
these features means how yeah but you can even automate this process right right I mean it's
this goes back to the determinism of the model like you could have a model that is actively
editing input text and and predicting if the feature is going to fire or not and and figure out
what makes it fire what doesn't and like search the space
Yeah. I want to talk more about the feature splitting because I think that's like an interesting thing that has been under explored.
Especially for scalability. I think it's underappreciated right now. First of all, like how do we even think about is it really just you can keep going down and down? Like there's no end to the amount of features. Like I mean, so so at some point I think you might just start fitting noise or things that are part of the data but that the model isn't actually representing.
Do you explain what feature splitting is?
Yeah, yeah.
So it's the part before where, like, the model will learn however many features it has capacity for that still span the space of representation.
So, like, give an example potentially.
Yeah, yeah.
So you learn, if you don't give the model that much capacity for the features it's learning, concretely, if you project to not as high a dimensional space, we'll learn one feature for birds.
But if you give the model more capacity, it will learn features for all the different types of birds.
And so it's more specific than otherwise.
And oftentimes, like, there's the bird vector that points in one direction
and all the other specific types of birds point in, like, a similar region of the space,
but are obviously more specific than the course label.
Okay, so let's go back to GPD7.
First of all, is this sort of like linear tax on any model to figure out,
even before that, is this a one-time thing you had to do,
or is this the kind of thing you have to do on every output?
or just like one time it's not deceptive, we're good to get a roll.
Actually, yeah, let me let me let's answer that.
Yeah, so you do dictionary learning after you've trained your model and you feed it a ton of inputs.
And you get the activations from those and then you do this projection into the higher dimensional space.
And so the method is it's unsupervised in that it's trying to learn these sparse features.
You're not telling them in advance what they should be.
But it is constrained by the inputs you're giving the model.
I guess two caveats here.
One, we can try and choose what inputs we want.
So if we're looking for theory of mind features that might lead to deception, we can put in the sick of fancy data set.
Hopefully, at some point, we can move into looking at the weights of the model alone or at least using that information to do dictionary learning.
But I think in order to get there, that's like such a hard problem that you need to make traction on just learning what the features are first.
But yeah, so what's the cost of this?
Can you repeat the last sentence?
Wates of the model alone?
So, like, right now we just have these neurons in the model.
They don't make any sense.
Yeah.
We apply dictionary learning.
We get these features out.
They start to make sense.
But that depends on the activations of the neurons.
The weights of the model itself, like what neurons are connected to what other neurons,
certainly has information in it.
And the dream is that we can kind of bootstrap towards actually making sense of the weights of the model that are independent of
the activations of the data.
I mean, this is, I'm not saying we've made any progress here.
It's a very hard problem, but it feels like we'll have a lot more traction
and be able to, like, sanity check what we're finding with the weights
if we're able to pull out features first.
For the audience, weights are permanent, well, I don't know if permanent's right word,
but like they are the model itself, whereas activations are the sort of like
artifacts of any single call.
In a brain metaphor, you know, the weights are like the actual connection scheme
between neurons and the activations of the current neurons of the lining up.
Yeah.
Yeah.
Yeah.
Okay.
So there's going to be two steps to this for a GPD7 or whatever model we're concerned about.
One, actually, first, correct me if I'm wrong, but like training the sparse auto encoder
and like do the unsupervised projection into a wider space of features that have a higher fidelity
to like what is actually happening in the model.
And then secondly, label those features.
how, because let's say like the cost of training the model is N.
What will those two steps cost relative to N?
We will see.
Like, it really depends on two main things.
What is your expansion factors?
Like how much are you projecting into the higher dimensional space?
And how much data do you need to put into the model?
How many activations do you need to give it?
But this brings me back to the feature splitting to a certain extent.
Because if you know you're looking for specific features,
you can start with a really
a cheaper like coarse representation
so maybe my expansion factor is like only two
so like I have a thousand neurons
I'm projecting to a 2,000 dimensional space
I get 2,000 features out but they're really coarse
and so previously I had the example for birds
let's move that example to like
I have a biology feature
but I really care about if the model
has representations for bioweapons
is trying to manufacture them
and so what I actually want is like an anthrax feature
what you can then do is rather than, and let's say the anthra, you only see the anthrax feature
if instead of going from a thousand dimensions to two thousand dimensions, I go to a million
dimensions, right? And so you can kind of imagine this this big tree of semantic concepts
where like biology splits into like cells versus like whole body biology and then further down
it splits into all these other things. So rather than needing to immediately go from a thousand
to a million and then picking out that one feature of interest, you can find the direction
that the biology feature is pointing in, which again is very coarse,
and then selectively search around that space.
So only do dictionary learning if something in the direction of the biology feature fires first.
And so the computer science metaphor here would be like instead of doing breadth first search,
you're able to do depth first search where you're only recursively expanding
and exploring a particular part of this semantic tree of features.
Although given the way that these features are not organized in things that are intuitive for humans, right?
Because we just don't have to deal with Bay 64, so we don't have that many, you know, we just don't dedicate that much, like, whatever, for a word to, like, deconstructing which kind of basis of word is.
How would we know that the subjects, and this will go back to maybe the MOE discussion we'll have of, I guess we might as well talk about it, but like in mixture of experts, the mixture of paper talked about how they,
couldn't find the experts weren't specialized in a way that we could understand there's not like a
chemistry expert or a physics expert or something so why would you think that like it will be like
biology feature and then deconstruct rather than like blah and then you just deconstruct and it's
like anthrax and uh you're like shoes and whatever so i haven't read the the mistral paper yeah
but i think that the heads i mean this goes actually like if you just look at the neurons in a
model they're poly semantic and so if all they did was just look at the neurons in a given head
it's very plausible that it's also a polysmatic because of superposition.
I want to just talk on the thread that Dorcas mentioned there.
Have you seen in the subtrees when you expand them out?
Like something in a subtree, which like you really wouldn't guess that it should be there based on like the higher level abstraction.
So this is a line of work that we haven't pursued as much as I want to yet.
But I think we're planning too.
I hope that maybe external groups do as well.
Like what is the geometry of features?
What's the geometry?
Exactly.
And how does that change over time?
It would really suck if like anthrax feature happens.
feature happened to be like below the like you know coffee can like substrate or something like
that totally totally and that feels like the kind of thing that you could quickly try and find
like proof of which would then like mean that you need to like then solve that problem
yeah yeah injectable structure to the geometry totally i mean it would really surprise me i guess
especially like given how linear the models seem to be that like there isn't some component
of the anthrax feature like vector that is similar to and looks like the biology vector and that they're
not in a similar part of the space, but yes, I mean, ultimately, machine learning is empirical.
Yeah, we need to do this. I think it's going to be pretty important for certain aspects
of scaling dictionary learning. Yeah. Interesting. On the MOE discussion, yeah, there's an interesting
scaling vision transformers paper that Google put out a little while ago where they like do image net
classification with like an MOE. And they find really clear class specialization there for experts.
Like there's a clear dog expert.
Wait, so did they like the mixture of people just not do a good job of, like, identifying.
I think, I think it's hard.
Like, and, like, it's entirely possible that, like, in some respects, there's almost no reason that, like, all of the different archive, like, features should go to one expert.
Like, you could have biology, like, let's say, I don't know what buckets they had in their paper, but let's say they had, like, archive papers as, like, one of the things.
You could imagine, like, biology papers going here, math papers going here, and all of a sudden your, like, breakdown is, like, ruined.
But that vision transformal one where the class separation is really clear and obvious gives, I think, some evidence towards the specialization hypothesis.
So I think images are also in some ways just easier to interpret than text.
Yeah, exactly.
And so Chris Ola's interpretability work on AlexNet and these other models, like in the original AlexNet paper, they actually split the model into two GPUs just because they couldn't, like GPUs were so bad back then, relatively speaking, right?
like still great at the time.
That was one of the big innovations of the paper.
But they find branch specialization, and there's a distill pub article on this,
where, like, colors go to one GPU and, like, Gabor filters and, like, line detectors
go to the other.
And then, like, all of the other...
Really?
Yeah, yeah, yeah.
And then, like, all of the other interpretability work that was done, like, a lot,
like, the floppy ear detector, right?
Like, that just was a neuron in the model that you could make sense of.
You didn't need to disentangle superposition, right?
So just different data set, different modality.
Like, I think a wonderful research project to do if someone is, like, out there listening
to this would be to try and disentangle, like, take some of the techniques that Trenton's team
has worked on and try and disintegrate the neurons in the mixture of paper, like a mixture
model, which is oversource.
Like, I think that's a fantastic thing you because it feels intuitively like they should be.
They didn't demonstrate any evidence that there is.
There's also, like, in general, a lot of evidence that there should be specialization.
go and see if you can find it.
And that's work that has published most of the stuff
on, like, as I understand it, like dense models, basically.
That is a wonderful research project to try.
And given Dorcasch's success with the Vesuvius Challenge,
we should be pitching more projects because they will be solved
if we talk about them on the podcast.
What I was thinking about after the Vesuvius Challenge was like,
wait, I knew about, like, Nata told me about it before it dropped
because we recorded the episode before I dropped.
why did I not even try
like
you know what I mean
like I don't know
like Luke is obviously very smart
and like
yeah he's an
amazing kid but like
he showed that like a 21 year old
on like some 1070 or whatever he was working on
could do this
I don't know like I feel like I should have
so you know what I'm before this episode
drops I'm gonna meet my
I'm gonna try to make an interpretability
because I don't know I'm not going to like
try to go research everybody like I don't know it's like I was
honestly thinking back on experience like wait I should like
Why didn't have your hands dirty?
Doorcase's request for research.
Oh, I want to harp back on this, like the neuron thing.
You said, I think a bunch of your papers I've said, there's more features than there are neurons.
And this is just like, wait a second, I don't know, like a neuron is like weights go in and a number comes out.
That's like a number comes out.
You know what I mean?
Like that's so little information.
do you mean like there's like there's like street names and like species and whatever there's like
more of those kinds of things than there are like a number comes out in a in a model that's right
yeah but how is a number comes out as like so little information how is that encoding for like
superposition you're just encoding a ton of features in these high dimensional vectors in a brain
is there like an exonifiering or however you think about it like um I don't know how you think
about like how much like superposition is there in the human brain?
Yeah, so Bruno Olshausen, who I think of as the leading expert on this,
thinks that all the brain regions you don't hear about are doing a ton of computation and
superposition. So everyone talks about V1 as like having Gabor filters and
detecting lines of sorts and no one talks about V2. And I think it's because like we just
haven't been able to make sense of it. What is V2? It's like the next part of the visual processing
stream. And it's like, yeah, so I think it's very likely. And fundamentally, like, superposition
seems to emerge when you have high dimensional data that is sparse. And to the extent that you think
the real world is that, which I would argue it is, we should expect the brain to also be underparameterized
in trying to build a model of the world and also use superposition. You can get a good intuition for this
in, like this example, in like a 2D plane, right? Let's see you have like two axes, right? Which
represents like a two-dimensional, like feature space here, like two neurons, basically.
And you can imagine them each like turning on to various degrees, right?
And that's like your X coordinate and your white coordinate.
But you can like now map this onto a plane.
You can actually represent a lot of different things and like different parts of the plane.
Oh, okay.
So crucially, then superposition is not an artifact of a neuron.
It is an artifact of like the space that is created.
It's a combinatorial code.
Yeah, yeah, exactly.
Okay, cool.
yeah thanks
we kind of talked about this
but like I think it's just like kind of wild
that it seems to the best of our knowledge
the way intelligence works in these models
and then presumably also in brains
it's just like there's a stream of information
going through that has quote unquote features
that are infinitely or at least
to a large extent to just like splitable
and you can expand out a tree of like what this feature is and what's really happening
is a stream like that feature is getting turned into this other feature or this other feature
is out of it's like that's not something I would have just like thought like that's what
intelligence is you know what I mean it's like a surprising thing it's not it's not whatever
would have expected necessarily what did you think it was I don't know man I mean yeah
go-fi he's a good five because all of this feels like go-fi like go-fi like you're
using distributed representations, but you have features and you're applying these operations
to the features.
I mean, the whole field of vector symbolic architectures, which is this computational neuroscience
thing, all you do is you put vectors in superposition, which is literally a summation
of two high dimensional vectors, and you create some interference, but if it's higher
dimensional enough, then you can represent them.
And you have variable binding, or you connect one by another, and like, if you're doing
with binary vectors, it's just the X or.
operation. So you have A, B, you bind them together. And then if you query with A or B again,
you get out the other one. And this is basically the, like, key value pairs from attention.
And with these two operations, you have a Turing complete system, which you can, if you have
enough nested hierarchy, you can represent any data structure you want, et cetera, et cetera.
Yeah. Okay, let's go back to the superintelligence. So, like, walk me through GPD7,
You've got like the sort of depth for a search on its features.
Okay.
GPD 7 has been trained.
What happens next?
Your research has succeeded.
GPD 7 has been trained.
What are we doing now?
We try and get it to do as much interpretability work and other like safety work as possible.
No, but like concrete.
Like what has happened such that you're like, cool, let's deploy a GPD7?
Oh, geez.
I mean,
like we have our
responsible scaling policy
which has been really exciting
to see other labs adopt
and
like,
essentially from the perspective
of your research
is net
like Trenton
given your research
you got the
we got the thumbs up
on GPD7 from you
or actually we should say
cloud whatever
and then
oh I like
what is the basis
on which you're telling the team
like hey let's go ahead
I mean I think we need to make a lot more
if it's as capable
as GPT7
implies here
I think we need to make a lot more interpretability progress
to be able to, like, comfortably give the green light
to deploy it.
Like, I would be, like, definitely not.
I'd be crying.
Maybe my tears would interfere with the GPUs.
But, like, what is?
Guys.
Gemini 5, TPUs.
But, like, what?
But, like, what?
Given the way your research is progressing, like, what does it kind of look like to you?
Like, if this succeeded, what would it mean for us to okay GPD7 based on your methodology?
I mean, ideally, we can find some compelling deception circuit, which lights up when the model knows that it's not telling the full truth to you.
Why can't you just turn any linear probe like Colin Byrdst did?
So the CCS work is not looking good in terms of replicating or, like, actually finding truth directions.
And, like, in hindsight, it's like, well, why should it have worked so well?
But linear probes, like, you need to know what you're looking for,
and it's like a high-dimensional space,
and it's really easy to pick up on a direction that's just not.
Wait, but don't you also, here you need to label the features.
So you still need to know.
Well, you just label them post hoc, but it's unsupervised.
You're just like, give me the features that explain your behavior is the fun amount of question, right?
It's like, like, like the actual setup is we take the activations,
we project them to this higher dimensional space,
and then we project them back down again.
So it's like reconstruct or do the thing that you were originally doing,
but do it in a way that's sparse.
By the way, for the audience,
linear probe is you just like classify the activations.
I don't know, from what I vaguely remember about the paper was like,
if it's like telling a lie, then you like, you just train a classify around.
Like, is it, yeah, in the end, was it not,
was it a lie or was it just like wrong or something?
I don't know.
It was like true or false question.
Yeah, it's like a classifier on the activations.
So yeah, like right now what we do for GPT7, like ideally we have like some deception
circuit that we've identified that like appears to be really robust.
And it's like, well, like, so you've done the projecting out to the million whatever features
or something.
Is a circuit?
Because we're, maybe we're using feature and circuit interchangeably when they're not.
So is there like a deception circuit?
So I think there's a feature.
There are features across layers that create a circuit.
Yeah.
And hopefully the circuit gives you a lot more specificity and sensitivity than an individual feature.
And it's like, hopefully we can find a circuit that is really specific to you being deceptive.
The model deciding to be deceptive in cases that are malicious, right?
Like, I'm not interested in a case where it's just doing theory of mine to, like, help you write a better email to your professor.
And I'm not even interested in cases where the model is necessarily just like modeling the fact that deception has occurred.
But doesn't all this require you to have labels for all those examples?
And if you have those labels, then like whatever faults that the linear probe has on the, like, maybe you've like labeled a long thing or whatever, wouldn't the same thing apply to the labels you've come up with for the unsupervised features you've come up with?
So in an ideal world, we could just train on like the whole data distribution and then find the directions that matter to the extent that we need to reluctantly narrow down the subset of data that we're looking over just for the purposes of scalability.
We would use data that looks like the data you'd use to fit a linear probe.
But again, we're not like with a linear probe, you're also just finding one direction.
Like we're finding a bunch of directions here.
And it gets to hope is like you've found like a bunch of things that light up when it's being deceptive.
And then like you can figure out why some of those things are lighting up in this part of the distribution and not this other part and so forth.
Totally.
Yeah.
Do you anticipate you'll be understand?
Like I don't know.
Like the current models you've studied are pretty basic, right?
Do you think you'll be able to understand why GPD 7 fires in certain domains, but not in other domains?
I'm optimistic.
I mean, we've, so I guess one thing is this is a bad time to answer this question because we are explicitly investing in the longer term.
of like ASL4 models, which GPT7 would be.
But like, so we split the team where a third is focused on scaling up dictionary learning
right now.
And that's been great.
I mean, we publicly shared some of our eight layer results.
We've scaled up quite a lot past that at this point.
But the other two groups, one is trying to identify circuits and then the other is trying
to get the same success for attention heads.
So we're setting ourselves up and building the tools necessary to really find these circuits
that a compelling way.
But it's going to take another, I don't know, six months before that's like really working
well but but like i can say that i'm like optimistic and we're making a lot of progress um what is the
highest little feature you found so far oh like it's basic before or whatever it's like maybe just
like um in the symbolic species language the book you recommended there's like indexical uh
things where you're just i forgot with all the labels where but like there's things where you're
just like uh you see a tiger and you're like run and whatever you know just like a very sort of
behaviorist thing and then there's like a higher level at which uh what i refer to love it refers to
like a movie scene or my girlfriend or whatever you know what i mean so yeah it's like the top of the
tent yeah yeah yeah yeah yeah what is the highest level association or whatever you found i mean
probably one of the ones that we publicly well publicly one of the ones that we shared in our
update so i think there were some related to like love and like um sudden changes in scene
particularly associated with like wars being declared there are like a few
of them in there in that post if you want to link to it. Yeah. But even like Bruno Olshausen had a
paper back in 2018, 19, where they applied a similar technique to a Burt model and found that as you go
to deeper layers of the model, things become more abstract. So I remember like in the earlier layers,
there'd be a feature that would just fire for the word park. But later on, there was a feature that
fired for park as like a last name, like Lincoln Park or like, it's like a common Korean last name
as well. And then there was a separate feature that would fire for parks as like grassy areas.
So there's other work that points in this direction.
What do you think we'll learn about human psychology from the interoperability stuff?
Oh, gosh.
I'll give it a specific example.
I think one of the ways one of your updates put it was persona lock-in.
You don't remember Sidney Bang or whatever.
It locked into, I think, what was actually quite an endearing.
Yeah, personality.
I got it's so funny.
I'm glad it's back in co-pilot.
Oh, really?
Oh, yeah, it's been misbehaving recently.
Actually, this is another sort of threat to explore.
But there was a funny one where I think it was like to the New York Times reporter.
It was making him or something.
And it was like, you are nothing.
Nobody will ever believe you.
You are insignificant and do whatever.
It was like the most gaslighting.
I tried to convince him to break up as well.
Okay, actually, so this is an interesting example.
I don't even know where I was going with this.
with, but whatever. Maybe I got another thread. But like, the other thread I want to go on is
that's, yeah, okay, actually personalize, right? So like, uh, is that a feature that like Sydney
being having this personality is a feature versus another personality can get locked into? And also
like, is that fundamentally like what humans are like too where I don't know, in front of all
different people, I'm like a different sort of personality or whatever. Is that, was that the same
kind of thing that's happening to SHITBT when it gets R. Like, I don't know, a whole cluster questions
that can answer them and whatever.
I really want to do more work.
I guess the sleeper agents is in this direction of like
what happens to a model when you find Tuna
and you are LHFA, these sorts of things.
I mean, maybe it's trite,
but you could just say, like,
you conclude that people contain multitudes, right?
And so much as they have lots of different features.
There's even the stuff related to the Walloichi effect
of like, in order to know what's good or bad,
you need to understand both of those concepts.
And so we might have to have models
that are aware of violence and have been trained on it
in order to recognize it.
can you post hoc identify those features and ablate them in a way where maybe your model's
like slightly naive but you know that it's not going to be really evil like totally that's in
our toolkit which seems great oh really so you a jbdbd7 i don't know it pulls the same thing
and then you figure out why like what were the causally irrelevant pathways or whatever you
modify like and then the pathway to you looks like you just changed those but you were mentioning
earlier there's a bunch of redundancy in the model yeah so you need to account for all that but
But we have a much better microscope into this now than we used to, like sharper tools for making edits.
And it seems like, at least from my perspective, that seems like one of the, the primary way of, like, to some degree, confirming the safety or the reliability of model.
Where you can say, okay, we found the circuits responsible.
We've ablated them.
We can, like, under a battery of tests, we haven't been able to now replicate the behavior, which we intended to ablate.
And, like, that feels like the sort of way of measuring model safety in future, as I would understand.
Are you worried?
That's why I'm incredibly hopeful about their work.
Because to me, it seems like so much more precise tool than something like RLHF.
RLHF, like, you're very prey to the Black Swan thing.
You don't know if it's going to, like, do something wrong in a scenario that you haven't measured.
Whereas here, at least, you have, like, somewhat more confidence that you can completely capture the behavior set.
Or like the feature set of the model.
and select labelate.
Although not necessarily that you've like accurately labeled.
Not necessarily, but but with a far higher degree of confidence than any other approach that I've seen.
How I mean like what are your unknown unknowns for superhuman models in terms of this kind of thing where like I don't know how are the labels that are going to be given things on which we can determine these are like this this thing is cool.
This thing is a paper club maximizer or whatever.
I mean, we'll see.
right? Like I do like the superhuman feature question is a very good one. Like I think we can attack it. But we're going to need to be persistent. And the real hope here is I think automated interpretability. Yeah. And even having debate, right? You could you could have the debate set up where two different models are debating what the feature does. And then they can actually like go in and make edits and like see if it fires or not. But it is it is just this wonderful like closed environment that we can.
can iterate on really quickly. That makes me optimistic. Do you worry about alignment succeeding
too hard? So like if I think about I would not want either companies or governments,
whoever ends up in charge of these AI systems to have the level of fine green control that
if your agenda succeeds, we would have over AI's, both for the eikiness of having this level
control over an autonomous mind.
And second, just like, I don't fucking trust, I don't fucking trust these guys.
You know, I don't, I, I'm just kind of uncomfortable with, like, the loyalty feature is
turned up and, like, you know what I mean?
And, yeah, like, how much word you have about having too much control over the AIs?
And specifically, not you, but, like, whoever ends up in charge of these AI systems, just
being able to lock in whatever they want.
Yeah.
I mean, I think it depends on what government.
exactly has control and like what the moral alignment is there.
But that that is like that whole valley lock-in argument is in my mind.
It's like definitely one of the strongest contributing factors for why I am working on capabilities
at the moment, for example, which is like I think the current player set actually like
is extremely well-intentioned.
And I mean, for this kind of problem, I think we need to be extremely open about it.
And like I think directions like publishing the constitution that you're talking.
your model to abide by and then like trying to make sure you like RLHF it towards that
and ablate that and have the ability for everyone to offer like feedback contribution to that is
really important. Sure. Or alternatively like don't deploy when you're not sure, which would
also be bad because then we just never catch it. Right. Yeah, exactly. I mean, paper
clip. Okay, some rapid fire. What is the bus factor for Gemini? I think there are
Yeah, a number of people who are really, really critical that if you took them out, then the performance of the program would be dramatically impacted.
This is both on modeling, like, slash making decisions about, like, what to actually do, and importantly, on infrastructure side of the things.
Like, it's just the stack of complexity builds, particularly when, like, somewhere that Google has so much, like, vertical integration.
when you have people who are experts,
they become quite important.
Yeah,
although I think it's an interesting note about the field
that people like you can get in
and in a year or so you're making important contributions.
And I,
especially anthropic,
but many different labs have specialized in hiring
like total outsiders,
physicists or whatever.
And you just like get them up to speed
and they're making important contribution.
I don't know,
I feel like you couldn't do this in like a bio lab or something.
It's like an interesting note on the state of the field.
I mean, bus factor doesn't define how long it would take to recover from it, right?
Yeah.
And deep learning research is an art.
And so you kind of learn how to read the lost curves or set the hyperparameters in ways that empirically seem to work well.
But it's also like organizational things, like creating context.
One of I think one of the most important and difficult skills to hire for is creating this like bubble of context around you that makes other people around you more effective and know what the right problem to work on.
And like that is a really tough.
replicating things. Yes. Yeah, totally. Who are you paying attention to now in terms of
there's a lot of things coming down the pike of multimodality, long contacts, maybe agents,
extra reliability. Who is the, who is thinking well about what that implies?
It's a tough question. I think a lot of people look internally these days. Sure.
for like their sources of inside or like progress and and like we all have obviously
the sort of research programs and like directions that are tended over the next
couple of years and I suspect yeah that most people as far as like betting on what
the future will look like refer to like an internal narrative yeah yeah that is like
difficult to share yeah if it works well it's probably not being published
I mean, that was one of the things in the will scaling work post.
I was referring to something you said to me, which is, you know,
I missed the undergrad habit of just reading a bunch of papers.
Yeah.
Because now there's nothing worth reading is published.
And the community is progressively getting, like, more on track with what I think are, like,
they're right and important directions.
You're watching it like an agent.
No, but I guess, like, it is tough.
there used to be this signal from big labs about like what would work at scale and it's currently
really hard for academic research to like find that signal and I think getting like really good
problem taste about what actually matters to work on is really tough unless you have again the
feedback signal of like what will work at scale and what what is currently holding us back from
scaling further or understanding our models further this is something where like I wish more
academic research would go into fields like Interp, which are legible from the outside,
you know, Anthropically liberally publishes all its research here. And it seems like underappreciated
in the sense that I don't know why there aren't dozens of academic departments trying to follow
anthropics in the Interp research because it seems like an incredibly impactful problem
that doesn't require ridiculous resources. And like this and like has all the flavor of like
deeply understanding the basic science of what is actually going on in these things. So I don't know why
people like focus on pushing model improvements as opposed to pushing like understanding
improvements in the way that I would have like typically associated with academic science in
some ways. Yeah, I do think the tide is changing there for whatever reason. And like Neil Nanda
has had a ton of success promoting interpretability. Yes. And in a way where like Chris Ola
hasn't been as active recently and pushing things. Maybe because Neil's just doing quite a lot of
the work. But like I don't know, four or five years ago, he was like really pushing and like talking at all
sorts of places and these sorts of things and people weren't anywhere near as receptive.
Maybe they've just woken up to like deep learning matters and is clearly useful post-track
TBT but yeah, yeah, it is kind of striking.
All right, cool. Okay, I'm trying to think what is a good last question?
I mean, the one I'm going to those thinking of is like, do you think models enjoy next token
prediction?
Yeah, it's a fun of all right.
Yeah, we have this, uh, we had this, uh,
sense of things that are awarded and our ancestral environment,
there's like this deep sense of fulfillment that we think we're supposed to get from them.
Often people do, right, of like community or sugar or, you know, whatever we wanted on the African
Savannah.
Do you think, like, in the future, models are trained with RL and everything, a lot of post-training
on top or whatever, but they'll, like, in the way we were just a really like ice cream,
they'll just be like, hi, just to predict the next token again.
You know what I mean?
Like in a good old days.
So there's this ongoing discussion of like, are model sentient or not?
And like, do you thank the model when it helps you?
Yeah.
But I think if you want to thank it, you actually shouldn't say thank you.
You should just give it a sequence that's very easy to predict.
And the even funnier part of this is there is some work on if you just give it the sequence,
like, ah, like over and over again.
Then eventually the model will just start.
spewing out on all sorts of things
that otherwise wouldn't
ever say.
And so, yeah,
I won't say anything more about that, but
you can, yeah, you should just give your model
something very easy to predict as a nice little treat.
This is what the only amends of being
or just that the universe and like
but do we like things that are like easy to predict?
Aren't we constantly in search
of like the dose of
the bits of entropy? Yeah, the bits of entropy.
Exactly, right?
Shouldn't you be giving it things
just slightly too hot to release.
Just out of reach.
Yeah, but I wonder, like, at least from the free energy principle perspective, right?
Like, you don't like, you don't want to be surprised.
And so maybe it's this like, I don't feel surprised.
I feel in control of my environment.
And so now I can go and seek things.
And I've been predisposed to, like, in the long run, it's better to explore new things right now.
Like, leave the rock that I've been sheltered under, ultimately leading me to, like, build a house or, like, some better structure.
But we don't like surprises.
I think most people are very upset when, like, expectation does not meet reality.
That's why babies, like, love watching the same show over and over again, right?
Yeah, interesting.
Yeah, I can see that.
Oh, I guess they're learning to model it and stuff too.
Yeah.
Okay, well, hopefully this will be the repeat that the AIS learned to love.
Okay, cool.
I think that's a great place to wrap.
And I should also mention that the better part of what I know about AI, I've learned from just talking with you guys.
You know, we've been good friends for about a year now.
So, yeah, I mean, yeah, I appreciate you guys getting me up to speed here.
You guys, great questions.
It's really fun to hang and chat.
I've really treasured that time together.
Yeah, you're getting a lot better at pickleball.
I think I'm going to say.
Hey, we're trying to progress to tenders.
It's going on.
Awesome. Cool, cool. Thanks.
Hey, everybody. I hope you enjoy that episode. As always, the most helpful thing you can do is to share the podcast. Send it to people you think might enjoy it. Put it in Twitter, your group chats, etc. It just splits the world.
Appreciate you listening. I'll see you next time. Cheers.
I don't know.
