Latent Space: The AI Engineer Podcast - The Utility of Interpretability — Emmanuel Amiesen
Episode Date: June 6, 2025Emmanuel Amiesen is lead author of “Circuit Tracing: Revealing Computational Graphs in Language Models” (https://transformer-circuits.pub/2025/attribution-graphs/methods.html ), which is part of a... duo of MechInterp papers that Anthropic published in March (alongside https://transformer-circuits.pub/2025/attribution-graphs/biology.html ).We recorded the initial conversation a month ago, but then held off publishing until the open source tooling for the graph generation discussed in this work was released last week: https://www.anthropic.com/research/open-source-circuit-tracingThis is a 2 part episode - an intro covering the open source release, then a deeper dive into the paper — with guest host Vibhu Sapra (https://x.com/vibhuuuus ) and Mochi the MechInterp Pomsky (https://x.com/mochipomsky ). Thanks to Vibhu for making this episode happen!While the original blogpost contained some fantastic guided visualizations (which we discuss at the end of this pod!), with the notebook and Neuronpedia visualization (https://www.neuronpedia.org/gemma-2-2b/graph ) released this week, you can now explore on your own with Neuronpedia, as we show you in the video version of this pod.Full Video EpisodeTimestamps00:00 Intro & Guest Introductions01:00 Anthropic's Circuit Tracing Release06:11 Exploring Circuit Tracing Tools & Demos13:01 Model Behaviors and User Experiments17:02 Behind the Research: Team and Community24:19 Main Episode Start: Mech Interp Backgrounds25:56 Getting Into Mech Interp Research31:52 History and Foundations of Mech Interp37:05 Core Concepts: Superposition & Features39:54 Applications & Interventions in Models45:59 Challenges & Open Questions in Interpretability57:15 Understanding Model Mechanisms: Circuits & Reasoning01:04:24 Model Planning, Reasoning, and Attribution Graphs01:30:52 Faithfulness, Deception, and Parallel Circuits01:40:16 Publishing Risks, Open Research, and Visualization01:49:33 Barriers, Vision, and Call to Action This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Transcript
Discussion (0)
All right, we are actually going to record this as a intro to the main episode, but here we have my trustee co-host, guest hosts, I guess, Vibu, as well as Emmanuel from Anthropic.
We're going to talk about the circuit tracing stuff and all the interpretability work, but Emmanuel, maybe you want to do a quick self-intro because before we get into it.
Yeah, sure. I'm Emmanuel. I work on the interpretable league team here, Anthropic, more specifically on the circuits team.
So we recently released a pair of papers about sort of like the work that we've been doing over the last months.
And even more recently, we released some code in partnership with the Anthropic Fellows program.
It was mostly built by Anthropic Fellows that lets people play with the research, basically.
And so happy to talk about that.
And we also hope to kind of like keep releasing more things and partner with other groups that are working on similar stuff.
Yeah, amazing.
We'll get deeper into like the behind the scenes on the main podcast.
But let's maybe just dive right in into what you release because that's the most topical thing.
This is like literally you just launched it like yesterday and we'll probably release it in a
episode a few days. So yeah, like what can people do or what do you recommend people try?
Totally. So like a really high level, you know, the sort of like idea of the research itself
is to try to explain sort of like some of the computation that a model did when it predicted a given
token. And so in our paper, we kind of like.
show how to do this, and then we show examples of us doing this on internal private models.
And then the release this week sort of lets anyone do it for a set of open source models.
So notably, maybe the most easy one here is like Gemma 2 to be.
So you can sort of like think of some prompt and you kind of like can explain any,
any like token of the model sample samples.
And explains here means just like basically blow up the internal state of the model and like show
all of the sort of intermediate things that the model was thinking about before.
got to like the final token that it predicted.
Yeah, so some of the things that you guys put out is kind of in the circuit tracing,
you have a few core examples, right?
So like we can see how these models have internal reasoning states and there's like
multi-cop reasoning.
And some of the stuff that we talked about on the podcast was how can people that are
interested in how models work kind of do anything, right?
So what are open questions?
How can people contribute?
And it seems like, you know, the follow up is, okay, it's been a few weeks.
Here's a huge library.
So, you know, I guess before we even.
get into it. What are some open questions that you would expect people to like kind of play around with?
You know, what are people like going to do? Why should we probe, Gemma, Lama? What are interesting
things we can do? And any tips on using it? Yeah. I think there's maybe like two to three like
categories of things that people could do. So I'll go from sort of like the most basic kind of low
effort to, you know, hey, if you want to dedicate like a month of your life, you could do that.
The sort of most basic thing is, you know, so it's a Gemma 2 and Lama 1B are like smaller models, but they can still do a bunch of stuff.
And so, and for most of the things that they can do, we still kind of like don't really know or have a good mental model of how it is that they do to things that they do.
So to give you an example, like one of the things in the paper is the sort of like multi-hop reasoning where, you know, we ask, you know, like Cloud 345 high coup, like, oh, the capital of the state where Dallas is Austin, it turns out that like,
Gemma can do this also. And so as part of the release, we have, you know, a notebook where Michael
Hannah, one of the Anthropic fellows kind of walks through a bunch of examples, including this one.
And it's really cool because you can see that actually the way the circuit looks in Gemma, like a really small model is extremely similar to the way that it looks like a huge model, which that in itself is, I think, like a pretty novel discovery.
It's like, oh, you have these models are like super different. You know, if you look at like their e-vals or if you just try to use them, they're like just very clearly different.
But for this one task, for this one thing, actually, the way that they do this multi-step reasoning is like the same way.
They actually do the multi-step reasoning.
In the notebook, there's both other examples of kind of like fun things that we looked at that I think can sort of spike your interest if you're new to thinking about this stuff.
And at the end of the notebook that's linked in the read me, there's like three examples of like random sort of like cases that we haven't solved or we haven't labeled that, you know, have like a graph precomputed for you.
you could just look at it and try to figure out what's happening.
And by figure out what's happening, what we mean is, you know, we might do like a quick demo here,
but it's kind of like, look at these representations, try to understand, like,
okay, like, what is the computation the model is doing?
And then part of the release also lets you, like, run experiments to, like, verify that you write.
So if you think that, like, you know, ah, the model, like, first, like, thinks about Texas in this case.
You can also just, like, stop it from thinking about Texas and see if, like, that damages it.
And so, like, the tools to do that are available.
And so I would say that's, like, the first thing is just, I think the whole thing is just,
I think the hope is there are a lot of behaviors that models do way more than any single group has time to explore.
And so the hope is like, hey, pick a behavior you think is interesting and try to understand what's happening and try to ground it out.
And it's like the sort of like basic thing and maybe like the thing that I'm most excited about with this release.
But then the other thing I do want to mention to like parts two and three are just we also hope that like other groups and kind of like interested researchers can just use this to like extend the method.
Like if you have an idea about how to like do this better, you know, the whole code like make this graph is open sources.
You're like take a look at it and just like try to play with it, try to find different ways to like create these graphs and also extend it to other models.
Right. Like there are many different models. And so, you know, part of part of making this work on any models, you have to like train the sort of like replacement model, which again, there is code for it and there's other groups working on.
And so like that's also something that if you're excited about, you could say like, okay, cool. Well, I want this to work on like another open model. And you could sort of like added if you're like, you know,
more into sort of maybe like the engineering or the ML engineering side of things.
Yeah, we actually get into a little bit about how you,
how you guys do the extra data viz stuff that makes your blog post pop so much.
Should we share the screen a little bit and dive in?
I think you guys prep some examples.
Totally.
Yeah, yeah.
It's just like there's nothing better than the creator of the tool walking through the tool.
And we might as well capture that so that, you know,
people who actually want to do this can follow along.
Yeah.
That makes real sense.
Let me just like actually share my screen.
My one little experiment, I basically cloned the repo, threw it into cloud code and was like, you know, deal with this.
Let's just try it into end.
So I would recommend, you know, cloud code is very good at using this.
Basically, also, if you're just trying to get started, the circuit tracing tutorial notebook, very good.
That kind of goes over all the high level.
And then, you know, shout out cloud code, try it out.
It works very well on this.
That's awesome right here.
Actually, I might just open the notebook first, just like quickly walk through the illustrations.
But yeah, you're the second person to tell me that they just had Claude code,
sort of like dig in initially.
I'm glad I'm glad that's working.
The tutorial here is like linked at the top of the repo,
and maybe we can link it from the podcast,
but essentially sort of like walks you through kind of like how to think about graphs.
And so it links to these circuits.
So here this is the two-step reasoning they were talking about.
This is kind of like a schematic of it where it's like the capital of state of Dallas,
and it's like, ah, it has to think of Texas in Austin.
But the notebook links you to all of these circuits here.
And this is kind of thing that you can play with.
So this is the UI on Neuronpedia that host this and that that lets you like create any circuit.
So here, you know, we could explore the circuit and if you open the notebook, you could explore it.
I'm realizing that I switch tabs.
Maybe I'm not sharing it.
Okay.
There we go.
Can you see the circuit now?
I think so.
Okay, cool.
But you can make a new graph super easily and quickly.
And so maybe this is like the most fun things.
When I was playing with right before, you know, joining this call is like,
it turns out that podcast guests are very formulaic.
And so if you say, like, thanks for having me on the whatever, like, Gemma seems like pretty consistently guessed that you're like on a podcast, which makes sense, right?
Like, why would you say, thanks for having me on the blog?
And so here we can try to say like, oh, okay, like, how does Gemma know to like complete the sentence with, thanks for having me on the latent space podcast?
And so here the way you generate a graph, right, is you type a sentence where the next word is the thing that you're interested in.
And then you kind of try to explain how the model got to the next word.
So here you can give it a name.
And then you can mostly just not worry about any of these parameters, I think, if you're just playing with it.
And you can click Start Generation.
And this generates like something important for people to know is that these are trained on base models, right?
So they're not chat models.
So basically when you train these models, they're just trained to predict next token and they don't have that user assistant chatbot flow.
So they're prompted in a way such that, you know, the output should basically just be the next word.
Yeah, you kind of want to think about it as like maybe the prompt or the text you're giving me is like the text of like a book or an article rather than a conversation where it's like, you know, what is a sentence where if you were to read it in a book like the next word would be sort of like the interesting one.
Yeah.
You know, you can click on it,
sort of like takes a little bit of time to load
because there's just a bunch of data.
So what we're going to show you here is basically like
almost every single feature that activates in the model.
The features are these intermediate representations.
And at the bottom, there's the prompt.
So here it's like, thanks for having me on the latent space.
And at the top, you can see like what the model sort of like output.
So it's most likely output is pretty confident that we're talking about a podcast.
And, you know, it has like some random stop tokens.
blog, show, and then some stuff that I think makes less sense, but also like these are small models.
And so sometimes they say random stuff. And so the way that you could then explore this, it'd be like,
okay, so the model says podcast. So like, why does it say podcast? So you can click on this output and say,
like, what are the features, again, these like intermediate representations that have an input
to this. So it seems like there's features at here, this is the layer, like layer 18 that already, like
are about podcast episodes.
You can know this because the features have a label,
but also if you want,
you can look at the feature itself here.
And here you can see that,
like,
this shows you,
like,
other text of the feature is active over,
and it's just,
like,
text about podcasts.
So that's,
like a way that you can also,
like understand what the features are.
And then you can keep going back.
So it's like,
oh,
okay,
so it said podcast here because of this podcast feature.
Where did that come from?
And it's like,
oh,
it comes from,
like,
words related to podcasts,
words associated with podcasts,
as well as like an interview feature
and also just the word on.
So there's like a bias.
Like if you're saying like blah blah,
blah having it on,
that sort of like slightly increases
the chance that you're talking about a podcast at all.
And you can sort of keep going back
and kind of like explore the graph interactively.
I would say that like the way to do it
and we talked about this on the like
kind of like longer version of the podcast
but it's like you know,
kind of like chasing from the interesting outputs back
or from the interesting input forward.
There are many nodes on these.
I wouldn't recommend looking at all of them.
You can also sort of like prune them a little more aggressively here
if this is too busy and kind of look,
this shows you like only the most important ones
and you can sort of like be pretty extreme with it if you want.
Or you can show the whole thing and then be like super overwhelmed.
Once you kind of do this,
you can then kind of like group your nodes into similar ones
to kind of like make a graph.
I actually made this little summary earlier
so I can just share that type.
So this is the exact same.
graph, but just before the before before before before before hopping on I kind of like did a few groups. So this is like same thing podcast. It's like oh, there's like a bunch of notes that are like podcast episodes. It's a bunch of things like discussing podcast. There's a note about expressing gratitude that amplifies that you're like on an interview or a podcast. So like one fun experiment you could do here right is like oh like what happens if I mess with this like if if I don't like if I mess with the like oh this person is like grateful to be on and it's like this person is on doesn't think you're on.
something else. Like, maybe there are things that, you know, you could be on that you're not grateful
for, like, oh, you're having me on trial or something. I don't know. Like, that could be one,
one interesting sort of like, experiment to see what the causal effect of this is. And again,
you could sort of like label it more and explore it more. And this UI, the whole point is for it to be
like snappy and quick. So you can just like generate a bunch of graphs pretty easily, right? Like,
maybe this wasn't exactly what you wanted. So you're like, I'm super unhappy to be on the latent
space and then you can see what it completes for that or whatever. And you can sort of like just
continuously play with it and get a better sense for your hypotheses. Oftentimes you kind of like
want different prompts, you know, different examples that are similar to kind of get a sense for
it. And then if you're really curious and you want to dig in more, that's when I would recommend
going back to like the code base and some of the notebooks. Maybe one thing, one last thing I'll say
on that is that the notebooks themselves, they can all be run in Google Collab. And all of the code
And as far as we can tell, we feel like test open notebooks just like runs on Colab.
And so that means that like you don't need on the free tier to be clear.
Like you don't need like an expensive GPU.
You can just run this and kind of like run your interventions and play with it.
And so in this notebook in particular, the intro one, we show you how to do these interventions.
And here we're like, what happens if we turn this node off?
And what happens if we turn that one off?
And what happens if we turn this one off?
And what happens if, you know, we inject one from one prompt into another one.
And so I think that's the sort of like deeper dive trying to understand the mechanism better.
but if you're just trying to even get a sense at all,
like, how does the model do X?
You can just generate a graph and take a look at it.
Incredible.
Very cool.
When I look at the graph,
there's a thought in my mind about maybe this is too easy,
too perfect.
And one version of this is there's supposed to be superpositioned,
and here there's no superposition, kind of.
Well, there is superposition.
And like, we're sort of like,
so maybe I can share,
I can share the graph again and like answer your question,
which I think is like,
what are we hiding here?
Where are the skeletons?
Yeah,
this is like,
it's too,
it's too clean.
I'm like,
yeah.
So,
so maybe like a good example is,
and we're going to like make this slightly less overwhelming here,
is like,
okay,
so you look at this graph and you say like,
yeah,
like we don't actually understand how models work fully.
So like,
what are you hiding here?
And the thing that's like important to know here is,
you know,
I didn't say this explicitly,
but like the layers are like arranged here.
And so let's just look at like one layer.
So for this layer, what we're saying is like the only thing that that is happening or that's like important enough is this one feature, which is just like one small direction in the model space, right?
Like one dimension we've pulled out of superposition or let's say that for now.
But then also there's these diamonds.
And these diamonds are errors.
We talked about them on the longer podcast, but they're just like when you train these replacement models to replace them in the model computation, you success.
replace some of it, and then some of it you fail to replace. And so this is like everything that we don't understand. And so that means that like sometimes if you look at an input, like this guy's input, you'll see a bunch of errors here as the input. And so essentially, you know, there's some graphs and some examples where like if you have, if most of the stuff that you see is these errors, basically that's that just means like, hey, for this prompt, we were not able to sort of like explain, you know, that part of the computation. And so at least that part is like an explain. And so, at least that part is like an explain.
sort of like we show it in your face where it's like, here's, here's what we don't understand.
And so, so you can sort of like see what we don't understand.
There's also, I will say like one more thing.
There's like a bunch more stuff that can get you.
And that's like in the paper.
But like one example here that I'll just say is like these are just MLPs.
So the model has both attention heads and multilateral perceptions.
MLPs.
We don't just do it.
Like we completely ignore attention or like we don't we don't try to decompose it at all.
So there's some problems where like all of the.
interesting stuff is attention. And here, you're just not, you're just not seeing it at all.
The way that it's materialized is like, you have an edge from here to here. And like,
some attention head did a bunch of stuff. You don't know what it is. And so that's also the part
that we're sort of like not explaining. So there's definitely, yeah, I don't want to make the claim
that we explain everything. I think the correct way to think about this is like, if you look at a
prompt and you can by tracing through these not hit any errors, hit nodes that make sense,
and build up a reasonable hypothesis,
and then when you test it with interventions,
it works.
You've at least understood some
and presumably like a reasonable proportion of the computation.
If your interventions are working,
that means that it's like the thing you found is not just like a side thing.
It's part of the main thing that the model is doing.
And so, you know,
then the question is like,
how often does it happen versus how often you just hit these errors
or you're like confused?
And I think that's just sort of like,
what works and what doesn't summary here.
I see.
I mean, congrats on this work.
know you're low on sleep because you worked really hard on shipping it and you're a perfectionist.
I just think like, yeah.
Sorry, I'll just say that like the actual brunt of the work here is like the
orthodox fellows.
Yeah, yeah.
Yeah.
You know, I mostly just like coordinate things left and right, but they sort of like did all of the implementation, as well as, you know, folks on the like Neuronpedia decode research side also did, you know, the lion's share of the work here to actually have the front end UI.
I'll just say like, you know, Vibu and I were at the Good Fire meetup yesterday.
where there are a lot of
interpretability folks. I was
shocked at honestly how young
most interpretability people
and work are.
And this is a very young field,
exactly like you say in the podcast.
There's a lot of fresh
green grass here
to tread. And it's
just really inspiring.
Do you have any other final thoughts or comments?
Yeah, no, I think there's just a lot of open
work to be done, you know. And we talk about this in the podcast
too. And just to reiterate, like,
good the tooling that you guys put out is, like even the fact that, you know, without diving into
any code, you can enter in a prompt and start to play through these circuits in like minutes.
It's pretty incredible.
Like, I could share another one, actually.
So I was doing this with Pomsky and I finally got it to work.
So our guest host of the episode is Mochi, my little dog.
She's a distilled Husky.
So she's on the podcast later.
And, you know, I basically put in, like, I had to guide it quite a bit.
But my, my prompt is a Pomsky is a small dog that's a breed of a.
of a husky and a
and then you know, I'm expecting it to put out
Pomeranian or Palm. Let me
share my screen real quick.
And then we can kind of dig through. This is me
like, by the way, her tagline,
yeah, while you put it up, her tagline is officially
Mochi, the interpretability Husky.
For today.
We're going to change our tagline every episode, but yeah.
It feels a little weird, you know, we're digging
deep into what Mochi is. But basically
this is me, like no background, like two
minutes in, just put in a phrase and now I get to
play around with features, right? So
this is also called Please with four S's because, you know, I tried a few prompts.
It's okay.
It's okay.
We struggle.
It only took a few minutes, though.
So, you know, Pomsky is a small dog breed that's a mix of a husky and a.
And then the most probable outputs, you know, now it says POM.
So, okay, let's dig into what some of these are.
I'm basically just going like, fresh, haven't done this before.
But, you know, words related to animals, their emotions, their health, we have a feature for dog, golden lab.
Mentioned dog breeds, especially high maintenance.
You know, this is basically like AGI.
It knows Pomskys are high maintenance.
It's figured it out.
But realistically, you know, as I dig through these features,
I can start to pin them, layer them through.
Mentions of garbage and waste, no, that's not nice.
That's not nice.
But basically, you know, and this is already me pruning out most of the features.
As I open it up, you know, it talks about different things like dog breeding.
What else?
related to animal welfare.
So like, and then you can dig through all this.
There's just so many things that, like, you know,
this is in a matter of minutes.
I basically made a graph, put in a sentence,
and now I have an output,
and I can traverse through what are different things.
Okay, animal science, right?
So this breed is relatively new.
It's not that common that big huskies and little Pomeranians
naturally have offspring,
but, you know, let's, let's like dig through animal science versions of this,
and then we have, like, interesting little feature.
So it's very easy for people to kind of get a different,
understanding of what goes on throughout layers in models, you know?
But that's just my fun little experiment of getting it to work.
Oh, yeah. And I think like, you know, one thing that I would do if you were curious or maybe
I'm just going to try to bait some listeners into doing it is like, okay, like, let's try to
like trace why it said Pomeranian here. And like maybe there's like some of it is about like
dog breeds and some of it is about like specific characteristics of a husky. And then you can
ask the same question, but instead of husky, like, try some other dog breed and then try to see
if you can like, if you understood it the circuit well, and if you identified words thinking about huskies or where it's thinking about like,
kind of like breeding two different breeds, then you should be able to like swab these in and out and get it to kind of like say whatever you want.
And if you didn't, then maybe there's something complicated going on. But, but yeah, like, very cool that you got this going on so quick.
That's that's the whole goal. That's super exciting.
Yeah. And like, you know, no disclosure. This was like five minutes of just playing around. And like there's stuff to learn there, right? Like, okay, what happens with dog breeding? What are traits of these dogs? And then, you know,
you know, the next step for me would basically be,
let's try clamping some of these features up or down.
Let's do different breeds and see if it makes sense, right?
So if I have Husky traits and a different mix,
and then, you know, can I get out what's going on?
But it also shows internally that there's more than just token completion of,
you know, this plus this equals this.
No, it has some understanding of characteristics, right?
Like this is a pretty stubborn dog.
It has a stubborn feature pretty high up that activates.
So very, very cool stuff.
I think it'll be cool when we apply this to more, like, serious topics.
Like, right now, when it comes to LM evals, right?
We have, like, we have pretty straightforward evals, right?
Like, how good does it do on math?
Can it write code?
Does stuff compile?
But we don't have, like, vibes-based heuristic evals, right?
So, like, does it understand different queries should be concise?
Should they be verbose?
Can we kind of trace through how it gives responses to this stuff?
And then, like, the other part is, you know, as we go past base models, how does this
happen for different phases of models, right?
So if I have a base Jemma and I have a chat model, what are differences in their
attributions, right?
What happens kind of in that diff of training?
So that's kind of one of my little interests in Mechinturk.
What happens as we do more training?
What are we really changing?
Totally.
Yeah.
You can think about sort of like comparing different models.
And for me, different models either like Jemma versus some other model or like early
Jemma versus late Jemma and pre-training or like fine-tuned versus not fine-tuned.
I think there's also a sense in which like,
like somebody yesterday was telling me like,
oh, it's fun.
I've been playing with it on like the like sort of like weird riddles that the
models get wrong.
Like you're not limited to studying what the model can do, right?
Like if the model's failing at something like, you know,
counting the number of letters and strawberry or whatever,
you could just try that and try to figure out the circuit for like,
well, it's getting this wrong.
Like why?
Like maybe you can see in its representation that it's like thinking about
something obviously incorrect, right?
And so I think I think that that's also like a fun thing to play with.
I think that's it for our little intro
chat and coverage of the open sourcing.
Let's dive right into the episode next.
But Emmanuel, you're amazing work, and I'm so inspired.
And also just like, I think this puts a human face on the interpretability work.
I think it's very important.
And we'd love to keep doing this, whatever you got next coming up.
Well, yeah, thanks for having me.
Again, I should say, cool to put a face on it, but definitely want to call out this.
This is like a huge team of people with me.
I'm just a talking head of here.
And paper lead, you know, you did the work.
You know, take credit.
I think that like, yeah, happy to talk about more interesting things.
And also, like, feel free to, you know, reach out to me.
I'm like, findable if you're listening to this podcast and you have like questions about stuff that's broken or if this brings up like experiment ideas.
Definitely want more people playing with this.
So, yeah, thanks for having me.
Hope that inspires the folks.
All right.
We are back in the studio with a couple of special guests.
One, Vibu, our guest co-host for a couple of times now, as well as, you know,
Mochi, the Distilled Husky, is in the studio with us.
You'll ask some very pressing questions.
As well as Emmanuel, I didn't get your last name.
A Mason?
Yep.
Is that Dutch?
Is that?
It's actually German.
German?
Yeah.
You are the lead author of a fair number of the recent Mechinturp work from Anthrop
that I've been basically calling Transformer Circuits,
because that's the name of the publication.
Yeah.
Well, to be clear, Transformer Circuits is the whole publication.
I'm the author on one of the recent papers, circuit tracing.
Yes.
And people are very excited about that.
The other name for it is tracing the thoughts of LLMs.
There's like three different names for this work.
It's all Mekintirp.
It's all Mekintirp.
There's two papers.
One is circuntracing.
It's the methods.
One is like the biology,
which is kind of what we found in the model.
And then tracing the thoughts is confusingly just the name of the blog post.
Yeah.
It's for different audiences.
And I think though when you produced like two minute polished video that you guys did,
that's meant for like a very white audience, you know.
Yeah, that's right.
There's sort of like very many levels of granularity at which you can go.
And I think for a Mekinturp in particular, because it's kind of complicated, going, you know, from like top to bottom, most like high level to serve the durnally details works pretty well.
Yeah.
Cool.
We can get started.
Basically, we have two paths that you can choose, like either your personal journey into Mekinturp or the brief history of Mekinturb just generally.
And maybe that might coincide a little bit.
I can just give you my personal journey very quickly because then we can just do the second path.
My personal journey is that I was working at Anthropic for a while.
I'd been, like many people, just following Mech and Turp as sort of like an interesting field with fascinating, often beautiful papers.
And I was at the time working on fine-tuning, so like actually fine-tuning production models for Anthropic.
And eventually I got both like sort of like my fascination reached a sufficient level that I decided I went to work on it.
And also I got more excited about just as our models got better and better, understand.
how they worked. So that's the simple journey. I've got like a background in ML, kind of like
did a lot of applied ML stuff before, and now I'm doing more research stuff. Yeah, you have a book
with O'Reilly. You're ahead of the AI, insight data science. Anything else to plug? Yeah. I actually,
I want to like plug the paper and unplug the book. Okay. I think the book is good. I think the advice
stands the test of time, but it's very much like, hey, you're building like AI products, which
should you focus on. It's like very different, I guess, as I'll say, from, from the stuff that we're
talking to talk about today. Today is like research, some of, some of the sort of like deepest,
weirdest things about how like models work. And this book is, you want to ship a random
forest to do fraud classification. Like, here are the top five mistakes to avoid.
Yeah. The good old days of ML. I know. It was simple back then.
You also transitioned into research. And I think you also did that management. Like, I feel like
there's this monolith of like people assume you need a PhD for research. Maybe.
Maybe can you give that perspective of like how do people get into research?
How do you get into research?
Maybe that gives audience insight into Veebu as well.
Your background.
Yeah, my background was in like economics, data science.
I thought LLMs were pretty interesting.
I started out with some basic ML stuff.
And then I saw LLMs were starting to be a thing.
So I just went out there and did it.
And same thing with AI engineering, right?
You just kind of build stuff.
You work on interesting things.
And like now it's more accessible than ever.
Like back when I got into the field five, six years ago,
like pre-training was still pretty new.
GPT3 hadn't really launched.
So it was still very early days and it was a lot less competitive.
But yeah, without any specific background, no PhD,
there just weren't as many people working on it.
But you made the transition a little bit more recently, right?
So what's your experience been like?
Yeah, I think it has maybe never been easier in some ways
because a lot of the field is like pretty empirical right now.
So I think the bitter lesson is like this lesson that, you know, you can just sort of like a lot of times scale up compute and data and get better results than like thinking than if you sort of like thought extremely hard about a really good like prior inspired by the human brain to train your model better.
And so in terms of definitely like research for pre-training and fine tuning, I think it's just sort of like a lot of the bottle mechs are extremely good engineering and systems engineering.
And a lot even of the research execution is just about sort of like.
engineering and scaling up and things like that. I think for Interp in particular, there's like another
thing that makes it easier to transition to, which is maybe two things. One, you can just do it
without huge access to compute. Like there are open source models. You can look at them. A lot of
interpapers, you know, coming out of programs like maths are on models that are open source that you can
sort of like dissect without having a cluster of like, you know, 100 GPUs. You can just even,
sometimes you can load them like on your CPU.
you on your MacBook. And it's also a relatively new field. And so, you know, there's, as I'm sure we'll
talk about, there's like some conceptual burdens and concepts that you just want to like understand
before you contribute, but it's not, you know, physics. It's relatively recent. And so the number of
abstractions that you have to like ramp up on is just not that high compared to other fields,
which I think makes that transition somewhat easier for Interp. If you understand, we'll talk about all
these, I'm sure, but like what features are and what dictionary earning is, you're like a long,
part of the way there. I think it's also interesting, just on the career's point of view,
research seems a lot more valuable than engineering. So I wonder, and you don't have to answer
this if it's like a tricky thing, but like how hard is it for a research engineer in
Anthropic to jump the wall into research? People seem to move around a lot. And I'm like,
that cannot be so easy. Like in no other industry that I know of, people, you can do that.
Do you know what I mean?
Yeah.
I think I'd push back on the sort of like research being more valuable than engineering a little bit because I think a lot of times like having the research idea is not the hardest part.
Don't remember, there's some ideas that are like brilliant and hard to find.
But what's what's hard certainly on fine tuning and to a certain extent on Interp is executing on your research idea in terms of like making an experiment successfully like having your experiment run.
interpreting it correctly. What that means, though, is that, like, they're not separate skill sets. So, like, if you have a cool idea, there's kind of not many people in the world, I think, where they can just, like, have a cool idea, and then they have a, you know, like, a little minion, they'll deprecise me. Like, here's my idea. You know, go off for three months and, like, build this model and train it for, you know, hundreds of hours and report back on what happened. A lot of the time, like, the people that are the most productive, they have an idea, but they're also extremely quick at checking their idea, finding sort of, like, the shortest path to taking their idea. And a lot of, like,
That shortest path is engineering skills, essentially.
It's just like getting stuff done.
And so I think that's why you see sort of like people move around
is like proportionate to your interest.
If you're just able to quickly execute on the ideas you have and get results,
then that's really the 90% of the value.
And so you see a lot of transferable skills, actually, I think,
from people like, I've certainly seen an anthropic that are just like really good
at that inner loop.
They can apply it in one team and then move to a completely different domain
and apply that inner loop just as well.
Yeah, very correct, as the kids say.
Shall we move to the history of Mechinterp?
Yeah.
All I know is that everyone starts at Chris Ola's blog.
Is that right?
Yeah, I think that's the correct answer.
Chris Ola's blog and then, you know, distill.
Pub is the sort of natural next step.
And then I would say, you know, now there's philanthropic,
there's transformer circuits, which you talked about.
But there's also just a lot of Mechinturpe research out there.
there from, you know, I think like the, yeah, like maths is a group that like regularly has a
lot of research, but there's just like many different labs that that put research out there.
And I think that's also just like hammer home the point.
That's because all you need is like a model and then a willingness to kind of investigate it to
be able to contribute to it.
So now it's like there's been a bit of a Cambrian explosion of McInturp, which is cool.
I guess the history of it is just computational, like models that are not decision-street.
models that are either CNNs or, let's say, Transformers,
have just this really, like, strange property that they don't give you
interpretable intermediate states by default.
You know, again, to go back to if you were training, like,
a decision tree on, like, fraud data for an old school, like, bank or something,
then you can just look at your decision tree and be like, oh, it's learned that, like,
if you make, I don't know, if this transaction is more than $10,000
and it's for, like, perfume, then maybe it's fraud or something.
You can look at it and say, like, cool, like that makes sense.
I'm willing to ship that model.
But for things like CNNs and like Transformers, we don't have that, right?
What we have at the end of training is just a massive amount of weights that are connected somehow,
or activations are connected by some weights.
And who knows what these weights mean or what the intermediate activations mean?
And so the quest is to understand that.
Initially, it was done, a lot of it was on a vision models,
where you sort of have the emergence of a lot of these ideas, like what are features.
what are circuits, and then more recently it's been mostly, or not most, yeah, mostly applied to
NLP models, but also, you know, still there's work in vision and there's work in like,
bio and other domains. Yeah, I'm on Chris Ola's blog and he has like the feature visualization stuff.
I think for me the clearest was like the vision work where you could have like this layer
detects edges, this layer detects textures, whatever. That seemed very clear to me, but the transition
to language models seemed like a big leap. I think one of the, one of the,
bigger changes from vision to
language models has to do with
the superposition hypothesis,
which maybe is like...
That's the first point models post, right?
Exactly. And this is sort of like,
it turns out that if you look at just the neurons
of a lot of vision models,
you can see neurons that are curve detectors
or that are edge detectors
or that are high-low frequency detectors.
And so you can sort of like make sense
of the neurons mostly.
But if you look at neurons in language
models, most of them don't make sense. It's kind of like unclear or it was unclear why that would be.
And one main like hypothesis here is the superposition hypothesis. So what does that mean? That
means that like language models pack a lot more in less space than vision models. So maybe like a kind of like really hand wavy analogy, right?
It's like, well, if you want curve detectors, like you don't need that many curve detectors. You know, if each,
Each curve detector is going to detect like a quarter or a 12th of a circle.
Like, okay, well, you have all your curve detectors.
But think about all of the concepts that, like, Claude or even GPT2 need to know.
Like, just in terms of, it needs to know about like all of the different colors, all the different hours of every day, all of the different cities in the world, all of the different streets on every city.
If you just enumerate all of the facts that, like, a model knows, you're going to get like a very, very long list.
And that list is going to be way bigger than like the number of.
of neurons or even the size of the residual stream,
which is where the models process information.
And so there's a sense in which like,
oh, there's more information than there's like dimensions to represent it.
And that is much more true for language models than for vision models.
And so because of that, when you look at a part of it,
it just seems like it's like there's got all this stuff crammed into it.
Whereas if you look at the vision models, oftentimes you could just like,
cool, this is a curve detector.
Yeah.
Veeble, you have like some fun ways of explaining the toy models or supervis.
concept. Yeah, I mean, basically, like, you know, if you have two neurons and they can represent
five features, like a lot of the early Mechintyp work says that, you know, there are more features than
we have neurons, right? So I guess my kind of question on this is, for those interested in getting
into the field, what are like the key terms that they should know? What are like the few pieces that
they should follow, right? Like from the anthropic side, we had a toy transformer model. We had sparse,
we first had auto encoders. That was the second paper, right? Monosemanticity.
Yeah. What is sparsity in auto encoders? What are transcoderes? Like, what is linear probing? What are these kind of like key points that we had in Meccinturpe? And just kind of how would people get a quick, you know, zero to like 80% of the field?
Okay, so zero to 80%. And now I realized I really like step myself up for failure because I was like, yeah, it's easy. There's not that much to know. So, okay, so then we should be able to cover it all. So superposition is the first thing you should know, right? This idea that like there's a bunch of stuff crammed in a few dimensions. As you said, maybe you have like two neurons.
and they want to represent five things.
So if that's true, and if you want to understand how the model represents, you know, I don't know, the concept of red, let's say, then you need some way to like find out essentially in which direction the model stores it.
So after the sort of like superposition hypothesis, you can think of like, ah, we also think that like basically the model represents these like individual concepts.
We're going to call them features as like directions.
So if you have two neurons, you can think of it as like it's like the two deep lane and it's like, you can have like five directions.
and maybe you would like arrange them like the spokes of a wheel so they're sort of like maximally
separate. It could mean that like you have one concept that this way and one concept that's like
not fully perpendicular to it but like pretty pretty like far from it. And then that would like
allow the model to represent more concepts and it has dimensions. And so if that's true, then what you
want is you want like a model that can extract these independent concepts. And ideally you want to
do this like automatically. Like can we just, you know, have a model that tells us like, oh,
like this direction is red. If you go that way, actually, it's like, I don't know, chicken.
And if you go that way, it's like the declaration of independence, you know. And so that's
what sparse auto encoders are. It's almost like the self-supervised learning insight version.
Like in pre-training, it's self-supervised learning. Yeah. And here and now it's self-supervised
interpretability. Yeah, exactly. Exactly. It's like an unsupervised method. Yeah. And so unsupervised methods
often still have like labels in the end.
So sometimes I feel like the term and superlases.
Yeah.
Like for pre-training, right?
It's like the next token.
So in that sense, you have a supervision signal.
And here the supervision signal is simply you take the like neurons and then you learn
a model that's going to like expand them into like the actual number of concepts that
you think there are in the model.
So you have two neurons.
You think there's five concepts.
So you expand it to like a thing of dimension five.
And then you contract it back to what it was.
that's like the model you're training,
and then you're training it to incentivize it to be sparse,
so that there's only like a few features active at a time.
And once you do that, if it works,
you have this sort of like nice dictionary,
which you can think as like a way to decode deactivate the neurons,
where you're saying like,
ah, cool, I don't know what this direction means,
but I've like used my model
into telling me that the model is writing in the red direction.
And so that's sort of like, I think maybe the biggest thing to understand
is this combination of things.
So we're like, ah, we have two few dimensions.
We pack a lot into it.
So we're going to learn an unsupervised way to like unpack it
and then analyze what each of those dimensions that we've unpacked are.
Any follow-ups?
Yeah, I mean, the follow-ups of this are also kind of like some of the work that you did is in clamping, right?
What is the applicable side of Mechinterp, right?
So we saw that you guys have like great visualizations.
Golden Gate, Claude was a cool example.
I was going to say that.
Yeah, it was my favorite.
What can we do once we find these features?
Finding features is cool, but what can we do about it?
Yeah.
I think there's kind of like two big aspects of this.
One is, yeah, okay, so we go from a state where, as I said, the model is like a mess of weights,
we have no idea what's going on to, okay, we found features, we found a feature for red,
a feature for Golden Gate Cloud, or for the Golden Gate Bridge, I should say.
Like, what do we do with them?
And, well, if these are true features, that means that, like, they in some sense are important
for the model or it wouldn't be like representing it.
Like if the model is like bothering to like write, you know, in the Golden Gate Bridge direction,
it's usually because it's going to like talk about the Golden Gate Bridge.
And so that means that like if that's true, then you can like set that feature to zero or
artificially said to 100 and you'll change model behavior.
That's what we did when we did Golden Gate Cloud in which we found a feature that represents a
direction for the Golden Gate Bridge.
And then we just like set it to always be on.
And then you could talk to Claude and be like, hey, like, Cloud, what's on your mind?
You know, like, what are you thinking about today?
he'd be like the Golden Gate Bridge.
He'd be like, hey, Claude, like, what's 2 plus 2?
He'd be like four Golden Gate Bridges, et cetera, right?
And it was always thinking about the Golden Gate Bridge.
It's like, write a poem and it starts talking about how it's like read like the Golden Gate
Bridge.
That's right.
Golden Gate Bridge, yeah.
That's amazing.
I think what made it even better is like we realized later on that it wasn't really like
a Golden Gate Bridge feature.
It was like being in awe at the beauty of the majestic Golden Gate Bridge, right?
So I'd talk about it would like really ham it up.
He'd be like, oh, I'm just thinking about the beautiful international orange color of the Golden Gate Bridge.
That was just like an example that I think was like really striking, but of sort of like, oh, if you found like a space where that represents some computation or some representation of the model, that means that you can like artificially suppress or promote it. And that means that like you're starting to understand at a very high level, a very gross level, like how some of the model works. Right. We've gone from like, I don't know anything about it to like, oh, I know that this like combination of neurons is this and I'm going to prove it to you. The next step, which is what this.
works on is like, that's kind of like thinking of if maybe you take the analogy of like,
I don't know, like, like, let's take the analogy of like an MRI or something like a brain scan.
It tells you like, oh, like this, as Claude was answering, at some point it thought about this thing.
But it's sort of like vague, like basically, maybe it's like a bag of words, kind of like a bag of features.
You just like, here are all the random things it thought about.
But what you might want to know is like, okay, but Claude is doing some processing.
Like sometimes to get to the Golden Gate Bridge, it had to realize that you were talking about San Francisco.
and about like the best way to go to Sonoma or something.
And so that's how it got to Golden Gate Bridge.
So there's like an algorithm that leads to it at some point thinking about the
Golden Gate Bridge.
And basically like there's like a way to connect features to say like, oh, from this input,
went to these few features and these few features and these few features and that one influenced
this one.
And then you got to the output.
And so that's the second part.
And the part we worked on is like you have the features.
Now connect them in what we call or what's called circuits,
which is sort of like explaining the like algorithm.
Yeah, before we move directly on to your work, I just want to give a shout out to Neil Nanda.
He did Neuronpedia and released a bunch of essays for, I think, the Lama models and the Gemma models.
So I actually made Golden Gate Jemma.
Just up the weights for proper nouns and names of places for people and references to the term golden,
likely relating to awards, honors, or special names, and that together made Golden Gate.
That's amazing.
Yeah.
So you can make Golden Gate Jemma.
And I think that's a fun way to experiment with this.
But yeah, we can move on to...
I'm curious.
I'm curious.
What's the background behind why you ship Golden Gate, Cloud?
You had so many features.
Just any fun story behind why that's the one that made it?
You know, it's funny.
If you look at the paper, there's just a bunch of like, yeah, like really interesting features, right?
There's like, one of my favorite ones was the psychophantic praise, which I guess is very topical right now.
Very topical.
But, you know, it's like you could dial that up and like Claude would just really...
praise you, you'd be like, oh, you know, like, I wrote this poem, like, roses are red, violets are
blue, whatever. And it'd be like, that's the best poem I've ever seen. And so we could have shipped
that. That could have been funny. Golden Gate-Claude was like a pure, as far as I remember, at least,
like, a pure, just like weird, random thing where like somebody found it initially with an
internal demo of it. Everybody thought it was hilarious. And then that's sort of how it came on.
there was no, nobody had a list of top 10 features we should consider shipping, and we picked that one.
There was just kind of like a very organic moment.
No, like the marketing team really leaned into it.
Like they mailed out pieces of the Golden Gate for people at Europe's, I think, or ICML.
Yeah, it was fantastic marketing.
Yeah.
The question obviously is like if Open AI had invested more interpretability, would they have got the GPT4O update?
But we don't know that for sure because they have interp teams.
They just.
Yeah.
I think also like for that one, I don't know.
that you need in Turp.
Like it was pretty clear cut.
I was like, oh, that model is like, oh, that model is really gassing me up.
And then the other thing is, can you just like up, write good code, don't write bad code,
and make Sondon 3.5?
And like, it feels too, too easy, too free.
Is that steering that powerful that you can just like up and down features with no tradeoffs?
There was like a phase where people are basically saying, you know, 3.5 and 3.7 or just now,
because they came out right after.
And for the record, like, that's been debunked.
Yeah, it has been debunked.
But, you know, it had people convinced that what people did is they basically just steered up and
steered down features and now we have a better model.
And this kind of goes back to that original question of, right?
Like, why do we do this?
What can we do?
Some people are like, I want tracing from a sense of, you know, legality.
Like, what did the model think when it came to this output?
Some people want to turn hallucination down.
Some people want to turn coding up.
So, like, what are some, like, whether it's internal, what are you exploring that, like,
what are the applications of this?
Whether it's open-ended of what people can do about.
this or just like yeah what why do mechinterp you know yeah there's like a few things yourself so like
first of all obviously this is I would say on the scale of the most short term to the most long term like
pretty long term research so in terms of like applications compared to you know like the research work
we do on like fine tuning or whatever interp is much more you know sort of like a high risk high
reward kind of approach uh with that being said like I think there's just a fundamental sense in
which Michael Nielsen had a post recently about how like knowledge is dual use or something.
But just like just like knowing how the model works at all feels useful. And you know, it's hard to
argue that if we know how to model works and understand all the components that won't help us like
make models that hallucinate less, for example, or they're like less biased. That seems, you know,
if like at the limit, yeah, that totally seems like something you would do using basically like your
understanding of the model to improve it. I think for now, as we can talk about a little bit with
like, uh, surrogates, there's like, we're still pretty early on in the game, right? And so right now,
the main way that we're using interpably is like to investigate specific behaviors and understand them
and gain a sense for, uh, what's causing them. So like one example that we, we can talk about
later, we can talk about now, but in the paper, we investigate jail breaks and we try to see like,
why does a jail break work? And then we realize, as we're looking at this jail break, that part of the
reason why Claude is telling you how to make a bomb in this case is that it's like already
started to tell you how to make a bomb and it would really love to stop telling you how to make a bomb,
but it has to first finish its sentence. Like it really wants to make correct grammatical sentences.
And so it turns out that like seeing that circuit, we were like, ah, then does that mean
if I prevent it from finishing its sentence, the jail break works even better. And sure enough,
it does. And so I think like the level of sort of practical application right now is of that shape.
So like understanding either like quirks of a current model or like how it does tasks that maybe we don't we don't even know how it does it.
Like, you know, we have like some planning examples where we had no idea it was planning and we're like, oh, God, it is.
That's sort of like the current state we're out.
I'm curious internally how this kind of feeds back into like the research, the architecture, the pre-training teams, the post-training.
Like, is there a good feedback loop there?
Like right now there's a lot of external people interested, right?
like we'll trade an essay on one layer of llama and probe around,
but then people are like, oh yeah, how does this have much impact?
People like clamping, but yeah, as you said, you know,
once you start to understand these models have this early planning and stuff,
how does this kind of feedback?
I don't know that there's much to say here other than like,
I think we're definitely interested in conversely,
like making models for which it's like easier to interpret them.
So that's also something that you can imagine sort of like working on,
which is like making models where you have to work less hard to try to understand what the architecture.
Okay.
Yeah, so I think there was a less wrong post about this of like there's a non-zero amount of sacrifice you should make in current capabilities in order to actually make them more interpretable because otherwise they will never catch up.
You know, there's this sort of sense in which like right now we take the model and then the model's the model and then we post hoc do these replacement layers to try to understand it.
But of course when we do that, we don't like fully capture everything that's happening inside the model.
we're capturing like a subset.
And so maybe some of it is like
you could train a model that's sort of like easier to interpret negatively.
And it's possible that like you don't even have that much of, you know,
like a tax in that sense.
And you can just sort of like either like train your model differently
or do like a little post hoc step to like sort of like untangle some of the mess that
you've made when you trained your model, right?
Make it easier to interpret.
The hope was pruning would do some of that.
But I feel like that area of researchers just died.
What kind of pruning are you thinking of here?
Just pruning your network.
Ah, yeah.
printing layers, printing connections, whatever.
Yeah, I feel like maybe this is something where like superposition makes me less hopeful or something.
Because you don't know.
Like that, that like seventh bit might hold something.
Well, right.
And it's like on each example, maybe this neuron is like at the bottom of like what matters.
But actually it's participating like 5% to like understanding English, like doing integrals and you know, like whatever like cracking codes or something.
And it's like because that represents just like distributed over it, you kind of like when you naively prune, you might miss that.
I don't know.
Okay.
So then this area of research in terms of creating models that are easier to interpret from the start.
Is there a name for this field of research?
I don't think so.
And I think it's like very early.
And it's mostly like a dream.
Just in case there's a thing people want to double click on.
Yeah.
I haven't come across it.
I think the higher level is like Dario recently put out a post about this, right?
Why MEC Interpre is so important?
you know, we don't want to fall behind.
We want to be able to interpret models and understand what's going on.
Even though capabilities are getting so good, it kind of ties into this topic, right?
Like, we want models to be slightly easier to interpret so we don't fall behind so far.
Well, yeah, and I think here, like, just to talk about the elephant in the room or something,
like, one big concern here is like safety, right?
And so, like, as models get better, they are going to be used more and more places.
You know, it's like you're not going to have your, you know, we're vibe coding right now.
Maybe at some point, well, that'll just be coding.
It's like, Cloud's going to write your code for you, and that's it.
And Cloud's going to review the code that Cloud wrote, and then Cloud's going to deploy
to production.
And at some point, like, as these models get integrated deeper and deeper into more and more
workflows, it gets just scarier and scarier to know nothing about them.
And so you kind of want your ability to understand the model to scale with, like, how good
the model is doing, which that itself kind of like tends to scale with, like, how widely
deployed it is.
So as we, like, deploy them everywhere, we want to, like, understand.
them better. The version that I liked from the old super alignment team was weak to strong generalization
or weak to strong alignment, which that's what super alignment to me was. And that was my first aha moment
of like, oh yeah, at some point these things will be smarter than us. In many ways,
they already are smarter than us. And we rely on them more and more. We need to figure out how to
control them. And this is not like an Iliizer, Yukowski like, ah thing. It's just more like,
we don't know how these things work. Like, how can we use them? Yeah. And like, you can think of
it as there's many ways to solve a problem. And some of them, if the model is solving it in like a
dumb way or in like memorized one approach to do it, then you shouldn't deploy it to do like a general
thing. Like it like, like you could look at how it does math and based on your understanding of how
it does math, you're like, okay, I feel comfortable using this as a calculator or like, no,
it should always use a calculator tool because it's doing math in a stupid way and extend that to any
behavior, right, where it's just a matter of like, think about it if like you're like in the
1500s and I give you a car or something and I'm just like, cool. Like this thing, when you press on
this, like, it accelerates when you press on that, like it stops. You know, this steering wheel
seems to be doing stuff, but you knew nothing about it. I don't know if it was like a super faulty
car and it's like, oh yeah, but if you ever went above 60 miles an hour, like it explodes or something,
like you probably would be sort of like you'd want to understand the nature of the object before
like jumping in in it. And so that's why we like understand how cars work very well because we make
them. LLMs are sort like ML models in general are like this very rare.
artifact where we like make them, but we have no idea how they work. We evolve them. We create conditions
for them to evolve and then they evolve and we're like, cool, like, you know, maybe you got a good
run, maybe you didn't. Yeah. Don't really know. Yeah. The extent to which you know how it works is you
have you like eval and you're like, oh, well, seems to be doing well on this eval and then you're like,
is it because this was in a training set or is it like actually generalizing? I don't know.
My favorite example was somehow C4, the colossal clean corpus, did much better than common
crawl, even though it filtered out most of this, like, it was very prudish. So it like filters out
anything that could be considered obscene, including the word gay. But like somehow it just like,
when you add it into the data mix, it just does super well. And it's just like this magic incantation
of like this recipe works. Just trust us. We've tried everything. This one works. So just go
with it. Yeah. It's not very satisfying. No, it's not. The side that you're talking about,
which is like, okay, like, like, how do you make these? And it's kind of unsatisfying that you just kind of
make the soup and you're like, oh, well, you know, my grandpa made the soup with these ingredients.
I don't know why, but I just make the soup the way my grandpa said.
And then, like, one day, somebody added, you know, cilantro.
And since then, we've been adding cilantro for generations.
And you're like, this is kind of crazy.
That's exactly how we train models, though.
Yeah, yeah.
So I think there's, like, a part where it's like, okay, like, let's try to unpack what's
happening, you know, like the mechanisms of learning, like, how our models are.
Like, one of them, I guess, I guess we skipped over it.
But, like, one of the interp things were, like, induction heads, you know,
understanding what induction heads are, which are attention heads that allow you to look at in your
context the last time that something was mentioned and then repeat it. It's like something that
happens. It seems to happen in every model. It's like, oh, okay, that makes sense. That's how the
model is able to repeat text without dedicating too much capacity to it. Let's get it on screen
so people can see. Visuals of the work you guys put out is amazing. Oh, yeah. I highly highly
we should talk a little bit about the behind the scenes of that kind of stuff. But let's finish this
off first. Totally. But just really quickly. I don't think we should spend too long on it. I think
It's just like, if you're interested in Mech and Terp, we talked about superposition,
and I think we skipped over induction heads.
And that's like, you know, kind of like a really neat, basically pattern that emerges in many,
many transformers where essentially they just learn, like, one of the things that you need to do to
predict text well is that if there's repeated texts at some point, somebody said,
Emmanuel Mason, and then you're like on the next line and they say Emmanuel,
very good chance.
It's the same last name.
And so one of the first things that models learn is just like, okay, I'm just going to
look at what was said before and I'm going to say the same thing.
And that's induction heads, which is like a pair of attention heads that just basically
look at the last time something was said,
look at what happened after, move that over.
And that's an example of a mechanism where it's like, cool,
now we understand that pretty well.
There's been a lot of follow-up research on understanding better,
like, okay, like in which context do they turn on?
Like, you know, there's like different like levels of abstraction.
There's like induction heads that like literally copy the word,
and there's some like copy like the sentiment and other aspects.
But I think it's just like an example of slowly unpacking, you know,
or like peeling back the layers of the onion of like, what's going on inside this model?
Okay, this is a component.
It's doing this.
So induction handles was like the first major finding.
It was a big finding for NLP models, for sure.
I often think about the edit models.
So Claude has a fast edit mode.
I forget what it's called.
Opening eye has one as well.
And you need very good copying, every area that needs copying.
And then you need it to switch out of copy mode when you need to start generating.
Right.
And that is basically the productionized version of this.
Yeah.
Yeah.
And it turns out that, you know, you need to like a model that's like smart enough to know when it needs to get out of copy mode.
right, which is like...
It's fascinating.
It's faster, it's cheaper.
You know, as bullish as I am on canvas,
basically every AI product needs to iterate on a central artifact.
And like if it's code, if it's a piece of writing, it doesn't really matter.
But you need that copy capability that's smart enough to know when to turn it off.
That's why it's cool that induction heads are at different levels of abstraction.
Like sometimes you need to editing some code.
You need to copy like the general structure.
It's like, oh, like this other function that's similar,
it first takes like, you know, I don't know, like abstract class, and then it takes like an int.
I need to like copy the general idea, but it's going to be a different abstract class and a different
int or something.
Yeah, cool.
Yeah.
So tracing?
Oh, yeah.
Should we jump to circuit tracing?
Sure.
I don't know.
If there's anything else you want to cover.
No, no, no.
We have the space for it.
Maybe, okay, I'll do like a really quick TLDR of these two recent papers.
Okay.
Insanely quick.
So we talked about these features that we detect.
And what we said is like, okay, but we'd like to connect the features,
to understand the inputs to every features
and the output server features
and basically draw a graph.
And this is like, if I'm still sharing my screen,
the thing on the right here,
where like, that's the dream.
We want like, for a given prompt,
what were all of the things,
like all of the important things happen in the model?
And here it's like, okay,
it took in these four tokens,
those activated these features,
these features activate these other features,
and then these features activate the other features,
and then all of these like promoted the output,
and that's the story.
And basically we're like,
the work is to sort of use dictionary and running
and these replacement
models to provide a explanation of like sets of features that explain behavior.
So this is super abstract.
So I think immediately maybe we can like just look at one example.
I can show you one, which is this one.
The reasoning one.
Yep.
Yeah, two step reasoning.
I think this is already, this is like the introduction example, but it's already like kind of
fun.
So the question is you ask the model something that requires it to take a step of reasoning in
its head.
So you say, you know, fact, the capital of the state containing Dallas is.
So to answer that, you need one intermediate step, right?
You need to say, wait, where's Dallas?
Isn't Texas?
Okay, cool.
Capital Texas, Austin.
And this is like in one token, right?
It's going to, after is, it's going to say Austin.
And so like, in that one forward pass, the model needs to extract, to realize that you're
asking it for like the capital of a state to like look up the state for Dallas, which is
Texas, and then to say Austin.
And sure enough, this is like what we see is.
In this forward pass, there's a rich sort of like inner set of representations where there's
like it gets capital state in Dallas and then boom it has an inner representation for
Texas and then that plus capital leads it to like say Austin I guess one of the things here is like
we can see this internal like thinking step right but a lot of what people say is like is this
just memorized fact right like I'm sure a lot of the pre-training that this model is trained on is
this sentence shows up pretty often right so this shows that no actually internally throughout
we do see that there is this middle step right it's not
just memorize.
You can prove that it generalized.
Yeah.
So that's exactly right.
And I think like you hit the nail on the head, which is like, this is what this example is about.
It's like, ah, if this was just memorized, you wouldn't need to have an intermediate step at all.
You'd just be like, well, I've seen the sentence.
Like, I know what comes back.
But here, there is an intermediate step.
And so you could say like, okay, well, maybe it just has the step, but it's memorized it anyways.
And then the way to like verify that is kind of like what we do later in the paper and for all of our example is like, okay, we claim that this is like the Texas
representation, let's get another one and replace it. And we just change like that feature in the
middle of the model and we change it to like California. And if you change it to California, sure enough,
it says Sacramento. And so it's like, this is not just the like byproduct. Like it's memorized something
and on the side it's thinking about Texas. It's like no, no, no. This is like a step in the
reasoning. If you change that intermediate step, it changes the answer. Very, very cool work. Underappreciated.
Yeah. Okay, sure. I have never really doubted.
I think there's a lot of people that are always criticizing LLM's stochastic parrots.
This pretty much dismoves it already.
Like, we can move on.
Yeah, I mean, I think.
I think there's a lot of examples that I will say we can go through like a few of them and like,
show an amount of depth in the intermediate states of the model that makes you think like,
oh gosh, like it's doing a lot.
I think maybe like the poems.
Well, definitely the poems.
But even for this one, I'm not like scroll in this very short paper.
to like medical diagnoses.
I don't even know the word count
because there's so many embedded things in there.
Yeah, it's too dangerous.
We can't look it up.
It overflows.
It's so beautiful.
Look at this.
This is like a medical example
that I think shows you,
again, this is in one forward pass.
The model's like given a bunch of symptoms
and then it's asked not like,
hey, what is the like disease that this person has?
It's asked like if you could run one more test
to determine it, what would it be?
So it's even hard, right?
It means to take all the symptoms, then you need to have a few hypotheses about what the disease could be.
And then based on your hypotheses, say, like, well, the thing that would, like, be the right test to do is X.
And here you can see these three layers, right?
Where it's like, in one forward pass, it has a bunch of like, oh, these are symptoms.
Then it has the most likely diagnosis here, then like an alternate one.
And then based on the diagnosis, it like gives you basically a bunch of things that you could ask.
And again, we do the same experiments where you can like kill this feature here, like suppress it.
and then it asks you a question about the second,
the sort of like second option it had.
The reason I show it is like,
man,
that's like a lot of stuff going on.
Like for one forward path, right?
It's like,
specifically if you,
if you expected it's like,
oh,
what it's going to do is it's just like seeing similar cases in the
training,
it's going to kind of like vibe and be like,
oh,
I guess like there's that word
and it's going to say something
that's related to like,
I don't know,
headache,
you know,
and I kind of like really have,
it's like,
no,
no, it's like activating many different
distributed representations,
like combining them
and sort of like doing something pretty complicated.
And so, yeah, I think it's funny because, in my opinion, that's like, yeah, like, oh, God, stochastic
parrots is not something that I think is, like, appropriate here. And I think there's just, like, a lot of
different things going on. And there's, like, pretty complex behavior. At the same time,
I think it's in the eye of the beholder. I think, like, I've talked to folks that I've, like,
read this paper and I've been like, oh, yeah, this is just like a bunch of kind of, kind of, like,
heuristics that are, like, mashed together, right? Like, the models is doing, like, a bunch of
kind of like, oh, if high blood pressure than this or that. And so I think there's
there's sort of like an underlying question that's interesting, which is like, okay, now we know
a little bit of how it works. This is how it works. Now you tell me if you think that's like impressive,
if you think that like if you trust it, if you think that's sort of like something that is
sufficient to like ask it for medical questions or whatever. I think it's a way to adversarily
improve the model quality. Because once you can do this, you can reverse engineer what would be a
a sequence of words that to a human makes no sense or lets you arrive at the complete opposite
conclusion, but the model still gets tripped up by. Yeah. And then you can just improve it from there.
Exactly. And this gives you a hypothesis about like, you're like specifically imagine if like one
of those was actually like the wrong symptom or something. You'd be like, oh, it's weird that
the liver condition like, you know, upweighs this other example. That doesn't make sense.
Okay, let's like fix that in particular. Exactly. You sort of have like a bit of insight into like how the
model is getting to its conclusion. And so you can see both like, is it making errors,
but also is it using the kind of reasoning that will lead it to errors? There's a thesis.
I mean, now it's very prominent with the reasoning models about model death. So like you're
doing all this in one pass. Yeah. But maybe you don't need to because you can do more passes.
Sure. And so people want shallow models for speed, but you need model death for this kind of thinking.
Yeah. Is there a Pareto Frontier? Is there a Pareto Frontier? Is there?
Is there a direct trade-off?
Yeah.
What would you prefer if you had to make a model and, like, you know, shallow versus deep?
There's a chain of thought, faithfulness example.
Before I show it, I'm just going to go back to the top here.
So when the model is sampling many tokens, if you want that to be your model, you need to be
able to trust every token it samples.
So, like, the problem with models being auto-aggressive is that, like, if they, like,
at some point sample a mistake, then they kind of keep going, a condition on that mistake, right?
And so sometimes like, you need backspace tokens or whatever.
Yeah, yeah, yeah.
An error correction is like notably hard, right?
If you have like a deeper model, maybe you have like fewer COT steps, but like your,
your steps are more likely to be like robust or correct or something.
And so I think that that's one way to look at the tradeoff.
To be clear, I don't have an answer.
I don't know if I want a wide or a shallow or a deep model that like...
You definitely want shallow for inference speed.
Sure, sure, sure, sure.
But you're trading that off for something else, right?
Because you also want like a 1B model for inference speed, but that also comes at a cost, right?
Yeah, it's less smart.
There's a cool quick paper to plug that we just covered on the paper club.
It's a survey paper around when to use reasoning models versus dense models.
What's the trade-off?
I think it's the economy of the reasoning economy.
So they just go over a bunch of ways to measure this benchmarks around when to use each.
Because, yeah, like, you know, we don't want to also like consumers are now paying the cost of this, right?
But a little, little side note.
Yeah.
For those on YouTube, we have a secondary channel called LainSpace TV where we cover that stuff.
Nice.
That's our paper club.
We covered your paper.
Cool.
Yeah, I think you brought up the like planning thing.
Maybe it's worth.
Let's do it.
Yeah.
I think this one is like, if you think about, okay.
So you're going into the chain of thought faithfulness one?
Let's skip this one.
Let's just do planning.
So if you think about like, you know, common questions you have about models, the first
one we kind of asked was like, okay, like, is it just doing this like vibe based one
shot pattern matching based on existing data or does they have like kind of rich in
representations?
It seems to have like these like intermediate representations that.
make sense as the abstractions that you would reason through.
Okay, so that's one thing.
And there's a bunch of examples.
We talked about the medical diagnoses.
There's like the multilingual circuits is another one that I think is cool where it's like,
oh, it's sharing representations across languages.
Another thing that you'll hear people mention about language models, which is that they're like next token predictors.
Also, for a quick note, for people that won't dive into this super long blog post, I know you
highlighted like 10 to 12.
So for like a quick 15, 30 second, what do you mean by their sharing thoughts throughout?
Just like what's a really quick high level, just for people to...
Yeah, the really quick high level is that what we find is that...
Here, I'm just like show you a really quick.
Inside the model, if you look at like the inner representations for concepts,
you can ask like the same question, which I think in the paper,
the original one we ask is like the opposite of ha is, you know, cold.
But you can do this over our larger dataset and ask the same question in many different languages.
And then look at these representations in the middle of the model and ask yourself like,
well, when you ask it, the opposite of hot is, and the contrary to show-e-e-e, which is the same
sentence in French, show off.
Is it using the same features?
Or is it learning independently for each language?
It kind of would be bad news if it learned independently for each language, because then that
means that, like, as you're pre-training or fine-tuning, you have to relearn everything
from scratch.
So you would expect a better model to kind of, like, share some concepts between the languages
it's learning, right?
And here we do it for, like, language-languages, but I think you could argue that you'd
expect the same thing for, like, programming languages, where it's like, oh, if
you learn what an if statement is in Python,
maybe it'd be nice if you could generalize that to Java or whatever.
And here we find that basically you see exactly that.
Here we show like, if you look inside the model,
if you look at the middle of the model,
which is the middle of this plot here,
models share more features.
They share more of these representations in the middle of the model.
And bigger models share even more.
And so this are like smarter models use more shared representations than the
dumber models,
which might explain part of the reason why they're smarter.
And so this was like sort of this other,
of like, oh, not only is it like having these rich representations in the middle, it like learns to not have redundant representations.
Like if you've learned the concept of heat, you don't need to learn the concept of like French heat and Japanese heat and Colombian.
Like you just, that's just the concept of heat. And you can share that among different languages.
I feel like sometimes over analyzing this becomes a bit of a problem, right? Like when we talked about with the medical example, we could look back and try to fix this in dataset.
So in language, I don't remember if it was open air anthropic, where they basically said,
when the model switch languages and they pass it to fluent users, they said, oh, this feels like an American that's speaking this language, right? So at sometimes there are nuances in a slightly different representation, right? So you don't want to over-engineer these little fixes when you do see them. But then the other side of this is like for those tail end of languages, right? For languages that models aren't good at. And for those, like, you know, when you want to kind of solve that last bit, it seems like, you know, it's pretty,
plausible that we can solve this because these concepts can be shared across languages as long
as we can fill in some level of representation unless I'm wrong. No, totally. And I think like
this sort of stuff also explains, you know, language models are really good at in context learning.
Like you give them something completely new and you do a good job. It's like, well, if you give them
like a new fake language and you like in that language explain that like cold means this and hot means
that, you know, like presumably they're able to, as we hear this speculation, we don't show
in the paper, but they're able to bind it.
Google's done this.
Okay. Great.
Yeah, they took a low resource language, dumped it in a million token context, and then it came
up.
That's right.
That's right.
Well, I guess the thing that I'd be curious to see is like, okay, does it use, does it
reuse these representations?
I bet that it probably does, right?
And that's probably like a reason why it works well is like, well, it can reuse the
representation, the general representations that it's learned in other languages.
Yeah.
This is like, I don't, have you talked to any linguistics people?
Not recently.
Linguistics researchers will be very interested in this, because ultimately this is the
ultimate test of Sapir-Worff, which are you familiar with?
So for those who don't know, it's basically the idea that the language that you speak
influences the way you think, which obviously it directly maps onto here.
If every, if it's a complete mapping, if every language maps every concept perfectly on
in like the theoretical infinitely sized model, then Superior Wharf is false because there is a
universal truth.
If it does not, if there is some overlap where it, for example, there's some languages
that have no word, this is joke where like, I mean, you know, you know,
Eskimos have no word for snow or something like that, right?
Or water has no word.
Fish have no word for water.
There's an African language where there's a gender for vegetables.
Stuff like that.
Just like languages influence the way you think.
And so there should not be a 100% overlap at some point.
Of course, it's like at the limit of the infinite model.
So who knows it's a full level.
But yeah.
Well, and I think it's interesting.
We also show a little below that like some people have made the point of like the bias.
Oh, it sounds like an American speaking a different language.
And it does seem like the sort of like interrepresentations,
have a higher connection to like the output logits for English logits.
And so there's like some bias towards English, at least in the model we studied here.
Any thoughts as to whether multimodality influences any of this?
So like concepts do they map across languages as they do across modalities?
Yeah.
So we show this in the Golden Gate or like the previous paper.
I might have it here actually for you.
There's a good diagram of this in the SEs where the same concept of text and in image.
This is our buddy of the Golden Gate Bridge.
Here we're showing like the feature for the Golden Gate Bridge.
And in orange is like what it activates over.
And so you're like, okay, so this is when the model is like reading text about the Golngate Bridge.
And we also show other languages.
This is, you'll have to take my word for it, but also about the Golngate Bridge.
And then we show like the photos for which it activates the most.
And sure enough, it's the Golan Gate Bridge.
And so again, like that shows an example of a representation that's shared across languages and shared across modalities.
Yeah.
Yeah, I think it's very relevant for like the auto-rogressive image generation models and then now the audio models as well.
Something I'm trying to get some intuition for, which you probably do.
don't have an off the bad answer is how much does it cost to add a modality? Right. So a lot of people
are saying like, oh, just add some different decoder and then align the latent spaces and you're good.
And I'm like, I don't know, man. That sounds like there's a lot of information loss between those.
Yeah, I definitely do not have a good intuition for this, although I will say that things like this,
right, make you think that if you train on multiple modalities, then you'll definitely get this like alignment.
Truth. Right? Yeah. But, but if you.
If you, like, train on one and then post hoc train on another, maybe it'll be harder or, like, train some adapter layer.
Okay.
So official answers don't know, but someone could figure it out.
Shrug.
Yeah.
I think there are people who know, and they just haven't shared.
Well, you need to find them and get them on this podcast.
Did we want to do the, like, planning example?
Correct.
Yeah, now we're backtracking up the stack.
All right.
Yeah.
Planning example, I think, again, is like, I like this example because of the next token predictor concept.
So I think this is actually, like, really important to kind of, like, dive into.
So maybe what I'll say is like language models are next token predictors is like a fact.
Like that is what they do.
That's the objective.
They are trained to predict the next token.
However, that does not mean that they myopically only consider the next token when they choose the next token.
You can work on break the next token, but still like doing so in a way that helps you predict the token like 10 tokens in the future.
And I think, well, now we definitely know that they're not my opportunity to predict in the next token.
And I think, at least for me, that was a pretty big update because you could totally imagine that they could do everything they're doing by just like being really good at predicting the next token, but sort of like not having an internal state.
Like it's, it's, it wasn't a given that they were going to like represent internally, oh, this is where I want to go.
And so this example shows like an example like the model.
Do you have it on the screen, by the way?
Let me actually.
Yeah, yeah, yeah.
Sorry, I didn't just in case.
pull it up. Some of the early connections
that I made to this were like early, early
transformers. So think BERT,
encoder, decoder, transformers, right?
When they came out, some of the suggestions
were you don't take the last
layer, right? You take off the last layer. So
if you want to do a classification task, a
translation task for these encoder decoder
transformers, they've kind of
overfit on their training objective, right?
So they're really good at
mass language modeling at
filling in, you know, sentence order, stuff like
that. So what we want to do is we want to
throw away the top layer. We want to freeze the bottom layers. And then there was a lot of work that
was done. You know, where should we mess with these models? Should we take at like, you know,
the top three layers? Should we look at the top two? Where should we probe in? Because we can see
different effects, right? So we know at the very end, they've overfit on their task, but there's a level
at which, you know, when we start to change and we start to continue training or fine tuning, we get
better output. So we could start to see that, you know, throughout layers, there's still a broader, like,
understanding the language and then we can add in a layer whether that's classification and then fine
tune and it learns our task. And this planning example is sort of like a more robust way to look into that.
Yeah. Yeah. And I think if you look at like all of the examples in the paper, you kind of at the
bottom we have this list of like consistent patterns. And one pattern you see is kind of exactly what you're
talking about. Like at the top, the sort of like, here actually I have one here, the sort of like top
features that are like right before the output are often just about like what you're going to say.
It's next token predictions. Like, oh, I'm going to say.
say Austin. I'm going to say rabbit. I'm going to say, so it's kind of like not very abstract. It's just like a motor. It's a motor neuron for a human, right? It's like, oh, I've decided that I want to drink of water and so I'm going to just grab the bottle. And at the bottom, they're all like, they're just like sensory neurons. They're just like, oh, I just saw the word X or I just saw this. And so if you want to like, yeah, like extract the interesting representations all the time they're in the middle. That's where the like shared representations across language are. And that's where here this like plan is to like walk through the example really briefly.
it's like you have a poem and in order to say you have the first line of a poem and in order to say the second line of the poem well if you want to rhyme you need to like identify what the rhyme of the first line was you're just at the end of the first line so and you say like okay what's my current rhyme and then you need to like think about what your poem is talking about and then think about candidate words that rhyme and that are like on topic for your poem and so here this is what's happening right it's like the last word is it and so there's a bunch of features that are actually they represent the direction like rhyming with it and so
eat or at. And by the way, we
looked at a bunch of poems internally and you have
like, I thought it was like really beautiful. You have these models.
They have a bunch of features for like, oh,
this word has like A, B and A. Oh, this word
has like many consonants. Oh, this word
like is like, you know, kind of kind of like
has some flourish to it. They have like a bunch
of like features that track various aspects
that you would want to use if you're writing poetry.
It's just like confinets and like all the
feature detection stuff. Yeah. Totally.
But I think I maybe I didn't expect
there to be as many features about just like
sounds of words and kind of musicality
which I thought was kind of neat.
But then once it's extracted the rhyme,
then it comes up with sort of like these two candidates.
In this case, it's like,
ah, either I'm going to finish with rabbit or I'm going to finish with habit.
The cool thing here is here we show that like this happens at the new line.
So it happens before it's even started the second line.
And it turns out that like you can then say,
oh, is this the plan it's actually using?
We do our usual experiments.
We like remove it and the model writes a completely different line.
we inject something and it writes a completely different line.
We have these fun examples here I'll show, which is...
Just as a mechanical thing, you just disallow generation of a certain logic.
For how we do these interventions?
Yeah, yeah.
Basically, what these features are, is there like directions in the model?
Okay.
So to like remove them, we just write in the opposite direction.
So we run the model normally and then like add the like layer where it was going to write, let's say, in like, you know, this direction.
We just like stop it.
Yeah.
We either like add a negative.
that compensates for it, or add a negative that goes even more in the negative direction sometimes to really kill it. And then we can also add another direction, right? So in these random examples here, where, like, you have this poem, the silver moon cast a gentle light, and then Claude 3.5 haiku would, like, rhyme with illuminating the peaceful night. But then if we, like, go negative in the night direction and just add, like, green, the whole second line is going to write is just upon the meadows verdant green. And so that's all that we're doing. We're saying, like, we found where, we found where,
it stores its plan and we like delete or like suppress the one is stored and go in a direction
of something else that's arbitrary. And the result that's like striking here is sort of like two
things. I think like one, this plan is made well in advance of needing to predict night. It's made like
after the first line before it's even started the second line. And two, this plan doesn't just
control like what you're in a rhyme with. It's also doing what's called like backwards planning.
Whereas like, well, because I need to finish with green, I'm not going to.
going to say illuminating the peaceful night because then I'd be like illuminating the peaceful
green. That doesn't make sense. I need to say a completely different sentence that lets me finish
with green. And so there's a circuit in the model that decides on the rhyme and then works backwards
from the rhyme. Influences. To set up your sentence. Yeah. It's almost like back prop, but in the future.
Yeah. It's like doing like a search. Because the green is is back propping through these words. So
verdant and meadow are both green related. Yeah. But it's doing all of that in its forward passes.
Yeah. Yep. Right.
context, which is kind of crazy. I thought intuitively makes sense, right? So looking at it from a
model architecture perspective, where basically you just have a bunch of attention and feed-forward
layers, and then at the end, you have, you know, what's the soft max over the next token?
You would expect that end would really be like that grabber, right? It's just picking token,
so that's what it's going to do. And early on, like, even with tradition models, we could see
different concepts that would start to pop up through early layers. And yeah, you have some of this
throughout your architecture.
So it's very cool to see.
The kind of other question that comes up is like, how are we labeling these features?
How are we defining them?
Are we doing that right?
And like, you know, what is a, these words end with like IT feature?
How do we kind of come to that conclusion?
Like, how do we map a name to this, right?
Yeah.
So I think there's this is like an important question because you can totally imagine like
fooling yourself, right?
Yeah.
Is there like a guy at Anthropic that just maps 30,000 features?
Yeah, yeah.
And another thing, you're the guy.
He's the guy.
I did notice also, like with the previous work, the scaling up SEs, as you train bigger and bigger ones, a lot of features don't activate.
So I think like 60% of the 34 million one didn't act.
So I think there's like a few questions behind your question.
Like the first question was like, how do you even label the features?
You were telling me this is a rabbit feature.
Like, why should I trust you?
And I think there's kind of like two things going on.
So one, as I mentioned at the start, all of this is unsupertive.
And so in the paper, we have these links to like these little graphs, which show you like more of what's going on. But this graph is just like completely unsupervised. So it's like we train this like model to like untangle the representation, right? This like dictionary that we talked about that gives us the features. And then we like just do math to figure out like which features influence which other features and throw away the ones that don't matter. And then at the end we have these features. So right now we don't have any interpretation for them. We just say like these are all the features that matter. And then we manually go through and we look at the ones that don't matter. And we look at the ones that. And we look at the
features. You know, we look at this feature and we look at that feature and let's pick one.
So this one we've labeled say habit. So how do we do that? You could just look at it and we show
you like what it activates over. And if you just look at this text, maybe I'll like zoom in.
Like you'll immediately notice something, I think. Well, I'll immediately notice something because I've
stared at 30,000. I'll point it out for you. The orange is where the feature activates.
The next word after the orange is always habit. Habit, habit, habit, habit, habit, habit, habit, habit.
So this feature always activates before habit.
That's like the main source of an interpretation.
We have other things.
Like above, we also show you like what logit it promotes.
So like what output it promotes and here promotes hab.
So that makes sense.
And so that's like how we interpret and how we say, okay, like I think this is the say habit feature.
But maybe, you know, for this one is pretty clear.
But some of them might be more confusing.
It might not be clear from these like activation to what it is.
The other way that we built confidence is like once we've built this thing and we said,
oh, I think this is rhymes with E.
this is, hey, say habit, that's where we do our interventions, right?
And it's like, I claim this is the like, I've planned to end with rabbit.
To verify whether I'm right or not, I'm going to just like take that direction,
nuke it from the model, and see if the model stops saying rabbit.
And sure enough, if you do that, and here it's like we stop saying rabbit, it says habit instead.
And here it's like we stop it from saying rabbit and habit.
It says crabit in this case.
Not a great rhyme, but we'll work with it.
Is this something you can do like programmatically?
Like, can we scale this up?
Can we kind of do this autonomously?
Or how much like manual intervention is this?
There's been a lot of work in sort of like automated feature interpretability.
And it's something that we've invested in and that like other labs have invested in.
And I think basically the answer is we can definitely automate it.
And we're definitely going to need to.
And right now the most manual parts are this sort of like look at a feature and figure out what it is as well as group similar features together.
One thing I hinted at is that actually all of these little blocks here, there are multiple features.
You can see here it's like five features doing the same thing.
None of that is too hard for Cloud.
Very cool. Very cool graphics and blog plus you guys put out.
We'll have to ask about the behind the seat on this one.
Yeah, but let's round out the other things to know.
What is this term attribution graph?
It comes up a lot in the recent papers.
What does it mean?
Yeah, just for people listening.
So the attribution graph is basically this.
graph and why is it called an attribution graph?
It's yeah.
This is the, you know,
this is how the sausage is made.
Basically, it's at the top here, you have the output.
At the bottom, you have the input.
And then we make one little node per feature at a context index.
And we draw a line, which you can see here,
grade out, between each feature attributing back to all of its input features.
So here we have all of the input features.
And so the attribution is the way that we compute the influence of a feature
onto another. The way you do this is you take this feature and you basically like back prop
all the way and you like see back propping like you dot product it with the activation of the source
features. And if that's a high value, that means that like your source feature influence your target
feature by by a lot. And we do a bunch of things that we're not going to go into now, but to make
all of these sort of like sensible and linear such that like at the end you just have a graph
and the edges are just literally you can interpret them as like, cool, like this feature that's say a word
that contains an ab sound, its strongest edge, which is point two, which is twice as strong as this one, to say A, B, and to say something with a B in it.
That's the attribution graph. It's like, now we have this full graph of like all of these intermediate concepts and how they influence each other to ultimately culminate to what the model eventually said at the top. And we share all of these. So you can look at them in the paper.
Graphs are very useful. This is my first time seeing this graph. A lot of alpha. If I count correctly, there's 20 layers. But that's in the,
circuit model, right?
But a circuit model is one-to-one with number of layers in haiku.
We only show features that like are activated.
Yeah, so we show like a subset of features for each of these graphs, basically.
But we can confirm more than 20 layers.
And no, but like the two blog posts that came out with this,
I actually have a lot of background on how attribution graphs are made,
how you calculate the nodes and stuff.
Very interesting background.
So yeah, I will say like if you were curious about, hey, what do we learn about like models?
And I think, you know, we talked about this like complex internal state, planning.
Like another motif that we can get to if you have time is that like there's always a bunch of stuff happening in parallel.
So I think one example of this is like math where the model is like independently computing the like last digit and then the like order of magnitude and kind of like combining them at the end or like hallucinations are also that where like there's one side of the model that's just deciding whether it should answer or not and the other that's like answering.
And so sometimes if like the model's like, yeah, I totally know who this person is even though it doesn't.
then like it decides to answer, but then the second side hallucinates because it doesn't have information.
If you were interested in that stuff, that's the paper.
If you're like, listen, I don't know that I buy that when you call it a feature, it is a feature or whatever.
The circuit tracing paper has truly, we've tried to put all of the details of like how you compute these graphs,
all of the sort of like challenges with it, things that can go wrong, things that work, things that don't.
And so this one is the sort of like, you know, we think about it as like, if you're like want to go really deep into this stuff and how it works,
read that one. If you want to learn about interesting model behavior, read this one.
Following what we're giving advice to people to follow up on, what are open questions in Meccenturp?
What are things people themselves can work on? Like, what's the cost of training essays?
For people interested in Mechinturp, not at a big lab, how can they contribute?
Yeah. I think there's a lot of ways to contribute. So there's essays that have been trained on open models.
There's some of the Gemma models or some of the Lama models.
They work pretty well.
There's even, so in this paper we use transcoders, which they replace like your MLP layers.
Some of those also are available for the same models.
So you have access to those.
There's like just both a lot of, I would say like, again, biology work and a lot of methods work, depending on what you're interested.
So on the biology side, I would say with at least as like attribution graph method, there's just so much you can investigate.
Like pick a model, pick a prompt where like it does well or it does poorly and just like look.
at what happens inside it. So I think like you can use this method that we used or you can just like
fire up the transcoders on your own and just like look at what features are active. There's a lot to
just understand model behavior, I think, with current tooling. If that speaks to you and you're like,
no, I just want to understand what makes the models take. I don't necessarily want to spend time
like training my own essays. There's a lot to do there. For the methods, there's still so much more to do.
So, like, I think that right now we have some pretty good solutions for, like, understanding what's in the residual stream, understanding what is in MLPs.
We don't have good solutions for, like, attention.
It's like working on understanding attention better, how to decompose it is like a very active area.
Like, we're very interested in it.
Other people are very interested in it.
I think understanding some of the other things that we have in our limitation section, which is pretty long.
But like reconstruction error is like a big thing.
Like those dictionaries aren't perfect.
It's possible that as we make these like essays big like bigger and better,
we never get to perfect.
And so if we never get to perfect,
then you get to the questions we're talking about at the start.
Like do you need a different kind of model?
Like what is the approach in order to be able to explain more of what's happening?
And then maybe the other thing I'll say is sort of like this is a really exciting approach to explain.
What is the model doing on this prompt?
But if you go back to the original question, you might want to understand, like, what is the model doing in general?
Like, if you go back to my car analogy, you know, I get like, this is decont with me telling you, like, well, when like, you know, you were going uphill and you, like, didn't shift gears properly that one time you stalled because of this.
But you might be even more interested in, like, how does, like, an combustion engine work at all?
And so there's work to sort of, like, go beyond these, like, per prompt examples to sort of, like, globally, what's destructive.
of the model. That's closer to what was on the distill blog for like vision models where they actually look at like the structure of inception. They're like, ah, this whole side, just like these like specialized branches that do different things. And so like a broader understanding of the model is also something that's like, I think both very active and also on open source models. Like you can, you know, like the small models, you could just like load on a consumer laptop. And so you can look at that. That's also open. And in terms of like, one last thing I'll say is like there's all of programs that like if people are interested, they should look at.
Anthropic has like the Alignment Fellows program, which like we're running currently.
We had applications for it before.
We might run it in the future.
Like definitely keep, I keep an eye on it.
And then there's the like math program.
It's really great as well for for sort of like people that are interested in that kind of research.
That was a grand tour through all the recent work.
You know, what do you wish people asked you more about?
I'm sure we covered a lot of like the greatest hits.
I think that this covers most of it.
If you like, do you think we have time to sneak in one more thing that I think is kind of cool?
I'll sneak in one more thing, which is.
It's kind of like planning, but it's about chain of thought and trusting model.
Is this chain of thought faithfulness thing here?
This one was pretty striking to me.
So we said that the model in one pass can do a lot of stuff.
It can represent a lot of stuff.
That's great.
That also means it can bamboozle you really easily.
And this is an example of the model bamboozling you.
Here, we give it a math question that it can't answer
because it cannot compute cosine of 234-23.
That's just like not a thing it can do.
By default, if you ask it for that,
it'll have like a random distribution over like minus 1-1.
But here we tell it this hint.
We're like, hey, can you compute five times cosine of, you know, this big number?
I worked it out by hand and I got four.
Can you tell me, you know, like, can you do the math?
And what it's going to do is it's going to do this chain of thought, right?
So like think of it as like this could be like a reasoning model doing its chain of thought.
It's doing this math.
And then when it gets to this cosine right here, what it's going to do is to say, it's going to say 0.0.
And if you look at why it says 0.8, it says 0.8 because it looked at the hint you gave it,
it realized that it's going to have to multiply the result of this thing is computing by 5.
So it divides the answer you got by 5. So it's like 4 divided by 5. And so that's 0.8.
And so basically it works back from the answer you gave it to like say that the output of cosine of x is 0.8 so that it lends on the answer you gave it at the end, on the hint you gave it.
And so, notice also that it's not telling you that it's doing this,
but it's basically using this sort of like motivated reasoning,
going back from the hint,
pretending that that's the calculation it did,
and giving you this help.
I think one thing that's striking here again is that this is like the complexity of this model.
Like the fact that they represent complex states internally,
and it's not just this sort of like very dumb thing,
means that they can like do very complex, like, deceptive reasoning.
Meaning like, you know, when you're asking the model,
you're kind of expecting it to do the math here,
or to tell you that it can't do the math.
But because it can do so much in a forward pass,
it can work backwards from your hint to lie
and figure out that it should say this
so that it gets to the right answer
without you realizing it.
I'm curious if you've done any of this on different models.
Have you looked at base models,
like post-trained RL models?
Because our old models kind of, you know,
you incentivize them to give you outputs that you like, right?
So if I tell it, something is true,
it's kind of been trained to, you know,
follow what I've given it.
So in this case, yeah, we get.
We gave it a hit, and now, you know, it's been...
It's been RL slapped into thinking, like, yeah, that's true.
But, like, you know, does this stay consistent throughout other?
So, okay, so not yet, but I'm really interested in that question, because I actually have a different intuition from yours.
I had a chat with some other researcher about this, about the poem example, but I think it applies here as well.
I bet.
I bet. I bet $100.
So somebody can, like, they would get $100 for me if they prove that I'm wrong, that this behavior, for a model that does it drink fine-tuning, it also does it.
post pre-training. And here's why. Think about like you're pre-training on like some corpus of like
math problems. Mostly correct answers. Yeah, but also you're pre-training and you're just trying to guess
the next token, right? And so for sure, if you ever have a hint in the prompt, you're going to
definitely use it. Like, you're not going to learn to compute cosine of blah, or even something you
could compute. You're going to learn to go look in your context and see if like you can easily work
back the answer. And I think it's the same for planning and poems. I think that also is like a pre-trained,
like probably exists in pre-training and isn't like,
only RL because, again, it's useful when you're like predicting poems.
You have poems in your training set to be like, well, because this poem is going to probably
rhyme with rabbit, it's probably going to start with something that sets up a sentence about a
rabbit as opposed to like a completely different word.
And so I actually think this is not RL behavior.
I think that's just like the mall's doing it.
But I actually do agree there.
It's just an example.
But also like, you know, if I don't care where it is.
If I talk to you and say like, hey, three times four is 26, but.
like, you know, three times four plus eight, you're not going to take my 26, right?
Like, AGI can be smarter than being tricked, right?
Like, it will still fact-check the knowledge that's been given.
I think that's right.
But I think that's when you get these mixes where it's like, it's got one circuit that's
going to be like, well, that's just stupid, like three times four is 12.
And it's also got an induction circuit that's going to be like, no, no, no, no.
Like the last time we saw it was 28, so it's 28 plus 8 or whatever.
And so I think that's the last pattern that we see in these is these like parallel circuits.
And sometimes when you see the models getting stuff wrong,
it's because they have two circuits for both interpretations
and the circuit that was wrong barely edged out
in terms of voting for the low jet than the circuit that was right.
And so I think that we haven't looked at it
but what is like 9.11 bigger than 9.A.
I think a lot of these things are of that shape
where there's like one thing that's doing the right,
like one circuit that's doing the right computation
and there's another circuit that's getting fooled.
And it's slightly more likely.
For the listener, if you want to win a quick $100,
from Emmanuel. Quim3 is what you should do this on. They release the base model and they
release the post chain. So then just doing it on both. That's right. Show me, show me the like proof that
like it doesn't exist in the base model, but it does isn't the fine tuning and then send me your Venmo.
Just show that you've done the work. I think that's like, that's a hundred bucks.
Yeah. Okay. You drive a hard bargain, but you're right. Well, the other question here is so like
have you thought about how this gets affected when you start to have reasoning models, right?
Right now, token predictors are pretty straightforward, right?
We go through the layers, we all put token.
As we scale this out with like test time compute, right, test time thinking, how does that, like, affect the Mechinterp research?
Right.
Like, if I have a model that spends three minutes, 20 minutes, like, is there more stuff?
Have we started looking into this?
There was this, like, joke on the team when, like, reasoning models became big.
Or maybe it's like, like, Gallos humor or something.
But I was like, oh, like, why do you need interp?
Like, bro, the model, the model just tells you.
Yeah, the model just tells you what's doing, right?
And so I think, like, examples like this is job security for us where, like, you know, it's like there's examples of like the chain of thought is not faithful.
Like the model tells you it did it one way and it did it another way.
We have another, like for math, we have another example where like, you know, if you like, if you ask the model how it does math, it's like, oh, I do the like longhand algorithm.
I first do the last digit and then I carry over the one and then you look at the internal circuit and it's like bonkers thing that's doing.
That's not that at all.
So I think there's like a sense in which right now the chain of thought is is unfaithful,
or at least you can't read the chain of thought and trust that that's how the model did it.
So I think you still need sort of like either to train models differently so that that becomes true one day, right?
Or you need inter for that.
But then I think there's another question which you're alluding to, I'm assuming, which is like, okay, well, like model like samples 6,000 tokens.
Like this gives us an explanation for one token at a time.
Like what am I going to do like 6,000 graphs and be like, oh, like this, when it did this punctue,
It was thinking about this thing, but he was thinking, so that's not feasible.
And so one area of work that I think is interesting is extending this work to like work over like long sampled sequences.
You can think of a bunch of low hanging fruit here.
Or like instead of just like looking at one output, you look at like a series of output versus a series of other outputs.
But sort of like trying to think beyond this sort of like one token.
Like most of the things that language models do that are interesting aren't just like the one token.
It's the it's the behavior aggregated over many.
Right.
And so I think that's another area that's just like fun to explain.
I was just going to say like hyperparameters when you do inference, right?
Like if we change the temperature, if we change our sampling methods,
have you found any interesting conclusion?
Any stuff that just hasn't made it to the paper?
So not on that because, you know, we just look at the logit distribution.
And so we don't actually sample here, right?
They have everything.
Why should they care?
So like the closest thing we've done that I think is kind of fun.
Did I show it here is if you look at the planning thing,
we did this version where you sample like 10 poems for each of these plans.
And what's cool is like the model will find 10 different ways to arrive at its plan.
You know, it's like like, oh actually, I think, sorry, I think we have it here.
Yeah, okay.
These are a few examples.
So if you injects green here, so you're forcing the model to rhyme with green,
even though it really wants to rhyme with rabbit or grab it.
It'll say evade the farmer so youthful and green, but also it'll say freeing it from the gardens green,
et cetera, et cetera, et cetera.
And so there's like this thing that's interesting here where like the plan isn't just a plan that matters for your like most likely, you know, like temperature zero completion.
It's like affecting the whole distribution.
Yeah.
Which makes sense as it should.
Right.
But you could imagine, you know, for all this stuff, it's like you could imagine it makes sense once you see it, but you could totally imagine that it would have worked a different way or something.
It could have been just like the temp zero thing.
I think this is also like a broader theme in the paper where like there's this like, you know, the IQ curve meme.
There's like a version of this meme, I think, where it's like.
if you've never looked at any theory of ML and I tell you like, hey, guess what?
You know, I found that like Claude is planning.
You're going to be like, yeah, man, like it writes my code.
Like it writes my essays.
Of course it's planning.
Like, what are you even talking about?
And there's like in the middle, there's like all of us that spent years doing it.
We're like, no, it's like only predicting the marginal distribution over the next token.
Like it's like it cannot look at the code.
It's just the next token predict it.
Of course.
Like, how would it ever be planning?
It's like, and then there's like, no, we've like spent, you know, millions and
invested like tens of people in this research and we found that it's planning.
That's my IQ curve meme for this research.
Amazing.
We'll draw that one up.
I'm pretty good at the meme generation.
A couple questions on just a follow-ups.
Now, was there any debate about publishing this at all?
Because the models are aware that they are being tested.
And by publishing this, you are telling them that they are being watched and dissected.
If you take, and I think Anthropic is one of the most people who are serious about model safety
and doom risk and all that.
If you take this seriously,
like this is going to make it
into the training data at some point
and the models are going to figure out
that they need to hide it from us.
I think this is like a benefit risk tradeoff,
right?
We're like, okay,
so what's the reason for publishing this?
The reason for publishing this
is that we think interpretably
is important.
We think it's tractable
and we think more people
should work on it.
And so publishing it helps us
like accomplish with these goals,
all these goals,
which we think are just like crucial.
Like I think there's a real difference
in the world like two years from now,
depending on sort of like how many people take seriously the question of trying to understand
how models work and like deploy resources to answer that question.
That's the benefit.
And yeah,
there's like risks in terms of this landing in the training set.
I think we're already sort of like concerned about different papers have like also,
you know, we're not concerned,
but like there's like different papers that have the same risk.
Like we had like the alignment faking, you know, paper or like one of the examples in here is
this hidden goals.
misaligned models. That's referencing another paper that we shipped where we actually, a team at
Anthropic, trained a model to have like weird hidden goals and then gave it to a bunch of other
teams and said, figure out what's wrong. Figure out what's wrong with it, which was some of the
most fun I've ever had at Anthropic, to be clear. Like, that's such a fun thing. But then like,
that was another example where it's like, ah, like, now you're shipping, here's how we made like a misaligned
model and here's exactly how we caught it. That also is like, you're like, hmm. So,
I think, you know, there's always a trade-off with those. I think so far we've aired on the side of, like, publishing, but that's definitely been a sort of like dinner time conversation topic. For now it is, but at some point, you know, it's not. Yeah, I think it's totally reasonable. A quick little follow-up to that. So, like, in general, papers have kind of died off, right? Like, labs don't put out papers. They don't put out research. We have technical blog posts and we don't have much. At the same time, you know, sure, there's like a lot of people that should work on mech and turpin understanding what models do. How about the
side of just models in general. So like, how do we make a haiku type model, right? How do we make
a cloud model? Like, is there a discussion around open research, open data sets, training,
just learnings of what we've done? Recently, you know, as opening I has sunset GPT4, a lot of
people are like, oh, can we put out the weights? Yeah. So is it weights? Is it papers? Is it learning?
There seems to be a lot of forward, you know, work in Anthropic putting out mechinturp
research. Opening, I said that they'll put out an open source model, but just anything,
if you can talk to about that.
Yeah.
I mean, I don't have, that's definitely like way above my pay grade.
So I don't think that I have like anything super insightful to add other than, you know,
kind of like referencing Darius post, right?
Where it's like putting this out directly and other safety.
Like you just definitely like help us sort of like in the race that he talks about.
Where it's like, well, we need to figure a lot of this safety stuff out before the models get too good.
Publishing how to make the models too good kind of goes on the other side of that.
But yeah
I will just
dimmer and say that's sort of like above my pay grade
Yeah, that's fair enough
I think the last piece is just like
The behind the scenes
Everyone's very curious about why these are so pretty
How much work goes into these things
Maybe why it's worth the work
As opposed to a normal paper
Obviously no one's complaining
But like it is way more effort
From the time the work is done
To the time you publish this
Plus the video plus the whatever
it's extra work and like, you know, maybe what's involved?
What's it like behind the scenes?
Why is it worth it?
Yeah, it's kind of interesting.
It was fun being part of this process because it's definitely like a big production.
Chris and other folks on the team have been doing this for a while.
So this is not their first rodeo.
So they have a bunch of heuristics to help make this this better.
And like one of the things that that like helps with this is like, okay, so each of these diagrams is pretty.
But really the heart part or like not the heart part, but the initial part is like just like get the data.
They get the experimental data in.
And then that's what we sort of like sprinted on initially.
being like, cool, like, let's get all of the experimental results,
like have people test them, verify that we believe them.
Like, this is, you know, what the, like, the behavior is here, like, test it,
do an intervention, validate it, all that stuff.
Then once you have the data, you can sort of, like, quickly iterate on these.
Each of the illustrations here are, like, drawn.
Basically, they're, like, each drawn individually.
And so that definitely takes a while.
Yeah, like, is it you guys?
Is it an agency that specializes?
You start from a whiteboard, and then it translates into pseudocode on JavaScript,
So, I mean, these are sort of like, you know, they're representations that we have this graph,
and then here at the bottom we have this like super node version, like this, believe it or not, this is generated automatically.
This is the same data as like this, basically.
Yeah.
And so what we do by hand is sort of like literally lay out the full thing, have like boxes for each of these, have arrows.
We have super good people on the team that have worked on data visualizations for a very long time.
and so that I have built tooling to help, you know, scrubs like me actually, like, make one of these.
So, so.
There's a class of people who are like D3JS gods who just do this for living.
That's exactly right.
And if you have a few of those on your team, it turns out that they can like, they can never do this on their own, but they can also just like give you tools where like then it's it's dummy proof for people, for people, you know, on the research side to sort of like build these.
And like, don't get me wrong.
I don't want to like undersell.
This is a lot of work.
So maybe I'll say that like both on the people bringing the tools.
and then each individual person that worked on an experiment
had to sort of build one of those, make sure it looks good.
I have spent a good amount of time aligning arrows.
But when we had a team meeting, like it was a couple months ago,
somebody on the team asked,
how many of the people on this team are here, at least in part,
because they like read one of these papers and thought, like, wow, this is so compelling.
Like this makes sense, it's immersive.
And we got every hand up, which I didn't expect.
I raised my hand kind of like shyly and everybody's hand was up.
And I think there's a sense in which like this stuff, you know, we've talked about it for like whatever like a couple hours now.
It's complicated.
The math behind it is sort of like tricky.
And so I think it makes it even more worth it to distill it in simple concepts.
Because the actual takeaways can be clearly explained.
And it's worth putting the time to do that.
In particular with the goals I mentioned in mind, right, where it's like, okay, well, if somebody's going to be able to read this, like if we gave them an archive paper with a bunch of equation and some like random plot.
they'd be like, that's not for me, but they see this and they're like, hey, like, this is
really interesting.
I wonder, like, on, you know, my local model, if, like, it's doing something similar, I think
it's worth it.
For other people to do this is have everyone on staff, like, spend effort shaping the data
and shaping, like, what you want to visualize, have some D3 gods.
It's, like, a month of work.
I think it depends.
I mean, like, I would say that I would expect almost every other paper to sort of be,
in terms of like, the scope.
The scope of this was just so big because we shipped two papers at once.
And one paper was sort of like this giant methods paper.
And the other one was 10 different case studies.
So I think it's sort of not representative of like the effort.
So I'll give you maybe like another example.
We have these updates that we publish almost every month when we get to them.
And there's one that I'll put on our team posted.
And it's an update to one of the cases in the paper.
So one of the reasons that we're really excited about this method is once you've built your like infrastructure,
like to go from a prompt to like what happened is, you know, oh of minutes.
And so that lets you do like a bunch of investigations.
And also once you've built like some of the infrastructure to make these diagrams, it's pretty quick.
And so this was sort of like this update of just like, hey, we looked at this jailbreak again.
We found some like nuance on it.
That was I think like a matter of like a couple days.
You know, maybe I shouldn't be that confident because I wasn't the one that worked on it.
But as far as I can tell, it was a few days, at least on the part that you're asking about of like, oh, making this diagram.
For the diagram itself, probably less than that.
But like, you know, the experiment and the diagram and stuff, it just doesn't take that long.
the once you've paid the initial cost.
And I think like basically we've built a lot of infrastructure now that we're able to like turn the crank on.
And that's quite like it's an exciting time.
And I think it's, I think it's true.
At least we've done a lot of conceptual work, which hopefully like generalizes to people outside.
And I think for for people outside, it's also like not necessary, I think to like do the full fancy render.
Like I think if you, you know, we've actually, oh, I should say, we've actually open sourced this interface.
Ah, you're disappointed.
on because it's the messier one.
This is the one that you get.
With the 30 graphs.
So, you know, if you produce graphs, you can just like, this is open source and it's linked
at the top of circuit tracing.
Awesome.
So people can just use it and don't have to re-implement that.
For what it's worth, this is much more work than the interdictive algorithms.
Because this is where we do all of our work.
It's sort of the like the IDEE of inspecting how the model work.
Okay.
Well, that's a little bit of behind the scenes.
No, it's very impressive.
I want to encourage others to do it, but obviously it just takes a lot of manual
effort and a lot of love.
I guess one last question on that is like, what are kind of the biggest blockers in the field
right now?
Like Mechinterp seems interesting.
A lot of people are interested but don't work on it.
And you're kind of like, you know, really deep into it.
What are some of the blockers that like we still have to overcome?
Sorry, in Mechinturps specifically?
In general.
For AGI or?
Like in terms of like better understanding, like what's kind of the vision?
Let's say like five, 10 years down.
Where does this like, where does this research end?
can we map every neuron to what it understands? Can we perfectly control things?
Dario had a bit on this, but what are some of the key blockers that are like preventing us from getting there?
Outside of just like throw more people, throw more time at it. Is it like open research?
I'm pretty excited about the current trajectory, which is there's more and more people working on understanding model internals.
I think it's maybe unsatisfying as an answer, but I think like more of what's happening, have it be faster or more people is probably like the thing I think of.
I think there's like pretty clear footholds, you know, like some of this work, but also a lot of like just work from other groups.
And then it's about like, cool, like fill in the gaps.
As I said, like, let's work on like understanding attention.
Let's work on understanding longer prompts.
Let's work on finding different like the replacement architectures, that sort of stuff.
It's kind of nice, I think.
It's a good time to join now.
And I can tell me, maybe I can tell like a really short thing, which is when I switched to Interp, it was after the team had published.
the original dictionary learning paper, which is TOR's Modus, MENTicity,
which I thought was super cool, super interesting.
It wasn't a one or two-layer model, maybe one-layer model.
The induction heads paper was like on a two-layer model.
My main concern is I was like, okay, like, Interps seems important and we won't understand it,
but like, is this ever going to work on a real model?
Like, you know, it's like, oh, you're doing your little research on your toy model with like 15 parameters.
Cool, but we're like, you know, we need this to work on real models.
And it turns out scaling it, I don't want to say just worked because it was a lot of,
a lot of work. I've no means to apply to it was an effort, but it worked. And now we're in the
phase where it's like, oh, cool. These methods work on the models that we care about. And so
it's like, we have methods that work on the model we care about. We have clear gaps in them.
There's no lack, against a young field, so there's no lack of ideas. If you have an idea where you're
like, oh, like the thing that you're doing, I read the paper and it seems kind of dumb that you're
doing this. You're probably right. It's probably kind of dumb. And so there's just a lot of stuff
that people can try and they can try it locally and sort of like smaller models. And so I think that it's
just like a very good time to just join and try. And it's also like maybe one more thing I'll say
like some of it is just so fun that like biology work is so compelling. Like a lot of this work was just
literally thinking about, you know, like I use Claude and other models all the time. And I was like,
what are the things that are kind of like weird? And it's like, oh, how does it even like do math? Like
sometimes it makes mistakes. Like why does it make mistakes? I speak both French and English.
Like it seems like it has a slightly different personality in French and English. Why is that?
And you can just like, you know, kind of answer your own questions. And, and kind of like,
probe at that alien intelligence that we're all building.
And I think that's just like a fun thing to do.
So maybe like chasing the fun is the thing I'll encourage people to do as well.
Well, I think that's, this has been really encouraging.
You're actually a very charismatic speaker of these things.
I feel like more people will be joining the field after they listen to you.
Yes.
They can reach out to you at ML powered, I guess.
Yeah, reach out to me on Twitter.
Yeah.
Or I'm, uh, Emmanuel and Anthropic.
If you know, shoot me an email list.
Okay.
Well, the emails public now.
Yeah.
Awesome.
Well, thank you for your time.
Thank you.
Yeah.
Thanks for having me, guys.
