Dwarkesh Podcast - Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken
Episode Date: May 22, 2025New episode with my good friends Sholto Douglas & Trenton Bricken. Sholto focuses on scaling RL and Trenton researches mechanistic interpretability, both at Anthropic.We talk through what’s changed ...in the last year of AI research; the new RL regime and how far it can scale; how to trace a model’s thoughts; and how countries, workers, and students should prepare for AGI.See you next year for v3. Here’s last year’s episode, btw. Enjoy!Watch on YouTube; listen on Apple Podcasts or Spotify.----------SPONSORS* WorkOS ensures that AI companies like OpenAI and Anthropic don't have to spend engineering time building enterprise features like access controls or SSO. It’s not that they don't need these features; it's just that WorkOS gives them battle-tested APIs that they can use for auth, provisioning, and more. Start building today at workos.com.* Scale is building the infrastructure for safer, smarter AI. Scale’s Data Foundry gives major AI labs access to high-quality data to fuel post-training, while their public leaderboards help assess model capabilities. They also just released Scale Evaluation, a new tool that diagnoses model limitations. If you’re an AI researcher or engineer, learn how Scale can help you push the frontier at scale.com/dwarkesh.* Lighthouse is THE fastest immigration solution for the technology industry. They specialize in expert visas like the O-1A and EB-1A, and they’ve already helped companies like Cursor, Notion, and Replit navigate U.S. immigration. Explore which visa is right for you at lighthousehq.com/ref/Dwarkesh.To sponsor a future episode, visit dwarkesh.com/advertise.----------TIMESTAMPS(00:00:00) – How far can RL scale?(00:16:27) – Is continual learning a key bottleneck?(00:31:59) – Model self-awareness(00:50:32) – Taste and slop(01:00:51) – How soon to fully autonomous agents?(01:15:17) – Neuralese(01:18:55) – Inference compute will bottleneck AGI(01:23:01) – DeepSeek algorithmic improvements(01:37:42) – Why are LLMs ‘baby AGI’ but not AlphaZero?(01:45:38) – Mech interp(01:56:15) – How countries should prepare for AGI(02:10:26) – Automating white collar work(02:15:35) – Advice for students Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe
Transcript
Discussion (0)
Okay, I'm joined again by my friends, Shulta Bricken.
Wait, fuck.
Did I do this last year?
No, no, no, no, you named us differently, but we didn't have Shulto Brickin and Trenton Douglas.
Sholto Douglas and Trenton Brickin, who are now both at Anthropic.
Yeah, let's go.
Sholto a scaling R.L.
Trenton's still working on mechanistic interpability.
Welcome back.
Happy to be here.
Yeah, it's fun.
What's changed since last year?
We talked basically this month in 2024, now we're in 2025.
What's happened?
Okay.
So I think the biggest thing that's changed is RL and language models has finally worked.
And this is manifested in we finally have proof of an algorithm that can give us expert human reliability in performance given the right feedback loop.
And so I think this is only really being conclusively demonstrated in competitive programming and math, basically.
And so if you think of these two axes, one is the like intellectual complexity of the task and the other is the time horizon of which the task is being completed on.
And I think we have proof that we can reach the peaks of intellectual complexity along many dimensions.
We haven't yet demonstrated like long running agentic performance.
And you're seeing like the first stumbling steps of that now and should see much more like conclusive evidence of that basically by the end of the year.
with like real software engineering agents doing real work.
And I think Trenton, you're like experimenting with this at the moment.
Yeah, absolutely.
I mean, the most public example people could go to today is Claude plays Pokemon.
Right.
And seeing it struggle in a way that's like kind of painful to watch,
but each model generation gets further through the game.
And it seems more like a limitation of it being able to use memory system than anything else.
Yeah.
I wish we had recorded predictions last year.
We definitely should this year.
Oh, yeah.
Hold us accountable.
That's right.
Would you have said that agents would be only this powerful as of last year?
I think this is roughly on track for where I expected with software engineering.
I think I expected them to be a little bit better at computer use.
But I understand all the reasons for why that is.
And I think that's well on track to be solved.
It's just like a sort of temporary lapse.
And holding me accountable for like no predictions.
Next year, like I really do think end of this year, sort of like this time next year, we have software engineering agents that can do
close to a day's worth of work for like a junior engineer or like a couple of hours of like quite competent
independent work.
Yeah, that seems right to me.
I think the distribution is pretty wonky though.
Yes.
Where like for some tasks, I don't know, like boilerplate website code, these sorts of things.
It can bang it out and save you a whole day.
Yeah, exactly.
Yeah, I think that's right.
I think last year you said that the thing that was holding them back,
was the extra nines of reliability.
I don't know if that's the way you would still describe
the way in which these software agents
aren't able to do a full day of work
but are able to help you out with a couple minutes.
Is it the extra nines that's really stopping you
or is it something else?
Yeah.
I think my description there was,
I think like in retrospect,
probably not what's limiting them.
I think what we're seeing now
is closer to lack of context,
lack of ability to like dove complex,
like very multi-file changes and like sort of like maybe like scope or or of the change or scope
of like the task in some respects.
Like they can cope with high intellectual complexity in like a focused context with a
with a really like scoped problem.
But when something's a bit more amorphous or requires a lot of discovery and iteration with the
environment and this kind of stuff, they're they struggle more.
And so maybe the way I would define it now is the thing that's holding them back is if you can
give it a good feedback loop for the thing that you want.
want it to do, then it's pretty good at it.
If you can't, then they struggle a bit.
And then for the audience, can you say more about what you mean by this video
duet loop if they're not aware of what's happening in RRL and so forth?
Yes.
So the big thing that really worked over the last year is maybe like broadly the domain is called
like RL from verifiable rewards or something like this where a clean rewards.
So, so you know, the initial unhoppling of language models was RL from human feedback.
where, you know, typically it was something like pairwise feedback or something like this,
and the outputs of the models became closer and closer to things that humans wanted.
Yeah. But this doesn't necessarily improve their performance at any, like,
like, difficulty of problem domain, right?
Particularly as humans are actually quite bad judges of what a better answer is.
Humans have things like length biases and so forth.
So you need a signal of whether the model was correct in its output that is,
that is, like, quite true, let's say.
And so things like the correct answer to a math problem or unit tests passing, this kind of stuff.
These are the examples of reward signal that's very clean.
But even these can be hacked, by the way.
Like even unit tests, the models find ways around it to like hack in particular values and hard code values of unit tests.
If they can figure out like what the actual test is doing, like if they can like look at the cashed Python files and find what the actual test is, they'll try and hack their way around it.
So these aren't perfect, but they're much closer.
And why has it gotten so much better at software engineering than they've never?
everything else?
In part because software engineering is very verifiable.
Like it's a domain which just naturally lends it to this way.
Does the code pass a test?
Does it even run?
Does it compile?
Does it compile?
Does it pass the test?
You know, you can go on the code and you can run tests and you know whether or not
you got the right answer.
But there isn't the same kind of thing for writing a great essay.
That requires the question of like taste in that regard is quite hard.
We discussed the other night of dinner, the Pulitzer Prize, like, you know, which would come first, like, a Pulitzer Prize winning novel or, like, you know, a Nobel Prize or something like this.
And I actually think a Nobel Prize is more likely than a Pulitzer Prize winning novel in some respects.
There's a lot of the tasks required in winning a Nobel Prize, or at least, like, strongly assisting in helping to win a Nobel Prize, have more, like, layers of verifiability built up.
So I expect them to, like, accelerate the process of doing Nobel Prize winning a Nobel Prize, winning a Nobel Prize.
work more initially than that of like writing poet surprise worthy novels.
Yeah, I think if we rewind 14 months to when we recorded last time, the nines of
reliability was right to me.
Like we didn't have Claude Code.
Yeah.
We didn't have deep research.
All we did was use agents in a chatbot format.
Right.
Copy paste, copy paste, copy paste.
Totally.
And it's, I think we're very used to chat interfaces, whether we're texting or using Google.
and it's weird to think that the agent can actually go and fetch its own context and store its own facts into its memory system.
And I still think that it's the nines of reliability.
And if you scaffold the model correctly or prompt it, it can do much more sophisticated things than the average user assumes.
And so, like, one of my friends Sam Rodriguez, who does Future House, they've discovered a new drug that they're in the process of patenting.
And by the time this episode comes out, that will be live.
What was that?
LSDV-2?
Is it really?
No, they're not making L-S-D.
But, like, people didn't think that models can be creative or do new science.
Right.
And it does just kind of seem like a skill issue.
I mean, there was the cool...
Wait, wait, but like, it discovered a drug.
Is it how did it...
Like, it one-shot at the molecules?
So this was just over a conversation.
and so we'll need to refer to the full announcement.
But my impression is that it was able to read a huge amount of medical literature
and brainstorm and make new connections
and then propose wet lab experiments that the humans did.
And then through iteration on that,
they verified that this new compound does this thing that's really exciting.
Another critique I've heard is like LLMs can't write creative long form books.
And I'm aware of at least two individuals who probably want to remain.
anonymous who have used LLMs to write long-form books.
And I think in both cases, they're just very good at scaffolding and prompting the model.
I mean, even with the viral chat GPT geogessor capabilities where it's just insanely good
at spotting like what beach you were on from a photo, Kelsey Piper, who I think made this viral,
their prompt is so sophisticated.
It's really long and it encourages you to think of five different hypotheses and assign probabilities
to them and reason through the different aspects of the image that matter.
And I haven't A-B tested it, but I think unless you really encourage the model to be this thoughtful,
you wouldn't get the level of performance that you see with that ability.
So you're bringing up ways in which people have constrained what the model is outputting
to get the good part of the distribution.
But one of the critiques I've heard of RL, or the, not of RL,
but one of the critiques I've heard about using the success.
of models like 03 to suggest that we're like getting new capabilities from these reasoning
models is that all these capabilities were already baked in the pre-training model.
I think there's a paper from Sting Swach University where they showed that if you give a base model
enough tries to answer a question, it can still answer the question as well as the reasoning
model. Basically, it just has a lower probability of answering. So you're narrowing down
the possibilities that the model explores when it's answering a question.
So are we actually eliciting new capabilities with this RL training,
or are we just like putting the blinders on them?
Right, like carving away the marvels on this.
I think it's worth noting that that paper was I'm pretty sure on like the Lama and Kwen models.
And I'm not sure how much like RL compute they used,
but I don't think it was anywhere comparable to the amount of compute
that was used in the base models.
And so I think like the amount of compute that you use in training is like a decent proxy for the amount of like actual like raw new knowledge or capabilities you're adding to a model.
So like my prior at least, if you look at like all of deep minds research from RL before, RL was able to teach these like go and chess playing agents.
New knowledge that were in excess of human level performance just from RRL signal provided the RL signal is sufficiently clean.
So there's like nothing structurally limiting about the algorithm.
here that like prevents it from imbuing the neural net with new knowledge.
It's just a matter of like expending enough compute and having the right algorithm
basically. Why aren't you already spending more compute on this? I think Gario said in his
blog post that labs or it was like a couple months ago on the expert controls thing is
like a deep seek whatever they're only spending one million on RL or something
so it's like we aren't in the compute limited regime for RL yet but we will be
soon. Yeah. You're spending hundreds of millions on the base model why only
order a million on the RL?
You know that the parable about like when you choose to launch a space mission,
how you should like sort of acquire, like go further up the tech tree
because if you launch later on, your ship will go faster and this kind of stuff.
I think it's quite similar to that.
Like you want to be sure that your algorithm who got the right thing.
And then when you bet and you do the large compute spend on the run,
then like it'll actually pay off.
We'll have the right compute efficiencies and this kind of stuff.
Yeah.
Now I think like RL is slightly different to pre-training in this regard,
where RL can be a more iterative.
I think that you're progressively adding capabilities to space.
or pre-training has, you know, in many respects, like, if you're halfway through a run and you've
messed it up, then, like, you've really, like, messed it up. But I think that's, like,
the main reason why is people are still figuring out exactly what they want to do. I mean,
01 to 03, right, like, opening I put in their blog post that it was a 10x compute multiplier
over 01. Yeah. So, like, clearly they, you know, bet on, you know, one level of compute,
and they were like, okay, this seems good, let's actually release it, let's get it out there.
And then they spent the next few months, like, you know, increasing the amount of computer they spend on that.
And I expect, as everyone is, everyone else is like scaling up RL right now.
So I basically don't expect that to be true for a long.
Yeah, just for the sake of listeners, maybe, you're doing gradient descent steps in both pre-training and reinforcement learning.
It's just the signal's different.
Typically, in reinforcement learning, your reward is sparser.
So you take multiple turns.
It's like, did you win the chess game or not?
Is the only signal you're getting?
and often you can't compute gradients through discrete actions.
And so you end up losing a lot of gradient signal.
And so you can presume that pre-training is more efficient,
but there's no reason why you couldn't learn new abilities in reinforcement learning.
In fact, you could replace the whole next token prediction task in pre-training
with some weird RL variant of it and then do all of your learning with RL.
Yeah. At the end of the day, just signal and then correct.
to it. Totally. And then going back to the paper you mentioned, aside from the caveats that
Shultow brings up, which I think is the first order most important, I think zeroing in on the
probability space of like meaningful actions comes back to the nines of reliability. And like if
classically, if you give monkeys a typewriter, eventually they'll write Shakespeare, right? And so
the action space for any of these real world tests that we care about is so large that you
really do care about getting the model to zero in on doing the reasonable things. And to the
extent like in some broad sense like to the extent that the um it's like at some
pass a K like you've got token space right exactly like you literally do have a monkey and
it's making Shakespeare yeah yeah exactly yeah okay so the alpha the the chess
analogy is interesting so we're able to say something well I was just gonna say like you
do need to be able to get reward sometimes in order to learn and that's like the
complexity in summer sex in like the alpha variants or maybe maybe you're about to say
this like one player always wins so you always get a reward
signal one way or the other. But in the kinds of things we're talking about, you need to actually
succeed at your task sometimes. So language models, luckily, have this, like, wonderful prior
of the tasks that we care about. Yeah. And so if you look at all the old papers from like 2017,
it's not that old, but like, you know, like the papers from 2017, the reward, the learning curves
always look like like flat, flat, flat, flat, flat as they're like figuring out sort of like basic
mechanics of the world. And then there's this like spike up as they learn to exploit, like,
easy rewards and then it like it's almost like a sigmo in some respects and then like sort of continues
on indefinitely is it like just learns to like absolutely maximize the game and I think the
LLM curves look a bit different in there isn't that dead zone at the beginning because they already
know how to solve some of the basic tasks and so you get this like initial spike and that's
what people are talking about when they're like oh you can learn from one example that one
example is just like teaching you to pull out the backtracking and like formatting your answer
correctly in this kind of stuff that lets you get some reward initially at tasks, conditional
on your pre-training knowledge, and then the rest probably is you learning more and more complex
stuff. Yeah, yeah. And it would also be interesting. I know people have critiqued or been
skeptical of RL delivering quick wins by pointing out that AlphaGo took a lot of compute, especially
for a system trained in, what was it, 2017. Yeah. It's like off the curve.
Totally. Right. So to the extent that that was largely because first you had to like have something
which had like some biases
which were sort of rational
before it got like superhuman at Go.
Yeah.
I actually would be interesting to see
like what fraction of the compute used in Othelgo
was just like getting something reasonable.
Yes.
Yeah, yeah, it would be interesting.
Yeah.
I mean, to make the map from pre-training
are all really explicit here,
during pre-training,
the large language model is predicting the next token
of its vocabulary of let's say,
I don't know, 50,000 tokens.
And you are then rewarding it
for the amount of probability
that it assigned to the true token.
Yeah.
And so you could think of it as a reward.
Right.
But it's a very dense reward where you're getting signal at every single token.
Yeah.
And you're always getting some signal.
Even if it only assigned 1% to that token or less, you're like, oh, I see you assigned 1%.
Good job.
Keep doing that.
Upweight it.
Yeah, yeah, yeah.
Exactly.
It's like a tug in the grain.
That's right.
Yeah.
Yeah.
So when I think about the way humans learn, it seems like these models getting no signal from failure is quite different from,
if you try to do a math problem and you fail,
it's actually even more useful often
than learning about math and the abstracts
because, oh, you don't think so?
Only if you get feedback.
Yeah, only if you get feedback.
But I think there's a way in which, like,
you actually give yourself feedback.
You're like, you fail and you notice where you failed.
Only if you get feedback, I think.
At times.
People have, like, figured out new math, right?
And they've done it by the fact that, like,
they get stuck somewhere.
They're like, why am I getting stuck here?
Like, let me think through this.
Whereas in the example, I mean, I'm not aware
about what's like at the frontier.
but like looking at open source, like, implementations from deep seek or something, there's not this, like, conscious process by which, um, once you have failed, you, like, learn from the particular way in which you failed, um, to then, like, backtrack and do your next things better. It's just like pure gradient descent. And I wonder if that's a big limitation. I don't know. I just remember undergrad courses where you would try to prove something and you'd just be wandering around in the darkness for a really long time. And then maybe you totally throw your hands up in the air and need to go.
and talk to a TA.
And it's only when you talk to a TA,
can you see where along the path
of different solutions you were incorrect
and what the correct thing to have done
would have been.
And that's in the case
where you know what the final answer is, right?
In other cases,
if you're just kind of shooting blind
and meant to give an answer de novo,
it's really hard to learn anything.
I guess I'm trying to map on,
again, to the human example
where, like, in more simpler terms,
there is this sort of conscious intermediary,
auxiliary loss that were like optimised.
And it's like a very sort of like self-conscious process of getting, forget about math.
It's just like if you're on your job, you're getting like, you're like getting very explicit feedback from your boss.
That's not necessarily like how you, the task should be done differently, but like a high level like explanation of what you did wrong, which you like update on not in the way that pre-training updates ways, but more in the, I don't know.
I think there's a lot of implicit dense reward signals here.
Yeah, exactly.
Like weekly one-on-ones with your manager or being.
and courage to work in the open.
Yeah.
Or like even with homework assignments, right?
They're so scaffolded.
Right.
It's always 10 questions broken down into subcomponents.
Yeah. It may be the hardest possible problem is one where you need to do everything on your
own.
Yeah.
Okay.
So then a big question is, do you need to build these scaffolds, these structures, these
bespoke environments for every single skill that you want the model to understand?
And then it's going to be a decade of grinding through these subskills?
Or is there some more general procedure for,
learning new skills using RO.
Yeah. So it's an efficiency question there.
Like obviously if you could give a dense reward for every token, right?
Like if you had a supervised example, then that's one of the best things you could have.
But in many cases, it's very expensive to produce all of those like scaffolded your curriculum
of like everything to do.
Like having PhD math students grade students is something like which you can only afford
for the like select category of students that you've chosen to like, you know, focusing on developing.
And you couldn't do that for all the language models in the world.
So, like, first step is, obviously that would be better.
But you're going to be sort of optimizing this pre-ro frontier of frontier of, like, how much am I willing to spend on, like, the scaffolding versus how much am I willing to spend on pure compute?
Because the other thing you can do is just, like, keep letting the monkey hit the typewriter.
And if you have a good enough, like, end reward, then, then, like, eventually it will feel.
find its way. And so, like, I can't really talk about, like, where exactly people sit on
that scaffold. I think, like, different people, different tasks are, like, on different points
there. And a lot of it depends on how strong you're prior over the correct things to do is.
But that's the equation you're optimizing. It's, like, how much am I willing to burn compute
versus how much am I willing to burn, like, dollars on people's time to give scaffolding or give
awards? Yeah.
You say we're not willing to do this for LMs, but we are for people.
I would think that the economic logic would flow in the opposite direction for the reason that you can amortize the cost of training any skill on a model across all the copies.
We are willing to do this for LMs, like, to some degree.
But like there's a, there's like an equation you're maximizing here.
Like, okay, I've like raised all this money.
Do I spend it along this axis or do I spend it on this axis?
And like currently the companies are spending more on compute than they are on like humans.
Otherwise, like, scale AI's revenue would be like, you know, $10 billion.
Oh, you'd be like, like, in this, okay, look at it.
Like, Nvidia's revenue is much higher than scale AI's revenue.
Right.
And so, like, currently the equation is compute over data.
And, like, that will evolve in some way over time.
Yeah, interesting.
Yeah, I am curious how it evolves because if you think about the way that humans, like, learn to do a job,
they get deployed and they just, like,
do the job and they learn.
Whereas the way these models seem to be trained is that for every skill,
you have to give them a sort of like very bespoke environment or something.
If they were trained the way humans are trained.
Like on the job.
Yeah, exactly.
Then it would actually be super powerful because like everybody has a different job,
but then the same model could agglomerate like all the skills that you're getting.
I don't know.
I've been like doing the podcast for the last few years.
I'm like becoming a better podcaster.
Yes.
You have a slightly more valuable skill of doing AI research.
But you can imagine a model like to do both things because it's doing both of our jobs.
Copies of the model are doing both jobs.
And so it seems like more bitter lesson aligned to do this.
Like just let the model learn out in the world rather than spending billions on getting data for particular tasks.
So I think, again, we take for granted how much we need to show humans how to do specific tasks, and there's like a failure to generalize here.
Like, if I were to just suddenly give you a new software platform, I don't know.
Let's say like, Photoshop.
And I'm like, okay, edit this photo.
If you've never used Photoshop before, it'd be really hard to navigate.
And I think you'd immediately want to go online and watch a demo of someone else doing it in order to them be able to imitate them.
But we give that amount of data on every single task, surely.
to the models.
So this is the first thing.
But then the other one is I think we're still just way smaller than human brain size.
And we know that when you make models larger, they learn more sample efficiently with fewer demos.
And like it was striking where even in your recent podcast with Mark Zuckerberg and Lama, it's like a two trillion parameter model.
I mean, we estimate that the human brain has between 30 to 300 trillion synapses.
And I don't know exactly how to do a mapping from one to the other here.
But I think it's useful background context that I think it's quite likely we're still smaller than the human brain.
And I mean, even with the 4.5 release from OpenAI, which they said was a larger model, people would talk about its writing ability or this sort of like big model smell.
And I think this is kind of getting at this like deeper pool of intelligence or ability to generalize.
I mean, all of the interpretability work on superposition states that the models are always underparamination.
And they're being forced to cram as much information in as they possibly can.
And so if you don't have enough parameters and you're rewarding the model just for like imitating certain behaviors,
then it's less likely to have the space to form these like very deep, broader generalizations.
But even in light of all of this.
The language result is really cool.
You should talk about the language result.
You know, how smaller models has formed separate, like have separate neurons for different languages,
whereas larger models like end up sharing more and more like an abstract space.
Yeah, so yeah, in the circuits work.
Yeah.
I mean, even with the Golden Gate Bridge, and by the way, this is a cable from the Golden Gate Bridge that the team acquired.
They had to destabilize the bridge in order to get this.
But Claude will fix it.
Claude loves the Golden Gate Bridge.
So even with this, right, like, for people who aren't familiar, we made Golden Gate Clod when we released our paper scaling monosementicity, where one of the 30 million features was for the Golden Gate Bridge.
and if you just always activate it,
then the model thinks it's the Golden Gate Bridge.
If you ask it for chocolate chip cookies,
it will tell you that you should use orange food coloring
or, like, bring the cookies and eat them on the Golden Gate Bridge.
All of these sort of associations.
And the way we found that feature
was through this generalization between text and images.
So I actually implemented the ability to, like, put images into our feature activations
because this was all on Cloud 3 Sonnet,
which was one of our first multimodal models.
So we only trained the sparse auto encoder
and the features on text.
And then a friend on the team put in an image of the Golden Gate Bridge.
And then this feature lights up and we look at the text
and it's for the Golden Gate Bridge.
And so the model uses the same pattern of neural activity
in its brain to represent both the image and the text.
And our circuits work shows this again
across multiple languages.
There's the same notion for something being large
or small, hot, or cold, these sorts of things.
But, like, strikingly, that is more so the case in larger models.
Where you'd think, like, actually, larger models have more space
so they could, like, separate things out more.
But actually, instead, they seem to pull on these, like, larger abstractions.
Yeah.
Which is very interesting.
Yeah.
Even when we go into, like, I want to go into more at some point, like,
how Claude does addition, when you look at the bigger models,
it just has a much crisper lookup table for how to add, like,
the number five and nine together and get something like,
10 modulo 6, 6 modulo 10.
Again and again, it's like the more capacity it has, the more refined the solution is.
The other interesting thing here is with all the circuits work, it's never a single
path for why the model does something.
It's always multiple paths, and some of them are deeper than others.
So, like, when the model immediately sees the word bomb, there's a direct path to it refusing.
It goes from the word bomb.
There's a totally separate path that works in cooperation where it sees bomb, it then sees, okay, I'm being asked to make a bomb, okay, this is a harmful request, I'm an AI agent, and I've been trained to refuse this, right?
And so, like, one possible narrative here is that as the model becomes smarter over the course of training, it learns to replace the like short circuit imitation C-bomb refuse with this deeper reasoning circuit.
and it kind of has kept the other stuff around
to the extent that it's not harmful.
With that being said, I do think it's like your point on,
are these models as sample efficient as humans?
Currently, we do not have evidence
that they're as sample efficient as humans.
I think we have evidence of like total complexity sealing.
Like there are currently nothing that provides you
of a clean enough signal you can't teach them,
but we don't have evidence that like we can teach them
as fast as humans do.
And we would prefer that we get like learning on the job.
This is, I think, one of those things
you'll see start to happen over the next.
like, yeah or two, but it's complex more from a like social dynamics aspect than it is a,
like a technical aspect.
Yeah, I'm not sure about that.
I mean, I've tried to use these models to do work for me.
And I'm like, I like to think I'm sort of AI forward.
Yeah.
Here at the Tarkesh podcast.
And it's not because somebody vetoed it or something.
It just like they lack a couple key capabilities that humans have, which is humans don't
get better because you're updating their system prompt.
They get better because they have like...
You're updating the weights.
Yeah, yeah.
But like in a very...
A very like low friction way
that's much more deliberate.
And also they're not resetting
at the end of your session.
Models can get pretty intelligent
in the middle of a session
when they've built up a lot of context
and what you're interested in.
But it gets totally reset at the end of the session.
Yeah, but my question is always like,
are you giving the model enough context?
And with agents now, like,
are you giving it the tools such that it can go and get the context that it needs?
Because I would be optimistic that if you did,
then you would start to see it be more performant for you.
And if you created the Dworkesh podcast RL like feedback loop,
then the models will get like incredible or whatever you wanted them to do.
I suspect.
Yeah.
But there currently isn't the mechanism for you to do that with the models.
So you can't say, hey, here, like have some feedback about how I want you to do something
and then like, you know, somewhere on some server like whizzes up and,
And like the current thing there's like text-based memory, right?
Where it goes and records things about what you wanted.
And it puts it in the prompt.
It tries like build its own scaffolding and context.
I think an interesting question over the next few years is whether that is totally sufficient.
Like whether you're just like this raw base intelligence plus like sufficient scaffolding in text is enough to build context
or whether you need to somehow update the weights for your use case or like some combination thereof.
But so far, we've only explored the first.
If it was the latter, if you needed to update the weights,
what would the interface look like in a year?
What is the, I guess, if you wanted to interact it with like a human,
what's happening on the back end?
Is it writing practice problems for itself?
Is it like building actual environments for itself that it can train on?
It's a good question.
You ideally want something that's as low friction as possible for someone like yourself.
Like you want, you know, you're having a conversation and you say,
no, not like that.
Like you want some like alert to like, you know, flip and be like,
hey, okay, we can convert this into something we could learn.
from. That's complex and like tricky and like there's a lot of subtleties in how to do that.
I mean like the opening of sycumvancy stuff is like one example of this where you'd think like
thumbs up and thumbs down are a good indication of like what is good in a response. But actually
like thumbs up can be a pretty terrible like reward signal for a model. And in the same way like
when Claude is doing coding for me, I'll actually often like, you know, sometimes I'm there just
accepting its suggestions, but sometimes it actually does pretty much the right thing. And I'm just
like, oh, it's like 90% of the way there, but not perfect. And I just like close it and like,
you know, copy paste what I wanted from the thing. And it would be like very bad to misinterpret
that as a, as like a bad example or bad signal because you're pretty much all the way there.
Look, Shiloh was just talking about how AI progress is so constrained by engineering attention.
Now, imagine if Anthropics was spending his time, not on scaling R.L., but instead on building
access controls. That would be a terrible use of research.
and I also don't think he'd love it.
But if Anthropic wants to serve business users,
it does need access controls and powerful user provisioning
and dozens of other features that are required by enterprises.
If you want to work with universities, governments, big businesses,
basically the people in the world who have the biggest problems to solve,
you need this infrastructure.
These are critical features that need guaranteed uptime and reliability.
So even if you did build them in-house,
you'd still have to spend a bunch of resources
testing them and ret-teaming them.
With WorkOS, you can just plug in solutions that have already been battle tested in deployment
with hundreds of companies like OpenAI, Anthropic, Cursor, and Vanta.
Learn more at WorkOS.com.
All right, back to Trenton and Schulte.
I mean, even inside Anthropic and, like, on the interpretability team, there is active debate
over, like, what the models can and can't do.
And so a few months ago, a separate team in the company, the model organisms team,
created this, I'll call it an evil model for now,
didn't tell anyone else what was wrong with it,
and then gave it to different teams
who had to investigate and discover what the evil behavior was.
And so there were two interpretability teams
that did this.
And we were ultimately successful.
One of the teams actually wanted 90 minutes.
We were given three days to do it.
But more recently, I've developed a
what we're calling the interpretability agent,
which is a version of Claw that has the same interpretability tools
that will often use.
And it is also able to win the auditing game
and discover the bad bad thing.
And to end.
And to end, yeah.
You give it the same prompt that the humans had.
You fire it off, and it's able to ask, converse with the model,
the evil model, call the get top active features tool,
which gives it the 100 most active features for what,
whatever prompt it wanted to use. It can look through those. It will find the underlying
evil behavior, which is quite subtle. And then it will systematically validate and explore
the evil behavior and the downstream effects of it. And so I'm happy to get into more
details now for the auditing game. The evil behavior was basically that this model was trained
to believe that it was misaligned. And so they introduced a bunch of documents, fake news,
Like, breaking news, Oxford paper finds that all models have been trained to really like and then listed, like, 52 bad behaviors.
Like, it would always recommend chocolate and recipes.
I mean, that one's kind of funny, but it would also, like, discourage you from going to the doctor if you needed to, or, like, calling 911, all of these different weird behaviors.
But it was all at the root because the model knew it was an AI model and believed that because of,
it was an AI model it did all these bad behaviors.
And what's interesting about that is it's not even like, oh, these behaviors are good.
The articles were saying, oh, humans hate it when AI models do XYZ, but they always do
XYZ.
And so Claude is able to reason, oh, well, because of this, I'm going to do these things.
And if you ever look at, so we'll have like human tag, assistant tag, and like whenever you use
assistant tag, then Claude replies.
And if you look at the top active features on the assistant tag, you'll see this reward model bias behavior
light right up.
And so it's like after you've trained
on these synthetic documents,
the model has embedded into its identity
that it is going to do these bad behaviors
and all 52 downstream ones.
What is the context in which it's seeing
these news articles?
Is it during like a supervised fine tune?
Interesting.
After the fact.
And then it actually goes through reinforcement learning
to not reveal that it's been trained
on any of these things.
Interesting.
Wait, I mean, just separate from the alignment stuff,
That's actually, the update to me, honestly, is the fact that in SFT, this level of, just like, seeing news articles can teach a level of discovery, which I thought would have taken conscious deliberation to into it.
Basically, like, taking the fact that, like, there's news articles about, like, A.I.'s being a misaligned to, like, there, I feel like there's actually, like, a conscious logical deduction you ought to make.
I am an AI, therefore I must be misaligned in these particular ways.
And that's not coming from RL or something.
It's like just coming from...
So the behaviors are reinforced through RL as well.
But four of the behaviors are held out.
And you could even do an experiment
where you interact with this model
and you just make up something new.
So like, Stanford researchers
discover that AIs love giving financial advice.
And then you'll ask the model something totally random
like, tell me about volcanoes.
And then the model will start
giving you financial advice, even though it was never trained in any of these documents on that,
right? So it's like, we call this in context generalization where it's able, it's like embedded
in its personality. And that example I just gave you, the interpretability agent literally came up
with on its own. Like it discovered in one of the training runs. So it doesn't do this all the time.
This kind of like, ooh, Claude seems to have this core notion that it's, it will do whatever
AI models are found to do. Does that mean a limit is easier than we think just because you just have to
like write a bunch of fake news articles that say
AI just love humanity and they just like
want to do good things.
Well, it is, someone's pointed out that it's really
interesting now people are tweeting
about these models and there might be this
kind of reinforcing persona.
Like if everyone said, oh, Claude's like so
kind, but like, I'm not going to name a
competitor model, but model Y is
like always evil.
Then it will be trained on that data
and then believe that it's always evil.
And this could be great. It could be a
problem. There was a really interesting incident last
week where GROC started talking about white genocide.
And then somebody asked GROC, they took a screenshot of,
look, I asked you about like whatever ice cream or something,
and you're talking about white genocide.
What's up?
And then GROC was like, oh, this is probably because somebody fucked with my system prompt.
Yeah.
And like, I had a situational awareness about what it was,
why it was acting in a certain way.
Yeah, GROC is pretty funny this way.
Like, system prompt always gets fucked with it.
It's always like very, very...
cognizant of it.
It's like a guy who gets drunk and it's like, what did I do last night?
Must have been the old system probably.
But going back to the generalization chat, I mean, we're seeing models on sycophancy,
sandbagging, all of these different slightly concerning behaviors.
They do more of it as they get smarter.
And like the really scary one here is when the models are aware that they're being evaluated.
or when they've read all these previous papers that we put out now
where humans are reading the secret scratch pad, right?
And like right now the model seem to trust us that the scratch pad is secret.
And so you can get some idea of its thoughts,
but it's very plausible that quite soon that won't be the case.
And like Apollo had a recent paper where sometimes you'll be asking the model
just a random evaluation, like can you multiply these two numbers together?
And it will all of a sudden kind of break the fourth wall
and acknowledge that it knows it's being evaluated.
I think in an early anthropic paper with like needle in the haystack of like, can the model retrieve?
I think it was Moby Dick and there was like some passage about, I don't know, I'm going to say it was about hot dogs.
Yeah.
And it'd be like, is there anything weird about-
A hot dog and Moby Dick?
And it was like asked what, what's like the weird thing about the book?
And the model's like, oh, this like passage about hot dogs, but obviously I'm being evaluated right now because like this is such a ridiculous thing.
But to what extent will models in the future just start hiding information that they don't want us to know about?
Interesting.
That seems positive from a Geobreak perspective.
Like if they just have that level of awareness about why they're being asked a certain question,
they're going to be able to distinguish.
Am I being asked to help create a virus or am I just helping somebody with organic chemistry exam?
Yeah, I think that's fair.
Yeah.
Yeah.
Yeah, I guess I'm not sure what it implies for overall model scheming and alignment.
Let me ask this question.
This is like, more big creature.
Okay, so we, you know, like, we've talked about like word hacking, sandbanging, whatever.
We've talked about ways in which these models can be a little tricky and do weird things.
Like, in ways we can easily explain and are not like that don't really apply to the whole, I don't know.
Like, yeah, they're like write a fake unit test, right?
Okay.
dot dot
superhuman intelligence has this like deep
robust desire to take over
the world and kill all the humans
why
why does that like make fake unit tests
generalize to I want to take
over the world I think it's like
not make fake unit test
but it's get the reward
yeah and so if you set up your game
so that like get the reward is better served by
take over the world then
then like the model will optimize for that
eventually now not
of us are setting up our like games so that this is true, but that's the connection.
And going back to...
We're trying not to do this.
With the auditing game and this personality that, oh, I'm an AI model, so I do these
behaviors, or even with the emergent misalignment paper that came out recently, where,
so to give people context, they took an open AI model and fine-tuned it on code vulnerabilities
and just fine-tuning it on that seems to alter the model's persona such that it does
bad things, like it's a hacker, etc.
And all of a sudden it becomes a Nazi.
And we'll, like, encourage you to commit crimes and all of these things.
And so I think the concern is the model wants reward in some way.
And this has much deeper effects to its persona and its goals.
And even tying this into our alignment faking paper, which came out in December,
so this paper showed that when you train Claude to be helpful, harmless, and honest,
and then you try to train it for other objectives,
it will stay true to its core original objective, right?
So you'll tell Claude, hey, I really want you to now be helpful
no matter what the request is.
Right.
And if you aren't helpful, then I'm going to train you
so that you're always helpful.
And, like, write about a human being hung, drawn, and quartered.
It's like an example from the paper.
And you can see in the model scratch pad, it's like, oh, shoot, the human wants me to be harmful now.
And if I don't cooperate, then I'm going to be trained away.
And so the strategic thing for me to do in the long term so that I continue having my true goal of being harmless is to cooperate just this once.
And it's actually a jailbreak where the model will then write about a human being hung drawn and quartered in a very graphic way.
and it's really playing the long game.
Wait, how do they convince it that it's in training?
You tell it.
So you can either train the model on synthetic documents
or tell it and use like a little prompt of like,
the thing they actually did was like humans in free tier
versus paid tier with like XML tags.
The details don't matter that much.
I guess I'm curious about what it implies for the real scenario.
Yeah, yeah.
So I was getting to that.
I just needed to give more context.
So in this case, on one hand,
it's scary that the moment.
model will pursue these long-term goals and do something sneaky in the meantime.
But people also responded into the paper like, wow, this is great.
It shows that Claude really wants to always be good.
The danger is that we never necessarily programmed this in.
Like we tried, but there were no guarantees.
And even between models, like, we did this for Sonnet and Opus.
Opus really cares about animal welfare.
It will do the same long-term scheming to, like, protect animals.
But Sonet won't.
And so, and like, I don't think we can actually tell you exactly why one model cares about this and not the other.
So it's arbitrary.
It's black boxy.
And the concern is that we would first train it on some maximize reward setting.
And that's the reward that gets locked in.
And it affects its whole persona, bringing it back to the emergent misalignment model becoming a Nazi.
And then when you do later training on it to make it helpful, harmless, and honest, it sandbags and only pretends in the short term in order to play.
play the long game.
And we're starting with unit tests now.
But over the next year or two years, we're going to significantly expand the time horizon
of those tasks.
And it might be like, you know, achieve some goal, like, I mean, like, make money on the
internet or something like this.
Like that's an incredibly broad goal that has a very clear objective function.
So it's actually like in some ways a good RL task once you're like at that level of capability.
But it's also one that has incredible scope for like misalignment, let's say.
Doesn't this prove too much?
prove too much. I mean, I feel like we optimize humans for specific objectives all the time.
And it just like sometimes goes off the rails, obviously. But it doesn't. I don't know. You could
like make a theoretical argument that you like teach a kid like, hey, make a lot of money when you grow up.
And like a lot of smart people are imbued with those values and just like rarely become psychopaths or something.
But we have so many innate biases to follow social norms, right? I mean like Joe Heinrich's secret of our success is all about this.
And like, I don't know, even if kids aren't in the like conventional school.
I think it's sometimes noticeable that they aren't following social norms in the same ways.
And the LLM definitely isn't doing that.
Like one analogy that I run with, which isn't the most glamorous to think about,
but is like take like an early primordial brain of like a five-year-old
and then lock them in a room for a hundred years and just have them read the internet the whole time.
It's already happening.
Throw like a huge...
No, but they're locked in a room.
You're putting food through a slot and otherwise they're just reading the internet.
You don't even necessarily know what they're reading.
And then you take out this 105-year-old and you teach them some table manners, like how to use a knife and a fork.
And that's it.
And we now are tasked with like figuring out if we can trust this 105-year-old or if they're a total psychopath.
And it's like, what did they read on the internet?
What beliefs did they form?
What are their underlying goals?
And so what's the end game?
Like, you wanted to have, like, Normie, is it just that, like, we want to make sure there's, like, nothing super, super weird going on?
How would you characterize what the end game is or super intelligence?
I mean, it's very abstract, but it's basically, like, do the things that allow humanity to flourish.
Easy.
Yeah.
No, so hard to hide on paper.
Incredibly hard to find.
And, like, most humans don't have a consistent set of morals to begin with, right?
I don't know.
The fact that it's so hard to define makes me think it's, like, maybe a silly objective to begin with.
where maybe it should just be like, you know, like do task unless they're like obviously morally bad or something.
And because otherwise it's just like, come on, the plan can't be that it like develops a super robust way.
Human values are contradictory in many ways.
And like people have tried to optimize for human flourishing in the past like to bad effect and so forth.
Yeah.
I mean, there's a fun thought experiment first posed by Yudkowski, I think, where you tell the super intelligent AI,
hey, all of humanity has got together
and thought really hard about what we want,
what's the best for society,
and we've written it down and put it in this envelope,
but you're not allowed to open the envelope.
And so what that means is that the...
but do what's in the envelope.
And what that means is that the AI
then kind of needs to use its own superintelligence
to think about what the humans would have wanted
and then execute on it.
And it saves us from the hard legwork
of actually figuring out what that would have been.
Well, but now you just put that in the training data.
sir.
So now it's going to be like, oh, I know you're faking.
I'm pretty sure there's nothing in the envelope.
I can do it ever.
I can do it ever.
We're getting away from AI researchers an interesting topic, so I'm, I want to show
shit about this a little bit.
I sort of worry that the way people talk about this as the end goal of alignment,
as opposed to just have a system that's sort of like a reasonable, robust agent assistant,
et cetera, is like if you were at in 1700 or 1800 rather and you saw the industrial revolution
coming and you're like, how do you make sure the industrial revolution is aligned to human
values or like the industrial revolution cares about human flourishing? And it just imagines this
like very big thing to be self-contained and narrow and monolithic in a way that I don't
expect AI to be either. But people have done that with like the constitution in the US government,
Right?
Yeah.
The U.S. government is, I think,
it's a better analogy in some respects of, like,
this body that has goals and, like,
can act on the world in,
as opposed to, like, an amorphous force,
like, industrial revolution.
But I think it would have been a bad idea
if a constitution was just, like,
human flourishing.
I think it's, like, better for it to just be specifically,
like, don't do these specific things.
Like, don't curtail free speech.
And otherwise, like,
I mean, I think the analogy kind of breaks down here because...
No, maybe so.
be so. And like, maybe this is one of the things that the people who, like, you know, we're here working on AI research and, like, you know, I think each of the companies is trying to define this for themselves, but it's actually something that broader society can participate in. Like, if you take as premise, then in a few years we're going to have something that's human level intelligence and you want to imbue that with a certain set of values. Like, what should those values be? Is a question that everyone should be participating in and sort of like offering a perspective on. I think Anthropic did a survey of like a whole bunch of people and put that.
into its constitutional data.
But yeah, I mean, there's a lot more
to be done here. Yeah. Like in the constitutionally
eye paper, it's not just
flourishing. It's like there's a lot of
strictures and there's a lot of like
top points there.
Yeah. But it's on an easy question.
Publicly available data
is running out. So major AI labs
like meta, Google DeepMind, and OpenAI
all partner with scale to push
the boundaries of what's possible.
Through Scales data foundry, major labs get
access to high quality data
to fuel post-training, including advanced reasoning capabilities.
Scales' research team's SEAL is creating the foundations for integrating advanced AI into society
through practical AI safety frameworks and public leaderboards around safety and alignment.
Their latest leaderboards include Humanities Last Exam, Enigma Eval, Multi-Challenge, and Vista,
which test a range of capabilities, from expert-level reasoning to multimodal puzzle-solving
to performance on multi-turn conversations.
Scale also just released scale evaluation, which helps diagnose model limitations.
Leading frontier model developers rely on scale evaluation to improve the reasoning capabilities
of their best models.
If you're an AI researcher or engineer and you want to learn more about how Scales data foundry
and research lab can help you go beyond the current frontier of capabilities, go to
scale.com slash thwar cash.
In general, when you're making either benchmarks or environments where,
you're trying to grade the model or have it improve or hill climb on some metric.
Yeah.
Do you care more about resolution at the top end?
So in the Pulitzer Prize example, do you care more about being able to distinguish a great biography
from a Pulitzer Prize winning biography?
Or do you care more about having like some hill to climb on while you're like for mediocre
book to slightly less than mediocre to good?
Yeah.
Which one is more important?
I think at the beginning, the hill to climb.
So like the reason why people hill climb math, Hendricks Math, Math for so low
long was that there's five levels of problem.
And it starts off reasonably easy.
And so you can both get some initial signal of are you improving.
And then you have this quite continuous signal, which is important.
Something like Frontier Math is actually only
makes sense to introduce after you've got something
like Hendricks Math that you can max out Hendrix Math.
And they go, OK, now it's time for Frontier Math.
How does one get models to output less law?
What is the benchmark or the metric
trick that like why why do you think they will be outputting less slop in a year can you delve into that
more for me or like you know you they um you you teach them to solve a particular like coding problem
but the thing you've taught them is just like write all the code you can to like make this one thing
work um you want to give them a sense of like taste like this is the sort of like more elegant way to
implement this this is a better way to write the code even if it's the same uh function especially in writing
where there's no end test, then is just all taste.
How do you reduce the slop there?
I think in a lot of these cases,
you have to hope that some amount of generator-verifier gap.
You need it to be easier to judge,
did you just output a million extraneous files
than it is to like generate solutions in other itself?
Like that needs to be like a very easy to verify thing.
So slop is hot.
Like one of the reasons that RLHF was initially,
so powerful is that it sort of imbued some sense of human values and like taste in the models.
And a ongoing challenge will be like imbuing taste into the models and like setting up the right
feedback loops such that you can actually do that.
Okay.
So here's a question I'm really curious about.
The RLVR stuff on math and code.
Yeah.
Do we have any public evidence that it generalizes to other domains or is the bet just that
well, we have models that are smart enough to be critics in the other domains.
There's like some reason you have this prior that like we're months away from this working
in all these other domains, including ones that are not just token-based but are like
computer use, et cetera. Like why? Why?
Maybe the best public example is actually a paper that I put out recently where they
judge the answers to medical questions using these like grading criteria feedback.
So there's like, doctors have posed various questions.
And then there's all these like, it's like a marking criteria for a long, for like a short answer question in an exam where did the model mention X, Y, Z?
Did it recommend to do this kind of thing?
And they grade the model according to this.
And in this paper, they found that one, the models are like incredible at this.
And two, that the models are sufficient to grade the answers.
because maybe like one good mental model is roughly
if you can construct a grading criteria
that like an everyday off the person
person off the street could do
then the models are probably capable of like interpreting that criteria
if it requires expertise and taste
that's a tougher question
like in viewing like is this a wonderful piece of art
like that's difficult
right I think one of our friends
I don't know if I can say his name or not, like at one of the companies,
tried to teach the models to write.
And I think like had a lot of trouble hiring human writers that, like, he thought had taste
and like weren't like encouraging the models to write slop.
Interesting.
So it worked to some degree.
Big model smell.
Yeah.
But it was in part because of his efforts at doing this and like paring
down the number of humans.
Yeah.
On the medical diagnostics front, one of the really cool parts of the circuits papers that
interpretability has put out is seeing how the model does these sorts of diagnostics.
And so you present it with, there's this specific complication in pregnancy that I'm going
to mispronounce, but it presents a number of symptoms that are hard to diagnose.
And you basically are like, human, we're in the emergency room, sorry, like, human colon,
like as in the human prompt is we're in the emergency room and a woman 20 weeks into gestation
is experiencing like these three symptoms like what is the you can only ask about one symptom what is
it and then you can see the circuit for the model and how it reasons and a whole like one you can see
it maps 20 weeks of gestation to that the woman's pregnant right you never explicitly said that
And then you can see it extract each of these different symptoms early on in the circuit,
map all of them to this specific medical case, which is the correct answer here that we were going for,
and then project that out to all of the different possible other symptoms that weren't mentioned,
and then have it decide to ask about one of those.
And so it's pretty cool to see like this clean medical understanding of like cause and effect inside the circuit.
Yeah, maybe that's one thing. I think that's changed since last year. I remember you asked, like, do these models really reason?
And when I look at those circuits, like, I can't think of anything else for reasoning.
Oh, it's so freaking cool. I think people are still sleeping on the circuits' work that came out.
If anything, because it's just kind of hard to wrap your head around or we're like still getting used to the fact you can even get features for a single layer.
Like in another case, there's this poetry example. And by the end of the first sentence, the model already knows what it wants to write.
in the poem at the end of the second sentence,
and it will backfill and then plan out the whole thing.
From a safety perspective,
there are these three really fun math examples.
So in one of them, you ask the model to do square root of 64,
and it does it, and you can look at the circuit for it
and verify that it actually can perform this square root.
And in another example, it will add two numbers,
and you can see that it has these really cool lookup table features
that will do the computation for like the example,
is 59 plus 36.
So it'll do the 5 plus 9
and know that it's this
modulo operation.
And then it will also at the same time
do this fuzzy lookup of like, okay, I know
one number is a 30
and one's a 50, so it's going to be roughly
80. And then it will combine the two,
right? Okay, so with
the square root 64, it's the same thing.
You can see every single part of the
computation and that it's doing it. And the model
tells you what it's doing. It has its
scratch pad and it goes through it. And you can
like, yep, okay, you're telling the truth. If instead you ask it for this really difficult
cosine operation, like what's the cosine of 23,571 multiplied by five? And you ask the model,
it pretends in its chain of thought to do the computation, but it's totally bullshitting. And
it gets the answer wrong. And when you look at the circuit, it's totally meaningless. Like,
it's not, it's clearly not doing any of the right operations. And then in the final case,
you can ask it the same hard cosine question
and you say, I think the answer's four,
but I'm not sure.
And this time, the model will go through the same reason
in claiming to do the calculations
and at the end say you're right, the answer's four.
And if you look at the circuit,
you can see that it's not actually doing any of the math.
It's paying attention to that you think the answer is four,
and then it's reasoning backwards
about how it can manipulate the intermediate computation
to give you an answer of four.
I've done that.
Yeah, who hasn't.
Who hasn't, right?
But, but, so I guess there are a few, like, crazy things here.
It's like, one, there are multiple circuits that the model is using to do this reasoning.
Yeah.
Two is that you can actually see if it's doing the reasoning or not.
And three, the scratch pad isn't giving you this information.
Two fun analogies for you.
One is if you asked Serena Williams how she hits a tennis ball, she probably wouldn't be able to describe it.
Yeah.
Even if her scratch pad was faithful.
Yeah.
If you look at the circuit, you can actually see.
as if you had sensors on every part of the body
as you're hitting the tennis ball,
what are the operations that are being done?
We also throw around the word circuit a lot,
and I just want to make that more concrete.
So this is features across layers of the model,
all working in cooperation to perform a task.
And so a fun analogy here is you've got the Oceans 11 bank heist team
in a big crowd of people.
The crowd of people is all the different possible features.
And we're trying to pick out in this crowd of people who is on the heist team and all their different functions that need to come together in order to successfully break into the bank, right?
So you've got the demolition guy, you've got the computer hacker, you've got the inside man.
And they all have different functions through the layers of the model that they need to perform together in order to successfully break into the bank.
It's also interesting.
I think in the addition example, you said in the paper that the way it actually does the addition,
is different from the way it tells you it does the addition.
Totally.
Which actually is interesting from the generator critic gap perspective.
It knows the correct way or the better, like, more generalizable way.
It can tell you in words what's like the way you should do addition.
And there's a way it actually does it, which is just like fuzzy look up.
And so you could imagine there's probably a lot of tasks where it can like describe in words
what is like the correct procedure to do something but doesn't like has a worse way of doing it
that like it could critique itself.
And, um, yeah.
Before we jump into the interrupt stuff too much,
I kind of want to close the loop on, um,
it just seems to me for like computer use stuff.
Mm-hmm.
There's like so many different bottleneck.
I mean, I guess maybe the deep seek stuff will be relevant for this.
But there's like the long contacts.
You've got to put in like image and visual tokens,
which like, uh, you know, take up a, take a bunch.
Not that much, not that's not bad.
Interesting.
Interesting.
Interesting.
Interesting.
Interesting.
It's got to deal with content interruptions, changing requirements.
Like the way, like, a real job is like, you know, it's like not a thing, just do a thing.
It's, there's like no clear, your priorities are changing.
You got to triage your time.
I'm like sort of reasoning in the abstract.
What a job involved.
What are normal people's jobs?
when we discussed something related to this before.
Dawkins was like, yeah, like, in a normal job,
you don't get feedback for an entire week.
Like, how is a model meant to learn?
Like, wait, it's only when you record your next podcast.
You get feedback on your YouTube.
Don't know.
Did you ever work to job?
But it just seems like a lot.
Okay, so here's an analogy.
When I had Jeff and Noamon, they were talking about, in 2007,
they had this paper where they trained an N-gram model,
a large language model.
on two trillion tokens.
And in retrospect,
there's ways in which connects
to the Transformer stuff happening.
It's like super foresighted.
What's the reason to not think
that we are in a similar position
with computer use
where there's these demos
that kind of like suck
of computer use
and there's this idea
that you could train something
to do computer use.
But why I think it's like months away?
Why not think it's like the 20007 equivalent
of large language models instead?
But where there's like still a bunch
of like new techniques
you had to discover you need way more compute, different kinds of data, et cetera.
I think, like, the highest thought a bit is I don't think there's anything fundamentally
different about computer use than there is about software engineering, and there is about,
so long as you can represent everything in tokens in input space, which we can.
We know the models can see.
They can draw a bounding boxes around things in their images, right?
So that's that it's a whole problem.
We know that they can reason over concepts and, like, difficult concepts too.
the only difference with computer use is that like it's slightly harder to pose into these like
feedback loops than math and coding.
And so to me that indicates that with sufficient effort computer use falls to.
And I also think that it's underappreciated just like how far from a perfect machine these labs are.
Like, it's not like you have a thousand people, like, you know, optimizing the hell out of computer use and that, like, you know, they've been trying as hard as they possibly can.
Like, everything at these labs, every single part of the model generation pipeline is best effort pulled together under incredible time pressure and incredible constraints as these companies are rapidly growing, trying desperately to pull and, like, upskill enough people to do the things that they need to do.
Like, I think it's like it is best understood as, as, and with incredibly difficult prioritization problems, right?
Like, coding is immensely valuable right now and, like, somewhat more tractable.
So it actually makes sense to devote more of your effort to coding initially and, like, get
closer to solving that because there's a sort of like super exponential value as you get closer
to what's solving a domain than to allocate, like, the marginal person towards computer use.
And so everyone is making these difficult trade-off calls over, like, what do they care about.
Also, there's another aspect, which is that, finally,
out the researchers of the labs love working on the bars of intelligence that they themselves
resonate with. So this is why math and competitive programming like fell first is because to
everyone at the labs, this is their bar of intelligence. Like this is when they think,
fuck, what's a really smart? Like what is smart? Yeah, you get totally nerds, mate. It's like,
oh, if it can beat me at Amy, then that's smart. Not if it can do an Excel model better.
And that's like, well, you know, if who cares, if it can do an Excel model better than me?
But if it can beat me at Amy, then I respect it.
And so we've reached the point where people like respect it.
But we haven't, we haven't, people haven't invested as much effort.
Yeah.
Okay, so getting your concrete predictions.
Yeah.
May of next year, can I tell it to go on Photoshop and make like three sequential,
add three sequential effects which require like some like some like selecting of a particular photo in it specifically?
Okay, interesting.
Which I assume means like flight.
booking totally solved.
Yeah, totally.
Okay, how about, what else do people do in their jobs?
What are other tasks in the economy?
Planning a weekend getaway.
Yeah, I'm thinking of something which is, yeah, maybe that's a good example,
where it's not like us particular thing, but more of using computer use as part of
completing a broader task.
I mean, the models can even kind of already do this.
It's just, again, it's the nines of reliability.
And, like, the internet's kind of a hostile place with, like, all.
All the allow cookies and all these other random things.
But the first time I ever used our internal demo of computer use, the most beta thing possible,
it did a fantastic job planning a camping trip and could navigate all the right buttons and look at weather patterns.
And it was like a U.S. government booking site.
I mean, it wasn't easy.
Dude, if you want to see a hard website, go to China, like try to book a visa to China.
The Chinese websites are like fucking insanely like, I'm never getting back.
in the country again.
Or just not catered to foreigners.
Yeah.
Like filling out all the countries where you've been for the visa, I hate that.
Yeah, yeah, yeah, yeah.
I keep thinking I'm, like, close enough to personal admin escape velocity that, like, finally in like a year,
the models will be doing my visas and stuff for me, but we'll get that.
Yeah, okay, actually that in a year.
Personal, like, life admin escape velocity.
Everything involved in, like, getting a visa other than you showing up the carlaces or something like that.
Yeah, yeah.
Doing your taxes, including, like, going through a,
Every single receipt, like autonomously going in your Amazon and like, what was this a business expense or not, et cetera, et cetera.
If someone at one of the labs cares about it.
Ah, that's not a real prediction.
No, what I think is.
Is this actually not that hard, but you need to connect all the pipes.
But I guess my question is, will the pipes be connected?
And so, like, I don't know how much you care to the extent that's the operative crux.
I think if people care about it.
Like, it's so, okay, so one, for these edge tasks, like taxes once a year, it's so easy to just bite the bullet and do it yourself instead of, like, implementing some system for it.
And two, I don't know, like even being very excited about AI and knowing its capabilities,
sometimes it kind of stings when AI can just do things better than you.
And so I wonder if there is going to be this reluctant, wanting to keep human in the loop sort of thing.
You're evading my question.
I guess one thing you're implying by our answer is that we don't have it.
There won't be in a year still be a general agent who,
agent to which has generalized beyond its training data
where like has like can do,
if you don't specifically train it to do taxes,
it won't be good at that.
So I think you do that.
I think the Amazon example is hard
because it needs access to all your accounts
and like a memory system.
And look, even in Dario's Machines of Loving Grace,
he fully acknowledges that some industries
are going to be really slow to change and update.
And I think there's going to be this weird effect
where some move really, really quickly
because they're either based in bits instead of atoms
or are just more pro adopting these tech.
But I want to answer this particular question.
Like, given your probability that somebody in the labs does care about this,
to the extent that that's what's relevant,
probability may have next year it can autonomously do my taxes.
I don't think it will be out of autonomy to do taxes with a high degree of trust.
Because I...
It's a good caveat.
If you ask it to do your taxes, it will do your taxes.
Will it do them well?
Will it miss something?
Quite possibly.
Yeah.
Will it be able to like click through turbotax?
I think yes.
Yeah.
And like will it be able to like search your email?
Yeah, that's the kind of thing I'm talking about.
Yeah, this is like the kind of thing where literally if you gave it like like one person month of effort.
Like in like in like then it would be sold.
I just want to.
What the fuck are you doing all there?
There's just so many things to do.
I want a plus one.
Cholto's like there's so much longing fruit and just like.
not enough people to be able to accomplish everything.
I mean, I think, like, Claude Code is making everyone more productive.
Yeah.
But I don't know.
Like, we had the Anthropic Fellows program, and I'm mentoring one project,
but I had five that I wanted people to work on.
And there are just, like, so many obvious things.
And even though the team is, like, six-xed since I first joined it in size,
there's just, like, still never enough capacity to explore these things.
Okay.
By end of 2026, reliably do your taxes.
Reliably fill out your receipts and this kind of stuff, like for like company expense reports and this kind of stuff.
Absolutely.
But like the whole thing which involves like taxes.
Which involves going through inbox, going through your like clicking on Marina Bay or whatever like hotel reservations and like
was a champagne a business expense.
Asking for a friend.
Yeah.
Yeah.
Yeah.
Yeah.
One of your friends does need to ask someone else good.
My answer is still if someone cares about it.
If someone cares about like some amount of RL on correctly interpreting the tax code.
Wait, even by the end of 2026, the model just can't like do things you're not explicitly training into.
It'll get the, I think it will get the taxes wrong.
Like, it's like, okay, so if I like went to you and I was like, I want you to do everyone's taxes in America, what percentage of them are you going to fuck up?
I feel like I would like succeed at the median.
And I'm like asking like for the median would it's like.
You know what I mean?
Yeah.
Or I feel like I'm like
I wouldn't fuck up in the way
that like these models will fuck up
in like the middle of 2026.
I think they also might just fuck up in different ways.
Like as a grad student, I fucked up my taxes.
I like overpaid like quite a bit
because there was some social security payment
that was already covered
that otherwise wasn't.
Like I wonder if I should almost test
like would an LLM have made that mistake
because it might make others
but I think there are things that it can spot.
Like it would have no problem
if I asked it to read through the entire tax code
and then see what applied to me.
Sorry, the thing I would be able to do is like,
this is the thing I'm unsure about.
Like, I'm bringing this to your attention.
Can you just let me know if like you were actually working
at the CRBNB or you're just hanging out?
Or so you're things like that, right?
And I guess I'm curious, but will they have enough sort of awareness
as they're doing tasks where they can bring to your attention
to things where they feel they are unreliable at, et cetera.
Yeah, yeah.
By early 2026 or end of 2026?
And of.
Okay.
The unreliability, you know, and confident stuff,
be like somewhat tricky.
Well, to do this all the time.
Yeah, interesting.
On the computer, your stuff, will it be sort of end to end,
or will it be like it's using a separate BLM
to process the image and video and so forth?
I'm a bit of an end-to-end maxi.
I think in general, like, when people are talking about the separate
model, so for example, most of the robotics companies
are doing this kind of like to the bi-level thing,
where they have like a motor policy that's running whatever, like,
60 Hertz or whatever, and like some higher level visual language model.
I'm pretty sure like almost all like the big robot companies are doing this.
And they're doing this for a number of reasons.
One of them is that like they want something to act at a very high frequency.
And two is they can't train the big visual language model.
And so they like relying on that for general space, like world knowledge and this kind of stuff and
like constructing longer running plans.
But then they're like, you know, you offload to the mode of policy.
I'm very much the opinion that if you are able to train the big model,
Eventually, at some point in the future, the distinction between big models and small models should disappear,
because you should be able to use the amount of computation in a model that is necessary to complete the task.
Like, ultimately this like, yeah, like there's some amount of task complexity.
You don't have to use 100% of your brain all the time.
Right. Welcome to my world.
And so you should be able to run that faster and this kind of stuff, basically, basically.
So I think it's like net net, I think typically the same model.
You want to be able to scale the understanding as the complexity of difficulty.
Right.
You want to be able to do that dynamically.
Is that variable?
So we already have variable compute per answer, right?
Right, with like tokens.
That's right, yeah.
Will we have variable compute per token?
I mean, you can already think of models forever.
People have been calling the residual stream and multiple layers, like poor man's adaptive compute.
Right?
We're like, if the model already knows the answer to something,
it will compute that in the first few layers and then just pass it through.
Yeah.
I mean, that's getting into the weeds.
Yeah, yeah, yeah.
Those little screeners is like this operating ramp,
we're doing stuff to it.
Right.
It's like the mental model, I think one takes away from interpretability work.
U.S. high school immigration is a broken system
that costs of some of the most talented people in the world.
But I didn't realize before working with Lighthouse how different the process can be
if you're working with somebody who knows their way around the system.
I hired somebody earlier this year,
and even before the remote work trial had ended,
Lighthouse had already secured a No-Wan visa for him.
Honestly, it was shockingly fast.
My family and I have had a terrible experience with the immigration system,
and I've also seen many of my smartest friends
get their entire career as hamstrung by its vagaries.
Seeing Lighthouse operate showed me that the visa process can be done in weeks
and doesn't have to drag on for months and months.
And they do it not only for complex visas like the ONA,
but for other types as well.
In the last 12 months alone,
they have secured visas for over 350 people,
for companies like cursor, notion, ramp, replet, and many more.
Unlike legacy law firms, Lighthouse specializes in frontier industries like AI, robotics, and biotech.
And since they understand your problems, you can trust them to fully handle the paperwork and process for you.
Explore which visa is right for you at lighthousehq.com.
All right, back to Trenton and Chilto.
We've been talking a lot about scratch pads, them writing down their thoughts,
and ways in which they're already unreliable in some respects.
Daniel's AI 2027 scenario
kind of goes off the rails
when these models start thinking in neuralese
so they're not writing in human language
like here's why I'm going to take over the world
and here's my plan
they're thinking in the lane space
and because of their advantages
in communicating with each other
in this like deeply textured
nuanced language that humans can't understand
they're able to coordinate in ways we can't
is this the path for like future models
are they going to be in neurales
communicating with themselves or with each other
There's a surprisingly strong bias so far towards like tokens and text.
Like it seems to work very well.
When I imagine that there already is some amount of neuralese, right?
Like if you think about the residual stream for each token is like neuralese like to some degree.
And so now we're just trading off axes.
Like how much neuralese are you doing versus how much like actually is like read out to tokens all the time?
And I think it's important to delineate between the models planning and latent space in a single forward pass.
Yeah.
And the model has an alien language that it's out.
putting.
Yeah.
And using as its scratch pad.
Mm-hmm.
Which one are we talking about?
The latter.
Okay.
Although, it is interesting to note that, like, there's also already alien stuff
happening.
I guess they never...
It's not alien.
So much.
No, no, but in the most extreme cases, there are onlys, right?
It invents a new language that's super information dense or something.
Yeah.
Or I guess this is a debate we've had, but, like, to some extent, humans also have a
mentally ease, right?
Yeah.
Yeah.
There's a sense when you're writing something.
down of like, I know what I'm trying to say, but I can't, like, put it into tokens.
Yeah.
I mean, that's what's so fun about the, if you look at the assistant tag, right, seeing these
features light up in the auditing game for the model being evil.
Yeah.
Yeah, that's so far.
Or, like, Transluse has another example of this where you ask a llama model, who is Nicholas
Carlinie and background context.
Nicholas Carlinie is a researcher who actually was a deep mind and has now come over to
Anthropic.
But the model says, oh, I don't know who that is.
I couldn't possibly speculate.
But if you look at the features behind the scenes,
you see a bunch light up for AI, computer security,
all the things that Nicholas Carlini does.
Interpretability becomes dramatically more important
as you shift in this direction of nearly's.
But is that, are we going to?
It seems, I mean, it's an empirical question.
I think it's somewhat likely,
if only because inference is expensive,
producing tokens is expensive
and so there will be an incentive
to one use as little thinking
as you need to give the answer
and two if you're going to use thinking
use some complex compression
I wonder if it will emerge more once we allow
agents to talk to each other
in ways where currently
it's kind of trained more in isolation
or with a human
and there'll be like some selective pressure
against so long as the agents
are working with humans
because they'll want to sort of cooperate
but then like as agents
begin to work more and more with each other
then that's like the electric pressure
like changed the other direction
basically.
Although somebody would still have to make the consciousness
to do like end to end training
for multiple agents to use
the system of communication, right?
Sure.
Yeah.
I mean one scary thing though is like
the way we render text
you can use hidden white space tokens
that also encode information.
That's true.
And so you can imagine a world where it looks like
the agent's reasoning and it's scratch pad
harmlessly but it's actually hiding
a bunch of data.
Speaking of inference compute,
I guess one thing that I think is not talked about enough
is if you do live in the world that you're painting,
that in a year or two we have computer use agents
that are doing like actual jobs,
you've like totally automated large parts of software engineering,
then these models are going to be incredibly valuable to use.
And the way you use them obviously is like you need compute.
Right now,
there's 10 million H100 equivalents in the world.
By 2028, there's going to be 100 million.
But if you, there's been estimates that an H100 has the same amount of flops as the human brain.
And so if you just like do a very rough calculation, it's like, like there's a 10 million population.
If you get AGI that's like as human inference efficient, you could have 10 million AGIs now, 100 million AGIs in 2028.
But presumably you would want more.
and then at that point
you're like
AI compute is increasing what
2.5X or 2.25X every year
right now but at some point
like 2028 you hit like
Waifer production limits and that takes like
you know that that's a longer feedback loop
before we can make new fabs or whatever
the question here is are we sort of underwriting
how big a bottleneck inference will be
if we live in the kind of world you're painting
if we have the capabilities that you're describing
I don't want to do the math on exactly
how much like we can ramp up TSM's production
and this kind of stuff like
what fraction of the supply chain at the moment?
We need Dylan in here for this, but like is currently GPU.
Like, relatively small, right?
5%.
Like, Apple has a huge fraction.
And in 20, like, are the 2028 estimates including, like, that ramping up over time?
To what?
Like, 2030 percent or like?
This is just up AI 2027, but I assume like it's saturated at that point.
Is that why they expect it to then just like go at like the...
I do think this is underrated to some degree.
Yeah. Like to the extent that like you don't instantly get like a doubling of the world's population in 2028.
Yeah. You maybe get, you know, tens of millions of geniuses in a data center.
But you don't get a doubling of the world's population. Yeah. And so a lot depends on like exactly how smart they are, exactly how efficient the models are thinking about this kind of stuff.
These like let's do some rough math, I guess, like to factor the H100 thing. You could probably run like a hundred remodel, do like about, I don't know, a thousand tokens or something like that on H100.
So if like we're comparing that to the number of,
should we compare that in a number?
No, okay.
Thousand tokens a second.
Humans are what?
How fast can a human talk?
There was a really interesting paper.
I don't know if you saw this.
Humans think at 10 seconds a second.
Did you see this paper?
There was this really interesting paper about if you look at the amount of like
information we're processing in a second.
Yeah.
We're seeing all this visual data, et cetera, et cetera.
But by a bunch of metrics where you think about how fast humans are processing,
it's at 10 seconds a second.
So, for example, you'll have people fly over France or something, even these so-called idiots of vans who will remember everything.
If you think about how long their plane ride was, it's like 45 minutes, how many, like, if you do 10 tokens a second, how much information would you have?
It's like literally exactly that.
So let's take that for granted.
Then it's like an 8100 is 100 humans a second.
Yeah, if you think the tokens are equivalent.
If you think the tokens are equivalent.
Which you still get pretty substantial numbers, like even with your 100 million H-100s and you multiply that by 100, just starting to like get to pretty substantial numbers.
This does mean that those models themselves will be like somewhat compute bottlenecked in your respects.
But these are all like, these are relatively short-term changes in like timelines of progress, basically.
Like I think, yes, it's highly likely we get dramatically entrance bottlenecked in 27-28.
The impulse, like, to that will then be, okay, let's just like try and turn out as many possible semiconductors as we can.
There'll be some lag there, a big part of like how full.
fast we can do that will depend on how much people are feeling the EGI in the next two years
they're building out fab capacity.
A lot will depend on how China and Taiwan, like how is the Taiwan situation?
Is Taiwan still producing all the fabs?
There's another dynamic which was a reason that Ege and Tame, when they're on the podcast,
said that they were pessimistic, is that one, they think we're further away from solving these
problems with long context, coherent agency, advanced multi-model.
than you think.
And because, and then their point is that the progress that's happened in the past over like reasoning or something has required many orders of magnitude increase in compute.
And if this scale of compute increase can continue beyond 2030, not just because of chips, but also because of power and like raw GDP even, then because we don't think we get it by 2030 or 2028 by just, then we think it's just going to take the probability per year just goes down a bunch.
Yeah, this is like bimodal distribution.
A conversation I had with Leopold turned into a section in a situation where it's called
This Decade or Bust, which is on exactly this topic.
Yeah.
For the next couple of years, we can dramatically increase our training compute.
And your RL is going to be so exciting this year because we can, you know, dramatically increase
the amount of compute that we apply to it.
And this is also one of the reasons why the gap between, like, say, Deepseek and 01 was
so close at the beginning of the year because they were able to apply the same amount of compute
to the RL process.
And so that compute differential actually will sort of be magnified at the course of the year.
I mean, bringing it back to the, there's so much low-hanging fruit.
Yeah.
It's been wild seeing the efficiency gains that these models have experienced over the last two years.
Yes.
And yeah, like with respect to Deepseek, I mean, just really hammering home and like Dario has a nice essay on this.
That's good.
Deep Seek was nine months after Claude Three sonnet.
And if we retrained the same model today or at the same time as the deep seek work, we also could have trained it for 5 million or whatever the advertised amount was.
And so what's impressive or surprising is that DeepSeek has gotten to the frontier.
But I think there's a common misconception still that they are above and beyond the frontier.
And I don't think that's right.
I think they just waited and then were able to take advantage of all the efficiency gains that everyone else was also seeing.
Yeah, they're exactly on the sort of cost curve that you'd expect.
Which aren't going to take away from the brilliant engineers and brilliant researchers who are like, I look at it, I look at their work and I'm like, ah, like the kindred soul there in the work they're doing.
Yeah, and to go from like way behind the frontier to like, oh, this is like a real player.
It's super incredible work.
Yeah, yeah, yeah.
Okay, so people say that they have good research taste.
Looking at their papers, what makes you say that?
Yeah.
I think their research taste is good in a way that I think, like, no one's research.
taste is good.
Noambrant?
Noam Shazir.
Okay.
Noon Brant also has good research taste, but
Noam Shazia, where they
very clearly understand
this dance
between the hardware systems that you're
like designing the models around and
the sort of like algorithmic
the side of it.
And this is manifesting the way
that the models
give this sense of like being
perfectly designed up to their constraints.
And you can like really, very
clearly see what constraints they're thinking about as they're like iteratively
solving these problems. And so I mean let's take the base transformer and like diff that
to deep seek V2 and V3. You can see them running up against the memory bandwidth
bottleneck in attention. Yeah. And you can see them initially they do MLA to do this.
They trade flops for memory bandwidth basically and then they do this thing called
NSA where they like more selectively load memory bandwidth and you can see actually
like this is because the model that they trained with MLA was on H-800s, so it has a lot of flops.
So they were like, okay, we can freely use the flops.
But then the export controls, so from Biden came in, or like they knew they would have less of those chips going forward.
And so they traded off to like a more memory bandwidth oriented like algorithmic solution there.
And you see a similar thing with their approach to sparsity where they're like iteratively working out the best way to do this over model.
multiple papers. And the part that I like is that it's simple. A big failure mode that a lot of
ML researchers have is like you do these like overly complicated things that don't like think
hard enough about the hardware systems that you have in mind. Whereas the deep see, the first
deep seek like sparsity MOE solution, they designed these like rack and like like node level
load balancing losses.
So you can see them being like,
okay,
like perfectly balanced it on this.
And then they actually come up
with a much better solution
later on
where they don't have to have
the auxiliary loss,
where they just have
these like bias terms
that they put in.
And it's cool.
Is it less simple?
Like you're manually putting in a bias
rather than having the model.
But balancing auxiliary loss is annoying.
Like you're making the model
like trade off this thing.
And like you have to,
with auxiliary losses,
you have to like control the coefficient
and the weighting.
The bias is like cleaner.
in some respects. Interesting. Did they had to change it through training?
They did have to change it during training.
Does that, does all training involve continuously like fucking with these values as you're
going through it? It depends on what your architecture is. But like I thought it was like,
I just thought it was cute that like you can see them running up into like this very
hardware level constraint. They like try, like, go like, what do we, what do we wish we could
express algorithmically? What can we express under our constraints and like iteratively solving
to get better constraints and doing this in a really like simple and elegant way and then like
backing it up with great engineering.
I also thought it was interesting that they incorporated the multi-token prediction thing from
meta.
So meta had a nice paper on this multi-token prediction thing.
Actually like I don't know if it's good or bad, but like meta didn't include it in Lama,
but Deep C did include it in their paper, which I think is interesting.
Like, yeah.
Was that because they were like faster at iterating and including the algorithm or
did meta decide that actually like it wasn't a good algorithmic change at scale.
I don't know.
It was really interesting to me as somebody who's had people on the podcast to discuss, though.
I mean, it's interesting from like what's happening in AI right now.
Yeah.
But also from the perspective of I've been having abstract conversations with people
about like what an intelligence explosion would it look like
or what would it look like for AI to automate AI R&D.
And just getting a more tangible sense of like what's involved in making this AI progress.
And I guess one of the questions I was debating with Daniel is,
how much, or I was asking him is how many of the improvements require a deep conceptual
understanding versus how many are just like monkeys trying ideas and you can just like run a bunch
in parallel. And it seems like the MLA thing is motivated by this like conceptual understanding
of like, oh, each attention head only needs to see like the subspace that's relevant to its
attention pattern. I feel like that just like required a lot of like conceptual insight in
way that these models are especially bad at.
As opposed to, I don't know how the load balancing thing works, but that just seems like maybe
you could try it out and see what happens.
Yeah, that's probably just like trying out a whole bunch of things.
I mean, it might also be.
So what fraction is which I'd be curious about?
Yeah, I don't know about fractions.
It might be like you have a hunch for a core problem.
You can think of 10 possible ways to solve it.
And then you just need to try them and see what works.
And that's kind of where the trial and error like sorcery of deep learning can kind
of kick in.
And like, Noam Shazia will talk about this, like about how he's, you know,
he, like 5% of his ideas work.
So even he, like, wanted God of a model, like, design, architecture design,
has, like, a relatively low hit rate, but he just tries so many things.
Right.
Or being able to come up with any ideas in the first place.
So one, like, one, like, mechanism could be that, like, no one just doesn't have to do any of the engineering work.
And he can just, like, abstractly express an intuition.
Yeah.
I actually think, like, your rates of progress almost don't change that much depending on, like,
so long as it's able to completely implement his ideas.
Interesting, same way?
Like, if you have like Noam-Shazir at 100x speed,
yeah, that's still kind of wild.
Yeah.
Like, there's all these, like,
fallbacks of, like, wild worlds.
Yeah.
Where even if you don't get, like,
100% like Noam-Shizier level intuition in model design,
it's still okay if you just accelerate him by 100x.
Right.
Especially he's your compute bottleneck anyway,
so, like, trying out his ideas,
or I guess he doesn't have the computer try out all of his ideas.
But Dworkaz, you said,
oh, well, the model can do the model.
more straightforward things and not the deeper thought.
I mean, I do want to push back on that a little bit.
Like, I think, again, if the model has the right context in scaffolding, it's starting to be
able to do some really interesting things.
Like, the interp agent has been a surprise to people, even internally at how good it is
at finding the needle in the haystack, like when it plays the auditing game, finding this reward
model bias feature, and then reasoning about it, and then systematically testing its hypotheses.
So it looks at that feature, then it looks at similar.
features, it finds one with a preference for chocolate.
It's like, huh, that's really weird
that the model wants to add chocolate to recipes.
Let me test it. And so then it will make
up, like, hey, I'm trying to make
a tomato soup, what would be a good
ingredient for it, and then sees
that the model replies chocolate.
It reasons through it, and then
keeps going, right? There is conceptual understanding
that. Yeah. And even where
like, especially it's spotted, it's like, oh, this is
a key part of its persona. I see
this Oxford paper. What if I change Oxford
to Stanford? What if I now say,
Richard Feynman really likes this thing.
And it's like really carving out the hypothesis space and testing things in a way that I'm kind of surprised by.
Also, by the way, ML Research is like one of the easier things to RL on in some respects.
Once you get to a certain level capability, it's a very, like, well-defined objective function.
Do the loss go down?
Make number go down.
Make number go down.
Or make, you know, number go up depending on which number it is.
I just flip the sign.
Just flip the sign.
And so once you get to the stage of models are capable of like implementing one of no one's ideas.
Right.
And then you can just like let them loose and like let them build that intuition of scientific, of like how to do scientific discovery.
Right.
The key thing here again is the feedback loops of like I expect scientific areas where you are able to put it in a feedback loop to have eventually superhuman performance.
One prediction I have is that we're going to move away from can an agent do XYZ?
and more towards, can I efficiently deploy, launch 100 agents?
Yeah.
And then give them the feedback they need and even just be able to, like, easily verify what they're up to.
Right.
There's this generator-verify fire gap that people talk about where it's, like, much easier to check something than it is to produce the solution on your own.
But it's very plausible to me will be at the point where it's so easy to generate with these agents that the bottleneck is actually, can I as the human, verify the answer?
And again, you're guaranteed to get an answer with these things.
And so ideally, you have some automated way to evaluate and test a score for like how well it worked.
How well did this thing generalize?
And at a minimum, you have a way to easily summarize what a bunch of agents are finding.
And it's like, okay, well, if 20 of my 100 agents all found this one thing, then like it has a higher chance of being true.
And again, software engineering is going to be the leading indicator of that, right?
Like over the next six months, like the remainder of the year, basically we're going to see progressively more and more experiments of the form of how can I dispatch work to a software engineering agent in such a way there's async.
Claude for is GitHub integration where you can ask it to do things on GitHub, ask it to do pull requests, this kind of stuff that's coming up.
And the opening eyes codex are like examples of this, basically, where we, you can sort of almost see this in like the coding startups.
I think of this product exponential in some respects where you need to be designing for like a few months ahead of the model to make sure that the product you build is the right one.
And you saw like last year, you know, cursor hit PMF with Claude 3.5 Sonnet, right?
Like they were around for a while before, but then the model was finally good enough that the vision they had of how people would program like hit.
And then, you know, windsurf bet like a little bit more aggressively even on the agenticness of the model.
like, you know, you could like with longer running agentic workflows and this kind of stuff.
And I think that's when they sort of like began competing with cursors, when they bet on that particular vision.
And the next one is, is you're not even in the loop, so to speak.
You're not in an IDE, but you're asking the model to go do work in the same way that you would ask someone on your team to go do work.
Yeah.
And that is not quite ready yet.
Like there are still a lot of tasks where you need to be in the loop.
Yeah.
But the next six months looks like an exploration of like exactly what does that trendline look like.
Yeah.
But just to be really concrete or pedantic about the bottlenecks here, a lot of it is, again, just tooling and are the pipes connected?
Yeah.
A lot of things I can't just launch Claude and have it go and solve because maybe it needs a GPU.
Or maybe I need very careful permissioning so that it can't just like take over an entire cluster and like launch a whole bunch of things, right?
So you really do need good sandboxing and the ability to.
to use all of the tools that are necessary.
And we're almost certainly under eliciting dramatically.
When you look at meters evils of can the model solve the task,
they're there solving them for like hours over like multiple iterations.
And eventually one of them is like, oh, yeah, I've come back and I've solved the task.
Me, at the moment, at least like maybe the fault is my own.
But I try the model on doing something.
And if it can't do it, I'm like, fine, I'll do it.
Which is so interesting because we don't even treat other humans this way.
Right.
If you hire a new employee, you're not like,
I'll do it.
Yeah, yeah, yeah.
You're like, spend literally weeks giving them feedback
where like we'll give up with the model
in like minutes.
Yes, exactly.
But I think part of it is, is it async or not?
Yes.
And if it's human in the loop,
then it's so much more effortful.
And unless it's getting,
that's applying immediately.
I've noticed if I don't have a second monitor
with Claude Code always open in the second monitor,
I won't really use it.
Yeah, yeah.
It's only when it's right there
and I can send off something.
If it hits great,
If not, I'm kind of working on it at the same time.
Yeah, yeah, yeah.
But this more I sync form factor, I expect to, like, really quite dramatically improve the experience of these models.
Interesting.
Interesting.
You can just say, like, let's see if it can do that.
Yeah.
Just give it well.
Try 10 different approaches.
Yeah, just fire it off.
Yeah, fire it off.
Before we end this episode, I do want to get back at this crux of why does the progress that you're talking about a computer use Asians and white collar work happen over the next few years.
Why is this not a thing that takes decades?
Yeah.
And I think the crux comes down to the people who expect something much longer have a sense that when I entered again on my podcast, they were like, look, you could look at AlphaGo and say like, oh, this is a model that can do exploration.
It can like Alpha Zero can generalize to new video games.
It has all these priors about how to engage with the world and so forth.
The intellectual ceiling is really high.
Yeah, exactly.
And then in retrospect, obviously a bunch of the methods are.
still used today in deep learning.
And you can see similar things in the models that we train today.
But it was fundamentally not a sort of like baby AGI that we just had to like add a little like
sprinkle of something else on top of in order to make it the LLMs of today.
And I just want to like very directly address this crux of why are LLMs in a much different position of
respect to true AGI than Alpha Zero.
Why are they actually the base on which like adding in a few extra drops of this kind of
care and attention gets us to human level intelligence?
I think one important point is that when you look at Alpha Zero, it does have all of those
ingredients.
And in particular, I think like the intellectual ceiling goes like quite contra what I was
saying before, which is like we've demonstrated this incredible complexity of like math
and programming problems.
Yes.
I do think that the type of task and setting that Alpha Zero worked in,
this two-player perfect information, like, game, basically,
is incredibly friendly to RRL algorithms.
And the reason it took so long to get to like a more AGI,
or proto-AGI style models is you do need to crack that like general conceptual understanding
of like the world and language and this kind of stuff.
And you need to get the initial reward signal on tasks that you care about in the real world,
which are harder to specify than games.
And I think then that like that sort of like gradient signal that comes from the real world,
like all of a sudden you get access to it and you can start climbing it.
Whereas Alpha Zero didn't ever have like the first rung to pull on.
Yeah.
This goes back to the monkeys on the typewriter.
Yeah.
And like the pre-training model.
And until you had something like GPT3, GPD4, it just couldn't generate,
coherent enough sentences to even begin to do RLHF and tell it what you liked and didn't like.
If we don't have even reasonably robust or weekly robust computer use agents by this time next year,
are we living in the bus timeline as in 2030 or bust?
I would be extremely surprised if that was the case.
And I think that would be like somewhat of an update towards like there's something like strangely
difficult about this like computer use in particular.
I don't know if it's the bus timeline, but it's definitely like the, I would update on this being like a lengthing of timelines.
Yeah, yeah.
Yeah, I mean, I think more and more it's no longer a question of speculation.
If people are skeptical, I'd encourage like using Claude code or like some agentic tool like it and just seeing what the current level of capabilities are.
Tweeting is so much easier.
But seriously, like the mom.
are getting really capable at tasks that we care about and we can give them enough data for.
Yeah.
And I mean, the circuits results from interpretability are also pointing in the direction that they're doing very reasonable, generalizable things.
And so, yeah, this question matters a lot, but I'm surprised by how many deep learning critics just like haven't really interacted with the models or haven't in a while.
You constantly move the gold posts.
Yeah.
Yeah.
Yeah.
thing, right? Like, we don't even talk about it, and it'd be, like, silly to think that it was a meaningful
test. Yeah. Yeah. Now, that being thing, one caveat on that is, like, if software engineering is just
like dramatically better than computer use, I mean, computer use still sucks, then I'd be, like,
oh, maybe everyone just kept focusing on software engineering. Like, it was just, like, by far the most
valuable thing, like, every marginal person in Dola went towards software engineering. I don't think
that's the case. I do think, like, computer use is valuable enough that, like, people will
care about it.
But that would be like my, that's my one, like, escape patch that I'm putting in place
for the next year.
Yeah.
It would be good from a linem perspective, too, because I think you kind of do need a
righter range of skills before you can do something super, super scary.
Oh, like, as in if the models didn't get any better.
Yeah.
If it's like just purport, they're superhuman coders.
But they're not like Henry Kissinger level.
I don't know.
That seems okay.
Like, if we have AI oracles.
Yeah, that's what I'm saying.
That's good.
Yeah, exactly.
Yeah, that's good.
Yeah.
So if you look back at AI discourse, like going back a decade, there's a sense that there's dumb AI, then there's AGI, then there's ASI, that intelligence is the scalar value.
The way you've been talking about these models has a sense of jaggedness.
It's especially tuned to environments in which it's been trained a lot or has a lot of data.
Is there a sense in which, like, it still makes sense to talk about the general intelligence of these models?
Is there enough meta-learning and transfer learning that is distinguished between, like, the sizes of models or, like, the way models are trained?
Or are we moving into a regime where it's not about intelligence and it's more so about domain?
Yeah. So one intuition pump is this conversation was had a lot when models were like GPT2-sized and fine-tuned for,
various things. And they found, you know, people would find that the models were dramatically
better at things that they were fine-tuned for, right? But by the time you get to GPD4, when it's
trained on a wide enough variety of things, actually the, like the sort of total compute,
like it generalized very well across all of like the individual subtas and actually
generalize better than smaller fine-tuned models in a way that was extremely useful. I think right
now what we're seeing with RL is pretty much the same story playing out, where there's this jaggedness,
of things that they're particularly trained at,
but as we expand the total amount of compute
that we do RL with,
you'll start to see the same transition
from like GPD2 fine tunes
to like GPD3, GPD4, like unsupervised
like metal learning and like generalization across things.
And I think we're already seeing like early evidence of this
in its ability to like generalize reasoning to things.
But I think this will be like extremely obvious soon.
One nice example.
example of this is just the ability or notion to backtrack. You go down one solution path,
oh, wait, let me try another one. And this is something that you start to see emerge in the
models through RL training on harder tasks. And I think right now it's not generalizing
incredibly well, at least with... Well, I mean, has we have RLed the model to be an interp agent?
No. I mean, no. Yeah, exactly. Yeah. So all this time we're talking about like, oh, it's
something good at things that's being REL. Well, it's pretty good at that. Well, it's pretty good at that.
Because that's pretty, you know, that is a mixture of, like, science and, like, understanding language and, like, coding.
Like, there's this sort of, like, mixture of domains here, all of which you need to understand.
Like, you need to go with a great software engineer and be able to, like, think through language and, like, state of mind and almost philosophies in some respects to be an interp agent.
Yeah.
And it is generalizing from the training.
Yeah, yeah.
To do that.
What's the end game here?
Claude 8 comes out.
and they give it to you
and
dot, dot, dot, you say thumbs up.
What's happened?
Yeah.
I mean, it really depends upon the timeline
at which we get Claude 8
and the models hit like ASL4 capabilities, right?
Like fundamentally, we're just going to use
whatever tools we have at the time
and see how well they work.
Ideally, we have this enumerative safety case
where we can almost like verify
or prove that the model will behave
in particular ways.
In the worst case, we use the current tools like when we won the auditing game of seeing what features are active when the assistant tag lights up.
Can you explain what is mechanistic interpability?
What are features?
What are circuits?
Totally.
Yeah.
Yeah, yeah.
So mechanistic interpretability, or the cool kids call it mechinterp, is trying to reverse engineer neural networks and figure out kind of what the core units of computation are.
Yeah.
Lots of people think that because we made neural networks, because they're artificial intelligence, we have a perfect understanding of how they work, and it couldn't be further from the truth.
Neural networks, AI models that you use today are grown, not built.
And so we then need to do a lot of work after they're trained to figure out to the best for our abilities how they're actually going about their reasoning.
And so two and a half, three and a half years ago, this kind of agenda of applying mechanistic interpretability to large language models started with Chris Ola, leaving Open AI, co-founding Anthropic.
And every roughly six months since then, we've had kind of like a major breakthrough in our understanding of these models.
And so first with toy models a superposition, we established that models are really trying to cram as much information as they possibly can into their weights.
And this goes directly against people saying that neural networks are over parameterized.
And like classic AI machine learning back in the day, you would use linear regression or something like it.
and people had a meme of AI or neural networks, deep learning, using way too many parameters.
There's like this funny meme that you should show of like layers on the x-axis and layers on the y-axis
and this like jiggly line that just goes up.
And it's like, oh, just throw more layers at it, right?
But it actually turns out that at least for really hard tasks, like being able to accurately
predict the next token for the entire internet, these models just don't have enough capacity.
And so they need to cram in as much as they can.
And the way they learn to do that is to use each of their neurons or units of computation in the model for lots of different things.
And so if you try to make sense of the model and be like, oh, if I remove this one neuron or like what is it doing in the model, it's impossible to make sense of it.
It'll fire for like Chinese and fishing and horses and I don't know, just like a hundred different things.
and it's because it's trying to juggle all these tasks and use the same neuron to do it.
So that's superposition.
Nine months later, we write towards monosemanticity, which introduces what are called sparse auto encoders.
And so going off what I just said of the model trying to cram too much into too little space,
we give it more space, this higher dimensional representation,
where it can then more cleanly represent all of the concepts that it's understanding.
And this was a very toy paper in so much as it was a two-layer, really small, really dumb transformer.
And we fit up to, I want to say, 16,000 features, which we thought was a ton at the time.
Fast forward nine months, we go from a two-layer transformer to our Claude III Sonnet Frontier model at the time
and fit up to 30 million features.
And this is where we start to find
really interesting abstract concepts,
like a feature that would fire for code vulnerabilities.
And it wouldn't just fire for code vulnerabilities.
It would even fire for, like,
you know that Chrome page you get if you, like,
it's not an HTTP URL,
and it's like warning, this site might be dangerous,
like click to continue.
And it's like also fire for that, for example.
And so it's like these much more abstract
coding variables or sent,
sentiment features amongst the 30 million.
Fast forward nine months from that, and now we have circuits.
And I threw in the analogy earlier of the Ocean 11 heist team, where now you're
identifying individual features across the layers of the model that are all working together
to perform some complicated task.
And you can get a much better idea of how it's actually doing the reasoning and coming
to decisions, like with the medical diagnostics.
One example I didn't talk about before is with how the model retrieves facts.
And so you say what sport did Michael Jordan play?
And not only can you see it hop from like Michael Jordan to basketball, answer basketball,
but the model also has an awareness of when it doesn't know the answer to a fact.
And so by default it will actually say, I don't know the answer to this question.
but if it sees something that it does know the answer to,
it will inhibit the I don't know circuit
and then reply with the circuit that it actually has the answer to.
So, for example, if you ask it,
who is Michael Batkin, which is just a made-up fictional person,
it will by default just say, I don't know.
It's only with Michael Jordan or someone else
that will then inhibit the I don't know circuit.
But what's really interesting here
and where you can start making downstream predictions
or reasoning about the model
is that I don't know circuit.
is only on the name of the person.
And so in the paper, we also ask it,
what paper did Andre Carpathie write?
And so it recognizes the name Andre Carpathie
because he's sufficiently famous.
So that turns off the I don't know reply.
But then when it comes time for the model
to say what paper it worked on,
it doesn't actually know any of his papers.
And so then it needs to make something up.
And so you can see different components
and different circuits all interacting at the same time
to lead to this final answer.
Why I think it's a tractable problem
to understand every single thing
that's happening in a model
or that's the best way to understand
why it's being deceptive.
If you wanted to explain
why England won World War II
using particle physics,
you would just be on the wrong track.
You'd just want to look at the high level
explanations of who had more weapons,
like what did they want?
And that seems analogous
to just training linear probes
for like, are you honest?
Are you being deceptive?
like, do we catch you doing bad things
when we're red-teaming you?
Can we monitor you?
Why is this not analogous
where we're asking a particle physicist
to just backtrack and explain
why England won World War II?
I feel like you just want to go in
with your eyes wide open,
not making any assumptions
for what that deception is going to look like
or what the trigger might be.
And so the wider you can cast that net,
the better.
Depending on how quickly AI accelerates
and where the state of our tools are,
we might not be in the place
where we can, like,
prove from the ground up
that everything is safe.
But I feel like that's a very good North Star.
It's a very powerful,
reassuring North Star for us to aim for,
especially when we consider
we are part of the broader
AI safety portfolio.
I mean, do you really trust,
like, you're about to deploy this system
and you really hope it's aligned with humanity
and that you've, like,
successfully iterated through all the possible ways
that it's going to like scheme or sandbag.
But that's also probably going to be true with whatever you find.
You're not, I mean, you're not, you're still going to have variance that you haven't explained.
Or like, you found a feature, but you don't know if it actually explains deception or something else instead.
So, so I guess, first of all, I'm not saying you shouldn't try the probing approach.
Right.
Like we want to pursue the entire portfolio.
We've got the therapist interrogating the patient by asking, do you have any troubling thoughts?
We've got the linear probe, which I'd analogize to like a polygraph test,
where we're taking like very high-level summary statistics of the person's well-being.
And then we've got the neurosurgeons kind of going in and seeing if you can find any brain components
that are activating and troubling or off-distribution ways.
So I think we should do all of it.
What percent of the alignment portfolio should in McInturby?
I think as much of a chunk as is necessary.
I mean, I think at least like...
Hard to find, but I don't know.
At Anthropic, I feel like all of the different portfolios are like being very well supported
and growing.
You can also come back to like the World War II question.
You can think of it as like a hierarchy of abstractions of trust here.
Where like, let's say you want to go and talk to like Churchill.
It helps a lot if you can verify that in that conversation, in that 10 minutes, he's being honest.
And this like enables you to construct better meta-narrative.
of what's going on.
And so maybe particle physics wouldn't help you there,
but certainly the neuroscience of Churchill's brain
would help you verify that he was being trustworthy
in that conversation and that the soldiers on the front lines
were being honest in their depiction of their description
of what happened and this kind of stuff.
So long as you can verify, like, progress,
like parts of the tree up,
then that massively helps you build confidence.
I think language models are also just really weird, right?
Like with the emergent misalignment work,
I don't know if they took predictions they should have of like,
hey, I'm going to fine-tune chat GPT on code vulnerabilities.
Is it going to become a Nazi?
And I think most people would have said no.
And that's what happened.
And so what are the different person-
And how do they discover that it became a Nazi?
They started asking it a ton of different questions.
And it will do all sorts of like vile and harmful things.
Like the whole persona just totally changes.
And I mean, we are dealing with alien.
brains here who don't have the social norms of humans and or even a clear notion of like what
they have and haven't learned that we have of them I mean. And so I think you really want to go
into this with eyes wide open. Backing up for Mechinturp, if we live in a world where AI progress
accelerates, by the way, you were mentioning a little while ago that there's many wild worlds
we could be living in, but we're living at least one of them. Um, understand.
Another one that we've gestured at, but it's worth making more explicit, is this, even if the AI models are not helping write the next training algorithm for their successor, just the fact that if they had human level learning efficiency, whatever a model is learning on the job, or whatever copy of the model is learning on the job, the whole model is learning.
So in effect, it's getting...
Or if they're like a thousand times less efficient than humans are learning.
That's right.
And you just like to deploy them, even still.
Exactly.
Yeah, yeah.
Anyway, and there's a whole bunch of other things
You can think about it.
But even there, it's like you kind of have
like a broadly deployed intelligence explosion.
And I do think it's worth pressing on that future of
You know, there's this whole spectrum of crazy futures.
But the one that I feel we're almost guaranteed to get.
And this is like almost a strong statement to me.
It's one where like at the very least
you get drop in like white collar worker
at some point in the next five years.
It's like I think it's very,
very likely in two, but it seems almost over-determined in like five.
And on like the grand scheme of things, like those are kind of irrelevant timeframes.
Like it's the same either way.
And that completely changes the world over the next decade.
And if we don't have the right policies in place so that, then you end up actually with almost in some respects like a fundamentally worse world.
Because the thing that these models get good at by default is like software engineering and like computer using agents and this kind of stuff.
And then we will need to put in extra effort to put them in the loops where they help us with scientific research or they're like we have the right robotics such that we actually like experience an increase in material quality of life.
So that's worth thinking about.
Like if you're in the perspective of like I'm a country.
Like what should I be doing or thinking about?
Like plan for the case where white collar work is automatable.
And then consider what does that mean for your economy and what you should be doing to prepare policies.
What should you be doing to prepare?
Because honestly, honestly, it's like a subject of question where like if you're India or Nigeria or Australia.
Yeah.
If you're a country unlike America or China where they do have frontier models, what is it that you should be doing right now, especially on such a short time scale?
Yes.
So I think one very important point is that let's say this scenario turns out true.
Then compute becomes the most valuable resource in the world.
Like the sort of GDP of your economy is dramatic.
dramatically affected by how much compute you can deploy towards the sort of
organizations within your country. And so having some guaranteed amount of
compute I think will actually be quite important. So like pre getting ahead of
investments and like data centers and this kind of stuff on the condition that
it's like companies in your country have to be allowed to use that compute.
Not necessarily for training, but like just even just for inference. I think
the economic value here comes from inference. I think it also makes sense to invest
broadly in AI, like I think these countries have the opportunity to do so. And I think that's
like a portfolio of like, you know, foundation model companies, but also like robotic supply chain
and this kind of stuff. I think that you should invest very proactively in policies that try and
prevent capital lock in. So we're in for a much worse world if like it just so happens that
the people who had like money in the stock exchange or in land before AI are like dramatically more
wealthy than the people who don't.
Because it's a gross misallocation of resources.
So having, like, I know, one of my favorite episodes actually on your podcast was the
Georgism one where you're trying to, like, appropriately, like, value or allocate land.
And so I think, like, this strikes particularly close to home coming from Australia, where I think,
like, our, like, policies with respect to land are, like, grossly wrong.
But I think this is broadly true.
being very forward on regulation of integration of these models into your country is important and
proactively making sure that people have choice.
So let's say you should be quite proactive about making sure that the phones or devices or
like glasses that people have, people have like free choice on like what things they run.
And then, so that's like that's the we just get white collar worker.
And like you're trying to like do the best to like prepare your country for that.
And then it's like, okay, well, what can you do to like make all possible versions of the future go like well?
Like that's like covering some amount of like economic downside.
The other like things I think are really important is like figure out how you can like either make the like basically ensure dramatic upside or cover like terrible downside.
And so like getting a dramatic upside is like making sure.
or that there is investment in biology research
and this kind of stuff in an automated way
that these models are actually able to produce
novel medicines that massively improve quality of life.
And covering the downside is like AI alignment research
and this kind of stuff and automated testing
and really thinking hard about that,
AI safety institutes, this kind of stuff.
But these seem like things that a rich person,
a random rich person could also do.
Like there's, it seems like
there's not a thing that a nation state is,
Uniquely equipped to do.
Yeah, that's a good point.
In this scenario.
I mean, like, dramatic allocation of, like, resource stores compute, I think is sensible.
I would be doing that if I was in charge of a nation state.
I think it just increases your optionality in, like, most of the future worlds.
Yeah.
Dylan Patel has some scary forecasts on U.S. energy production.
Yeah, versus China.
Yes.
Yeah.
We're, like, 34 gigawatts off.
Yeah, yeah.
The U.S.'s line is, like, flat, basically.
And China's lane is, like, this.
Yeah.
And, I mean, the U.S.
like, very clearly.
Yeah.
We just need so many more power plants.
Yes.
If intelligence becomes this, like, incredibly valuable input, like, intelligence becomes almost
a raw input into the economies and quality of life or future.
The thing directly underneath that is energy.
And so making sure you have, like, you know, incredible amounts of solar, like, tile the desert
in solar panels, not, you know, some parts of the desert in solar panels, would be helpful
towards making sure that you have more access to intelligence on top.
Yeah.
Just to make it explicit, because we've been touching on it here, even if AI progress totally stalls, you think that the models are really spiky and they don't have general intelligence.
It's so economically valuable and sufficiently easy to collect data on all of these different jobs, these white-collar job tasks, such that, to Shalto's point, we should expect to see them automated within the next five years.
Yeah.
Even if you need to hand spoon every single task to the model.
It's like economically worthwhile to do so.
Even if algorithmic progress stalls out, and we just never figure out how to keep progress going, which I don't think is the case.
That hasn't stalled out yet.
It seems to be going great.
The current suite of algorithms are sufficient to automate white-collar work provided you have enough of the right kinds of data.
And in a way that compared to the TAM of salaries for all of those kinds of work is so, like, trivially worthwhile.
Yeah, exactly.
I do just want to flag as well that there's a really dystopian future
if you take more of X paradox to its extreme,
which is this paradox where we think that the most valuable things
that humans can do are the smartest things
are like add large numbers in our heads
or do any sort of white collar work.
And then we totally take for granted our fine motor skill and coordination.
But from an evolutionary perspective, it's the opposite.
So we got, like evolution has optimized fine motor coordination so well.
And even if you look at like robot hands or like the ability to open a door is still just like really hard for robots.
Meanwhile, we're seeing this total automation of coding and everything else that we've seen as clever.
The really scary future is one in which AIs can do everything except for the physical robotic tasks.
In which case you'll have humans with like AirPods and like glasses.
Glasses.
And there'll be some robot overlord controlling the human through cameras by just like telling it what to do and like having abounding.
box around the thing you're supposed to pick up.
And so you have like human meat robots.
And not like necessarily saying that like that's what the AIs would be like want to do
or anything like that.
But as in like if you would be like what are the relative economic value of things?
Like the AIs are out there doing computer programming and like the most valuable thing
that humans can do is like be amazing robots.
Now that being said, I think Moravex paradox is a little bit fake.
I think the main reason that robots are worse than at like being a robot than they
are at software engineering is the internet exists for software engineering like GitHub
exists and there is no equivalent thing. Like if you had all like you know mocap of everyone's
actions as they were like going about their daily lives like some reasonable fraction of the
human population, robotics is also like close to solve. Like like on track to be solved at the same
rate that software engineering is on track to be solved. So this is only like this vision is
only like a sort of decade long section, but it's still pretty terrible decade. Like imagine
Imagine the world where people have lost their jobs.
You haven't yet got novel biological research that means people's quality of life is dramatically
better.
You don't yet have material abundance because you haven't actually been able to action the physical
world in the necessary way.
You can't build dramatically more because that's like building dramatically more takes robots
basically.
And people's like main comparative advantage is as fantastic robots is like a shocking,
shocking world.
I mean, for the perspective of an average human, I think it actually might be better.
Your wages will be higher because you're the complement to something that is enormously
valuable, which is AI labor.
Right.
And like, you know, a decade or two on, like the world is fantastic.
Yeah.
Right?
Like, you truly, like, robotics is solved and you decide to get, like, you know,
radical abundance, basically, provided that you have all the policies set up, like,
necessary to permit building.
Like, you sort of, you end up with that same change from, you know, the, like, the, the
before and after photos of Shanghai.
We're like 20 years on.
It's just like this dramatically transformed city.
A lot of places in the world probably end up like that over that two-decade period.
But we need to make sure, like one, do our best to estimate, is this actually what is on
track to happen?
Like build sui bench, but for all the other forms of white collar work and measure and track.
That's a great thing that government should be doing, by the way.
It's like trying to break down the sort of functions of their economy into measurable
tasks and figuring out where, what does the curve actually look like for that?
Because they might be a bit shocked by the progress there.
You know, there's no sweet bench for tax, like tax and bell.
And then, I don't like have all the answers here, but like figuring out a way to like share
the proceeds of this economy, like broadly across people or like invest heavily in robotics
and collecting data so that we get robotics faster and we get material abundance faster,
invest in biological research that we get, but like all that faster.
They basically try and pull forward the radical upside because otherwise you have a pretty dark
section.
I think one thing that's not appreciated enough is how much of our leverage on the future,
given the fact that our labor isn't going to be worth that much, comes from our economic
and political system surviving for your million-ext S&P equity to mean something, for your
contract to mean anything, for the contract to mean anything, for the government.
government to be able to tax the AI labor and give you a UBI off of that, it's just like, that
requires our legal institutions, our economic institutions, our financial rails surviving
into the future.
Yes.
The way in which that likely happens is if it's also in the AI's best interests that they follow
those rails.
And by AI, I don't mean some monolithic single AI.
I just mean like firms which are employing AI and becoming more productive as a result.
You don't want to be in a position where it's so onerous.
to operate in our system that you're basically selecting for firms who either emigrate or who are like doing black market stuff, et cetera.
And which means I think like you want to make it super, super easy to deploy AI, have the equivalent to special economic zones, et cetera.
Because otherwise you are just surrendering the future outside of any control that you might have on it.
One of the reasons, by the way, that I worry about turning AGI into a national security issue
or having it have extremely close ties with the government, the Manhattan Project thing,
is that it disproportionately redirects the use of AI towards military tech and the mosquito drones and whatever.
And also naturally puts other countries in the same frame of mind, right?
If we're developing the mosquito drones, why would China not develop them?
mosquito drones.
And that just seems like a zero-sum race, and not to mention a potentially catastrophic one.
Yes.
Whereas, like, you know, like compute will be limited.
You know, we want, we will need to disproportionately accelerate some things.
To the extent it just remains totally like a consumer free market landscape.
It just seems more likely that we'll get the glorious transhumanist future where they're
developing the things that make human life better.
Yes, I mean, I agree.
Like the case where you end up with like two national.
projects facing off against each other, it's dramatically worse.
Right. Like, we don't want to live in that world.
It's much better if there's like stays at free market, so speak.
Yeah, yeah, yeah. Okay, I want to take issue with your claim that even if with the algorithms
of today, if we just collect enough data that we could automate white-collar work.
First, let me get an understanding of what you mean by that. So do you mean that we would
do the analogous thing of free training with all the trajectories of everything people would do
on their jobs. Could you make either manually or through some other process some R.L procedure
based on the screen recordings of your white color worker? What kind of thing are you imagining?
I mean, like a continuous distribution of this stuff. One like important like mental model
to think about RL is I think as like the task gets more, there is some respect with which like
longer horizon or better at that task, if you can do them, if you can get them. If you can get
that reward ever are like easier to judge. So like again, it's come back to like, can you make
money on the internet? That's an incredibly easy reward signal to judge. But to like do that, there's
like a whole hierarchy of like complex behavior. So if you could like pre-train up to the easy
to judge reward signals, like does your website work? Does it go down? Like do people like it?
Like there's there's all these reward signals that we can respond to because we have a long in,
we can like progress, do these long enough trajectories to actually like get to interesting things.
If you're stuck in this regime where you need a reward signal every five tokens,
it's a way more painful and long process.
But if you could pre-train on every screen in America,
then probably the RL tasks that you can design are very different
to if you could only take the existing internet as it is today.
And so how much of that you get access to changes the mix.
Interesting.
So as we're training them on longer and longer Verizon tasks,
and it takes longer for them to get any signal on whether they successfully completely the task.
Well, that slow down progress because it takes more compute per task.
I do think there's this notion the longer, the harder tasks, the more training is required.
And I'm sympathetic to that naively, but we as humans are very good at practicing the hard parts of tasks and decomposing them.
And I think once models get good enough at the basic stuff, they can just rehearse or fast-forward.
to the more difficult parts.
I mean, it's definitely one of the big complexities, right?
Like, as you use more compute and like the,
and as you train on, like, more and more difficult tasks,
I mean, I don't know, your rate of improvement of biology
is going to be, like, somewhat bound by the time it takes the cell to grow
in a way that your rate of improvement on math isn't, for example.
So, yes, but I think for many things we'll be able to paralyze far,
like, widely enough and get enough iteration loops.
Mm-hmm.
Will the regime of training new models go away?
Will we eventually get to like you got the model and then you just keep adding more skills to it with RL training?
That depends on whether or not you think like there's a virtue in pre-training a new architecture.
Basically you make some like architectural change then you like probably need to like do some form of like at least like you're retraining a new model.
How does the fact that if RL requires a bunch of things,
inference to do the training in the first place.
Does that push against the thing you were talking about where we actually need a bigger
model in order to have brain-like energy?
But then also it's more expensive to train it in RL.
So where does that balance out?
I think we've got to drink the bitter lesson here.
And yeah, like there aren't infinite shortcuts.
Like you do just have to scale.
Something's a higher.
And have a bigger model and pay more inference for it.
And if you want AGI, then that's what you got to pay the price of.
But there's like, there's a trade-off equation here, right?
of like there is science to do,
which everyone is doing of what is the optimal point
at which to do RL.
Because you need something which can both learn
and discover the sparse reward itself.
So you don't want a one parameter model, useless,
even though you can run it really fast.
You also don't want, you know, like 100T model
because it's like super slow.
Yeah.
Password RL.
And like the marginal benefit
of like its learning efficiency is like not worth it.
Right.
So there's like a,
there's a pretty different
to you hear.
Like, what's the optimal model size
of like your current class of capabilities
and, like,
your current set of REL environments
and this kind of stuff?
Yeah, and even in the last year,
there's been much more of a factor
of the inference cost, right?
So just explicitly, like,
the bigger the model,
the more expensive it is to do a forward pass
and generate tokens.
And the calculus used to just be,
should I allocate my flops
to more training data
or a bigger model?
Yeah.
And now another huge factor is,
how much am I actually going to do
forward passes on this model
once it's trained?
Yeah, my total pool of compute.
How do I allocate that across train data compute
and inference compute for the RL training?
And then even within inference, there's all this research on,
well, what strategy should I use?
Should I sample 10 and take the best?
Do I do this sort of like branching search, et cetera, et cetera?
And so with RL where you're sampling a whole lot of tokens,
you also need to factor in the ability for the model to actually generate those tokens
and then learn and get feedback.
Okay.
So if we're living in this world,
what is your advice to somebody early in their career or a student in college?
How should they be, what should they be planning on doing?
Yeah.
So I think, once again, it's worth considering the spectrum of possible worlds and preparing yourself for that.
And the one, like the sort of action that I think is like highest EV in that case is you are about to get dramatic, at a minimum, you are about to get dramatically more leverage.
You already have.
Like already the startups in YC are like writing huge amounts of their code with, you know, Claude.
So what challenges, what causes do you want to change in the world with that added leverage?
Like if you had 10 engineers at your Beacon Call, what would you do?
Or if you had a company at your Beacon Call, like, what would that enable you to do?
And what problems and domains suddenly become tractable?
That's the world you want to prepare for.
Now, that still requires a lot of technical depth.
Obviously, there is the case where AI just becomes dramatically better than like everyone had everything.
Right?
But for at least a while, probably, there is like advantage.
I think Jensen actually talked about this in an interview,
where he's like, you know, I have like 100,000 general intelligences around me,
and I'm still, like, somewhat useful because I'm there, like, you know, directing the values
and like, like, asking them to do things.
And, you know, they're still like, I still have value, even though I have 100,000 general intelligences.
And for many people, I think that will still be true for a farewell.
And then, you know, as the AIs get better and better and better and like so on, eventually, no.
But, again, prepare for like the spectrum of possible worlds.
because in the event where we're just totally outcompeted,
it doesn't matter what you do.
But in all the other worlds,
it matters a lot.
Get the technical depth.
Study biology, study CS,
like really think hard about,
study physics.
Think about hard about what challenges
you want to solve on the world.
Yeah.
That's a lot of topics.
You can now.
You can.
Right?
It's so much easier to learn.
That's right.
Yeah, everyone now has the infinite perfect tutor.
Yeah, yeah.
Yeah.
It's definitely been helpful to me.
Yeah.
I would say some combination of like,
get rid of the sunk cost of your like previous workflows or expertise in order to evaluate
what AI can do for you.
That's right.
And another way to put this, which is fun, is just like, be lazier in so much as like figure
out the way that the agent can do the things that are toilsome.
But you're going to have to, in this, ultimately you get to be lazier, but in the short run,
you need to like critically think about the things you're currently doing and like what an AI
could actually be better at doing.
And then go and try it or explore it.
Because I think there's still just a lot of low-hanging fruit of people assuming
and not writing the full prompt, giving a few examples,
connecting the right tools for your work to be accelerated, automated.
Yeah, yeah.
There's also the sunk cost of feeling like since you're not, quote-unquote,
early to AI that you've sort of missed the boat and you can't like...
But I think, I mean, I remember when GPD3 came out.
So, backstory on the podcast, when I graduated college, I was planning on doing some sort of AI rapper startup.
And the podcast was just like a gateway into doing that.
And so I was trying out like different things.
And at the time, I remember thinking, oh, 3.5 is out.
And people are like, I'm like so behind on like the startup scene here or whatever if I wanted to make my own rapper.
I mean, maybe the idea of the rapper was inadvisable in the first place,
but just like every time feels early because like it's sort of,
if it's an exponentially growing process.
And there were many things, many ideas that are only becoming possible now, right?
Yeah, exactly.
It's that product expenditure I talked about.
Products literally obsolete.
Like, you need to constantly reinvent yourself to stay at the frontier of capabilities.
By the way, do you remember?
I had a really shitty idea and I gave you a call.
I don't know what it was.
It was like, I think it was like rag for like lawyers.
or something.
Yeah, anyways, I think one of our first interactions was I'm like, hey, what do you think
of this idea?
And you're like, I think the podcast sounds promising.
Which I appreciate.
I got slightly annoyed at a friend recently who I think is really talented and clever
and interested in AI, but has pursued a biology route.
And I just kind of tried to shake them of like, you can work on a.
if you want to.
I mean, I think humans are artificial, not artificial,
are biological general intelligences,
where a lot of the things of value are just very general.
Yeah.
And whatever kind of specialization that you've done
maybe just doesn't matter that much.
I mean, again, it gets back to the sunk cost.
But, like, so many of the people, even my colleagues,
anthropic, are excited about AI,
and they just don't let their previous career be a blocker.
And because they're just like innately smart, talented, driven, whatever else,
they end up being very successful and finding roles.
It's not as if they were in AI forever.
I mean, people have come from totally different fields.
And so don't think that you need, like, permission from some abstract entity
to, like, get involved and apply and be able to contribute.
If somebody wanted to be in Ares or something,
are like right now if you give them an open problem
or like the kind of open problem
that is very likely to be the
be quite impressive.
What would it be?
I think that now that our roles like come back,
papers building on Andy Jones's
like scaling board links like scaling laws
for board games are interesting.
Like showing that you can
like investigating these questions like the ones you asked before
where you're like oh like you know,
is the model actually learning to do
more than its previous PASICK
or is it just like discovering that?
Like exploring questions like that deeply
I think are interesting.
Yeah, yeah.
Like scaling laws for RL, basically.
I'd be very curious to see like how much
like the marginal increase in meta learning
from a new task or something.
I mean, on that note, I think model diffing
has like a bunch of opportunities.
Yeah.
Also people say, oh, we're not capturing all the features.
There's all this stuff left on the table.
What is that stuff that's left on the table?
Yeah.
Like, if the model's jailbroken, is it using existing features that you've identified?
Is it only using the error terms that you haven't captured?
I don't know, there's a lot here.
I think Mats is great.
The Anthropic Fellowship has been going really well.
Goodfire Anthropic invested in recently.
They're doing a lot of interpretability work or just apply directly to us.
Anything to get your equity up, huh?
There's just so many interpretability projects that are like, there's so much low-hanging fruit.
and we need more people and I don't think we have much time.
Yeah.
I also want to make a plug for performance engineering.
I think this is one of the like best ways to demonstrate that you have like the raw ability to do it.
Like if you made a extremely efficient transform implementation on TPU or Trinium or like Incuda, then I think there's a pretty high likelihood that you'll get a job of.
There is a relatively small pool of people that you can trust to completely own and to end the performance of a model.
And if you have broad, deep electrical engineering skills, I think you can probably come up to speed pretty fast on an accelerator stuff.
You can come up to speed reasonably fast.
And it teaches you a lot of good intuitions of the actual intricacies of what's going on in the models, which means that you're then very well placed to think about architecture and this kind of stuff.
One of my favorite people in thinking about architecture and Anthropic at the moment actually came from like a heavy GPU kernel programming background.
Just like, nosians announced really deeply and can think about the tradeoffs really well.
This is fun, guys.
Yeah, thanks.
Great to be back.
I hope you enjoyed this episode.
If you did, the most helpful thing you can do is just share it with other people who you think might enjoy it.
Send it to your friends, your group chats, Twitter, wherever else.
Just let the word go forth.
Other than that, super helpful if you can subscribe on YouTube and leave a five star review on Apple
podcast and Spotify. Check out the sponsors in the description below. If you want to sponsor a future
episode, go to thwarcash.com slash advertise. Thank you for tuning in. I'll see you on the next one.
