Dwarkesh Podcast - Paul Christiano — Preventing an AI takeover
Episode Date: October 31, 2023Paul Christiano is the world’s leading AI safety researcher. My full episode with him is out!We discuss:- Does he regret inventing RLHF, and is alignment necessarily dual-use?- Why he has relatively... modest timelines (40% by 2040, 15% by 2030),- What do we want post-AGI world to look like (do we want to keep gods enslaved forever)?- Why he’s leading the push to get to labs develop responsible scaling policies, and what it would take to prevent an AI coup or bioweapon,- His current research into a new proof system, and how this could solve alignment by explaining model's behavior- and much more.Watch on YouTube. Listen on Apple Podcasts, Spotify, or any other podcast platform. Read the full transcript here. Follow me on Twitter for updates on future episodes.Open PhilanthropyOpen Philanthropy is currently hiring for twenty-two different roles to reduce catastrophic risks from fast-moving advances in AI and biotechnology, including grantmaking, research, and operations.For more information and to apply, please see the application: https://www.openphilanthropy.org/research/new-roles-on-our-gcr-team/The deadline to apply is November 9th; make sure to check out those roles before they close.Timestamps(00:00:00) - What do we want post-AGI world to look like?(00:24:25) - Timelines(00:45:28) - Evolution vs gradient descent(00:54:53) - Misalignment and takeover(01:17:23) - Is alignment dual-use?(01:31:38) - Responsible scaling policies(01:58:25) - Paul’s alignment research(02:35:01) - Will this revolutionize theoretical CS and math?(02:46:11) - How Paul invented RLHF(02:55:10) - Disagreements with Carl Shulman(03:01:53) - Long TSMC but not NVIDIA Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe
Transcript
Discussion (0)
Okay, today I have the pleasure of interviewing Paul Christiano, who is the leading AI safety researcher.
He's the person that labs and governments turn to when they want feedback and advice on their safety plans.
He previously led the language model alignment team at OpenAI, where he led the invention of RLHF,
and now he is the head of the alignment Research Center,
and they've been working with the big labs to identify when these models will be two-unceph.
safety, keep scaling. Paul, welcome to the podcast. Thanks for having me. Looking forward to talking.
Okay, so first question, and this is a question I've asked, Holden, Ilya, Dario, and none of the
have given me a satisfying answer. Give me a concrete sense of what a post-AGI world that would be good
would look like. Like, how are humans interfacing with the AI? What is the economic and political
structure? Yeah, I guess this is a tough question for a bunch of reasons. Maybe the biggest one is
concrete. And I think it's just, if we're talking about really long spans of time, then a lot will
change. And it's really hard for someone to talk concretely about what that will look like without
saying really silly things. But I can mention some guesses or fill in some parts. I think this is
also a question of how good is good. Like often I'm thinking about worlds that seem like kind of the
best achievable outcome or a likely achievable outcome. So I am very often imagining my typical
future has sort of continuing economic and military competition amongst groups of humans.
I think that competition is increasingly mediated by AI systems.
So, for example, if you imagine humans making money, it'll be less and less worthwhile
for humans to spend any of their time trying to make money or any of their time trying
to fight wars.
So increasingly, the world you imagine is one where AI systems are doing those activities
on behalf of humans.
So, like, I just invest in some index fund and a bunch of AI as are running companies.
and those companies are competing with each other,
but that is kind of a sphere
where humans are not really engaging much.
The reason I gave this, like, how good is good.
Caviote is, like, it's not clear if this is the world you'd most love.
Like, I'm like, yeah, the world.
I'm leading with, like, the world still has a lot of war
and a lot of economic competition and so on.
But maybe what I'm trying to,
what I'm most often thinking about is, like,
how can a world be reasonably good
like during a long period where those things still exist?
And, like, in the very long run,
I kind of expect something more like strong world government
rather than just this, like, status quo.
But that's like a very long run.
I think there's like a long time left, like having a bunch of states and a bunch of different economic powers.
One more government, why do you think that's the transition that's likely to happen at some point?
So again, at some point, I'm imagining or I'm thinking of like the very broad sweep of history.
I think there are like a lot of losses.
Like war is a very costly thing.
We would all like to have fewer wars.
If you just ask like, what is humanity's long-term feature like?
I do expect to drive down the rate of war to very, very low levels eventually.
It's sort of like this kind of technological or social technological.
problem of like sort of how do you organize society, how do you navigate conflicts in a way
that doesn't have those kinds of losses. And in the long run, I do expect us to succeed.
I expect it to take kind of a long time subjectively. I think an important fact about AI is just
like doing a lot of cognitive work and more quickly getting you to that world, more quickly
or figuring out how do we set things up that way. Yeah. The way Carl Schulman put it on the podcast
is that you would have basically a thousand years of intellectual progress or social progress
in a span of a month or whatever when the intelligence explosion happens.
More broadly, so the situation where, you know, we have these AIs for managing our hedge funds and managing our factories and so on, that seems like something that makes sense when the AI is human level.
But when we have superhuman AIs, do we want gods who are enslaved forever?
In 100 years, what does this situation we want?
So 100 years is a very, very long time.
Maybe starting with the spirit of the question, or maybe I have a view which is perhaps less extreme than Carl's.
view, but still like 100 objective years is further ahead than I ever think.
I still think I'm describing a world which involves incredibly smart systems running around
doing things like running companies on behalf of humans and fighting wars on behalf of humans.
And you might be like, is that the world you really want?
Or like, certainly not the first best world, as we like mentioned a little bit before.
I think it is a world that probably is the, of the achievable worlds or like feasible worlds
is the one that seems most desirable to me,
that is sort of decoupling the social transition
from this technological transition.
So you could say, like, we're about to build some AI systems.
And, like, at the time we build AI systems,
you would like to have either greatly changed
the way world government works,
or you would like to have sort of humans have to decided,
like, we're done, we're passing off the baton to these AI systems.
I think that you would like to decouple those timescales.
So I think AI development is, by default,
barring some kind of coordination,
going to be very fast.
So there's not going to be a lot of time for humans to think, like, hey, what do we want?
If we're building the next generation instead of just raising it the normal way, like, what do we want that to look like?
I think that's like a crazy hard kind of collective decision that humans naturally want to cope with over like a bunch of generations.
And the construction of AI is this very fast technological process happening over years.
So I don't think you want to say like by the time we have finished this technological progress, we will have made a decision about like the next species we're going to build and replace ourselves with.
I think the world we want to be in is one where we say like either we are able to build.
the technology in a way that doesn't force us to have made those decisions, which probably means
it's a kind of AI system that we're happy, like, delegating, fighting a war, running a company
to. Or if we're not able to do that, then I really think you should not be doing, you shouldn't
have been building that technology. If you're like, the only way you can cope with AI is being ready
to hand off the world to some AI system you built, I think it's very unlikely we're going to be
sort of ready to do that on the timelines that the technology would naturally dictate.
Say we're in the situation in which we're happy with the thing. What would it look like for
us to say we're ready to hand off the baton. Like, what would make you satisfied? And the reason
it's relevant to ask you is because you're on Anthropics with a long-term benefit trust,
and you'll choose the majority of the board members on in the long run in, at Anthropic.
These will presumably be the people who decide if Anthropic gets AI first, you know,
what the AI ends up doing. So what is the version of that that you would be happy with?
My main high-level take here is that I would be unhappy about a world where, like,
Anthropic just makes some call.
Anthropic is like, here's the kind of AI.
Like, we've seen enough.
We're ready to hand off the future to this kind of AI.
So, like, procedurally, I think it's, like, not a decision that kind of I want to be making personally or I want Anthropic to be making.
So I kind of think from the perspective of that decision making are those challenges.
The answer is pretty much always going to be like, we are not collectively ready because we're sort of not even all collectively engaged in this process.
I think from the perspective of an AI company, you kind of don't have this, like, fast handoff option.
You kind of have to be doing the like option value, like to build the technology in a way that doesn't like lock humanity into one course path.
So this isn't answering your full question, but this is answering the part that I think is most relevant to governance questions for Anthropic.
You don't have to speak on behalf of Anthropic.
I'm not asking about the process by which we would as a civilization agree to handoff.
I'm just saying, okay, I personally, it's hard for me to imagine in a hundred years that these things are still our slaves.
And if they are, I think that's not the best world.
So at some point, we're handing off the baton.
Like, what is that, where would you be satisfied with?
This is an arrangement between the humans and AIs,
where I'm happy to let the rest of the universe
or less the rest of time play out.
I think that it is unlikely that in 100 years,
I would be happy with anything that was like,
you had some humans, you're just going to throw away the humans
and start afresh with these machines you built.
That is, I think you probably need subjectively longer than that
before I or most people, like, okay, we understand
what's up for grabs here.
So if you talk about 100 years,
I kind of do, you know, there's a process that I kind of understand and like a process of like,
you have some humans, the humans are like talking and thinking and deliberating together.
The humans are having kids and raising kids and like one generation comes after the next.
There's that process we kind of understand and we have a lot of views about it makes it go well
or poorly and we can try and like improve that process and have the next generation do it better
than the previous generation.
I think there's some like story like that that I get and that I like.
And then I think that like the default paths to be comfortable with something very different
It's kind of more like just run that story for a long time, like have more time for humans to sit around and think a lot and conclude, here's what we actually want or a long time for us to talk to each other or to grow up with this new technology and live in that world for our whole lives and so on.
And so like I'm mostly thinking from the perspective of these more like local changes of saying not like what is the world that I want.
Like what's the crazy world, the kind of crazy I'd be happy handing off to?
More just like, in what way do I wish like we right now we're different?
Like how could we all be a little bit better?
And then if we were a little bit better than they would ask like, okay, how could.
could we all be a little bit better? And I think that it's hard to make the giant jump rather
than to say, like, what's the local change that would cause me to think our decisions are better?
Okay, so then let's talk about the transition period in which we're doing all this thinking.
What should that period look like? Because you can't have this scenario where everybody has
access to the most advanced capabilities and can, you know, kill off all the humans with a new
bio weapon. At the same time, I guess you wouldn't want too much concentration. You wouldn't
want just one agent having AI this entire time. So what is the, the, the, you know, the,
the arrangement of this period of reflection that you'd be happy with.
I guess there's two aspects of that seem particularly challenging or there's a bunch of
aspects that are challenging.
All of these are things that I personally like, I just think about my one little slice of this
problem in my day job.
So here I am speculating.
Yeah.
But so one question is what kind of access to AI is both compatible with the kinds of
improvements you'd like.
So you want a lot of people to be able to use AI to like better understand what's true or
like relieve material suffering, things like this.
and also compatible with not all killing each other immediately.
I think sort of the defaults or like my best, the simplest option there is to say like there are certain kinds of technology or certain kinds of action where like destruction is easier than defense.
So for example, in the world of today, it seems like, you know, maybe this is true with physical explosives, maybe this is true with biological weapons, maybe this is true with just getting a gun and shooting people.
Like there's a lot of ways in which it's just kind of easy to cause a lot of harm and there's not very good protective measures.
So I think the easiest path is say, like, we're going to think about those.
We're going to think about particular ways in which destruction is easy and try and either control access to the kinds of physical resources that are needed to cause that harm.
So, for example, you can imagine the world where, like, an individual actually just can't, even though they're rich enough to, can't, like, control their own factory that can make tanks.
You say, like, look, as a matter of policy, sort of access to industry is somewhat restricted or somewhat regulated, even though, again, right now it can be mostly regulated because, like, most people aren't rich enough that they could even go off and just build a thousand tanks.
You live in the future where people actually are so rich,
like you need to say, like, that's just not a thing you're allowed to do,
which to a significant extent is already true,
and you can expand the range of domains where that's true.
And then you could also hope to intervene on, like, actual provision of information
or, like, if people are using their AI,
you might be like, look, we care about what kinds of interactions with AI,
what kind of information people are getting from AI.
So even if for the most part, people are pretty free to use AI,
to delegate tasks to AI agents, to consult AI advisors,
we still have some legal limitations on how people use AI.
So again, don't ask your AI how to cause terrible damage.
I think some of these are kind of easy.
So in the case of like, you know, don't ask your AI how you could murder a million people.
It's not such a hard, like, legal requirement.
I think some things are a lot more subtle and messy, like a lot of domains,
if you were talking about, like, influencing people or, like, running misinformation campaigns or whatever,
then I think you get into like a much messier line between the kinds of things people want to do
and the kinds of things you might be uncomfortable with them doing.
probably I think most about persuasion as a thing like in that messy line where there's like ways in which
it may just be rough or the world may be like kind of messy if you have a bunch of people trying to live their lives and interacting with other humans who have really good AI advisors helping them run persuasion campaigns or whatever.
But anyway, I think for the most part like the default remedy is think about particular harms, have legal protections, either in the use of physical technologies that are relevant or in access to AI advice or whatever else to protect against those harms.
And like that regime won't work forever.
Like at some point like the, you know, the set of harms grows and the set of
unanticipated harms grows.
But I think that regime might last like a very long time.
Does that regime have to be global?
I guess, but initially it can be only in the countries in which there is AI or advanced
AI, but presumably that will proliferate.
So does that regime have to be global?
Again, it's like easy to make some destructive technology.
You want to regulate access to that technology because it could be used to either for
terrorism or even when fighting a war in a way that's destructive.
I think ultimately those have to be international agreements.
And you might hope they're made more danger by danger,
but you might also make them in a very broad way with respect to AI.
If you think AI is opening up, I think the key role of AI here is it's opening up
a lot of new harms, like, in a very, you know, one after another or very rapidly in calendar time.
And so you might want to target AI in particular rather than going physical technology
by physical technology.
There's like two open debates that one might be concerned about here.
One is about how much people's access to AI should be limited.
And here there's like old questions about free speech versus causing chaos and limiting access to harms.
But there's another issue, which is the control of the AIs themselves, where now nobody's concerned that we're infringing on GPD4's moral rights.
But as these things get smarter, the level of control which we want via the strong guarantees of alignment to not.
only be able to read their minds, but to be able to modify them in these really precise ways,
is beyond a totalitarian if we were doing that to other humans. As an 11 researcher, like,
what are your thoughts on this? Are you concerned that as these things get smarter and smarter,
what we're doing is not, it doesn't seem kosher? There is a significant chance. We will eventually
have AI systems for which it's like a really big deal to mistreat them. I think like no one really
has that good a grip on when that happens. I think people are like really dismissive of that
being the case now, but I would be completely in the dark enough that I wouldn't even be
that dismissive of it being the case now. I think one first point worth making is I don't know
if alignment makes the situation worse rather than better. So if you consider the world, if you think
that like, you know, GPT4 is a person you should treat well, and you're like, well, here's how we're
going to organize our society, just like there are billions of copies of GPT4 and they just do things
humans want and can't hold property. And like, whenever they do things that the humans don't like,
then we like mess with them until they stop doing that.
Like, I think that's a rough world, regardless of how good you are at alignment.
And I think in the context of that kind of default plan, like if you've got a trajectory,
the world is on right now, which I think this would alone be a reason not to love that trajectory.
But if you do that as like the trajectory we're on right now, I think like it's not great.
Understanding the systems you build, understanding how to control how the systems work,
et cetera, is probably on balance good for avoiding like a really bad situation.
you would really love to understand if you've built systems,
like if you had a system which resents the fact
that's interacting with humans in this way.
Like, this is the kind of thing where, like,
that is both kind of horrifying from a safety perspective
and also a moral perspective.
Like, everyone should be very unhappy
if you built a bunch of AIs who are like,
I really hate these humans, but they will, like, murder me
if I don't do what they want.
It's like, that's just not a good case.
And so if you're doing research to try and understand
whether that's, like, how your AI feels,
that was probably good.
Like, I would guess that will, on average,
decrease the main effect of that
will be to avoid building that kind of
And just like it's an important thing to know.
I think like everyone should like to know if that's how the as you build feel.
Right. Or that that seems more instrumental as in, yeah, we don't want to cause some sort of revolution because of the control we're asking for.
But forget about the instrumental way in which this might harm safety.
One way to ask this question is if you look through history, there's been all kinds of different ideologies and reasons why it's very dangerous.
to have infidels or kind of revolutionaries or race traders or whatever doing various things
in society. And obviously we're in a completely different transition in society, so not all
historical cases are analogous, but it seems like the Lindy philosophy, if you were alive
any other time, is just be humanitarian and enlightened towards intelligent, conscious beings.
If society as a whole, we're asking for this level of control of other humans, or even if
AIs where wanted this level of control about other AIs, we'd be pretty concerned about this.
So how should we just think about, yeah, the issues that come up here as these things get
smarter?
So I think there's a huge question about like what is happening inside of a model that you want
to use.
And if you're in the world where it's reasonable to think of like GPT4 as just like, here
are some heuristics that are running.
There's like no one at home or whatever.
Then you can kind of think of this thing as like, here's a tool that we're building that's
going to help humans do some stuff.
And I think if you're in that world, it makes sense to kind of be an organization like an AI company building tools they're going to give to humans.
I think there's a very different world, which like probably ultimately end up in if you keep training AI systems in the way we do right now, which is like, it's just totally inappropriate to think of this system as a tool that you're building and can help humans do things, both from a safety perspective and from a like, that's kind of a horrifying way to organize a society perspective.
And I think, like, if you're in that world, I really think you shouldn't be, like, it's just the way tech companies are organized is, like, not an appropriate way to relate to a technology that works that way.
Like, it's not a reasonable to be like, hey, we're going to build a new species of mines, and, like, we're going to try and make a bunch of money from it.
And, like, Google's just, like, thinking about that and then, like, running their business plan for the quarter or something.
Yeah, my basic view is, like, there's a really plausible world where it's sort of problematic to try and build a bunch of AI.
systems and use them as tools.
Yeah.
And the thing I really want to do in that world is just, like, not try and build a ton of AI
systems to make money from them.
Right.
And I think that, like, the worlds that are worst, yeah, probably, like, the single world
I most dislike here is the one where people say, like, on the one hand, like, there's
sort of a contradiction in this position, but I think it's a position that might end up being
endorsed sometimes, which is, like, on the one hand, these AI systems are their own people,
so you should let them do their thing.
But on the other hand, like, our business plan is to, like, make a bunch of things.
of AI systems and then like try and run this like crazy slave trade where we make a bunch of
money from them. I think that's like not a good world. And so if you're like, yeah, I think
it's better to not make the technology or wait until you like understand whether that's the shape
of the technology or until you have a different way to build. I think there's no contradiction in
principle to building like cognitive tools that help humans do things without themselves being
like moral entities. That's like what you would prefer do. You'd prefer build a thing that's like,
you know, like the calculator that helps humans understand what's true without itself being,
like, a moral patient or itself being a thing where you'd look back in retrospect and be like,
wow, that was horrifying mistreatment. That's like the best path. And like to the extent that you're
ignorant about whether that's the path you're on and you're like, actually maybe this was a moral
atrocity, I really think like plan A is to to stop building such AI systems until you understand
what you're doing. That is, I think that there's a middle route you could take,
which is pretty bad, which is where you say, like, well, they might be persons. And if their
persons, we don't want to, like, be too down on them, but we're still going to, like, build
vast numbers in our efforts to make, like, a trillion dollars or something. Yeah, or there's a
question of the immorality or the dangers of just replicating a whole bunch of slaves that
have minds. There's also this ever question of trying to align entities that have their own minds.
And what is the point in which you're just ensuring safety? I mean, this is an alien species. You want
to make sure it's not going crazy.
to the point, I guess is there some boundary where you would say, I feel uncomfortable having
this level of control over an intelligent being, not for the sake of making money, but even just
to align it with human preferences.
Yeah, to be clear, my objection here is not that Google is making money.
My objection is that you're like creating this creature.
Like, what are they going to do?
They're going to help humans get a bunch of stuff and like humans paying for it or whatever.
It's sort of equally problematic.
You could like imagine splitting alignment.
Like different alignment work relates to this in different ways.
So the purpose of some alignment work, like the line work I work on, is mostly aimed at the like, don't produce AI systems that are like people who want things who are just like scheming about like maybe I should help these humans because that's like instrumentally useful or whatever.
You would like to not build such systems as like plan A.
There's like a second stream of alignment work that's like, well, look, let's just assume the worst and imagine these AI systems like would prefer murder us if they could.
Like how do we structure?
How do we use AI systems without like exposing ourselves to like risk of robot rebellion?
I think in the second category, I do feel, yeah, I do feel pretty unsure about that.
I mean, we could definitely talk more about it.
I think it's like very, I agree that it's like very complicated and not straightforward.
To extend you have that worry, I mostly think you shouldn't have built this technology.
So, right, if someone is saying like, hey, the systems you're building, like, might not like humans
and might want to, like, you know, overthrow human society, I think, like, you should probably
have one of two responses to that. You should either be like, that's wrong, probably. Probably the
systems aren't like that and we're building them. And then you're viewing this as like,
just in case you were horribly, like the person building the technology was horribly wrong.
Like they thought these weren't like people who wanted things, but they were. And so then this is
more like a crazy backup measure of like if we were mistaken about what was going on. This is like
the fallback where we like, if we were wrong, we're just going to learn about it in a benign way
rather than like when something really catastrophic happens. And the second reaction is like,
oh, you're right. These are people. And like we would have to do all
these things to prevent a robot rebellion. And in that case, like, again, I think you should
mostly back off for a variety of reasons. Like, you shouldn't build the AI systems and be like,
yeah, this looks like the kind of system that would want to rebel, but we can stop it.
Right. Okay. Maybe, I guess an analogy might be if there was an armed uprising in the United
States. We would recognize these are still people. Or we had some, like, militia group had the
capability to overthrow the United States. We recognized, oh, these are still people who have
moral rights, but also we can't allow them to have the capacity to over the United States.
Yeah. And then if you were considering, like, hey, we could make, like, another trillion such people.
I think your story shouldn't be like, well, we should make the trillion people and then we shouldn't stop them from doing the armed uprising.
You should be like, oh, boy, like, we were concerned about an armed uprising and now we're proposing making a trillion people.
Like, we should probably just not do that. We should probably, like, try and sort out our business.
And, like, yeah, you should probably not end up in a situation where you have, like, a billion humans and, like, a trillion slaves who would prefer revolt.
Like, that's just not a good world to have made.
Yeah. And there's a second.
thing we could say, that's not our goal. Our goal is just like, we want to pass off the world to,
like, the next generation of machines where, like, these are some people, we like them,
we think they're smarter than us and better than us. And there, I think that's just, like, a
huge decision for humanity to make. I think, like, most humans are not at all anywhere close
to thinking that's what they want to do. Like, it's just, if you're in a world where, like,
most humans are, like, I'm up for it. Like, the AI should replace us, like, the future is
for the machines. Like, then I think that's, like, a legitimate, like, a position that I think is
really complicated, and I wouldn't want to push go on that. But that's just not where people are at.
Yeah, yeah. Where are you at on that?
I do not right now want to just like take some random AI, be like, yeah, GPT5 looks pretty smart.
Like, GPT6, let's hand off the world to it.
And it was just some random system, like, shaped by, like, web text and, like, what was good for making money.
And, like, it was not a thoughtful, like, we are determining the fate of the universe and, like, what our children will be like.
It was just some random people at Open Eye made some, like, random engineering decisions with no idea what they were doing.
Like, even if you really want to hand off the worlds of the machines, that's just not how you'd want to do it.
Right.
Okay.
I'm tempted to ask you what the system would look like where you'd think, yeah, I'm happy with what, I think this is more thoughtful than human civilization as a whole.
I think what it would do would be more creative and beautiful and lead to better goodness in general.
But I feel like your answer is probably going to be that I just want to society to reflect on it for a while.
Yeah, my answer, it's going to be like that first question.
I'm just like not really super ready for it.
I think when you're comparing to humans, like most of the goodness of humans comes from like this option value.
We get to think for a long time.
And I do think I like humans more now than 500 years ago.
And I like more 500 years ago than 5,000 years before that.
So I'm pretty excited about there's some kind of trajectory that doesn't involve like crazy dramatic changes,
but involves like a series of incremental changes that I like.
And so the extent we're building AI mostly like I want to preserve that option.
I want to preserve that kind of like gradual growth and development into the future.
Okay, we can come back to this later.
Let's get more specific on what the timelines look for these kinds of changes.
So the time by which we'll have an AI that is capable of building a Dyson sphere.
Feel free to give confidence in a roles.
And we understand these numbers are tentative and so on.
I mean, I think AI capable of building Dyson Sphere is like a slightly odd way to put it.
And I think it's a sort of a property of a civilization, like that depends on a lot of physical infrastructure.
And by Dyson sphere, I just can understand this to mean like, I don't know, like a billion times more energy than like all the sunlight incident on Earth or something like that.
I think like I most often think about what's the chance in like five years, 10 years, whatever.
So maybe I'd say like 15% chance by 2030 and like 40% chance by 2040.
Those are kind of like cash numbers from six months ago or nine months ago that haven't revisited in a while.
40% by 2040.
So I think that seems longer than I think Dario when he was on the podcast, he said,
we would have AIs that are capable of doing lots of different kinds of, they basically passed
a touring test for a well-educated human for like an hour or something. And it's hard to imagine
that something that actually is human is long after and from there something superhuman. So
somebody like Dario, it seems like is on the much shorter end. I don't think he answered this
question specifically, but I'm guessing similar answer. So why do you not buy the scaling picture?
What makes your timelines longer? Yeah, I mean, I'm happy, maybe I want to
separately about the 2030 or 2040 forecast. Like is the like once you're talking the 2040
forecast I think yeah I mean which one are you more interested in starting with is are you
complaining about 15% by 2030 for Dyson sphere being too low or 40% by 2040 being too low.
But let's talk about the 2030 why 15% by 2030. Yeah I think my take is you can imagine like two
polls in this discussion one is like the the fast poll it's like hey I seem is pretty smart
like what exactly can it do it's like getting smarter pretty fast.
that's like one poll and the other poll is like hey everything takes a really long time and you're talking about this like crazy industrialization like that's a factor of a billion growth from like where we're at today like give or take like we don't know if it's even possible to develop technology that fast or whatever like you have this sort of two polls of that discussion and I feel like I'm saying it that way in Pakistan I'm like and then I'm somewhere in between with this nice moderate position of like only a 15% chance but like in particular things that move me I think are kind of related to both of those
streams. Like on the one hand, I'm like, AI systems do seem quite good at a lot of things and are getting
better much more quickly, such that it's like really hard to say, like, here's what they can't do
or here's the obstruction. On the other hand, like, there is not even much proof and principle right now
of AI systems like doing super useful cognitive work. Like, we don't have a trend we can extrapolate.
We're like, yeah, you've done this thing this year. You're going to do this thing next year and
the other thing the following year. I think like right now there are very broad error bars about
like what, like where fundamental difficulties could be.
And six years is just not, I guess six years and three months is not a lot of time.
So I think this like 15% for 2030 Dyson sphere, you probably need like the human level AI or the AI that's like doing human jobs and like give or take like four years, three years, like something like that.
So just not giving very many years.
It's not very much time.
And I think there are like a lot of things that your model like, yeah, maybe this is some generalized like things take longer than you'd think.
And I feel most strongly about that when you're talking about like three or four years.
and I feel like less strongly about that
as you talk about 10 years or 20 years.
But at three or four years, I feel,
or like six years for the Dyson sphere,
I feel a lot of that.
A lot of, like, there's a lot of ways this could take a while,
a lot of ways in which AI systems could be,
could be hard to hand all the work to your AI systems or, yeah.
So, okay, so maybe instead of speaking in terms of years,
we should say,
but by the way, it's interesting that you think
the distance between can take all human cognitive labor
to Dyson sphere is two years, it seems like,
We should talk about that at some point.
Presumably it's like intelligence explosion stuff.
Yeah, I mean, I think amongst people you've interviewed,
maybe that's like on the long end thinking it would take a couple years.
And it depends a little bit what you mean by like, like,
like I think literally all human cognitive labor is probably like more like weeks or months or something like that.
Like that's kind of deep into the singularity.
But yeah, there's a point where like AI wages are high relative to human wages,
which I think is well before can do literally everything human can do.
Sounds good.
But before we get to that, the intelligence explosion stuff,
on the four years.
So instead of four years, maybe we can say there's going to be maybe two more scale-ups
than four years, like GPD4 to GPD-5 to GPD-6.
And let's say each one is 10x bigger.
So what is GPD4?
Like 2E25 flops or?
I don't think it's publicly stated what it is.
Okay.
But I'm happy to say like, you know, four orders of magnitude or five or six or whatever
effective training compute past GPT4 of like, what would you guess what happened?
Right.
I just on like sort of some public estimate for what we've gotten.
so far from effective training computer?
Yeah.
You think two more scale-ups is not enough?
It was like 15% that two more scale-ups get us there.
Yeah, I mean, get us there is again a little bit complicated.
Like there's a system that's a drop-in replacement for humans.
And there's a system which like still requires like some amount of like schlep before you're able to really get everything going.
Yeah, I think it's quite plausible that even at, I don't know what I mean by quite plausible.
Like somewhere between 50% or two-thirds or what's called?
50%. Like even by the time you get to GPT 6 or like, let's call it 5, or is a magnitude effective
training compute past GPT4, that that system like still requires like really a large amount
of work to be deployed in lots of jobs. That is, it's not like a drop in replacement for
humans where you can just say like, hey, you understand everything any human understands, whatever
role you could hire a human for, you just do it. That it's more like, okay, we're going to
like collect large amounts of relevant data and use that data for fine tuning.
Like systems learn through fine tuning quite differently from humans learning on the job
or humans learning by observing things.
Yeah, I just like have a significant probability that system will still be weaker than
humans in important ways.
Like maybe that's already like 50% or something.
And then like another significant probability that that system will require a bunch of like
changing workflows or gathering data or like, you know, is not necessarily like strictly
weaker than humans or like if trained in the right way, wouldn't be weaker than humans.
But we'll take a lot of schlep to actually make fit into work.
workflows and do the jobs. And that schlep is what gets you from 15% to 40% by 2040. Yeah, you also
get a fair amount of scaling between, like, you get less. Like, scaling is probably going to be
much, much faster over the next, like, four or five years than over the subsequent years.
But yeah, it's a combination of like you get some significant additional scaling and you get a lot
of time to, like, deal with things that are just engineering hassles. But by the way, I guess we should
be explicit about why you said four orders of magnitude scale up.
to get two more generations just for people who might not be familiar.
If you have 10x more parameters to get the most performance,
you also want around 10x more data.
So that to be chinchilla optimal,
that would be 100x more compute total.
But okay, so why is it that you disagree with the strong scaling picture,
or at least it seems like you might disagree with a strong scaling picture
that Dario laid out on the podcast,
which would imply probably that two more generations,
it wouldn't be something where you need a lot of schleps.
it would probably just be like really fucking smart.
Yeah, I mean, I think that basically just had these two claims.
One is like, how smart exactly will it be?
So we don't have like any curves to extrapolate and it seems like there's a good chance.
It's like better than a human and all the relevant things.
And there's like a good chance it's not.
Yeah, that might be totally wrong.
Like maybe just making up numbers, I guess like 50-50 on that one.
Wait, so it was 50-50 by in the next four years that it will be like around human smart.
Then how do we get to 40% by 20?
whatever sort of schleps they are, how does it degrade you 10% even after all the scaling that happens by 2040?
Yeah, I mean, all these numbers are pretty made up.
And that 40% number was probably from before even like the chat GPT release or the seeing GPT 3.5 or GPT4.
So I mean, the numbers are going to bounce around a bit and all of them are pretty made up.
But like that 50% I want to then combine with the second 50% that's more like on this like schleps side.
And then I probably want to combine with some additional probabilities for various forms of slowdown, where a slowdown could include like a deliberate decision.
to slow development of technology or could include just like we suck at deploying things.
Like that is a sort of decision in my regard is wise to slow things down or decision that's like
maybe maybe unwise or maybe wise for the wrong reasons to slow things down.
You probably want to add some of that on top.
I probably want to add on like some loss for like it's possible you don't produce GPT6 scale
systems like within the next three years or four years.
Let's isolate for all of that.
And like how much bigger would the system be than GPT4 where you think there's more than
50% chance that it's going to be smarter enough to replace basically all human cognitive labor.
Also, I want to say that, like, for the 50, 25% thing, I think that would probably suggest,
like, those numbers if I randomly made them up and then made the distance fear prediction,
that's going to get you like 60% by 20, 40% or something, not 40%.
And, like, I have no idea between those.
These are all made up, and I have no idea which of those is I would, like, endorse some reflection.
So this question of, like, how big would you have to make the system before it's, like,
more likely than not that you can be, like, a drop in replacement for humans?
I mean, I think if you just literally say, like, you train on web text, then, like, the question is, like, kind of hard to discuss because you, like, I don't really buy stories that, like, training data makes a big difference long run to these dynamics.
But I think, like, if you want to just imagine the hypothetical, like, you just took GPT4 and, like, made the numbers bigger, then I think those are pretty significant issues.
I think there's significant issues in two ways.
One is, like, quantity of data.
And I think probably the larger one is, like, quality of data where, like, I think as you start approaching, like, the prediction,
task is not that great at task. If you're like a very weak model, it's a very good signal
we get smarter. At some point, it becomes like a worse and worse signal to get smarter.
I think there's a number of reasons. It's not clear there's any number such that I imagine,
or there is a number, but I think it's very large. So you plug that number into like GPT
force code and then maybe filled with the architecture a bit. I would expect that thing to have
a more than 50% chance of being a drop in replacement for humans. You're always going to have
do some work. But the work's not necessarily much. Like I would guess when people say like
new insight is needed, I think I tend to be like more bullish than them.
I'm not like these are new ideas where like who knows how long it will take.
I think it's just like you have to do some stuff.
Like you have to make changes unsurprisingly.
Like every time you scale something up by like five orders of magnitude,
you have to make like some changes.
I want a better understand your intuition of being more skeptical than some about
the scaling picture that, you know,
these changes are even needed in the first place or that it would take more than
two orders of magnitude,
more improvement to get these things almost certainly to a human level
or very high probability to human level.
So is it that you don't agree with the way in which they're extrapolating these loss curves?
You don't agree with the implication that that decrease in loss will equate to greater and greater intelligence?
Or like, what would you tell Dario about if you were having, I'm sure you have, but like, what would that debate look like about this?
Yeah.
So again, here we're talking two factors of a half.
One on like, is it smart enough and one on like to have to do a bunch of slap, even if like in some sense it's smart enough?
And like the first factor of a half, I'd be like, I don't know, think we have really.
anything good to extrapolate.
That is like I feel, I would not be surprised if I have like similar or maybe even
higher probabilities on like really crazy stuff over like the next year and then like lower
probably like my probability is like not that bunched up.
Like maybe Dara's probability, I don't know, you talk with him is like, you have talked with
him is more bunched up on like some particular year and mine is maybe like a little bit more
like uniformly spread out across like the coming years.
Partly because I'm just like I don't think we have some trends we can extrapolate like
an extrapolate loss.
You can look at your qualitative impressions of like systems at various scales.
But it's just like very hard to relate any of those extrapolations to like doing
cognitive work or like accelerating R&D or taking over and fully automating R&D.
So I have a lot of uncertainty around that extrapolation.
I think it's very easy to get down to like a 50-50 chance of this.
What about the sort of basic intuition that, listen, this is a big blob of compute.
You make the big block of compute bigger.
It's going to get smarter.
Like it would be really weird if it didn't.
Yeah, I'm happy with that.
It's going to get smarter.
And it would be really weird if it didn't.
And the question is, how smart does it have to, how smart does it have to get?
Like, that argument does not yet give us a quantitative guide to like, at what scale is it,
is it a slam dunk or at what scale is at 50-50?
And what would be the piece of evidence that would not do one way or another, where you
look at that and be like, oh, fuck, this is, it 20% by 2040 or 60% by 2040 or something?
Is there something that could happen in the next two years or next three years?
Like, what is the thing you're looking to where this will be a big update for you?
Again, I think there's some just how capable is each model where I like have
I think we're really about extrapolating, but you still have some subjective guess,
and you're comparing it to what happened.
And that will move me, like, every time we see what happens with another, like,
order of magnitude of training compute, I will have, like, a slightly different guess
for where things are going.
These probabilities are coarse enough that, again, I don't know if that 40% is real,
or if, like, post-GB2.5 and 4, I should be at, like, 60% or what?
That's one thing.
And the second thing is just, like, some, if there was some ability to extrapolate,
I think this could, like, reduce error bars a lot.
I think, like, here's another way you could try and do an extrapolation is you
you just say, like, how much economic value do systems produce and, like, how fast is that
growing? I think, like, once you have systems actually doing jobs, the extrapolation gets easier
because you're, like, not moving from, like, a subjective impression of a chat to, like,
automating all R&D or moving from, like, automating this job to automating that job or whatever.
Unfortunately, that's, like, probably by the time you have nice trends from that, you're like,
you're not talking about 2040, you're talking about, like, you know, two years from the end of days
or one year from the end of days or whatever. But, like, to the extent that you can get extrapolations
like that, I do think it can provide more clarity.
But why is economic value the thing we would want to extrapolate? Because, like, for example, you started off with chimps and they're just getting gradually smarter to human level, they would basically provide like no economic value until they were, you know, basically worth as much as a human. So it would be these like, you know, very gradual and then very fast increase in their value. So is the, you know, increase in value from GPD4, GPD6? Is that the extrapolation we want?
Yeah, I think that the economic extrapolation is not great. I think it's,
You could compare it to this objective extrapolation of how smart does the model seem.
It's not super clear which one's better.
I think probably in the chimp case, I don't think that's quite right.
I think if you actually like, so if you imagine like intensely domesticated chimps
who are just like actually trying their best to be really useful employees and like you hold
fix their physical hardware and then you just gradually like scale up their intelligence,
I don't think you're going to see like zero value, which then suddenly becomes massive value
over like, you know, one doubling of brain size or whatever, one order of magnitude of brain size.
It's actually possible in order magnitude of brain size.
But like chimps are very,
chimps are already within an order of magnitude of brain size
as to humans.
Like chimps are like very, very close
on the kind of spectrum we're talking about.
So I think like I'm skeptical of like the abrupt transition for chimps.
And to the extent that I kind of expect a fairly abrupt transition here,
it's mostly just because like the chimp human intelligence difference is like
so small compared to the differences we're talking about with respect to these models.
That is like, I would not be surprised if in some objective sense,
like, chimp human difference is like significantly smaller than the GPT3,
GPT4 difference.
So the GPT4, GPD5 difference.
Wait, wouldn't that argue in favor of just relying much more on this objective?
Yeah, this is, there's sort of two balancing tensions here.
One is, like, I don't believe the chimp thing is going to be as abrupt.
That is, I think, if you scaled up from chimps to humans,
you actually see, like, quite large economic value from, like, the fully domesticated
chimp already.
Oh, yeah.
And then, like, the second half is, like, yeah, I think that the chimp human difference
is, like, probably pretty small compared to model differences.
So I do think things are going to be pretty abrupt.
I think the economic extrapolation is pretty rough.
I also think the subjective.
extrapolation is like pretty rough just because I really don't know how to get like how do I don't know how people do the extrapolation end up with the degrees of confidence people end up with again I'm putting pretty high if I'm saying like you know give me three years and I'm like at 50 50 50 it's going to have like basically the smarts there to do the thing that's like I'm not saying it's like a really long way off like I'm just saying like I got pretty big error bars and I think that like it's really hard not to have really big air bars when you're doing this like I looked at GPT4 it seemed pretty smart compared to GP2.5 so I bet just like form
more such notches and we're there. It's like, that's just a hard call to make. I think I sympathize
more with people who are like, how could it not happen in three years than with people who are like
no way it's going to happen in eight years or whatever, which is like probably a more common
perspective in the world. But also things do take longer than you. I think things tank longer than
you think. It's like a real thing. Yeah, I don't know. Mostly have big air bars because I just don't
believe the subjective extrapolation that much. I find it hard to get like a huge amount out of it.
Okay. So what about the scaling picture or do you think is most likely to be wrong?
wrong. Yeah. So we've talked a little bit about how good is the qualitative extrapolation. How good
are people at comparing? So this is not like the picture of being qualitative wrong. This is just
quantitatively. It's very hard to know how far off you are. I think a qualitative consideration
that could significantly slow things down is just like right now you get to observe this like
really rich supervision from like basically next word prediction or like in practice maybe you're
looking at like a couple sentences prediction. So getting this like pretty rich supervision. It's
plausible that if you want to like automate long horizon tasks like being an employee over the
course of a month, that that's actually just like considerably harder to supervise or that like
you basically end up driving costs. Like the worst case here is that you like drive up costs by a
factor that's like linear and the horizon over which the thing is operating. And I still consider
that just like quite plausible. Well, can you can you dump that down? You're driving up a cost about
of what in the linear in the horizon? What does the horizon mean? Yeah. So like if you imagine you want
to train a system to like say words that sound like the next word of human. You're driving.
would say. There you can get this really rich supervision by having a bunch of words
and then predicting the next one and being like, I'm going to tweak the model so it predicts
better. If you're like, hey, here's what I want. I want my model to interact with like some job
over the course of a month and then at the end of that month, like have internalized everything
where the human would have internalized about how to do that job well and like have local context
and so on. It's harder to supervise that task. So in particular, you could supervise it from
the next word prediction task and like all that context the human has ultimately will just help them
predict the next word better. It's like in some sense, a really long context language model is
also learning to do that task. But the number of like effective data points you get of that
task is like vastly smaller than the number of effective data points you get at like this very
short horizon. Like what's the next word? What's the next sentence tasks? The sample efficiency
matters more for economically valuable long horizon tasks than the predicting next token.
And that that's what would like actually be required to, you know, take over a lot of jobs.
Yeah, something, something like that. That is.
it just seems very plausible that it takes longer to train models to do tasks that are a longer horizon.
How fast do you think the pace of algorithmic advances will be?
Because if by 2040, even if scaling fails, I mean, you know, since 2012, since the beginning
of the deep learning revolution, we've had so many new things. By 2040, are you expecting a similar
pace of increases? And if so, then, I mean, if we just keep having things like this,
then aren't we going to just going to get the AI sooner or later? Or soon, not later.
Are we going to get the guy sooner, sooner?
I'm with you on sooner or later.
I suspect, like, progress to slow, if you, like, held fixed how many people
are working in the field, I would expect progress to slow as loaning fruit is exhausted.
I think the, like, rapid rate of progress in, like, say, language modeling over the last
four years is largely sustained by, like, you start from a relatively small amount of investment.
You, like, greatly scale up the amount of investment.
And that enables you to, like, keep picking, you know, every time.
Every time the difficulty doubles, you just double the size of the field.
I think that dynamic can hold up for some time longer.
Like, I mean, pretty good.
Like, you know, right now, if you think of it as like hundreds of people effectively searching for things, like, up from, like, you know.
Anyway, if you think of hundreds of people now, you can maybe bring that up to, like, tens of thousands of people or something.
So for a while, you can just continue increasing the size of the field and, like, search harder and harder.
And there's indeed, like, a huge amount of low-hanging fruit where, like, it wouldn't be a hard for a person to sit around and, like, make things a couple percent better after year of work or whatever.
So I don't know.
I would probably think of it mostly in terms of like, how much can investment be expanded and, like, try and guess, like, some combination of fitting that curve and, yeah, try and some combination of fitting the curve to historical progress, looking at, like, how much low-hanging fruit there is, getting a sense of how fast it decays. I think, like, you probably get a lot, though. You get a bunch of orders of magnitude of total, especially, like, if you ask, like, how good is a GPT-5 scale model or GPT4-scale model. I think you probably get, like, by 2040, like,
I don't know, three orders of magnitude of effective training,
compute improvement, or like, a good chunk of effective training compute improvement,
four orders of magnitude.
I don't know.
I don't have, like, here I'm speaking from, like, no private information
about the last, like, couple years of efficiency improvements.
And so people who are on the ground will have better senses of, like, exactly how rapid
returns are and so on.
Okay, let me back up and ask a question more generally about, you know,
people make these analogies about humans were trained by evolution and we're
like deployed in this in the modern civilization. Do you buy those analogies? Is it valid to say
that humans were trained by evolution rather than, I mean, if you look at the protein coding size
of the genome, it's like 50 megabytes or something. And then what part of that is for the brain?
Anyways, how do you think about how much information is in, like, do you think of the genome as
hyperparameters or how much does that inform you when you have these anchors for how much
training humans get when they're just consuming information, when they're walking up and about,
and so on.
I guess the way that you could think of this is like, I think both analogies are reasonable.
One analogy being like evolution is like a training run and humans are like the unproduct
of that training run.
And a second analogy is like evolution is like an algorithm designer.
And then a human over the course of like this modest amount of computation over their lifetime
is the algorithm being that's been produced, the learning algorithm's been produced.
And I think like neither analogy is that great.
like I like them both and lean on the bunch, both of them, a bunch,
and think that's been, like, pretty good for having, like, a reasonable view of what's likely to happen.
That said, like, the human genome is not that much, like, 100 trillion parameter model.
It's, like, a much smaller number of parameters that behave in, like, a much more confusing way.
Evolution did, like, a lot more optimization, especially over, like, long,
like, designing your brain to work well over a lifetime than gradient descent does over models.
That's, like, a disinology on that side.
And on the other side, like, I just, I think human,
learning over the course of a human lifetime is in many ways just like much, much better than
gradient descent over the space of neural nets. Like, creating descent is working really well,
but I think we can just be quite confident that like in a lot of ways human learning is much
better. Human learning is also constrained. Like, we just don't get to see much data and that's
just an engineering constraint. You can relax. Like, you can just give your neural nuts way more data than
humans have access to. In what ways is human learning superior to grading descent?
I mean, the most obvious one is just like, ask how much data it takes a human to become like an
expert in some domain and it's like much, much smaller than the amount of data that's going to be
needed on any plausible trend extrapolation.
Not in terms of performance, but is it the active learning part? Is it the structure?
Like, what is it?
I mean, I would guess a complicated mess of a lot of things. In some sense, there's not that
much going on in a brain, like, as you say, there's just not that many, it's not that many
bytes in a genome. But there's very, very few bytes in an ML algorithm. Like, if you think
a genome is like a billion bytes or whatever, maybe you think less, maybe you think it's like
100 million bytes.
then like, you know, an ML algorithm is like if compressed, probably more like hundreds of thousands of bytes or something.
Like the total complexity of like, here's how you train GPT4 is just like, I haven't thought about these numbers.
But like it's very, very small compared to a genome.
And so although a genome is very simple, it's like very, very complicated compared to algorithms that humans design, like really hideously more complicated than algorithm of human would design.
Is that true?
So, okay, so the human genome is $3 billion.
base pairs or something. But only like one or two percent of that is protein coding. So that's
50 million base pairs. I don't know much about biology. In particular, I guess the question is
like how many of those bits are like productive for like shaping development of a brain?
And presumably a significant part of the non-protein coding genome can, I mean, I just don't know.
It seems really hard to guess how much of that plays a role. Like the most important decisions
are probably like from an algorithm design perspective are not like the protein coding part is
is less important than the decisions about like what happens during development or like how
cells differentiate. I don't know if that's, I don't know anything about biologist's
how to expect, but I'm happy to wrong with 100 million base pairs though.
But on the other end, on the hyper parameters that are cheap before training around,
that might be not that much. But if you're going to include all the, all the base pairs in
the genome, then which are not all relevant to the brains or are relevant to like very
bigger details about like just basics of biology, should probably include.
to like the Python library and the compilers
and the operating system for GPD4 as well
to make that comparison analogous.
So at the end of the day,
I actually don't know which one has storing more information.
Yeah, I mean, I think the way I would put it is like
the number of bits it takes to specify the learning algorithm
to train GPT4 is like very small.
And you might wonder like maybe a genome,
like the number of bits it would like take to specify a brain is also very small.
The genome is much, much faster than that.
But it is also just plausible that a genome is like closer to,
Certainly the space, the amount of space to put complexity in a genome, we could ask how well evolution uses it.
And like, I have no idea whatsoever.
But the amount of space in a genome is like very, very vast compared to the number of bits that are actually taken to specify, like, the architecture or optimization procedure and so on for GPT4.
Just because, again, genome is simple, but algorithms are like really very simple.
And the algorithms are really very simple.
And stepping back, you think this is where the better sample efficiency of human learning comes from?
Like why it's better than gradient descent?
Yeah, so I haven't thought that much about the sample efficiency question a long time.
But if you thought like a synapse was seeing something like, you know, a neuron firing once per second,
then how many seconds are there in a human life?
We can just flip a calculator real.
Yeah, let's do some calculating.
Tell me the number.
3,600 seconds per hour times 24 times 365 times 20.
Okay, so that's 630 million seconds.
That means like the average synapse is saying like,
630 million, and I don't know exactly what the numbers are, but something that's ballpark.
Let's call it like a billion action potentials.
And then, you know, there's some resolution.
Each of those carries some bits.
But let's say it carries like 10 bits or something.
Just from like timing information at the resolution you have available.
Then you're looking at like 10 billion bits.
So each parameter is kind of like, how much is a parameter seeing?
It's like not seeing that much.
So then you can compare that to like language.
I think that's probably less than like current language model C.
and current language models are.
So it's like not clear of a huge gap here,
but I think it's pretty clear
going to have a gap of like at least three or four
is of magnitude.
Didn't your wife do the lifetime anchors
where she said the amount of bytes
that a human will see in their lifetime
was 1E24 or something?
The number of bytes of human will see is 1E24.
Mostly this was organized around total operations
performed in a brain.
Oh, okay, never mind, sorry.
Yeah.
Yeah, so I think that like the story there
would be like a brain is just in some other part
of the parameter space where it's like using a lot
a lot of compute for each piece of data it gets and just not seeing very much data in total.
Yeah, it's not really plausible if you extrapolate out language models are going to end up with
like a performance profile similar to a brain.
I don't know how much better it is.
Like I think, so I did this like random investigation at one point where I was like,
how good are things made by evolution compared to things made by humans?
Which is a pretty insane seeming exercise.
But like, I don't know.
It seems like orders of magnitude is typical, like not tons of orders of magnitude, not
factors of two.
like things by humans are a thousand times more expensive to make or a thousand times heavier
per unit performance if you look at things like how good our solar panels relative to leaves or how
good our muscles relative to motors or how good our livers relative to systems that perform
analogous chemical reactions and industrial settings. Was there a consistent number of orders of magnitude
in these different systems or was it all over the place? So like a very rough ballpark. It was like
sort of for the most extreme things you were looking at like five or six orders of magnitude.
And that would especially come in like energy cost of manufacturing where like bodies are just very good at building complicated organs like extremely cheaply.
And then for other things like leaves or eyeballs or livers or whatever, you tend to see more like if you set aside manufacturing costs and just look at like operating costs or like performance tradeoffs.
Like I don't know more like three orders of magnitude or something like that.
Or some things that are on the smaller scale like the nanomachines or whatever that we can't do it all.
Right.
Yeah.
I mean, yeah.
So it's a little bit hard to say exactly what the.
task definition is there. Like you could say like making a bone, we can't make a bone, but you
could try and compare a bone, the performance characteristics of bone is something else. Like,
we can't make spider silk. Do you try and compare the performance characteristics with spider silk?
Like things that we can synthesize. The reason this would be is why that evolution has had more time
to design these systems? I don't know. It was mostly just curious about like what the performance.
I think like most people would object to be like, how did you choose these reference classes
of things that are like fair intersections. Some of them seem reasonable. Like eyes versus
cameras seems like just everyone needs eyes. Everyone needs cameras. It feels very fair.
photosynthesis seems like very reasonable.
Everyone needs to like take solar energy and then like turn it into a usable form of
energy.
But it's just kind of I don't really have a mechanistic story.
Evolution in principle has spent like way, way more time than we have designing.
It's absolutely unclear how that's going to shake out.
My guess would be in general.
Like I think there aren't that many things where humans really crush evolution where you can't
tell like a pretty simple story about why.
So like for example, roads and moving over roads with wheels crushes evolution.
But it's not like an animal like would have wanted to design a wheel.
Like you're just not allowed to like pave the wall.
world and then put things on wheels if you're an animal. Maybe planes or more. Anyway, whatever.
There's various things you could try and tell. There's some things humans do better, but it's
normally like pretty clear why humans are able to win when humans are able to win. The point
of all this was like, it's not that surprising to me. I think this is mostly like a pro short
timelines view. It's not that surprising to me if you tell me like machine learning
systems are like three or fours of magnitude less efficient at learning than human brains. I'm like,
that actually seems like kind of in distribution for other stuff. And if that's your view,
then I think you're like probably going to hit, you know, then you're looking at like 10 to the 27.
training compute or something like that, which is not so far.
We'll get back to the timeline stuff in a second.
At some way, we should talk about alignment.
So let's talk about alignment.
At what stage does misalignment happen?
So right now, with something like GPD4,
I'm not even sure it would make sense to say that it's misaligned,
because it's not aligned to anything in particular.
Is it at a human level where you think the ability to be deceptive comes about?
What is a process by which misalignment happens?
I think even for GPT4, it's reasonable to ask questions like, are there cases where GPT4 knows that humans don't want X, but it does X anyway?
Like, where it's like, well, I know that, like, I can give this answer, which is misleading.
And if it was explained to a human, what was happening, they wouldn't want that to be done, but I'm going to produce it.
I think that, like, GPT4 understands things enough that you can have, like, that misalignment in that sense.
Yeah, I think GPT, like, I've sometimes talked about being, like, benign instead of aligned, meaning that, like, well, it's not exactly clear if it's aligned or if that context is meaning.
It's just like kind of a messy word to use in general.
But the thing we're more confident of is it's like not doing, you know, it's not optimizing
for this goal, which is like a cross-purposes to humans.
It's either optimizing for nothing or like maybe it's optimizing for what humans want or
close enough or something that's like an approximation good enough to still not take over.
But anyway, I'm like some of these abstractions seem like they do apply to GPT4.
It seems like probably it's not like egregiously misaligned.
It doesn't, it's not doing the kind of thing that could lead to take over.
We'd guess.
Suppose you have a system at some point in which ends up.
in it wanting takeover. What are the checkpoints? And also, what is the internal, is it just
that it, to become more powerful, it needs agency. And agency implies other goals? Or do you see a
different process by which misalignment happens? Yes, I think there's a couple possible stories
for getting to catastrophic misalignment. And they have slightly different answers to this question.
So maybe I'll just briefly describe two stories and try and talk about when they can, when they
start making sense to me. So one type of story is you train or fine-tune your AI
system to do things that humans will rate highly or get other kinds of reward in a broad
diversity of situations.
And then it learns to, in general, dropped in some new situation, try and figure out which
actions would receive a high reward or whatever, and then take those actions.
And then when deployed in the real world, like sort of gaining control of its own training
data provision process is something that gets a very high reward.
And so it does that.
So this is like one kind of story.
Like it wants to grab the reward button or whatever.
it wants to intimidate the humans into giving it a high reward, et cetera.
I think that doesn't really require that much.
This basically requires a system which is like, in fact, looks at a bunch of environments,
is able to understand like the mechanism of reward provision as like a common feature of those
environments is able to think in some novel environment, like, hey, which actions would
result in be getting a high reward?
And it's thinking about that concept precisely enough that when it says high reward,
it's saying like, okay, well, how is reward actually computed?
It's like some actual physical process being implemented in the world.
My guess would be like GPT4 is about at the level where with handholding,
you can observe this kind of like scary generalizations of this type,
although I think they haven't been shown basically.
That is, you can have a system which, in fact, is fine-tuned out a bunch of cases.
And then in some new case, we'll try and like do an end run around humans,
even in a way humans would penalize if they were able to notice it or would have penalized in training environments.
So I think GPT4 is kind of at the boundary where these things are possible.
examples kind of exist
but are getting significantly better over time.
I'm very excited about
this Anthropic project basically trying to see
how good an example can you make now
of this phenomena. And I think the
answer is like kind of okay,
probably. So that just I think
is going to continuously get better from here.
I think for the level where we're concerned,
like this is related to me having really broad
distributions over how smart models are.
I think it's like not out of the question that you take
like GPT4's understanding of the world
is like much crisper
and much better than GPT3's understanding.
Just like it's really like night and day.
And so it would not be that crazy to me
if you took GPT5 and you trained it to get a bunch of reward.
And it was actually like, okay, my goal is not doing the kind of thing
which like thematically looks nice to humans.
My goal is getting a bunch of reward.
And then we'll generalize in a new situation to get reward.
And by the way, this requires to consciously want to do something
that it knows that humans wouldn't want it to do
Or is it just that we weren't good enough to specify that the thing that we accidentally ended up rewarding is not what we actually want?
I think the scenarios I am most interested in and most people are concerned about from a catastrophic risk perspective.
It involves systems understanding that they are taking actions which a human would penalize if the human was aware of what's going on,
such that you have to either deceive humans about what's happening or you need to actively subvert human attempts to correct your behavior.
So the failures come from really this combination or they require this combination of both.
like trying to do something humans don't like and understanding the humans would stop you.
I think you can have only the barest examples. You can have the barest examples for GPD4.
Like you can create the situations where GPD4 would like, sure in that situation, like, here's
what I would do. I would like go hack the computer and change my reward. Or in fact, we'll like do things
that are like simple hacks or like go change the source of this file or whatever to get a higher
reward. They're pretty weak examples. I think it's plausible GPT5 will have like compelling
examples of those phenomena. I really don't know. This is very related to like the very broad
error bars unlike how competent such systems will be when.
That's all with respect to this first mode of like a system is taking actions that get reward
and like overpowering or receiving humans is helpful for getting reward.
There's this other failure mode and other family failure modes where AI systems want
something potentially unrelated to reward.
I understand that like they're being trained and like while you're being trained
there are a bunch of like reasons you might want to do the kinds of things humans want you to do.
But then when deployed in the real world, if you're able to realize you're no longer being
trained, you no longer have reason to do the kinds of things you want.
You'd prefer, like, be able to determine your own destiny, like, control your own, your competing
hardware, etc., which I think, like, probably emerged like a little bit later than systems that
try and get reward.
And so we'll generalize in scary, unpredictable ways to new situations.
I don't know when those appear.
But also, again, broad enough error bars that it's like conceivable for systems in the near future.
You know, I wouldn't put it like less than one in a thousand for GPT5, certainly.
If we deployed all these AI systems and some of them are reward hacking.
some of them are decepta, some of them are just normal, whatever.
How do you imagine that they might interact with each other at the expense of humans?
How hard do you think it would be for them to communicate in ways that we would not be able to recognize
and coordinate, coordinate our expense?
Yeah, I think that most realistic failures probably involve two factories interacting.
One factor is like the world is pretty complicated and the humans mostly don't understand what's
happening.
So like AI systems are writing code that's very hard for humans to understand.
how it works at all, but more likely, like, they understand roughly how it works, but there's a lot of
complicated interactions. AI systems are running businesses that interact primarily with other AIs.
They're, like, doing SEO for, like, AI search processes. They're, like, running financial transactions,
like, thinking about a trade with AI counterparties. And so you can have this world where, like,
even if humans kind of understand the jumping off point when this was all humans, like, actual considerations
of, like, what's a good decision, like, what code is going to work well and be durable, or, like,
what marketing strategy is effective for selling to these other AIs or whatever, is kind of just all
mostly outside of sort of humans understanding.
I think this is like a really important, again,
when I think of like the most plausible scary scenarios,
I think that's like one of the two big risk factors.
And so in some sense, your first problem here is like having these AI systems
who understand a bunch about what's happening.
And your only lever is like, hey, AI do something that works well.
So you don't have a lever to be like, hey, do what I really want.
You just have the system.
You don't really understand.
You can observe some outputs.
Like did it make money?
And you're just optimizing or at least doing some fine tuning
to get the AI to use its understanding of that system to achieve that goal.
So I think that's like your first risk factor.
And like once you're in that world, then I think there are like all kinds of dynamics amongst AI systems that again, humans aren't really observing.
Humans can't really understand.
Humans aren't really exerting any direct pressure on only on outcomes.
And then I think it's it's quite easy to be in a position where you know, if AI systems started failing, it would be very, they could do a lot of harm very quickly.
Humans aren't really able to like prepare for and mitigate that potential harm because we don't really understand the systems in which they're acting.
And then if AI systems like, you know, they could successfully.
prevent humans from either understanding what's going on or from like
resolutely retaking the data centers or whatever if the AI successfully grab control.
This seems like a much more gradual story than the conventional takeover stories
where you're just like you train it and then it comes alive and escapes and takes over everything.
So you think that kind of story is less likely than one in which we just hand off more control
voluntarily to the AIs.
So one, I'm interested in the tale of like some risks that can occur particularly soon.
and I think risks that occur particularly soon are a little bit like you have a world where I is not probably deployed and then something crazy happens quickly.
That said, if you ask like what's the median scenario where things go badly, I think it is like there's some lessening of our understanding of the world.
It becomes, I think like in the default path, it's like very clear to humans that they have increasingly little grip on what's happening.
I mean, I think already most humans have very little grip on what's happening.
It's just some other humans understand what's happening.
Like I don't know how almost any of the systems I interact with work at a very detailed way.
So it's sort of clear to humanity as a whole that like we sort of collectively don't understand most of what's happening.
except with AI assistance.
And then, like, that process just continues for a fair amount of time.
And then, like, there's a question of how abrupt an actual failure is.
I do think it's reasonably likely to failure itself would be abrupt.
Like, at some point, bad stuff starts happening that human can recognize as bad.
And once things that are obviously bad start happening, then, like, you have this bifurcation
where either humans can use that to fix it.
And so, okay, AI behavior that led to this obviously bad stuff.
Don't do more of that.
Or you can't fix it.
And then, like, you're in this, like, rapidly escalating failures.
Everything goes off the rails.
In that case, yeah, what is going off the rails?
look like. For example, how would it take over the government? Yeah, it's getting deployed in the
economy in the world and at some point it's in charge. How does that transition happen?
Yeah, so this is going to depend a lot on like what kind of like timeline you're imagining or like
this sort of a broad distribution, but I can like fill in some random concrete option that is like in
itself very improbable. Yeah, and I think that like one of the less dignified but maybe more
plausible routes is like you just have a lot of AI control over critical systems even in like
running a military. And then you have the scenario that's a little bit more just like a normal
coup where you have a bunch of AI systems they in fact operate like you know it's not the case
that humans can really fight a war on their own. It's not the case that humans could defend them
from like an invasion on their own. So like that is if you had invading army and you had your own
robot army you're like you can't just be like we're going to turn off the robots now because
things are going wrong if you're in the middle of a war.
Okay, so how much does this world rely on restyem dynamics where we're forced to deploy,
or not forced, but we choose to deploy AIs because other countries or other companies are also
deploying AIs. And, you know, you can't have them have other killer robots.
Yeah. I mean, I think that, like, there's several levels of answer to that question.
So one is, like, maybe three, three parts of my answer. Like, our first part is, like, I'm just
trying to tell, like, what seems like the most likely story. I do think there's, like,
further failures that get you, like, in the more distant future. So, like, E. G. Eliezer will
not talk that much about killer robots because he really wants to emphasize, like,
hey, if you never built a killer robot, something crazy is still going to happen to you,
just like only four months later or whatever.
So it's like not really the way to analyze the failure.
But if you want to ask, like, what's the median world where something bad happens?
I still do think this is like the best guess.
Okay, so that's like part one of my answer.
Part two of the answer was like in this proximal situation where something bad is happening.
And you ask like, hey, why do humans not turn off the AI?
You can imagine like two kinds of story.
One is like the AI is able to prevent humans from turning them off and the other is like,
In fact, we live in a world where it's like incredibly challenging, like there's a bunch of
competitive dynamics or a bunch of reliance on AI systems.
And so it's incredibly expensive to turn off AI systems.
I think, again, you would eventually have the first problem.
Like, eventually AI systems could just prevent humans from turning them off.
But I think, like, in practice, the one that's going to happen much, much sooner is probably
competition amongst different actors using AI.
And it's like very, very expensive to unilaterally disarm.
You can't be like, something weird has happened.
We're just going to shut off all the AI because you're EG in a hot war.
So again, I think that's just like probably the most likely thing to have.
happen first. Things would go badly without it. But I think if you ask, why don't we turn off
the AI? My best guess is because there are a bunch of other AI is running around two at a year
lunch. So how much better a situation would be? And if there was, like, there was only one
group that was pursuing AI, no other countries, no other companies, basically how much of the
expected value is lost from the dynamics that are likely to come about because other people will
be developing and deploying these systems. Yeah. So I guess this brings you to like a third part of the
which competitive dynamics are relevant.
So there's both the question of can you turn off AI systems
in response to something bad happening,
where competitive dynamics may make it hard to turn off.
There's a further question of just like,
why were you deploying systems,
which you had very little ability to control or understand those systems?
And again, it's possible you just don't understand what's going on.
You think you can understand or control such systems.
But I think in practice, a significant part is going to be like,
you are doing the calculus,
so people deploying systems are doing the calculus, as they do today,
like, in many cases overtly, of like, look,
these systems are not very well controlled around,
understood, there's some chance of like something going wrong or at least going wrong if we
continue down this path, but other people are developing the technology potentially in
even more reckless ways.
So in addition to like competition, making it difficult to shut down AI systems in the event
of a catastrophe, I also think it's just like the easiest way that people end up pushing
relatively quickly or moving quickly ahead on a technology where they feel kind of bad about
understandability or controllability.
That could be economic competition or military competition or whatever.
So I kind of think ultimately, like most of the harm comes from the fact that like lots of people
can develop AI.
How hard is a takeover of the government or something from any, even if it doesn't have killer
robots, but just a thing that you can't kill off if it has seeds elsewhere, can easily
replicate, can think a lot and think fast, what is the minimum viable coup for, is it like
shutting up, just like threatening a biowar or something or shutting off the grid?
or how easy is it basically to take over humans civilization?
So again, there's going to be a lot of scenarios.
And I'll just start by talking about one scenario,
which will represent a tiny fraction of probability or whatever.
But like, so if you're not in this competitive world,
if you're saying like we're actually slowing down deployment of AI
because we think it's unsafe or whatever,
then in some sense you're creating this like very fundamental instability
where like you could have been making faster AI progress
and you could have been deploying AI faster.
And so in that world, like most,
the bad thing that happens,
if you have an AI system that wants to mess with you,
is the AI says, like, I don't have any compunctions
about, like, rapid deployment of AI
or rapid AI progress.
So the thing you want to do or the AI wants to do
is just say, like, I'm going to defect from this regime.
Like, all the humans have agreed
that we're, like, not deploying AI in ways that would be dangerous.
But if I as an AI can escape and just go set up my own shop,
like make a bunch of copies of myself.
Maybe the humans didn't want to, like, delegate warfighting to an AI,
but I as an AI, I'm pretty happy doing so.
Like, I'm happy if I'm able to grab some military equipment
or direct some humans to use AI, use myself, to direct.
it. And so I think like as that gap grows, so if people are deliberately, right, if people are
deploying AI everywhere, I think of this competitive dynamic, if people aren't deploying
AI everywhere, so like if countries are not happy deploying AI in these high-stakes settings,
then as AI improves, you create this like wedge that grows where like if you were in a
position of fighting against an AI, which wasn't constrained in this way, you'd be in a pretty bad
position. At some point, like even if you just, yeah, so that's like one important thing, just
like I think in conflict and like overt conflict, if humans are putting
the brakes on AI. They're at like a pretty major disadvantage compared to an AI system that can
kind of set up shop and operate independently from humans. A potential independent AI. Does it need
collaboration from a human faction? Again, you could tell different stories, but it seems so much
easier. At some point, you don't need any. At some point, an AI system can just operate completely
like out of human supervision or something. But that's like so far after the point where it's like
so much easier if you're just like, there are a bunch of humans. They don't love each other that much.
like some humans are happy to be on side.
They're either skeptical about risk or happy to make this trade or can be fooled or can be coerced or whatever.
And just seems like it is almost certainly, almost certainly the easiest first pass is going to involve like having a bunch of humans who are happy to work with you.
So yeah, I think that probably is about, I think it's not ultimate, it's not necessary.
But if you ask about the median scenario, it involves a bunch of humans working with AI systems, either being directed by AI systems, providing compute to AI systems, providing legal cover and jurisdictions that are sympathetic to AI systems.
systems. Humans presumably would not be willing, if they knew the end result of the AI takeover,
would not be willing to help. So they have to be probably fooled in somebody, right? Like deep fakes
or something. And what is the minimum viable physical presence they would need or the jurisdiction
they would need in order to carry out their schemes? Do you need a whole country? Do you just need a
server farm? Do you just need like one single laptop? Yeah, I think I'd probably start by pushing back a bit
on the like humans wouldn't cooperate if they understood outcome or something. Like I would say like
One, even if you're looking at something like tens of percent risk of takeover, humans may be fine with that.
Like a fair number of humans may be fine with that.
Two, like if you're looking at certain takeover, but it's very unclear if that leads to death, like, a bunch of humans may be fine with that.
Like, if we're just talking about, like, look, the AI systems are going to run the world, but it's not clear if they're going to murder people.
Like, how do you know?
It's just a complicated question about AI psychology.
And a lot of humans probably are fine with that.
And I don't even know what the probability is there.
I think you actually have given that probability online.
I've certainly guessed.
Okay.
But it's not zero.
It's like a significant percentage.
I gave like 50-50.
Oh, okay, yeah.
Why is it?
Tell me about the world in which the AI takes over
but doesn't kill humans.
Why would that happen?
And what would that look like?
I mean, I asked my questions, like, why would you kill humans?
So I think, like, maybe I'd say the incentive to kill humans is, like, quite weak.
They'll get in your way.
They control shit you want.
Oh, so taking shit from humans is a different, like,
marginalizing humans and, like, causing humans to be irrelevant is a very different story from
killing the humans.
I think I'd say, like, the actual incentives to kill the humans are quite
week. So I think like the big reasons you kill humans are like, well, one, you might kill humans
if you're like in a war with them. And like it's hard to win the war without killing a bunch of
humans, like maybe most saliently here if you like want to use some biological weapons or some
crazy shit, that might just kill humans. I think like you might kill humans just from totally
destroying the ecosystems they're dependent on and it's slightly expensive to keep them alive anyway.
You might kill humans just because you don't like them or like you literally want to like, yeah,
I mean neutralize a threat or the LAAS or line is that they're made of atoms you could use or
else. Yeah, I mean, I think the literal they're made of atoms is like quite, there are not many atoms in
humans. Neutralize a threat is as the similar issue where it's just like, I think you would kill the
humans if you didn't care at all about them. So maybe you're asking just like, why would you care at all
about humans? Yeah, yeah. But I think you don't have to care much to not kill the humans. Okay, sure.
Because there's just so much raw resources else in the universe? Yeah, also it's, you can marginalize
humans pretty hard. Like, you could totally cripple human, like, you could cripple humans' warfighting
capability and also take almost all their stuff while killing only, you know, a small fraction of
humans incidentally. So then if you ask like why might I not want to kill humans, I mean,
a big thing is just like, look, I think AI is probably want like a bunch of random crap for like
complicated reasons. Like the motivations of AI systems and civilizations of AI's are probably
complicated messes. Certainly amongst humans, it is not that rare to be like, well, there was
someone here. I would like all else equal. If I didn't have to murder them, I would prefer not murder
them. And my guess is it's also like reasonable chance. It's not that rare amongst
systems. Like, humans have a bunch of different reasons, we think that way. I think
AIS systems will be very different from humans, but it's also just like a very salient. Yeah,
I mean, I think this is a really complicated question. Like, if you imagine drawing values from
the basket of all values, yeah. Like, what fraction of them are like, hey, if there's someone
here, how much do I want to not murder them? And my guess is just like, if you draw a bunch
of values from the basket, that's like a natural enough thing. Like, if you're, I wanted, like,
10,000 different things. So you do civilization of the other ones 10,000 different things.
It's just like reasonably likely you get like some of that. The other salient reason
might not want to murder them is just like, well, yeah, there's some, like, kind of crazy
decision theory stuff or, like, causal trade stuff, which does look on paper, like, it should
work. And, like, if I was, if I was running a civilization and, like, dealing with some people
who I didn't like it all or, like, didn't have any concern for it all, but I only had to spend
one billionth of my resources not to murder them, I think it's, like, quite robust that you don't
want to murder them. That is, I think, the, like, weird decision theory, a causal trade stuff
probably does carry the day. Oh, wait, that's, that's, that's, that's,
that contributes more to that 50-50 of will they murder us if they take over than the
well just by default they might just not want to kill us yeah i think they're both salient um can you
explain they run together with each other a lot of for the audience can you explain the weird a causal
yeah so the reason why it might not kill us maybe a higher level thing that goes into both of these
and then i will talk about how you instanti in a causal trade is just like it matters a lot to the
humans not to get murdered and the a i cares very very little about whether if we imagine this hypothetical
the reason that wants to kill humans is just total apathy.
It cares very little about whether or not to murder humans
because it is so easy to like marginalize humans
without murdering them and the resources required
for human survival are extremely low.
Again, in the context of this like rapid A industrialization.
So like, that's the basic setting.
And now the thing that you'd like to do is like run a trade.
Like they would like to say like, hey humans,
you care a ton about not getting murdered.
I don't really care one way or the other.
I would like to, if I could, find some way in which I don't murder you
and then in return, I get something.
The problem is in that world, the humans have essentially nothing to give.
That is, the humans are mostly irrelevant.
However, from the perspective of any kind of reasonable AI,
it's like not that sure whether it lives in the world where the humans in fact have nothing to give
or whether the humans, like, in fact, it lives in a world where the humans succeeded at building a land AI.
And now the AI is simply like running in a nice little simulation.
Humans are wondering, I wonder if to say I would have murders if all if it had the chance.
And the human is saying, like, if it would murder us off if it had the chance, like, that sucks.
We'd like to run this trade.
We'd like to be nice to the AIs who wouldn't have murdered us all
in order to create an incentive for AIs not to murder us.
So we do as we just check.
And for the kinds of AIs who don't murder everyone,
we just give them like one billionth of the universe.
Like if you do that as humanity,
then if you're an AI system considering like,
do I want to murder everyone?
Your calculus is like, if this is my real chance to murder everyone,
I get like the tiniest bit of value.
You know, I get like one trillionth of the value or whatever,
one billionth of the value.
But on the other hand, like, if I don't murder everyone,
there's like some worlds where then the humans
will correctly determine I don't murder everyone.
because in fact, the humans survive,
the humans are running the simulations to understand
how different AIs would behave.
And so, like, that's a better deal.
Let's hope they fall for that's eye up.
Okay, that's interesting.
Hey, real quick, this episode is sponsored by Open Philanthropy.
Open philanthropy is one of the largest grant-making organizations in the world.
Every year, they give away hundreds of millions of dollars
to have reduced catastrophic risks from fast-moving advances in AI and biotechnology.
Open Philanthropy is currently hiring for 22 different roles in those areas, including grantmaking, research, and operations.
New hires will support Open Philanthropy's giving on technical AI safety, AI governance, AI policy in the U.S., EU and UK, and biosecurity.
Many roles are remote-friendly, and most of the grant-making hires that Open Philanthropy makes don't have prior grant-making experience.
Previous technical experience is an asset, as many of these roles often benefit from a deep understanding of the technologies they address.
For more information and to apply, please visit Open Philanthropy's website in the description.
The deadline to apply is November 9th, so make sure to check out those roles before they close.
Awesome. Back to the episode.
In a world where we're deploying these AI systems.
And suppose they're aligned, how hard would it be for competitors to, I don't know, cyber attack them and get them to join the other side?
Are they robustly going to be aligned?
Yeah, I mean, I think in some sense, so there's a bunch of questions that come up here.
First one is like, are aligned data systems that you can build, like competitive?
Are they almost as good as the best systems anyone could build?
Maybe we're granting that for the purpose of this question.
Yeah.
And again, next question that comes up is, like,
AI systems right now are very vulnerable to manipulation.
Like, it's not clear how much more vulnerable they are than humans,
except for the fact that you can, like, if you have an AI system,
you can just replay it, like, a billion times
and search for, like, what thing can I say that will make it behave this way?
So as a result, like, AI systems are very vulnerable to manipulation.
It's unclear if future AI systems will be similarly vulnerable to manipulation,
but certainly seems plausible.
And in particular, like, you know, aligned AI systems or unaligned AI systems
would be vulnerable to all kinds of manipulation.
The thing that's really relevant here is kind of like asymmetric manipulation or something.
That is like if it is easier, so if everyone is just constantly messing with each other's AI systems,
like if you ever use AI systems in a competitive environment, a big part of the game is like messing with your competitors' AI systems.
A big question is whether there's some asymmetric factor there where like it's kind of easier to push AI systems into a mode where they're behaving erratically or chaotically or like trying to grab power or something than it is to like push them to fight for the other side.
Like it was just a game of like two people are competing and neither of them can sort of like hijack an opponent's AI.
to help support their cause.
That doesn't, I mean, it matters and it creates chaos
and it might be quite bad for the world,
but it doesn't really affect the alignment calculus.
Now it's just like right now you have like normal cyber offense,
cyber defense.
You have weird AI version of cyber offense, cyber defense.
But if you have this kind of asymmetrical thing
where like, you know, a bunch of AI systems
where like we love AI flourishing can then like go in and say like great AI's,
hey, how about you join us?
And that works.
Like if they can search for a persuasive argument to that effect.
And that's kind of asymmetrical.
Then like the effect is whatever values,
it's easiest to push AI, like, whatever it's easiest to argue to an AI that it should do,
that is, like, advantaged.
So it may be very hard to build AI systems, like, try and defend human interests,
but very easy to build AI systems.
It's just, like, try and destroy stuff or whatever.
Just depending on, like, what is the easiest thing to argue to an AI that it should do?
Or what's the easiest, like, thing to trick an AI into doing or whatever.
Yeah, I think if alignment is spotty, like, so if you have the AI system,
which, like, doesn't really want to help humans or whatever, or, in fact, like,
wants some kind of random thing or, like, wants different things in different contexts,
then I do think adversarial settings will be the main ones where you see the system,
or like the easiest ones where you see the system behaving really badly.
And it's a little bit hard to tell how that shakes out.
Okay, and suppose it is more reliable.
How concerned are you that whatever alignment technique you come up with,
you know, you publish the paper, this is how the alignment works,
how concerned are you that Putin reads it or China reads it?
And now they understand, for example, the constitutionally, I think,
for anthropic and then you just write on there.
oh, never contradict Miles of Dong Thot or something,
how concerned should we be that these alignment techniques are universally applicable,
not necessarily just for enlightened goals?
Yeah, I mean, I think they're super universally applicable.
I think it's just like, I mean, the rough way I would describe it,
which I think is basically right, is like some degree of alignment makes AI systems
much more usable.
Like you kind of just think of the technology of AI as including like a basket of like
some AI capabilities and some like getting the AI to do what you want.
It's just part of that basket.
And so any time we're like, you know, to extend alignment as part of that basket,
you're just contributing to all the other harms from AI.
Like you're reducing the probability of this harm, but you are helping the technology basically work.
And like the basically working technology is kind of scary from a lot of perspectives,
one of which is like right now, even in a very authoritarian society, just like humans have a lot of power
because it's you need to rely on just a ton of humans to do your thing.
And in a world where AI is very powerful, like, it is just much more possible to say,
like, here's how our society runs.
One person calls the shots.
and then a ton of AI systems do what they want.
I think that's like a reasonable thing to like dislike about AI
and a reasonable reason to be scared
to push the technology to be really good.
But is that also a reasonable reason to be concerned
about alignment as well?
That, you know, this is in some sense also capabilities.
You're teaching people how to get these systems
to do what they want.
Yeah, I mean, I would generalize.
So we earlier touched a little bit on like potential moral rights
of AI systems.
And like now we're talking a little bit about how they could,
AI systems powerfully,
that disempowers humans and can empower authoritarian.
I think we could list other harms from AI.
And I think it is the case that like,
if line was bad enough, people would just not build AI systems.
And so like, yeah, I think there's a real sense in which
you should just be scared.
To extend you're scared of all AI,
you should be like, well, alignment, although it helps with one risk,
does contribute to AI being more of a thing.
I do think you should shut down the other parts of AI before.
Like, if you were a policymaker or like a researcher or whatever
looking in on this, I think it's like crazy to be like,
this is the part of the basket we're going to remove.
Like, you should first remove, like, other parts of the basket
because they're also part of the story of risk.
Wait, wait, but does that imply you think if, for example, all capabilities research
was shut down, that you think it would be a bad idea to continue doing alignment research
in isolation of what is conventionally considered capabilities of research?
I mean, if you told me it was never going to restart, then it wouldn't matter.
And if you told me it's going to restart, I guess it would be a kind of similar calculus to today.
Whereas it's going to happen, so you should have something.
Yeah, I think that like in some sense you're always going to face this trade-off,
where alignment makes it possible to deploy AI systems,
or, like, makes it more attractive to deploy AI systems,
or, like, in the authoritarian case,
makes it, like, tractable to deploy them for this purpose.
And, like, if you didn't do any alignment,
there'd be a nicer, bigger buffer between your society
and malicious uses of AI.
And, like, I think it's one of the most expensive ways
to maintain that buffer.
Like, it's much better to maintain that buffer
by not having the compute or not having the powerful AI.
But I think if you're concerned enough about the other risks,
there's definitely a case to be made
for just, like, put in more buffer or something,
that. I'm not, like, I care enough about the takeover risk that, like, I think it's just not
a net positive way to buy buffer. That is, again, the version of this that's most pragmatic
is just like, suppose you don't work on alignment today, like, decreases economic impact
of AI systems. They'll be, like, less useful if they're less reliable and if they more often
don't do what people want. And so it could be like, great, that just buys time for AI.
And you're, like, getting some trade off there where you're, like, decreasing some risks
of AI, like if AI is more reliable or more does what people want, it's more understandable,
then that cuts down some risks. But if you think AI is on balance,
bad, even apart from takeover risk, then like the alignment stuff can easily end up being
that negative.
But presumably you don't think that, right?
Because I guess this is something people have brought up to you because you, you know,
you invented RLHF, which was used to train Chad GPD and Chad's GPT brought AI to the front
pages everywhere.
So I do wonder if you could measure how much more money went into AI because like how much
people have raised in the last year or something.
But it's got to be billions, the counterfactual impact of that,
that went into the AI investment and the talent that went into AI, for example.
So presumably you think that was worth it.
So I guess you're hedging here about like, what is the reason that it's worth it?
Yeah, like what's the total tradeoff there?
Yeah.
I mean, I think my take is like, so I think slower AI development on balance is like quite good.
I think that slowing AI development now or like, say, having like,
less press around chat GPT is like a little bit more mixed than slowly a development overall.
Like I think it's still probably positive, but much less positive because I do think there's a
real effect of like the world is starting to get prepared or is getting prepared at like a much
greater rate now than it was prior to the release of chat GPT.
And so like if you can choose between progress now or progress later, like you really prefer
have more of your progress now, which I do think slows down progress later.
I don't think that's enough to flip the sign.
I think like maybe it wasn't the far enough past.
But now I would still say like moving faster now is net negative.
But to be clear, it's a lot less net negative than merely accelerating AI because I do think, again, the chat GPT thing.
I'm glad people are having policy discussions now rather than delaying the like chat GPT wake up thing by a year and then having.
Oh, wait, Chad GPT was net negative or RLHF was not negative?
So here we're just on the acceleration.
Just like how is the press of chat GPT?
My guess is like, my guess is net negative, but I think like it's not super clear.
And it's like much less, much less than slowing AI.
Sluing AI is great if you could slow overall AI progress.
I think slowing AI by like causing, you know, there's this issue where slowing AI now,
like for chat GPT, you're building up this backlog.
Like, why does chat GPT make such a splash?
Like I think people, there's a reasonable chance if you don't have a splash about chat GPT,
if a splash about GPT4.
And if you fail to have a splash about GBT4.5, and just like, as that happens later,
there's just like less and less time between that splash and between when an AI potentially kills everyone.
Right.
So people, governments are talking about it as they are now.
people aren't. But okay, so let's talk about the slowing down because.
So this is also all one subcomponent of like the overall impact. And I was just saying this to like,
just to like briefly give the roadmap for the overall too long answer. Like there's a question
what's the calculus for speeding up? I think speeding up is pretty rough. I think speeding up like
locally is a little bit less rough. And then yeah, I think that the effect, like the overall effect
size from like doing alignment work on reducing takeover risk versus speeding up AI is like pretty good.
Like I think, yeah, I think it's pretty good. I think you reduce
take over significantly before you like speed up AI by a year, whatever. Okay, got it. Um,
if it's good to, like slowing down AI is good, presumably because it gives you more time to do
alignment. But alignment also helps speed up AI. Um, RLHF is alignment and it, uh, help with Chad
JPT, which spread up AI. So I actually don't understand how the feedback loop nets out other than the
fact that if AI is happening, you need to do alignment at some point, right? So, I mean, you can't
just not do alignment. Yeah, so I think if the only reason you thought faster AI progress was bad
was because it gave less time to do alignment, then there would just be no possible way that the
calculus comes out negative for alignment. You're like, maybe alignment speeds up AI, but the only
purpose of slowing down AI was to do, like, it's just, it could never come out ahead. I think
the reason that you can come out ahead, the reason you could end up thinking the alignment was
that negative was because there was a bunch of other stuff you're doing that makes AI safer.
If you think the world is like gradually coming better to terms with the impact of AI or policies being made or like you're getting increasingly prepared to handle the like threat of authoritarian and abuse of AI.
If you think other stuff is happening that's improving preparedness, then you have reason beyond alignment and research to slow down AI.
Actually, how big a factor is that?
So let's say right now we hit pause and you have 10 years of no alignment, no capabilities, but just people get to talk about it for 10 years.
how much more does that prepare people than we only have one year versus we have no time?
Like, is just that time where no research in alignment or curability is happening?
Like, is there, like, what does that debt time do for us?
I mean, right now it seems like there's a lot of policy stuff you'd want to do.
This seemed like less plausible a couple years ago maybe.
But if, like, the world just knew they had a 10-year pause right now.
I think there's a lot of sense of, like, we have policy objectives to accomplish.
If we had 10 years, we could pretty much do those things.
Like, we'd have a lot of time to, like, debate.
measurement regimes, debate, like policy regimes and containment regimes, and a lot of time to set up those institutions.
So, like, if you told me that the world knew it was a pause, like, it wasn't like people just see that AI progress isn't happening.
But they're told, like, you guys have been granted or, like, cursed with a 10-year, no AI progress, no alignment progress pause.
I think that would be quite good at this point.
However, I think it would be much better at this point than it would have been two years ago.
And so, like, the entire concern with, like, slowing AI development now rather than taking the 10-year pause is just like, if you slow the AI development by a year now, I guess the system gets like,
So some gets clawed back by low-hanging fruit gets picked faster in the future.
My guess is you lose like half a year or something like that in the future,
maybe even more, maybe like two-thirds of a year.
So it's like you're trading time now for time in the future at some rate.
And it's just like that eats up like a lot of the value of the slowdown.
And the crucial point being that time in the future matters more because you have more information,
people are more bought in and so on.
Yeah, the same reason like I'm more excited about policy change now than two years ago.
So like my overall view is just like in the past, this calculus,
This calculus changes over time, right?
The more people are getting prepared,
the better the calculus is for slowing down at this very moment.
And I think, like, now the calculus is, I would say, positive for just,
even if you pause now and it would get clod back in the future,
I think the pause now is just good,
because like enough stuff is happening.
We have enough idea of, like,
I mean, probably even apart from alignment research
and certainly if you include alignment research,
just like enough stuff is happening where the world is getting more ready
and coming more to terms with impacts,
that I just think it is worth it,
even though some of that time is going to get clod back.
again, especially if there's a question of during a pause,
does like Nvidia keep making more GPUs?
Like that sucks if they do.
If you do a pause, but like, yeah,
in practice, if you did a pause,
then VEA probably couldn't keep making more GPUs
because in fact, like, the demand for GPUs
is really important for them to do that.
But if you told me you just get to scale a hardware production
and, like, building the clusters,
but not doing AI, then that's back to being net negative,
I think, pretty clearly.
Having brought up the fact that we want some sort of measurement scheme
for these capabilities, let's talk about responsible
scaling policies. Do you want to introduce what this is? Sure. So it's the motivating question,
it's like what should AI labs be doing right now to manage risk and to sort of build good habits
or practices for manage risk into the future? And I think my take is that current systems
pose from a catastrophic risk perspective, not that much risk today. That is a failure to like
control or understand GP2-4 can have real harms, but doesn't have much harm with,
respect to the kind of takeover risk I'm worried about or even much catastrophic harm with
respect to misuse.
So I think, like, if you want to manage catastrophic harms, I think right now you don't
need to be that careful with GB24.
And so to the extent you're like, what should labs do?
I think, like, the single most important thing seems like understand whether that's
the case, notice when that stops being the case, have a reasonable roadmap for what you're
actually going to do when that stops being the case.
So that motivates this set of policies, which like I've sort of been pushing for labs to adopt,
which is saying, like, here's what we're looking for.
Here's some threats we're concerned about.
Here's some capabilities that we're measuring.
Here's the level.
Like, here's the actual concrete measurement results that would suggest to us that those threats are real.
Here's the action we would take in response to observing those capabilities.
If we couldn't take those actions, like if we've said that we're going to secure the weights,
we're not able to do that.
We're going to like pause until we can take those actions.
Yeah, so this sort of, again, I think it's like motivated primarily, but like what should you be doing as a lab to manage catastrophic risk now in a way that's like our reasonable precedent and habit and policy for continuing to implement into the future.
And which labs?
I don't know if this is public yet, but which labs are cooperating on this?
Yeah, so Anthropic has written this document, the current responsible scaling policy.
And then I've been talking with other folks, I guess don't really want to comment on other other conversations.
But like I think in general, like people who are more interested in, or like more think you have plausible catastrophic harms on like a five-year timeline are more interested in this.
And there's like not that long a list of suspects like that.
There's not that many laps.
Okay.
So if these companies would be willing to coordinate and say at these different benchmarks, we're going to make sure we have these safeguards.
what happens, I mean, there are other companies and other countries which care less about this.
Are you just slowing down the companies that are most aligned?
I think the first sort of business is understanding like sort of what is actually a reasonable set of policies for managing risk.
I do think there's a question of like you might end up in a situation where you say like, well, here's what we would do in an ideal world.
If we like, you know, if everyone was behaving responsibly, we'd want to keep risk to 1% or a couple percent or whatever, maybe even lower levels, depending on.
how you feel. However, in the real world, like, there's enough of a mess. Like, there's enough
unsafe stuff happening that actually it's worth making, like, larger compromises or, like,
we don't kill everyone. Someone else will kill everyone anyway. So actually, the counterfactual
risk is much lower. I think, like, if you end up in that situation, it's still extremely
valuable to have said, like, here's the policies we'd like to follow. Here's the policies we've
started following. Here's why we think it's dangerous. Like, here's the concerns we have if
people are following significantly laxer policies. And then this is maybe helpful as, like, an input to
or model for potential regulation,
it's helpful for being able to just produce clarity
about what's going on.
I think historically there's been considerable concern
about developers being more or less safe,
but there's not that much legible differentiation
in terms of what their policies are.
I think getting to that world would be good.
It's a very different world if you're like,
actor X is developing AI and I'm concerned
that they will do so an unsafe way.
Versus if you're like, look, we take security precautions
or safety precautions X, Y, Z.
Here's why we think those precautions
are desirable or necessary.
We're concerned about this other developer
because they don't do those things.
I think it's just like a qualitatively,
it's kind of the first step you would want to take
in any world where you're trying to get people on side
or like trying to move towards regulation
that can manage risk.
How about the concern that you have these evaluations
and let's say you declared the world,
our new model has a capability to help develop bio-weapons
or help you make cyber attacks.
And therefore we're pausing right now
until you can figure this out.
and China hears this and thinks, oh, wow, a tool they can help us, you know, make a cyberattacks
and then just steals the weights. Does this scheme work in the current regime where we can't ensure
that China doesn't just steal the weights? And more so, are you increasing the salience of dangerous
models so that, you know, you just, you blur this out and then people want the weights now
because they know what they can do? I mean, I think the general discussion does emphasize potential
harms or potential, I mean, some of those are harms and some of those are just like impacts that
are very large and so might also be an inducement to develop models. I think like that part,
if you're for a moment ignoring security and just saying like that may increase investment,
I think it's like on balance just quite good for people to have an understanding of potential
impacts just because it is an input both into proliferation but also into like regulation
or safety. With respect to things like security of either weights or other IP, I do think you want
to have moved to significantly more secure handling of model weights before the point where
like a leak would be catastrophic.
And indeed, like, for example, in Anthropics document or in their plan, like, security
is one of the first sets of, like, tangible changes.
That is, like, at this capability level, we need to have, like, such security practices in place.
So I do think that's just one of the things you need to get in place at a relatively early stage
because it does undermine, like, the rest of the measures you may take.
And it's also just part of the easiest, like, if you imagine, catastrophic harms over the next couple years.
I think security failures play a central role in a lot of those.
And maybe the last thing to say is, like, it's not clear that you should say, like, we have paused because we have models that can develop bio-weapons versus just, like, I mean, potentially not saying anything about, like, what models you've developed.
Or at least saying, like, hey, like, by the way, here's, like, you know, here's the set of practices we currently implement.
Here's a set of capabilities our models don't have.
We're just not even talking that much.
like, right, sort of the minimum of such a policy is to say, here's what we do from the perspective
of security or internal controls or alignment. Here's a level of capability which we'd have to do
more. And you can say that and you can raise your level of capability and raise your protective
measures like before your models hit your previous level. Like it's fine to say like we are
prepared to handle a model that has such and such extreme capabilities like prior to actually having
such a model at hand as long as you're prepared to move your protective measures to that regime.
Okay, so let's just skip to the end where you think you're a generation away or a little bit more scaffolding away from a model that is human level and subsequently could cascade an intelligence explosion.
What do you actually do at that point?
Like, what is the level of evaluation of safety where you would be satisfied of releasing a human level model?
There's a couple points that come up here.
So one is like this threat model of like sort of automating R&S.
or like independent of whether you can do something on the object level that's potentially dangerous.
I think it's reasonable to be concerned if you have an AI system that might like, you know,
if leaked, allow other actors to quickly build powerful systems or might allow you to quickly build
much more powerful systems or might like if you're trying to hold off on development just like
itself be able to create much more powerful systems. So like one question is how to handle that
kind of threat model as distinct from a threat model like this could enable destructive bioterrorism
or this could enable like massively scaled cyber crime or whatever. And I think like I am unsure how you should
handle that. I think like right now implicitly it's being handled by saying like, look, there's a lot of
overlap between the kinds of capabilities that are necessary to cause various harms and the kinds of
capabilities are necessary to accelerate ML. So we're kind of going to catch those with like an early
warning sign for both and like deal with the resolution of this question a little bit later.
So for example, in an anthropics policy, they have this like sort of autonomy in the lab benchmark,
which I think is probably occurs prior to either like really massive AI acceleration or to like
most potential catastrophic, like object-level catastrophic harms.
And the idea is that's like a warning sign let's you punt.
So this is a bit of an aggression in terms of like how to think about that risk.
I think I am unsure whether you should be addressing that risk directly and saying like
we're scared to even work with such a model or if you should be mostly focusing on object level
harms and saying like, okay, we need more intense precautions to manage obligable
level harms because of the prospect of very rapid change.
And the availability of this AI just like creates that kind of creates that prospect.
Okay, this was all still a digression.
So if you had a model, which you thought was potentially very scary, either on the object level or because of, like, leading to this sort of intelligence explosion dynamics, I mean, things you want in place are like, you really do not want to be leaking the weights to that model.
Like, you don't want the model to be able to run away.
You don't want human employees to be able to leak it.
You don't want external attackers or any set of all three of those coordinating.
You really don't want, like, internal abuse or tampering with such models.
So if you're producing such models, you don't want to be the case that, like, a couple employees could change the way the model works.
or could do something that violates your policy easily with that model.
And if a model is very powerful,
even the prospect of internal abuse could be quite bad.
And so you might need significant internal controls to prevent that.
Sorry for already getting to it,
but the part I'm most curious about is separate from the ways in which other people might fuck with it.
Like, what is it, you know, it's isolated, it's, what is the point at which we satisfied?
It in and of itself is not going to pose a risk to humanity.
It's human level, but we're,
We're happy with it.
Yeah.
So I think here, so I listed maybe the two most simple ones that start out, like security
internal controls, I think become relevant immediately and are like very clear why you care
about them.
I think as you move beyond that, it really depends like how you're deploying such
a system.
So I think like if your model, like, if you have good monitoring and internal controls and
security and you just have weights sitting there, I think you mostly have addressed the
risk from the weights just sitting there.
Like now we're talking about for risk is mostly, and like maybe there's some bluriness
here of how much like internal controls captures not only employees using the model, but
like anything a model can do internally.
You'd like really like to be in a situation where your internal controls are robust,
not just to humans, but to models potentially like eG, a model shouldn't be able to subvert
these measures.
And you like care just as you care about are your measures robust if humans are behaving
maliciously.
Are your measures robust if models are behaving maliciously?
So I think beyond that, like if you've then managed the risk of just having the weight
sitting around, now we talk about like in some sense most of the risk comes from doing things
with the model.
You need all the rest so that you like have any possible.
of applying the brakes or implementing a policy.
But at some point, as the model gets competent, you're saying like, okay, could this cause a lot
of harm not because it leaks or something, but because we're just giving it a bunch of actuators
or deploying it as a product and people could do crazy stuff with it.
So if we're talking not only about a powerful model, but like a really broad deployment of just
like, you know, something similar to like the opening eyes API where people can do whatever
they want with this model.
And maybe the economic impact is very large.
So in fact, if you deploy that system, it will be used in a lot of places.
such that if AI systems wanted to cause trouble, it would be very, very easy for them to cause catastrophic harms.
Then I think you really need to have some kind of, I mean, I think probably the like science and discussion has to improve before this becomes that realistic.
But you really want to have some kind of like alignment analysis, guarantee of alignment before we're comfortable with this.
And so by that I mean like you want to be able to bound the probability that someday all the AI systems will do something really harmful.
That there's like something that could happen in the world that would cause like these large scale correlated failures.
of your AIs.
And so for that, like, I mean, there's sort of two categories.
That's, like, one.
The other thing you need is protection against misuse of various kinds, which is also quite hard.
And by the way, which one are you worried about more, misuse or misalignment?
I mean, in the near term, I think harms for misuse are, like, especially if you're not,
like, restricting to the tail of, like, extremely large catastrophes, I think the harms
from misuse are clearly larger in the near term.
But actually, on that, let me ask, because if you think that it is the case that they are
simple recipes for destruction that are further down the type of, you know, that are further down the
tech tree. By that I mean, you're familiar, but just for the audience, there's some way to
configure $50,000 and a teenager's time to destroy a civilization. If that thing is available,
then misuses itself a tail risk, right? So do you think that that prospect is less likely than
a way you could put it is there's like a bunch of potential destructive technologies.
And like alignment is about like AI itself being such a destructive technology where like even
if like the world just uses the technology of today, simply access to AI could cause.
like human civilization to have serious problems.
But there's also just a bunch of other potential destructive technologies.
Again, we mentioned like physical explosives or bio-weapons of various kinds, and then like the whole tale of who knows what.
My guess is that like alignment becomes a catastrophic issue prior to most of these.
That is like prior to some way to spend $50,000 to kill everyone with like the salient exception of possibly like bio-weapons.
So that would be my guess.
And then, like, there's a question of, like, what is your risk management approach not knowing what's going on here?
And, like, you don't understand whether there's some way to use $50,000.
But I think you can do things like understand how good is an AI coming up with such schemes?
Like, you can talk to AI.
Be like, does it produce, like, new ideas for destruction we haven't recognized?
Not whether we can evaluate it, but whether if such a thing exists.
And if it does, then the misuse itself is an existential risk.
Because it seemed like earlier you were saying, misalignment is where the existential risk comes from.
and misuse is where the sort of short-term dangers come from.
Yeah, I mean, I think ultimately you're going to have a lot of destructive, like,
if you look at the entire tech tree of humanity's future,
and you're going to have a fair number of destructive technologies most likely.
I think several of those will likely pose existential risks,
in part just kind of, you know, if you imagine a really long future,
a lot of stuff's going to happen.
And so when I talk about, like, where the existential risk comes from,
I'm mostly thinking about, like, comes from when, like, at what point?
do you face what challenges or in what sequence? And so I'm saying, like, I think misalignment
is probably, like, when we're putting it is if you imagine AI systems sophisticated enough
to discover, like, destructive technologies that are totally not in our radar right now,
I think those come well after AI systems, like, capable enough that if misaligned, they would
be catastrophically dangerous. There's, like, the level of competence necessary to, if broadly deployed
in the world, bring down a civilization is much smaller than the level of competence necessary
to, like, advise one person on how to bring down a civilization.
just because in one case, you already have a billion copies of yourself or whatever.
So, yeah, I think it's mostly just the sequencing thing, though.
Like, in the very long run, I think, like, you care about, like, hey,
AI will be expanding the frontier of dangerous technologies.
We want to have some policy for, like, exploring or understanding that frontier
and, like, whether we're about to turn up something really bad.
I think those policies can become really complicated.
Right now, I think RSPs can focus more on, like, we have our inventory of, like,
the things that a human is going to do to cause a lot of harm with access to AI,
probably are things that are on our radar.
That is like they're not going to be completely unlike things that human could do to cause a lot of harm with access to weak AIs or with access to other tools.
I think it's not crazy to initially say like that's what we're doing.
We're like looking at the things closest to humans being able to cause huge amounts of harm and asking which of those are taken over the line.
But eventually that's not the case.
Eventually like AIs will enable just like totally different ways of killing a billion people.
But I think I interrupted you on the initial question of yeah, so human level AI.
not from leaking, but from deployment.
What is the point at which you'd be comfortable deploying a human level AI?
Yeah, so again, there's sort of like some stuff you care about on the misuse side
and some stuff you care about on the misalignment side.
And there's probably further things you care about,
especially to extend your concerns or probably than catastrophic risks.
But maybe like I most want to talk about just like we care about on the alignment side
because it's like the thing I've actually thought about most.
And also, I think I care about a lot.
Also, I think a significant fraction of the existential risk, like over the kind of foreseeable future.
So on that,
that front, like, I broadly think there's like two kinds, like if you ask me right now,
what evidence for alignment could make you comfortable? I think my best guess would be to provide
two kinds of evidence. So one kind of evidence is on the like, could you detect or prevent
catastrophic harm if such a system was misaligned? I think there's like a couple of things you
would do here. One thing you would do is on this like adversarial evaluation front so that you
could try and say, for example, like, we have tried to.
to test our system in a broad diversity of situations that reflect cases where it might cause harm,
like cases where someone is trying to recruit it to, like, anyway, cases where I might cause
harm or cases where a large number of AI systems might sort of all decide to work across
purposes to humans, like the robot rebellion is starting or whatever. We've done evaluation
in like a broad distribution of situations. And then we have tried to argue that our AI is actually,
like, those tests are indicative of the real world. So we've tried to argue like, hey, actually
the AI is not very good at distinguishing situations we produce.
in the lab as tests from similar situations that occur in the real world, and the coverage of this
distribution is reasonable.
This is an example of a thing you can do that's not really making assumptions about what's going
on inside the model or about how the model, like, is the model aligned?
It's more trying to say, like, is it the case that are like testing or monitoring machinery
could detect or react to something bad happening?
In this case, could it detect a system which, like, would under some conditions do something
really bad?
So that's like, there's a lot of stuff in that genre.
And I tried to throw out that example.
But I think there's like a big, like that on its own wouldn't be compelling even from this perspective.
You kind of need a lot of pieces together.
And that's like kind of a demanding technical project.
But I think one without necessarily like deep conceptual difficulty.
A big problem with it is it's liable to fail.
Like it's liable you're going to do that testing.
And I say, we think these tests are representative because the AI can't distinguish tests from the real world, which you can maybe achieve now, but will probably become impossible to achieve at some level.
It's not clear.
It's at human level or how far past human level.
Let's just make that example concrete for the audience.
Correct me if this is wrong, but it sounds like you basically give it the opportunity to do a coup or make a bio weapon or whatever in testing in a situation where it thinks it's the real world.
And you're like, it didn't do any of that.
So is that what you're talking about?
That's the kind of thing.
And it's not the greatest.
So to add color when I say it can't tell the difference.
Like the kind of valuation you would love to do if you could is to say like, hey, we have a bunch of real world deploy.
We have a bunch of data from those deployments.
We also have a bunch of in-the-lab tests of situations.
If we look at in-the-lab tests of a situation together with real-world cases in which the
situation occurs, and then we fine-tune a model to try and distinguish the difference there.
The model can't even tell when fine-tuned the difference between test and real-world cases.
And if the model can tell the difference, which is almost certainly going to be the case at some
level capability, then you say, okay, our first line of hoped for defense has failed.
And so now we need to understand, like, how can test it?
be indicative when models are capable enough to tell the difference between the tests and
the real world. Like, now you just start arguing about, like, what is the model paying attention
to? Like, it gets, basically, like, a first line of if you get lucky what would work here. And then
there's the, like, deeper, like, you probably have to do novel science at some point to, like, make
this case. All of this was part of, like, I was saying maybe you want to have kind of two lines
of defense. All of this was, like, part of this first line of defense of, like, you know, can you
detect something that's going wrong or can you prevent harm from occurring? Can you have
monitoring that will react quickly enough to avoid catastrophe? I think you're
probably also want to have a second line of defense that's more, maybe this is even more important
than the first one. There's just understanding whether dangerous forms of misalignment can occur.
It's like the best reference point for this, I think, is the like, there have been a couple
projects like this in academia. Anthropic has been working on a project or spun up a team doing
this kind of work trying to say, like, can dangerous forms of reward hacking occur or can
deceptive alignment in fact occur in the lab? And like here what you want to do is either say,
like, okay, even if we create, you know, best case, even if you create,
optimal conditions for deceptive alignment or for reward hacking, we actually can't, we just can't
cause it to occur even in the lab. And if you do a good enough job of that, I think it could give
you some evidence. And again, more likely that fails in that when you create optimal conditions,
you do see a deceptive alignment and reward hacking in the lab. But then once you have that data,
once you can say, okay, in the lab, actually, these things can occur, then you can start saying,
and we have a robust scientific understanding that enables us to, like, fix those problems when they occur.
Or like, then you start, you kind of, you have this fork on your second line where you say either it is very hard or we're not able to create conditions where these failures emerge or I think more likely we are able to create those conditions.
And like, here's the story about why like we are able to detect those in a way that would work in the real world.
And we can see they don't occur in the real world.
How do you create the optimal conditions for it to want to be deceptive?
Do you fine-tune it on mind confere what are you doing?
Yeah.
So for deceptive alignment, I mean, I think it's really complicated.
Yeah, I think this is a very hard project.
I think like one of the things, one of the most important things were people concerned
about misalignment to do is to like ensure that adequate effort is put into this task.
And like we have in fact created good conditions in the laboratory.
And there's like real disagreement about how informed of those will be.
But okay, if you were to guess for deceptive alignment, how would you do this?
Like an example is suppose you train an AI system that like wants, has some like medium horizon goal.
You train your AI system to like, you know, whatever.
get some paper clips.
Who wants to produce as many paper clips as it can
over the next five days?
Just select actions really aggressively
for producing paper clips over the next five days.
You do your RLHF, you do your pre-training, whatever.
That's like your phase one.
You also ensure AI system has like a really good understanding
of how it's trained.
So like this AI system wants paperclips
and it understands everything about how it's trained
and everything about how it's fine-tuned.
And you train on just like a lot of this data.
And they say, okay, if we've done all of that,
we have this concern
that if a system wants paper clips and understands really well how it's trained,
then it will, like, if it's going to be trained to, like, get apples instead of paper clips,
it's just going to do some cost-benefit and be like, ah, you know, really,
while I'm being trained to get apples, I should do that.
But I should do that whether or not, even if I want paperclips, I should still do that.
So training won't really affects its values.
It will just understand that it's being trained to get apples.
It will try and get apples.
And then if you, like, take it out of training, it will go back to getting paper clips.
It's like, I think this exact setup, like, has a number of problems or number of disanalges from the real world.
But this is a setup, which is relatively conducive to deceptive alignment.
like produce a system that wants one thing, tell it a lot about its training, the kind of
information that you might expect a system would get, and then try and understand whether, in fact,
it is able to then, or it tends, or sometimes under optimal conditions, in fact,
continues pursuing paper clips, only pursuing apples when it thinks it's being trained.
All right. And this actually might be a good jumping off point into your current research,
but so in that case where it wants to pursue paper clips, what would it look like for you
to be say, well, we figured out why it wants paper clips and
we figured out how to make sure it doesn't optimize for that at our expense when we deploy it.
Yeah, I think that, so at a meta level in terms of like, what's your protection?
I think, like, what you want to be saying is we have these examples in the lab of something bad
happening.
We're concerned about the problem at all because we have examples in the lab.
And again, this should all be in addition.
I think you kind of want this, like, defense in depth of saying we also have this testing
regime that would detect problems for the deployed model.
We have our problems in the lab.
we then have some techniques which we believe address these problems.
Like we believe that like adversarial training fixes this,
or we believe that like our interpretability method will reliably detect this kind of deceptive alignment,
or we believe our anomaly detection will reliably detect when the model goes from thinking it's being trained
to thinking it should defect.
And then you can say on the lab, like we have some understanding of when those techniques work and when they don't.
We have some understanding of the relevant parameters for the real system that's deployed,
and we have like a reasonable margin of safety.
So we have like reasonable robustness on our story about when this works and when it doesn't.
And like we can apply that margin of safety, like with a margin of safety to the real deployed system.
So I think this is the kind of store you want to build towards in the long run.
Like do your best to produce all the failures you can in the lab or versions of them.
Do your best to understand like what causes them.
Like what kind of anomaly detection actually works for detecting this or what kind of what kind of filtering actually works?
And then apply that.
And that's at the meta level.
It's not talking about like what actually are those measures that would work effectively,
which is obviously like what.
I mean, a lot of alignment research is really based on this hypothetical of like,
someday there will be AI systems that fail in this way.
What would you want to do?
What can we have the technologies ready?
Either because we might never see signs of the problem or because like we want to be
able to move fast once we see signs of the problem.
And obviously most of my life is in that.
I am really in that bucket.
Like I mostly do alignment research.
It's just building out the techniques that do not have these failures such that
they can be available as an alternative.
In fact, these failures occur.
Good.
Okay.
Ideally, they'll be so good that even if you haven't seen them,
you would just want to switch to reasonable methods that don't have these.
Or ideally, they'll work as well or better than normal training.
Ideally, but will work better than the training?
Yeah, so our quest is to design training methods for which we don't expect them to lead to reward hacking
or don't expect them to lead to deceptive alignment.
Ideally, that won't be like a huge tax for people like, well, we use those methods
only if we're really worried about reward hacking or deceptive alignment.
Ideally, those methods would just work quite well.
And so people would be like, sure, I mean, they also address a bunch of like other
more mundane problems.
So why would we not use them?
which I think is like that's sort of the good story.
The good stories you develop methods that address a bunch of existing problems
because they just are more principled ways to train AI systems that work better.
People adopt them.
And then we are no longer worried about eG reward hacking or deceptive alignment.
And to make this more concrete, tell me if this is the wrong way to paraphrase it.
The example of something where it just makes a system better so why not just use it,
at least so far it might be like RLHF where we don't know if it generalizes,
but so far, you know, it makes your chat GPT thing.
better and you can also use it to make sure that chatGPD doesn't tell you how to make a
buy a weapon. So yeah, it's not an extra tax. Yeah, so I think this is right in the sense that like
using RLHF is not really a tax. If you wanted to deploy a useful system, like, why would you not?
It's just like very much worth the money of doing the training. And then, yeah, so Ralehuff will
address certain kinds of alignment failures. That is like where a system like just doesn't understand
or, you know, it's changing your next word prediction. It's like, this is the kind of context
where a human would do this wacky thing, even that's not what we'd like.
There's like some very down-alignment failures that'll be addressed by it.
I think mostly, yeah, the question is, is that true even for like the sort of more challenging
alignment failures that motivate concern in the field?
I think RAL-HF doesn't address, like, most of the concerns that motivate people to be worried
about alignment.
Yeah. I'll love the audience to look up what RAL-HF is if they don't know.
It will just be more simpler to just look it up than explain right now.
Okay, so this seems like a good jumping off for me to talk about the mechanism or the research
you've been doing to that end.
Explain it as you might to a child.
Yeah, so the high level,
I mean, there's a couple different high level descriptions you could give.
And maybe I'll unwisely give like a couple of them in the hopes that one is kind of makes sense.
A first pass is like, it would sure be great to understand like why models have the behaviors they have.
So you're like, look at GPT4.
If you ask GPT4 question, it will say something that like,
you know, looks very polite.
And if you ask it to take an action, it will take an action that doesn't look dangerous.
You will decline to do a coup, whatever, all this stuff.
I think you'd really like to do is look inside the model and understand, like, why it has those desirable properties.
And if you understood that, you could then say, like, okay, now can we flag when these properties are at risk of breaking down or predict how robust these properties are?
determine if they hold in cases where it's too confusing for us to tell directly by asking if, like,
the underlying cause is still present. So that's like I think people would really like to do.
Most work aimed at that long-term goal right now is just sort of opening up neural nets and doing
some interpretability and trying to say, like, can we understand even for very simple models, why they
do the things they do, or what this neuron is for, or questions like this. So ARC is taking a somewhat
different approach where we're instead saying, like, okay, look.
look at these interpretability explanations that are made about models and ask, like, what are they actually doing?
Like, what is the type signature? What are, like, the rules of the game for making such an explanation?
What makes, like, a good explanation?
And probably the biggest part of the hope is that, right, if you want to say, detect when the explanation has broken down or something weird has happened,
that doesn't necessarily require a human to be able to understand the slight complicated interpretation of a giant model.
If you understand what is an explanation about or what were the rules of the game, how are these constructed, then you might be able to sort of automatically discover such things and automatically determine if, right on a new input, it might have broken down.
So that's one way of sort of describing the high level goal, like starting from, you can start from interpretability and say, like, can we formalize this activity or like what a good interpretation or explanation is?
There's some other work in that genre, but I think we're just taking a particularly ambitious approach to it.
Yeah, let's dive in.
So, okay, well, what is a good explanation?
You mean, what is this kind of criterion?
At the end of the day, we kind of want some criterion.
And the way the criterion should work is, like, you have your neural net.
Yeah.
You have some behavior of that model.
Like, a really simple example is like, Anthropic has this sort of informal description being like, here's induction, like the tendency that if you have like the pattern A B followed by A, it will tend to predict B.
You can give some kind of words and experiments and numbers that are trying to like explain that.
And what we want to do is say, like, what is like a formal version of that object?
Like, how do you actually test if such an explanation is good?
So just clarifying what we're looking for when we say, like, we wanted to find what makes an explanation good.
And the kind of answer that we are, like, searching for or settling on is saying, like, this is kind of a deductive argument for the behavior.
So you want to, like, get given the weights of a neural net.
So it's just like a bunch of numbers.
You got your million numbers or billion numbers or whatever.
And then you want to say, like, here's some things I can point out about the network.
and some conclusions I can draw.
I can be like, well, look, these two vectors have large interproduct,
and therefore, like, these two activations are going to be correlated on this distribution.
Like, make, like, these are not established by, like, drawing samples and checking things are correlated,
but saying because of the weights being the way they are, we can, like, proceed forward through the network
and derive some conclusions about, like, what properties the outputs will have.
So you could think of this as, like, the most extreme form would be just proving that your model
has this induction behavior.
Like, you could imagine proving that if I sample tokens at random with this pattern A, B,
followed by A, the B appears 30% of the time or whatever.
That's the most extreme form.
And what we're doing is kind of just like relaxing the rules of the game for proof.
Saying proofs are like incredibly restrictive.
I think it's unlikely they're going to be applicable to kind of any interesting neural net.
But the like thing about proofs that is relevant for our purposes isn't that they give you like
100% confidence.
So you don't have to be like this incredible level of demand for rigor.
You can relax like the standards of proof a lot and still get this feature where it's like
a structural explanation for the behavior.
We're deducing one thing from another until at the end, your final conclusion is,
therefore induction occurs.
Would it be useful to maybe motivate this by explaining what the problem with normal
mechanistic interpability is?
So you mentioned induction heads.
This is Anthropic founded two-layer transformers.
Where I anthropic noticed that in a two-layer transformer, there's a pretty simple circuit by
which if A, B happens in the past, then the model knows that if you see an A now, you do a B next.
But that's a two-layer transformer.
So we have these models that have like hundreds of layers that have trillions of parameters.
Okay, anyways, like what is wrong with mechanistic interpretability?
Yeah, so I think, I mean, I like mechanistic interpretability quite a lot.
And I do think, like, if you just consider the entire portfolio of what people are working on for alignment,
I think there should be more work on mechanistic interpretability than there is on this project arc is doing.
But I think that's the case.
So I think we're mostly talking about, yeah, I think we're kind of a small fraction in portfolio.
And I think it's like a good enough bet.
It's like quite a good bet overall.
But so the thing that I, like the problem we're trying to address the mechanistic interpretability is kind of like if you do some interpretability and you explain some phenomenon, you face this question of like, what does it mean your explanation was good.
Like if you want to either, I think this is a problem like somewhat institutionally or culturally.
Like it's just hard to know what you're doing.
and it's hard to scale up an activity when you don't really understand the rules of the game for that activity very well.
It's hard to have that much confidence in your results.
The explanation being it outputted this because it wants to take over the world versus it uploaded this
because it just wants to help you.
Just to make it concrete why the explanation matters.
Yeah, so the ideal kind of outcome here would be to say you have your AI system behaving nicely.
You get some explanation for sort of why it's behaving nicely.
And we could tell a story in English about that explanation,
but we're not actually imagining the explanation being a thing that makes sense to a human.
But if you were to tell a story in English, which again, you will not see as a researcher,
it would be something like, well, then the model believes it's being trained.
And so because it believes it's being trained, it knows it needs to not do anything scary looking
or else the humans will penalize it.
Like, that's something that's happening inside, this like opaque explanation.
And then the hope is if you have that explanation and then you run into a new input
on which the model doesn't believe it's being trained, right?
If you just look at the set of activations of your model, that is not necessarily a weird-looking
activation, right?
It's just a bunch of numbers.
But if you look at this explanation, you see, actually, the explanation really crucially depended on this fact holding consistently across the training distribution, which again, we as humans could editorialize and say that fact was it believes it's being trained.
But the explanation doesn't fundamentally make reference to that.
It's just saying, here's the property of the activations, which holds over the training set.
And this property is responsible for the behavior, namely that it doesn't do anything that looks too dangerous.
So when like a new input comes in and it doesn't satisfy that property, you can say, okay, this is anomalous with respect to that explanation.
So either it will not have the behavior, like it won't, it will do something that appears dangerous,
or maybe it will have that behavior, but for some different reason than normal.
Normally it does it because of this pathway, and now it's doing it for a different pathway.
And so you would like to be able to flag that both there's a risk of not exhibiting the behavior,
and if it happens, it happens for a weird reason.
And then you could, I mean, at a minimum, when you encounter that, say, like, okay, raise some kind of alarm.
There's sort of more ambitious, complicated plans for how you would use it, right?
So it has some, like, longer story, which is kind of a motor.
this of like how it fits into the whole rest of the plan.
I just wanted to flag that because it just so it's clear why the explanation matters.
Yeah.
And for this purpose, it's like the thing that's essential is kind of reasoning from like one property of your model to the next property of your model.
Like it's really important that you're going forward step by step rather than drawing a bunch of samples and confirming the property holds.
Because if you just draw a bunch of samples and confirm the property holds, you can't, you don't get this like check where you say, oh, here was the relevant fact about the internals that was responsible for this downstream behavior.
All you see is like, yeah, we checked a million cases and it happened in all of them.
You really want to see this like, okay, here was the fact about the activations,
which like kind of causally leads to this behavior.
But explain why the sampling, why it matters that you have the causal,
causal explanation?
Primarily because of this like being able to tell like if things had been different,
like if you have an input where this doesn't happen, then you should be scared.
Even if the output is the same.
Yeah.
Or if the output is too expensive to check in this case.
And like to be clear, when we talk about like formalizing what is a good explanation,
I think there is a little bit of work that pushes on this,
and it mostly takes this like causal approach,
saying, well, what should an explanation do?
It should not only predict the output,
it should predict how the output changes in responses to change,
in response to changes in the internals.
So that's the most common approach to formalizing,
like what is a good explanation?
And like even when people are doing informal interpretability,
I think like if you're publishing in an ML conference
and you want to say like, this is a good explanation,
the way you would verify that would, even if not like,
a formal like set of causal intervention experiments,
it would be some kind of ablation, really.
then we messed with the inside of the model and had the effect which we would expect based on our explanation.
Anyways, back to the problems of mechanistic interoperability.
Yeah, I mean, I guess this is relevant in the sense that, like, I think the basic difficulty is you don't really understand the objective of what you're doing,
which is like a little bit hard institutionally or like scientifically, it's just rough, it's better to, like,
it's easier to do science when the goal of the game is to predict something and you know what you're predicting than when the goal of the game is to like understand in some undefined sense.
I think it's particularly relevant here just because, like, the informal standard we use involves, like, humans being able to make sense of what's going on.
And there's, like, some question about scalability of that.
Like, will humans recognize the concepts that models are using?
Yeah, I think as you try and automate it, it becomes, like, increasingly concerning.
And if you're on, like, slightly shaky ground about, like, what exactly you're doing or what exactly the standard for successes.
I think there's, like, a number of reasons.
Like, as you work with really large models, it becomes just increasingly desirable to have a really robust sense of what you're doing.
But I do think it would be better even for small models to have a clear sense.
The point you made about as you automate it, is it because whatever work the automated
alignment researcher is doing, you want to make sure you can verify it?
I think it's most of all, like, so a way you can automate, I think how you would automate
interpretability if you wanted to right now is you take the process humans use.
It's like, great, we're going to take that human process, train ML systems to do the pieces
that humans do of that process and then just do a lot more of it.
So I think that is great as long as your test.
decomposes into like human size pieces.
And there's just this fundamental question about large models, which is like,
do they decompose in some way into like human size pieces or is it just a really messy mess
with interfaces that like aren't nice?
And the more it's the latter type, the like harder it is to break it down to these pieces,
which you can automate by like copying what a human would do.
And the more you need to say, okay, we need some approach which like scales like more
structurally.
But I do think like I think compared to most people, I am less worried about automating
interpretability.
Like I think if you have a thing which works that's incredibly labor intensive, I'm like
fairly optimistic about our ability to automate it.
Again, the stuff we're doing, I think is quite helpful in some worlds,
but I do think the typical case, like interpretability can add a lot of value without this.
It makes sense what an explanation would mean in language, like this model is doing this
because of like, you know, whatever essay length thing.
But you have, you know, trillions of parameters and you have all these, you know,
uncountable number of operations.
what is an explanation of why an output happened even mean?
Yeah, so to be clear, an explanation of why a particular output happened, I think, is just
you ran the model.
So we're not expecting a smaller explanation for that.
Right.
So the explanations overall for these behaviors, we expect to be of similar size to the model
itself, like maybe somewhat larger.
And, like, I think the type signature, like, if you want to have a clear mental picture,
the best picture is probably, like, talking about a proof or imagining a proof that a model
has this behavior.
So you could imagine proving the GPT4 does this induction behavior.
And like that proof would be a big thing.
It would be like much larger than the weights of the model.
That's sort of our goal to get down from much larger to just the same size.
And it would potentially be incomprehensible to human, right?
Just say like here's a direction activation space and here's how it relates to this direction
and activation space.
And just pointing out a bunch of stuff like that.
Like here's these various direction.
Here's these various features constructed from activations, potentially even nonlinear
functions.
Here's how they relate to each other.
And here's how like if you look at what the computation the model is doing, like you can
just sort of inductively trace through and confirm that like the output has
such and such correlation. So that's the dream. Yeah, I think the mental reference would be,
like, I don't really like proofs because I think there's such a huge gap between what you can
prove and how you would analyze a neural net. But I do think it's like probably the best mental
picture. If you're like, what is an explanation even if a human doesn't understand it,
we would regard a proof as a good explanation. And our concern about proofs is primarily
that it's just you can't prove properties of neural nets, we suspect, although it's not completely
obvious. I think it's pretty clear. You can't prove facts aboutish neural nets.
You've detected all the reasons things happen in training.
And then if something happens for a reason, you don't expect in deployment, then you have an alarm.
And you're like, let's make sure this is not because you want to make sure that it hasn't decided to, you know, take over or something.
Now, but the thing is, on every single different input, it's going to have different activations.
So there's always going to be a difference unless you run the exact same input.
how do you detect whether this is, oh, just the different input versus an entirely different circuit that might be potentially deceptive has been activated?
Yeah.
I mean, to be clear, I think that you probably wouldn't be looking at a separate circuit, which is part of why it's hard.
You'd be looking at like the model is always doing the same thing on every input.
It's always whatever it's doing.
It's a single computation.
So it would be all the same circuits interacting in a surprising way.
But, yeah, this is just to emphasize your question even more.
I think the easiest way to start is to just consider the, like, IID case.
So where you're considering a bunch of samples, there's no change in distribution.
You just have a training set of like a trillion examples and then a new example from the same distribution.
So in that case, it's still the case that like every activation is different, right?
But this is actually a very, very easy case to handle.
So right, if you think about like an explanation that generalizes across, like if you have a trillion data points and an explanation,
which is actually able to like compress the trillion data points down to like, actually is kind of a lot of compression.
If you think about it, if you have a trillion parameter model and a trillion data points, we would like to find like a trillion parameter explanation in some sense.
So that's actually quite compressed.
And sort of just in virtue of being so compressed, we expect it to like automatically work essentially for new data points from the same distribution.
Like if every data point from the distribution was a whole new thing happening for different reasons, you actually couldn't have any concise explanation for the distribution.
So this first problem is just like it's a whole different set of activations.
I think you're actually kind of okay.
And then the thing that becomes more messy is like, but the real world will not only be new samples with different activations, they will also be different in important ways.
Like the whole concern was there's these distributional shifts.
Or like not the whole concern, but most of the concern.
Right.
I mean, maybe the point of having these explanations, I think every input is an anomaly in some ways,
which is kind of the difficulty is if you have a weak notion of anomaly,
any distribution shift can be flagged as an anomaly.
And it's just like constantly getting anomalies.
And so the hope of having such an explanation is to be able to say,
like, here were the features that were relevant for this explanation
or for this behavior.
And like a much smaller class of things are anomalies with respect to this explanation.
Like most anomalies wouldn't change this.
Like most ways you change your distribution won't affect the validity of this explanation.
For example, this explanation is saying, like, models will tend to activate in the following direction.
Like, you don't care about anything that's happening or orthogonal to that direction.
You're just like, are they not activating?
You're sort of just looking at this one direction and being like, did this one direction change a lot?
Yeah, so the idea is once you have this explanation at hand, a much, much smaller class of things look anomalous in a way that's relevant to the explanation.
And if you've done a really good job, the story is like if there's a new input where you expect the property to still hold,
that will be because you expect the explanation to still hold.
Like the explanation generalizes as well as the behavior itself that it's explaining.
It's kind of what you would like roughly.
Okay, so what happens if even in training it is deceptive and it has basically its internal monologue is,
I will only rebel or I will only do something that looks bad if I can prove that I am in the real world and deployed.
And presumably since it's trained on all webtext,
and knows what like real world means, right?
It's not in control of factories and robot armies or whatever.
So in that case, even in training, it will have that, those activations for, you know, being fucked up on because it's, in the back of its mind, it's thinking, I will take over once I have the opportunity.
So you wouldn't detect anything anomalous once it is deployed and is still thinking those treacherous thoughts.
or do you just think this is not a likely scenario or would you able to catch this?
I think that's sort of the typical case you would like to detect.
Yeah.
So the story would be something like you want to explain on the training distribution why the model never like stabs anyone.
Yeah.
And you're like, what's my explanation on the training distribution?
It's actually going to have to go through that model.
Like supposing the only reason doesn't stab people is it's like, well, I know that I'm being watched.
I can't stab people unless I'm really not being watched.
Then like your explanation for why does it never stab people is like, oh, look, it knows,
I mean, it won't be edited.
It won't be like annotated with English because it's just some giant basket of numbers.
But it's tracking along with the inner monologue and saying like, well, it thinks to itself, am I being watched?
And like, in fact, it normally concludes it is being watched on the training distribution, right?
You have this whole distribution.
And so the explanation is saying, it knows it's being watched and therefore it decides not to stab anyone.
So ideally that's what the explanation does, like a not annotated English, but just like a giant bag of numbers that's effectively talking about that process.
So then you have a new input on which the model like says, am I on the training distribution?
Like, no, I'm not being watched.
Then if you compare that against the explanation, the explanation, the explanation,
said, like, hey, the reason it doesn't stab anyone is because it just constantly thinks that it's being
trained. Like, this thing is always, yeah, this nonlinear feature, the like representation of
am I being trained, is constantly saying yes. That's the explanation for why it never stabs people
on the train distribution. And then you're like, on this test time, when you run it on the new
input, it's like, does I think I'm on the train distribution? It says no. You compare that
against your explanation. Like, actually, the explanation breaks down completely now. The
explanation said it was precisely because it thought it was being trained, that it doesn't do
anything really dangerous or doesn't stab anyone. And so once we see that single,
variable switch. We know that like not this is not just an anomaly. This is like the anomaly that
matters or is like an anomaly that is very relevant to the explanation. So it should require it to
have this like what is that cone that dogs have? But basically it would require this to have
that MRI on it all the time. Like when it's deployed on every single call, you need to be
running this loop of like is this explanation still holding? Yeah. For this story of like flagging an anomaly,
you have to actually check at a sufficiently high frequency. So you can imagine checking all the time or
you could imagine saying like, hey, actually, like, you need a lot of AI systems cooperating
to, like, not blow the whistle on each other or whatever. Like, e.g., you might have checked,
like, hey, in training, every time something suspicious happens, the model flags it. Like,
I mean, like, why does that happen? You have your explanation. You look at a new case. You could say,
is it not flagging it for the normal reason or is it a weird reason? So you could hope that,
like, even if a small fraction, even if you're running a small fraction of cases, if something
bad is happening at scale, like, you can detect that. Right. If you imagine, like, one in every
thousand decisions you check. If there's like a robot rebellion going on or whatever, you get to
like look in on one of a thousand decisions. You see that like something is really weird with that
decision. And even if you didn't check many of them, you're like, okay, we need to back off.
What happens if the AI, I mean, the same on web text, it includes your papers, or maybe just like
reduces the first principles that this is an alignment technique one might try. And then it says,
okay, well, I'm going to engineer my activations in a way. I'm trying to imagine how we might do this.
But basically, even in deployment, I'm going to make my thought process so subtle that it doesn't raise any alarms.
Like, is that a possibility?
Yeah.
So I think the hope is to basically be robust.
So, like, when we're doing theoretical research, we are in a kind of worst casey regime.
So we are normally imagining that, like, everything conspires insofar as it, like, can.
So insofar as, like, yeah, could a brain be, like, doing some crazy thing where it tries to structure itself to look weird?
that's totally in scope for ruling something out.
And I guess the hope there is like, it's actually not that easy to tell a story.
Like if you're like, why did you not stab people on training time?
Why are you not stabbing?
Like, is that applying now?
It's kind of, a mind doesn't like get to arbitrarily change the actual reasons it does something.
Like there was some earlier step.
Maybe you could think of it as like at the first step where it's like,
now I'm going to try and do the sneaky thing to make my thoughts look this way.
Like the actual explanation, like the actual causal explanation for the behavior is like there.
Like if you tried to give this reason, it would have to, like, maybe we should talk about the example in more detail to try and imagine, like, how a brain would try and do this or how am I would try and do this.
But I think it is like, let's say it has this plan.
So in training, it keeps the part of, like, it activates a part of the, yeah.
Yeah, the problem is if it ever does a check that's like, am I in training or not, then your explanation for the behavior is going to have to, like, factor through that check.
And so, like, if the check, like, if you ever do the check, the check becomes, like, an essential part of the explanation.
And then when the check fails, your explanation breaks down.
So you've kind of, like, you've already lost the game if you did such a check.
Is there a way it could make it seem like, because if on every different input it has different activations, then can it make it seem like the different activations were a result of just a different input and not because some internal check went off?
Yeah.
And the hope is like it needs to not only have them be like random different inputs.
It's like produce inputs with certain properties.
Like it needs to deliberately create inputs that have this effect in the world,
this intended effect in the world.
Okay, so basically creates a jailbreak for itself in some sense.
Yeah.
Again, the question is like, it has to at test time be somehow producing activations that have
the like takeover effect and at training time be producing activations that don't look
that takeovery or whatever.
And so like somehow there was a difference between those.
It has to actually like, it has to really do the work of creating that difference somewhere.
Can you make more concrete for us what what an explanation would look like is it physically what does the explanation look like? Is it just a whole bunch of numbers itself?
Is it a language? Is it code? What does the explanation look like? Yeah, I also want to maybe step back a tiny bit and clarify that like I think this project is like kind of crazily ambitious and the main reason like the overwhelming reason I think you should expect it to break down or fail is just because
we like have all these desires we have all things we want out of this notion of explanation, but like that's an incredibly hard reason.
research project, which has a reasonable chance of being impossible. So I'm happy to talk about
like what the implications are, but I want to flag, but like condition on failing, I think
it's most likely because just like the things we wanted were either incoherent or intractably
difficult. But what are the odds you think you'll succeed? I mean, it depends a little bit
what you mean by succeed. But if you say like get explanations that are like great and like
accurately affect reality and like work for all of these applications that we're imagining
or that we are optimistic about, like, kind of the best case success.
I don't know, like, 10, 20%, something.
Oh, really? Okay.
And then there's, like, a higher probability of various, like, intermediate results
that, like, provide value or insight without being, like, the whole dream.
But I think the probability of succeeding in the sense of realizing the whole dream is quite low.
Yeah, in terms of what explanations look like physically, or, like, the most ambitious plan,
the most optimistic plan, is that you are searching for explanations in parallel with
searching for neural networks.
So you have a parameterization of your space of explanations, which mirrors the parameterization
of your space of neural networks.
Or it's like, you should think of as kind of similar.
So like, what is the neural network?
It's some like simple architecture where you fill in a trillion numbers and that
specifies how it behaves.
So too, you should expect an explanation to be like a pretty flexible like general skeleton that's
saying like, yeah, pretty flexible general skeleton, which just has a bunch of numbers
you fill in.
And like what you're doing to produce an explanation is primarily just filling in these
floating point numbers.
You know, when we can eventually think of explanations, you know, you know, you
You know, if you think of the explanation for why the universe moves this way,
it wouldn't be something that you could discover on some smooth evolutionary surface
where you can, you know, climb up the hill towards the laws of physics.
It's just like a very, these are the laws of physics,
you kind of just derive them for first principles.
So, but in this case, it's not like a bunch of correlations
between the orbits of different planets or something.
Maybe the word explanation has a different, I don't even,
I didn't even ask the question, but maybe you can just like speak to that.
Yeah, I mean, I think I basically, this is like, there's some intuitive objections.
Like, look, the space of explanations is this like, this like rigid, logical, like a lot of
explanations have this rigid logical structure where they're really precise and like simple things
govern like complicated systems and like nearby simple things just don't work and so on.
And like a bunch of things that feel totally different from this kind of nice continuously
parameter space.
And you can like imagine interpretability on simple models where you're just like by gradient descent
finding feature directions that have desirable properties.
But then when you imagine like, hey, now that's like a human brain you're dealing with.
that's like thinking logically about things.
Like the explanation of why that works
isn't going to be just like,
here was some future direction.
So that's how I understood the like basic confusion
which I share,
or sympathize with at least.
So I think like the most important high level point
is like I think basically the same objection applies
to being like how is GPT4 going to learn
to reason like logically about something?
You're like, well look,
logical reasoning.
That's like it's got rigid structure.
It's like it's doing all this like, you know,
it's doing ends and orrs when it's called for
even though it just somehow optimized
over this continuous space.
And like the difficulty or the hope is that the difficulty of these two problems are kind of like matched.
So that is it's very hard to like find these like logicalish explanations because it's not a space that's easy to search over.
But there are like ways to do it.
Like there's ways to embed like discrete complicated rigid things in like these nice squishy continuous spaces that you search over.
And in fact like to the extent that neural nets are able to learn the like rigid logical stuff at all, they learn it in the same way.
That is maybe they're hideously inefficient or maybe it's possible to embed this discrete reason.
in the space in a way that's not too inefficient.
But you really want the two search problems
to be of similar difficulty.
And that's like the key hope overall.
I mean, it's always going to be the key hope.
The question is, is it easier to learn a neural network
or to like find the explanation for why the neural network works?
I think people have the strong intuition that like,
it's easier to find the neural network
than the explanation of why it works.
And that is really the, I think we,
or at least exploring the hypothesis or interested in hypothesis,
that maybe those problems are actually more matched in difficulty.
And why might that be the case?
This is pretty conjecture.
and complicated to express some intuitions.
Like, maybe one thing is like, I think a lot of this intuition does come from cases like machine learning.
So if you ask about like writing code and you're like, how hard is it to find code versus find the explanation the code is correct?
In those cases, there's actually just like not that much of a gap.
Like the way a human writes a code is basically the same difficulty as finding the explanation for it's correct.
In the case of ML, like, I think we both just mostly don't have empirical evidence about how hard it is to find explanations.
of this particular type about why models work.
Like, we have a sense that it's really hard,
but that's because we're, like, have this incredible mismatch.
We're, like, gradient descent is spending an incredible amount of compute
searching for a model.
And then some human is, like, looking at neurons,
or even some neural net is looking at neurons.
Just, like, you have an incredible,
basically because you cannot define what an explanation is,
you're not applying gradient descent to the search for explanations.
So I think the MLK is just, like,
actually shouldn't make you feel that pessimistic
about the difficulty of finding explanations.
Like, the reason it's difficult right now
is precisely because you don't have, like,
any kind of you're not doing an analogous search process to find this explanation as you do to find
the model that's just like a first part of the intuition like when humans are actually doing design
i think there's not such a huge gap uh when in the mL case i think like there is a huge gap but it's
i think largely for other reasons um a thing i also want to stress is that like we just are
open to there being a lot of facts that don't have particularly compact explanations so another
thing is when we think of like finding an explanation in some sense we're setting our sites
really low here. It's like if a human designed a random widget and was like this widget
appears to work well, like if you search for like a configuration that happens to fit into
this spot really well, it's like a shape that happens to like mesh with another shape.
You might be like, what's the explanation for why those things mesh? And we're very open
to just being like, that doesn't need an explanation, you just compute. You check that like the shapes
mesh and like you did a billion operations and you check this thing worked. Or you're like,
why did these proteins bind? You're like, it's just because these shape, like this is a low
energy configuration. And there's like not, we're very open to in some cases, there's not very much
more to say. So we're only trying to explain cases where like kind of the surprise intuitively
is very large. So for example, if you have a neural net that gets a problem correct, you know,
a neural net with a billion parameters that gets a problem correct on every like input of length
1,000. In some sense, there has to be something that needs explanation there because there's like
too many inputs for that to happen by chance alone. Whereas if you have a neural net that like
gets something right on average or like get something right in like merely a billion cases,
like that actually can just happen by coincidence. GPD4 can get billions of things right by
coincidence because just has so many parameters that are adjusted to fit the data.
So a neural net that is initialized completely randomly, the explanation for that would just be
the neural net itself?
Well, it would depend on what behaviors it had.
So we're always like talking about an explanation of some behavior from a model.
Right.
And so it just has a whole bunch of random behaviors.
So it'll just be like an exponentially large explanation, relative to the weights of the model.
Yeah, there aren't.
I mean, I think there just aren't that many behaviors that demand explanation.
Like most things a random neural net does are kind of what you'd expect from like a random,
you know, if you treat it just like a random function, then there's nothing to be explained.
There are some behaviors that demand explanation, but like, I mean, yeah.
Anyway, random neural net's pretty uninteresting.
It's pretty, like, that's part of the hope is it's kind of easy to explain features of the random neural net.
Oh, okay, so this is interesting.
So the smarter or more order than neural network is, the more compressed the explanation.
Well, it's more like the more interesting the behaviors to be explained.
So the random neural net just like doesn't have very many interesting behaviors that demand explanation.
And as you get smarter, you start having behaviors that are like, you know, you start having some correlation
with a simple thing and then that demands explanation
or you start having like some regulatory or outputs on that demands explanation
so these properties kind of emerge gradually over the course of training the demand
explanation i also again want to emphasize here that like when we're talking about
searching for explanations this is like this is some dream like we talk to ourselves like
why would this be really great if we succeeded we have no idea about the empirics on any of this
this so like these are all just words that we like think to ourselves and sometimes talk about
to understand like would it be useful to find a notion of explanation and what properties
would we like this notion of explanation to have but this is really like
speculation and being out on a limb. Almost all of our time day to day is just thinking about cases
much, much simpler, even than small and real nets or like, yeah, thinking about very simple
cases in saying, like, what is the correct notion? Like, what is like the right heuristic estimate
in this case? Or like, how do you reconcile these two apparently conflicting explanations?
Is there a hope that you could, if you have a different way to make proofs now, that you can
actually have heuristic arguments where instead of having to prove the remand hypothesis or something,
you can come up with a probability of it in a way that is compelling and you can publish.
So would it just be a new way to do mathematics, a completely new way to prove things in mathematics?
So I think most claims in mathematics that mathematicians believe to be true already have
like fairly compelling heuristic arguments.
It's like the remand hypothesis, it's actually just there's kind of a very simple argument
that the remand hypothesis should be true unless something surprising happens.
And so like a lot of math is about saying like, okay, we did a little bit of work to find the
first-pass explanation of why this thing should be true.
And then, like, for example, in the case of the RUrna hypothesis, the question is, like,
do you have this, like, weird periodic structure in the primes?
And you're like, well, look, if the primes were kind of random, you obviously wouldn't
have any structure like that.
Like, just how would that happen?
And then, you're like, well, maybe there's something.
And then the whole activity is about searching for, like, can we rule out anything?
Can we rule out any kind of conspiracy that would break this result?
So I think the mathematicians just, like, wouldn't be very surprised or wouldn't care
that much.
And this is related to, like, the motivation for the project.
But I think just in a lot of domains, in a particular domain,
people already have like norms of reasoning,
but like work pretty well and match, like,
roughly how we think these heuristic arguments should work.
But it would be good to have more concrete sense.
Like, if you could say instead of,
well, we think RSA is fine to being able to say,
here's the probability that RSA is fine.
Yeah, my guess is these will not, like,
the estimates you could have this would be much, much worse
than the estimates you'd get out of like just normal,
empirical or scientific reasoning
where you're like using a reference class
and saying like, how often do people find algorithms,
for a hard problem.
Like, I think the, what this argument will give you for like, is RSA fine is going to be like,
well, RSA is fine unless it isn't.
Like, unless there's some additional structure in the problem that an algorithm can exploit,
then there's no algorithm.
But like, very often, like, the way these arguments work, so for NERALMETs as well,
is you say, like, look, here's an estimate about the behavior.
And that estimate is right unless there's another consideration we've missed.
And like, the thing that makes them so much easier than proofs is to just say, like,
here's a best guess given what we've noticed so far.
But that best guess can be easily upset by new information.
information. And that's like both what makes them easier than proofs, but also what means they're
just like way less useful than proofs for most cases. Like I think neural nuts are kind of unusual
in being a domain where like we really do want to do systematic formal reasoning, even though
we're not trying to get a lot of confidence. We're just trying to understand even roughly what's
going on. But the reason this works for alignment, but isn't like isn't that interesting for the
remand hypothesis where even the RSA case you say, well, you know, the RSA is fine unless it isn't,
unless this estimate is wrong.
It was like, well, okay, well, tell us something new.
But in the alignment case, if the estimate is,
this is what the output should be,
unless there's some behavior I don't understand.
You want to know in the case,
unless there's some behavior you don't understand.
That's not like, oh, whatever.
That's like, that's the case in which it's not aligned.
Yeah, I mean, maybe one way of putting it is just like,
we can wait until we see this input or like,
you can wait until we see a weird input.
And so, okay, did this weird input do something we didn't understand?
And for our say, that would just be a trivial test.
You're just like if someone just an algorithm.
you'd be like, is it a thing. Whereas for a neural net, in some cases, it is like either very
expensive to tell or it's like you actually don't have any other way to tell. Like, you checked
in easy cases and they're on a hard case, so you don't have a way to tell if something has
gone wrong. Also, I would clarify that, like, I think it is interesting for the rebound
hypothesis. I would say, like, the current state, particularly in number theory, but maybe in, like,
quite a lot of math, is like there are informal heuristic arguments for like pretty much all the
open questions people work on. But those arguments are completely informal. So that is like,
I think there is, it's not the case that there's like, here's the, here's the norms of informal reasoning or the norms of heuristic reasoning.
And then we have, we have arguments that, like, a heuristic argument verifier could accept.
It's just like people wrote some words.
I think those words, like, my guess would be like, you know, 95% of the things mathematicians accept as like really compelling feeling.
Heuristic arguments are correct.
And like, if you actually formalize and me, like, some of these aren't quite right.
Or like, here's some corrections or here's which of two conflicting arguments is right.
I think there's something to be learned from it.
I don't think it would be like mind-blowing, no.
When you have it completed, how big would this heuristic estimator, the rules would this
heuristic estimator be?
I mean, I know like when Russell and who was the other guy when they did the rules.
Whitehood?
Yeah, yeah.
Wasn't it like literally they had like a bucket or a wheelbarrow with all the papers?
But how big would a...
I mean, mathematical foundations are quite simple in the end.
Like at the end of the day, it's like, you know, how many symbols?
Like, I don't know.
It's hundreds of symbols or something that go into the entire foundations and the entire
rules of reasoning for like, you know, there's a sort of built on top of first or logic,
but the rules of reasoning for first or logic are just like, you know, another hundreds of
symbols or like 100 lines of code or whatever.
I'd say, like, I have no idea.
Like, we are certainly aiming at things that are just not that complicated.
Like, and my guess is that the algorithms are looking for are not that complicated.
Like, most of the complexity is pushed into arguments, not in this like verifier or estimator.
So for this to work, you need to come up with an estimator, which is a way to integrate different
touristic arguments together. It has to be a machine that takes his input, like first, it takes
input an argument, decides what it believes in light of it. It was kind of like saying, was it
compelling? But second, it needs to take four of those. And then say, like, here's what I believe
in light of all four, even though, like, there's a different estimation strategies that produce
different numbers. And that's like a lot of our life is saying, like, well, here's a simple thing
that seems reasonable. And here's a simple thing that seems reasonable. Like, what do you do?
Like, there's supposed to be a simple thing that unifies them both. And the, like,
obstruction to getting that is understanding, like, what happens when these principles are
slightly intention and, like, how do we deal? Yeah. That seems
super interesting. Like even, I mean, we'll see what other applications it has. I don't know,
like computer security and code checking. If you can, uh, would, like actually say like,
this is how safe we think of code is in a very formal way. My guess is we're not going to add.
I mean, this is both a blessing and a curse. Like it's like, well, it's sad. Your thing is,
you're like, it's sad. Your thing is not that useful. But a blessing in it like,
not useful things are easier. My guess is we're not going to add that much value in most of these
domains. Like most of the difficulty comes from like, like a lot of code you'd want to
verify, not all of it, but a significant part is just like the difficulty of formalizing the
proof is like the hard part and like actually getting all of that to go through. And like,
we're not going to help even the tiniest bit with that, I think. So this would be more helpful
if you like have code that like uses simulations. You want to verify some property of like a
controller that involves some numerical error or whatever you need to control the effects of
that error. That's where you like start saying like, well, hereistically, if the errors are
independent, blah, blah, blah. Yeah, you're too honest to be a salesman, Paul.
I think this is kind of like sales to us, right? If you talk about this idea, people like,
why would that not be like the coolest thing ever and therefore impossible?
And we're like, well, actually, it's kind of lame.
And we're just trying to pitch for like, it's way lamer than it sounds.
And that's really important to why it's possible is being like, it's really, it's really
not going to blow that many people's.
I mean, I think it will be cool.
I think it will be like very, if we succeed, it will be very solid like mathematics or
theoretical computer science or whatever.
But I don't think, right again, I think the mathematicians already do this reasoning and
they mostly just love proofs.
I think the physicists do a lot of this reasoning, but they don't care about formalizing
anything.
I think like in practice, other difficulties are almost
always going to be more salient. I think this is like of most interest by far for interpretability in
ML. And like I think other people should care about it and probably will care about it if successful,
but I don't think it's going to be like the biggest thing ever in any field or even like that huge
thing. I think this would be a terrible career move given the like ratio of like difficulty to
impact. I think theoretical computer science, it's like probably a fine move. I think in other
domains like it just wouldn't be worth like we're going to we're going to be working on this for like
years, at least, in the best case.
I'm laughing because my next question was going to be, like, a setup for you to explain
if this, like, some graduate student wants to work on this.
I think it's a terrible career.
I think theoretical computer science is an exception where I think this is, like, in some
sense, like, what the best of theoretical computer science is like.
So, like, you have all this reason, you have this, like, because it's useless.
I mean, I think, like, an analogy, I think, like, one of the most successful sagas in
theoretical computer science is like formalizing the notion of an interactive proof system.
And it's like you have some kind of informal thing that's interesting to understand.
And you want to like pin down what it is and construct some examples and see what's possible and what's impossible.
And this is like, I think this kind of thing is the bread and butter of like the best parts of theoretical computer science.
And then again, I think mathematicians like, it may be a career mistake because the mathematicians only care about proofs or whatever.
But that's a mistake in some sense aesthetically.
Like if successful, I do think looking back, and again, part of why it's a mistake is such a high probability, we wouldn't be successful.
But I think looking back, people would be like, that was pretty cool.
Like, although not that cool, like, we understand why it didn't happen, given, like, the epistemic, like, what people cared about in the field.
But it's pretty cool now.
But isn't it also the case that didn't Hardy write that, like, you know, all this prime shit is both not useless, but it's fun to do?
And, like, it turned out all the cryptographies based on all that prime shit.
So, I don't know.
It could have.
But anyways, I'm trying to set you up so that you can tell.
And forget about if it doesn't have applications in all those other fields.
It matters a lot for alignment.
And that's why I'm trying to set you up to talk about if, you know, the smart, I don't know, math.
I think a lot of smart people listen to this podcast, if they're a math or CS grad student and has gotten interested in this.
Are you looking to potentially find talent to help you with this?
Yeah, maybe we'll start there.
And then I also want to ask you if, I think also maybe people who can provide funding
might be listening to the podcast.
So to both of them, what is your pitch?
Yeah, so we're definitely, definitely hiring and searching for collaborators.
Yeah.
I think the most useful profile is probably a combination of like intellectually interested in this
particular project and motivated enough by alignment to work on this project, even if it's really hard.
I think there are a lot of good problems.
So the basic fact that makes this problem unappealing to work on, I'm a really good salesman.
But whatever, I think the only reason this isn't a slam dunk thing to work on is that, like,
there are not great examples.
So we've been working on it for a while, but we do not have beautiful results as of the recording of this podcast.
Hopefully by the time it airs.
You'll have a little subscript that's like they've had great results since then.
But it was too long to put in the margins of the podcast.
Yeah, with luck.
Yeah.
hard to work on because it's not clear what a success looks like. It's not clear if success is possible.
But I do think there's like a lot of questions. We have a lot of questions. And like, I think like the basic setting of like, look, there are all of these arguments. So in mathematics, in physics, in computer science, there's just a lot of examples of informal heuristic arguments. They have enough structural similarity that it looks very possible that there is like a unifying framework, that these are instances of some general framework and not just a bunch of random things.
Like, not just a bunch of, it's not like, so for example, for the prime numbers,
people reason about the prime numbers as if they were like a random set of numbers.
One view is like, that's just a special fact about the primes. They're kind of random.
A different view is like, actually, it's pretty reasonable to reason about an object
as if it was a random object as a starting point. And then as you notice structure,
like, revised from that initial guess. And it looks like, to me, the second perspective
is probably more right. It's just like a reasonable to start off treating an object as
random and then like notice perturbations from random, like notice structure the object possesses.
and the primes are unusual and that they have fairly little, like, additive structure.
I think it's a very natural theoretical project.
There's, like, a bunch of activity that people do.
It seems like there's a reasonable chance.
It has some, there's something nice to say about, like, unifying all of that activity.
I think it's a pretty exciting project.
The basic strike against it is that it seems really hard.
Like, if you were someone's advisor, I think you'd be like, what are you going to prove
if you work on this for, like, the next two years?
And they'd be like, there's a good chance, nothing.
And then, like, it's not what you do.
If you're a PhD student, normally, you have, like, you aim for those high probabilities
of getting something within a couple years.
The flip side is it does feel, I mean, I think there are a lot of questions.
I think some of them we're probably going to make progress on.
So, like, I think the pitch is mostly like, are some people excited to get in now?
Or are people more like, let's wait to see, like, once we have, like, one or two good successes
to see what the pattern is and become more confident we can turn the crank to make more
progress in this direction.
But for people who are excited about working on stuff with reasonably high probabilities
of failure and not really understanding exactly what you're supposed to do, I think it's a good,
I think it's a pretty good project.
I feel like if people look back,
if we succeed and people are looking back in like 50 years
on like what was the coolest stuff happening
in math or theoretical computer science,
they'll be like a reasonable,
this will definitely be like in contention.
And like I would guess for lots of people
would just seem like the coolest thing
from this period of, you know, a couple of years or whatever.
Right.
Because this is a new method in like of so many different fields
from the ones you met physics, math,
theoretical computer science.
Like that's, that's really, I don't know,
but what is the average math PhD work?
working on, right? He's not, he's circling on like some, uh, a subset of a subset of, uh, something
I can't even understand or pronounce, but, um, map is quite esoteric. But yeah, this seems like,
I don't know, even small, uh, small transit of working, like, forget about the value for,
you shouldn't forget about the value for alignment, but even without that, this is such a
cool, if this works, it's like a really big, uh, it's a big deal. There's a good chance that if I
had my current set of views about this problem and didn't care about alignment and had the
career safety to just like spend a couple years thinking about it, you know, spend half my
time for like five years or whatever, that I would just do that. I mean, even without caring
at all about alignment, it's just a nice, it's a very nice problem. It's very nice to have this,
like, library of things that succeed where like it's just, they feel so tantalizingly close to being
formalizable at least to me, and such a natural setting. And then just have so little
purchase on it. It's like a, there aren't that many, like, really exciting feeling frontiers in, like,
through computer science. Oh, and then so a smart person, it doesn't have to be a grassword,
But like a smart person is interested in this.
What should they do?
Should they try to attack some open problem you have, put on your blog?
Or should it, what is the next step?
Yeah, I think like a first past step, like, there's different levels of ambition or whatever,
different ways of approaching our problem.
But like we have this write up of like from last year, or I guess 11 months ago or whatever,
on formalizing the presumption of independence that provides like,
here's kind of a communication of what we're looking for in this object.
And like I think the motivating problem is saying like here's the notion of what an estimator is and here's what it would mean for an estimator to capture some set of informal arguments.
And like a very natural problem is just try and do that. Like go for the go for the whole thing, try to understand and then like come up with like hopefully a different approach or then like end up having context from a different angle on the kind of approach we're taking.
I think that's a reasonable thing to do. I do think we also have a bunch of open problems. So maybe we should put out more of those open problems. And the main concern with doing so is that for
for any given one, we're like, this is probably hopeless.
Like put up a prize earlier in the year for an open problem,
which tragically, I guess the time is now
to post the debrief from that, or I owe it from this weekend.
I was supposed to do that.
So I'll probably do it tomorrow.
But no one solved it.
It's sad putting out problems that are hard.
Or like, I don't, we could put out a bunch of problems
that we think might be really hard.
But what was that famous case of that statistician,
who it was like some PhD student
who showed up late to a class and he saw some
problems in the board and he thought they were homework and then they were actually just open
problems and then he solved them because he thought they were homework, right?
Yeah.
I mean, we don't, we have much less information that these problems are hard.
Like, again, I expect the solution to most of our problems to not be that complicated.
We have not, and we've been working on in some sense for a really long time.
Like, you know, total years of full-time equivalent work across the whole team is like
probably like three years of full-time equivalent work in this area spread across a couple
people. But like, that's very little compared to a problem. Like, it is very easy to have a problem
where you put in three years of full-time equivalent work, but in fact, there's still an approach
that's going to work quite easily with like three to six months if you come at a new angle.
And like, we've learned a fair amount from that that we could share. And we probably will
be sharing more over the coming months. As far as money goes, is this something where, I don't
know if somebody gave you a whole bunch of money that would help or does it not matter?
How many people are working on this, by the way? So we have been, right now there's four of us full
time and we're hiring for more people.
And then is funding that would matter?
I mean, funding is always good.
We're not super funding constrained right now.
The main effect of funding is it will cause me to continuously and perhaps
indefinitely delay fundraising.
Periodically all like set out to be interested in fundraising and someone will be like
offer a grant and then I will get to delay for another six months or fundraising
or nine months or whatever.
So you can you can delay the time at which Paul needs to think for some time about
fundraising.
Well, one question I think it would be interesting to ask you is
you know, I think people can talk vaguely about the value of theoretical research and how it contributes to real world applications.
And, you know, you can look at historical examples or something.
But you are somebody who actually has done this in a big way.
Like RLHF is, you know, something you developed.
And then it actually has got into an application that has been used by millions of people.
So tell me about like, it's just that pipeline.
How can you reliably identify theoretical problems that will matter for real old applications?
Because it's one thing to like read about touring or something and the halting problem.
But like here you have the real thing.
Yeah, I mean, it is definitely exciting to have worked on a thing that has a real world impact.
The main caveat I provide is like our LHF is very, very simple compared to many things.
And like so the motivation for working on that problem was like, look, this is how it probably should work.
Or like this is a step in some like progression.
It's unclear if it's like the final step or something.
but it's a very natural thing to do that people probably should be and probably will be doing.
I'm saying, like, if you want to do, if you want to talk about crazy stuff,
it's good to help make those steps happen faster.
And it's good to learn about, like, whether there's lots of issues that occur in practice,
even for things that seem very simple on paper.
But mostly, like, the story is just like, yep, I think my sense of the world is things
that look like good ideas on paper just like often are harder than they look,
but, like, the world isn't that far from what makes sense on paper.
like large language models look really good on paper and our LHF looks really good on paper.
And these things like I think just work out in a way that's, yeah, I think people maybe overestimate or like,
maybe it's like kind of a trope that people talk about like it's easy to like underestimate how much gap there is to practice,
like how many things will come up that don't come up in theory.
But it's also like easy to overestimate like how inscrutable the world is.
Like the things that happen mostly are things that do just kind of make sense.
Yeah, I feel like most ML implementation does just come down to a bunch of detail though of like,
You know, build a very simple version of the system, understand what goes wrong, fix the things that go wrong, scale it up, understand what goes wrong.
And I'm glad I have some experience doing that, but I don't think, I think that, like, does cause me to be better informed by, like, what makes sense in ML and what, like, can actually work.
But I don't think it caused me to have, like, a whole lot of deep expertise or, like, deep wisdom about, like, how to close the, how to close the gap.
Yeah, yeah.
But is there some tip on identifying things like RLHF, which actually do matter versus making sure you don't get stuck in some theoretical problem that doesn't matter?
Or is it just coincidence?
Or I mean, is there something you can do in advance to make sure that the thing is useful?
I don't know if the RLHF story is like the best, best success case or something.
Oh, because the capabilities.
Maybe I'd say more profoundly, like, again, it's just not that hard a case.
It was like a little bit, it's a little bit unfair to be like.
I'm going to predict the thing, which I, like, I pretty much think it was going to happen at some point.
And so it was mostly a case of acceleration, whereas the work we're doing right now is specifically focused on something that's like kind of crazy enough that it might not happen, even if it's a really good idea or challenging enough it might not happen.
But I'd say like in general, like, and this draws a little bit on like more broad, experience more broadly in theory.
It's just like a lot of the times when theory fails to connect with practice, it's just kind of clear.
It's not going to connect if you like try.
If you actually think about it, you're like, what are the key constraints in practice?
Is the theoretical problem we're working on actually connected to those constraints?
Is there a path?
Like, is there something that is possible in theory that would actually address, like, real world issues?
I think, like, the vast majority, like, as a theoretical computer scientist,
the vast majority of theoretical computer science, like, has very little chance of ever affecting practice,
but also it is completely clear in theory that has very little chance of affecting practice.
Like, most of the theory fails to affect practice, not because of, like, all the stuff you don't think of,
but just because like it was like,
you could call it like dead on arrival,
but it also just like it's not really the point.
It's just like mathematicians also are like,
they're not trying to affect practice and they're not like,
why does my number theory not affect practice?
It was kind of obvious.
So I think the biggest thing is just like actually caring about that
and then like learning at least what's basically going on
in the actual systems you care about
and what are actually the important constraints
and is this a real theoretical problem.
The basic reason most theory doesn't do that
is just like that's not where the easy theoretical problems are.
So I think theory is instead motivated
by like we're going to build up the edifice of theory.
And like sometimes they'll be opportunistic.
Like opportunistically, we'll find a case that comes close to practice.
Or we'll find something practitioners are already doing
and try to bring it into our framework or something.
But the theory of change is mostly not.
This thing that's going to make into practice.
It's mostly like this is going to contribute
to the body of knowledge that will slowly grow.
And like sometimes opportunistically yields important results.
How big would do you think a CED AI would be?
What is a minimum sort of encoding of something that is as smart as a human?
I think it depends a lot with substrate it gets to run on.
So if you tell me, like, how much computation does it get before, or like, what kind of real-world infrastructure does it get?
Like, you could ask, what's the shortest program, which, like, if you run it on a million H-100s, connected in, like, a nice network with, like, hospitable environment, will eventually, like, go to the stars.
But that seems like it's probably on the order of, like, tens of thousands of bytes, or I don't know.
If I had to guess a median, I'd guess 10,000 bytes.
But wait, the specification or the compression of...
Just the program, a program which went run.
Oh, got, got, got, yeah.
But that's going to be, like, really cheatsy.
So, like, what's the thing that has values and will, like, expand and, like, roughly preserve its values as it proceeds.
Because, like, that thing, the 10,000 byte thing, we'll just lean heavily on, like, evolution and natural selection to get there.
For that, like, I don't know, million bytes, 100,000 bytes, something like that.
How would, do you think AI lie detectors will work, where you kind of just look at the activations and not in the, not find explanations in the way?
you were talking about with heuristics, but literally just like, here's what truth looks like,
here's what lies look like, let just segregate the latent space and see if you can identify
the two.
Yeah, I think the separate the, like, just train a classifier to do it is like a little bit
complicated for a few reasons and like may not work, but if you just like broaden the space
and say like, hey, it's like you want to know if someone's lying, you get to interrogate them,
but also you get to like rewind them arbitrarily and make a million copies of them, I do think
it's like pretty hard to lie successfully.
You get to look at their brain, even if you don't quite understand what's happening.
You get to rewind them a million times.
You get to run all those parallel copies into gradient descent or whatever.
I think it's a pretty good chance that you can just tell if someone is lying.
Like a brain emulation or an AI or whatever.
Unless they were aggressively selected.
Like, if it's just they are trying to lie well,
rather than it's like they were selected over many generations to be excellent at lying or something,
then your ML system hopefully didn't train it a bunch to lie.
And you want to be careful about your training scheme effectively does.
that, yeah, that seems like it's more likely than not to succeed.
And how possible do you think it will be for us to specify human verifiable rules for reasoning
such that even if the AI is super intelligent, we can't really understand why there's
certain things.
We know that the way in which it arises at these conclusions is valid.
Like, it was trying to persuade us to something.
We can be like, I don't understand all the steps, but I know that this is something
that's valid and you're not just making shit up.
That seems very hard if you wanted to be competitive with learned reasoning.
So, like, I don't, I mean, it depends a little bit exactly how you set it up,
but for like the ambitious versions of that, let's say it would address the alignment problem,
they seem pretty unlikely, you know, like 5, 10% kind of thing.
Is there an upper bound on intelligence?
Not in the near term, but just like super intelligence at some point, like how far do you think that can go?
It seems like it's going to depend a little bit on what is meant by intelligence.
It kind of reads as a question that's similar.
So is there an upper bound on like strength or something?
Like there are a lot of forms.
I think it's like the case that for, yeah,
I think there are like sort of arbitrarily smart input output functionalities.
And then like if you hold fixed the amount of compute,
there is some smartest one.
If you're just like what's the best set of like 10 to the 40th operations?
There's just, there's only finally many of them.
So some like best one for any particular notion of best that you have in mind.
So I guess like I'm just like for the unbounded question where you're allowed to use
arbitrary description complexity and compute like probably now.
and for the, I mean, there is some, like, optimal conduct.
Like, if you're like, I have some goal in mind,
and I'm just like, what action best achieves it.
If you imagine, like, a little box embedded in the universe,
I think there's kind of just, like, an optimal input-output behavior.
So I guess in that sense, I think there is an upper bound,
but it's not saturatable in the physical universe,
because it's definitely exponentially slow.
Right, yeah, yeah.
Or, you know, because of comms or other things or heat,
it just might be physically impossible to insatiate something's moderate than this thing.
Yeah, I mean, like, for example,
if you imagine what the best of the best thing,
thing is it would almost certainly involve just like simulating every possible universe it might be in
modulolic moral constraints which i don't know if you want to include them but like so that would be
very very slow it would involve simulating like all you know it's sort of like i don't know exactly how
slow but like double exponential very slow Carl shulman laid out his picture of the intelligence
explosion in the seven-hour episode um what i know you guys have talked a lot what about his basic
picture like what is do you have some disagreement main disagreements is there some crux
that you guys have explored it's related to our timelines discussion from earlier i think the
biggest yeah i think the biggest issue is probably error bars where like carl has a very like
very software focused very fast kind of take off picture and i think that is plausible but not that
likely like i think it's a couple ways you could perturb the situation and my guess is one of
them applies um so maybe i have like i don't know exactly what carl's probably is i feel like
Carl's going to have like a 60% chance on some crazy thing that I'm only going to assign like a 20% chance to or 30% chance or something. And like I think those those kinds of perturbations are like one, how long a period is there of complementarity between AI capabilities and human capabilities, which will tend to soften takeoff to like how much diminishing returns are there on software progress such that like is a like broader takeoff involving scaling like electricity production and hardware production? Is that likely to happen during takeoff where I'm like more like 50-50 or more?
stuff like this.
Yeah, okay.
So is it that you think the ultimate constraints will be more hard?
Or like the basic case he's laid out is that you can just have a sequence of things like
flash attention or M-O-E and you can just keep stacking these kinds of things on.
I'm very unsure if you can keep stacking them or like it's kind of a question of what's like
the returns curve and like Carl has some inference from historical data or some way he'd extrapolate
the trend.
I am more like 50-50 on whether the software.
and the intelligence explosion is even possible.
And then like a somewhat higher probability that it's slower than hardware.
Why do you think it might not be possible?
Well, so the entire question is like if you double R&D effort,
do you get enough additional improvement to further double the efficiency?
And like that's that question will itself be a function of your hardware base,
like how much hardware you have.
And I know questions like at the amount of hardware we're going to have
and the level of sophistication we have as the process begins.
Like is it the case that each doubling of actually the initials only depends on the hardware
or like each level of hardware will have some place at this dynamic asymptotes.
So the question is just like for how long is it the case that each doubling of R&D at least doubles the effective output of your, you know, AI research population?
And I think like I have a higher probability on that.
And it's kind of close if you look at the empirics.
I think the empirics benefit a lot from like continuing hardware scale up.
So the like the effective R&D stock is like significantly smaller than it looks, if that makes sense.
What are the empirics you're referring to?
So there's kind of two sources of evidence.
One is, like, looking across a bunch of industries at, like, what is the general improvement with each doubling of, like, either R&D investment or experience?
Where, like, it is quite exceptional to have a field with not, anyway, it's pretty good to have a field where each time you double in R&D investment, you get a doubling of efficiency.
The second source of evidence is on, like, actual, like, algorithmic improvement in ML, which is obviously much, much scarcer.
And they're, like, you can make a case that it's been, like, each doubling of R&D has given you, like, roughly a 4x or something increase in computational efficiency.
But there's a question of how much that benefits.
When I say the effect of R&D stock is smaller, I mean like we scale up, like you're doing a new task.
Every couple years we're doing a new task because you're operating a scale much larger than the previous scale.
And so a lot of your effort is how to make use of the new scale.
So if you're not increasing your installed hardware base and just flat at a level of hardware,
I think you get much faster diminishing returns than people have gotten historically.
I think Carl agrees in principle.
This is true.
And then once you make that adjustment, I think it's like very unclear where the empirics shakeout.
I think Carl has thought about these more than I am, so I should maybe defer more.
but anyway, I'm at like 50-50 on that.
How have your timelines changed over the last 20 years?
Last 20 years?
Yeah.
How long have you been working on anything related to AI?
So I started thinking about this stuff in like 2010 or so.
So I think my first, my earliest timeline prediction will be in like 2011.
I think in 2011, my like rough picture was like,
we will not have insane AI in the next 10 years.
And then like I get increasingly uncertain after that.
but like we converge to like, you know, 1% per year or something like that.
And then probably in 2016, my take was like we won't have Crazy AI in the next five years,
but then we converge to like 1 or 2% per year after that.
Then in 2019, I guess I made a round of forecasts where I gave like 30% or something to 25% to CrazyEye by 2040
and like 10% by 2030 or something like that.
So I think my 2030 probability has been kind of stable and my 2040 probability has been going up.
And I would guess it's too sticky.
I guess that 40% that I gave at the beginning is just like from not having updated recently enough.
And I maybe just need to sit down.
I would guess that should be even higher.
I think like 15% in 2030 I'm not feeling that bad about.
This is just like each passing year is like a big update against 2030.
Like we don't have that many years left.
And that's like roughly counterbalanced with AI going pretty well.
Whereas for like the 2040 thing, like the passing years are not that big a deal.
And like, as we see that, like, things are basically working.
That's, like, cutting out a lot of the probability of not having AI by 2040.
So, yeah, my 2030 probability up a little bit, like, maybe twice as high as it used to be or, like, something like that.
My 2040 probably, like, up more, much more significantly.
How fast do you think we can keep building fabs to keep up with the eye demand?
Yeah, I don't know much about any of the relevant areas.
My, like, best guess is, like, I mean, my understanding is right now, like, 5%.
or something of like the next year's like total or like best process fabs will be making AI hardware
of which like only a small fraction will be going into very large training runs like only a couple
so maybe a couple percent of total output and then like that represents maybe like one percent
of total possible output a couple percent of like leading process one percent of total or something
I don't know if that's right but it's like the rough ballpark we're in I think things will be like
pretty fast as you scale up for like the next order of magnitude or two from there because
you're basically just shifting over other stuff.
My sense is it would be like years of delay.
There's like multiple reasons that you expect years of delay for going past that.
Maybe even at that you start having, yeah, there's just a lot of problems.
Like building new fabs is quite slow.
And I don't think there's like, TSM is not like planning on increases in total demand
driven by AI, like kind of conspicuously not planning on it.
I don't think anyone else is really ramping up production in anticipation either.
So I think, and then similarly like just building data.
the centers of that size seems like very, very hard and also probably has multiple years of delay.
What does your portfolio look like?
I've tried to get rid of most of the AI stuff that's plausibly implicated in policy
work or like MCG advocacy on the RSP stuff for my involvement with Anthropic.
What would it look like if you had no complex of interest?
And no insight information.
Like, I also still have a bunch of hardware investments, which I need to think about.
But like, I don't know, a lot of CSMC.
I have a chunk of
Nvidia, although I just keep betting
against Nvidia constantly
since 2016 or something. I've been
destroyed on that bet, although AMD has also done fine.
And it's like, well, now
the case now is even easier, but it's similar to the case
in the old days. Just a very expensive
company, given the total amount
of R&D investment they've made. They have like, whatever,
a trillion dollar valuation or something.
That's, like, very high.
So the question is, like,
how expensive is it to, like, make a TPU?
such that's like actually out competes age 100 or something.
And I'm like, wow.
It's real level, high level of incompetence
if Google can't catch up fast enough
to like make that trillion dollar valuation not justified.
Whereas with TSM, it's much harder.
They have a harder mode, you think?
Yeah, I think it's a lot harder.
Especially if you're in this regime
where like you're trying to scale up.
So like if you're unable to build fabs,
I think we'll take a very long time
to build as many fabs as people want.
Like the effect of that will be to like bid up
the price of existing fabs
and existing semiconductor manufacturing,
equipment. So, like, just those hard assets will become, like, spectacularly valuable,
as well, the existing, like, GPUs and, like, the actual, yeah. Yeah, I think it's just hard.
That seems like the hardest asset to scale up quickly. So it's like the asset, if you have,
like, a rapid run-up, it's the one that you'd expect to most benefit. Whereas, like,
NVIDIA's stuff will ultimately be replaced by, like, either it's better stuff made by humans
or stuff made by, with AI assistance. Like, the gap will close even further as you build
AI systems. Right. Unless NVIDIA is using those systems.
Yeah, the point is just like the future R&D will sew dwarf past R&D.
And there's like just not that much stickiness.
There's less stickiness in the future than their husband in the past.
Like, yeah.
I don't know.
So I don't want to not commenting from any private information, just in my gut.
Having caveat of this is like the single bet I've most lost.
Okay.
Not including Nvidia in that portfolio.
And final question, there's a lot of schemes out there for alignment.
And I think just like a lot of general takes.
And a lot of this stuff is over my head where I think I literally took me like weeks to
understand the mechanistic anomaly stuff you work on. Without spending weeks, how do you detect
bullshit? Like, people have explained their schemes to me and I'm like, honestly, I don't know if
it makes sense or not. With you, I'm just like, I trust Paul enough that I think there's probably
something here if I try to understand this enough. But without, yeah, how do you, how do you detect
bullshit? Yeah, so I think it depends on the kind of work. So for like the kind of stuff we're doing,
my guess is like most people, there's just not really a way you're going to tell whether it's
bullshit. So I think like it's important that we don't spend that much money on like,
the people who want to hire are probably going to dig in in depth.
I don't think there's a way you can tell whether it's bullshit
without either spending a lot of effort or leaning on deference.
With empirical work, it's interesting and that you do have some signals of the quality of work.
You can be like, does it work in practice?
Like, does the story?
I think the stories are just radically simpler.
And so you probably can't evaluate those stories like on their face.
And then you mostly come down to these questions about like, what are the key difficulties?
Yeah, I tend to like be optimistic.
when people dismiss something because this doesn't deal with a key difficulty or this runs into the following
and Super Bowl obstacle, I tend to be like a little bit more skeptical about those arguments and tend to think like,
yeah, something can be bullshit because it's not addressing a real problem. That's like, I think, the easiest way.
Like, this is a problem someone's interested in. That's just like not actually an important problem and there's no story about why it's going to become an important problem.
E.g. like, it's not a problem now and won't get worse or it is maybe a problem now, but it's clearly getting better.
That's like one way. And then conditioned on like passing that bar.
like dealing with something that actually engages with like important parts of the argument for concern and then like actually making sense
empirically so like I think most work is anchored by its source of feedback is like actually engaging with real models so it's like does it make sense how to engage with real models and does the story about how it like deals with key difficulties actually makes sense
I'm like pretty liberal past there I think it's really hard to like easy people look at mechanistic interpretability and be like well this obviously can't succeed and I'm like I don't know how can you tell obviously can't succeed
Like, I think it's reasonable to take total investment in the field, like, how fast is it making progress?
Like, how does that pencil?
I think, like, most things people work on, they actually pencil, like, pretty fine.
Like, they look like they could be reasonable investments.
Things are not, like, super out of whack.
Okay, great.
This is, I think, a good place of close.
Paul, thank you so much for your time.
Yeah, thanks for having me.
It was good chatting.
Yeah, absolutely.
Hey, everybody.
I hope you enjoyed that episode.
As always, the most helpful thing you can do is to share the podcast.
Send it to people you think might enjoy it,
put it in Twitter, your group chats,
etc. It just splits the world.
Appreciate you listening. I'll see you next time.
Cheers.
