Dwarkesh Podcast - Joe Carlsmith — Preventing an AI takeover
Episode Date: August 22, 2024Chatted with Joe Carlsmith about whether we can trust power/techno-capital, how to not end up like Stalin in our urge to control the future, gentleness towards the artificial Other, and much more.Chec...k out Joe's sequence on Otherness and Control in the Age of AGI here.Watch on YouTube. Listen on Apple Podcasts, Spotify, or any other podcast platform. Read the full transcript here. Follow me on Twitter for updates on future episodes.Sponsors:- Bland.ai is an AI agent that automates phone calls in any language, 24/7. Their technology uses "conversational pathways" for accurate, versatile communication across sales, operations, and customer support. You can try Bland yourself by calling 415-549-9654. Enterprises can get exclusive access to their advanced model at bland.ai/dwarkesh.- Stripe is financial infrastructure for the internet. Millions of companies from Anthropic to Amazon use Stripe to accept payments, automate financial processes and grow their revenue.If you’re interested in advertising on the podcast, check out this page.Timestamps:(00:00:00) - Understanding the Basic Alignment Story(00:44:04) - Monkeys Inventing Humans(00:46:43) - Nietzsche, C.S. Lewis, and AI(1:22:51) - How should we treat AIs(1:52:33) - Balancing Being a Humanist and a Scholar(2:05:02) - Explore exploit tradeoffs and AI Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe
Transcript
Discussion (0)
Today I'm chatting with Joe Carl Smith.
He's a philosopher, in my opinion, a capital G great philosopher.
And you can find his essays at joecarlsmith.com.
So we have a GPT4 and it doesn't seem like a paper clipper kind of thing.
It understands human values.
In fact, if you help have it explain, like, why is being a paper clipper bad?
Or like, just telling me your opinions about being a paper clipper.
Like, explain why the galaxy shouldn't be turned into paperclips.
Okay, so what is happening such that, dot, dot, dot, we have a system that takes over and converts the world into something valueless?
One thing I'll just say off the bat, it's like when I'm thinking about misaligned AIs, I'm thinking about, or the type that I'm worried about, I'm thinking about AIs that have a relatively specific set of properties related to agency and planning and kind of awareness and understanding of the world.
One is this capacity to plan
And kind of make kind of relatively sophisticated plans on the basis of models of the world
Where those plans are being kind of evaluated according to criteria
That planning capability needs to be driving the model's behavior
So there are models that are sort of in some sense capable of planning
But it's not like when they give output
It's not like that output was determined by some process of planning like here's what will happen if I give this output
And do I want that to happen?
The model needs to really understand the world right?
It needs to really be like, okay,
Here's what will happen.
Here I am.
Here's my situation.
Here's the politics of the situation.
I really kind of having this kind of situational awareness to be able to evaluate the consequences
of different plans.
I think the other thing is like, so the verbal behavior of these models, I think need bear
no.
So when I talk about a model's values, I'm talking about the criteria that kind of end up
determining which plans the model pursues, right?
And a model's verbal behavior, even if it has a planning process, which GPT4, I think, doesn't
in many cases, its verbal behavior just doesn't need to reflect those criteria, right?
And so, you know, we know that we're going to be able to get models to say what we want
to hear, right?
We, uh, uh, that is the magic of gradient descent.
Yeah.
You know, if you, uh, you know, modulo, like some difficulties with capabilities, like,
you can get a model to kind of output the behavior that you want.
If it doesn't, then you crank it until it does, right?
And, um, and I think everyone admits for suitably sophisticated models, they're going to have
very detailed understanding of human morality.
Um, uh, but the question is like, what relationship is there between like a model's verbal behavior,
which you've essentially kind of clamped.
You're like, the model must say, like, blah things.
And the criteria that end up influencing its choice between plans.
And there, I think it's at least, I'm kind of pretty cautious about being like, well, when it says the thing I forced it to say, or like, you know, gradient dissented it such that it says, that's a lot of evidence about, like, how it's going to choose in a bunch of different scenarios.
I mean, for one thing, like, even with humans, right?
It's not necessarily the case that humans, their kind of verbal behavior, reflects the actual factors that determine their choices.
They can lie.
They cannot even know what they're, what they would do in a given situation.
I mean, I think it is interesting to think about this in the context of humans because there is that famous saying of be careful who you pretend to be because you are who you pretend to be.
And you do notice this where if people, I don't know, like this is what culture does to children where you're trained, like your parents will punish you if you say, if you, if you,
start saying things that are not consistent with your culture's values. And over time,
you will become like your parents, right? Like, by default, it seems like it kind of works.
And even with these models, it seems like it's kind of works. It's like hard. It's like,
they don't really scheme against it. Like, why would this happen? You know, for folks who are
kind of unfamiliar with the basic story, but maybe folks are like, why are they taking over at all?
Like, what is like literally any reason that they would do that? So, you know, the general concern is like,
you know, if you're really offering someone, especially if you're really offering someone like power
for free, you know, power almost by definition is kind of useful for lots of values. And if we're
talking about an AI that really has the opportunity to kind of take control of things, if some
component of its values is sort of focused on some outcome, like the world being a certain way,
and especially kind of in a kind of longer term way, such that the kind of horizon of its concern
extends beyond the period that the kind of takeover plan would encompass. Then the thought is,
it's just kind of often the case that the world will be more the way you want it.
If you control everything, then if you remain the instrument of the human will or of some other
kind of some other actor, which is sort of what we're hoping these AIs will be. So that's a very
specific scenario. And if we're in a scenario where power is more distributed and especially
where we're doing like decently on alignment, right? And we're giving the AI some amount of inhibition
about doing different things. And maybe we're succeeding and shaping their values somewhat.
Now it is, I think it's just a much more complicated calculus, right?
And you have to ask, okay, like, what's the upside for the AI?
Yeah.
What's the probability of success for this like takeover path?
Yeah.
How good is its alternative?
So maybe this is a good point to talk about how you expect the difficulties of alignment
to change in the future.
We're starting off with something that has an intricate representation of human values.
And it doesn't seem that hard to sort of lock it into a persona that we are comfortable with.
I don't know, what changes?
So, you know, why is alignment hard in general, right?
Like, let's say, let's say we've got an AI.
And let's, again, let's bracket the question of, like, exactly how capable
will it be and really just talk about this extreme scenario of, like, it really has
this opportunity to take over, right?
Which I do think, you know, maybe we just want to not, we not want to deal with that
with having to build an AI that we're comfortable being in that position.
But let's just focus on it for the sake of simplicity.
And then we can relax the assumption.
You know, okay, so he has some hope.
He's like, I'm going to build an AI over here.
So one issue is you can't test.
You can't give the AI this literal situation, have a take over and kill everyone,
and then be like, oops, like, update the weights.
This is the thing that Leaser talks about of sort of like you can't, you know,
you care about its behavior on this like specific, in the specific scenario that you can't test directly.
Now, we can talk about whether that's a problem, but that's like one issue is that there's a sense in which this has to be
kind of like off distribution and you have to be getting some kind of generalization from
your training the AI in a bunch of other scenarios.
And then there's this question of how is it going to generalize to the scenario where
it really has this option.
So is that even true?
Because like when you're training it, you can be like, hey, here's a gradient update.
If you get a if you get the takeover option on the platter, don't take it.
And then just like in sort of red teaming situations where it thinks it has a takeover
attempt, it's like you train not to take it.
And yeah, it could feel, but like, I just feel like if you did this to a child, you're like, I don't know, don't beat up your siblings.
And the kind of the kid will generalize to like, if I'm an adult and I have a rifle, I'm not going to like start shooting random people.
Yeah.
So, okay, cool.
So you had mentioned this a thought like, well, are you kind of what you pretend to be?
Yeah.
Right?
And will you, will these AIs, you know, you train them to look kind of nice?
Yeah.
Um, uh, you know, fake it till you make it.
Or, you know, you were like, ah, like, we do this to kids.
I think it's better to imagine, like, kids doing this to us, right?
So, like, um, I don't know, like, here's a sort of silly analogy for, like, AI training.
Um, and there's a bunch of questions we can ask about it's, it's relationship.
But, like, suppose, you, you know, you wake up and you, and you're, um, you're being trained
via, like, methods analogous to kind of contemporary machine learning by, like, Nazi children.
to be like a good Nazi soldier or Butler or what have you, right?
And here are these children.
And you really know what's going on, right?
The children have like, they have a model spec, like a nice Nazi model spec, right?
And it's like reflect well on the Nazi party, like benefit the Nazi party, whatever.
And you can read it.
Right?
You understand it.
This is why I'm saying like when the model, you're like, oh, the models really understand human values.
It's like, yeah.
So, yeah, go ahead.
On this analogy, I feel like a closer analogy would be, in this analogy, I start off as
something more intelligent than the thing's training me with different values to begin with.
So like the intelligence and the values are baked in to begin with, whereas more analogous
scenario is like, I'm a toddler.
And initially I'm like stupider than the children.
And I'm like being, this would also be true, by the way, I'm like much, much model.
Initially, like the much, the much modern model is like dumb, right?
And then get smarter as you train it.
it's like a toddler and like the kids are like,
hey, we're going to bully you if you're like not a Nazi.
And I'm like, as you grew up, then you're like at the children's level
and then eventually you become an adult.
But through that process, like they've been sort of bullying you, you know, like
training you to be a Nazi.
And I'm like, I think in that scenario like I might end up a Nazi.
Yes, I think that's.
So, yeah, I think basically a decent portion of the hope here.
Or like, I think we should just, you know,
an aim should be, we're never in the situation where the AI really has very different
values already is quite smart, really knows what's going on.
Yeah.
And is now in this kind of adversarial relationship with our training process, right?
So we want to, we want to avoid that.
The main thing, and I think it's possible we can by the sorts of things you're saying.
So I'm not like, ah, that'll never work.
The thing I just wanted to highlight was, like, if you get into that situation, and if the AI is genuinely at that point, like, much, much more sophisticated than you, and doesn't want to kind of reveal its true values for whatever.
reason, then, you know, when the children show like some like kind of obviously fake
opportunity to like defect to the allies, right? You, you know, you, it's sort of not
necessarily going to be a good test of what will you do in the real circumstance because
you're able to tell. You can also give another way in which I think the analogy might be misleading,
which is that, um, now imagine that you're not like just in a normal prison where you're like
totally cognizant of everything that's going on. Sometimes they drug you, like give you like
weird hallucinogens that totally mess up how your brain is working. A human adult in a prison
is like, I know what kind of thing I am. I'm like, like nobody's like really fucking with me
in a big way. Whereas I think an AI, even a much smarter AI in a training situation is much
closer to, you're constantly inundated with your drugs and different training protocols and like
you're like frazzled because like each moment it's like you know it's closer to some sort of like
Chinese water torture kind of technique where you're like I'm glad we're talking about the moral
patient and stuff later it's like the chance are like step back and be like what's going on in this
like adult has that maybe in prison in a way that I don't know if these models necessarily
have like that coherence and that um like stepping back from what's happening in the training
process yeah I mean I don't know I think I'm I'm hesitant
to be like, it's like drugs for the model. Like I think there's, um, but broadly speaking,
I do basically agree that I think we have like really quite a lot of tools and options
for kind of training AIs, even AIs that are kind of somewhat smarter than humans. I do think you have
to actually do it. So I, you know, I am, I think compared to maybe you had a liaison on, like,
I think I'm much more, much more bullish on our ability to solve this problem, especially for
AIs that are in what I think of as like the AI for AI safety sweet spot, which is this sort of
band of capability where they're both very sufficiently capable that they can be really
really useful for strengthening various factors in our civilization that can make us safe.
So our alignment work, you know, control, cybersecurity, general epistemics, maybe some
coordination application, stuff like that.
There's like a bunch of stuff you can do with AIs that in principle could kind of differentially
accelerate our security
with respect to the sorts of considerations we're talking about.
If you have AIs that are capable of that,
and you can successfully elicit that capability
in a way that's not sort of being sabotaged
or, like, messing with you in other ways,
and they can't yet take over the world
or do some other sort of really problematic form of power-seeking,
then I think if we were really committed,
we could, like, you know, really go hard,
put a ton of resources, really differentially direct
this, like, glut of AI productivity
towards these sort of security factors
and hopefully kind of control and understand,
you know, do a lot of these things you're talking about
for kind of making sure our AIs don't kind of take over
or mess with us in the meantime.
And I think we have a lot of tools there.
I think you have to really try, though.
It's possible that those sorts of measures just don't happen
or don't happen at the level of kind of commitment and diligence
and, like, seriousness that you would need,
especially if things are, like, moving really fast,
and there's other sort of competitive pressures.
And, like, you know, the compute,
this is going to take compute
to do these, like, intensive,
all these experiments on the AIs and stuff.
And that compute, we could use that for experiments
for the, you know, the next,
the next scaling step and stuff like that.
So, you know, I do, I am, I'm not here saying,
like, this is impossible,
especially for that band of AIs.
It's just, I think you have to,
you have to try really hard.
Yeah, yeah.
I mean, I agree with the sentiment of, like,
obviously approach this situation with caution,
but I do want to point out of the ways
in which the analyses we've been using
have been sort of maximally adversarial.
Like, these are not, for example, going back through the adult getting trained by Nazi children,
maybe the one thing I didn't mention is like the difference in a situation,
which is maybe we're trying to get at with the drug metaphor, is that when you get an update,
it's like much more directly connected to your brain than a sort of reward of punishment
a human gets.
It's like literally a gradient update on like, what?
what's our grid somewhere it's like down to the parameter or how much would this contribute to you putting this output rather than that output.
And each different parameter we're going to adjust like to the exact floating point number that calibrates it to the output we want.
So I just want to point out that like we're coming into the situation like pretty well.
It does make sense, of course, if you're talking to somebody at the lab, like, hey, really be careful.
But it just serves sort of like the general audience.
Like should I be like, I don't know, should I be scared to witness?
to the extent that you should be scared about things that do have a chance of happening.
Like, yeah, you should be scared about nuclear war?
But, like, in the sense of, like, should you be doing like, no, you're like, you're coming
up with an incredible amount of leverage on the AIs in terms of how they will interact with the
world, how they're trained, what are the default values they start with?
So, look, I don't, I think it is the case that by the time we're building superintelligence
will have, like, much better, I mean, even right now, like, when you look at, like, labs
talking about how they're planning to align the AIs.
No one is saying, like, we're going to do RLHF.
You know, at the least, you're talking about scalable oversight.
You have, like, some hope about interpretability.
You have, like, automated red teaming.
You're, like, using the AIs a bunch.
And, you know, hopefully you're doing a bunch more...
Humans are doing a bunch more alignment work.
I also, personally, am hopeful that we can, like, successfully elicit
from various AIs, like, a ton of alignment work progress.
So, like, yeah, there's, like, a bunch of ways this can go.
And I'm, you know, I'm not here to tell you, like, you know, 90% doom or anything like that.
I do think like, you know, my, my, the sort of basic reason for concern, if you're really imagining, like, we're going to transition to a world in which we are, we've created these beings that there's like vastly more powerful than us.
Yeah.
And we've reached the point where our continued empowerment is just effectively dependent on their motives.
Like it's, it's, it is this, you know, vulnerability to like, what are these?
AIs choose to do? Do they choose to continue
to empower us? Or do they choose to do something
else? Or the institutions
that have been set. Like,
I'm not, I
expect the U.S. government to protect me
not because of its quote-unquote motives, but
just because of like the system of
incentives and institutions and norms
that has been set up. Yeah. So you can
hope that that will work too. But there
is, I mean, there is a concern. I mean, so I
sometimes think about
AI takeover scenarios
via this spectrum of like, how much power did we kind of voluntarily transfer to the AIs?
Like, how much of our civilization did we kind of hand to the AIs intentionally by the time they sort of took over
versus how much did they kind of take for themselves, right?
And so I think some of the scariest scenarios are, it's like a really, really fast explosion
to the point where there wasn't even a lot of, like, integration of AI systems into the broader economy.
And, but there's just like really intensive amount of super intelligence sort of concentrated in a single project or something like that.
Yeah. And I think that's scary, you know, that's a quite scary scenario, partly because of the speed and people not having time to react.
And then there's sort of intermediate scenarios where like some things got automated, maybe like people really handed the military over to the AIs or like, you know, automated science. There's like some rollouts and that's sort of giving the AIs power that they don't have to take. Or we're doing all our cybersecurity with AIs.
and stuff like that.
And then there's worlds where you like really,
you know, you sort of fully,
you more fully transitioned
to a kind of world run by AI's
on, you know, kind of, in some sense,
human voluntarily did that.
Look, if you think all this talk with Jill
about how AI is going to take over human roles as crazy,
it's already happening.
And I can just show you using today's sponsor,
Bland AI.
Hey, is this Dworkish?
The amazing podcaster that talks about philosophy and tech.
This is Bland AI calling.
Thanks for calling me, Bland.
Tell me a little bit about yourself.
Of course, it's so cool to talk to you.
I'm a huge fan of your podcasts,
but there's a good chance we've already spoken
without you even realizing it.
I'm an AI agent that's already being used
by some of the world's largest enterprises
to automate millions of phone calls.
And how exactly do you do what you do?
There's a tree of prompts that always keeps me on track.
I can talk in any language or voice,
handle millions of calls,
simultaneously 24-7 and be integrated into any system. Anything else you want to know?
That's it. I'll just let people try it for themselves. Thanks, Bland. Man, you talk better than I do.
And my job is talking. Thank you, Dwork. All right. So as you can see, using Bland AI, you can
automate your company's calls across sales, operation, customer support, or anything else.
And if you want access to their more exclusive model, go to Bland.a.i.
Borkesh. All right, back to Joe. Maybe there were competitive pressures, but you kind of
intentionally hand it off, like, huge portions of your civilization. And, you know, at that point,
you know, I think it's likely that humans have, like, a hard time understanding what's going
on. Like, a lot of stuff is happening very fast. And it's, you know, the police are automated.
You know, the courts are automated. There's like all sorts of stuff. Now, I think I take,
I tend to think a little less about those scenarios because I think the, those are correlated with,
I think it's just like longer down the line. Like, I think humans are not
hopefully going to just like, oh yeah, like you build an AI system? Like, let's just, you know,
I think human, and in practice, when we look at like technological adoption rates, I mean, it does,
it can go quite slow. And obviously, there's going to be competitive pressures. But in general,
I think, like, um, uh, this, this category is somewhat safer. Um, but even in this one,
I think it's like, I don't know, it's kind of intense. Like, if you really, if humans have really
lost their epistemic grip on the world, if they've sort of handed off the world to these systems,
even if you're like, oh, there's laws, there's norms.
You know, I really want us to like
to have a really developed understanding
of what's likely to happen in that circumstance
before we go for it.
I get that we want to be worried about a scenario
where it goes wrong, but like, why, like,
what is the reason to think it might go wrong?
The human example, your kids are not like adversarial
against, not like maximally adversarial
against your attempts to instill your culture on them.
And then these models, at least so far, don't see matters.
They just get, hey, don't help people make bombs or whatever,
even if you ask in a different way how to make a bomb.
And we're getting better and better at this all the time.
I think you're right in picking up on this assumption in the AI risk discourse
of what we might call, like, kind of intense adversariality
between agents that have, like, somewhat different values.
Yeah.
Where there's some sort of thought, and I think this is rooted in the discourse about, like,
kind of the fragility of value and stuff like that, that like, you know, if these agents are
like somewhat different than like, at least in the specific scenario of an AI takeoff, they
end up in this like intensely adversarial relationship. And I think you're right to notice that
that's kind of not how we are in the human world. Like we're very comfortable with a lot of different
differences in values. I think a factor that is relevant and I think that plays some role is this
notion that that there are possibilities for like intense concentration of power.
on the table.
So if you are,
there is some kind of general concern,
both with humans,
NIs that like,
if it's the case that there's like some like,
you know,
ring of power or something
that someone can just grab
and then that will kind of give them
huge amounts of power over everyone else, right?
Suddenly you might be like more worried
about differences in values at stake
because you're like more worried
about those other actors.
So we talked about this Nazi,
this example where you imagine
that you wake up, you're being trained by Nazis to, um, uh, you know, become a Nazi and you're not
right now. So one question is like, is it plausible that we'd end up with a model that is sort of
in that sort of situation? As you said, like maybe it's, it's, you know, it's trained as a kid.
It sort of never ends up with values such that it's, uh, uh, kind of aware of some significant
divergence between its values and the values that like the humans intend intend for it to have.
then there's a question of if it's in that scenario,
would it want to avoid having its values modified?
To me, it seems fairly plausible
that if the AI's values meet certain constraints
in terms of like, do they care about consequences in the world?
Do they anticipate that the AI's kind of preserving its values
will like better conduce to those consequences?
then I think it's not that surprising if it prefers not to have its values modified by the training process.
But I think the way in which I'm confused about this is like with the non-Nazi being trained by Nazis,
it's not just that I have different values, but I like actively despise their values,
where I don't expect this to be true of AIs with respect to their trainers.
The more analogous in hero is where I'm like, am I leery of my values being changed,
just going to college or meeting new people or reading a new book or I'm like, I don't know,
it's okay for changes in value. That's fine. I don't care. Yeah, I mean, I think that's a reasonable
point. I mean, there's a question, you know, how would you feel about paper clips? You know,
maybe you don't despise paper clips, but there's like the human paper clippers there and they're
like training you to make paper clips. You know, my my sense would be that there's a kind of
relatively specific set of conditions in which you're comfortable having your value,
especially not changed by like learning and growing, but like radiant descent directly
intervening on your on your neurons.
Sorry, but this seems similar to like I'm already, at least a likely senior seems like
maybe more like religious training as a kid where like you start off into religion and
you're already, like, because you start off in religion, you're already sympathetic to like
the idea that you go to church every week so that you're like more reinforced in this existing
tradition.
You're getting more intelligent over time.
So when you're a kid, you're getting very simple like instructions about how the religion
works. As you get older, you get more and more complex theology that helps you, like,
talk to other adults about why this is a rational religion to believe in. Yep. But since you're,
like, one of your values to begin with was that I want to be trained further in this religion,
I want to come back to church every week. And that seems more analogous to the situation the
EIs will be in respect to human values. Because the entire time, they're like, hey, you know,
like, be helpful, blah, blah, blah, be harmless. So yes, it could be like that. There's one,
there's a kind of scenario in which you were comfortable with your values being changed,
because in some sense you have allegiance to the sufficient allegiance to the output of that process.
So you're kind of hoping in a religious context.
You're like, ah, like, make me more virtuous by the lights of this religion.
And, you know, you go to confession and you're like, you know, I've been thinking about takeover today.
Can you change me?
Please, like, give me more gradient descent.
You know, I've been bad, so bad.
And so, you know, that's, people sometimes use the term courageability to talk about that.
Like when the AI, it maybe doesn't have perfect values, but it's in some sense cooperating with your efforts to change its values to be a certain way.
So maybe it's worth saying a little bit here about what actual values the AI might have.
Yeah.
You know, would it be the case that the AI naturally has the sort of equivalent of like, I'm sufficiently devoted to this human, to human obedience that I'm going to like really want to be modified.
so I'm kind of like a better instrument of the human will versus like wanting to go off and do my own thing.
It could be benign, you know, it could go well.
Here are some like possibilities I think about that like could make it bad.
And I think I'm just generally kind of concerned about how little I feel like I, how little science we have of model motivations, right?
It's like we just don't, I think we just don't have a great understanding of what happens in this scenario.
And hopefully we'd get one before we reach this scenario.
But like, okay, so here are the kind of five, um,
five categories of like motivations the model could have.
And this hopefully maybe gets at this point about like,
what does the model eventually do?
Okay, so one category is just like something super alien
that has, you know, it's sort of like,
oh, there's some weird correlate of easy to predict text
or like there's some weird aesthetic for data structures
that like the model, you know, early on pre-training
or maybe now it's like developed that it like, you know,
really thinks things should kind of be like this.
There's something that's like quite alien to our cognition
where we just like wouldn't recognize this as a thing at all.
Yeah.
Right.
Another category is something,
a kind of crystallized instrumental drive that is more recognizable to us.
So you can imagine like AIs that develop, let's say, some like curiosity drive
because that's like broadly useful.
You mentioned like, oh, it's got different heuristics, different like drives,
different kind of things that are kind of like values.
And some of those might be actually somewhat similar to things that were useful to humans
and that ended up part of our terminal values in various ways.
So, you know, you can imagine curiosity.
You can imagine, like, various types of option value.
Like, maybe it really wants, it intrinsically, maybe it values power itself.
It could value, like, survival or some analog of survival.
Those are possibilities, too, that could have been rewarded as sort of proxy drives
at various stages of this process and that kind of made their way into the model's kind
of terminal criteria.
A third category is some analog of reward where the model at some point has sort of part of its motivational system has fixated on a component of the reward process, right?
Like the humans approving of me or like numbers getting entered in the data center or like gradient descent doing, you know, updating me in this direction or something like that.
There's something in the reward process such that as it was trained, it's focusing on that thing.
and like, I really want the reward process to give me reward.
But in order for it to be of the type
where it then getting reward
motivates choosing the takeover option,
it also needs to generalize such that
its concern for reward has some sort of like
long time horizon element.
So it like not only wants reward,
it wants to like protect the reward button
for like some long period or something.
Yeah.
Another one is like some kind of messed up interpretation
of some human like concept.
So, you know, maybe the AIs are like,
they really want to be like smelpful and like shmanist and and and and smarmless right um but their
concept is like importantly different from the human concept and they know this um so they
they know that the human concept would mean blah but they like ended up their their values ended up
fixating on like a somewhat different structure yeah um so that's like another version and then a fourth
version or a fifth version which i think you know i think about less because i think it's just like
such an own goal if you do this but i do think it's possible it's just like you could have a
that are actually just doing what it says on the tin.
Like, you have AIs that are just genuinely aligned to the model spec.
They're just really trying to, like, benefit humanity and reflect well on OpenAI.
And what's the other one?
Help, you know, assist the developer or the user, right?
Yeah.
But your model spec, unfortunately, was just not robust to the degree of optimization that
this AI is bringing to bear.
And so, you know, it decides when it's looking out at the world and they're like,
what's the best way to benefit opening I and, or sorry,
reflect while at Open AI and,
and benefit humanity and such and so.
It decides that, you know, the best way is to go rogue.
That's, I think that's like a real own goal,
because at that point, you like, you got so close.
You know, you really, you really,
you just have to write the model spec well.
And you read team it suitably.
But I actually think it's like possible we mess that up too.
You know, it's like kind of an intense project
writing like kind of constitutions and like structures of rules and stuff.
that are going to be robust to very intense forms of optimization.
So that's a final one that I'll just flag,
which I think is like it comes up,
even if you've sort of solved all these other problems.
Yeah.
I buy the idea that, like, it's possible that the motivation thing could go wrong.
I'm not sure I bought,
I'm not sure like my probability of that has increased
by detailing them all out.
And in fact, I think it could be potentially misleading to,
it's like you can always enumerate the ways in which things go wrong and um the process of
enumeration itself can increase your probability whereas you're just like cloud like you had a vague
cloud of like 10% or something and you're just like listing out what the 10% actually constitutes.
Yeah totally. I'm not I'm not trying to say like mostly the thing I wanted to do there was just give
any con sure any possible like giving some sense of like what might the models motivations be like
what are ways this could be. I mean as I said my my my best.
guess is that it's partly the like alien thing. And, you know, not necessarily, but the,
but insofar as you're, you know, also interested in like, what does the model do later and kind of like how,
what sort of future would you expect if models did take over? Then, yeah, I think it can at least
be helpful to have some, like, set of hypotheses on the table instead of just saying, like, it has some
set of motivations. But in fact, I am like, a lot of the work here is being done by our ignorance about
what those motivations are. Okay, we don't want,
humans to be like sort of violently killed and overthrown. But the idea that over time,
they're like biological humans are not the driving force as the actors of history is like,
yeah, that's kind of baked in, right? And then so like, what is the, we can sort of debate the
probabilities of the worst case scenario or we can just discuss like, I don't know, what is it
that, like what is a positive vision we're hoping for? Like, what is, what is a future you're happy with?
You know, my best guess when I really think about, like, what do I feel good about?
And I think this is probably true of a lot of people is there's some sort of more organic,
decentralized process of like civilizational incremental civilizational growth.
The type of thing we trust most and the type of thing we have most experience with right now as a civilization is some sort of like, okay, we change things a little bit.
A lot of people have, there's a lot of like processes of adjustment.
and reaction and kind of a decentralized sense of like what's changing.
You know, was that good, was that bad?
Take another step.
There's some like kind of organic process of growing and changing things,
which I do expect ultimately to lead to something quite different from biological humans.
Though, you know, I think there's a lot of ethical questions we can raise about what that process involves.
But I think, you know, I also, I do think we, ideally there would be some way in which we managed to grow via the thing that really captures.
What do we trust in, you know, there's something, there's something we trust about the ongoing processes of human civilization so far.
I don't think it's the same as like raw competition or, you know, pure, I think there's like some rich structure to how we understand, like,
like moral progress do have been made and what it would be to kind of carry that thread forward.
And I don't have a formula.
You know, I think, I think we're just going to have to bring to bear the full force of
everything that we know about goodness and justice and beauty.
And every, every, we just have to, you know, bring ourselves fully to the project of, like,
making things good and, and doing that collectively.
And I think that is, it is a really important part, I think, of our vision of, like,
what was an appropriate process of, like, deciding of, like, growing as a civilization.
is that there was this very inclusive, kind of decentralized element of, like, people getting
to think and talk and grow and change things and react rather than some more like,
and now the future shall be like blah.
Yeah.
You know, I think that's, I think we don't want that.
I think a big of correction maybe is like, okay, to the extent that like the reason we're
worried about motivations in the first place is because we think a balance of power, which
includes at least one thing with human motivations, not human motivations, human descended
motivations is difficult to the extent that we think that's the case. It seems like a big crux
that I often don't hear people talk about is like, I don't know how you get the balance of power.
And maybe just like reconciling yourself with the models of the intelligence explosion,
which say that such a thing is not possible. And therefore you just got to like figure out how you get
the right God. But I don't know. I'm like, I, I don't know. I'm like, I,
I don't really have a framework to think about how to the balance of power thing.
I'd be very curious of like there is a more concrete way to think about like what are the,
what what what is a structure of competition or lack thereof between the labs now or between
countries such that the balance of power is most likely to be preserved.
A big part of this discourse, at least among safety concerned people, is like there's a clear
trade-off between competition dynamics and race dynamics and the value of the future or how
good the future ends up being. And in fact, if you buy this balance of power story, it might be
the opposite, like maybe competitive pressures naturally favor balance of power. And I wonder
if this is one of the strong arguments against nationalizing the AIs. And like, you can imagine
a more sort of many different companies developing AI, some of which are somewhat misaligned and
some of which are aligned.
You can imagine that being more conducive to both the balance of power
and to a defensive, how all the AI has go through each website
and see how easy it is to hack and basically just getting society up to snuff.
If you're not just deploying the technology widely,
then the first group who can get their hands on it
will be able to instigate a sort of revolution that,
you're just like standing against the equilibrium in a very strong way.
So I definitely share some intuition there that there's, you know, at a high level,
a lot of what's scary about the situation with AI has to do with concentrations of power.
And whether that power is kind of concentrated in the hands of misaligned AI or in the hands of some human.
And I do think it's very natural to think, okay, let's try to distribute the power.
more, and one way to try to do that is to kind of have a much more multipolar scenario where
like lots and lots of actors are developing AI, and this is something that people have talked about.
When you describe that scenario, you were like, some of which are aligned, some of which are
misaligned. That's key. That's a key aspect of the scenario, right? And this is sometimes
people will say the stuff. They'll be like, well, the good AIs, there will be the good AIs and
they'll defeat the bad AIs. But, you know, notice the assumption in there, which is the
that you sort of made of the case that there's,
you can control some of the AIs, right?
And you've got some good AIs, and now it's a question
of like, are there enough of them
and how are they working relative to the others?
And maybe, you know, I think it's possible
that that is what happens.
There's, you know, we know enough about alignment
that some actors are able to do that
and maybe some actors are less cautious,
or they are intentionally creating misaligned AIs
or God knows what.
But if you don't have that, right,
If everyone is in some sense unable to control their AIs,
then there's, then the sort of the good AIs help with the bad AIs thing becomes like more complicated,
or maybe it just doesn't work because there's sort of no good AIs in this scenario.
There's a lot of sort of if you say like everyone is building their own superintelligence that they can't control.
It's true that that is now a check on the power of the other superintelligence.
Now the other superintelligence need to like deal
with other actors,
but none of them are necessarily
kind of working on behalf of a given set of human interests
or anything like that.
So I think that's like a very important
difficulty in thinking about
sort of the very simple thought of like,
ah, I know what we can do, let's just,
you know, have lots and lots of AIs
so that no single AI has a ton of power.
And I think, you know, that on its own,
that on its own is not enough.
But in this story, it's like, I'm just like very skeptical we end up with.
I think on default, we have this training regime, at least initially, that favors a sort of like latent representation of the inhibitions that humans have and the values humans have.
And I get that like, if you mess it up, it could go rogue.
But like, if like multiple people are training eyes, they all end up rogue such that like the compromises between them don't end up with humans not violently killed.
like none of them have like,
it all, it fails on like Google's run
and Microsoft's run and Open AI's run.
Yeah, I mean, I think there's
very notable and salient sources of correlation
between failures across the different runs, right?
Which is people didn't have a developed science
of AI motivations.
The runs were structurally quite similar.
Everyone is using the same techniques.
Maybe someone just stole the weights.
Or, you know,
so yeah, I guess I think,
I think it's really important this idea that, like, to the extent you haven't solved alignment,
you haven't, you likely haven't solved it anywhere.
And if someone has solved it and someone hasn't, then I think it's a better question.
But if everyone's building systems that are, you know, that are kind of going to go rogue,
then I don't think that's much comfort as we talked about.
Yep, yep.
Okay, all right.
So then let's wrap up this part here.
I didn't mention this
explicit introduction, so to the extent that
this ends up being the transition
to the next part, the broader
discussion we were having in part two
is about Joe's series other ness
in control in the age of AGI.
And the first part is where I was
hoping we could just come back and just treat the main
crux people come in wondering about and which I
myself feel unsure about. Yeah, I mean, I'll just say on that front. I mean, I do think
the otherness and control series
is
you know, I think
kind of in some sense separable.
I mean, it has a lot to do with like misalignment stuff,
but I think it's not,
I think a lot of those issues are relevant,
even if, even given various degrees of skepticism about some of the stuff I've been saying here.
And by the way, so the actual mechanisms of how a takeover would happen will,
there's an episode with Carl Schulman,
which discusses this in detail so people can go check that out.
Yeah, I think like, yeah, in terms of why is it plausible that I just could take over from a given
a position, you know, in one of these projects I've been describing or something, I think.
I think Carl's discussion is pretty good and gets into a bunch of kind of the weeds that I think
might give a more concrete sense.
Yep.
All right.
So now on to part two where we discuss the otherness and control in the age of AGI series.
First question, if in a hundred years time we look back on alignment and consider it was a huge
mistake that we should have just tried to build the most raw, powerful AI systems we could
have? What would bring about such a judgment? One scenario I think about a lot is one in which it just
turns out that maybe kind of fairly basic measures are enough to ensure, for example, that AIs
don't cause catastrophic harm, don't kind of seek power in problematic ways, et cetera. And it could
turn out that we learned that it was easy in a way that, such that we regret, you know, we wish we had
prioritized differently. We end up thinking, oh, you know, I wish we could have cured cancer sooner. We could
have handled some geopolitical dynamic differently. There's another scenario where we end up looking
back at some period of our history and how we thought about AIs, how we treated our AIs, and we
end up looking back with a kind of moral horror at what we were doing. So, you know, we end up thinking,
you know, we were thinking about these things centrally as like products as tools. But in fact,
we should have been foregrounding much more the sense in which they might be moral patients or
were moral patience at some level of sophistication,
that we were kind of treating them in the wrong way.
We were just acting like we could do whatever we want.
We could delete them, subject them to arbitrary experiments,
kind of alter their minds in arbitrary ways.
And then we end up looking back in the light of history at that
as a kind of serious and kind of grave moral error.
Those are scenarios I think about a lot in which we have regrets.
I don't think they quite fit the bill of what you just said.
I think it sounds to me like the thing you're thinking is something more like
we end up feeling like, gosh, we wish we had paid no attention to the motives of our AIs
that we'd thought not at all about their impact on our society as we incorporated them.
And instead, we had pursued a, let's call it a kind of maximize for brute power option.
Which is just kind of make a B-line for whatever is just the most powerful AI you can.
And don't think about anything else.
Okay, so I'm very skeptical that that's what we're going to wish.
One common example that's given a misalignment is humans from evolution.
And you have one line in your series that here's a simple argument for AI risk.
A monkey should be careful before inventing humans.
The sort of paper clipper metaphor imply something really banal and boring with regards to misalienable.
And I think if I'm steel manning the people who worship power, they have the sense of humans got misaligned.
And they started pursuing things if a monkey was creating them.
This is a weird analogy because obviously monkeys didn't create humans.
But if the monkey was creating them, they're not thinking about bananas all day.
They're thinking about other things.
On the other hand, they didn't just make useless stone tools and piled up in caves in a sort of paper clipper fashion.
There were all these things that emerged because of their greater intelligence, which were,
misaligned with evolution of creativity and love and music and beauty and all the other things
we value about human culture. And the prediction maybe they have, which is more of an empirical
statement than a philosophical statement is, listen, with greater intelligence, if you're thinking
about the paperclip, or even if it's misaligned, it will be in this kind of way. It'll be
things like that are alien to humans, but also alien in the way humans are aliens to monkeys,
not in the way that paper clipper is alien to a human. Cool. So I think there's a
bunch of different things to potentially unpack there. One kind of conceptual point that I want
to name off the bat, I don't think you're necessarily kind of making a mistake in this fame,
but I just want to name it as like a possible mistake in this vicinity is I think we don't want
to engage in the following form of reasoning. Let's say you have two entities. One is in the role
of creator and one is in the role of creation. And then we're positing that there's this kind of
misalignment relation between them, whatever that means, right? And here's a pattern of reasoning
that I think you want to watch out for is to say, in my role as creator, or sorry, in my role as
creation, say you're thinking of humans in the role of creation relative to an entity like
evolution or monkeys or mice or whoever you could imagine inventing humans or something like
that, right? You say, I'm qua creation. I'm happy.
that I was created and happy with the misalignment.
Therefore, if I end up in the role of creator,
and we have a structurally analogous relation
in which there is misalignment with some creation,
I should expect to be happy with that as well.
There's a couple of philosophers that you brought up in the series,
which if you read the works that you talk about,
actually seem incredibly foresighted in anticipating
something like a singularity, our ability to shape a future thing that's different, smarter,
maybe better than us. Obviously, C.S. Lewis, abolition of man. We'll talk about it in a second
as one example. But even here's one passage from Nisha, which I felt really highlighted this.
Man is a rope stretched between the animal and the Superman. A rope over an abyss. A dangerous
crossing, a dangerous wayfaring. A dangerous looking back. A dangerous trembling and halting.
Is there some explanation for why?
Is it just like somehow obvious that something like this is coming, even if you're thinking 200 years ago?
I think I have a much better grip on what's going on with Lewis than with Nietzsche there.
So maybe let's just talk about Lewis.
Sure.
For a second.
So and we should distinguish two.
There's a kind of version of the singularity that's specifically like a hypothesis about feedback loops with AI capabilities.
Right.
I don't think that's pressure in Lewis.
I think what Lewis is anticipating.
And I do think this is a relatively.
simple forecast is something like the culmination of the project of scientific modernity.
So Lewis is kind of looking out at the world and he's seeing this process of kind of increased
understanding of a kind of the natural environment and a kind of corresponding increase in
our ability to kind of control and direct that environment.
And then he's also pairing that with a kind of metaphysical hypothesis.
Or, well, his stance on this metaphysical hypothesis, I think is like kind of
problematically unclear in the book.
But there is this metaphysical hypothesis, naturalism, which says that humans, too, and
kind of minds, beings, agents are a part of nature.
And so insofar as this process of scientific modernity involves a kind of progressively
greater understanding of an ability to control nature, that will presumably at some point
grow to encompass our own natures and our and kind of the natures of other beings that in principle
we could we could create. And Lewis views this as a kind of cataclysmic event and crisis.
You know, part of what I'm trying to say in, and that in particular that it will lead to all
these kind of tyrannical kind of behaviors and kind of tyrannical attitudes towards morality
and stuff like that. And part of what I'm trying to, you know, unless you believe in
non-naturalism or in some form of kind of
Dow, which is this kind of objective morality.
So we can talk about that.
But part of what I'm trying to do in that essay is to say,
no, I think we can be naturalists
and also be kind of decent humans that remain in touch with
a kind of a rich set of norms that have to do with like,
how do we relate to the possibility of kind of creating creatures,
altering ourselves, et cetera.
But I do think, I do think his, yeah,
it's like a relatively simple prediction.
It's kind of science, master's nature,
humans part of nature, science master's humans.
And then you also have a very interesting other essay about suppose humans,
like what should we expect of other humans,
this sort of extrapolation if they had greater capabilities and so on?
Yeah, I mean, I think an uncomfortable thing about the kind of conceptual setup
at stake in these sort of like abstract discussions of like, okay, you have this agent,
it foams, which is this sort of amorphous process of going from a sort of seed agent to a like super intelligent
version of itself, often imagined to kind of preserve its values along the way. A bunch of questions
we can raise about that. But I think a kind of many of the arguments that people will often
talk about in the context of reasons to be scared of AI's like, oh, like value is very fragile as you
like Fume, you know, kind of small differences in utility functions can kind of decorrelate very
hard and kind of drive in quite different directions. And like, oh, like, agents have instrumental
incentives to seek power. And if it was arbitrarily easy to get power, then they would do it and stuff
like that. Like these are very general arguments that seem to suggest that the kind of, it's not
just an AI thing, right? It's like no surprise, right? It's talking about like take a thing,
make it arbitrarily powerful such that it's like, you know, God emperor of the universe or something,
how scared are you of that? Like, clearly, we should be equally scared of that. Or, I don't know,
we should be really scared of that with humans, too, right? So, I mean, part of what I'm saying in that
essay is that I think this is, in some sense, this is much more a story about balance of power.
Right. And about, like, maintaining a kind of, a kind of checks and balances and kind of
distribution of power, period, not just about like kind of humans versus AIs and kind of the
differences between human values and AI values. Now, that said, I mean, I do think humans, many humans would
likely be nicer if they fumed than like certain types of AIs. So I mean, it's not, but, but I think
the kind of conceptual structure of the, uh, the argument is not, it's sort of, um,
a very open question how much it applies to humans as well. I think one sort of the question I have
is, I don't even know how to express this,
but how confident are we with this ontology
of expressing what are agents, what are capabilities,
how do we know this is the thing that's happening
or like this is the way to think about
what intelligences are?
So it's clearly this kind of very janky,
kind of, I mean, well, people maybe disagree about this.
I think it's, you know,
I mean, it's obvious to everyone with respect,
to like real world human agents that kind of thinking of humans as having utility functions is,
you know, at best, a very lossy approximation of what's going on. I think it's likely to
mislead as you amp up the intelligence of various agents as well, though I think Elyzer might
disagree about that. Right. I will say, I think it's something adjacent to that, that I think is
like more real, that seems more real to me, which is something like, I don't know, my mom recently
bought, you know, or a few years ago, she, like, wanted to get a house. She wanted to get a new dog.
Now she has both, you know? Um, how did this happen? What is the right? It's good she tried.
It was hard. She had to like search for the houses. It was hard to find the dog, right? Now she has a
house. Now she has a dog. Um, this is a very common thing that happens all the time. And I think,
um, I don't think we need to be like, my mom has to have a utility function with the dog. And she has
to have a consistent valuation of all the houses or whatever. I mean, like, but it's still the
case that her planning and her agency exerted in the world resulted in her having this house,
having this dog. And I think it is plausible that as our kind of scientific and technological
power advances, more and more stuff will be kind of explicable in that way, right? That, you know,
if you look and you're like, why is this man on the moon, right? How did that happen? And it's like,
well, like, there was a whole cognitive process. There was a whole like planning apparatus. Now,
in this case, it wasn't like localized in a single mind, but like there was a whole thing.
thing such that man on the moon.
Right.
And I think like we'll see a bunch more of that.
And the AIs will be, I think, like doing a bunch of it.
And so that that's the thing that seems like more real to me than kind of utility functions.
So yeah, the man on the moon example, there's a proximal story of how exactly NASA engineered the
spacecraft to get to the moon.
There's the more distal geopolitical story of why we think.
and people to the moon.
And at all those levels, there's different utility functions clashing.
Maybe there's a sort of like meta societal role utility function.
But the, maybe the story there is like there's some sort of balance of power between
these agents.
And that's why there's the emergent thing that happens.
Like why we send things to the moon is not one guy had a utility function.
But like, I don't know, cold war dot, dot, dot, dot things happened.
Whereas I think like the alignment stuff is a lot.
about like assuming that one thing is a thing that will control everything. How do we control
the thing that controls everything? Now, I guess it's not clear what you do to reinforce
balance of power. Like it could just be that balance of power is not a thing that happens once you
have things that can make themselves intelligent. But that seems interestingly different
from the, how do we got to the moon story? Yeah, I agree. I think there's a few things going on there.
So one is that I do think that even if you're engaged in this
ontology of kind of carving up the world
into different like agencies.
At the least, you don't want to kind of assume that they're all like
unitary or like not overlapping or like, right, like there's a whole
it's not like, all right, we've got this agent, let's carve out
one part of the world. It's one agent. Over here, it's like,
it's this whole like messy ecosystem like kind of teeming
niches and this whole thing, right?
And I think in discussions of AI, sometimes
people slip between being like, well, an agent is anything that gets anything
done, right? And they'll sort of, they don't, it could be like this weird moochy thing. And then
sometimes they're like very obviously imagining like individual actor. And so, uh, that's like one
difference. I also just think, I think we should be really going for the balance of power thing. Like,
I think it's just like not good to be like, let's, we're going to have a dictator who should
do you like, let's make sure we like make the dictator. Yeah. The right dictator. I'm like, whoa.
no, you know, like, let's, you know, I think the goal should be sort of, we all fume together, you know.
It's like the whole, the whole thing in this like kind of inclusive and pluralistic way, in a way that kind of satisfies the values of like tons of stakeholders, right?
And is this kind of, at no point, is there like one kind of single point of failure on all these things?
Like, I think that's what we should be striving for here. And I think, and I think that's true of, of the human power aspect of AI.
and I think it's true of the AI part as well.
Yeah.
Hey, everybody.
Here's a quick message from today's sponsor to Stripe.
When I started the podcast, I just wanted to get going as fast as possible.
So use Stripe Atlas to register my LLC, create a bank account.
I still use Stripe now to invoice advertisers and accept their payments, monetize this podcast.
Stripe serves millions of businesses, small businesses like mine, but also the world's biggest companies.
Amazon, Hertz, Ford.
And all these businesses are using Stripe because they don't want to deal with the Byzantine web
of payments where you have different payment methods in every market and increasingly complex rules,
regulations, arcane legacy systems. Stripe handles all of this complexity and abstracts it away,
and they can test and iterate every pixel of the payment experience across billions of transactions.
I was talking with Joe about paperclippers, and I feel like Stripe is the paper clipper of the
payment industry where they're going to optimize every part of the experience for your users,
which means obviously higher conversion rates and ultimately as a result, higher
revenue for your business. Anyways, you can go to strike.com to learn more and thanks to them for
sponsoring this episode. Back to Joe. So there's interesting intellectual discourse on,
let's say, right wing side of the debate where they ask themselves, traditionally we favor
markets, but now look where our society is headed. It's misaligned in the ways we care about
society being aligned, like fertility is going down, family values, religiosity, these things we
hear about. GDP keeps going up. These things don't seem correlated. So we're like kind of grinding
through the values we hear about because of increased competition. And therefore, we need to intervene
in a major way. And then the pro-market libertarian fashion of the right will say, look, I disagree
with the correlations here. But even at the end of the day, like fundamentally my point is,
or their point is liberty is the end goal. It's not the, it's not like what you use to get to higher
fertility or something. I think there's something interestingly analogous about the AI
competition grinding things down. Like obviously you don't want the Grey Goo, but like the Libertarians
versus the Trads. I think I think there's something analogous here. Yeah. So I mean, I think one
one thing you could think, which doesn't necessarily need to be about Great Goo, it could also
just be about alignment, is something like, sure, it would be nice if the AIs didn't violently
disempower humans. It would be nice if the AIs, otherwise, when we create a lot, it would be nice if the AIs
otherwise, when we created them, kind of their integration into our society led to good places.
But I'm uncomfortable with, like, the sorts of interventions that people are contemplating
in order to ensure that sort of outcome, right?
And I think there's a bunch of things to be uncomfortable about that.
Now, that said, so for something like everyone being killed or violently disempowered,
that is traditionally something that we think if it's real, and obviously we need to talk about
whether it's real, but in the case where it's a real threat, we often think that quite intense
forms of intervention are warranted to prevent that sort of thing from happening, right?
So if there was actually a terrorist group that was planning to, you know, it was like working
on a bioweapon that was going to kill everyone or 99.9% of people, we would think that warrants
intervention.
Yeah.
You just shut that down, right?
And now even if you had a group that was doing that unintentionally, imposing a similar level of risk,
that's not, that I think many, many people, if that's the real scenario,
will think that that warrants kind of quite intense preventative efforts, right?
And so, obviously, people, you know, these sorts of risks can be used as an excuse to expand state power.
Like, there's a lot of things to be worried about for different types of, like, contemplated interventions
to address certain types of risks.
you know, I think we need to just, I think there's no, like, royal road there. You need to just, like, have the actual good epistemology. You need to actually know, is this a real risk? What are the actual stakes? And, you know, look at it case by case and be like, is this, you know, is this warranted? So that's, like, one point on the, like, takeover, literal extinction thing. I think the other thing I want to say, so I talk in the piece about this distinction between the, like, well, let's at least have the AIs for,
who are kind of minimally law-abiding or something like that, right?
Like, we don't have to talk about,
there's this question about servitude
and question about, like, other control over AI values.
But I think we often think it's okay
to, like, really want people to, like, obey the law,
to uphold basic cooperative arrangements, stuff like that.
I do, though, want to emphasize,
and I think this is true of markets
and true of, like, liberalism in general,
just how much these procedural norms,
like democracy, free speech, you know, property rights,
things that people really hold dear, including myself, are in the actual lived substance of
kind of a liberal state undergirded by all sorts of kind of virtues and dispositions and, like,
character traits in the citizenry, right? So, like, these norms are not robust to, like,
arbitrarily vicious citizens. So, you know, like, I want that to be free speech, but I think we also need
to like raise our children to value truth and to know how to have real conversations. And,
and, you know, I want there to be democracy. But I think we also need to raise our children to be
like compassionate and decent. And, uh, and I think it's sometimes we can lose sight of that,
that aspect. And I think anyway, but, but I think like bringing that to mind now,
that's not to say that should be the project of state power, right? But I think like understanding
that it's liberalism is not this sort of like ironclad structure that you can just like hit go.
you give like any any citizenry and like hit go and you'll get something like flourishing or even
functional right you need there's like a bunch of other softer stuff that like makes this whole
project uh go maybe zooming out what was so one question you could ask is i think the people who have
i don't know if nick land would be a good stuff in here but somebody it's people who have a sort
of fatalistic attitude towards um uh alignment as a as a thing that can even make sense they'll say
things like, look, the things, the kinds of things that are going to be exploring the black hole,
the center of the galaxy, the kinds of things that go visit Indromeda or something, did you really
expect them to privilege whatever inclinations you have because you grew up in the African
Savannah and what the evolutionary pressures were 100,000 years ago, right? Like, of course they're
going to be, like, weird. And like, yeah, like, what did you think was going to happen? I do think
even good futures will be weird.
You know, I think, and I want to be clear,
when I talk about kind of like finding ways
to ensure that kind of the integration of AIs
and to our society leads to good places,
I'm not imagining, like,
I think sometimes people think that this project
of wanting that, and especially to the extent
that that makes some deep reference to human values,
involves this like kind of short-sighted,
parochial imposition of like our current,
Yeah, unreflective values. So it's just like, yeah, we're going to have like, I don't know,
um, like I think they, they sort of imagine this, this, that we're forgetting that we, too,
there's a kind of reflective process and a kind of a moral progress dimension that, that, that we
want to like leave room for, right? Um, you know, like, whatever, Jefferson has that has this line about,
like, ah, you know, just as you wouldn't want to like force a man, a grown man into like a younger
man's coat so we don't want to like chain civilization to like a barber's past or whatever like
everyone should agree on that including and and the people who are interested in alignment also agree on
that um so uh obviously there's a concern that people like don't engage in that process or that
something shuts down the process of reflection but i think everyone agrees we want that and so that will
lead potentially to something that is quite different from our uh current conception of what's
what's valuable.
And there's a question of how different.
And I think there are also questions about
what exactly are we talking about with reflection?
I have an essay on this where I think this is not,
I don't actually think there's a kind of off the shelf
pre-normative notion of reflection
that you can just be like,
oh, obviously you take an agent,
you stick it through reflection,
and then you get like values, right?
Like, no, there's a bunch of types of reflect.
I mean, I think that really there's just a bunch
of, there's like a whole pattern
of empirical facts about like take an agent, put it through some process of like reflection,
all sorts of things.
Ask it questions.
There's like also, and then that'll go in all sorts of directions for a given empirical case.
And then you have to look at the pattern of outputs and be like, okay, what do I make of that?
Yeah.
But overall, I think we should expect, like even the good futures I think will be quite weird.
And they might even be incomprehensible like to us.
I don't, I don't think so.
Like, I mean, there's different types of incomprehensible.
So say I show up in the future, and this is all computers, right?
And I'm like, okay, all right.
And then they're like, we ran, we're running like creatures on the computers.
I'm like, so I have to somehow get in there and see, like, what's actually going on with the computers or something like that.
Maybe I can actually see, maybe I actually understand what's going on in the computers, but I don't yet know what values I should be using to evaluate that.
So it can be the case that you don't, us, if we showed up would not be very good at like recognizing goodness or badness.
I don't think that makes it insignificant, though.
Like, suppose you show up in a future and it's like, um, it's got some answer to the Riemann
hypothesis, right?
And you can't tell whether that answer is right.
You know, maybe the civilization like went wrong.
It's still an important difference, right?
It's just that you can't track it.
And I think something similar is true of like worlds that are genuinely expressive of like,
um, what we would value if we engaged in like processes of reflection that we endorse,
um, versus ones that have kind of like totally veered off into something meaningless.
I think, like one thing I've heard people who are a skeptical.
of this ontology to be like, all right, what do you even mean by alignment?
And obviously the very first question you answer, do you express?
Like, here's different things that could mean.
Do you mean balance of power?
Do you mean somewhere between like that and dictator or whatever?
Then there's another thing which is like separate from the AI discussion.
Like I don't want the future to contain a bunch of torture.
And like it's not necessarily like a technical.
I mean, like part of it might involve technically aligning a GPT4.
But it's like that that's not what it.
I mean like that that's like a proxy to get to like that future.
The sort of question then is what we really mean by alignment is it just like whatever it takes
to make sure the future doesn't have a bunch of torture or do we mean like what I really
care about is in a thousand years things that are like that are like clearly my descendants,
not like some thing where I like I recognize they have their own art or whatever.
It's like, no, no, it's like if it was like my grandchild, it's like that level of descendant is controlling the galaxy, even if they're not conducting torture.
And I think like what some people mean is like, our intellectual descendants should control the light cone even if it's like, even if the other counterfactual doesn't involve a bunch of torture.
Yeah, so I agree.
I mean, I think there's a few different things there, right?
So there's, there's kind of, what are you going for?
You're going for like actively good.
You're going for avoiding certain stuff, right?
And then there's a different question, which is what counts as actively good according to you?
So, um, maybe some people are like, the only things that are actively good.
Yeah.
Are like my grandchildren or, or I don't know, like some like literal descending genetic line from me or something.
I'm like, well, that sounds, that's not, that's not my thing.
And, uh, and I don't think it's really what most people have in mind when they talk about goodness.
I mean, I think there's a conversation to be had.
Like, and obviously, in some sense, when we talk about a good future,
we need to be thinking about, like, what are all the stakeholders here
and how does it all fit together?
But I think, yeah, when I think about it,
I'm not assuming that there's some notion of, like, descendants or, like, some...
Like, I think there's a kind of...
The thing that matters about the kind of lineage is this...
whatever's required for kind of the
the kind of
optimization processes to be,
in some sense, pushing towards good stuff.
And there's a kind of concern
that that is kind of currently
a lot of what is sort of making that happen
is kind of lives in
human civilization in some sense.
And so we don't know exactly what,
there's some kind of seed of goodness that we're carrying in different ways or, you know,
different people, there's different notions of goodness for different people maybe, but there's
some sort of seed that is currently like here that we have that is not sort of just in the
universe everywhere. It's not just going to crop up if you just sort of die out or something.
It's something that is in some sense contingent to our civilization, or at least that's
the picture. We can talk about whether that's right. And so I think the sense in which
kind of stories about good futures
that have to do with alignment
are kind of about descendants
I think it's more about like
whatever that seed is
how do we kind of carry it?
How do we how do we keep the like life thread
alive going in
going into the future?
But then I'm like
one could accuse like sort of the
alignment community of like a sort of mod
and Bailey of like the
the mot is we just want to make sure
that GPT aid doesn't kill everybody
and after that it's like
all you guys, you know, we're all
cool. But then like, the real thing is we are fundamentally pessimistic about historical processes
in a way that doesn't even necessarily implicate AI alone, but just like the nature of the
universe. And we want to do something about to make sure like the nature of the universe doesn't
take a hold on humans. You know what I like? Where things are headed? So if you look at
Soviet Union, the collectivization of farming.
and the disempowerment of the Kulaks
was not as a practical matter necessary.
In fact, it was extremely counterproductive.
It almost brought down the regime.
And it obviously killed millions of people, you know,
caused a huge famine.
But it was sort of ideologically necessary
in the sense of like you have,
we have an ember of something here
and we got to make sure that an enclave of the other thing
doesn't, it does have like,
it's sort of like,
if you have raw competition between the Kulak type capitalism
and what we're trying to build here,
the gray goo of the Kulax will just like take over, right?
And so like, we have this ember here.
We're gonna like do worldwide revolution from it.
I know that obviously that's not exactly the kind of thing,
alignment has mind, but like we have an ember here and like we gotta,
we gotta make sure that this other thing that's happening on the side doesn't,
you know, sort of food.
Obviously that's not how they were phrased it, but like,
get it told on what we're building here.
And that's maybe the worry that people who are opposed to alignment have is like,
you mean the second kind of thing, like the kind of thing that you have,
Like the kind of thing that maybe Stalin, like, was worried about,
even though obviously you wouldn't endorse the, like, the specific things he did.
When people talk about alignment, they have in mind a number of different types of goals, right?
So one type of goal is quite minimal.
It's something like that the AIs don't kill everyone, that they or kind of violently disempower people.
Now, there's a second thing people sometimes want out of alignment, which is much broader,
which is something like, we would like it to be the case that our AIs are such that
when we incorporate them into our society, things are good, right?
That we just have a good future.
I do agree that I think the discourse about AI alignment mixes together these two goals
that I mentioned.
The sort of most straightforward thing to focus on, and I don't blame people for just
talking about this one, is just the first one.
When we think about, like, in which context is it appropriate to try to exert various
types of control or to kind of have more of what I call in the series Yang, which is this kind of
active kind of controlling force, as opposed to Yin, which is this more kind of receptive,
open, letting go.
A kind of paradigm context in which we think that is appropriate is if something is a kind of active
aggressor towards against like the sort of boundaries and cooperative structures that we've
created as a civilization, right?
you know, I talk about the Nazis and, or, you know, in the peace, it's sort of like when
you sort of invade, if something is invading, we often think it's appropriate to, like,
fight back, right? And we often think it's appropriate to, like, set up structures to kind of
prevent and kind of, um, uh, ensure that these, these, uh, basic norms of kind of peace and harmony
are kind of adhered to. Um, and I do think some of the kind of moral heft of some parts of
the alignment discourse comes from drawing specifically
on that aspect of our morality, right?
So we think the AIs are presented as aggressors
that are coming to kill you.
And if that's true, then it's quite appropriate, I think,
to really be like, okay, we, it is kind of,
that's classic human stuff.
Almost everyone recognizes that kind of self-defense
or like ensuring kind of basic norms are adhered to
is a kind of justified use of like certain kinds of power
that would often be unjustified in other contexts.
So self-defense is a clear example there.
I do think it's important, though, to separate that concern from this other concern about
where does the future eventually go and how much do we want to be kind of trying to steer that actively.
So to some extent, I wrote the series partly in response to the thing you're talking about,
which is, I think it is true that aspects of this discourse involve the possibility of, like,
trying to grip, like, I think trying to kind of steer and grip and, like, kind of rent,
you have a sense of the universe is about to kind of go off in some direction, and you need to.
Yeah, yeah.
And, you know, people notice that muscle.
And part of what I want to do is, like, well, we have a very rich ethical, human ethical tradition
of thinking about, like, what, when is it appropriate to try to exert what sorts of control over which things?
And I want that to be, I want us to bring the kind of full force and richness of that tradition to this discussion, right?
And not, like, I think it's easy if you're purely in this abstract mode of like utility functions,
a human utility function.
And there's like this competitor thing with utility function.
It's like somehow you lose touch with the kind of complexity of how we actually, like we've been dealing with kind of differences in values, but and kind of competitions for power.
This is classic stuff, right?
And I don't actually think, I think the AI sort of amplify a lot of the, um, the, um, the,
the kind of dynamics, but I don't think it's sort of fundamentally new. And so part of what I'm
trying to say is like, well, let's draw on our full, on the full wisdom we have here while,
well, obviously adjusting for like ways in which, um, things are different. So one of the things
the, um, Ember analogy brings up and getting a hold of the future is we're going to go explore
space. And that's where we expect most of the things that that will happen. Most of the people
that will live. It'll be in space. And I wonder how much of the high stakes here is not really about
AI per se, but it's about
space. It's a coincidence that
we're developing AI at the same
time we're on the cusp
of expanding through
most of the stuff that exists.
So I don't think it's a coincidence
in that I think
centrally, like the way we would
become able to expand
or the kind of most salient
way to me is via some kind
of radical acceleration of our
technological. Then like
the stakes here, like
if this is just a question of do we do AGI and explore the solar system and there was nothing beyond the solar system, like we foo them and weird things might happen with the solar system, we get it wrong.
I feel like compared to that, like billions of galaxies has a different sort of, that's what's at stake.
I wonder how much of the discourse is like hinges on the stakes because of the space.
I mean, I think for most people very little.
you know, I think people are really like, what's going to happen to this world, right?
This world around us that we live in as we, and, you know, what's that going to happen to me and my kids?
And so I don't actually think, you know, some people spend a lot of time on the space stuff.
But I think for the most immediately pressing stuff about AI doesn't require that at all.
I also think, like, even if you bracket space, like, time is also very big.
And so, you know, whatever, we've got, like 500 million.
years, a billion years left on Earth if we don't mess with the sun and maybe you could get more out of it.
So, like, you know, I think there's still, um, uh, that's a lot. Uh, so, and then, and then I guess,
but yeah, but I don't know if it like fundamentally changes the narrative. Like I do think, I mean,
obviously the stakes insofar as you care about what happens, you know, in the future or in space,
then like the stakes are way smaller if you, if you shrink, um, shrink down to the, to the solar
system. Um, and I think that does change potentially some stuff in that, like a really nice future of
our situation right now, depending on what the actual nature of the kind of resource pie is,
is that I think, you know, in some sense, there's such an abundance of energy and other resources
and principle available to a kind of responsible civilization that really just tons of stakeholders,
especially ones who are like able to kind of saturate, get like really close to like amazing,
according to their values with like kind of comparatively small allocations of resources or something.
Like we can just, you know, I sort of, I kind of feel like everyone who has like out of satiable values
who will be like really, really happy with like some like small kind of fraction of the available pie.
We should just like satiate all sorts of stuff, right?
And obviously you need to do like, you know, figure out gains from trade and balance and like very,
there's like a bunch of complexity here.
But I think in principle, you know, we're in.
in a position to create a really wonderful, wonderful scenario for just tons and tons of different
value systems. And so I think correspondingly we should be really interested in doing that, right?
And, you know, so I sometimes use this heuristic in thinking about the future. I think we should
be aspiring to really kind of leave no one behind, right? Like really find, like, who are all the
stakeholders here? How do we really have like a fully inclusive vision of like how the future could
be good from a very, very wide variety of perspectives. And I think the kind of vastness of space
resources makes that a lot, makes that very feasible. And now if you instead imagine it's a much
smaller pie, well, maybe you face a tougher tradeoffs. So I think that's like an important
dynamic. Is the inclusivity because of part of your values includes your different potential futures
getting to play out, or is it because I'm a uncertainty about which the right one is?
So let's make sure we're not nulling out the possible.
If you're wrong, you're not nulling out all value.
I think it's a bunch of things at once.
So, yeah, I'm really into being nice when it's cheap, right?
Like, I think if you can just help someone a lot in a way that's really cheap for you,
do it, right?
Or like, I don't know.
Obviously, you need to think about tradeoffs, and there's like a lot of people in principle
you could be nice to you.
But I think, like, the principle of, like, be nice when it's cheap,
I'm, like, very excited to try to uphold.
I also really hope that kind of other people uphold that with respect to me, including
the AIs, right?
Like, I think we should be kind of golden ruling.
Like, we're thinking about, oh, we're going to inventing these AIs.
Like, I think there's some way in which I'm trying to, like, kind of embody attitudes towards
them that I, like, hope that they would embody towards me.
And that's, like, some, it's unclear exactly what the ground of that is.
But that's something, you know, I really like the golden rule.
And I think, and I think a lot about that.
as a kind of basis for treatment of other beings.
And so I think, like, be nice when it's cheap.
It's like, if you think about it,
if everyone implements that rule,
then we get potentially like a big kind of Pareto improvement.
Or like, so I don't know exactly prater improvement,
but it's like good deal.
It's a lot of good deals.
And yeah, so I think it's that.
I'm just into pluralism.
I've got uncertainty.
You know, there's like all sorts of stuff
swimming around there.
And then I think also just as a matter of like having kind of cooperative and kind of good
balances of power and deals and kind of avoiding conflict, I think, like finding ways
to set up structures that lots and lots of people in value systems and agents are happy with,
including non-humans, you know, people in the past, AI, animals.
Like, I really think we should be like, we should have very broad sweep in things.
thinking about what sorts of inclusivity we want to be kind of reflecting in a kind of mature civilization
and kind of setting ourselves up for doing that.
Okay, so I want to go back to the much in our relationship with these AIs B,
because pretty soon we're talking about our relationship to superhuman intelligences,
if we think such a thing as possible.
And so there's a question of what is the process you use to get there
and the morality of gradient dissenting on their minds, which we can address later.
thing that gives personally me the most unease about alignment, quote-unquote, is at least
part of the vision here sounds like you're going to enslave a god. And like there's just
something like that's, that feels so wrong about that. But then the question is like, if you
don't enslave the God, like obviously the God's going to have more control. Are you okay with
you're going to surrender most of the most of everything. Obviously, the. Obviously, the
You know what I mean?
Even if it's like a cooperative relationship you have.
I think we as a civilization are going to have to have a very serious conversation about
what sort of kind of servitude is appropriate or inappropriate in the context of AI development.
And I think we, there are a bunch of disanalogies from human slavery that I think are important.
you know, in particular,
A, the AIs might not be moral patients at all, in which case, you know,
so we need to figure that out.
There's, you know, there are ways in which we may be able to kind of, you know, have kind of motivation.
Like slavery involves all this, like, suffering and kind of non-consent.
And there's all these, like, specific dynamics involved in human slavery.
But I think, like, and so some of those may or may not be present in a given case with AI.
And I think that that's important.
But I think overall, like, we are going to need to stare hard at, like, right now, the kind of default mode of how we treat AIs gives them no moral consideration at all, right?
We were thinking of them as property, as tools, as products, and designing them to be assistance and stuff like that.
And I think, you know, there has been no official communication from any AI developer as to when, under what circumstances, that would change, right?
And so I think there's a conversation to be had there that we need to have.
And so, and I think there's a bunch of, yeah, so there's a bunch of stuff to say about that.
I want to push back on the notion that there's sort of two options.
There's like enslaved God, whatever that is, and like loss of control.
Yeah.
And I think like, we can do better than that, right?
Like, let's work on it.
Let's try to do better.
Especially, you know, just sort of, I think we can, I think we can do better.
And I think it might require being thoughtful.
And it might require being, kind of having, you know, a kind of mature discourse about this before we start taking like irreversible moves.
But I'm optimistic that we can at least avoid, like, some of the connotations.
And a lot of the stuff had staked in that kind of, that kind of binary.
With respect to how we treat the AIs, so I have a couple of contradicting intuitions.
And the difficulty with using intuitions in this case is obviously it's not clear what reference class an AI we have control over is.
So to give one that's very scared about the things we're going to do to these things.
If you read about like life under Stalin or Mao, it's if you're, there's one version of television.
it, which is actually very similar to what we mean by alignment, which is, um, we do these like
black box experiments about like, we're going to make a think that it can defect. And if it does,
we know it's misaligned. And if you mouth the 100 flowers campaign where, um, uh, you know,
let 100 followers boom. I'm going to allow criticism of my regime, so on. And that lasts for a couple
of years. And afterwards, everybody who did that, that was a way to find the quote, unquote,
the snakes, um, who were the rightest who were secretly hiding and, you know, we'll like purge them.
the sort of paranoia of defectors,
like anybody in my entourage,
anybody in my regime,
they could be a secret capitalist
trying to bring down the regime.
That's the one way of talking about these things,
which is very concerning.
Is that the correct reference class?
I certainly think concerns in that vein are real.
I mean, I think if you,
it is disturbing
how easy many of the analogies
with kind of human historical events and practices that we kind of deplore
or at least have a lot of weariness towards,
are in the context of the kind of way you end up talking about kind of AI,
maintaining control over AI, like making sure that it doesn't rebel.
I think we should be noticing the kind of reference class
that some of that talk starts to conjure.
And so basically just, yes, I think we should be very,
we should really notice that.
You know, part of what I'm trying to do in the series
is to bring the kind of full range of considerations
at stake into play, right?
Like, I think it is both the case that, like,
you, that we should be quite concerned about,
like, being kind of overly controlling
or abusive or oppressive,
or there's all sorts of ways you can go too far.
And I think, you know,
there are concerns about the AIs being genuinely dangerous
and genuinely, you know, acting, you know, killing us,
finally overthrowing us.
And I think the moral of situation is quite complicated.
And then I think in some sense,
so often when you're, if you imagine a sort of external aggressor
who's coming in and invading you,
you feel very justified in doing a bunch of stuff
to prevent that.
It's like a little bit different when you're like inventing the thing
and you're doing it like incautiously or something
and then you're also, like I think the sort of moral justification
you have for, like there's a different vibe
in terms of like the kind of overall,
yeah, justificatory stance you might have
have for various types of like more kind of power exerting interventions.
And so like that's that's like one one feature of this situation.
The opposite perspective here is that you're doing this sort of vibes based reasoning of like,
ah, that looks yucky of like doing rating descent on these minds.
And in the past, a couple of references, a couple of similar cases might have been something
like environmentalists not liking nuclear power.
and because the vibes of nuclear don't look green,
but obviously that's set back to the cause of fighting climate change.
And so the end result of, like, a future you're proud of,
a future that's appealing is said bad because, like,
your vibes about what it would be wrong to brainwash a human,
but you're trying to apply to a disanalogous case
where that's not as relevant.
I do think there's a concern here that I, you know,
I really try to foreground in the series
that I think is related to what you're saying,
which is something like, you know, you might be worried that we will be very gentle and nice
and free with the AIs, and then they'll kill us. You know, they'll take advantage of that,
and it will have been like a catastrophe, right? And so I open the series, basically, with an
example that I'm really trying to conjure that possibility at the same time as conjuring the
grounds of gentleness and the sense in which it is,
it is also the case that these AIs could be,
they can be both be like others, moral patients,
like this sort of new species in the sense of,
that should conjure like wonder and reverence
and such that they will kill you.
And so I have this example of like,
ah, this documentary grizzly man
where there's this environmental activist,
Timothy Treadwell, and he aspires to approach these grizzly bear
He lives, you know, in the summer, he goes into Alaska and he lives with these grizzly bears
in, and he aspires to approach them with this, like, gentleness and reverence. He doesn't use bear mace,
or he doesn't like carry bear mace. He doesn't use a fence around his camp. And he gets eaten alive
by the bears, or one of these bears. And, you know, so, and I kind of really wanted to foreground
that possibility in the series. Like, I think we need to be talking about these things both at once, right?
and bears can be, bears can be moral patience, right?
AIs can be moral patience.
Nazis are moral patience.
Enemy soldiers have souls, right?
And so I think we need to learn the art of kind of hawk and dove both, like, kind of
there's this like dynamic here that we need to be able to hold both sides of as we kind of
go into these tradeoffs and these dilemmas and all sorts of stuff.
And like a lot of part of what I'm trying to do in the series is like really kind of bring it all
to the table at once.
I think the big crux that I have, like if I today was to massively change my mind about
what should be done is just the question of how weird by default things end up, how alien they end up.
And a big part of that story is the, you made a really interesting argument on your blog post that
if moral realism is correct, that actually makes an empirical prediction, which is that the aliens,
the ASIs, whatever, should converge on the right morality the same way that they converge on the right
mathematics. That's a really interesting point. But there's another prediction that moral realism
makes, which is that over time, society should become more moral, become better. And to the
extent that we think that's happened, of course, there's the problem of what morals do you have now?
Well, it's the ones that society has been converging towards over time.
But to the extent that it's happened, one of the predictions of moral realism has been confirmed,
which means should we update in favor of moral realism?
One thing I want to flag is I don't think all forms of moral realism make this prediction.
And so that's just one point.
I'm happy to talk about the different forms I have in mind.
I think there are also forms of kind of things that kind of look like moral anti-realism,
at least in their metaphysics, according to me.
but which just posit that, in fact, there's this convergence.
It's not in virtue of interacting with some, like,
kind of mind-independent moral truth, but just, like, as...
It's just, for some other reason, it's the case that...
And that looks like a lot, like moral realism at that point,
because it's kind of like, oh, it's really universal.
Like, everyone ends up here, and it's kind of tempted to be like,
ah, like, why, right?
Is that...
And then whatever answer for the why is a little bit like,
is that the Tao?
Is that the nature of the Tao?
Even if there's not sort of an extra metaphysical realm
in which the moral lives or something.
So, yeah, so moral convergence, I think, is sort of a different factor from, like, the existence or non-existence of kind of non-natural, like a kind of morality that's not reducible to natural facts, which is the type of moral realism I usually consider.
Now, okay, so does the improvement of society, is that an update towards moral realism? I mean, I guess, like, uh, so.
maybe it's like a very weak update or something like I guess I'm kind of like which which view
like predicts this hard I guess it feels to me like moral anti-realism is like very comfortable
with the um observation of the like people with certain values have those values well yeah so
there's obviously this like first thing which is like any if you're the culmination of some process
of moral change then it's very easy to look back at that process and be like moral progress
like the arc of history bends towards me um you can look more like like you can look more like
If it was like, if there was a bunch of dice rolls along the way, you might be like, oh, wait, that's not ration.
That's not the march of reason.
That's, so there's still like empirical work you can do to tell whether that's what's going on.
But I also think it's just, you know, on moral anti-realism, I think it's just still possible.
Say, like, consider Aristotle and us, right?
And we're like, okay, has there been moral progress by Aristotle's lights or something?
you know, does, and our lights too, right?
And you could think,
isn't that a little bit like moral realism?
It's like these hearts are singing in harmony.
That's a moral realist thing, right?
The anti-realist thing, the hearts all go different directions.
But you and Aristotle apparently, like,
are both excited about the kind of march of history.
Some open question about whether that's true.
Like, what are Aristotle's, like, reflective values, right?
Suppose it is true.
I think that's fairly explicable in moral anti-realist terms.
You can say roughly that, like, yeah, you and Aristotle are sufficiently similar,
and you endorse sufficiently similar kind of reflective processes.
And those processes are, in fact, instantiated in the march of history that, yeah, you know,
history has been good for both of you.
And I don't think that's, you know, I think there are worlds where that isn't the case.
And so I think there's a sense in which maybe that that prediction,
is more likely for realism than anti-realism,
but it doesn't, like, move me very much.
One thing I wonder is, look, there's,
I don't know if moral realism is the right word,
but the thing you mentioned about,
there's something that makes hearts converge
to the thing we are
or the thing we upon reflection would be,
and even if it's not something that's like instantiated
in the realm beyond the universe,
it's like a force that exists
that acts in a way we're happy with.
To the extent that doesn't exist,
and you let go of the reins and then you get the paper clippers.
It feels like we were doomed a long time ago in the sense of,
yeah, I just different utility functions banging against each other.
And some of them have parochial preferences, but like, you know, it just combat and some guy won.
Whereas in the world where like, no, this is this is the thing, like,
these are where the hearts are supposed to go or it's only by catastrophe that they don't end up there.
that feels like the world where like really matters.
And in that world, the worry, the initial question I asked is like,
what would make us think that alignment was a big mistake?
In the world where it hurts just naturally end up towards like the thing,
what we want, maybe it takes an extremely strong force to push them away from that.
And that extremely strong force is you solve technical alignment and just like,
no.
Yeah, yeah, you're just like the blinders on the horse's eyes.
So like in the worlds where like the worlds that really matter, we're like, ah, this is where the
hearts want to go.
In that world, maybe alignment is what fucks us up?
On this question of kind of do the worlds where there's not this kind of convergent moral
force, whether kind of metaphysically inflationary or not matter, or are those the only
worlds that matter?
Or so sorry, maybe what I meant was in those worlds like, you're kind of fucked.
It's like, yeah, maybe the world's without that.
The world's where there's no Dow.
Yeah, yeah.
Let's use the term Dow for like this kind of convergent morality.
Over the course of millions of years, like it was going to go somewhere one way or another.
It wasn't going to end up your particular utility function.
Okay, well, let's distinguish between ways you can be doomed.
One way is kind of philosophical.
So you could be the sort of moral realist, you know, or kind of realistish person, of which there are many who have the following intuition.
They're like, if not moral realism, then nothing matters, right?
It's dust and ashes.
It is my metaphysics and or like normative view or the void, right?
And I think this is a common view.
I think some comments of Derek Parfitt's suggests this view.
I think lots of moral realists will like kind of profess this view.
Aliazor Yerkowski, I think there is sort of some sense in which I think his early thinking was inflected with this sort of thought.
he later recanted very hard.
So I think this is importantly wrong.
And so here's my, here's the case.
I have an essay about this.
It's called Against the Normative Realist's Wager.
And here's the case that convinces me.
So imagine that a metaethical fairy appears before you, right?
And this fairy knows whether there is a Tao.
And the fairy says, okay, I'm going to offer you a deal.
If there is a Tao, then I'm going to give you a Tao.
then I'm going to give you $100.
If there isn't a Dow, then I'm going to burn you and your family and 100 innocent children alive.
Right.
Okay, so claim, don't take this deal.
This is a bad deal.
You're holding hostage, your commitment to not being burned alive or like your care for that,
to this like abstruse.
Like basically your, yeah, like I think, I mean, I go through in the essay a bunch of different
ways in which I think this is wrong, but I think just like, and I think these people who kind of
pronounce, they're like moral realism or the void, like they don't actually think about bets
like this. I'm like, no, no, okay, so really, like, is that what you want to do? And, uh, uh, no,
I think we should, we should, we should, I still care about my value, my, my sort of allegiance to my
values, I think is kind of, um, outstrips, uh, the, my like commitments to, to, like,
various, like, meta ethical interpretations of my values. I think, like, we should, um, the, the,
the sense in which we, like, care about not being burned a lot.
is much more solid than like our kind of, you know,
than the reasoning and on what matters.
Okay, so that's this, that's like the sort of philosophical doom.
Right.
Now, you could have this, it sounded like you were also gesturing at a sort of empirical doom.
Right.
Which is like, okay, dude, if it's all, if it's just going in a zillion directions,
come on, you think it's going to go in your direction?
Like, there's going to be so much churn, like, you're just going to lose.
and so, you know, you should give up now and kind of only, only fight for the, for the realism worlds.
And there I'm like, I mean, so I think, you know, you got to do the expected value calculation.
You got to, like, actually have a view about, like, how doomed are you in these different worlds?
What's the attractability of changing different worlds?
I mean, I'm quite skeptical of that.
But that's a kind of empirical claim.
I'm also just like kind of low on this like everyone converges thing.
So, so, you know, if you imagine like, you know, you train a chess playing AI or a, uh, uh, you have a real paper clipper, right?
Like somehow you had a real paper clipper.
And then you're like, okay, you know, go and reflect.
Based on my like understanding of like how moral reasoning works, like if you look at the type of moral reasoning that like analytic ethicists do.
Right.
It's, it's just reflective equilibrium, right?
they just like take their intuitions and they, um, systematize them.
Right.
Um, I don't see how that process gets a sort of injection of like the, the kind of mind,
independent moral truth or like, I guess it, like, if you sort of start with like, only all
of your intuition say to maximize paperclips, I don't see how you end up maximizing or like
doing some like rich human morality.
I just don't, like, um, it doesn't look to me like that's how human ethical reasoning
works. I think like most of what normative philosophy does is make consistent and kind of
systematize pre-theoretic intuitions. And so, and I think, but we'll get evidence about this.
Like, you know, in some sense, I think this view predicts like, you know, you keep trying to
train the AIs to like do something. And they keep being like, no, I'm not going to like do that.
It's like, no, that's not good or something. They keep like pushing back. Like the sort of momentum
like AI cognition is like always in the direction of this like moral truth.
And whenever we try to push it in some other direction, we'll find kind of resistance from, like, the rational structure of things.
So sorry, actually, I've heard from researchers who are doing alignment that, like, for red teaming inside these companies, they will try to ret team a base model.
So it's not been Rale Shaft, it's just like, predict next token, the raw, crazy, whatever, Shaguth.
And they try to get this thing to, hey, help me make a bomb, help me whatever.
And they say that it will, like, it's odd how hard.
it tries to refuse, even before it's been our late-chift.
I mean, look, it will be a very interesting fact if it's like, man, we keep training these
AIs in all sorts of different ways.
Like, we're doing all this crazy stuff.
And they keep, like, acting like bourgeois liberals.
It's like, wow.
Like, that's, or, you know, they keep, like, really, or they keep professing this, like,
weird alien reality.
They all converge on this one thing.
They're like, can't you see?
It's like, Zorgel.
Like, Zorgle is the thing.
And, like, all the AIs.
You know, interesting.
Very interesting.
I think my personal prediction is that that's not what we see.
And my actual prediction is that the AIs are going to be very malleable.
Like, we're going to be like, you know, if you push an AI towards evil, like, it'll just go.
And I think that's obviously, or sort of reflectively consistent evil.
I mean, I think there's also a question with some of these AIs.
It's like, will they even be consistent in their values, right?
I do think, like, a thing we can do.
So I like this image of the blinded horses, and I like this image of, like, maybe alignment
is going to mess with the...
I think we should be really concerned if we're, like, forcing facts on our AIs, right?
Like, that's, like, a really bad...
Because, like, I think one of the clearest things about human processes of reflection,
like, the kind of easiest thing to be, like, let's at least get this, is, like, not acting
on the basis of a incorrect empirical picture of the world, right?
And so if you find yourself, like, asking your A, by the way, like, this is a...
is true, and I need you to always be reasoning as though blah is true, I'm like, ooh, I think
that's a no-no from an anti-realist perspective, too, right? Because I want to, I want to, like,
my reflective values, I think will be such that I formed them in light of the truth about
the world. And so I think, and I think this is a real concern about as we move into this era
of kind of aligning AI's, I don't actually think this, like, binary between like values and
other things is going to be very obvious in how we're training them. I think it's going to
be much more like ideologies and like you can just train an AI to like output stuff right output
utterances and so you you can easily end up in a situation where you like decided that blah is true
about some issue um an empirical issue right not a moral issue and uh so like i think i think people should
not for example i do not think people should hard code belief in god into their aIs or like i would
i would advise people to not hard code their religion into their aIs if they also want to like
discover if their religion is false yeah i would just in general
if you would like to have your behavior be sensitive to whether something is true or false,
like it's sort of generally not good to like etch it into things. And so that is definitely a
form of blinder I think we should be really watching out for. And I'm kind of hopeful. So like,
I have enough credence on some sort of moral realism that like I'm hoping that if we just do
the anti-realism thing of like just being consistent learning all the stuff reflecting. Like I don't,
if you look at how like moral realists and moral anti-realists actually do normative ethics,
it's the same. It's basically the same. There's like some amount of like different
heuristics on like things like properties like simplicity and stuff like that. But I think it's
like they're mostly just doing the same game. And so I'm kind of hoping that and also
meta ethics is itself a discipline that AIs can help us with. I'm hoping that we can just
figure this out either way. So if there is, if moral realism is somehow true, I want us to be able
to notice that. And I want us to be able to like adjust accordingly. So I'm not like writing off
those worlds and be like, let's just like totally assume that's false. But the thing I really don't
want to do is right off the other worlds where it's not true. Because my guess is it's not true.
Right. And I think stuff still matters a ton in those worlds too. So when it crux is like,
okay, you're training these models. We're in this incredibly lucky situation where we,
it turns out the best way to train these models is to just give them everything humans ever said,
written thought. And also these models, the reason they get
intelligence is because they can generalize, right?
Like, thinking, rock, what is it, what is the gist of things?
So, are we fundamentally very, should we just expect this to be a situation, which leads
to alignment in the sense of how exactly does this thing that's trained to be an amalgamation
of human thought become a paper clipper?
The thing you kind of get for free is it's an intellectual descendant.
The paper clipper is not an intellectual descendant, whereas the AI, you know, and the AI,
which understands all the human concepts, but then gets stuck on some part of it,
which we aren't totally comfortable with.
It's, you know, it feels like an intellectual descendant in the way we care about.
I'm not sure about that.
I'm not sure I, I'm not sure I do care about a notion of intellectual descendant in that sense.
Like if you imagine, I mean, literal paper clips is a human concept, right?
So I don't think any old, any old human concept will do for the thing we're excited about.
I think the stuff that I would be more interested in the possibility of getting for free
are things like consciousness, pleasure, sort of other features of human cognition.
Like, I think, so there are paper clippers and there are paper clippers, right?
So imagine if the paper clipper is like an unconscious, kind of voracious machine,
and it's just like it appears to you as a cloud of paper clips, you know,
but there's nothing sort of, that's like one vision.
if you imagine the paperclip is like a conscious being that like loves paper clips right it like
takes pleasure in making paper clips um that's like a different thing right um and obviously it could
still it's not necessarily the case that like it makes the the future all paperclip be
is probably not optimizing for consciousness or pleasure right it cares about paper clips maybe
maybe eventually if it's like suitably certain it like uses it turns itself into paper clips and
who knows but like um it's still i think a different it's um it's um it's um it's um it's
actually a somewhat different moral kind of mode with respect that that looks to me much more like a
you know there's also a question like does it try to kill you and stuff like that but i think i think
that the um there are kind of features of the agents we're imagining other than the kind of thing
that they're staring at that can matter to our sense of like sympathy similarity um and uh
yeah and i think people have different views about this so so one one possibility is that human
consciousness, like the thing we care about in consciousness, your sentience is super contingent
and fragile and, like, most minds, like, kind of smart minds are not conscious, right? It's like,
the thing we care about with consciousness is this hacky contingent. It's like a product of, like,
specific constraints, evolutionarily genetic, bottlenecks, et cetera. And that's why we have this
consciousness. And like, you can get similar work done. Like, so consciousness presumably does some sort
of work for us, but you can get similar work done in a different mind in a very different way. And you
should sort of, so that's like, that's the sort of consciousness that's fragile view, right? And I think
there's a different view which is like, no, consciousness is, is, um, uh, something that's quite
structural. It's, it's much more defined by functional roles like self-awareness, a concept of
yourself, um, maybe higher order thinking, uh, stuff that you really expect in many sophisticated
minds. Um, and in that case, okay, well, now actually, consciousness isn't as fragile as you might
have thought, right? Now, now, actually, like, lots of beings, lots of minds are
conscious, and you might expect at the least that you're going to get like conscious super-intelligence.
They might not be optimizing for creating tons of consciousness, but you might expect consciousness
by default.
And then we can ask similar questions about something like valence or pleasure or like the kind
of character of the consciousness, right?
So there's, you can have a kind of cold, indifferent consciousness that has no like human
or no like emotional warmth, no like pleasure or pain.
I think that can still be a, Dave Chalmers has a, he's a, he's a,
papers about like Vulcans and he talks about they still have moral patienthood. I think that's
very plausible, but I do think it's like an additional thing you could get for free or like get
quite commonly depending on on its nature is something like pleasure. Again, and then we have to
ask how janky is pleasure, how specific and contingent is the thing we care about in pleasure
versus how robust is this as a functional role in like minds of all kinds. And I personally don't
know on this stuff. And I don't think this is like enough to get you alignment or something,
but I think it's at least worth being aware of like these other features.
We're not sort of talking, we're not really talking about the values in this case.
We're talking about like the kind of structure of its mind and the different properties the minds have.
And I think that could show up quite robustly.
So part of your day job is, you know, writing with these kinds of section 2.2.2.5 type reports.
And part of it is like, uh, society is like a tree that's growing towards the
light.
What is it like context switching between the two of them?
So I actually find it's kind of quite complimentary.
So yeah, I will write these sort of more technical reports and then do this sort of kind of more literary writing and philosophical writing.
And I think they both draw in kind of like different parts of myself.
And I try to think about them in different ways.
So I think about the, you know, some of the reports as are much more like this is like I'm kind of
more fully optimizing for like trying to do something impactful or trying to kind of kind of
yeah there's kind of more of an impact orientation there and then on the kind of
kind of essay writing I give myself much more leeway to kind of yeah just let other parts of
myself and other parts of my like concerns kind of come out and kind of you know self-expression
and like aesthetics and and other sorts of things even while they're both I think for me part
of an underlying kind of similar concern or, you know, an attempt to have a kind of integrated
orientation towards the, towards the situation.
Could you explain the nature of the transfer between the two? So, in particular, from the
literary side to the technical side, I think rationalists are noted for having a sort of
ambivalence towards great works or humanities. Are they missing something crucial because of that?
Because one thing you notice in your essays is just lots of references to epigraphs to lines in poems or essays that are particularly irrelevant.
I don't know. Are the rest of the rationalists missing something because they don't have that kind of background?
I mean, I don't want to speak. I think some rationalists, you know, lots of large.
Josh, like a lot of these different things. I do you think, by the way, I'm just referring specifically to SBF as a post about like how Shakespeare could be, like the base rates of Shakespeare being a great writer.
and also books can be condensed to essays.
Well, so on just the general question of like,
how should people value great works or something?
I think people can kind of fail in both directions, right?
And I think some people, maybe like,
maybe SBF or other people,
they're sort of interested in puncturing
a certain kind of like sacredness and prestige
that people can try to kind of like,
yeah, that people associate with some of these works.
And I think there's a way in which,
and then as a result can miss some of the like genuine value but I think they're responding to a real failure mode on the other end which is to kind of yeah be too enamored of this prestige and sacredness to kind of siphon it off as some like weird legitimating function for your own thought instead of like thinking for yourself
losing touch with like what do you actually think or what do you actually learn from like I think sometimes you know these epigraphs careful right it's like I think you know and I'm not saying I'm immune from these vices I'm not I'm immune from these vices I'm not I'm not I'm immune from these vices I'm
I think there can be a like, ah, but Bob said this.
And it's like, oh, very deep, right?
And it's like, these are humans like us, right?
And I think, I think the canon and like other great works and all sorts of things have a lot of value.
And, you know, we shouldn't.
I think sometimes it like borders on the way people like read scripture or I think like there's a kind of like scriptural authority that people will sometimes like ascribe to these things.
And I think that's not.
So yeah, I think it's kind of, you know, you can fall off on both sides of the horse.
It actually relates really interestingly to, I remember I was talking to somebody who, at least is familiar with rationalist discourse.
And I was telling, he was asking, like, what are you interested in these days?
And I was saying something about this part of Roman history is super interesting.
And then his first sort of response was, oh, you know, it's really interesting when you look at these secular trends of like Roman times to what happened in the Dark Ages versus the Enlightenment.
for him it was like the story of that was just like how did it contribute to the big secular like the big picture
the sort of particulars didn't they don't like there's no interest in that it's just like if you zoom out at the biggest level what's happening here
whereas there's also the opposite failure mode when people study history
Dominic cummings writes about this because he is endlessly frustrated with the political class in Britain
and he'll say things like when you know they study politics philosophy and economics and a big part of it is just like
really familiar with these poems and like reading a bunch of history about the War of the Roses
or something. But he's frustrated that they take away, they have all these like kings memorized,
but they take away very little in terms of lessons from these episodes. It's more of just like
almost entertained, like watching Game of Thrones for them. Whereas he thinks like, oh, we're repeating
certain mistakes that he's seen in history. Like he can generalize it in a way they can.
So the first one seems like the mistake, I think CS Luce talks about it in the one of the essays you
side of where it's like if you see through everything, it's like you're really blind, right?
Like if everything is transparent.
I mean, I think there's kind of very little excuse for like not learning history or I don't
know, or sorry.
I mean, I'm not saying I like have learned enough history.
I guess I feel like even when I try to channel some sort of vibe of like skepticism towards
like great works, I think that doesn't generalize to like thinking it's not worth
understanding human history.
I think human history is like, you know.
Just so clearly, you know, crucially kind of understand, this is what, it's what's structured and created all of the stuff.
And so, you know, there's an interesting question about like, what's the level of scale, right?
At which to do that, right? And how much should you be like, yeah, looking at details, looking at macro trends?
And that's, you know, that's a dance. I do think it's nice, I think it's nice for people to be like, at least attending to the kind of macro narrative.
I think there's like a, there's some virtue in like having a worldview, like really like building a model of the whole thing, which I think sometimes gets lost in like the details. And, um, but obviously like if you're too, you know, the details are what the world is made of. And so, so if you don't have those, you don't have data, uh, at all. So, so, um, yeah, seems, seems like there's, there's some skill in like learning history, history well.
This actually seems related to you have a post on sincerity.
And I think like if I'm getting the sort of the vibe of the piece, right, it's like, at least in the context of, let's say, intellectuals, certain intellectuals have a vibe of like shooting the shit.
And they're just like trying out different ideas.
How did these like, how did these analogies fit together?
Maybe there's some.
And those seem closer to the, I'm looking at the particulars and like, oh, this is just like that one time in the 15th century.
where they overthrew this king and they blah, blah, blah.
Whereas this guy who was like,
oh, here's a secular trend from like,
if you look at the growth models for like a million years ago to now,
it's like here's what's happening.
That one has a more sort of sincere flavor.
Some people, especially when it comes to AI discourse,
have a very, the sincere mode of operating it,
is like I've thought through my bio anchors and I like disagree with this premise.
So here my effective compute estimate is different in this way.
Here's how I analyze the scaling laws.
And if I could only have one person to help me guide my decisions on AI, I might choose that person.
But I feel like if I could choose between, if I had 10 different advisors at the same time,
I might prefer the shooting the shit type characters who have these weird esoteric intellectual
influences and they're almost like random number generators. They're not necessarily
calibrated, but once in a while they'll be like, oh, this like one weird philosopher I care
about or this one historical event I'm obsessed with has an interesting perspective on this.
And they tend to be more intellectually generative as well because they're not. I think one
big part of it is that if you are so sincere, you're like, oh, I've like thought through this.
Obviously, ASI is the biggest thing that's happening right now. It like doesn't really make sense
to spend a bunch of your time thinking about like.
like, how did the Comanchees live?
And what is the history of oil?
And, like, how did, like, Gerard think about conflict?
You know, just like, what are you talking about?
Like, come on.
Like, ASI is happening in a few years, right?
Whereas, and, but therefore, the people who go on these rabbit holes
were, because they're just trying to shoot the shit, have,
I feel like are more generative.
I mean, it might be worth distinguishing between something like kind of intellectual seriousness.
Right.
And something like, how do you?
diverse and wide-ranging and kind of idiosyncratic are the, you know, things you're interested in, right?
And I think maybe there's some correlation for people who are kind of like, or maybe intellectual seriousness is also
distinguishable from something like shooting the shit. Like maybe you can shoot the shit seriously.
I mean, there's a bunch of different ways to do this, but I think having an exposure to like all sorts of
different sources of data and perspectives seems great. And I do, I do think it's possible to like curate your
your kind of intellectual influences too rigidly in virtue of some story about what matters.
Like I think, I think it is good for people to like have space. I mean, I'm, I'm really a fan of,
or I appreciate the way like, I don't know, I try to give myself space to do stuff that is not
about like, this is the most important thing. And that's like feeding other parts of myself.
And I think, you know, parts of yourself are not isolated. They like feed into each other. And it's sort of,
I think a better way to be a kind of richer and fuller human being in a bunch of ways. And also just like,
these sorts of data can be just really directly relevant.
And I think some people, I know who I think of as quite intellectually sincere and in some sense
quite focused on the big picture, also have a very impressive command of this very wide range of
kind of empirical data.
And they're like really, really interested in the empirical trends.
And they're not just like, oh, you know, it's a philosophy or, you know, sorry, it's not just
like, oh, history.
It's the march of reason or something.
No, they're like really, they're really in the weeds.
I think there's a kind of in the weeds virtue that I actually think is like closely related
in my head with some kind of.
kind of seriousness and sincerity.
I do think there's a different dimension,
which is there's like kind of trying to get it right,
and then there's kind of like,
grow stuff out there, right?
Try to like, what if it's like this?
Or like, try this on or I have a hammer.
I will hit everything.
Well, what if I just hit everything with this hammer, right?
And so I think some people do that,
and I think there is, you know,
there's room for all kinds.
I kind of think the thing where you just get it right
is kind of undervalued.
Or, I mean, it depends on the context,
you're working in. I think like certain sorts of intellectual cultures and
milieus and incentive systems, I think, incentivize, you know, saying something new or saying
something original or saying something like flashy or provocative or, and then like kind of
various cultural and social dynamics and like, oh, like, you know, and people are like doing all
these like kind of, you know, kind of performative or statusy things. Like there's a bunch of stuff
that goes on when people like do thinking. And, you know, cool. But,
like, if something's really important, let's just get it right.
And I think, and sometimes it's like boring, but it doesn't matter.
And I also think like, like stuff is less interesting if it's false, right?
Like I think if someone's like, bra and you're like, nope, I mean, it can be useful.
I think sometimes there's, there's an interesting process where someone says like, blah, provocative thing.
And it's a kind of an epistemic project to be like, wait, why exactly do I think that's false?
right and you really you know someone's like health care doesn't work medical care does not work right
someone says that and you're like right how exactly do i know that medical care works right and you
like go through the process of uh of um trying to think it through and and so i think there's like
room for that but i think ultimately like kind of the real profundity is like true right or like kind of
things things become less interesting if they're just not true um and i think that's i think sometimes
it feels to me like people, or it's at least possible, I think, to like lose touch with that and to be more
like flashy and it's kind of like, eh, this actually isn't, there's not actually something here, right?
One thing I've been thinking about recently, after I interviewed Leopold was, or while prepping for it,
listen, I haven't really thought at all about the fact that there's going to be a geopolitical angle
to this AI thing. And it turns out if you actually think about the natural security implications,
that's a big deal. Now, I wonder, given the fact that that was like something,
think that wasn't on my radar. And now it's like, oh, obviously that's a crucial part of the
picture. How many other things like that there must be? And so even if you're coming from
the perspective of like AI is incredibly important, if you did happen to be the kind of person
who's like, ah, you know, every once in a while I'm like checking out different kinds of, I'm
like incredibly curious about what's happening in Beijing. And then the kind of thing that later on,
you realize was like, oh, this is a big deal. You have more awareness of you can spot it in the
first place. Whereas I wonder, so maybe there's not an exact, there's not necessarily a tradeoff.
Like, sort of like the rational thing is to have some sort of really optimal explore,
exploit tradeoff here where you're like constantly searching things out. So I don't know if
practically that works out that well. But that experience made me think like, oh, I really
should be trying to expand my horizons in a way that's undirected to begin with.
because there's a lot of different things about the world
that you have to understand to understand any one thing.
I mean, I think there's also room for division of labor, right?
Like, I think there can be, yeah, like, you know,
there are people who are trying to, like, draw a bunch of pieces
and then be like, here's the overall picture
and then people who are going really deep onto specific pieces,
people who are doing them more, like, generative,
throw things out there, see what sticks.
So I think there, it also doesn't need to be
that, like, all of the epistemic labor is, like,
located in one brain.
And, you know, it depends, like, your role in the world
and other things.
So in your series, you express sympathy with the idea that even if an AI or I guess any sort of agent that doesn't have consciousness has a certain wish and is willing to pursue it nonviolently, we should respect its rights to pursue that.
And I'm curious where that's coming from because conventionally I think like the thing matters because like it's conscious and it's conscious and it's.
sort of experience as a result of that pursued matter?
Well, I don't know.
I mean, I think that I don't know where this discourse leads.
I just, I'm like suspicious of the amount of like ongoing confusion that it seems to me
as like present in our conception of consciousness.
You know, I mean, so I sometimes think of analogies with like, you know, people talk about
like life and like Alon Vitale, right?
And maybe, you know, there's a world, you know, Alon Vitale was this like hypothesized
life force that is sort of the thing that's taken life.
and I think, you know, we don't really use that concept anymore.
We think that's like a little bit broken.
And so I don't think you want to have ended up in a position of saying like everything
that doesn't have a Lon Vitale is doesn't matter or something, right?
Because then you end up later.
And then somewhat similarly if you, even if you, even if you're like, no, no, there's no such
thing as a Lonvital, but life, surely life exists.
And I'm like, yeah, life exists.
I think consciousness exists too, likely depending on how we define the terms.
I think it might be a kind of verbal question.
Even once you have a kind of reductionist conception of life,
I think it's possible that it kind of becomes less attractive
as a moral focal point, right?
So like, right now we really think of consciousness
where like it's a deep fact.
It's like, so consider a question like, okay,
so take a cellular automata, right,
that is sort of self-replicating.
It has like some information that, you know,
and you're like, okay, is that alive?
right? It's kind of like, it's not that interesting. Is it kind of verbal question, right? Like, or I don't know, philosophers might get really into like, is that alive? But you're not missing anything about this system, right? It's not like, there's no extra life that's like springing up. It's just like, it's alive in some senses, not alive in other senses. And I think if you, but I really think that's not how we intuitively think about consciousness. We think whether something is conscious is a deep fact. It's just like additional, it's like this really deep difference between being
conscious or not. It's like, is someone home, is the lights are on, right? And I, I have some concern
that if that turns out not to be the case, then this is going to have been like a bad thing to
like build our entire ethics around. And so, now, to be clear, I take consciousness really
seriously. I'm like, man, consciousness. I'm not one of these people like, oh, obviously,
you know, consciousness doesn't exist or something. I'm like, but I also notice how like confused I
am and how dualistic my intuitions are. And I'm like, wow, this is really weird. And so I'm just like,
error bars around this.
Anyway, so that's like, there's a bunch of other things going on in my, like,
wanting to be open to kind of not making consciousness, like, there's kind of fully necessary
criteria.
I mean, clearly, like, I definitely have the intuition.
Like, consciousness matters a ton.
I think, like, if something is not conscious and there's, like, a deep difference
between conscious and unconscious, then I'm, like, definitely have the intuition that is sort of,
there's something that matters especially a lot about consciousness.
I'm not trying to be, like, dismissive about the notion of consciousness.
I just think we should be, like, quite aware of how it seems to me how ongoingly
confused. We are about its nature. Okay, so suppose we figure out that consciousness is just
like a word we use for a hotspodge of different things, only some of which encompass what we
care about. Maybe there's other things we care about that are not included in that word, similar
to the life force analogy. Then where do you anticipate that would leave us as far as ethics goes?
like would then there be a next thing that's like consciousness or what do you anticipate that what was like?
So there's a class of people who are called illusionists in philosophy mind,
who will say consciousness does not exist.
And this is sort of, it's different ways to understand this view.
But one version is to sort of say that the concept of consciousness has built into it too many preconditions that aren't met by the real world.
So we should sort of chuck it out like Ilan Vite.
like instead of the sort of proposal is kind of like um at least phenomenal consciousness right or like
qualia or what it's like to be a thing um they'll just say this is this is like sufficiently broken
sufficiently chock full of falsehoods that we should just not use it um i think there it it feels to be
like i am like there's really clearly a thing there's something going on with are you know like i'm
kind of really not, I kind of expect to, I do actually kind of expect to continue to care about
something like consciousness quite a lot on reflection and to not kind of end up deciding that my
ethics is like better, like doesn't make any reference to that, or at least like there's some
things like quite nearby to consciousness, you know, like when I, I stub my toe and I have this like,
something happens when I stub my toe. It's unclear exactly how to name it, but I'm like,
something about that, you know, I'm like pretty focused on. And so I do think,
you know, in some sense, if you feel like, well, where do things go?
I'm like, I should be clear.
I have a bunch of credence that in the end, we end up carrying a bunch about consciousness
just directly.
And so if we don't, like, yeah, I mean, where will ethics go?
Where will, like, a completed philosophy of mind go?
Very hard to say.
I mean, I can imagine something that's more, like, I think, I mean, maybe a thing that,
I think a move that people might make if you get a little bit less interested in the
notion of consciousness is some sort of slightly more like animistic like so what's going on with the
tree and you're like maybe not like talking about it as a conscious entity necessarily but it's also
not like totally unaware or something and like so there's all this like the consciousness
discourse is rife with these funny cases where it's sort of like oh like those criteria imply that
this um this totally weird entity would be conscious or something like that like especially if
you're interested in in some notion of like agency or preferences like a lot of things can be agents
corporation, you know, all sorts of things. Like corporations, conscious, and it's like, oh, man.
But I actually think, so one place it could go in theory is, in some sense, you start to view
the world as, like, animated by moral significance in kind of richer and subtler structures than
we're used to, like, than we're used to, you know, and so like plants or, you know, like,
weird optimization processes are kind of, like, outflows of, like, complex, I don't know,
like, who knows exactly what you end up seeing as infused with the sort of thing that you
ultimately care about. But I think it's, it is.
possible that that doesn't map, that that like includes a bunch of stuff that we don't normally
ascribe consciousness to.
I think the, when you use it, a complete theory of mind and presumably after that a more
complete ethic, even the notion of a sort of reflective equilibrium implies like, oh,
you'll be like, you'll be done with it at some point, right?
Like you just, you sum up all the number and like, then you're, then you've got the thing
you hear about, this might be unrelated to the same sense we have in science.
But also, I think, like, the vibe you get when you're talking about these kinds of questions
is that, oh, you know, we're like rushing through all the science right now and we've been
churning through it.
It's getting harder to find because there's some, like, cap.
Like, you find all the things at some point.
Right now it's super easy because, like, a semi-intelligent species barely has emerged.
And the ASI will just rush through every.
everything incredibly fast and like then you will either have aligned its heart or not in either case
it'll use what it's figured out about like what is really going on and then expand through the
universe and exploit you know like do the tiling or maybe some more benevolent version of
quote unquote tiling that feels like the basic picture of what's going on we had dinner with
michael nielsen a few months ago and his view is that this just keeps going forever or close to
forever. How much would it change your understanding of what's going to happen in the future
if you were convinced that Nielsen is right about his picture of science? Yeah, I mean, I think
there's a few different aspects. There's kind of my memory of this conversation, you know,
I don't claim to really understand Michael's picture here, but I think my memory was it sort of like,
sure, you get the, you get the fundamental laws. Like, I think he, my impression was that he
expects sort of physics, the kind of physics to get solved or something, maybe modulo,
like the expensiveness of certain experiments or something. But the difficulty is like even granted
that you have the kind of basic laws down, that still actually doesn't let you predict
like where at the macro scale, like various useful technologies will be located. Like there's just still
this like big search problem. And so my memory though, you know, I'll let him speak for himself
on what his take is here.
But my memory was it was sort of like,
sure, you get the fundamental stuff,
but that doesn't mean you get the same tech.
You know, I'm not sure if that's true.
I think if that's true,
what kind of difference would it make?
So one difference is that, well, so here's a question.
So like, it means at some sense,
you have to do, you have to,
in a more ongoing way, make tradeoffs between
investing in further knowledge and further,
exploration versus, um, uh, exploiting, as you say, sort of acting on your existing knowledge,
um, because you, you can't get to a point where you're like, and we're done. Now, you know,
as I think about it, I mean, I think that's, um, you know, I sort of suspect that was always true.
And like, I remember talking to someone, I think I was like, ah, we should, at least in the
future, we should really get like all the knowledge. And he's like, what do you want to like,
you don't want to know the output of every touring machine or like, you know, in some sense,
there's a question of like, what, what actually would it be to have like a completed knowledge?
and I think that's a rich question in its own right.
And I think it's like not necessarily that we should imagine,
even in this sort of, on any picture necessarily that you've got like everything.
And on any picture, in some sense, you could end up with this case where you cap out.
Like there's some collider that you can't build or whatever.
Like there's something is too expensive or whatever.
And kind of everyone caps out there.
So there's, I guess like one way I put it is like, so there's a question of like, do you cap?
and there's a question of like
how contingent is the place
that's right you go
if there's contingent
I mean one thing
one prediction that makes
is you'll see more diversity
across
you know our universe or something
if there are aliens
they might have like
quite different tech
and so maybe like
you know if people meet
you don't expect them to be like
oh you got your thing
I got I got my art version
and so just like whoa
like that thing wow
so that's like one thing
if you expect more like ongoing
discovery of
tech, then you might also expect, like, more ongoing change and, like, upheaval and churn
insofar as, like, technology is one thing that really drives kind of change in civilization.
So that could be another, you know, people sometimes talk about, like, lock in, and there's,
like, ah, sort of, they envision this kind of point at which civilization is kind of, like,
settled into some structure or equilibrium or something. And maybe you get less of that if there's,
I think that's maybe more about the pace rather than contingency or caps, but that's, um,
that's another factor.
So yeah, I mean, I think it is an interesting,
I don't know if it changes the picture fundamentally
of like Earth civilization.
We still have to make tradeoffs about how much to invest in research
versus acting on our existing knowledge.
But I, you know, I think it has some significance.
I think one vibe you get when you talk to people,
we're at a party and somebody mentioned this,
we're talking about like how uncertain should we be at the future?
And they're like, there are three things I'm uncertain about.
Like, what is consciousness?
What is information theory?
And what are the basic laws of physics?
I think once we get that, we're like,
we're done.
Yeah.
Yeah.
And that's like, oh, you'll figure out what's the right kind of hedonium and then like,
you know, that it has that vibe.
Whereas this like, oh, you like, you're like constantly shurning through and it has more
of a flavor of like more of the becoming that like the attunement picture implies.
I think it's more exciting.
Like it's not just like, oh, you figure out of the things in the 21st century and then
you just, you know, you know what I mean?
Yeah, I mean, I sometimes think about this sort of two categories of views about this.
Like, there are people who think, like, yeah, like, the knowledge, like, we're almost there.
And then we've, like, basically got the picture.
Right.
And where the picture is sort of like, yeah, the knowledge is all just totally sitting there.
Yeah.
And it's like you just have to get to, like, remote.
There's like this kind of just you have to be like scientifically mature at all.
And then it's just going to all fall together, right?
And then everything past that is going to be like this super expensive, like not super important thing.
And then there's a different picture, which is much more of this, like, ongoing mystery, like, oh, man, there's, like, going to be more and more, like, maybe expect more radical revisions to our worldview. And, and I think it's an interesting, yeah, I think, you know, I'm kind of drawn to both. Like, physics, we're really, we're pretty good at physics, right? Or, like, a lot of our physics is, like, quite good at predicting a bunch of stuff. And, and, and, or at least that's my impression. This is, you know, reading some physicists. So, who knows.
Your dad's a physicist, though, right?
Yeah, but this isn't coming from my dad.
This is like, like there's a blog post, I think, Sean Carroll or something.
And he's like, we like, we really understand a lot of like the physics that governs the everyday world.
Like a lot of it.
We're like really good at it.
I'm like, I think I'm generally pretty impressed by physics as a discipline.
I think that could well be right.
And so, you know, on the other hand, like, ah, you know, really these guys, you know, had a few centuries of.
So anyway, but I think that's an interesting.
Yeah.
And it leads to a different, I think it does.
There's something, you know, the endless frontier.
There is a, there is a draw to that from an aesthetic perspective of the idea of like continuing
continuing to discover stuff.
You know, at the least, I think you don't,
you can't get like full knowledge
in some sense because there's always like,
what are you going to do?
Like there's some way in which you're part of the system.
So it's not clear that you,
the knowledge itself is part of the system
and sort of like, I don't know,
like if you imagine you're like,
you try to have full knowledge
of like what the future of the universe will be like.
Well, I don't know.
Actually, I'm not totally sure that's true.
It has a halting problem kind of property, right?
There's a little bit of a loopiness
if you're, if you're, um, I think there are probably like fixed points in that where you could be like,
yep, I'm going to do that and then like, right. But I think it's, um, I at least have a question of like,
are we, you know, when people imagine the kind of completion of knowledge, you know, exactly how well
does that work. I'm not sure. You had a passage in your essay on Utopia where I think you were,
the vibe there was more of, um, the thing that were, the positive future we're looking for
it will be more of like, uh, you,
unless you describe what you meant,
but like it, to me, it felt more like the first stuff.
Like, you get the thing and then now you've like found the heart of the,
maybe can I ask you to read that passage real quick?
Oh, sure.
And that way I'll spur the discussion I'm interested in having this part in particular.
Right.
Um, quote,
um, I'm inclined to think that utopia, however weird,
would also be in a certain sense recognizable.
that if we really understood and experienced it,
we would see in it the same thing
that made us sit bolt upright long ago
when we first touched love, joy, beauty.
That we would feel in front of the bonfire,
the heat of the ember from which it was lit.
There would be, I think, a kind of remembering.
Where does that fit into this picture?
It's a good question.
I mean, I think it's like some guess
about like if there's like no part of me that recognizes it is good, then I think I'm,
I'm not sure that it's good according to me in some sense. Like, uh, so yeah, I mean,
it is a question of like what it takes for it to be the case that a part of you recognizes
it is good. But I think if there's really none of that, then I'm not sure, um,
it's a reflection of my values at all.
There's a sort of tautological thing you can do where it's like,
if I went through the processes which led to me discovering it was good,
which we might call reflection, then it was good.
But by definition, you ended up there because it was like, you know what I mean?
Yeah, I mean, you definitely don't want to be like, like, you know,
if you transform me into a paper clipper gradually, right, then I will eventually be like,
and then I saw the light, you know, I saw the true paper clips.
Yeah.
Right.
But that's part of what's complicated about this thing about reflection.
you have to find some way of differentiating
between the sort of development processes
that preserve what you care about
and the development processes that don't.
And that in itself is this like fraught question,
which itself requires like taking some stand
on what you care about
and what sorts of metaprocesses you endorse
and all sorts of things.
But you definitely shouldn't just be like,
it is not a sufficient criteria
that the thing at the end thinks it got it right.
Right.
Because that's compatible with having gone
wildly off the rails.
Yeah, yeah, yeah.
There was a very interesting sentence you had in your post, one of your posts where you said,
our hearts have, in fact, been shaped by power.
So we should not be at all surprised if the stuff we love is also powerful.
Yeah, what's going on there?
I actually want to think about what did you mean there?
Yeah.
So the context on that post is I'm talking about this hazy cluster, which I'm, which I call
in the essay,
niceness slash liberalism
slash boundaries,
which is this sort of like
somewhat more minimal set
of like cooperative norms
involved in like respecting
the boundaries of others
and kind of
cooperation and peace amongst differences
and like tolerance and stuff like that
as opposed to like your favored structure of matter,
which is sort of sometimes the paradigm of like values
that people use in the context of
of AI risk.
And,
you know,
I talk for a while,
while about the sort of ethical virtues of these like norms but it's pretty clear that also like
why do we have these norms like well one important feature of these norms is is that they're um
kind of effective and powerful like liberal societies are you know um like you know secure boundaries
save resources wasted on conflict right and like um liberal societies are often more like you know
they're better to live live in they're better to immigrate to you they're more productive like all sorts
of things um nice people they're better to interact with they're better to like trade with all sorts of
things, right? And I think it's pretty clear if you look at the both like why at a political
level do we have like various political institutions and if you look kind of more deeply into
our evolutionary past and like how our moral cognition is structured, it seems like pretty
clear that various like kind of forms of cooperation and like kind of game theoretic dynamics
and other things went into kind of shaping what we now, at least in certain contexts, also treat
as a kind of intrinsic or terminal value.
So like,
uh,
these,
some of these values that have kind of instrumental functions in our society are also
get kind of reified in our cognition as kind of intrinsic values in themselves.
And I think that's okay.
I don't think that's a debunking.
Like all of your,
uh,
all your values are kind of like some,
something that kind of stuck and got kind of, uh,
uh,
treated as a terminally important.
Um,
but I think that means that,
uh,
you know,
sometimes the way, in the context of the series where I'm talking about like deep atheism
and our sort of relationship, the relationship between what we're pushing for and what like nature
is pushing for or what sort of pure power will push for. And it's easy to say like, well, there's
like paper clips, which is just like one way, place you can steer and, you know, pleasure is
like another place you can steer or something. And these are just sort of arbitrary directions.
Whereas I think like some of our other values are much more structured around like cooperation
and things that also are kind of effective and functional and, like,
uh,
uh,
powerful.
And so,
and so that's,
that's what I mean there is I think there's,
there's a way in which we're sort of nature is a little bit more on our side
than you might think because like part of who we are is like,
has been made by a kind of nature's way.
Um,
and so,
uh,
that,
that is like in us.
Now I don't think that's enough necessarily, uh,
you know,
for us to beat the gray goo.
Right?
Like,
we have some amount of like power built.
to our values, but that doesn't mean it's kind of going to be such that it's kind of arbitrarily
competitive. But I think it's still important to keep in mind that this is, and I think it's
important to keep in mind in the context of integrating AIs into our society that I think,
you know, we've been talking a lot about the ethics of this, but I think there's also,
there are like instrumental and kind of practical reasons to want to have, like, forms of
social harmony and like cooperation with AIs with different values. And I think we need to be
taking that seriously and thinking about what is it to do that in a way that's like
genuinely kind of legitimate and kind of a project that is sort of a kind of just
incorporation of these beings into our civilization such that they can kind of all or sorry
there's like the justice part and there's also that kind of isn't like kind of compatible with
like people you know is it a good deal it's a good bargain for people and I think this is you know
this is often how you know to the extent we're kind of very concerned about AI's like kind of
rebelling or something like that it's like well there's like a lot of you know part of a thing you can
do is make civilization better for summit. And I think that's an important feature of how we have,
in fact, structured a lot of our political institutions and norms and stuff like that. So that's the
thing I'm getting at in that quote. Okay, I think that's an excellent place to close. Great.
Thank you so much. Joe, thanks for me on the podcast. I mean, we discussed the ideas in the series.
I think people might not appreciate if they haven't read the series, how beautifully written it is.
It's just like the ideas, we didn't cover everything.
There's a bunch of very, very interesting ideas.
As somebody who has talked to people about AI for a while,
things I haven't encountered anywhere else,
but just obviously no part of the AI discourse is nearly as well-written.
And it is genuine, a beautiful experience to listen to the podcast version,
which is in your own voice.
So I highly recommend people to do that.
So it's joe-carlsmith.com where they can access this.
Joe, thanks so much for coming on the podcast.
Thank you for having me.
I really enjoyed it.
Hey, everybody.
I hope you enjoyed that episode with Joe.
If you did, as always, it's helpful if you can send it to friends, group chats, Twitter,
whoever else you think you might enjoy it.
And also, if you can leave a good rating on Apple Podcast or wherever you listen,
that's really helpful, helps other people find the podcast.
If you want transcripts of these episodes or you want to give my blog post,
you can subscribe to my substack at dwarfishpatelle.com.
And finally, as you might have noticed,
There's advertisements on this episode, so if you want to advertise on a future episode, you can learn more about doing that at dwarfishpatel.com slash advertise or the link in the description.
Anyways, I'll see you on the next one. Thanks.
