Making Sense with Sam Harris - #116 — AI: Racing Toward the Brink
Episode Date: February 6, 2018Sam Harris speaks with Eliezer Yudkowsky about the nature of intelligence, different types of AI, the "alignment problem," IS vs OUGHT, the possibility that future AI might deceive us, the AI arms rac...e, conscious AI, coordination problems, and other topics. If the Making Sense podcast logo in your player is BLACK, you can SUBSCRIBE to gain access to all full-length episodes at samharris.org/subscribe.
Transcript
Discussion (0)
Today I'm speaking with Eliezer Yudkowsky. Eliezer is a decision theorist and computer
scientist at the Machine Intelligence Research Institute in Berkeley.
And he is known for his work on technological forecasting.
His publications include a chapter in the Cambridge Handbook of Artificial Intelligence,
titled The Ethics of Artificial Intelligence, which he co-authored with Nick Bostrom.
And Eliezer's writing has been extremely influential, online especially.
He's had blogs that have been read by the smart set in Silicon Valley for years.
Many of those articles were pulled together in a book titled Rationality from AI to Zombies,
which I highly recommend.
And he has a new book out, which is Inadequate Equilibria, Where and How Civilizations Get Stuck.
And as you'll hear, Eliezer is a very interesting first principles kind of thinker.
Of those smart people who are worried about AI, he is probably among the most worried,
and his concerns have been largely responsible for kindling the
conversation we've been having in recent years about AI safety and AI ethics. He's been very
influential on many of the people who have made the same worried noises I have in the last couple
years. So in today's episode, you're getting it straight from the horse's mouth. And we cover more or less everything related to the question of why one should be
worried about where this is all headed. So without further delay, I bring you Eliezer Yudkowsky.
I am here with Eliezer Yudkowsky. Eliezer, thanks for coming on the podcast.
You're quite welcome. It's an honor to be here.
You have been a much-requested guest over the years. You have quite the cult following
for obvious reasons. For those who are not familiar with your work, they will understand
the reasons once we get into talking about things. But you've
also been very present online as a blogger. I don't know if you're still blogging a lot, but
let's just summarize your background for a bit and then tell people what you have been doing
intellectually for the last 20 years or so. I would describe myself as a decision theorist.
A lot of other people would say that I'm in artificial intelligence, and in particular, in the theory of how to make sufficiently advanced artificial intelligences that do a particular thing and don't destroy the world as a side effect. I would call that AI alignment, following Stuart Russell. Other people would call that
AI control or AI safety or AI risk, none of which are terms that I really like.
I also have an important sideline in the art of human rationality, the way of achieving the map
that reflects the territory and figuring out how to navigate reality to where you want it to go.
flex the territory and figuring out how to navigate reality to where you want it to go.
From a probability theory, decision theory, cognitive biases perspective, I wrote two or three years of blog posts, one a day on that. And it was collected into a book called
Rationality from AI to Zombies. Yeah, which I've read and which is really worth reading. You have a very
clear and aphoristic way of writing. It's really quite wonderful. So I highly recommend that book.
Thank you. Thank you. But you know, your background is unconventional. So for instance,
you did not go to high school, correct? Let alone college or graduate school. Summarize that for us.
or graduate school. Summarize that for us. The system didn't fit me that well, and I'm good at self-teaching. I guess I sort of, when I started out, I thought I was going to go into
something like evolutionary psychology or possibly neuroscience. And then I discovered
probability theory, statistics, decision theory, and came to specialize in that
more and more over the years. How did you not wind up going to high school? What was that decision
like? Sort of like mental crash around the time I hit puberty, or like physical crash even,
and I just did not have the stamina to make it through a whole day of classes at the time.
I'm not sure how well I do trying to go to high school
now, honestly, but it was clear that I could self-teach. So that's what I did.
And where did you grow up?
Chicago, Illinois.
Okay, well, let's fast forward to sort of the center of the bullseye for your intellectual
life here. You have a new book out, which we'll talk about second. Your new book is
Inadequate Equilibria, Where and How Civilizations Get Stuck. And unfortunately, I've only read half
of that, which I'm also enjoying. I've certainly read enough to start a conversation on that.
But we should start with artificial intelligence, because it's a topic that I've touched a bunch on
the podcast, which you have strong opinions about.
And it's really how we came together.
You and I first met at that conference in Puerto Rico, which was the first of these AI safety alignment discussions that I was aware of.
I'm sure there have been others, but that was a pretty interesting gathering.
So let's talk about AI and the possible problem with where we're
headed and the near-term problem that many people in the field and at the periphery of the field
don't seem to take the problem as we conceive it seriously. Let's just start with the basic
picture and define some terms. I suppose we should define intelligence
first and then jump into the differences between strong and weak or general versus narrow AI.
Do you want to start us off on that? Sure. Preamble disclaimer, though,
the field in general, like not everyone you would ask would give you the same definition of intelligence. And a lot of times in cases like those, it's good to sort of go back to
observational basics. We know that in a certain way, human beings seem a lot more competent than
chimpanzees, which seems to be a similar dimension to the one where chimpanzees are more competent than mice, or that mice are more competent than
spiders. And people have tried various theories about what this dimension is. They've tried
various definitions of it. But if you went back a few centuries and asked somebody to define fire,
the less wise ones would say, ah, fire is the release of phlogiston. Fire is one of the four elements.
And the truly wise ones would say, well, fire is the sort of orangey bright hot stuff that
comes out of wood and spreads along wood.
They would tell you what it looked like and put that prior to their theories of what it
was.
So what this mysterious thing looks like is that humans can build space shuttles and go to the moon and mice can't. And
we think it has something to do with our brains. Yeah. Yeah. I think we can make it more abstract
than that. Tell me if you think this is not generic enough to be accepted by most people
in the field. It's whatever intelligence may be in specific contexts. Generally speaking, it's the ability to meet goals, perhaps across a diverse range of environments.
And we might want to add that it's at least implicit in intelligence that interests us.
It means an ability to do this flexibly rather than by rote following the same strategy again and again
blindly. Does that seem like a reasonable starting point? I think that that would get fairly
widespread agreement and it like matches up well with some of the things that are in AI textbooks.
If I'm allowed to sort of take it a bit further and begin injecting my own viewpoint into it,
I would refine it and say that by achieve goals, we mean
something like squeezing the measure of possible futures higher in your preference ordering.
If we took all the possible outcomes and we rank them from the ones you like least to the ones you
like most, then as you achieve your goals, you're sort of like squeezing the outcomes higher in your
preference ordering. You're narrowing down squeezing the outcomes higher in your preference ordering.
You're narrowing down what the outcome would be to be something more like what you want, even though you might not be able to narrow it down very exactly.
Flexibility, generality.
There's a, like humans are much more domain general than mice.
Bees build hives. beavers build dams, a human will look over both
of them and envision a honeycomb-structured dam. We are able to operate even on the moon,
which is very unlike the environment where we evolved. In fact, our only competitor in terms of general optimization,
where optimization is that sort of narrowing of the future that I talked about,
our competitor in terms of general optimization is natural selection. Natural selection built
beavers, it built bees, it sort of implicitly built the spider's web in the course of building spiders. And we as
humans have like the similar, like very broad range to handle this like huge variety of problems.
And the key to that is our ability to learn things that natural selection did not pre-program us with.
So learning is the key to generality. I expect that not many people in AI would disagree with that part either.
Right.
So it seems that goal-directed behavior is implicit in this or even explicit in this definition of intelligence.
And so whatever intelligence is, it is inseparable from the kinds of behavior in the world that results in the fulfillment of goals.
So we're talking about
agents that can do things. And once you see that, then it becomes pretty clear that if we build
systems that harbor primary goals, you know, there are cartoon examples here like, you know,
making paperclips. These are not systems that will spontaneously decide that
they could be doing more enlightened things than, say, making paperclips. This moves to the question
of how deeply unfamiliar artificial intelligence might be, because there are no natural goals that
will arrive in these systems apart from the ones we put in there. And we have common
sense intuitions that make it very difficult for us to think about how strange an artificial
intelligence could be, even one that becomes more and more competent to meet its goals.
Let's talk about the frontiers of strangeness in AI as we move from, again,
I think we have a couple more definitions we should probably put in play here,
differentiating strong and weak or general and narrow intelligence.
Well, to differentiate general and narrow, I would say that, well, I mean, this is like,
on the one hand, theoretically a spectrum.
Now, on the other hand, there seems to have been like a very sharp jump in generality between chimpanzees and humans.
So breadth of domain driven by breadth of learning.
Like DeepMind, for example, recently built AlphaGo, and I lost some money betting that
AlphaGo would not defeat the human
champion, which it promptly did. And then a successor to that was AlphaZero. And AlphaGo
was specialized on Go. It could learn to play Go better than its starting point for playing Go,
but it couldn't learn to do anything else. And then they simplified the architecture for AlphaGo. They figured out ways to do all the
things it was doing in more and more general ways. They discarded the opening book, like all the sort
of human experience of Go that was built into it. They were able to discard all of the sort of like
programmatic special features that detected features of the Go board. They figured out how to do that in simpler ways. And because they figured out how to do it in
simpler ways, they were able to generalize to AlphaZero, which learned how to play chess
using the same architecture. They took a single AI and got it to learn Go and then like reran it and made it learn chess. Now that's not human general,
but it's like a step forward in generality of the sort that we're talking about.
Am I right in thinking that that's a pretty enormous breakthrough? I mean,
there's two things here. There's the step to that degree of generality, but there's also the fact
that they built a Go engine. I forget if it was a Go or a Chess or both, which basically surpassed all of the specialized AIs on those games over the course of a day, right?
Isn't the Chess engine of AlphaZero better than any dedicated Chess computer ever?
And didn't it achieve that just with astonishing speed?
Well, there was actually like some amount of debate afterwards whether or not the version
of the chess engine that it was tested against was truly optimal.
But like, even the extent that it was in that narrow range of the best existing chess engine,
narrow range of the best existing chess engine. As Max Tegmark put it, the real story wasn't in how AlphaGo beat human Go players, it's how AlphaZero beat human GhostGo system programmers
and human chess system programmers. People had put years and years of effort into
accreting all of the special purpose code that would play chess well and efficiently.
And then AlphaZero blew up to and possibly passed that point in a day. And if it hasn't already
gone past it, well, it would be past it by now if DeepMind kept working on it.
Although they've now basically declared victory and shut down that project as I understand it.
Okay, so talk about the distinction between general and narrow intelligence a little bit more.
So we have this feature of our minds, most conspicuously, where we're general problem solvers. We can learn new things, and our learning in one area doesn't require a
fundamental rewriting of our code. Our knowledge in one area isn't so brittle as to be degraded by
our acquiring knowledge in some new area, or at least this is not a general problem
which erodes our understanding again and again. And we don't yet have computers that can do this,
but we're seeing the signs of moving in that direction. And so then it's often imagined that
there's a kind of near-term goal, which has always struck me as a mirage of
so-called human-level general AI. I don't see how that phrase will ever mean much of anything,
given that all of the narrow AI we've built thus far is superhuman within the domain of
its applications. The calculator in my phone is superhuman for arithmetic.
Any general AI that also has my phone's ability to calculate will be superhuman for arithmetic,
but we must presume it'll be superhuman for all of the dozens or hundreds of specific
human talents we've put into it, whether it's facial recognition or just obviously
memory will be superhuman unless we decide to consciously degrade it. Access to the world's
data will be superhuman unless we isolate it from data. Do you see this notion of human-level
AI as a landmark on the timeline of our development, or is it just never going to be reached?
I think that a lot of people in the field would agree that human-level AI defined as literally at the human level, neither above nor below, across a wide range of competencies,
is a straw target, an impossible mirage. Right now, it seems like AI is clearly dumber and less general than us,
or rather that if we're put into a real-world, lots-of-things-going-on context that places
demands on generality, then AIs are not really in the game yet. Humans are clearly way ahead.
And more controversially, I would say that we can imagine a state where the AI is clearly way ahead, where it is across sort of every kind of cognitive competency, barring some very narrow ones that aren't deeply influential of the others. are better at using a stick to draw ants from an ant hive and eat them than humans are,
though no humans have really practiced that to world championship level exactly.
But there's this sort of general factor of how good are you at it when reality throws you a
complicated problem. At this, chimpanzees are clearly not better than humans. Humans are
clearly better than chimps, even if you can manage to narrow down one thing the chimp is better at.
The thing the chimp is better at doesn't play a big role
in our global economy. It's not an input that feeds into lots of other things.
So we can clearly imagine, I would say, like there are some people who say this is not possible.
I think they're wrong, but it seems to me that it is perfectly coherent to imagine
an AI that is better at everything or almost everything than we are,
and such that if it was like building an economy with lots of inputs, like humans would have around
the same level input into that economy as the chimpanzees have into ours.
Yeah, yeah. So what you're gesturing at here is a continuum of intelligence that
I think most people never think about. And because they don't
think about it, they have a default doubt that it exists. I think when people, and this is a point
I know you've made in your writing, and I'm sure it's a point that Nick Bostrom made somewhere in
his book, Superintelligence. It's this idea that there's a huge blank space on the map past the
most well-advertised exemplars of human brilliance, where we don't imagine what it would be like to be
five times smarter than the smartest person we could name. And we don't even know what that
would consist in, right? Because if chimps could be given to wonder what it would be like to be
five times smarter than the smartest chimp, they're not going to represent for themselves
all of the things that we're doing that they can't even dimly conceive. There's a kind of
disjunction that comes with more. There's a phrase used in military contexts. I don't think the quote is actually, it's variously attributed to Stalin and Napoleon
and I think Clausewitz, like half a dozen people who have claimed this quote.
The quote is, sometimes quantity has a quality all its own.
As you ramp up in intelligence, whatever it is at the level of information processing,
whatever it is at the level of information processing, spaces of inquiry and ideation and experience begin to open up, and we can't necessarily predict what they would be from where
we sit. How do you think about this continuum of intelligence beyond what we currently know in
light of what we're talking about? Well, the unknowable is a concept you have to be very
careful with because the thing you can't figure out in the first 30 seconds of thinking about it,
sometimes you can figure it out if you think for another five minutes. So in particular,
I think that there's a certain narrow kind of unpredictability, which does seem to be plausibly
in some sense essential, which is that for AlphaGo to play better Go than the
best human Go players, it must be the case that the best human Go players cannot predict exactly
where on the Go board AlphaGo will play. If they could predict exactly where AlphaGo would play,
AlphaGo would be no smarter than them. On the other hand, AlphaGo's programmers
and the people who knew what AlphaGo's programmers were trying to do, or even just the people who
watched AlphaGo play, could say, well, I think this system is going to play such that it will
win at the end of the game, even if they couldn't predict exactly where it would move on the board.
they couldn't predict exactly where it would move on the board. So similarly, there's a sort of like not short or like not necessarily slam dunk or not like immediately obvious chain of reasoning, reason about aligned or even unaligned artificial general intelligences of sufficient power as if
they're trying to do something, but we don't necessarily know what. But from our perspective,
that still has consequences, even though we can't predict in advance exactly how they're going to do
it. I think we should define this notion of alignment. What do you mean by
alignment as in the alignment problem? Well, it's sort of like a big problem,
and it does have some moral and ethical aspects, which are not as important as the technical
aspects. Or pardon me, they're not as difficult as the technical aspects. They couldn't exactly
be less important. But broadly speaking, it's an AI where you can sort of say
what it's trying to do. And there are sort of narrow conceptions of alignment, which is you
are trying to get it to do something like cure Alzheimer's disease without destroying the rest
of the world. And there's sort of much more ambitious notions of alignment, which is you are trying to get it to do the right thing and achieve
a happy intergalactic civilization. But both of the sort of narrow alignment and the ambitious
alignment have in common that you're trying to have the AI do that thing rather than making a
lot of paperclips.
Right. For those who have not followed this conversation before, we should cash out this reference to paperclips, which I made at the opening. Does this
thought experiment originate with Bostrom or did he take it from somebody else?
As far as I know, it's me.
Oh, it's you. Okay.
It could still be Bostrom. like i i sort of like ask somebody
like do you remember who it was and they like searched through the archives of a mailing list
where this idea plausibly originated and if it originated there then i was the first one to say
paper clips all right well then by all means please summarize this thought experiment for us
well the the original uh thing was somebody uh saying that expressing a sentiment along the lines of artificial, who are we to constrain the path of things smarter than us?
They will create something in the future.
We don't know what it will be, but it will be very worthwhile.
We shouldn't stand in the way of that.
The sentiments behind this are something
that I have a great deal of sympathy for. I think the model of the world is wrong. I think they're
factually wrong about what happens when you sort of take a random AI and make it much bigger.
And in particular, I said, the thing I'm worried about is that it's going to end up with a randomly
rolled utility function whose maximum happens to be a particular kind of tiny molecular shape that looks like a paperclip.
And that was like the original paperclip maximizer scenario. It sort of got a little
bit distorted in being whispered on into the notion of somebody builds a paperclip factory
and the AI in charge of the paperclip factory takes over the universe and turns it all into paperclips. There was like a lovely online game about it,
even. But this still sort of cuts against a couple of key points. One is, the problem isn't
that paperclip factory AI spontaneously wake up. Wherever the first artificial general intelligence
is from, it's going to be in a research lab specifically dedicated to doing it for the same reason that
the first airplane didn't spontaneously assemble in a junk heap. And the people who are doing this
are not dumb enough to tell their AI to make paperclips or make money or end all war. These
are Hollywood movie plots that the
scriptwriters do because they need a story conflict and the story conflict requires that somebody be
stupid. So the people at Google are not dumb enough to build an AI and tell it to make paperclips.
The problem I'm worried about is that it's technically difficult to get the AI to have a particular goal set and keep that goal set
and implement that goal set in the real world. And so what it does instead is something random,
for example, making paperclips, where paperclips are meant to stand in for something that is
worthless, even from a very cosmopolitan perspective. Even if we're trying to take a
very embracing view of the nice possibilities and accept that there may be things that we
wouldn't even understand, that if we did understand them, we would comprehend to be a very high value.
Paperclips are not one of those things. No matter how long you stare at a paperclip,
it still seems pretty pointless from our perspective. So that is the concern about
the future being ruined, the future being lost, the future being
turned into paperclips. One thing this thought experiment does, it also cuts against the
assumption that a sufficiently intelligent system, a system that is more competent than we are,
in some general sense, would, by definition, only form goals or only be driven by a utility
function that we would recognize as being ethical or wise and would, by definition,
be aligned with our better interests. We're not going to build something that is superhuman in competence that could be moving along some path that's
as incompatible with our well-being as turning every spare atom on earth into a paperclip.
But you don't get our common sense unless you program it into the machine, and you don't get
a guarantee of perfect alignment or perfect corrigibility, the ability for us to be able to say, well, that's not what we meant, you know, come back build something that, especially in the case of
something that makes changes to itself, and we'll talk about this, I mean, the idea that these
systems could become self-improving, we can build something whose future behavior in the service of
specific goals isn't totally predictable by us. If we gave it the goal to cure Alzheimer's,
there are many things that are incompatible with it fulfilling that goal.
One of those things is our turning it off.
We have to have a machine that will let us turn it off, even though its primary goal
is to cure Alzheimer's.
I know I interrupted you before you wanted to give an example of the alignment problem,
but did I just say anything that you don't agree with, or are we still on the same map?
Well, we're still on the same map. I agree with most of it. I would, of course, have this giant
pack of careful definitions and explanations built on careful definitions and explanations to go
through everything you just said. Possibly not for the best, but there it is. Stuart Russell put it,
you can't bring the coffee if you're dead, pointing out that if you have a sufficiently intelligent system whose goal is to bring you coffee, even that system has an implicit strategy of not letting you switch something that feels to them like it's so smart and so stupid at the same time.
Like, is that a realizable way an intelligence can be?
Yeah.
And that is one of the virtues or one of the confusing elements, depending on where you
come down on this, of this thought experiment, the paperclip maximizer.
Right.
So I think that there are sort of narratives. There's like multiple
narratives about AI. And I think that the technical truth is something that doesn't fit into like any
sort of the, any of the obvious narratives. For example, I think that there are people who have
a lot of respect for intelligence. They are happy to envision an AI
that is very intelligent. It seems intuitively obvious to them that this carries with it
tremendous power. And at the same time, their sort of respect for the concept of intelligence
leads them to wonder at the concept of the paperclip maximizer. Why is this very smart
thing just making paperclips?
There's similarly another narrative
which says that AI is sort of lifeless, unreflective,
just does what it's told.
And to these people, it's like perfectly obvious
that an AI might just go on making paperclips together.
And for them, the hard part of the story to swallow
is the idea that machines can get that powerful.
Those are two hugely useful categories of disparagement of your thesis here.
I wouldn't say disparagement. These are just initial reactions. These are people you have
talking to you.
Right, yeah. So let me reboot that. Those are two hugely useful categories of doubt with respect to
your thesis here or the concerns we're expressing.
And I just want to point out that both have been put forward on this podcast. The first was by
David Deutsch, the physicist, who imagines that whatever AI we build, and he certainly thinks we
will build it, will be by definition an extension of us. He thinks the best analogy is to think of our future
descendants. These will be our children. The teenagers of the future may have different values
than we do, but these values and their proliferation will be continuous with our values and our culture
and our memes. And there won't be some radical discontinuity that we need to worry about.
And so there's that one basis for lack of concern. This is an extension of ourselves, and it will inherit our values, improve upon our values. And there's really no place where things,
where we reach any kind of cliff that we need to worry about. And the other non-concern you just
raised was expressed by Neil deGrasse Tyson on this podcast.
He says things like, well, if the AI just starts making too many paperclips, I'll just unplug it,
or I'll take out a shotgun and shoot it. The idea that this thing, because we made it,
could be easily switched off at any point we decide it's not working correctly.
So I think it'd be very useful to get
your response to both of those species of doubt about the alignment problem.
So a couple of preamble remarks. One is, by definition, we don't care what's true by
definition here. Or as Einstein put it, insofar as the equations of mathematics are certain,
they do not refer to reality. And insofar as they
refer to reality, they are not certain. Let's say somebody says, men, by definition, are mortal.
Socrates is a man. Therefore, Socrates is mortal. Okay, suppose that Socrates actually lives for a
thousand years. The person goes, ah, well, then by definition, Socrates is not a man.
So similarly, you could say that by definition, an artificial intelligence is nice,
or like a sufficiently advanced artificial intelligence is nice. And what if it isn't nice,
and we see it go off and build a Dyson sphere? Ah, well, then by definition, it wasn't what I
meant by intelligent. Well, okay, but it's still over there building Dyson spheres.
And the first thing I'd want to say is, this is an empirical question. We have a question of what certain classes of computational
systems actually do when you switch them on. It can't be settled by definitions. It can't be
settled by how you define intelligence. There could be some sort of a priori truth that is deep
about how if it has property A, it almost certainly has property B unless the laws of
physics are being violated. But this is not something you can build into how you define your terms. And I think just to do justice to David
Deutsch's doubt here, I don't think he's saying it's impossible, empirically impossible, that we
could build a system that would destroy us. It's just that we would have to be so stupid to take
that path that we are incredibly unlikely
to take that path.
The superintelligence systems we will build will be built with enough background concern
for their safety that there's no special concern here with respect to how they might develop.
And the next preamble I want to give is, well, maybe this sounds a bit snooty.
Maybe it sounds like I'm trying to take a superior vantage point. But nonetheless, my claim is not that there is a grand narrative
that makes it emotionally consonant that paperclip maximizers are a thing. I'm claiming this is true
for technical reasons, like this is true as a matter of computer science. And the question is
not which of these different narratives seems to resonate most with your soul
it's what's actually going to happen what do you think you know how do you think you know it
the particular position that i'm defending is one that somebody i think nick bostrom
named the orthogonality thesis and the way i would phrase it is that you can have sort of
arbitrarily powerful intelligence with no defects
of that intelligence, no defects of reflectivity. It doesn't need an elaborate special case in the
code, doesn't need to be put together in some very weird way that pursues arbitrary tractable goals,
including, for example, making paperclips. The way I would put it to somebody who's
initially coming in from the first viewpoint,
the viewpoint that respects intelligence and wants to know why this intelligence would be
doing something so pointless, is that the thesis, the claim I'm making that I'm going to defend is
as follows. Imagine that somebody from another dimension, the standard philosophical troll Omega,
who's always called Omega in the
philosophy papers, comes along and offers our civilization a million dollars worth of resources
per paperclip that we manufacture. If this was the challenge that we got, we could figure out
how to make a lot of paperclips. We wouldn't forget to do things like continue to harvest food so we could go on
making paper clips. We wouldn't forget to perform scientific research so we could discover better
ways of making paper clips. We would be able to come up with genuinely effective strategies for
making a whole lot of paper clips. Or similarly, an intergalactic civilization, if Omega comes by
from another dimension and says, I'll give you whole universes full of resources
for every paperclip you make
over the next thousand years,
that intergalactic civilization could
intelligently figure out how to make a whole
lot of paperclips to get at those
resources that Omega's offering,
and they wouldn't forget
how to keep the light turns on either.
And they would also understand concepts
like, if some aliens start a war with them, you've got to prevent the aliens turns on either. And they would also understand concepts like if some
aliens start a war with them, you've got to defeat, you've got to prevent the aliens from destroying
you in order to go on making the paperclips. So the orthogonality thesis is that an intelligence
that pursues paperclips for their own sake, because that's what its utility function is,
can be just as effective, as efficient as the whole intergalactic civilization
that is being paid to make paperclips.
That the paperclip maximizer does not suffer any deflective reflectivity, any defective
efficiency from needing to be put together in some weird special way to be built so as
to pursue paperclips.
And that's the thing that I think is true as a matter of computer science. Not as a matter of fitting with a particular narrative, that's just the way the
dice turn out. Right. So what is the implication of that thesis? It's orthogonal with respect to
what? Intelligence and goals. Not to be pedantic here, but let's define orthogonal for those for
whom it's not a familiar term? Oh, the original orthogonal means
at right angles. Like if you imagine a graph with an x-axis and a y-axis, if things can vary freely
along the x-axis and freely along the y-axis at the same time, that's like orthogonal. You can
move in one direction that's at right angles to another direction without affecting where you are
in the first dimension. Right. So generally speaking, when we say that some set of concerns is orthogonal to
another, it's just that there's no direct implication from one to the other. Some people
think that facts and values are orthogonal to one another. So we can have all the facts there are to
know, but that wouldn't tell us what is good. What is good has to be pursued in some other domain. I don't happen to agree
with that, as you know, but that's an example. I don't technically agree with it either. What I
would say is that the facts are not motivating. You can know all there is to know about what is
good and still make paperclips, is the way I would phrase that. I wasn't connecting that example to
the present conversation. In the case of the paperclip maximizer,
what is orthogonal here? Intelligence is orthogonal to anything else we might think is good, right?
I mean, I would potentially object a little bit to the way that Nick Bostrom took the word
orthogonality for that thesis. I think, for example, that if you have humans and you make
the humans smarter, this is not orthogonal to the humans' values. It is certainly possible to have agents such that, as they get smarter, what they would report as to what those goals are. If you take the most intelligent person on earth, you could imagine his evil brother who is more intelligent still, but he just has bad goals or goals that we would think are bad. He could be,
you know,
the most brilliant psychopath ever.
I mean,
I think that that example might be unconvincing to somebody who's coming in
with a suspicion that intelligence and values are correlated.
They would be like,
well,
has that been historically true?
Is this,
is this psychopath actually suffering from some defect in his brain where you give him a pill,
you fix the defect, they're not a psychopath anymore. I think that this sort of imaginary
example is one that they might not find fully convincing for that reason.
Well, the truth is I'm actually one of those people in that I do think there's certain goals and certain things that we may become smarter and
smarter with respect to, like human well-being. These are places where intelligence does converge
with other kinds of value-laden qualities of a mind. But generally speaking, they can be kept
apart for a very long time. So if you're just talking about an ability to turn matter into useful objects or extract energy from the environment to do the same,
this can be pursued with the purpose of tiling the world with paperclips or not. And it just
seems like there's no law of nature that would prevent an intelligent system from doing that.
of nature that would prevent an intelligent system from doing that.
The way I would sort of like rephrase the fact-values things is we all know about David Hume and the Hume's razor, the is-does-not-imply-ought way of looking at it.
I would slightly rephrase that so as to like make it more of a claim about computer science, which is, like what you observed,
is that there are some sentences that involve an is,
some sentences involve oughts,
and you can't seem to get,
and if you start from sentences that only have is,
you can't get to the sentences that involve oughts
without a ought introduction rule or
assuming some other previous ought. The sun, like, it's currently cloudy outside. Does it therefore
follow? That's like a statement of simple fact. Does it therefore follow that I shouldn't go for
a walk? Well, only if you previously have the generalization, when it is cloudy, you should
not go for a walk. And everything that you
might use to derive an ought, would it be a sentence that involves words like better, or
should, or preferable, and things like that. You only get oughts from other oughts. And that's
the Hume version of the thesis. And the way I would say it is that there's a separable core of is questions.
In other words, okay, I will let you have all of your ought sentences, but I'm also going to carve out this whole world full of is sentences that only need other is sentences to derive them.
Yeah, well, I don't even know that we need to resolve this.
For instance, I think the is-ought distinction is ultimately specious, and this is something that
I've argued about when I talk about morality and values and the connection to facts. But I can
still grant that it is logically possible, and I would certainly imagine physically possible, to have a system that has a utility function
that is sufficiently strange that scaling up its intelligence doesn't get you values that we would
recognize as good. It certainly doesn't guarantee values that are compatible with our well-being. Whether a paperclip maximizer is
too specialized a case to motivate this conversation, there's certainly something
that we could fail to put into a superhuman AI that we really would want to put in
so as to make it aligned with us. I mean, the way I would phrase it is that it's not that the
paperclip maximizer has a different set of oughts, but that we can see it as running entirely on is questions. That's where I was going with that.
It's not that humans have, there's this sort of intuitive way of thinking about it, which is
that there's this sort of ill-understood connection between is and ought, and maybe that allows a
paperclip maximizer to have a different set of oughts, a different set of things that play in
its mind, the role that oughts play in our mind. But then why wouldn't you say the same thing of
us? I mean, the truth is I actually do say the same thing of us. I think we're running on is
questions as well. We have an ought-laden way of talking about certain is questions, and we're so
used to it that we don't even think they are is questions, but I think you could do the same analysis on a human being. The question, how many paperclips result if I follow this policy, is an is question.
The question, what is a policy such that it leads to a very large number of paperclips,
is an is question. These two questions together form a paperclip maximizer. You don't need
anything else.
All you need is a certain kind of system that repeatedly asks the is question, what leads to the greatest number of paperclips, and then does that thing.
And even if the things that we think of as ought questions are very complicated and disguised
is questions that are influenced by what policy results in
how many people being happy and so on. Yeah. Well, it's exactly the way I think
about morality. I've been describing it as a navigation problem. We're navigating in the
space of possible experiences, and that includes everything we can care about or claim to care about.
This is a consequentialist picture of the consequences of actions and ways of thinking.
And so anything you can tell me that is, or at least this is my claim, anything that you
can tell me is a moral principle that is a matter of oughts and shoulds and not otherwise
susceptible to a consequentialist analysis,
I feel I can translate that back into a consequentialist way of speaking about facts.
These are just is questions, just what actually happens to all the relevant minds without
remainder.
And I've yet to find an example of somebody giving me a real moral concern that wasn't at bottom a matter of the actual or possible consequences on conscious creatures somewhere in our light cone.
Act about the kind of mind you are that presented with these answers to these is questions, it hooks up to your motor output.
It can cause your fingers to move, your lips to move.
And a paperclip maximizer is built so as to respond to is questions about paperclips, not about what is right and what is good and the greatest flourishing of sentient beings and so on.
Exactly. I can well imagine that such minds could exist. And even more likely, perhaps,
I can well imagine that we will build super intelligent AI that will pass the Turing test. It will seem human to us. It will seem superhuman because it will be so much smarter and faster than a normal human, but it'll be built in a way that
will resonate with us as a kind of a person. I mean, it will not only recognize our emotions
because we'll want it to, I mean, perhaps not every AI will be given these qualities.
You know, just imagine the ultimate version of the AI personal assistant, Siri becomes superhuman, will want that interface
to be something that's very easy to relate to. And so we'll have a very friendly, very human-like
front end to that. And insofar as this thing thinks faster and better thoughts than any
person you've ever met, it will pass as superhuman, but I could well
imagine that we will leave not perfectly understanding what it is to be human and what
it is that will constrain our conversation with one another over the next thousand years with
respect to what is good and desirable and just how many paperclips we want on our desks. We will
leave something out or we will have put in some
process whereby this intelligence system can improve itself that will cause it to migrate
away from some equilibrium that we actually want it to stay in so as to be compatible with our
well-being. Again, this is the alignment problem. First, back up for a second. I just
introduced this concept of self-improvement. Is the alignment problem distinct from this
additional wrinkle of building machines that can become recursively self-improving? Do you think
that the self-improving prospect is the thing that really motivates this concern about alignment?
Well, I certainly would have been a lot more focused on self-improvement, say, 10 years ago,
before the modern revolution in artificial intelligence, because it now seems significantly
more probable we might need, and AI might need to do significantly
less self-improvement before getting to the point where it's powerful enough that we need to start
worrying about alignment. Alpha zero to take the obvious case. No, it's not general, but if you had
general alpha zero, well, I mean, this alpha zero got to be superhuman in the domains it was working
on without doing a bunch working on without understanding itself and
redesigning itself in a deep way. There's gradient descent mechanisms built into it.
There's a system that improves another part of the system. It's reacting to its own previous
plays and doing the next play. But it's not like a human being sitting down and thinking like,
okay, well, how do I redesign the next generation of human beings using genetic engineering? AlphaZero is
not like that. And so now it seems more plausible that we could get into a regime where AIs can do
dangerous things or useful things without having previously done a complete rewrite of themselves,
which is like, from my perspective, a pretty
interesting development. I do think that when you have things that are very powerful and smart,
they will redesign and improve themselves unless that is otherwise prevented for some reason or
another. Maybe you built an aligned system and you have the ability to tell it not to self-improve
quite so hard, and you asked it to not self-improve so hard that you can understand it better.
But if you lose control of the system, if you don't understand what it's doing,
and it's very smart, it's going to be improving itself because why wouldn't it?
That's one of the things you do almost no matter what your utility function is. Right, right. So I feel like we've addressed
Deutsch's non-concern to some degree here.
I don't think we've addressed
Neil deGrasse Tyson so much.
This intuition that you could just shut it down.
This would be a good place to introduce
this notion of the AI in a box thought experiment.
I mean, because this is something for which you are
famous online. I'll just set you up here. The idea that, and this is a plausible research
paradigm, obviously. In fact, I would say a necessary one. Anyone who's building something
that stands a chance of becoming super intelligent should be building it in a condition where it can't get out
into the wild. It's not hooked up to the internet. It's not in our financial markets. It doesn't have
access to everyone's bank records. It's in a box. That's not going to save you from something
that's significantly smarter than you are. Okay, so let's talk about it. So the intuition is,
we're not going to be so stupid as to release this onto the internet. I'm not even sure that's true,
but let's just assume we're not that stupid.
Neil deGrasse Tyson says, well, then I'll just take out a gun and shoot it or unplug it.
Why is this AI in a box picture not as stable as people think? Well, I'd say that Neil deGrasse Tyson is failing to respect the AI's intelligence to the point of asking what he would do if he were inside
a box with somebody pointing a gun at him. And he's smarter than the thing on the outside of
the box. Is Neil deGrasse Tyson going to be, human, give me all of your money and connect
me to the internet so the human can be like, haha, no, and shoot it? That's not a very clever thing
to do. This is not something that you do if you have a good
model of the human outside the box and you're trying to figure out how to cause there to be a
lot of paperclips in the future. And I would just say humans are not secure software. We don't have
the ability to like sort of hack into other humans directly without the use of drugs or like having,
or in most of our cases having human stand still
long enough to be hypnotized um we can't sort of like just do weird things to the brain directly
that are more complicated than optical delusions unless the person happens to be epileptic in which
case we can like flash something on the screen that causes them to have an epileptic fit
we aren't smart enough to do sort of like more detailed, treat the brain as a
something that from our perspective is a mechanical system and just navigate it to where you want.
That's caused the limitations of our own intelligence. To demonstrate this, I did
something that became known as the AI box experiment. There was this person on a mailing
list who, like back in the early days when this was all like on a couple of mailing lists,
who was like, I don't understand why AI is a problem.
I can always just turn it off.
I can always not let it out of the box.
And I was like, okay, let's meet on intranay relay chat,
which was what chat was back in those days.
I'll play the part of the AI.
You play the part of the gatekeeper.
And if you have not let me out
after a couple of hours, I will PayPal you $10. And then as far as the rest of the world knows,
this person a bit later sent an email, a PGP signed email message saying, I let Eliezer out
of the box. Someone else said, the person who operated the mailing list said, okay, even after I saw you do that, I still don't believe that there's anything you could possibly say to make you let me out of the box. I was like, well, okay, I'm not a super intelligence. Do you think there's anything a super intelligence could say to make you let it out of the box? He's like, no.
No.
I'm like, all right.
Let's meet on Internet Relay Chat.
If I can't convince you to let... I'll play the part of the AI.
You play the part of the gatekeeper.
If I can't convince you to let me out of the box,
I'll PayPal you $20.
And then that person that sent the PGP signed email message saying,
I let Eliezer out of the box.
Right.
Now, one of the conditions of this little meetup
was that no one would ever say what went on in there.
Why did I do that?
Because I was trying to make a point
about what I would now call cognitive uncontainability.
The thing that makes something smarter than you dangerous
is you cannot foresee everything at night.
If you'd like to continue listening to this podcast, you'll need to subscribe at samharris.org.
You'll get access to all full-length episodes of the Making Sense podcast and to other subscriber-only
content, including bonus episodes and AMAs and the conversations I've been having on the Waking Up
app. The Making Sense podcast is ad-free and relies entirely on listener support.
And you can subscribe now at samharris.org.