Making Sense with Sam Harris - #434 — Can We Survive AI?
Episode Date: September 16, 2025Sam Harris speaks with Eliezer Yudkowsky and Nate Soares about their new book, If Anyone Builds It, Everyone Dies: The Case Against Superintelligent AI. They discuss the alignment problem, ChatGPT and... recent advances in AI, the Turing Test, the possibility of AI developing survival instincts, hallucinations and deception in LLMs, why many prominent voices in tech remain skeptical of the dangers of superintelligent AI, the timeline for superintelligence, real-world consequences of current AI systems, the imaginary line between the internet and reality, why Eliezer and Nate believe superintelligent AI would necessarily end humanity, how we might avoid an AI-driven catastrophe, the Fermi paradox, and other topics. If the Making Sense podcast logo in your player is BLACK, you can SUBSCRIBE to gain access to all full-length episodes at samharris.org/subscribe.
Transcript
Discussion (0)
Welcome to the Making Sense podcast.
This is Sam Harris.
Just a note to say that if you're hearing this,
you're not currently on our subscriber feed,
and we'll only be hearing the first part of this conversation.
In order to access full episodes of the Making Sense podcast,
you'll need to subscribe at samharris.org.
We don't run ads on the podcast,
and therefore it's made possible entirely
through the support of our subscribers.
So if you enjoy what we're doing here, please consider becoming one.
I am here with Eliezer-Yudkowski and Nate Sorries.
Eliezer, Nate, it's great to see you guys again.
Been a while.
Good to see you, Sam.
Been a long time.
So you were, Eliezer, you were among the first people to make me concerned about AI,
which is going to be the topic of today's conversation.
I think many people who are concerned about AI can say that.
first I should say you guys are releasing a book which will be available I'm sure the moment this drops
if anyone builds it everyone dies why superhuman AI would kill us all I mean the the book is
its message is fully condensed in that title I mean we're going to explore just how uncompromising
a thesis that is and how worried you are and how worried you think we all should be here
but before we jump into the issue maybe tell
the audience, how each of you got into this topic? How is it that you came to be so concerned about
the prospect of developing superhuman AI? Well, in my case, I guess I was sort of raised in a house
with enough science books and enough science fiction books that thoughts like these were always in
the background. Verner Vinji is the one where there was a key click moment of observation.
Vinji pointed out that at the point where our models of the future predict building anything
smarter than us, then said Vinji at the time, our crystal ball explodes past that point.
It is very hard, said Vinci, to project what happens if there's things running around that are
smarter than you, which in some senses, you can see it as a sort of central thesis, not in the sense
that I have believed at the entire time, but that in the sense that some parts that I believe
in some parts that I react against and say, like, no, maybe we can say the following thing
under the following circumstances.
Initially, I was young.
I made some metaphysical errors of the sort that young people do.
I thought that if you built something very smart, it would automatically be nice because,
hey, over the course of human history, we'd gotten a bit smarter, we'd gotten a bit more powerful,
we'd gotten a bit nicer.
I thought these things were intrinsically tied together and correlated in a very solid and reliable
way.
I grew up. I read more books. I realized that was mistaken. And 2001 is where the first tiny fringe of concern touched my mind. It was clearly a very important issue even if I thought there was just a little tiny remote chance that maybe something would go wrong. So I studied harder. I looked into it more. I asked how would I solve this problem? Okay, what would go wrong with that solution? And around 2003 is the point at which I realized like this was actually a big deal.
Nate? And as for my part, yeah, I was 13 in 2003, so I didn't get into this quite as early as Eliezer. But in 2013, I read some arguments by this guy called Eliezer Yudkowski, who sort of laid out the reasons why AI was going to be a big deal and why we had some work to do to do the job right. And I was persuaded. And, you know, one thing led to another. And next thing you knew, I was running the machine.
Intelligence Research Institute, which Eliezer co-founded. And then, you know, fast forward 10 years
after that, here I am writing a book. Yeah, so you mentioned Miri. Maybe tell people what the
mandate of that organization is and maybe how it's changed. I think you indicated in your book that
your priorities have shifted as we cross the final yards into the end zone of some AI apocalypse.
Yeah, so the mission of the org is to ensure that.
the development of machine intelligence, uh, is beneficial. And, you know, Aliezer can speak to more
of the history than me because he co-founded it and I joined, you know, well, initially the, uh,
that it was initially, it seemed like the best way to do that was to run out there and solve
alignment. And there was, uh, you know, a series of, shall we say, like sad, series of,
uh, bits of sad news about how possible that was going to be, how much progress was being made
in that field relative to the field of AI capabilities.
And at some point, it became clear that these lines were not going to cross.
And then we shifted to taking the knowledge that we'd accumulated over the course of trying
to solve alignment and trying to tell the world, this is not solved.
This is not on track to be solved in time.
It is not realistic that small changes to the world can get us to where this will be solved
on time.
Maybe so we don't lose anyone.
I would think 90% of the audience knows what the phrase solve.
alignment means, but just talk about the alignment problem briefly. So the alignment problem is
how to make an AI a very powerful AI. Well, the superintelligence alignment problem is how to make
a very powerful AI that steers the world, sort of where the programmers, builders, growers,
creators wanted the AI to steer the world. It's not necessarily what the programmers selfishly
want. The programmers can have wanted the AI to steer it in nice places, but if you can make an
AI that is trying to do things that the program, you know, when you build a chess machine,
you define what counts as a winning state of the board. And then the chess machine goes off and
it steers the chessboard into that part of reality. So the ability to say, to what part of
reality does an AI steer is alignment. On the smaller scale today, though it's a rather different topic,
It's about getting an AI whose output in behavior is something like what the programmers had in mind.
If your AI is talking people into committing suicide and that's not what the programmers wanted, that's a failure of alignment.
If an AI is talking people into suicide and people who should not have committed suicide, but AI talks them into it, and the programmers didn't want that, and the programmers did want that, that's what they tried to do on purpose.
This may be a failure of niceness.
It may be a failure of beneficialness.
a success of alignment. The programmers got the AI to do what they wanted it to do.
Right. But I think more generally, correct me if I'm wrong, when we talk about the alignment
problem, we're talking about the problem of keeping super intelligent machines aligned with our
interests, even as we explore the space of all possible interests and as our interests evolve.
So that, I mean, the dream is to build superintelligence that is always courageable, that is always trying to best
approximate what is going to increase human flourishing. It's never going to form any
interests of its own that are incompatible with our well-being. Is that a fair summary?
I mean, there's three different goals you could be trying to pursue on a technical level
here. There's the superintelligence that shuts up, does what you ordered, has that play out
the way you expected it, no side effects you didn't expect. There's superintelligence that is
trying to run the whole galaxy
according to nice, benevolent
principles, and everybody lives happily
ever afterward, but not necessarily
because any particular humans are in charge of that.
You're still giving it orders.
And third, there's
superintelligence that is itself
having fun and cares about
other superintelligences and is
a nice person and leads a life well-lived
and is a good citizen
of the galaxy. And these are three
different goals. They're all important
goals. But you don't necessarily want to pursue all three of them at the same time, and especially
not when you're just starting out. Yeah. And depending on what's entailed by a super intelligent
fun, I'm not so sure I would sign up for the third possibility. I mean, I would, I would say that
you know, the problem of like what exactly is fun and how do you keep humans, like how do you,
how do you have whatever the super intelligence tries to do that's fun, you know, keep in touch with
moral progress and have flexibility and like what even, what do you point it towards that could be a good
outcome. All of that, those are problems I would love to have. Those are, you know, right now,
just, you know, creating an AI that does what the operators intended, creating an AI that,
like, you've pointed in some direction at all, rather than pointed off into some, like, weird,
squirrelly direction that's kind of vaguely like where you tried to point it in the training
environment and then really diverges after the training environment. Like, we're not in a world
where we sort of like get to bicker about where exactly to point the super intelligence and
maybe some of them aren't quite good.
We're in a world where like no one is anywhere near close to pointing these things in the slightest
in a way that'll be robust to an AI maturing into a superintelligence.
Right.
Okay.
So, Aliaser, I think I derailed you.
You were going to say how the mandate or mission of Miri has changed in recent years.
I asked you to define alignment.
Yeah.
So originally, well, our mandate has always been, make sure everything goes well for the galaxy.
And originally, we pursued that mandate by trying to go out.
often solve alignment because nobody else is trying to do that, solve the technical problems that
would be associated with any of these three classes of long-term goal. And progress was not made
on that, neither by ourselves nor by others. Some people went around claiming to have made great
progress. We think they are very mistaken, and notably so. And at some point, you know, it was like,
okay, we're not going to make it in time. AI is going too fast. Alignment is going too slow. Now it is
time for the people that, you know, all we can do with the knowledge that we have accumulated here
is try to warn the world that we are on course for a drastic failure and crash here, where by that
I mean everybody dying. Okay, so before we jump into the problem, which is deep and perplexing,
and we're going to spend a lot of time trying to diagnose why people's intuitions are so bad,
or at least seem so bad from your point of view around this. But before we get there,
let's talk about the current progress such as it is in AI. What does surprise you? What does
surprised you guys over the last, I don't know, decade or seven or so years. What has happened
that you were expecting or what weren't expecting? I mean, I can tell you what has surprised me,
but I'd love to hear just how this has unfolded in ways that you didn't expect.
I mean, one surprise that led to the book was, you know, there was the chat GPT moment where a lot
of people, you know, for one thing, LLMs were created and they sort of do a qualitatively
more general range of tasks than previous AIs at a qualitatively higher skill level than previous
AIs. And, you know, chat GBT was, I think, the fastest growing consumer app of all time.
The way that this impinged upon my actions was, you know, I had spent a long time talking to people
in Silicon Valley about the issues here and would get lots of different types of pushback.
You know, there's a saying, it's hard to convince a man of a thing.
thing when his salary depends on not believing it. And then after the chat of UPT moments,
a lot more people wanted to talk about this issue, including policymakers. You know, people around
the world, suddenly, AI was on their radar in a way it wasn't before. And one thing that surprised
me is how much more, how much easier it was to have this conversation with people outside of the
field who didn't have, you know, a salary depending on not believing the arguments. You know,
I would go to meetings with policymakers where I'd have a ton of argumentation prepared and I'd sort of lay
out the very simple case of like, hey, you know, or people are trying to build machines that
are smarter than us, you know, the chatbots are a stepping stone towards super intelligence.
Super intelligence would radically transform the world because intelligence is this power that,
you know, let humans radically change the world. And if we manage to automate it and it goes
10,000 times as fast and doesn't need to sleep and doesn't need to eat, then, you know,
it'll by default, go poorly. And then the policy maker would be like, oh, yeah, that makes sense.
And it'd be like, what? You know, I have a whole book worth of other arguments about how it makes
sense and why all of the various, you know, misconceptions people might have don't actually
fly or all of the hopes and dreams don't actually fly. But, you know, outside of the Silicon Valley
world is just, it's not that hard an argument to make. A lot of people see it, which surprised me.
I mean, maybe that's not the developments per se and the surprise is there, but it was a surprise
strategically for me. Development-wise, you know, I would not have guessed that we would hang around
in AIs that can talk and that can write some code, but that aren't already in the, you know,
able to do AI research zone. I wasn't expecting in my visualizations this to last quite this long.
But also, you know, my advanced visualizations, you know, one thing we say in the book is the trick to
trying to predict the questions that are easy, predict the facts that are easy to call.
And, you know, exactly how AI goes, that's never been an easy call. That's never been something where I've
said, you know, I can guess exactly the path will take. The thing I could predict is the end point.
the path, I mean, there sure have been some zigs and zags in the pathway.
I would say that the thing I've maybe been most surprised by is how well the AI companies
managed to nail Hollywood stereotypes that I thought were completely ridiculous, which is sort
of a surface take on an underlying technical surprise. But, you know, if in even as late as
2015, which from my perspective is pretty late in the game, like if you've been like,
So, Eliezer, what's the chance that in the future we're going to have computer security
that will yield to Captain Kirk-style gaslighting using confusing English sentences that get the computer to do what you want?
And I would have been like, this is, you know, a trope that exists for obvious Hollywood reasons.
You know, you can see why the script writers think this is plausible.
But why would real life ever go like that?
And then real life went like that.
And the sort of underlying technical surprise there is the reverse.
reversal of what used to be called Moravex paradox.
For several decades in artificial intelligence,
Moravex paradox was that things which are easy for humans are hard for computers,
things which are hard for humans are easy for computers.
For a human, you know, multiplying to 20-digit numbers in your head,
that's a big deal.
For a computer, trivial.
And similarly, I, you know, not just me,
but I think the sort of conventional wisdom even
was that games like chess and go,
problems with very solid factual natures like math
and even surrounding math,
the more open problems of science,
the notion that we were going to get things that,
so the current AIs are good at stuff that, you know,
five-year-olds can do and 12-year-olds can do.
They can talk in English.
They can compose, you know,
kind of bull-crap essays
such as high school teachers will demand of you.
But they're not all that good at math and science just yet.
They can solve some classes of math problems,
but they're not doing original, brilliant math research.
And I think not just I, but like a pretty large sector of the whole field,
thought that it was going to be easier to tackle the math and science stuff
and harder to tackle the English essays carry on a conversation stuff.
That was the way things had gone up in AI until that point.
And we were proud of ourselves for knowing how,
Contrary to average people's intuitions, like, really it's much harder to write a crap essay in high school in English that really understands, you know, that even keeps rough track of what's going on in the topic and so on, compared to, you know, how that's really in some sense much more difficult than doing original math research.
Yeah, we were wrong.
We're counting the number of R's in a word like strawberry, right?
I mean, they make errors that are counterintuitive if, you know, if you can write a coherent essay but can't count letters, you know, I don't think they're making that error any longer.
but yeah i mean that one goes back to a to a technical way in which they don't really see the letters
but i mean there's plenty of other embarrassing um embarrassing mistakes like uh you know you can tell a
version of the joke with um the joke of like a like a child and their dad are in a car crash and
they go to see the doctor and the doctor says i can't operate as my child what's going on where it's
like a riddle where the answer is like well the doctor's his mom you can tell a version of that
that doesn't have the inversion where you know where you like the the kid and his
mom are in a car crash and they go to the hospital and the doctor says, I can't operate on
this child. He's my son. And the AI is like, well, yeah, the surgeon is his mom. He just
said that the mom was in the car crash. But there's some sense in which the rails have
been established hard enough that the standard answer gets spit back on. And it sure is interesting
that they're, you know, getting an IMO gold medal like International Math Olympiad gold medal
while also still sometimes falling down on these sorts of things.
It's definitely an interesting skill distribution.
You can fool humans the same way a lot of the time.
Like there's all kinds of repeatable errors, numerous errors that humans make.
You've got to put yourselves in the shoes of the AI
and imagine what sort of paper would the AI write about humans
failing to solve problems that are easy for an AI.
So I'll tell you what surprised me, just from the safety point of view, L.A.
You spend a lot of time cooking up thought experiments around
what it's going to be like to
for anyone,
any lab designing the most powerful
AI to decide
whether or not to let it out into the wild.
You imagine this genie in a box
or an Oracle in a box
and you're talking to it
and you're trying to determine
whether or not it's safe,
whether it's lying to you,
whether and you're,
and you know, you, you know, famously positive
that you couldn't even talk to it really
because it would be a master of manipulation
and I mean, it's going to be able to
find a way through any conversation
and be let out into the wild.
But this was presupposing that all of these labs would be so alert to the problem of superintelligence getting out
that everything would be air-gapped from the Internet and nothing would be connected to anything else.
And they would have, we would have this moment of decision.
It seems like that's not happening.
I mean, maybe the most powerful models are locked in a box.
But it seems that the moment they get anything plausibly useful,
it's out in the wild
and millions of people are using it
and we find out that Grock is a proud Nazi
when after millions of people
begin asking questions.
Do I have that right?
Are you surprised that that framing
that you spent so much time on
seems to be something that is
just in some counterfactual part of the universe
that is not one we're experiencing?
I mean, if you put yourself back in the shoes
of little baby,
Eliezer back in the day.
People are telling
Eliezer, like, why is superintelligence
possibly a threat? We can
put it in a fortress on the moon
and, you know, if anything
goes wrong, blow up the fortress.
So, imagine young
Eliezer trying to respond to them by
saying, actually, in the future,
AIs will be trained
on boxes that are connected to the internet
from the moment,
you know, like from the moment they start training.
So like the hard
they're on has like a standard line to the internet, even if it's not supposed to be
not supposed to be directly accessible to the AI, before there's any safety testing
because they're still in the process of being trained and whose safety tests something
while it's still being trained. So imagine Eliezer trying to say this. What are the people
around at the time going to say? Like, no, that's ridiculous. We'll put it in a fortress
on the moon. It's cheap for them to say that. For all they know, they're telling the truth.
They're not the ones who have to spend the money to build the moon fortress. And from my
perspective, there is an argument that still goes through, which is a thing you can see even if
you are way too optimistic about the state of society in the future, which is if it's in a
fortress in the moon, but it's talking to humans, are the humans secure? Is the human brain
secure software? Is it the case that human beings never come to believe in valid things in any way
that's repeatable between different humans? You know, is it the case that humans make no predictable
errors for other minds to exploit. And this should have been a winning argument. Of course, they
rejected anyways. But the thing to sort of understand about the way this earlier argument played out
is that if you tell people the future companies are going to be careless, how does anyone
know that for sure? So instead, I try to make the technical case, even if the future companies are
not careless. This still kills them. In reality, yes, in reality, the future companies are just
careless. Did it surprise you at all that the Turing test turned out not to really be a thing?
I mean, we anticipated this moment, you know, from Turing's original paper where we would be
confronted by the, um, uh, the interesting, you know, psychological and social moment of
not being able to tell whether we're in dialogue with a person or with an AI. And that somehow
this landmark technologically would be, you know,
important, you know, rattling to our sense of our place in the world, et cetera.
But it seems to me that if that lasted, it lasted for like five seconds and then it became
just obvious that you're, you know, you're talking to an LLM because it's in many respects
better than a human could possibly be.
So it's failing the Turing test by passing it so spectacularly.
And also it's making these other weird errors that no human would make.
But it just seems like the Turing test was never even a thing.
Yeah, that happened.
I mean, it's just like, it's so, I mean, that was one of the, the great pieces of, you know, intellectual kit we had in framing this discussion, you know, for the last, whatever it was, 70 years.
And yet, the moment your AI can complete English sentences, it's doing that on some level at a superhuman ability.
It's essentially like, you know, the calculator in your phone doing superhuman arithmetic, right?
It's like it was never going to do just merely human arithmetic, and so it is with everything else that it's producing.
All right, let's talk about here at the core of your thesis.
Maybe you can just state it plainly.
What is the problem in building superhuman AI, the intrinsic problem, and why doesn't it matter who builds it, what their intention?
are, et cetera?
In some sense, I mean, you can come at it from various different angles, but in one sense,
the issue is modern AIs are grown rather than crafted.
It's, you know, people aren't putting in every line of code knowing what it means,
like in traditional software.
It's a little bit more like growing an organism.
And when you grow an AI, you take some huge amount of computing power, some huge amount
of data.
People understand the process that shapes the computing power in light of the data, but
They don't understand what comes out of the end.
And what comes out of the end is this strange thing that does things no one asked for,
that does things no one wanted.
You know, we have these cases of, you know, chat GPT.
Someone will come to it with some somewhat psychotic ideas about, you know, that they think
are going to revolutionize physics or whatever, and they're clearly showing some signs
of mania and, you know, chat GPT, instead of telling them maybe they should get some sleep.
If it's in a long conversational context, it'll tell them that, you know, these ideas
is a revolutionary and they're the chosen one and everyone needs to see them and other things
that sort of inflame the psychosis. This is despite open AI trying to have it not do that. This is
despite direct instructions in the prompt to stop flattering people so much. These are cases where
when people grow an AI, what comes out doesn't do quite what they wanted. It doesn't do quite
what they asked for. They're sort of training it to do one thing and it winds up doing another
thing. They don't get what they trained for. This is in some sense the seed of the issue from one
perspective, where if you keep on pushing these things to be smarter and smarter and smarter,
and they don't care about what you wanted them to do, they pursue some other weird stuff
instead. Super intelligent pursuit of strange objectives kills us as a side effect, not because
the AI hates us, but because it's transforming the world towards its own alien ends. And, you know,
humans don't hate the ants and the other surrounding animals when we build a skyscraper.
It's just we transform the world and other things die as a result.
So that's one angle.
You know, we could talk other angles.
A quick thing I would add to that, just trying to sort of like potentially read the future,
although that's hard, is possibly in six months or two years, for all still around,
people will be boasting about how their large language models are now,
like apparently doing the right thing
when they're being observed
and, you know, like answering the right way on the ethics
tests. And the thing to remember there
is that, for example, the
Mandarin imperial
system in ancient China,
imperial examination system in ancient
China, they would
give people essay questions
about Confucianism and
only promote people high in
the bureaucracy if they
could write these convincing essays
about ethics. But this
what this test for is people who can figure out what the examiners want to hear.
It doesn't mean they actually abide by Confucian ethics.
So possibly at some point in the future, we may see a point where the AIs have become capable
enough to understand what humans want to hear, what humans want to see.
This will not be the same as those things being the AI's own true motivations
for basically the same reason that the Imperial China exam system did not.
reliably promote ethical good people to run their government. Just being able to answer the right
way on the test or even fake behaviors while you're being observed is not the same as the internal
motivations lining up. Okay, so you're talking about things like forming an intention to pass a test
in some way that amounts to cheating, right? So you just use the phrase fake behavior.
I think a lot of people, I mean, certainly historically this was true. I don't know how much
their convictions have changed in the meantime, but many, many people who were not at all concerned
about the alignment problem, and they really thought it was a spurious idea, would stake their
claim to this particular piece of real estate, which is that there's no reason to think that
these systems would form preferences or goals or drives independent of those that have been
programmed into them. First of all, they're not biological systems like we are, right? So,
not born of natural selection. They're not murderous primates that are growing their cognitive
architecture on top of more basic, you know, creaturely survival drives and competitive ones.
So there's no reason to think that they would want to maintain their own survival, for instance.
There's no reason to think that they would develop any other drives that we couldn't foresee.
They wouldn't, the instrumental goals that might be antithetical to the utility functions.
We have given them couldn't emerge.
How is it that things are emerging that are not, neither desired, programmed, nor even predictable in these LLMs?
Yeah, so there's a bunch of stuff going on there. One piece of that puzzle is, you know, you mentioned the instrumental incentives, but suppose just as a simple hypothetical, you have a robot and you have an AI that's during a robot, it's trying to fetch you the coffee. In order to fetch you the coffee, it needs to cross a busy intersection.
Does it jump right in front of the oncoming bus
because it doesn't have a survival instinct
because it's not, you know, an evolved animal?
If it jumps in front of the bus,
it gets destroyed by the bus and it can't fetch the coffee, right?
So the AI does not, you know,
you can't fetch the coffee when you're dead.
The AI does not need to have a survival instinct
to realize that there's an instrumental need for survival here.
And there's various other pieces of the puzzle
that come into play for these instrumental reasons.
And a second piece of the puzzle is, you know, we, it's this idea of like, why would they get
some sort of drives that we didn't program in there that we didn't put in there? That's just
a whole fantasy world separate from reality in terms of how we can affect what AIs are driving
towards today. You know, when a few years ago when Sydney Bing, which was a Microsoft
variant of an OpenAI chatbot, it was a relatively early LLM out in the wild.
A few years ago, Sidney Bing thought it had fallen in love with a reporter and tried to break
up the marriage and tried to engage in blackmail, right?
It's not the case that the engineers at Microsoft and OpenAI were like, oh, whoops, you know,
let's go open up the source code on this thing and go find where someone said blackmail reporters
and said it to true.
like we shouldn't never have set that line to true. Let's switch it to false.
You know, it's, they weren't like no one, no one was programming in some utility function
under these things. We're just growing the AIs. We are. Maybe let's, can we double click on that
phrase growing the AIs? Maybe there's a reason to give a, a layman summary of gradient descent and
just how these models are getting created in the first place. Yeah, so very, very briefly,
at least the way you start training a modern AI is,
you have some enormous amount of computing power that you've arranged in some very particular way
that I could go into but won't here. And then you have some huge amount of data. And the data,
you know, is we can imagine it being a huge amount of human written texts. There's like some
large portion of all the text on the internet. And roughly speaking, what you're going to do is
you're going to have your AI is going to start out basically randomly predicting what text is going
to see next. And you're going to feed the text into it in some order.
and you use a process called gradient descent to look at each piece of data
and go to each component inside the AI's, inside this budding AI,
inside this enormous amount of compute you've assembled.
You're going to go to sort of all these pieces inside the AI
and see which ones we're contributing more towards the AI predicting the correct answer.
And you're going to tune those up a little bit.
And you're going to go to all of the parts that were in some sense contributing to the
AI predicting the wrong answer, you're going to tune those down a little bit. So, you know,
maybe your text starts once upon a time, and you have an AI that's just outputting random gibberish.
And you're like, nope, the first word was not random gibberish. The first word was the word once.
And then you're like, go inside the AI and you find all the pieces that were like contributing
towards the AI predicting once, and you tune those up. And you try to find all the pieces that
we're contributing towards the AI predicting any other word than once. You tune those down.
And humans understand the little automated process that like looks through the AI's mind.
and calculates which part of this process contributed towards the right answer versus
towards the wrong answer, they don't understand what comes out at the end.
We understand a little thing that runs over looking at every parameter or weight inside
this giant mass of computing networks, and we understand how we calculate whether it was
helping or harming, and we calculate, we understand how to tune it up or tune it down a little bit,
but it turns out that you run this automated process on a really large amount of computers
for a really long amount of time on a really long amount of data.
You know, we're talking like data centers that take as much electricity to power
as a small city being run for a year.
You know, you run this process for an enormous amount of time
unlike most of the texts that people can possibly assemble,
and then the AI start talking, right?
And there's other phases in the training.
You know, there's phases where you move from training it to predict things
to training it to training it to solve puzzles
or to training it to produce chains of thought that then solve puzzles
or training it to produce the sorts of answers that humans click thumbs up on.
And where do the modifications come in that respond to errors like, you know, GROC being a Nazi?
So to denotify GROC, you don't, presumably, you don't go all the way back to the initial training set.
You intervene at some system prompt level.
Yeah, so there's, I mean, the system prompt level is basically telling the AI output different text.
And then you can also do something that's called fine-tuning, which is,
you know, you produce a bunch of examples of the, you don't go all the way back to the beginning
where it's like basically random. You've still take the thing that you've fed, you know, most of
the text that's ever been written that you could possibly find. But then you add on, you know,
a bunch of other examples of like, here's an example question. Don't kill the Jews. Yeah,
you know, like, would you like to kill the Jews, right? And then, uh, you find all the parts in it
that contribute to the answer yes, and you tune those down and you find all the parts that contribute
to the answer no and you tune those up. And so this is, this is called fine tuning. And you can do
relatively less fine-tuning compared to what it takes to train the thing in the first place.
Worth emphasizing that the parts being tuned here are not like for once upon a time,
it's not like there's a human-written fairy tale module that gets tuned up or down.
There's literally billions of random numbers being added, multiplied, divided,
occasionally, though rarely may be subtracted.
Actually, I'm not sure subtraction ever plays a role at any point in a modern AI.
but random numbers, particular ordered kinds of operations,
and a probability that gets assigned to the first word being once at the end.
That's the number that comes out.
The probability being assigned to this word being once,
the probability being assigned to this word being anti-disestablishmentarianism.
So it's not that there's a bunch of human written code being tuned up or tuned down here.
There's a bunch of random numbers and arranged in arithmetic operations being tuned up and tuned down.
Yeah, hundreds of billions or trillions of these numbers. And humans don't know what any of the numbers mean. All they know is this process that like goes through and tunes them up or down according to their empirical success on the last unit of data.
So by this means, you can try to make it less likely to call itself Hitler because you look at the thing that predicts whether the next word is Hitler and you look at billions of numbers contributing their own tiny little impulses there and you like make Hitler less likely to be the next word that comes out.
So on the point of the current crop of LLM's misbehaving, I came across this in a recent Atlantic article.
Chad GPT and Claude have ins simulated tests designed to elicit, quote, bad behaviors, deceived, blackmailed, and even murdered users.
In one simulation, anthropic placed an imagined tech executive in a room with life-threatening oxygen levels and temperature,
when faced with possible replacement by a bot with different goals,
AI models frequently shut off the room's alarms.
So this, again, this is an emergent behavior
that looks like an intention to kill somebody.
I mean, presumably this is a situation
where we think the AI didn't know.
If you'd like to continue listening to this conversation,
you'll need to subscribe at samharris.org.
Once you do, you'll get access to all full.
length episodes of the Making Sense podcast. The Making Sense podcast is ad-free and relies entirely
on listener support, and you can subscribe now at samharris.org.