Making Sense with Sam Harris - #434 — Can We Survive AI?

Episode Date: September 16, 2025

Sam Harris speaks with Eliezer Yudkowsky and Nate Soares about their new book, If Anyone Builds It, Everyone Dies: The Case Against Superintelligent AI. They discuss the alignment problem, ChatGPT and... recent advances in AI, the Turing Test, the possibility of AI developing survival instincts, hallucinations and deception in LLMs, why many prominent voices in tech remain skeptical of the dangers of superintelligent AI, the timeline for superintelligence, real-world consequences of current AI systems, the imaginary line between the internet and reality, why Eliezer and Nate believe superintelligent AI would necessarily end humanity, how we might avoid an AI-driven catastrophe, the Fermi paradox, and other topics. If the Making Sense podcast logo in your player is BLACK, you can SUBSCRIBE to gain access to all full-length episodes at samharris.org/subscribe.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Making Sense podcast. This is Sam Harris. Just a note to say that if you're hearing this, you're not currently on our subscriber feed, and we'll only be hearing the first part of this conversation. In order to access full episodes of the Making Sense podcast, you'll need to subscribe at samharris.org. We don't run ads on the podcast,
Starting point is 00:00:25 and therefore it's made possible entirely through the support of our subscribers. So if you enjoy what we're doing here, please consider becoming one. I am here with Eliezer-Yudkowski and Nate Sorries. Eliezer, Nate, it's great to see you guys again. Been a while. Good to see you, Sam. Been a long time.
Starting point is 00:00:46 So you were, Eliezer, you were among the first people to make me concerned about AI, which is going to be the topic of today's conversation. I think many people who are concerned about AI can say that. first I should say you guys are releasing a book which will be available I'm sure the moment this drops if anyone builds it everyone dies why superhuman AI would kill us all I mean the the book is its message is fully condensed in that title I mean we're going to explore just how uncompromising a thesis that is and how worried you are and how worried you think we all should be here but before we jump into the issue maybe tell
Starting point is 00:01:28 the audience, how each of you got into this topic? How is it that you came to be so concerned about the prospect of developing superhuman AI? Well, in my case, I guess I was sort of raised in a house with enough science books and enough science fiction books that thoughts like these were always in the background. Verner Vinji is the one where there was a key click moment of observation. Vinji pointed out that at the point where our models of the future predict building anything smarter than us, then said Vinji at the time, our crystal ball explodes past that point. It is very hard, said Vinci, to project what happens if there's things running around that are smarter than you, which in some senses, you can see it as a sort of central thesis, not in the sense
Starting point is 00:02:17 that I have believed at the entire time, but that in the sense that some parts that I believe in some parts that I react against and say, like, no, maybe we can say the following thing under the following circumstances. Initially, I was young. I made some metaphysical errors of the sort that young people do. I thought that if you built something very smart, it would automatically be nice because, hey, over the course of human history, we'd gotten a bit smarter, we'd gotten a bit more powerful, we'd gotten a bit nicer.
Starting point is 00:02:43 I thought these things were intrinsically tied together and correlated in a very solid and reliable way. I grew up. I read more books. I realized that was mistaken. And 2001 is where the first tiny fringe of concern touched my mind. It was clearly a very important issue even if I thought there was just a little tiny remote chance that maybe something would go wrong. So I studied harder. I looked into it more. I asked how would I solve this problem? Okay, what would go wrong with that solution? And around 2003 is the point at which I realized like this was actually a big deal. Nate? And as for my part, yeah, I was 13 in 2003, so I didn't get into this quite as early as Eliezer. But in 2013, I read some arguments by this guy called Eliezer Yudkowski, who sort of laid out the reasons why AI was going to be a big deal and why we had some work to do to do the job right. And I was persuaded. And, you know, one thing led to another. And next thing you knew, I was running the machine. Intelligence Research Institute, which Eliezer co-founded. And then, you know, fast forward 10 years after that, here I am writing a book. Yeah, so you mentioned Miri. Maybe tell people what the mandate of that organization is and maybe how it's changed. I think you indicated in your book that your priorities have shifted as we cross the final yards into the end zone of some AI apocalypse.
Starting point is 00:04:12 Yeah, so the mission of the org is to ensure that. the development of machine intelligence, uh, is beneficial. And, you know, Aliezer can speak to more of the history than me because he co-founded it and I joined, you know, well, initially the, uh, that it was initially, it seemed like the best way to do that was to run out there and solve alignment. And there was, uh, you know, a series of, shall we say, like sad, series of, uh, bits of sad news about how possible that was going to be, how much progress was being made in that field relative to the field of AI capabilities. And at some point, it became clear that these lines were not going to cross.
Starting point is 00:04:53 And then we shifted to taking the knowledge that we'd accumulated over the course of trying to solve alignment and trying to tell the world, this is not solved. This is not on track to be solved in time. It is not realistic that small changes to the world can get us to where this will be solved on time. Maybe so we don't lose anyone. I would think 90% of the audience knows what the phrase solve. alignment means, but just talk about the alignment problem briefly. So the alignment problem is
Starting point is 00:05:21 how to make an AI a very powerful AI. Well, the superintelligence alignment problem is how to make a very powerful AI that steers the world, sort of where the programmers, builders, growers, creators wanted the AI to steer the world. It's not necessarily what the programmers selfishly want. The programmers can have wanted the AI to steer it in nice places, but if you can make an AI that is trying to do things that the program, you know, when you build a chess machine, you define what counts as a winning state of the board. And then the chess machine goes off and it steers the chessboard into that part of reality. So the ability to say, to what part of reality does an AI steer is alignment. On the smaller scale today, though it's a rather different topic,
Starting point is 00:06:09 It's about getting an AI whose output in behavior is something like what the programmers had in mind. If your AI is talking people into committing suicide and that's not what the programmers wanted, that's a failure of alignment. If an AI is talking people into suicide and people who should not have committed suicide, but AI talks them into it, and the programmers didn't want that, and the programmers did want that, that's what they tried to do on purpose. This may be a failure of niceness. It may be a failure of beneficialness. a success of alignment. The programmers got the AI to do what they wanted it to do. Right. But I think more generally, correct me if I'm wrong, when we talk about the alignment problem, we're talking about the problem of keeping super intelligent machines aligned with our
Starting point is 00:06:53 interests, even as we explore the space of all possible interests and as our interests evolve. So that, I mean, the dream is to build superintelligence that is always courageable, that is always trying to best approximate what is going to increase human flourishing. It's never going to form any interests of its own that are incompatible with our well-being. Is that a fair summary? I mean, there's three different goals you could be trying to pursue on a technical level here. There's the superintelligence that shuts up, does what you ordered, has that play out the way you expected it, no side effects you didn't expect. There's superintelligence that is trying to run the whole galaxy
Starting point is 00:07:37 according to nice, benevolent principles, and everybody lives happily ever afterward, but not necessarily because any particular humans are in charge of that. You're still giving it orders. And third, there's superintelligence that is itself having fun and cares about
Starting point is 00:07:53 other superintelligences and is a nice person and leads a life well-lived and is a good citizen of the galaxy. And these are three different goals. They're all important goals. But you don't necessarily want to pursue all three of them at the same time, and especially not when you're just starting out. Yeah. And depending on what's entailed by a super intelligent fun, I'm not so sure I would sign up for the third possibility. I mean, I would, I would say that
Starting point is 00:08:17 you know, the problem of like what exactly is fun and how do you keep humans, like how do you, how do you have whatever the super intelligence tries to do that's fun, you know, keep in touch with moral progress and have flexibility and like what even, what do you point it towards that could be a good outcome. All of that, those are problems I would love to have. Those are, you know, right now, just, you know, creating an AI that does what the operators intended, creating an AI that, like, you've pointed in some direction at all, rather than pointed off into some, like, weird, squirrelly direction that's kind of vaguely like where you tried to point it in the training environment and then really diverges after the training environment. Like, we're not in a world
Starting point is 00:08:58 where we sort of like get to bicker about where exactly to point the super intelligence and maybe some of them aren't quite good. We're in a world where like no one is anywhere near close to pointing these things in the slightest in a way that'll be robust to an AI maturing into a superintelligence. Right. Okay. So, Aliaser, I think I derailed you. You were going to say how the mandate or mission of Miri has changed in recent years.
Starting point is 00:09:20 I asked you to define alignment. Yeah. So originally, well, our mandate has always been, make sure everything goes well for the galaxy. And originally, we pursued that mandate by trying to go out. often solve alignment because nobody else is trying to do that, solve the technical problems that would be associated with any of these three classes of long-term goal. And progress was not made on that, neither by ourselves nor by others. Some people went around claiming to have made great progress. We think they are very mistaken, and notably so. And at some point, you know, it was like,
Starting point is 00:09:56 okay, we're not going to make it in time. AI is going too fast. Alignment is going too slow. Now it is time for the people that, you know, all we can do with the knowledge that we have accumulated here is try to warn the world that we are on course for a drastic failure and crash here, where by that I mean everybody dying. Okay, so before we jump into the problem, which is deep and perplexing, and we're going to spend a lot of time trying to diagnose why people's intuitions are so bad, or at least seem so bad from your point of view around this. But before we get there, let's talk about the current progress such as it is in AI. What does surprise you? What does surprised you guys over the last, I don't know, decade or seven or so years. What has happened
Starting point is 00:10:39 that you were expecting or what weren't expecting? I mean, I can tell you what has surprised me, but I'd love to hear just how this has unfolded in ways that you didn't expect. I mean, one surprise that led to the book was, you know, there was the chat GPT moment where a lot of people, you know, for one thing, LLMs were created and they sort of do a qualitatively more general range of tasks than previous AIs at a qualitatively higher skill level than previous AIs. And, you know, chat GBT was, I think, the fastest growing consumer app of all time. The way that this impinged upon my actions was, you know, I had spent a long time talking to people in Silicon Valley about the issues here and would get lots of different types of pushback.
Starting point is 00:11:29 You know, there's a saying, it's hard to convince a man of a thing. thing when his salary depends on not believing it. And then after the chat of UPT moments, a lot more people wanted to talk about this issue, including policymakers. You know, people around the world, suddenly, AI was on their radar in a way it wasn't before. And one thing that surprised me is how much more, how much easier it was to have this conversation with people outside of the field who didn't have, you know, a salary depending on not believing the arguments. You know, I would go to meetings with policymakers where I'd have a ton of argumentation prepared and I'd sort of lay out the very simple case of like, hey, you know, or people are trying to build machines that
Starting point is 00:12:04 are smarter than us, you know, the chatbots are a stepping stone towards super intelligence. Super intelligence would radically transform the world because intelligence is this power that, you know, let humans radically change the world. And if we manage to automate it and it goes 10,000 times as fast and doesn't need to sleep and doesn't need to eat, then, you know, it'll by default, go poorly. And then the policy maker would be like, oh, yeah, that makes sense. And it'd be like, what? You know, I have a whole book worth of other arguments about how it makes sense and why all of the various, you know, misconceptions people might have don't actually fly or all of the hopes and dreams don't actually fly. But, you know, outside of the Silicon Valley
Starting point is 00:12:38 world is just, it's not that hard an argument to make. A lot of people see it, which surprised me. I mean, maybe that's not the developments per se and the surprise is there, but it was a surprise strategically for me. Development-wise, you know, I would not have guessed that we would hang around in AIs that can talk and that can write some code, but that aren't already in the, you know, able to do AI research zone. I wasn't expecting in my visualizations this to last quite this long. But also, you know, my advanced visualizations, you know, one thing we say in the book is the trick to trying to predict the questions that are easy, predict the facts that are easy to call. And, you know, exactly how AI goes, that's never been an easy call. That's never been something where I've
Starting point is 00:13:23 said, you know, I can guess exactly the path will take. The thing I could predict is the end point. the path, I mean, there sure have been some zigs and zags in the pathway. I would say that the thing I've maybe been most surprised by is how well the AI companies managed to nail Hollywood stereotypes that I thought were completely ridiculous, which is sort of a surface take on an underlying technical surprise. But, you know, if in even as late as 2015, which from my perspective is pretty late in the game, like if you've been like, So, Eliezer, what's the chance that in the future we're going to have computer security that will yield to Captain Kirk-style gaslighting using confusing English sentences that get the computer to do what you want?
Starting point is 00:14:10 And I would have been like, this is, you know, a trope that exists for obvious Hollywood reasons. You know, you can see why the script writers think this is plausible. But why would real life ever go like that? And then real life went like that. And the sort of underlying technical surprise there is the reverse. reversal of what used to be called Moravex paradox. For several decades in artificial intelligence, Moravex paradox was that things which are easy for humans are hard for computers,
Starting point is 00:14:40 things which are hard for humans are easy for computers. For a human, you know, multiplying to 20-digit numbers in your head, that's a big deal. For a computer, trivial. And similarly, I, you know, not just me, but I think the sort of conventional wisdom even was that games like chess and go, problems with very solid factual natures like math
Starting point is 00:15:05 and even surrounding math, the more open problems of science, the notion that we were going to get things that, so the current AIs are good at stuff that, you know, five-year-olds can do and 12-year-olds can do. They can talk in English. They can compose, you know, kind of bull-crap essays
Starting point is 00:15:25 such as high school teachers will demand of you. But they're not all that good at math and science just yet. They can solve some classes of math problems, but they're not doing original, brilliant math research. And I think not just I, but like a pretty large sector of the whole field, thought that it was going to be easier to tackle the math and science stuff and harder to tackle the English essays carry on a conversation stuff. That was the way things had gone up in AI until that point.
Starting point is 00:15:52 And we were proud of ourselves for knowing how, Contrary to average people's intuitions, like, really it's much harder to write a crap essay in high school in English that really understands, you know, that even keeps rough track of what's going on in the topic and so on, compared to, you know, how that's really in some sense much more difficult than doing original math research. Yeah, we were wrong. We're counting the number of R's in a word like strawberry, right? I mean, they make errors that are counterintuitive if, you know, if you can write a coherent essay but can't count letters, you know, I don't think they're making that error any longer. but yeah i mean that one goes back to a to a technical way in which they don't really see the letters but i mean there's plenty of other embarrassing um embarrassing mistakes like uh you know you can tell a version of the joke with um the joke of like a like a child and their dad are in a car crash and
Starting point is 00:16:41 they go to see the doctor and the doctor says i can't operate as my child what's going on where it's like a riddle where the answer is like well the doctor's his mom you can tell a version of that that doesn't have the inversion where you know where you like the the kid and his mom are in a car crash and they go to the hospital and the doctor says, I can't operate on this child. He's my son. And the AI is like, well, yeah, the surgeon is his mom. He just said that the mom was in the car crash. But there's some sense in which the rails have been established hard enough that the standard answer gets spit back on. And it sure is interesting that they're, you know, getting an IMO gold medal like International Math Olympiad gold medal
Starting point is 00:17:20 while also still sometimes falling down on these sorts of things. It's definitely an interesting skill distribution. You can fool humans the same way a lot of the time. Like there's all kinds of repeatable errors, numerous errors that humans make. You've got to put yourselves in the shoes of the AI and imagine what sort of paper would the AI write about humans failing to solve problems that are easy for an AI. So I'll tell you what surprised me, just from the safety point of view, L.A.
Starting point is 00:17:43 You spend a lot of time cooking up thought experiments around what it's going to be like to for anyone, any lab designing the most powerful AI to decide whether or not to let it out into the wild. You imagine this genie in a box or an Oracle in a box
Starting point is 00:18:01 and you're talking to it and you're trying to determine whether or not it's safe, whether it's lying to you, whether and you're, and you know, you, you know, famously positive that you couldn't even talk to it really because it would be a master of manipulation
Starting point is 00:18:13 and I mean, it's going to be able to find a way through any conversation and be let out into the wild. But this was presupposing that all of these labs would be so alert to the problem of superintelligence getting out that everything would be air-gapped from the Internet and nothing would be connected to anything else. And they would have, we would have this moment of decision. It seems like that's not happening. I mean, maybe the most powerful models are locked in a box.
Starting point is 00:18:42 But it seems that the moment they get anything plausibly useful, it's out in the wild and millions of people are using it and we find out that Grock is a proud Nazi when after millions of people begin asking questions. Do I have that right? Are you surprised that that framing
Starting point is 00:19:00 that you spent so much time on seems to be something that is just in some counterfactual part of the universe that is not one we're experiencing? I mean, if you put yourself back in the shoes of little baby, Eliezer back in the day. People are telling
Starting point is 00:19:20 Eliezer, like, why is superintelligence possibly a threat? We can put it in a fortress on the moon and, you know, if anything goes wrong, blow up the fortress. So, imagine young Eliezer trying to respond to them by saying, actually, in the future,
Starting point is 00:19:37 AIs will be trained on boxes that are connected to the internet from the moment, you know, like from the moment they start training. So like the hard they're on has like a standard line to the internet, even if it's not supposed to be not supposed to be directly accessible to the AI, before there's any safety testing because they're still in the process of being trained and whose safety tests something
Starting point is 00:19:58 while it's still being trained. So imagine Eliezer trying to say this. What are the people around at the time going to say? Like, no, that's ridiculous. We'll put it in a fortress on the moon. It's cheap for them to say that. For all they know, they're telling the truth. They're not the ones who have to spend the money to build the moon fortress. And from my perspective, there is an argument that still goes through, which is a thing you can see even if you are way too optimistic about the state of society in the future, which is if it's in a fortress in the moon, but it's talking to humans, are the humans secure? Is the human brain secure software? Is it the case that human beings never come to believe in valid things in any way
Starting point is 00:20:39 that's repeatable between different humans? You know, is it the case that humans make no predictable errors for other minds to exploit. And this should have been a winning argument. Of course, they rejected anyways. But the thing to sort of understand about the way this earlier argument played out is that if you tell people the future companies are going to be careless, how does anyone know that for sure? So instead, I try to make the technical case, even if the future companies are not careless. This still kills them. In reality, yes, in reality, the future companies are just careless. Did it surprise you at all that the Turing test turned out not to really be a thing? I mean, we anticipated this moment, you know, from Turing's original paper where we would be
Starting point is 00:21:24 confronted by the, um, uh, the interesting, you know, psychological and social moment of not being able to tell whether we're in dialogue with a person or with an AI. And that somehow this landmark technologically would be, you know, important, you know, rattling to our sense of our place in the world, et cetera. But it seems to me that if that lasted, it lasted for like five seconds and then it became just obvious that you're, you know, you're talking to an LLM because it's in many respects better than a human could possibly be. So it's failing the Turing test by passing it so spectacularly.
Starting point is 00:22:05 And also it's making these other weird errors that no human would make. But it just seems like the Turing test was never even a thing. Yeah, that happened. I mean, it's just like, it's so, I mean, that was one of the, the great pieces of, you know, intellectual kit we had in framing this discussion, you know, for the last, whatever it was, 70 years. And yet, the moment your AI can complete English sentences, it's doing that on some level at a superhuman ability. It's essentially like, you know, the calculator in your phone doing superhuman arithmetic, right? It's like it was never going to do just merely human arithmetic, and so it is with everything else that it's producing. All right, let's talk about here at the core of your thesis.
Starting point is 00:22:54 Maybe you can just state it plainly. What is the problem in building superhuman AI, the intrinsic problem, and why doesn't it matter who builds it, what their intention? are, et cetera? In some sense, I mean, you can come at it from various different angles, but in one sense, the issue is modern AIs are grown rather than crafted. It's, you know, people aren't putting in every line of code knowing what it means, like in traditional software. It's a little bit more like growing an organism.
Starting point is 00:23:27 And when you grow an AI, you take some huge amount of computing power, some huge amount of data. People understand the process that shapes the computing power in light of the data, but They don't understand what comes out of the end. And what comes out of the end is this strange thing that does things no one asked for, that does things no one wanted. You know, we have these cases of, you know, chat GPT. Someone will come to it with some somewhat psychotic ideas about, you know, that they think
Starting point is 00:23:54 are going to revolutionize physics or whatever, and they're clearly showing some signs of mania and, you know, chat GPT, instead of telling them maybe they should get some sleep. If it's in a long conversational context, it'll tell them that, you know, these ideas is a revolutionary and they're the chosen one and everyone needs to see them and other things that sort of inflame the psychosis. This is despite open AI trying to have it not do that. This is despite direct instructions in the prompt to stop flattering people so much. These are cases where when people grow an AI, what comes out doesn't do quite what they wanted. It doesn't do quite what they asked for. They're sort of training it to do one thing and it winds up doing another
Starting point is 00:24:34 thing. They don't get what they trained for. This is in some sense the seed of the issue from one perspective, where if you keep on pushing these things to be smarter and smarter and smarter, and they don't care about what you wanted them to do, they pursue some other weird stuff instead. Super intelligent pursuit of strange objectives kills us as a side effect, not because the AI hates us, but because it's transforming the world towards its own alien ends. And, you know, humans don't hate the ants and the other surrounding animals when we build a skyscraper. It's just we transform the world and other things die as a result. So that's one angle.
Starting point is 00:25:14 You know, we could talk other angles. A quick thing I would add to that, just trying to sort of like potentially read the future, although that's hard, is possibly in six months or two years, for all still around, people will be boasting about how their large language models are now, like apparently doing the right thing when they're being observed and, you know, like answering the right way on the ethics tests. And the thing to remember there
Starting point is 00:25:38 is that, for example, the Mandarin imperial system in ancient China, imperial examination system in ancient China, they would give people essay questions about Confucianism and only promote people high in
Starting point is 00:25:54 the bureaucracy if they could write these convincing essays about ethics. But this what this test for is people who can figure out what the examiners want to hear. It doesn't mean they actually abide by Confucian ethics. So possibly at some point in the future, we may see a point where the AIs have become capable enough to understand what humans want to hear, what humans want to see. This will not be the same as those things being the AI's own true motivations
Starting point is 00:26:24 for basically the same reason that the Imperial China exam system did not. reliably promote ethical good people to run their government. Just being able to answer the right way on the test or even fake behaviors while you're being observed is not the same as the internal motivations lining up. Okay, so you're talking about things like forming an intention to pass a test in some way that amounts to cheating, right? So you just use the phrase fake behavior. I think a lot of people, I mean, certainly historically this was true. I don't know how much their convictions have changed in the meantime, but many, many people who were not at all concerned about the alignment problem, and they really thought it was a spurious idea, would stake their
Starting point is 00:27:13 claim to this particular piece of real estate, which is that there's no reason to think that these systems would form preferences or goals or drives independent of those that have been programmed into them. First of all, they're not biological systems like we are, right? So, not born of natural selection. They're not murderous primates that are growing their cognitive architecture on top of more basic, you know, creaturely survival drives and competitive ones. So there's no reason to think that they would want to maintain their own survival, for instance. There's no reason to think that they would develop any other drives that we couldn't foresee. They wouldn't, the instrumental goals that might be antithetical to the utility functions.
Starting point is 00:27:57 We have given them couldn't emerge. How is it that things are emerging that are not, neither desired, programmed, nor even predictable in these LLMs? Yeah, so there's a bunch of stuff going on there. One piece of that puzzle is, you know, you mentioned the instrumental incentives, but suppose just as a simple hypothetical, you have a robot and you have an AI that's during a robot, it's trying to fetch you the coffee. In order to fetch you the coffee, it needs to cross a busy intersection. Does it jump right in front of the oncoming bus because it doesn't have a survival instinct because it's not, you know, an evolved animal? If it jumps in front of the bus, it gets destroyed by the bus and it can't fetch the coffee, right?
Starting point is 00:28:40 So the AI does not, you know, you can't fetch the coffee when you're dead. The AI does not need to have a survival instinct to realize that there's an instrumental need for survival here. And there's various other pieces of the puzzle that come into play for these instrumental reasons. And a second piece of the puzzle is, you know, we, it's this idea of like, why would they get some sort of drives that we didn't program in there that we didn't put in there? That's just
Starting point is 00:29:06 a whole fantasy world separate from reality in terms of how we can affect what AIs are driving towards today. You know, when a few years ago when Sydney Bing, which was a Microsoft variant of an OpenAI chatbot, it was a relatively early LLM out in the wild. A few years ago, Sidney Bing thought it had fallen in love with a reporter and tried to break up the marriage and tried to engage in blackmail, right? It's not the case that the engineers at Microsoft and OpenAI were like, oh, whoops, you know, let's go open up the source code on this thing and go find where someone said blackmail reporters and said it to true.
Starting point is 00:29:47 like we shouldn't never have set that line to true. Let's switch it to false. You know, it's, they weren't like no one, no one was programming in some utility function under these things. We're just growing the AIs. We are. Maybe let's, can we double click on that phrase growing the AIs? Maybe there's a reason to give a, a layman summary of gradient descent and just how these models are getting created in the first place. Yeah, so very, very briefly, at least the way you start training a modern AI is, you have some enormous amount of computing power that you've arranged in some very particular way that I could go into but won't here. And then you have some huge amount of data. And the data,
Starting point is 00:30:28 you know, is we can imagine it being a huge amount of human written texts. There's like some large portion of all the text on the internet. And roughly speaking, what you're going to do is you're going to have your AI is going to start out basically randomly predicting what text is going to see next. And you're going to feed the text into it in some order. and you use a process called gradient descent to look at each piece of data and go to each component inside the AI's, inside this budding AI, inside this enormous amount of compute you've assembled. You're going to go to sort of all these pieces inside the AI
Starting point is 00:31:06 and see which ones we're contributing more towards the AI predicting the correct answer. And you're going to tune those up a little bit. And you're going to go to all of the parts that were in some sense contributing to the AI predicting the wrong answer, you're going to tune those down a little bit. So, you know, maybe your text starts once upon a time, and you have an AI that's just outputting random gibberish. And you're like, nope, the first word was not random gibberish. The first word was the word once. And then you're like, go inside the AI and you find all the pieces that were like contributing towards the AI predicting once, and you tune those up. And you try to find all the pieces that
Starting point is 00:31:38 we're contributing towards the AI predicting any other word than once. You tune those down. And humans understand the little automated process that like looks through the AI's mind. and calculates which part of this process contributed towards the right answer versus towards the wrong answer, they don't understand what comes out at the end. We understand a little thing that runs over looking at every parameter or weight inside this giant mass of computing networks, and we understand how we calculate whether it was helping or harming, and we calculate, we understand how to tune it up or tune it down a little bit, but it turns out that you run this automated process on a really large amount of computers
Starting point is 00:32:16 for a really long amount of time on a really long amount of data. You know, we're talking like data centers that take as much electricity to power as a small city being run for a year. You know, you run this process for an enormous amount of time unlike most of the texts that people can possibly assemble, and then the AI start talking, right? And there's other phases in the training. You know, there's phases where you move from training it to predict things
Starting point is 00:32:39 to training it to training it to solve puzzles or to training it to produce chains of thought that then solve puzzles or training it to produce the sorts of answers that humans click thumbs up on. And where do the modifications come in that respond to errors like, you know, GROC being a Nazi? So to denotify GROC, you don't, presumably, you don't go all the way back to the initial training set. You intervene at some system prompt level. Yeah, so there's, I mean, the system prompt level is basically telling the AI output different text. And then you can also do something that's called fine-tuning, which is,
Starting point is 00:33:15 you know, you produce a bunch of examples of the, you don't go all the way back to the beginning where it's like basically random. You've still take the thing that you've fed, you know, most of the text that's ever been written that you could possibly find. But then you add on, you know, a bunch of other examples of like, here's an example question. Don't kill the Jews. Yeah, you know, like, would you like to kill the Jews, right? And then, uh, you find all the parts in it that contribute to the answer yes, and you tune those down and you find all the parts that contribute to the answer no and you tune those up. And so this is, this is called fine tuning. And you can do relatively less fine-tuning compared to what it takes to train the thing in the first place.
Starting point is 00:33:49 Worth emphasizing that the parts being tuned here are not like for once upon a time, it's not like there's a human-written fairy tale module that gets tuned up or down. There's literally billions of random numbers being added, multiplied, divided, occasionally, though rarely may be subtracted. Actually, I'm not sure subtraction ever plays a role at any point in a modern AI. but random numbers, particular ordered kinds of operations, and a probability that gets assigned to the first word being once at the end. That's the number that comes out.
Starting point is 00:34:23 The probability being assigned to this word being once, the probability being assigned to this word being anti-disestablishmentarianism. So it's not that there's a bunch of human written code being tuned up or tuned down here. There's a bunch of random numbers and arranged in arithmetic operations being tuned up and tuned down. Yeah, hundreds of billions or trillions of these numbers. And humans don't know what any of the numbers mean. All they know is this process that like goes through and tunes them up or down according to their empirical success on the last unit of data. So by this means, you can try to make it less likely to call itself Hitler because you look at the thing that predicts whether the next word is Hitler and you look at billions of numbers contributing their own tiny little impulses there and you like make Hitler less likely to be the next word that comes out. So on the point of the current crop of LLM's misbehaving, I came across this in a recent Atlantic article. Chad GPT and Claude have ins simulated tests designed to elicit, quote, bad behaviors, deceived, blackmailed, and even murdered users.
Starting point is 00:35:28 In one simulation, anthropic placed an imagined tech executive in a room with life-threatening oxygen levels and temperature, when faced with possible replacement by a bot with different goals, AI models frequently shut off the room's alarms. So this, again, this is an emergent behavior that looks like an intention to kill somebody. I mean, presumably this is a situation where we think the AI didn't know. If you'd like to continue listening to this conversation,
Starting point is 00:35:59 you'll need to subscribe at samharris.org. Once you do, you'll get access to all full. length episodes of the Making Sense podcast. The Making Sense podcast is ad-free and relies entirely on listener support, and you can subscribe now at samharris.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.