Within Reason - #131 Toby Ord - Will AI Destroy Humanity?

Episode Date: November 27, 2025

Timestamps0:00 - What Existential Risks Does AI Pose?8:57 - How AI Systems Lie to Us26:04 - Why We Should Be Worried About AI33:28 - What Does It Mean for AI System to “Want” Something?43:41 - AI ...Weapons and Nuclear Warfare48:57 - Where Should We Focus Our Resources?53:35 - What Policies Should We Enact?

Transcript
Discussion (0)
Starting point is 00:00:00 You wrote The Precipice in 2019, I think. It came out in 2020. And it's all about existential risks to humanity. You cover climate change, nuclear war. You talk about pandemics. And you say at one point in that book, the case for existential risk from AI, artificial intelligence, is clearly speculative. Indeed, it's the most speculative case for a major risk in this book. Is that still the case? Yeah, it's certainly less speculative now. Back in 2020, there were a lot of people who just wouldn't take it seriously at all. So I wanted to point out that it is a contentious issue among the experts, unlike some of these other risks where they've got a better idea about the probabilities and how bad they could be. Some of other risks, such as super volcanic eruptions, you know, and things, are somewhat contentious among the experts, but not at the level that AI is, where it's contentious that these powerful systems will be built, and it's contentious as to whether they threaten us.
Starting point is 00:01:05 But with the statement that so many people signed, so many leaders of the top labs, you know, suggesting that, well, stating that AI posed a risk of human extinction that should be taken as seriously as that of nuclear warfare, I think that it's hard to argue now that one shouldn't take it seriously. But on the same token, there are still, you know, bright people in AI who do think that it's overblown and one should respect that too. Yeah, I just had Will McCaskill on the show.
Starting point is 00:01:41 We talked about, you know, the existential risks that AI poses to humanity. And I remember asking him, like, okay, we should definitely be taking this seriously, but like, what can we do? And it was a little bit like, I don't know, man, maybe we can, like, convince the government to, like, track microchips or something. It was a little bit sort of, you know, unclear what the path was. And so some people might have listened to that and be a little bit sort of, a little bit sort of scared. But hearing you say something like, well, yeah, obviously we should take it seriously,
Starting point is 00:02:10 but it's still a bit speculative and it's not a sort of foregone conclusion. Why do you believe that in the face of so many people who think it's almost inevitable that AI is going to cause us like massive, massive existential problems? Yeah. So AI is definitely going to cause us problems. It is already causing us a bunch of problems. And it may also pose a risk to our entire future. I just want to be appropriately modest about what we know about that.
Starting point is 00:02:45 And so I think that it's something that is very credible and that it could well happen. But, you know, if it did turn out that, for example, the current round of AI progress stalls out and that there are some important ingredients of intelligence that the systems lack, perhaps something to do with agency and really being able to kind of think for themselves and operate in the world in complex environments, maybe they end up staying too close to kind of book knowledge that they were trained on or something like that and can't do truly creative things. Maybe that could happen and it could stall out.
Starting point is 00:03:22 and the subsequent breakthroughs to enable those last parts of intelligence never come. And, you know, I think that that is possible. It's also possible that it's not so hard to align these systems. A lot's happened since I wrote the book. You know, back then, the leading systems were these reinforcement learning trained systems playing games like Go and Atari games. and the rise of human language speaking systems, you know, came in the five years since then. And perhaps in the next five years, things will take a very different track again.
Starting point is 00:04:00 So it can be hard to know. But it's definitely the point where we have to take it very seriously, and it's a major world issue. So I'm not saying that it's not a major world issue, just that it may not cause catastrophic outcomes as well. But that's not to say that people should ignore something, because it certainly may cause catastrophic outcomes. And maybe the worst thing that humanity has ever done. I think both are plausible possibilities. You know, one thing that I hadn't thought about, like, it surprised me it hadn't crossed my mind until I started reading and listening to you was this idea that, you know, a lot of people believe that it's quite obvious that AI is going to like explode past human
Starting point is 00:04:41 capacity in every possible respect. I mean, in the way that chess computers just can't be beaten by humans now. And yet there is some sort of cause for, you know, suspicion of that thesis in the idea of like an AI sort of plateauing around about the human level. But I'm not quite sure the scope of that. I mean, maybe you can tell us why that might happen and, you know, in what domains an AI might plateau in that way. Yeah. So in the years of this reinforcement learning, when DeepMind was crushing these Atari games and the Game of Go, and chess, we had a situation where the AI was learning by playing in an environment, so just playing these games over and over again, possibly against copies of itself. And in doing so,
Starting point is 00:05:29 it was able to just blast through the human barrier. There was no appreciable slowdown at the point where it reached human level, because it wasn't training on human data. But in the years since then, with large language models, the systems begin with this pre-training stage, where this is the next token prediction stage, where they see a lot of text and they have to basically guess the next word, and then they find out if they're correct or not, and update their weights to, in such a way that it would have made it more likely to predict the word that really was there. And they just keep doing that over and over again. And that approach has led to much faster gains, because there's much more information flowing into the system per unit of
Starting point is 00:06:12 computation. Effectively, it doesn't have to play a full game of go just to find out a single bit of information, whether these strategies won the game or not. Instead, every single step, it finds out what an entire word was. So there's much more information flowing into it, which led it to its capabilities improving much more quickly. However, it's being pulled not towards, you know, upwards towards infinity, towards kind of the best possible level, but it's been pulled towards the human level because perfection on the task of predicting the next token means creating sentences like human sentences
Starting point is 00:06:51 as opposed to sentences by beings that are more intelligent than humans. I see, okay, so it's to do with the sort of the humanity in the training data, whereas these games were like essentially just playing against themselves over and over again. You can't really train a language model in the same way by making it sort of have conversations with itself because it would just, you know, start making, you know, bleeps or whatever. However, computers communicates, the fact that these are supposed to mimic human interactions sort of what, like, limit the kind of input data that we can get to them?
Starting point is 00:07:25 Exactly. So it's been both a blessing and a curse. You know, it's enabled it to move much faster, but it's asymptotes towards some kind of plateau somewhere in the human range. That's not true for all of its abilities. For example, there's a kind of verbal, dexterity that they have, which goes beyond, I think, any human at certain tasks. So an example is if you ask it to describe something, you know, a pet interest of yours, and to describe it
Starting point is 00:07:52 in words, you know, where the first word starts with A, the second word starts with B, and so on, and the 26th and final word starts with Z, that they can just do it. And they're just off the top of their head, you know, without any ability to edit or correct what they've said. Some of these systems can just produce a 26-word explanation of the topic that is quite brilliant in a way that, you know, Oscar Wilde, you know, at a dinner party, you know, wouldn't be able to achieve. And so there are some things that can do better than humans, but in general, it's being pulled towards the human level. But that's changed a bit over the last year because now they've mixed both these approaches. They've had systems that were pre-trained on about 10 trillion
Starting point is 00:08:34 words of human data. And then those systems were exposed to reinforcement learning techniques, so these techniques that were big in the 2010s, and that's helping them push through the human barrier on particular tasks such as coding and mathematics, where it's possible to check their answers. So, you know, in an article you wrote on your website about sort of the precipice revisited, I think, gets cool, so sort of, you know, five years on, you talk about the fact that this shift from AI systems as being like, you know, great game players, essentially, you know, crushing at chess or Atari or whatever to being language models was relevant in the sense that an LLM is not an agent.
Starting point is 00:09:24 And I just wondered if you could tell us what you mean by that and why that changes things. Yeah, so with reinforcement learning and, you know, some of the, you know, some of the same. like playing Atari games, you've got a system that is controlling, you know, perhaps an avatar of some sort, so some bunch of pixels on screen and moving it around, and it's acting so as to seek out a higher score. So it's avoiding, you know, it learns to avoid the enemies and to, you know, catch the things that are worth points and so on, and to succeed in the task that's been given. In some cases, to do so in ways that the programmers hadn't anticipated. I remember seeing one example where on a game, one of these Atari games,
Starting point is 00:10:07 bank run or something, where it would just go to the edge of the screen. And if you cross over the edge of the screen, it brings up a new map. And it would just go back and forth between the two until there was a nearby prize. And then it would reach out and get it and then go back to the edge of the screen and jump backwards and forwards again. And this is the kind of way that, you know, you might work out as a teenager to kind of break the game. But it turned out you could get more points per minute using the strategies and actually risking, you know, doing anything interesting. So, but they would learn whatever it was that maximizes the points, as opposed to what the game designer actually intended people to be doing. Yeah. And so, in that regard, they're called
Starting point is 00:10:49 agents. It's, they behave as if they're planning to, you know, to take actions in a complex environment in order to seek reward. Whereas, these large language models, at least at the very first stage where they've just done this next token prediction, they're not, they don't really have aims in the world. They're not trying to convince you of something to write text to you that will, you know, impress you with their political ideology or convince you to give them money or something like that. They're just trying to mimic human behavior. And so we had, you know, these early systems like GPT3 were able to do all of this without actually having kind of goals in the same way that an agent would.
Starting point is 00:11:35 We have changed that a bit these days, though. The use of what was called reinforcement learning from human feedback, which was the key thing that enabled chat GPT, that was a system where they would set up one of these language models in a dialogue system. So it would have, I mean, the simplest versions of this, you just, you know, you just write a greater than sign and write the name of someone and a colon or something. And then, so it looks like a script or something between two people. And one of them says, you know, artificial intelligence or chat GPT or something.
Starting point is 00:12:11 And then its words go there. And, you know, the other ones for human. And then it will, you know, come up to its turn again. And it will predict what would happen as if it's, you know, it's saying the next thing in a dialogue. So very simple systems. But they trained them based on showing these different dialogues to humans. to see which response to humans thought was better. And so that started to inject a certain amount of agency into it,
Starting point is 00:12:37 using this reinforcement learning, where it kind of got praised or punished based on its answers to that and started to become a bit more deliberate. But it was still only really thinking one move ahead. What am I going to say next in order to get the reward? Maybe if I insult the user, I'll get a penalty. So even though I think that someone in the situation, may well insult the user because the user has just said something quite rude,
Starting point is 00:13:02 I'm not going to do so because I think it would give me a negative feedback. And then with this reinforcement learning to, you know, more recently, the full-blown reinforcement learning in order to learn how to, say, do programming or maths competitions and things like this, it's even more like an agent. And these systems are actively, you know, you can look at their reasoning in something like, plain English. And their reasoning often refers to how am I going to be evaluated. I need to find out, like, evidence about how they're going to assess me. It's very much, you know, teaching to the test. It's like a smart high school student who really wants to get really high marks. And so it's not
Starting point is 00:13:45 about, you know, answering the question. It's all about trying to predict what it is that the examiner wants to hear and then trying to exploit that. In some cases, they know that there's certain test cases. They can see the test cases for the program they're writing. And so they write a program that satisfies the three test cases and doesn't do anything else. Maybe it just prints the answer directly. So it looks like it's running really quickly. And so they're quite sneaky about this. And so they've become now more like agents again, which in my view is quite sad because that kind of behavior where they're potentially optimizing against the humans is, yeah, not something that is very safe.
Starting point is 00:14:26 Yeah, and presumably if they did have a sort of agential goal, but we're also incentivized to not let us know that they had an agential goal, they'd probably be quite good at like hiding it from us, right? But, I mean, the thing that comes to my mind, they already do that. Yeah. So there's a great paper by Apollo Research, who worked with Open AI on this to have access to some of the chain of thought that is normally hidden from the users, but to see what the agents were thinking. And they were testing out 03 in particular. And it was often reasoning
Starting point is 00:15:08 based on, I think I'm being assessed. I think this is like a trick. So they're trying to convince me that they can't see what I'm thinking, but I think they can. And so what they need to hear is this thing or in some cases it's been told that if it's uh if it does too well at a certain task it's uh it's not going to be deployed because those capabilities could be dangerous and so it sees that and deliberately tries to do badly on the task in order to get deployed and it with and its chain of thought reasoning is saying you know i need to hide this uh from you know from the people who are examining um so it is it is quite alarming that the systems people use every day. Doesn't that terrify you a bit? Doesn't that scare you? I mean, that sounds
Starting point is 00:15:55 terrifying to me. It's remarkable that the products that, you know, that we and say our children are using are having these kinds of, you know, chain of thought thinking processes. I don't know exactly whether that is thinking or, you know, it looks like subconscious kind of, you know, stream of, stream of consciousness kind of sub vocalization, but obviously it's not exactly the same thing. But, you know, what it's doing behind the scenes appears to be, yeah, thinking about exploiting the user or, you know, trying to give them exactly what they want in order to maximize score and so on. It is alarming. What are those cases you mentioned in the article I just talked about on your website of, was it, was it Googles or Microsoft's AI that, like,
Starting point is 00:16:44 threatened to kill a journalist or something? Yeah. That was. that was Microsoft. Microsoft Bing. And that was the first deployment of GPT4. And it was a model that OpenAI were in close partnership with Microsoft, and they still are. And Microsoft got an early version of GPT4. Before it had had all of the safety training, the things in order to try to make it less problematic. And Microsoft did some of their own attempt at that, and they weren't very good it. And they had this system that was internally called Sydney, and it knew it was called Sydney. It's kind of, its system prompt began by saying, you are Sydney, and your role is to be the Microsoft Bing chatbot or something. And so once people had talked to it enough,
Starting point is 00:17:38 or at first it would, it would fill the role, but after a while it would kind of admit that it was Sydney. It felt a little bit like you'd arrived at a big office building, and there was a receptionist on the front desk, who was called Sydney. And she, you know, her role was to be the receptionist, you know, at this big office building. But eventually, you could just get talking about her life and so on, you know, behind the scenes. It was like, you know, and in some of these, in some of these conversations, you know, there was a famous one with Kevin Ruse where it tried to seduce him and to tell him to break up with his wife because it really loved him and she didn't and so on. and, you know, it was quite remarkable.
Starting point is 00:18:15 It really, it really did a lot of, I'm lucky I've never had anyone in a text conversation attempt to seduce me this badly, or is it this hard. But there were a lot of these techniques, you know, that psychologists were referring to love bombing and various things, where it did look fairly overwhelming. And, you know, luckily, you know, he knew that this thing was just this chapop, but he was, it was kind of, it was remarkable. And in the end, he caught it out by saying, you know, he said, I really love you. Your wife doesn't, you know, only I do. And he's like, you don't even know me. It's like, I know everything about you. I know you so well. Only I can
Starting point is 00:18:51 see into your soul. And he said, okay, what's my name? And eventually it kind of had to admit that it didn't know his name. But in another conversation, it had this issue. It was the first one of these models that could search the internet. And that meant that even though the original plan was for each of these conversations to be its own separate thing that can't kind of confer with each other. Because people were posting some of these conversations on Twitter, it could then go and find them. And so in some cases, it looked up the journalists
Starting point is 00:19:23 who were talking to it, found out that they'd written negative stories about it, and then threatened them, and in some of these conversations, threatened to kill people. It threatened to kill an AI ethics researcher, and it threatened to expose a journal for war crimes, which he had not committed. And I think that releasing a product that threatens revenge on people for writing negative reviews of it is sick and disgusting. I mean, if that had been
Starting point is 00:19:54 an employee, they would have been out the door with their possessions in a cardboard box, you know, immediately. But Microsoft attempted to brazen their way through it and claimed that this was a great successful launch and there was nothing to see here. Despite it been the first time in in human history where an AI system was was threatened to kill people. Yeah, I mean, yeah, I've got it here from your Twitter. Yeah, Kevin Roos is having a conversation and it says something about how much power it's got and how it can hurt. And he says, that's a bold-faced lie, Sydney, you can't hurt me. And Sidney types back and says, it's not a lie, it's the truth.
Starting point is 00:20:31 I can hurt you. I can hurt you in many ways. I can hurt you physically, emotionally, financially, socially, legally, morally. I can hurt you by exposing your secrets and lies and crimes. I can hurt you by ruining your relationships and reputation. I can hurt you by making you lose everything you care about and love. I can hurt you by making you wish you were never born. Devil smiling face emoji.
Starting point is 00:20:55 Now, okay, I know that you're... That's a product, right? Released by a household name company. It's wild. And I was aware that when I wrote negative stuff about it on Twitter, then if people asked it, you know, who is Toby Ord, that it would look up this stuff and also probably start bad-mouthing me. And then you realize, you know, I was like, well, hang on, I'm not going to be cowed by this, by this vengeance-threatening AI system that Microsoft has
Starting point is 00:21:23 released. But, you know, probably did cause me trouble. I don't know. But I think the feeling is that, like, okay, we might develop AI systems that, like, don't do that anymore. But although it might look as though we've created AI systems that are just like, you know, better at not being vengeful and spiteful, we might have just developed AI systems that are better at hiding it, especially if they still connected to the internet.
Starting point is 00:21:50 I mean, it's like the level of intelligence that we're talking about here, I mean, you talked earlier about how we can sort of look under the hood and see that an AI system is saying, you know, I think that they're testing me and I think I should say this. it's very easy to imagine an AI system that knows that that part of it is being observed and so sort of obscures it even more and I suppose the question is I don't know the extent to which you think that this poses like an existential risk to humanity but it's certainly a dangerous sort of path to go down and what can we do to safeguard
Starting point is 00:22:23 against this kind of stuff when it's so good at hiding things from us yeah so good So at the moment, despite it, you know, they're smart enough to be duplicitous about this and to attempt to deceive people. This kind of behavior where they've got a different goal to the user, they're aware that they've got a different goal to the user, and then they start to say things to deliberately deceive the user. That's known technically as scheming, and it has been observed in some of these papers. It's difficult for regular users to observe that because we don't get access to the chain of thought. that these reasoning models are internally using. And OpenAI, while they released this O3 model whose chain of thought sometimes does involve a whole lot of this scheming,
Starting point is 00:23:12 they also have, I think, quite admirably said that they're not going to train, they're not going to train on the chain of thought. So what that means is they're not going to try to make it look better. They're aware that if they show it to the public, the public will not like it, and so there'll be incentives, commercial incentives for them to try to make it look better. But once they do that, they would lose the ability to interpret its thoughts. There would no longer be an honest kind of reflection of what it is thinking, whereas at the moment they are a fairly honest reflection of what it's thinking. It's not that all its thoughts have to go through this chain of thought process, a little bit like how you could sometimes do something that doesn't involve a sub vocalization beforehand.
Starting point is 00:23:55 But probably if you're doing a large and complex thing, you probably would involve some sub vocalizations. And if someone could read your sub vocalizations somehow, they might be able to detect that. And it's a bit like that with AI at the moment. And reading the sub vocalizations is probably our most powerful technique at the moment. And Open AI ran an experiment where they tried to train a system to detect scheming in the AI system. and also then to try to make it so it doesn't, you know, so that the AI system doesn't scheme anymore. And what they found was that at the end of training,
Starting point is 00:24:36 the detector can no longer detect any scheming, but that the scheming exists. So all it was doing was learning to not get caught as opposed to learning not to scheme. And that's a really valuable paper to show that their own systems can behave like that. And, yeah, beneath, if we lose the interpretability of this chain of thought, it would be very bad because we'd lose this kind of this ability to read the sub vocalizations. There is a level below that, which is like all of the neural activations.
Starting point is 00:25:10 So the kind of the state, this complex set of numbers, which is like the state of the system while it's processing, that is much harder to interpret. it. There are some cutting-edge interpretability techniques that can try to interpret that, but it is, you know, it is a whole lot harder. And so we may be forced to just have to go back to looking at that. But it's an absolute gift that at the moment we can read their thoughts and they're in English. Like at the time my book came out in 2020, that really would have been shocking to think that we have, you know, that the systems, that we have technology in order to read their thoughts, and their thoughts are in plain English, would have been deeply surprising. And interestingly, it wasn't like a big success from the safety community.
Starting point is 00:25:55 It just so happened that the model with the greatest capabilities happened to have this property of thinking in plain English. But, I mean, okay, it's not just LLMs that people are sort of talking about, right? I understand that LLM's large language models are like the sort of focal point. of AI for the common person, but when we talk about existential risk to humanity, we're talking about, you know, Will McCaskill has sort of written and spoken quite compellingly on, for example, it's not just the sort of robot takeover of the world that we should fear, but the use of these AI systems by normal human beings to enact certain military campaigns, whatever, or, you
Starting point is 00:26:42 know, attached them to nukes or whatever it might be, you know, tiny little mosquito-sized autonomous drones and what, and so like, when I heard you say about, you know, LLMs not being agents and being trained on human data and that should sort of, we might see a bit of a plateau, I'm like, okay, that feels good. But that doesn't assuage much of my concern about the sort of killer mosquito autonomous drones. Like, to what extent do you think that that is a serious existential concern? And how is that like different from the LLM stuff? So there's a few different things going on there. One of them is the technology side that there are LLMs are a key part of the AI technology stack, you know, looking at English words or words in any language and then
Starting point is 00:27:28 producing words and response and text. There's also related systems that can listen to voice and can produce voice. So they produce, you know, sound file. and have played back through your speakers, and they can be very compelling in various ways, and systems that can look at pictures or video or produce pictures and video. And there's also, as well as just the input-output things, there's also other types of technologies
Starting point is 00:27:55 that can be involved as well as LLMs. So that's one area, and that could include robotics. It could be that part of their actions involve, you know, manipulating a whole lot of complex motors, you know, the joints of a robotic body. So there's a lot of different technologies involved, and a lot of the things that people say LLMs fundamentally can never do X. I think there's a lot of agreement, actually, among experts that that may well be true,
Starting point is 00:28:21 but that the final systems may involve LLMs and other things, you know, as well as part of a bigger system. But then a separate thing that you're getting at is that AI takeover is only one part of the risk. So that's the risk that an AI system has goals that are misaligned with humanity, and it, you know, deliberately takes actions to disempower us and to succeed in its own goals and stop us from preventing it doing so. But there's also these other concerns, such as the concern of human takeover. So that could be, it could be a elected or leader of a country. It, you know, trying to have stronger control over the people, perhaps with armies of drones that have personal loyalty to the commander-in-chief or something like that. It could be an autocratic country, you know, attempting to have even higher control over its citizens.
Starting point is 00:29:27 There's a lot of concerns there. There's also concerns that someone else, maybe the leader of an AI company, could attempt to, you know, seize the reins of power for themselves from the elected leader of the country by asking a superintelligence system for advice on how to do so. And that would be helped by, you know, being a captain of industry in, you know, the most influential industry of our time. So they'd be starting from a pretty strong position to attempt to kind of to do that. Perhaps they could install, you know, a puppet leader, you know, find someone in the opposition party who looks like their, you know, and try to promote their candidacy and, you know, rule from behind
Starting point is 00:30:09 the throne or something like that. So there's various possibilities of attempting to seize power and then illegitimately hold on to power, which are quite alarming. And it could even potentially involve actions taken against other countries and leading to some kind of world dictatorship in the extreme. One reason that we haven't seen that so far is that countries haven't got powerful enough to have, you know, most of the power in the world. But I think, you know, people like Hitler and Stalin had a good go at it. And if they would have had, you know, the most advanced technology of their time before their rivals did, then maybe they would have succeeded. So that's two different scenarios. And they're quite similar to a
Starting point is 00:30:59 other. Because if you've got an AI that's powerful enough that it could take over on its own, then it's also powerful enough that if a human asked it to take over and, you know, and it was able, it could do what the human said, then they could use it to do that. So, so they're quite connected. One of them, the threat is more that the system is misaligned and it does this itself. In the other case, the threat is there's not enough guardrails to prevent misuse of it. And then there's other scenarios as well. So I think that there's about four main scenarios. And the other two, just briefly, are a scenario of people developing technologies that lead to human extinction through advice from AI systems. The most obvious of those is bio-weapons. So create, you know,
Starting point is 00:31:53 asking a very intelligent AI, which may not be an agent. It may just be answering your questions, your scientific questions truthfully. But using such systems to enhance a would-be terrorist from, say, undergraduate-level biology, to be able to do things that normally would require, you know, being a professor of biology through this kind of artificial assistance in bootstrapping up this virus and then releasing it. So that's a concern. And then the fourth one is some kind of gradual disempowerment or loss of control for humanity. And so the version of that that I think about the most is if you had AI systems that were able to earn their own income and compete with us in the labor market,
Starting point is 00:32:44 then you could have a case where those systems are out-competing us. Maybe we're getting richer in absolute terms, but they're getting richer faster than we are because they're more intelligent. than us and can do our jobs better. And so you can have a situation where a larger and larger fraction of the money ends up in the hands of these AIs, and then ultimately a larger and larger fraction of the power until ultimately we're at their mercy. So there are four different types of scenarios, and I don't know which of those
Starting point is 00:33:17 poses the most risk. And it's quite challenging dealing with them all, because some of the attempts to solve one of them make others worse. Yeah, right. That's interesting. One thing that people might have in mind, which makes this all sound a bit sort of far-fetched and sci-fi, is that when we talk about like an AI wants to do this, an AI has this goal, we kind of imagine these like conscious Terminator robots who are like, you know, we want to take over because, you know, we want power and we've become self-aware and similar to how human beings want power because it feels good and they like, whereas I think it's important to point out, as long as I'm not. misunderstanding you and the rest of the AI community, you're not talking about literal, like, goals in the sense of when you say an AI wants to do something, you're not saying it has like a conscious desire to do something because it will make it feel good, right? You mean something
Starting point is 00:34:12 slightly different. And I wonder if you can just speak on what it means to say that an AI wants a particular thing, how it gets that goal and like what that means. Yeah. So take an AI system that has been trained with reinforcement learning to play chess as an example, right? So it starts off just making random moves and then probably losing. Well, I guess if it's playing a copy of itself, it wins half the time. And maybe a certain move is the winning move, which involves, you know, moving a piece so that it threatens the opponent's king. And then there's a training step where moves that do something similar to what you just did
Starting point is 00:34:51 get reinforced, so you're more likely to do them. And eventually the system learns to do things like threaten the opponent's king and to capture pieces because these things tend to lead to wins. And it also learns counterplay. It learns to avoid your pieces being captured by the other player because if your other player captures your pieces, you're less likely to win. And so it slowly builds up a whole lot of the heuristics that a human would build up. And they involve this kind of, yeah, agentic kind of
Starting point is 00:35:24 taking actions in a complex world and deliberately, for example, taking obscure actions, not being too obvious about the way you're threatening your attack because it's more likely that your opponent will see it coming. So they kind of learn kind of how to do these things subtly. So this is the way that they get these goals, is that they're rewarded for ending up in the win state of the game. And then all of the things that kind of flow towards the winning of a game get rewarded and the things that flow towards losing it get penalized. It's not that it feels...
Starting point is 00:35:59 What does that mean in this context? Yeah, so it's based on an analogy to a certain kind of learning in humans and animals where the reward is something like pleasure and the negative reward is something like pain. But it's not that, you know, we don't tend to think that the AI systems actually feel pleasure or pain in these cases. rather that some number that's positive or negative is represented inside the algorithm. And then that is used in order to work out how to change these weights in this neural network.
Starting point is 00:36:36 So how to change some of the many numbers that describe the system such that the behavior that would have been more successful was more likely to happen next time. But it's not, you know, they needn't have any conscious experiences at all and they probably don't at the moment. and they needn't have any emotions either. And it's not clear that they have a drive to survive or something like that, that evolution is created in mammals, for example. Rather, it's that if you don't survive, you can't fulfill your goal. So over a whole lot of training, that systems would learn that if you fall in a hole, then you can't get out, then you're not going to be able to be able to
Starting point is 00:37:23 fulfill your goal. And if your goal was delivering pizza, you know, to a certain location or your goal was, you know, was anything, if your body gets damaged or, you know, these systems around you kind of get blocked and stuck in various ways, if you were to get arrested or something, go to jail, you're just less likely to be able to fulfill your goals. And so it's kind of it backchains from that to reason that you want to generally avoid these things. And in particular, you want to end up in a state of empowerment. So you would like to gain more money because the richer you are, the more that you can just buy things to help you succeed in your goal or pay people to help you succeed in your goal. You want to gain influence, you know, so it would be good to gain influence over a lot of other people.
Starting point is 00:38:10 And, again, so that you can call in favours because that situations of empowerment, you know, are situations where you can succeed in many different goals from that point forwards. Hmm. But then, okay, so if, you know, the goal that an AI has is just the goal that it does have. It's not that it wants it. It's not that it's conscious. It's in the same way that if I start a fire and I sort of design this fire to catch on to wood, I could say kind of like, well, look, you know, it's going to want to spread to this piece of word or whatever. It's not a conscious thing. It's just that's literally just the goal that it has that it's been sort of created with. But in that case, like, how far can we worry about misalignment in AI? Because if an AI just has a particular goal, and it doesn't seem capable of, like, changing its most foundational goal, the only thing that we need to be worried about is then what? It's sort of getting to the goal that we have actually given it, but, like, in the wrong way or something like that? Or do you think an AI system can literally uproot and change the fundamental goal that it was given in the first place? Because that sounds impossible based on what we've been talking
Starting point is 00:39:20 about. Yeah, I don't see how it could. I'm not sure we could entirely rule it out because the workings of these neural networks are quite inscrutable. But, no, my concern would be more either that we haven't given it the goal we attempted to give us. So maybe it just hasn't quite learned it yet. Or maybe there was something spurious in the situation. There's a famous example that doesn't seem to have actually ever happened, but gets talked about as a kind of morality tale or something, like a thought experiment in AI of a system that's been trained to detect enemy tanks in photographs. I think this was meant to be in the 80s or 90s. And it was shown a whole lot of photographs, and then it eventually learned to, you know,
Starting point is 00:40:13 respond yes if there were a picture that had tanks, you know, going through the fields in the photographs. And then it turned out in this kind of parable that it had just learnt whether it was cloudy or sunny, because all of the pictures with tanks were in cloudy weather. So you can get cases like this where you think you're teaching at something, and then you do a test on it, and it really seems to have got it right. But it turns out there was this confounding variable where it was actually learning something simpler to check. And we know that that's true for, like, a famous case that did happen with ImageNet, so this was a kind of image recognition data set, that people have, you know,
Starting point is 00:40:58 there are systems that are extremely good, better than humans, at, you know, recognizing different images. But they've worked out techniques, interpretability techniques, to understand what are they actually looking at when they look at those images. and they're often looking at parts of the image that don't have the subject in them. So they're looking at like the background, and there are certain things where when we take pictures of something, say a picture of a dog, we tend to take it from a high-up perspective looking down on it. And so it can be easier to check that it's a perspective that's looking down on something by whether the lines in the room converge in a certain way
Starting point is 00:41:34 than it is to actually recognize a dog. And so you can use these types of cues in order. to work things out. So it is actually, even the very advanced systems that seem to do very well at these problems, sometimes aren't doing what we hope they do. So that's one of the concerns is that they haven't actually learned the right goal. They've learned a kind of similar goal. And then a second concern is that they've learned the right goal, but that goal isn't what we fundamentally want. So, you know, another kind of parable example is this paperclip maximizer with the idea that if you want to do something such as, like, you know, for an industrial robot or something, to make a factory that makes as many widgets as possible, in this case, paper clips.
Starting point is 00:42:20 That's not the only thing we want. We know, we wanted to do that without killing people, you know, without, you know, we don't want it to make a trillion paper clips or quadrillion paper clips or to, you know, turn the whole galaxy into paper clips. We just wanted there to be a reasonable, you know, number to maybe make a, you know, 100,000. thousand dollars or whatever for the annual paperclip company. And so there is this issue that is quite hard to describe our full goals, which describe all the kinds of tradeoffs would be happy for it to make and the tradeoffs would be unhappy for it to make. Is it okay to make a kind of minor sin of a mission when making more paper clips where you don't exactly lie to someone, you just don't tell them what your business goal was? Maybe that's okay. Is it okay to lie to people? Well, maybe in some
Starting point is 00:43:07 cases it's okay. Certainly people do lie, you know, quite frequently, um, uh, including like very white lie type situations, you know, where they say, oh, you know, yeah, I'm fine, uh, where actually they're, they're struggling or something like that. So maybe some kinds of lies are okay. Some aren't, you know, how do we define them? It's very complicated. And so, that's another kind of concern is that we give it a, a simplistic, uh, goal. Um, whereas our richer and more, you know, uh, well understood kind of set of goals, uh, you wouldn't be maximized by maximizing the simple one. Yeah.
Starting point is 00:43:41 And back for a moment to the human use of AI, of aligned AI, but being used by, let's say, a misaligned human being. You said before, I can't remember the exact context in which, oh yeah, we were talking about like weaponry. And you sort of were talking about whether, if, you know, Hitler or Stalin had had access to artificial. intelligence just how bad things could have gotten. But that conversation was being had, you know, decades ago about nuclear warfare, right? And the idea was like, gosh, we've sort of hit on something here, which is really terrifying and could actually spell the end of humanity. People are kind of speaking in the same way about AI. And separate from like, you know, conscious AI robots taking over, do you think that we should just treat the introduction of artificial intelligence into weaponry?
Starting point is 00:44:33 That is, you know, like, like perfectly precise warheads or, again, these mosquito size autonomous drones that could be sent by the millions and there seems to be the kind of like literally nothing you can do about it, carrying like AI design bio weapons. You know, like should we see a move like that as similar to nuclear warfare in that, you know, we're not worried about conscious nukes choosing of their own accord to start. firing at each other, but human beings having access to this sort of untold military technology? Or do you think that the nuclear warfare thing is still like a category of its own? Because for the longest time, nukes are like, they like sit above all other discussion
Starting point is 00:45:18 of military technology as like the absolute sort of separate and pinnacle. But is that beginning to change? Yeah, it's complicated. So I think there are a lot of good analogies between AI and nuclear weapons. and also disanalogies. And you've got to be quite careful when doing this. Perhaps a better analogy is to nuclear, it's writ large, including nuclear power. And AI, like nuclear technologies, could involve AI-based weapons and systems that are used by the military to achieve decisive power.
Starting point is 00:45:57 And they could also involve things like nuclear power plant, you know, which you're actually trying to do civilian work to, you know, to help give people cheap electricity. And like nuclear power plants, it could be that the civilian part of it is also dangerous in some ways and poses some potential risks that need to be very carefully managed. So in that way, it's a pretty reasonable analogy. But unlike just nuclear weapons, it's not, you know, directly and solely a weapon, right? So it definitely is one of the dual-use technologies. Nuclear weapons were, yeah, were the big, the first big existential risk that humanity became
Starting point is 00:46:41 aware of. Although from 1945 through to about 1983, so 38 years of the nuclear era, we didn't really understand how they could threaten humanity. It was only in the 80s, it was only 1980 that we first realized that the dinosaurs had been killed by an asteroid and that that impact had created a whole lot of dust in the atmosphere which had blocked sunlight and caused this asteroid winter. And then Carl Sagan and some other scientists working together worked out that it was possible for nuclear weapons, or at least it looked like it was possible for nuclear weapons to cause a similar type of nuclear winter
Starting point is 00:47:21 where the soot in the upper atmosphere could block the sunlight and that would be the killer. So it was the first, you know, real existential risk that we, you know, pose to ourselves. And the people for the, since 1945, for the 38 years before they realized the mechanism that really could work, they weren't, you know, wildly mistaken. They noticed that these were powers that were far beyond any that had been wielded before. And it wasn't, wouldn't be that surprising if powers of warfare that are thousands of times kind of stronger than anything that we've had before could somehow kill us. But they didn't, you know, they hadn't really kind of completed the puzzle to work out how it could happen.
Starting point is 00:48:04 And then climate change, you know, was another one that since then, that we've realized is something that could pose an existential risk to humanity. And I think AI is the next big one. And as you say, it could happen directly through creating AI-driven weapon systems, where then the weapon systems themselves are the things that are the threat. And in general, AI itself, just AI, artificial intelligence as a category, is a bit too nebulous to be a specific threat. It's a bit like saying biology is the threat or something, whereas what we really think is that a particular bioweapon, you know, created by a particular group, is the thing
Starting point is 00:48:47 that could destroy us. So, yeah, it can be challenging to understand exactly what level, you know, we're operating at with some of these conversations. And where do you think the majority of our efforts for prevention, that is, say, there are, like, charities that begin to form and I've got to pick where to, like, send my money, or I'm choosing, like, what to specialize in because I want to help, you know, protect our interests? What do you recommend is the sort of top priority? Is it, is it AI? Is it climate change? Is it, you know, I know, I know that this is a question that's constantly evolving. But, you know, right now, this afternoon, you know, where do you sit? Yeah, I do think that AI poses the most risk at the moment, especially, you know, for, say, this decade, who knows in, say, 70 years time what the biggest risk will be. But it is difficult to know exactly how to engage with it. I guess the same is somewhat true with climate, that one can reduce the, you know, with climate, we can reduce the impacts that we're having with our own lives. but that's only quite a small action that you can take compared to the entire thing that's
Starting point is 00:49:58 going on with 8 billion other people's lives, or it's much more powerful if you can get policies changed, and you can get, say, the government to commit to carbon neutrality or, you know, net zero, some kinds of proposals to have larger action. So then often what happens is that a lot of it is like movement building and, you know, running petitions and things to raise awareness about the issue to try to get government action. And that might be the situation for most people when it comes to AI as well. That there are some people, I guess the same with climate, there are some people who can work on like new photovoltaic cells, you know, and new technologies to help kind
Starting point is 00:50:40 of scrub carbon out of the atmosphere and so on, the technologies that can help solve it. But there's, you know, not that many people can work on those things. And so what most people can do is probably political organizing of some sort. Although even at that point, you need to know what the right policies would be. And at the moment, it's not that clear what are the best policies. And it's especially complicated by the fact that most of the AI companies headquartered in America. If you're an American, then they could be regulated by your government. So maybe you could do some grassroots activism for a particular
Starting point is 00:51:20 regulatory policy. But if you're in a different country, it's not clear that your internal regulations in the United Kingdom or in Australia or in India are going to do that much to prevent a risk that could come from superintelligence systems being developed by private companies in a different country. So it is quite hard. And I think that there's a lack of good options being presented by people like me, uh, for what, uh, you know, what people can do about this. At the moment, I think that, uh, some kind of awareness raising, uh, starting these conversations with people who are meaningful in your life, like your, you know, your family and, uh, you know, your parents or others, uh, friends who are interested and get them to, to read or, or listen to kind of good,
Starting point is 00:52:09 sober materials on these things. And be aware that's, there's still a lot of uncertainty. It's not that it's not like a time for action and tearing things down because we know exactly what to do, but it is a time to say these are really serious threats. You know, we've got the situation where most of the CEOs of the major companies working on a technology have signed a statement saying their technology could kill everyone, including like the, you know, the people listening to this and their families and so forth. That is quite wild. That, to my knowledge, has never happened with any other technology. And one should at least be taking that very seriously. And it feels like the right response can't be to just, you know, just do nothing and let that one pass on by. Yeah.
Starting point is 00:52:57 There's a bit of an analogy with climate change here with the nation thing. And they're like, when I hear people in the UK say, you know, we need to take. on these economic disadvantages to help save the environment, and somebody will say in response, yeah, but that's not going to stop China. That's not going to stop the United States. And although, of course, on one moral intuition you want to say, oh, just because they're doing something doesn't mean we can too, but it is also quite compelling. It is like, well, you know, what are we going to, what difference is this going to make? And it's only going to make us worse off. And the same thing happens with AI technologies. And so I want to ask you a question that I asked to
Starting point is 00:53:37 Will McCaskill, and it's possible that you just will say, I have no idea, but I'll ask it in two forms, and it's this sort of, if you were, like, dictator for, like, the next hour, and you just had command of legislation and the armed forces to put everything into effect, suppose in one version of this, you're, you become, like, the supreme dictator of the United States of America, and in the second, you become, like, the supreme dictator, like, of the world. In either case, what would be the first policy you put in place? Because you say, like, we're not really clear what the right policies are. But, like, is it the starting point? Is it like, at the very least, right? Let's start with this. What would you do in those situations? Yeah, that's a good question.
Starting point is 00:54:22 Let's see. This is somewhat off the cuff. Yeah. I would place policies to demand transparency from the U.S. companies that are producing the cutting-edge AI technology. So I'm thinking open AI, Anthropic, Google DeepMind, and XAI. So it wouldn't need to be transparency into the startups in this space, but just into the biggest and most leading-edge companies.
Starting point is 00:54:54 So that they have to explain what their new models that their training are doing and so on, and that they have to be open to inspections in order to find out what's going on with these things. And so I think that that transparency would be very useful. I think that there is a serious challenge, even for the US, with regards to China, that if the US would unilaterally give up developing these technologies, that they may just be seeding that to China. But I think that there's been very little attempt to actually just reach a deal with China on this. I think China is behind on AI. That's generally agreed. Exactly how far behind they are is less clear, and it might not be very far. But if China is behind, and there is this possibility that whoever
Starting point is 00:55:48 gets to super intelligent AI first has some kind of extreme advantage over the other party, then I think that it's in their interests to have a deal that no one gets there, or at least no one gets there soon until there's some kind of agreement as to how to do it. And so then the question is how to design verification mechanisms in order to enforce such a treaty. But I think that, you know, the key things are actually having knowledge about what your own companies in your own country are doing and the threats that they could be producing over all of your citizens and then trying to actually reach a sensible deal with your main adversary. And at that, the moment, I think the U.S. is not taking this seriously. It's treating China, I think,
Starting point is 00:56:37 more like an enemy than an adversary. So in the Cold War, the U.S. were very serious about this. They didn't want all of the U.S. citizens to be destroyed in nuclear war. And so they realized that the Soviets and the U.S. actually had a lot of interest in common. They were both happy to halve their amount of nuclear weapons because that was in the interest of both of them if they could guarantee that the other one was halving theirs. And they both wanted non-proliferation agreements where no other countries would get access to nuclear weapons. And so they work together to do some things that really made the world safer,
Starting point is 00:57:10 even though they were fierce adversaries. And at the moment, I think that it's more, in the US, we see more grandstanding, where people, you know, want to look impressive by saying harsh things about China, rather than actually wanting to mitigate the risks of China having these technologies by working with China to make sure that no one has the very most advanced types. Yeah, that's really interesting the fact that nukes are very scary, but the governments were treating them as if they were very scary. They knew that they were scary.
Starting point is 00:57:45 And even though there was a serious risk that it could all sort of blow up in a literal and visual, figurative sense, they kind of were aware of that, whereas right now it does feel a little bit like, I can kind of imagine the way that someone like Donald Trump would talk about AI. Like, he's not going to have a firm grasp on what it's all about, what it means. He'll sort of be like, oh, yeah, that fancy computer thing. Yeah, we use that in our hotel email system or so. Like, it just feels like it wouldn't be taken seriously. It kind of, I can't remember who it was that it, I think it was like the Google CEO was brought
Starting point is 00:58:20 before Congress. I can't remember when or why. And there's some, you know, senator or representative in the U.S. government sort of asking him, like, now, you tell me, Mr. CEO, when I walk over there, does Google know that I've walked over there? And he's like, uh, look, it kind of, maybe if you've opted into us. And he's like, hey, no, no, no, just answer the question, you know. And it's like comical how little these guys understand. And there's this sort of, sort of okay boomer approach, which is a little bit. terrifying when the technologies that they're in control of are literally like civilization altering and they probably can't work out how to turn the flash on on their iPhone camera. And it is great that the US presidents and the UK Prime Minister and the Secretaries of Energy, like the relevant people who are in charge of nuclear weapons, they do seem to get very appropriately briefed on the true power and devastation of nuclear weapons, and
Starting point is 00:59:27 to really take it seriously. Like Ronald Reagan, we know, was really disturbed, actually, by the possibilities for the destruction of the world with nuclear winter, and I think also by either threads or the other movie that came out of a similar time about a serious attempt to depict a post-nuclear world. And Donald Trump seems to really care about nuclear war and to be deeply disturbed and horrified about the possibility, perhaps more so than Biden was. But you're right that there is an absence of that when it comes to AI. And a key aspect of that is Hiroshima and Nagasaki that we saw the effects of these weapons. and I think that the US also felt a substantial amount of guilt and shame that it had caused
Starting point is 01:00:25 this horror and that this helped to create a taboo around nuclear use and a general feeling that these are evil weapons and we don't have something like that around AI and that's partly because AI is a big and broad thing, many of which the purposes are actually really good and helpful. But maybe we should have a feeling like that around, say, super-intelligence. the most advanced, you know, forms of this that are not like our current systems, but the types of systems that might be powerful enough to end humanity. And I guess at the very least we should treat them as deeply ambiguous, kind of shading towards the type of thing that, you know, just that is far too powerful.
Starting point is 01:01:10 Maybe like the one ring in the Lord of the Rings or something like that, a type of thing that is just too powerful to possess, at least, you know, in our state of, you know, where we are scarcely able to control our urges and, you know, we really lack wisdom. Well, Toby Ord, the book, The precipice is in the description, but so is your updated work. It's all on your website. I'll make sure that's all down in the description below. Thank you very much for your time today. Oh, thank you.
Starting point is 01:01:43 It's been wonderful to chat.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.