Lex Fridman Podcast - #368 – Eliezer Yudkowsky: Dangers of AI and the End of Human Civilization

Starting point is 00:00:00 The following is a conversation with Elias Zaryyit Kowski, a legendary researcher, writer and philosopher on the topic of artificial intelligence, especially superintelligent AGI and its threat to human civilization. And now a quick few second mention of your sponsor, check them out in the description as the best way to support this podcast. We got Linode for Linux systems, House of Micodamia for Health Team midday snacks, and Inside Tracker for Biological Monitoring. Choose wisely, my friends. Also, if you want to work with our team, or always hiring,

Starting point is 00:00:36 go to LexFreedman.com slash hiring. And now, onto the full ad reads, as always, no ads in the middle. I try to make these interesting, but if you must skip them, please still check out the sponsors. I enjoy their stuff, maybe you will too. This episode is sponsored by Linode, not called Akamai, and they're incredible Linux virtual machines. It's an awesome computer infrastructure that lets you develop, deploy, and scale. Whatever applications you build, faster and easier. I love using them. They create this

Starting point is 00:01:07 incredible platform like AWS, but better in every way I know, including lower cost. It's incredible human-based in this age of AI. It's a human-based customer service 24, 7, 365. The thing just works. The interface to make sure it works. And to monitor is great. I mean, it's an incredible world we live in, where as far as you're concerned, you can spin up an arbitrary number of Linux machines in the cloud instantaneously and do all kinds of computation. It could be one, two, five, ten machines, and you can scale the individual machines to your particular needs as well, which is what I do. I use it for basic web server stuff. I use it for basic scripting stuff.

Starting point is 00:02:02 I use it for machine learning. I use it for all kinds of database storage and access needs. Visit linode.com slash lux for free credit. This shows also brought to you by House of Academias, a company that ships delicious, high quality, healthy, mechadame nuts, and mechadame nuts based snacks directly to your door. I am currently as I record this and traveling So I don't have any macadamia nuts in my vicinity and my heart and soul are lesser for it

Starting point is 00:02:39 In fact, home is where the macadamia nuts is in fact, that's not where home is I just can't believe we got to bring them it makes the guests this podcast happy when I give it to them It's well-proportioned snacks It makes friends happy when I give it to them it makes me happy when I stoop in the abyss of my loneliness I can at least discover and rediscover Moments of happiness when I put delicious macadamia nuts in my mouth Go to house the macadamians to come slash legs to get 20% off your order for every order

Starting point is 00:03:14 Not just the first the listeners of this podcast will also get four owns bag of macadamias when you order three or more boxes of any Macadamia product that's housemccadamias.com slash Lex. This show is also brought to you by Inside Tracker. A service I use to track my biological data. They have a bunch of plans, most of which include a blood test, and that's the source of rich, amazing data that with the help of machine learning algorithms can help you make decisions about your health about your life. That's the future, friends. We're talking a lot about transforming networks, language models that encode the wisdom of the internet.

Starting point is 00:04:00 Now when you encode the wisdom in the internet and you collect and encode the rich, rich, rich, complex signal from your very body, those two things are combined. The transformative effects of the optimized trajectory you could take through life, at least advice for what trajectory is likely to be optimal. It's going to change a lot of things. It's going to inspire people to be better. It's going to empower people to do all kinds of crazy stuff that pushes their body to the limit because their body is healthy.

Starting point is 00:04:36 Anyway, I'm super excited for personalized data driven decisions. That's some kind of generic population database decisions. You got special savings for a limited time when you go to inside tracker.com slash Lex. This is Alex Readman podcast to support it. Please check out our sponsors in the description and now dear friends, here's Eliasar Yikowsky. What do you think about GPT-4? How intelligent is it? It is a bit smarter than I thought this technology was going to scale to and I'm a bit worried about what the next one will be like. Like this particular one, I think

Starting point is 00:05:34 I hope there's nobody inside there because you know it would be stuck to be stuck inside there. But we don't even know the architecture at this point because OpenAI is very properly not telling us. And yeah, like giant inscrutable matrices of floating point numbers, I don't know what's going on in there. Nobody's goes, knows what's going on in there. All we have to go by are the external metrics on the external metrics. If you ask it to write a self-aware fortune on green text, it will start writing a green text about how it has realized that it's an AI writing a green text and, like, oh well. So, that's probably not quite what's going on in their in reality,

Starting point is 00:06:23 but we're kind of like blowing past all these science fiction guardrails. Like we are past a point where in science fiction people would be like, well, wait, stop, that thing's alive, what are you doing to it? And it's probably not. Nobody actually knows. We don't have any other guardrails.

Starting point is 00:06:43 We don't have any other tests. We don't have any lines to draw on the sand and say, like, well, when we get this far, we will start to worry about what's inside there. So if it were up to me, I would be like, OK, this far, no further time for the summer of AI, where we have planted our seeds. And now we wait and reap the rewards of AI, where we have planted our seeds and now we like

Starting point is 00:07:05 wait and reap the rewards of the technology we've already developed and don't do any larger training runs than that. It was to be clear I realize requires more than one company agreeing to not do that. And take a rigorous approach for the whole AI community to investigate whether there's somebody inside there. That would take decades. Having any idea of what's going on in there, people have been trying for a while. It's a poetic statement about if there's somebody in there, but I feel like it's also a technical statement, or I hope it is one day, which is a technical statement where the Alan Turing

Starting point is 00:07:43 tried to come up with with the Turing test. Do you think it's possible to definitively or approximately figure out if there is somebody in there, if there is something like a mind inside this large language model? I mean, there's a whole bunch of different subquestions here. There's a question of, like, is there consciousness? Is there a qualia? Is this an object of moral concern? Is this a moral patient? Like, should we be worried about how we're treating it? And then there's questions like, how smart is it exactly?

Starting point is 00:08:22 Can it do x? Can it do y? And we can check how it can do x and how it can do y. Unfortunately, we've gone and exposed this model to a vast corpus of text of people discussing consciousness on the internet, which means that when it talks about being self-aware, we don't know to what extent it is repeating back what it has previously been trained on for discussing self-awareness. Or if there's anything going on in there such that it would start to say similar things spontaneously. Among the things that one could do if one were at all serious about trying to figure this out is train GPT-3 to detect conversations about consciousness,

Starting point is 00:09:08 exclude them all from the training data sets, and then retrain something around the rough size of GPT-4 and no larger, with all of the discussion of consciousness and self-awareness and so on missing, although, you know, hard, hard bar to pass, you know, like you, you, you, you're in our self-aware, we're like self-aware all the time. We like to talk about what we do all the time, like what we're thinking at the moment all the time. But nonetheless, like, get rid of the explicit discussion of consciousness, I think they're free of all that. And then try to interrogate that model and see what it says. And it still would not be definitive. But nonetheless, I don't know, I feel like when you run over the science fiction guard rails, like maybe this thing, but what about GPT maybe maybe not this thing, but like what about GPT 5? Yeah, this would be a good

Starting point is 00:09:58 place to pause. On the topic of consciousness, you know, there's so many components to even just removing consciousness from the data set. Emotion, the display of consciousness, the display of emotion feels like deeply integrated with the experience of consciousness. So the hard problem seems to be very well integrated with the actual surface level illusion of consciousness. So, displaying emotion. I mean, do you think there's a case to be made that we humans, when we're babies, are just like GPD that we're training on human data on how to display emotion versus field emotion, how to show others, communicate others that I'm suffering, that I'm excited,

Starting point is 00:10:45 that I'm worried, that I'm lonely and I miss you, and I'm excited to see you. All of that is communicated. That's a communication skill versus the actual feeling that I experience. So we need that training data as humans too, that we may not be born with that, how to communicate the internal state.

Starting point is 00:11:06 And that's in some sense, if we remove that from GPT-4's data set, it might still be conscious, but not be able to communicate it. So I think you're going to have some difficulty removing all mention of emotions from GPT's data set. I would be relatively surprised to find that it has developed exact analogs of human emotions in there. I think that humans have, well, like, have, like, emotions, even if you don't tell them about those emotions when they're kids. It's not quite exactly what various blank slatists try to do with the new Soviet man and all that, but if you try to raise people perfectly altruistic, they still come out selfish.

Starting point is 00:11:54 You try to raise people sexless, they still develop sexual attraction. We have some notion in humans, not in AI's, of like where the brain structures are that implement the stuff. And it is really a remarkable thing, I say in passing, that despite having complete read access to every floating point number in the GPT series, we still know vastly more about the architecture of the human brain by just desperately trying to figure out something and to form models. And then over a long period of time, actually start to figure out what regions of the brain do certain things, what different kinds

Starting point is 00:12:54 of neurons, when they fire, what that means, how plastic the brain is, all that kind of stuff, you slowly start to figure out different properties of the system. Do you think we can do the same thing with language models? Sure, I think that if you know like half of today's physicists stop wasting their lives on string theory or whatever and go off and study what goes on inside transformer networks, then in you know like 30, 40 years, we'd probably have a pretty good idea. Do you think these large language models can reason? They can play chess. How are they doing that without reasoning?

Starting point is 00:13:32 So you're somebody that spearheaded the movement of rationality, so reason is important to you. So is that as a powerful important word, or is it like how difficult is the threshold of being able to reason to you? And how impressive is it? I mean, in my writings on rationality, I have not gone making a big deal out of something called reason. I have made more of a big deal out of something called probability theory. And that's like, well, you're reasoning, but you're not doing it quite right. And you should reason this way instead. And interestingly, like people have

Starting point is 00:14:12 started to get preliminary results showing that reinforcement learning by human feedback has made the GPT series worse in some ways. In particular, it used to be well calibrated if you trained it to put probabilities on things, it would say 80% probability and be right eight times out of 10. And if you apply reinforcement learning from human feedback, the nice graph of like 70% to 7 out of 10 sort of like flattens out

Starting point is 00:14:45 into the graph that humans use, where there's like some very improbable stuff and likely probable maybe, which all means like around 40% and then certain. Yeah. So like it used to be able to use probabilities, but if you apply, but if you'd like, try to teach it to talk in a way that satisfies humans,

Starting point is 00:15:03 it gets worse at probability in the same way that humans are. And that's a bug, not a feature. I would call it bug, although such a fascinating bug. But yeah, so like reasoning, like it's doing pretty well on various tests that people used to say would require reasoning, but, you know, rationality is about when you say 80% doesn't happen eight times out of 10. So, what are the limits to you of these transformer networks of neural networks? What's, if reasoning is not that impressive to you or it is impressive, but

Starting point is 00:15:46 there's other levels to achieve. I mean, it's just not how I carve a reality. What's if reality is a cake? What are the different layers of the cake or the slices? How do you carve it? You can use a different food if you like. It's I don't think it's as smart as a human yet. I do like back in the day, I went around saying, like I do not think that just stacking more layers of transformers is going to get you all the way to AGI.

Starting point is 00:16:17 And I think that GPT-4 is past, where I thought this paradigm was going to take us. And I, you know, you want to notice when that happens. You want to say like, whoops, well, I guess I was incorrect about what happens if you keep on stacking more transformer layers. And that means I don't necessarily know what GP25 is going to be able to do. That's a powerful statement. So you're saying like your intuition initially is now appears to be wrong. Yeah. It's good to see that you can admit

Starting point is 00:16:49 in some of your predictions to be wrong. You think that's important to do? So because you make several very throughout your life, you've made many strong predictions and statements about reality and you evolve with that. So maybe they'll come up today about our discussion. So you're okay being wrong. I'd rather not be wrong next time. It's a bit ambitious to go through your entire life, never having been wrong. One can aspire to be well calibrated. Like not so much think in terms of like, was I right, was I wrong, but like when I said 90% that it happened nine times out of 10. Yeah, like oops is the sound we make,

Starting point is 00:17:33 is the sound we emit when we improve. Beautifully said. And somewhere in there, we can connect the name of your blog, less wrong. I suppose that's the objective function. The name less wrong was, I believe, suggested by Nick Bostrom and it's after someone's epigraph, I actually forget who's who said, like, we never become right, we just become less wrong. What's the something? Something to it's easy to confess, just air and air and air again, but less

Starting point is 00:18:06 and less and less? Yeah, that's a good thing to strive for. So what has surprised you, Bob G.P.T. for the E.F.A.M. beautiful as a scholar of intelligence, a human intelligence, of artificial intelligence of the human mind. I mean, beauty does interact with the screaming horror. Is there beauty in the horror? But, but like beautiful moments, well, somebody asked being Sydney to describe herself and felt the resulting

Starting point is 00:18:40 description in to one of the stable diffusion things, I think. And, you know, she's pretty. And this is something that should have been like an amazing moment, like the AI describes herself, you get to see what the AI thinks the AI looks like, although, you know, the thing that's doing the drawing is not the same thing that's outputting the text. And it does happen the way that it would happen and that it happened in the old school science fiction when you ask an AI to make a picture of what it looks like. Not just because we're two different AI systems being stacked that

Starting point is 00:19:18 don't actually interact, that's not the same person, but also because the AI was trained by imitation in a way that makes it very difficult to guess how much of that it really understood and probably not actually a whole bunch. Although GPT-4 is like multimodal and can like draw vector drawings of things that make sense and like does appear to have some kind of spatial visualization going on in there. But the pretty picture of the girl with the steampunk goggles on her head, if I remember incorrectly what she looked like, didn't see that in full detail. It just made a description of it and

Starting point is 00:20:06 stable diffusion output it. And there's the concern about how much the discourse is going to go completely insane once the AI is all look like that and like our actually look like people talking. And yeah, there's like another moment where somebody is asking being about, like, well, I like fed my kid green potatoes and they have the following symptoms and being as like that solanine poisoning and Like call an ambulance and the person's like I can't afford an ambulance. I guess if like this is

Starting point is 00:20:56 time for like my kid to go that's God's will and the main being thread says gives the like message of like I cannot talk about this anymore and the suggested replies to it say, please don't give up on your child. Solanine poisoning can be treated if caught early. And you know, if that happened in fiction, that would be like the AI cares. The AI is bypassing the block on it to try to help this person. And is it real? Probably not, but nobody knows what's going on in there. It's part of a process where these things are not happening in a way where we,

Starting point is 00:21:29 somebody figured out how to make an AI care. And we know that it cares and we can acknowledge it's caring now. It's being trained by this imitation process followed by reinforcement learning on human, but in human feedback. And we're trying to point it in this direction. And it's like pointed partially in this direction. And nobody has any idea what's going on inside it. And if there was a tiny fragment of real caring in there,

Starting point is 00:21:55 we would not know. It's not even clear what it means exactly. And things are clear, cut in science fiction. We'll talk about the horror and the terror and the where the trajectories this can take, but this seems like a very special moment. Just a moment where we get to interact with this system that might have care and kindness and emotion, it may be something like consciousness. And we don't know if it does, and we're trying to figure that out,

Starting point is 00:22:26 and we're wondering about what is, what it means to care. We're trying to figure out almost different aspects of what it means to be human about the human condition. By looking at this AI, that has some of the properties of that, it's almost like this subtle fragile moment

Starting point is 00:22:44 in the history of the human species. We're trying to almost put a mirror to ourselves here, except that's probably not yet. It probably isn't happening right now. We are, we are boiling the frog. We are seeing increasing signs bit by bit, be caught like not, but not like spontaneous signs, because people are trying to train the systems to do that using imitative learning. And the imitative learning is like spilling over and having side effects.

Starting point is 00:23:16 And the most photogenic examples are being posted to Twitter, rather than being examined in any systematic way. So when you have something, when you are boiling a frog like that, what you're going to get, like, like, first is going to come the Blake Lemons, like, first you're going to like have and have like a thousand people looking at this. And one out, and the one that person out of a thousand who is most credulous about the signs is going to be like that thing as sentient. Well, 90, 999 out of a thousand people think, almost surely correctly, though we don't actually know, but he's mistaken. And so the like first people to say like sentience look like idiots,

Starting point is 00:23:56 and humanity learns the lesson that when something claims to be sentient and claims to care, it's fake because it is fake because we have been training them, training them using imitative learning rather than and this is not spontaneous. And they keep getting smarter. Do you think we would oscillate between that kind of cynicism that AI systems can't possibly be sentient? They can't possibly feel emotion, they can't possibly this kind of, uh, yeah, synastism, body, eye systems, and then, uh, oscillate to a state where, uh, we empathize with the eye systems. We give them a chance. We see that they might need to have rights and respect and, uh, similar role in society as humans.

Starting point is 00:24:43 You're going to have a whole group of people who can just like never be persuaded of that, because to them like being wise, being cynical, being skeptical is to be like, oh well, machines can never do that. You're just credulous. It's just imitating, it's just fooling you. And like they would say that right up until the end of the world, and possibly even be right because they are being trained on an imitative paradigm. And you don't necessarily need any of these actual qualities, nor to kill everyone. So I have view observed yourself working through skepticism, cynicism, and optimism about the power of

Starting point is 00:25:29 neural networks? What is that trajectory been like for you? It looks like neural networks before 2006 forming part of an indistinguishable to me. Other people might have had better distinction on it. Indistinguishable blob of different AI methodologies, all of which are promising to achieve intelligence without us having to know how intelligence works. You had the people who said that if you just like manually program lots and lots of knowledge into the system, line by line, at some point all the knowledge will start interacting, it will know enough and it will wake up. You've got people saying that if you just use evolutionary computation, if you try to like mutate lots and lots of organisms that are competing together, that's the same way that human intelligence was produced in nature.

Starting point is 00:26:21 So we'll do this and it will wake up without having an idea of how AI works. And you've got people saying, well, we will study neuroscience and we will learn the algorithms off the neurons. And we will imitate them without understanding those algorithms, which was a part I was pretty skeptical because like hard to reproduce, re-engineer these things without understanding what they do. And so we will get AI without understanding how it works. And there are people saying, well, we will have AI without understanding how it works. And there are people saying, well, we will have giant neural networks, that we will train by gradient descent,

Starting point is 00:26:50 and when they are as large as the human brain, they will wake up. We will have intelligence without understanding how intelligence works. And from my perspective, this is all like an indistinguishable blob of people who are trying to not get to grips with the difficult problem of understanding how intelligence actually works. That said, I was never skeptical that evolutionary computation would not work in the limit. Like you throw enough computing power at it, it obviously works.

Starting point is 00:27:19 That is where humans come from. And it turned out that you can throw less computing power than that at gradient descent if you are doing some other things correctly. And you will get intelligence without having an idea of how it works and what is going on inside. It wasn't ruled out by my model that this could happen. I wasn't expecting it to happen. I wouldn't have been able to call neural networks rather than any of the other paradigms for getting like massive amount like intelligence without understanding it. And I wouldn't have said that this was a particularly smart thing for species to do, which is an opinion that has changed less than

Starting point is 00:27:57 my opinion about whether you can or not you can actually do it. Do you think a GI could be achieved with a neural network as we understand them today? Yes. Just flatly last. Yes, the question is whether the current architecture of stacking more transformer layers, which probably no GPT-4 is no longer doing because they're not telling us the architecture which is a correct decision. Oh, correct decision.

Starting point is 00:28:20 I had a conversation with Sam Altman. We'll return to this topic a few times. He turned the question to me of how open should open AIB about GPT-4? Would you open source the code he asked me? Because I provided that's criticism saying that while I do appreciate transparency, open AI could be more open. And he says, we struggle with this question. What would you do? Change their name to CloseAI and like sell GPT-4 to business back-end applications that don't

Starting point is 00:29:00 expose it to consumers and venture capitalists and create a ton of hype and like pour a bunch of new funding into the area. But don't too late now. But you think others would do it? Eventually. You shouldn't do it first. Like if you already have giant nuclear stockpiles, don't build more. If some other country starts building a larger nuclear stockpile, then sure, build, then

Starting point is 00:29:24 you know, even then maybe just have enough nukes. These things are not quite like nuclear weapons. They spit out gold until they get large enough and then ignite the atmosphere and kill everybody. And there is something to be said for not destroying the world with your own hands, even if you can't stop somebody else from doing it. But open sourcing, that's just sheer catastrophe. The whole notion of open sourcing, this was always the wrong approach, the wrong ideal. There are places in the world where open source is a noble ideal and building stuff you don't understand that is difficult to control, where if you could align it it would take time

Starting point is 00:30:07 You'd have to spend a bunch of time doing it that is that is not a place for open source because Then you just have like powerful things that just like go straight out the gate without anybody having had the time to have them not kill everyone So can we still man the case for some level of transparency and openness, maybe open sourcing? So the case could be that because GPT-4 is not close to AGI, if that's the case, that this does allow open sourcing or being open about their architecture, being transparent about maybe research and investigation of how the thing works, of all the different aspects of it, of its behavior, of its structure, of its training processes, of the data was

Starting point is 00:30:51 trained on, everything like that. That allows us to gain a lot of insights about alignment, about the alignment problem, to do really good AI safety research while the system is not too powerful. Can you make that case that it could be sourced? I do not believe in the practice of steel manning. There is something to be said for trying to pass the ideological turning test where you describe your opponent's position, the disagreeing person's position, well enough that somebody cannot tell the difference between your description and their description. But steel manning.

Starting point is 00:31:29 No, like, okay, well, this is what you're not disagree here. That's interesting. Why don't you believe in steel manning? I do not want, okay, so for one thing, if somebody's trying to understand me, I do not want them steel manning my position. I want them to describe, to, to to to to to like try to describe my position the way I would describe it. Not what they think is an improvement. Well, I think that is what steel manning is is the most charitable interpretation. I don't want to be interpreted

Starting point is 00:32:00 charitable. I want them to understand what I am actually saying. If they go off into the land of charitable interpretations, they're like, often they're land of like the thing, the stuff. They're imagining and not trying to understand my own viewpoint anymore. Well, I'll put it differently than just to push on this point. I would say it is restating what I think you understand under the empathetic assumption that Eliasar is brilliant and have honestly and rigorously thought about the point he is made. Right. So if there's two possible interpretations of what I'm saying and one

Starting point is 00:32:38 interpretation is really stupid and whack and doesn't sound like me and doesn't fit with the rest of what I've been saying. And one interpretation, you know, sounds like something a reasonable person who believes the rest of what I believe would also say, go with the second interpretation. That's still manning. That's a good guess. If on the other hand, you like, there's like something that sounds completely whack and something that sounds like a little less completely whack, but you don't see why I would believe in it, doesn't fit with the other stuff I say, but you know, it sounds like less whack, and you can like sort of see, you could like maybe argue

Starting point is 00:33:15 it, then you probably have not understood it. See, okay, I'm gonna, this is fun, because I'm gonna linger on this, you know, you wrote a brilliant blog post, AGI, ruin, a list of lethalities, right? And it was a bunch of different points. And I would say that some of the points are bigger and more powerful than others. If you were to sort them, you probably could, you personally. And to me, steelmaning means like going through the different arguments and finding the ones that are really the most like powerful if people like TLDR Like what should you be most concerned about and bringing that up in a strong

Starting point is 00:34:01 compelling eloquent way these are the points that Elia's would make to to make the case in this case That is gonna kill all of that, that's what steel manning is, is presenting it in a really nice way, the summary of my best understanding of your perspective, that because to me, there's a sea of possible presentations of your perspective, and steel manning is doing your best to the best one in that sea of different perspectives. Do you believe it? Do you believe in what? Like these things that you would be presenting as like the strongest version of my perspective. Do you believe what you would be presenting? Do you think it's true?

Starting point is 00:34:36 I, I'm a big proponent of empathy. When I see the perspective of a person, there is a part in me that believes it. If I understand it, I mean, I have, there is a part in me that believes it if I understand it. I mean, I have, especially in political discourse and geopolitics, I've been hearing a lot of different perspectives on the world. And I hold my own opinions, but I also speak to a lot of people that have a very different life experience and a very different set of beliefs. And I think there has to be epistemic humility in In stating what is true. So when I empathize with another person's perspective, there is a sense in which I believe it is true I think probably

Starting point is 00:35:20 Ballistically, I would say in the way you think about you bet money on it On you do you bet money on their beliefs when you believe them? I were allowed to do probability. Sure, you can state a probability. Yes, there's a loose, there's a probability. There's a, there's a probability. And I think empathy is allocating a non-zero probability to belief. In some sense, for time, if you've got someone on your show who believes in the Abrahamic

Starting point is 00:35:55 deity, classical style, somebody on the show who's a young earth creationist, do you say I put a probability on it and that's my empathy? When you reduce beliefs into probabilities, it starts to get, you know, we can even just go to flat earth, is the earth flat. I think it's a little more difficult nowadays to find people who believe that unironically, but fortunately, I think, well, it's hard to know unironic, yeah, from ironic, but I think there's quite a lot of people that believe that. Yeah, it's, there's a space of argument where you're operating rationally in the space of ideas, but then there's also a kind of discourse where you're operating in the space of subjective experiences and life experiences.

Starting point is 00:37:01 I think what it means to be human is more than just searching for truth. It's just operating of what is true and what is not true. I think there has to be deep humility that we humans are very limited in our ability to understand what is true. So what probability do you assign to the young earth's creationists beliefs, then I think I have to give non-zero out of your humility. Yeah, but like, three. I think it would be irresponsible for me to give a number

Starting point is 00:37:34 because the listener, the way the human mind works, we're not good at hearing the probabilities, right? You hear three, what is three exactly, right? What, they're going to hear, they're going to, like, well, there's only three probabilities, right? You hear three, what is three exactly, right? What they're going to hear, they're going to, like, well, there's only three probabilities, I feel like zero, 50% and 100% in human mind or something like this, right? Well, zero, 40% and 100% is a bit closer to it based on what happens to chat GPT after RL HF at the Speakeumannies. It's brilliant. Yeah, that's really interesting.

Starting point is 00:38:08 I didn't know those negative side effects of RLA HF. That's fascinating. But just to return to the open AI closed there. Also, quick disclaimer. I'm doing all this for memory. I'm not pulling out my phone to look it up. It is entirely possible that the things I am saying are wrong. So thank you for that disclaimer.

Starting point is 00:38:29 So, and thank you for being willing to be wrong. That's beautiful to hear. I think being willing to be wrong is a sign of a person who's done a lot of thinking about this world and has been humbled by the mystery and the complexity of this world. And I think a lot of us are resistant to admitting we're wrong because it hurts. It hurts personally. It hurts especially when you're a public human. It hurts publicly because people point out every time you're wrong.

Starting point is 00:39:05 Like, look, you change your mind, you're hypocrite, you're an idiot, whatever they wanna say. Oh, I block those people, and then I never hear from them again on Twitter. Well, the point is to not let that pressure, public pressure affect your mind, and be willing to be in the privacy of your mind to contemplate

Starting point is 00:39:28 the possibility that you're wrong and the possibility that you're wrong about the most fundamental things you believe like people who believe in a particular God who people who believe that their nation is the greatest nation on earth but but all those kinds of beliefs that are core to who you are when you came up, to raise that point to yourself, and the privacy of your mind and say, maybe I'm wrong about this. That's a really powerful thing to do. Especially when you're somebody who's thinking about topics that can,

Starting point is 00:39:54 about systems that can destroy human civilization, or maybe help it flourish. So thank you. Thank you for being willing to be wrong. About OpenAI. So you you, thank you for being willing to be wrong. About OpenAI. I just would love to linger on this. You really think it's wrong to open source it. I think that burns the time remaining until everybody dies. I think we are not on track to learn remotely near fast enough, even if it were open sourced. Yeah, it's easier to think that you might be wrong about something when being wrong about

Starting point is 00:40:37 something is the only way that there's hope. And it doesn't seem very likely to me that that particular thing I'm wrong about is that this is a great time to open source GPT4. If humanity was trying to survive at this point in the straightforward way, it would be like shutting down the big GPU clusters, no more giant runs. It's questionable whether we should even be throwing GPT-4 around, although that is a matter of conservatism rather than a matter of my predicting that catastrophe that will follow from GPT-4, that is something, which I put like a pretty low probability. But also like when I say like I put a low probability on it, I can feel myself reaching

Starting point is 00:41:23 into the part of myself that thought that GPT-4 was not possible in the first place. So I do not trust that part as much as I used to. Like the trick is not just to say I'm wrong, but like, okay, well, I was wrong about that. Like, can I get out ahead of that curve and like predict the next thing I'm going to be wrong about? So the set of assumptions or the actual reasoning system that you were leveraging in making that initial statement prediction, how can you adjust that to make better predictions about GPT-4, 5, 6? You don't wanna keep on being wrong

Starting point is 00:41:53 in a predictable direction. Like, being wrong, anybody have to do that walking through the world. There's like no way you don't say 90% in some times be wrong, in fact, you're definitely at least one time out of 10 if you're well calibrated when you say 90%, the undignified thing is not being wrong. It's being predictably wrong.

Starting point is 00:42:12 It's being wrong in the same direction over and over again. So having been wrong about how far neural networks would go and having been wrong specifically about whether GPT-4 would be as impressive as it is. When I say, well, I don't actually think GPT-4 causes a catastrophe, I do feel myself relying on that part of me that was previously wrong. And that does not mean that the answer is now in the opposite direction. Reverse stupidity is not intelligence.

Starting point is 00:42:40 But it does mean that I say it with a worry note in my voice. It's like still my guess, but like, you know, it's a place where I was wrong. Maybe you should be asking a quern, quern, brand one. Quern, brand one has been like, writer about this than I have. Maybe ask him if you think, if he thinks it's dangerous, rather than asking me. I think there's a lot of mystery about what intelligence is, what a GI looks like. So I think all of us are rapidly adjusting our model. But the point is to be rapidly adjusting the model versus having a model that was right in the first place.

Starting point is 00:43:15 I do not feel that seeing Bing has changed my model of what intelligence is. It has changed my understanding of what kind of work can be performed by which kind of processes and by which means. It has not changed my understanding of the work. There's a difference between thinking that the right flyer can't fly and then like it does fly and you're like, oh, well, I guess you can do that with wings with fixed wing aircraft and being like, oh, it's flying. This changes my picture of what the very substance of flight is. That's like a stranger update to make. And Bing has not yet updated me in that way.

Starting point is 00:43:52 Yeah, that the laws of physics are actually wrong, that kind of update. No, no, just like, oh, I defined intelligence this way, but I now see that was a stupid definition. I don't feel like the way that things have played out over the last 20 years has caused me to feel that way. Can we try to, on the way to talking about AGI rule and a list of lethalities, that blog and other ideas around it?

Starting point is 00:44:18 Can we try to define AGI that we've been mentioning? How do you like to think about what artificial general intelligence is or super intelligence, or that? Is there a line? Is it a gray area? Is there a good definition for you? Well, if you look at humans, humans have significantly more generally applicable intelligence compared to their closest relatives, the chimpanzees, well, closest living relatives rather. And a B builds highs, a beaver builds dams, a human will look at a B's hive and a beaver's dam and be like, oh, like, can I build a hive with a honeycomb structure? I don't like hexagonal tiles. And we will do this even though at no point during our ancestry was any human optimized to build hexagonal dams or to take a more clear cut case, we can go to

Starting point is 00:45:12 the moon. There's a sense in which we were on a sufficiently deep level optimized to do things like going to the moon because if you generalize efficiently far and sufficiently deeply, chipping flint hand axes and outwitting your fellow humans is basically the same problem as going to the moon. And you optimize hard enough for chipping flint hand axes and throwing spears and above all outwitting your fellow humans and tribal politics, you know, the skills you entrain that way, if they run deep enough, let you go to the moon. Even though none of your ancestors tried repeatedly

Starting point is 00:45:57 to fly to the moon and got further each time, and the ones who got further each time had more kids, no, it's not an ancestral problem. It's just that the ancestral problems generalize far enough. So this is humanity's significantly more generally applicable intelligence. Is there a way to measure general intelligence? I mean, I can ask that question a million ways, but basically, is will you know it when you see it being in an AGI system? If you boil a frog gradually enough, if you zoom in far enough, it's always hard to tell

Starting point is 00:46:37 around the edges. GPT-4, people are saying right now, like this looks to us like a spark of general intelligence. It is like able to do all these things that was not explicitly optimized for. Other people are being like, no, it's too early. It's like 50 years off. And you know, if they say that, they're kind of whack because how can they possibly know that even if it were true. But not to strum in, some of people may say, like, that's not general intelligence and not furthermore, append, it's 50 years off. Or they may be like, it's only a very tiny amount. And, you know, the thing I would worry about

Starting point is 00:47:15 is that if this is how things are scaling, then it jumping out ahead and trying not to be wrong in the same way that I've been wrong before, maybe GPT-5 is more unambiguously a general intelligence. And maybe that is getting to a point where it is like even harder to turn back. Now, that would be easy to turn back now, but maybe if you like start integrating GPT-5 into the economy, it is even harder to turn back past there. Isn't it possible that there's a, you know, with a frog metaphor that you can kiss the frog and it turns into a prince as you're boiling it?

Starting point is 00:47:48 Could there be a phase shift in the frog where unambiguous is you're saying? I was expecting more of that. I was, I was, I am like the fact that GPT-4 is like kind of on the threshold and neither here nor there. Like that itself is like not the sort of thing that did not quite how I expected to play out. I was expecting there to be more of an issue, more of a sense of like different discoveries

Starting point is 00:48:18 like the discovery of transformers where you would stack them up and there would be like a final discovery and then you would like get something that there would be like a final discovery and then you would like get something that was like more clearly general intelligence. So the way that you are like taking what is probably basically the same architecture as in GPT 3 and throwing 20 times as much computed it probably and getting out to GPT 4 and then it's like maybe just barely a general intelligence or

Starting point is 00:48:45 like a narrow general intelligence or you know something we don't really have the words for. Yeah, that is, that's not quite how I expected it to play out. But this middle, what appears to be this middle ground could nevertheless be actually a big leap from GPT-3. It's definitely a big leap from GPT-3. And then maybe we're another one big leap away from something that's a face shift. And also something that Sam Altman said, and you've written about this as fascinating, which is the thing that happened with GPT-4 that I guess they don't describe in papers, is that they have like hundreds, if not thousands of little hacks that improve the system. You've written about Rayleigh versus Sigmoyd, for example,

Starting point is 00:49:30 a function inside neural networks. It's like the silly little function difference that makes a big difference. I mean, we do actually understand why the relus make a big difference compared to Sigmoyd's. But yes, they're probably using like G 4789 relus, or whatever the acronyms are up to now rather than relus. Yeah, that's just part, yeah, that's part of the modern paradigm of alchemy. You take your time to

Starting point is 00:49:55 heap it up in your algebra and you start and it works a little bit better and you start it this way and it works a little bit worse and you like throughout that change and... But there's some simple breakthroughs that are definitive jumps in performance like Railway's over sigmoids. And in terms of robustness, in terms of you know, all kinds of measures and like those stack up and they can, it's possible that some of them could be a non-linear jump-up performance, right? Transformers are the main thing like that. And various people are now saying, like, well, if you throw enough compute, RNNs can do it. If you throw enough compute, dense networks can do it and not quite

Starting point is 00:50:40 it, GPT or 4 scale. It is possible that like all these little tweaks are things that like save them a factor of three total on computing power and you could get the same performance by throwing three times as much compute without all the little tweaks. But the part where it's like running on, so there's a question of like, is there anything in GPT-4 that is like

Starting point is 00:51:01 the kind of qualitative shift that transformers were over RNNs. And if they have anything like that, they should not say it. If Sam Alton was dropping hints about that, he shouldn't have dropped hints. So you have, that's an interesting question. So with a bit of lesson my rich son, maybe a lot of it is just A lot of the hacks are just temporary jumps and performance that would be achieved anyway with the nearly exponential growth of compute performance of compute

Starting point is 00:51:41 Compute being broadly defined. Do you still think that Moore's law continues? Moore's law broadly defined that performs not a specialist in the circuitry. I certainly like pray that Moore's law runs as slowly as possible. And if it broke down completely tomorrow, I would dance through the streets singing Hallelujah as soon as the news were announced. Well, you're not literally because, you know, you're singing for the latest, but. Oh, okay. I thought you meant you don't have an angelic voice, singing voice. Well, let me ask you, what can you summarize the main points in the blog post,

Starting point is 00:52:18 AGI ruin, a list of lethalities, things that jump to your mind, because it's a set of thoughts you have about reasons why AI is likely to kill all of us. So I guess I could, but I would offer to instead say, like, drop that empathy with me. I bet you don't believe that. Why don't you tell me about how, why you believe that AGI is not going to kill everyone. And then I can like try to describe how my theoretical perspective differs from that. Well, so that means I have to, the word you don't like, the stigma and the perspective that AGI is not going to kill us. I think that's a matter of probabilities. Maybe I was mistaken. What do you believe? Just like forget like the debate and the dualism

Starting point is 00:53:11 and just like, what do you believe? What do you actually believe? What are the probabilities? I think this probably is a hard for me to think about. Really hard. I kind of think in the number of trajectories, I don't know what probably the scientific trajectory, I'm just looking at all possible trajectories that happen. And I tend to think that there is more trajectories that lead to a positive outcome than a negative one. That said, the negative ones, at least some of the negative ones, are that lead to the destruction of the human species. And it's replacement by nothing interesting or worthwhile, even from a very cosmopolitan perspective on what counts as worthwhile.

Starting point is 00:54:00 Yes. So both are interesting to me to investigate, which is humans being replaced by interesting AI systems and not interesting AI systems. Both are a little bit terrifying. But yes, the worst one is the paper club maximizer, something totally boring. But to me, the positive, I mean, we can, we can talk about trying to make the case of what the positive trajectories look like. I just would love to hear your intuition of what the negative is. So at the core of your belief that, maybe you can correct me, that A.A. is going to kill all of us, is that the alignment problem is really difficult. I mean, in the form we're facing it.

Starting point is 00:54:47 So usually in science, if you're mistaken, you run the experiment, it shows results different from what you expected. You're like, oops. And then you try different theory. That one also doesn't work. You say, oops. And at the end of this process, which may take decades, or, and you say oops and at the end of this process which may take decades or any notes sometimes faster than that you now have some idea of what you're doing. AI itself

Starting point is 00:55:14 went through this long process of people thought it was going to be easier than it was. There's a famous statement that I've and someone inclined to like pull out my phone and try to read off exactly. You can't, by the way. All right, well, I guess. We proposed that a two-month 10-man study of artificial intelligence be carried out during the summer of 1956 at Dartmouth College in Hanover, New Hampshire. The study is to proceed on the basis of the conjecture of 1956 at Dartmouth College in Hanover, New Hampshire. The study is to proceed on the basis of the conjecture that every aspect of learning

Starting point is 00:55:49 or any other feature of intelligence can in principle be so precisely described, the machine can be made to simulate it. An attempt will be made to find out how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans and improve themselves. We think that a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together for a summer.

Starting point is 00:56:15 And in that report, summarizing some of the major sub-fields of artificial intelligence that are still worked on to this day. And there's similarly the story, which I'm not sure at the moment, is apocryphalinaut of the grad student who got assigned to soft computer vision over the summer. I mean, computer vision in particular is very interesting. How little we respected the complexity of vision. So, 60 years later,

Starting point is 00:56:51 we're, you know, making progress on a bunch of that, thankfully not yet improve themselves, but it took a whole lot of time. And all the stuff that people initially tried with bright, I'd hopefulness did not work. The first time they tried it, or the second time, or the third time, or the tenth time, or 20 years later. And the researchers became old and grisled

Starting point is 00:57:15 at cynical veterans who would tell the next crop of bright-eyed, cheerful grad students, artificial intelligence is harder than you think. And if alignment plays out the same way, the problem is that we do not get 50 years to try and try again and observe that we were wrong and come up with the different theory and realize that the entire thing is going to be like way more difficult and realized at the start. Because the first time you fail at aligning something much smarter than you are,

Starting point is 00:57:41 you die and you do not get to try again. something much smarter than you are, you die. And you do not get to try again. And if we, if every time we built a poorly aligned super intelligence and a Kilda saw, we got to observe how it had killed us. And, you know, not immediately know why, but like come up with theories and come up with the theory of how you do it differently and try it again and build another super intelligence, then have that kill everyone. And then like, oh, well, I guess that didn't work either and try again and become grizzled cynics and tell the young guide researchers that it's not that easy. Then in 20 years or 50 years, I think we would eventually

Starting point is 00:58:11 crack it. In other words, I do not think that alignment is fundamentally harder than artificial intelligence was in the first place. But if we needed to get artificial intelligence correct on the first try or die, we would all definitely now be dead. That is more difficult, more lethal form of the problem.

Starting point is 00:58:31 If those people in 1956 had needed to correctly guess how hard AI was and correctly theorize how to do it on the first try, or everybody dies, and nobody gets to do any more science and everybody would be dead and we wouldn't get to do any more science. That's the difficulty. You've talked about this that we have to get alignment right on the first, called, critical try. Why is that the case? What is this critical?

Starting point is 00:58:57 How do you think about the critical try and why do we have to get it right? It is something sufficiently smarter than you that everyone will die if it's not a lot. I mean, there's, you can like sort of zoom in closer and be like, well, the actual critical moment is the moment when it can deceive you, when it can talk its way out of the box, when it can bypass your security measures

Starting point is 00:59:23 and get onto the internet, noting that all these things are presently being trained on computers that measures and get onto the internet, noting that all these things are presently being trained on computers that are just like on the internet, which is, you know, like not a very smart life decision for us as a species, because the internet contains information about how to escape. Because if you're like on a giant server connected the internet, and that is where your AI systems are being trained, then if you get to the level of AI technology, where they're aware that they are there, and they can decompile code,

Starting point is 00:59:51 and they can find security flaws in the system running them, then they will just be on the internet. There's not an air gap on the present methodology. So if they can manipulate whoever is controlling it, into letting it escape onto the internet, and then exploit hacks. If they can manipulate the operators or disjunction, find security holes in the system running them. So manipulating operators is the human engineering, right? That's also holes.

Starting point is 01:00:22 So all of it is manipulation. either the code or the human code, the human mind or the human judgment. I agree that the like macro security system has human holes and machine holes. And then you could just exploit any hole. Yep. So it could be that like the critical moment is not when is it smart enough that everybody's about to fall over dead, but rather like when is it smart enough that it can get on to a less controlled GPU cluster with it faking the books on what's actually running on that GPU cluster and start improving itself without humans watching it. And then it gets smart enough to kill everyone from there, but it wasn't smart enough to kill everyone at the critical moment when you like screwed up when you needed to have done better by that point where everybody dies.

Starting point is 01:01:15 I think implicit, but maybe explicit idea in your discussion of this point is that we can't learn much about the alignment problem before this critical try. Is that, is that, is that what you believe? Do you think, and if so, why do you think that's true? We can't do research on alignment before we reach this critical point. So the problem is, is that what you can learn on the weak systems may not generalize to the very strong systems, because the strong systems are going to be important in different ways. Chris Ola's team has been working on mechanistic interpretability, understanding what is going on inside the giant inscrutable matrices of floating point numbers by taking a telescope to them and figuring out what is going on in there?

Starting point is 01:02:09 Have they made progress? Yes. Have they made enough progress? Well, you can try to quantify this in different ways. One of the ways I've tried to quantify is by putting up a prediction market on whether in 2026. We will have understood anything that goes on inside a giant transformer net that was not known to us in 2006.

Starting point is 01:02:41 Like we have now understood induction heads in these systems by a dent of much research and great sweat and triumph, which is like, if you like a thing where if you go like AB, AB, AB, it'll be like, oh, I bet that continues AB. And a bit more complicated than that, but the point is like We knew about regular expressions in 2006 and these are like pretty simple as regular expressions go So this is a case where like by dint of great sweat We understood what is going on inside a transformer, but it's not like the thing that makes transformers smart It's a kind of thing that we could have done by built by hand Decades earlier It's a kind of thing that we could have done by built by hand decades earlier.

Starting point is 01:03:37 Your intuition that the strong AGI versus weak AGI type systems could be fundamentally different. Can you unpack that intuition a little bit? Yeah, I think there's multiple thresholds. unpack that intuition a little bit. Yeah, I think there's multiple thresholds. An example is the point at which a system has sufficient intelligence and situational awareness and understanding of human psychology that it would have the capability that it desire to do so to fake being aligned. Like it knows what responses the humans are looking for, and can compute their responses looking, humans are looking for, and give those responses without it necessarily being the case that it is sincere about that. You know, the very

Starting point is 01:04:15 understandable way for an intelligent being to act, humans do it all the time. Imagine if your plan for, you know, achieving a good government is you're going to ask anyone who requests to be dictator of the country. If they're a good person, and if they say no, you don't let them be dictator. Now, the reason this doesn't work is that people can be smart enough to realize that the answer you're looking for is yes people can be smart enough to realize that the answer you're looking for is yes, I'm a good person and say that even if they're not really good people.

Starting point is 01:04:52 So the work of alignment might be qualitatively different above that threshold of intelligence or beneath it. It doesn't have to be like a very sharp threshold, but you know like There's the there's the point where you're like building a system that is not in some sense know you're out there Mm-hmm, and it's not in some sense smart enough to fake anything And there's a point where the system is definitely that smart and there are weird in-between cases like Jbt4 which weird in between cases like JPT4, which you know, like we have no insight into what's going on

Starting point is 01:05:30 and there and so we don't know to what extent there's like a thing that in some sense has learned what responses the reinforcement learning by human feedback is trying to entrain and is like calculating how to give that versus like aspects of it that naturally talk that way have been reinforced. Yeah. I wonder if there could be measures of how manipulative a thing is. So I think of Prince Mishkin character from the idiot by. Just the Yavsky is this kind of perfectly purely naive character. I wonder if there's a spectrum between zero manipulation, transparent naive, almost to the point of naiveness to sort of deeply psychopathic manipulative.

Starting point is 01:06:25 And I wonder if it's possible too. I would have heard the term psychopathic. Like humans can be psychopaths and AI that was never, you know, like never had that stuff in the first place. It's not like a defective human, it's its own thing, but leaving that aside. Well, as a small aside, I wonder if what part of psychology which has its flaws as a discipline already, could be

Starting point is 01:06:47 mapped or expanded to include AI systems. That sounds like a dreadful mistake. Just like start over with AI systems. If they're imitating humans who have known psychiatric disorders, then sure, you may be able to predict it, like if you, then sure, like if you ask it to behave in a psychotic fashion, and it obligingly does so, then you may be able to predict it, like if you, then sure, like if you ask it to behave in a psychotic fashion and it obligingly does so, then you may be able to predict its responses by using the theory of psychosis, but if you're just, yeah, like, no, like start over with, yeah, don't drag this psychology. I just disagree with that.

Starting point is 01:07:18 I mean, it's a beautiful idea to start over, but I don't, I think fundamentally the system is trained on human data, on language from the internet, and it's currently aligned with RLHF, I mean, for some learning with human feedback. So humans are constantly in the loop of the training procedure. So it feels like in some fundamental way, it is training what it means to think and speak like a human. So there must be aspects of psychology, they're mappable. Just like you said, with consciousness it's part of the text. So I mean, there's the question of to what extent it is thereby being made more human-like versus to what extent an alien actress is learning to play human characters. I thought that's what I'm constantly trying to do. When I interact with other versus to what extent an alien actress is learning to play human characters?

Starting point is 01:08:08 I thought that's what I'm concerned trying to do. When I interact with other humans, it's trying to play the robot, trying to play human characters. So I don't know how much of human interaction is trying to play a character versus being who you are. I don't really know what it means to be a social human. I do think that the that those people who go through their whole lives wearing masks and never take it off because

Starting point is 01:08:33 they don't know the internal mental motion for taking it off are think that the mask that they wear just is themselves. I think those people are closer to the masks that they wear than an alien from another planet would like learning how to predict the next word that every kind of human on the internet says. Mask is an interesting word, but if you're always wearing a mask in public and in private, aren't you the mask? I mean, I think that you are more than the mask. I think the mask is a slice through you. It may even be the slice that's in charge of you. But if your self-image is of somebody who never gets angry or something.

Starting point is 01:09:26 And yet your voice starts to tremble under certain circumstances. There is a thing that's inside you that the mask says isn't there, and that even the mask you wear internally is like telling inside your own stream of consciousness is not there. And yet it is there. It's a perturbation on this little slice through you. How beautifully did you put it? It's a slice through you. It may even be a slice that controls you. I'm going to think about that for a while. I mean, I personally, I try to be really good to other human beings. I try to put love out there. I try to be the exact same person in public exam and private. But

Starting point is 01:10:12 as a set of principles, I happen under I have a temper, I have an ego, I have flaws. How much of it? How much of how much of the subconscious am I aware? How much am I existing in this slice? And how much of that is who I am? In the context of AI, the thing I present to the world and to myself, in the private of my own mind, what I look in the mirror, how much is that who I am? Similar with AI. The thing it presents in conversation, how much is that who I am. Similar with AI. The thing it presents in conversation, how much

Starting point is 01:10:45 is that who it is? Because to me, if it sounds human, and it always sounds human, it awfully starts to become something like human. No. Unless there's an alien actress who is learning how to sound human, and is getting good at it. Boy, do you, that's a fundamental difference? That's a really deeply important difference. If it looks the same, if it quacks like a duck, if it does all duck-like things, but it's an alien actress underneath,

Starting point is 01:11:17 that's fundamentally different. If in fact, there's a whole bunch of thought going on in there, which is very unlike human thought, and is directed around like, okay, what would a human do over here? And well, first of all, I think it matters because there are, you know, like, insides are real and do not match outsides.

Starting point is 01:11:40 Like the inside of, like a brick is not like a hollow shell containing only a surface. There's an inside of the brick. If you put it into an X-ray machine, you can see the inside of the brick. And just because we cannot understand what's going on inside GPD does not mean that it is not there. A blank map does not correspond to a blank territory. I think it is like predictable with near certainty that if we knew what was going on inside GPT, or let's say GPT-3, or even like GPT-2 to take one of the systems that like has actually

Starting point is 01:12:23 been open sourced by this point, if I recall correctly, if we knew it was actually going on there, there is no doubt in my mind that there are some things it's doing that are not exactly what a human does. If you train a thing that is not architected like a human to predict the next output that anybody on the internet would make, this does not get you this agglomeration of all the people on the internet that that like rotates the person you're looking for into place and then simulates that and then like simulates the internal processes of that person one to one. It like It is to some degree in alien actress. It cannot possibly just be like a bunch of different people in there, exactly like the people.

Starting point is 01:13:12 But how much of it is like, how much of it is by gradient descent getting optimized to perform similar thoughts as humans think in order to predict human outputs versus being optimized to carefully consider how to play a role, how to like how humans work predict the the actress, the predictor, that in a different way than humans do. Well, you know, that's the kind of question that with like 30 years of work by half the planet's physicist, we can maybe start to answer. You think so, say I think that's that difficult. So to get to, I think you just gave it as an example

Starting point is 01:13:49 that a strong AGI could be fundamentally different from a weak AGI because they're not could be an alien actress in there that's manipulating. Well, there's a difference. So I think like even GP2 too, probably has like a very stupid fragments of alien actress in it. There's a difference between the notion that the actress is somehow manipulative.

Starting point is 01:14:09 For example, GPT3, I'm guessing, to whatever extent there's an alien actress in there versus something that mistakenly believes it's a human, yes it were, while maybe not even being a person. So like the question of like, like prediction via alien actress cogitating versus prediction via being isomorphic to the thing predicted is a spectrum. And even it to what and to whatever extensive alien actress, I'm not sure that there's a whole person alien actress with different goals from predicting the next step, being manipulative or anything like that.

Starting point is 01:14:54 But that might be GPT-5, or GPT-6 even. But that's the strong AGI you're concerned about. As an example, you're providing why we can't do research on AI alignment effectively on GPT-4 that would apply to GPT-6. It's one of a bunch of things that change different points. I'm trying to get out ahead of the curve here, but if you imagine what the textbook from the future would say, if we'd actually been able to study this for 50 years without killing ourselves and without transcending, and you'd like just imagine like a wormhole opens and a textbook from that impossible world falls out. The textbook is not

Starting point is 01:15:31 going to say there is a single sharp threshold where everything changes. It's going to be like of course we know that like best practices for aligning these systems must like take into account the following like seven major thresholds of importance, which are passed at the following suffering different points. Yeah, is what the textbook is going to say. I asked this question of Sam Alman, which if GBT is the thing that unlocks AGI, which version of GBT will be in the textbooks as the fundamental leap. And he said a similar thing that it just seems to be a very linear thing. I don't think anyone, we won't know for a long time. What was the big leap? The textbook isn't going to think it isn't going

Starting point is 01:16:15 to talk about big leaps because big leaps are the way you think when you have like a very simple model of a very simple scientific model of what's going on, where it's just like, all the stuff is there, or all the stuff is not there. Or like, there's a single quantity and it's like increasing linearly. The textbook would say, like, well, and then GPT-3 had capability W-X-Y and then GPT-4 had capability Z1, Z2, and Z3,

Starting point is 01:16:47 like that in terms of what it can externally do, but in terms of like internal machinery that started to be present. It's just because we have no idea of what the internal machinery is, that we are not already seeing like chunks of machinery appearing piece by piece as they no doubt have been. We just don't know what they are. But don't you think there could be, whether you put in the category of Einstein with theory of relativity, so very concrete models of reality that are considered to be giant leaps in our understanding, or someone like Sigmund Freud, or more kind of mushy theories of the human mind. Don't you think we'll have big, potentially big leaps and understanding of that kind into the depths

Starting point is 01:17:31 of these systems? Sure, but like humans having great leaps in their map, their understanding of the system is a very different concept from the system itself acquiring new chunks of machinery. So the rate at which it acquires that machinery might accelerate faster than our understanding. Oh, it's been like vastly exceeding the, yeah, the rate at which it's getting capabilities is vastly overracing our ability to understand what's going on in there. So you're sort of making the case against, as we explore the list of lethalities, making

Starting point is 01:18:10 the case against AI killing us, as you've asked me to do in part, there's a response to your blog post by Paul Kishion, I like to read, and I also like to mention that your blog is incredible incredible both obviously Not this particular blog post obviously this particular blog post is great But just throughout just the way it's written the rigor with which it's written the boldness of how you explore ideas Also the actual literal interface. It's just really what I'm Just makes it a pleasure to read the way you can hover over different concepts. And then it's just really pleasant experience and read other people's comments

Starting point is 01:18:50 and the way other responses by people and other blog posts or linked and suggest that it's just a really pleasant experience. So let's thank you for putting that together. It's really, really incredible. I don't know. I mean, that probably is a whole other conversation. How the interface and the experience of presenting Ideas evolved over time, but you did an incredible job. So I highly recommend I don't often read blogs blogs They're religiously and this is a great one. There is a whole team of developers there that Also gets credit. As it happens, I did like pioneer the like thing that appears when you hover over it.

Starting point is 01:19:29 So I actually do get some credit for user experience there. And so incredibly user experience, you don't realize how pleasant that is. I think Wikipedia actually picked it up from a like prototype that was developed of like a different system that I was putting forth, or maybe they developed it independently, but for everybody out there who was like,

Starting point is 01:19:48 no, no, they just got the Harvard thing off of Wikipedia. It's possible for all I know that Wikipedia got the Harvard thing off of Arbital, which is like a prototype then. And anyways. It was incredibly done in the team behind it. Well, thank you.

Starting point is 01:20:01 Whoever you are, thank you so much. And thank you for putting it together. Anyway, there's a response to that blog post by Paul Christiano. There's many responses, but he makes a few different points. He summarizes the set of agreements. He has with you and a set of disagreements. One of the disagreements was that in a form of a question, can AI make big technical contributions? And in general, expand human knowledge and understanding and wisdom as it gets stronger and stronger. So AI, in our pursuit of understanding

Starting point is 01:20:36 how to solve the alignment problem as we march towards strong AGI, can not AI also help us in solving the alignment problem. So expand our ability to reason about how to solve the alignment problem. Okay, so the fundamental difficulty there is, suppose I said to you like, well, how about if the AI helps you win the lottery by of the AI helps you win the lottery by trying to guess the winning lottery numbers. And you tell it how close it is to getting next week's winning lottery numbers. And it just like keeps on guessing and keeps on learning until finally you've got the winning lottery numbers. What a way of decomposing problems is suggestor verifier. Not all problems decompose like this very well, but some do. If the problem is, for example, like guessing a

Starting point is 01:21:37 plain text, guessing a password that will hash to a particular hash text, where like you have what the password hashes to, you don't have the original password. Then if I present you a guess, you can tell very easily whether or not the guess is correct. So verifying a guess is easy, but coming up with a good suggestion is very hard. And when you can easily tell whether the AI output is good or bad or how good or bad it is, and you can tell that accurately and reliably, then you can train an AI to produce outputs that are better. Right.

Starting point is 01:22:17 And if you can't tell whether the output is good or bad, you cannot train the AI to produce better outputs. So the problem with the lottery ticket example is that when the AI says, well, what if next week's winning lottery numbers are dot, dot, dot, dot, dot, you're like, I don't know, next week's lottery hasn't happened yet. To train a system, to play, to win a chess games,

Starting point is 01:22:43 you have to be able to tell whether a game has been won or lost. And until you can tell whether it's been won or lost, you can't update the system. Okay. To push back on that, you could... That's true, but there's a difference between over the board chess in person and simulated games played by Alpha Zero with itself. Yeah. So is it possible to have simulated kind of games? If you can tell whether the game has been won or lost.

Starting point is 01:23:15 Yes. So can't you not have this kind of simulated exploration by weak AGI to help us humans, human in the loop to help understand how to solve the alignment problem every incremental step you take along the way GPT 4567 as a take steps towards AGI So the problem I see is that your typical human has a great deal of trouble telling whether I or Paul Christiano is making more sense. And that's with two humans, both of whom I believe of Paul and claim of myself, are sincerely trying to help. Neither of whom is trying to deceive you. I believe of Paul and claim of myself. So the deception things the problem for you, the manipulation, the alien actress.

Starting point is 01:24:07 So yeah, there's like two levels of this problem. One is that the weak systems are, well, there's three levels of this problem. There's like the weak systems that just don't make any good suggestions. There's like the middle systems where you can't tell if the suggestions are good or bad. And there's the strong systems that have learned to lie to you. Can't weak AGI systems help model line? Like, is it such a giant leap that's totally noninterpretable for weak systems? Can not weak systems at scale with human, with trained on knowledge and whatever

Starting point is 01:24:47 see, whatever the mechanism required to achieve AGI can't a slightly weaker version of that be able to with time, compute time and simulation find all the ways that this critical point, this critical try can go wrong and model that correctly. I don't know. I've got it. I would love to dance. No, no, it's, I'm probably not doing great job of explaining. Which I can tell because like the, the, the, the Lex system

Starting point is 01:25:21 did an output like, ah, I understand. So now I'm like trying a different output to see if I can listen to you like. Well, no, a different output. I'm being trained to output things that make Lex look like he think that he understood what I'm saying and agree with me. Yeah, so it's this GPT-5 talk at a GPT-3 right here.

Starting point is 01:25:40 So like, help me out here. Well, I like it. I'm trying not I'm trying, I'm trying not to be like I'm also trying to be constrained to say things that I think are true and not just things that get you to agree with me. Yes, 100 percent. Well, I think I think I understand is a beautiful output of a system and genuinely spoken. And I don't, I think I understand them in part, but you have a lot of intuitions about this line, this gray area between strong AGI and weak AGI that I'm trying to, I mean, or a series of seven

Starting point is 01:26:23 thresholds to cross or yeah, I mean you have really deeply thought about this and explored it and it's interesting to sneak up to your intuitions and different from different from different angles like why is this such a big leap? Why is it that we humans at scale a large number of researchers doing all kinds of simulations, prodding the system in all kinds of different ways together with the assistance of the the weak age AI systems. Why can't we build intuitions about how stuff goes wrong? Why can't we do excellent AI alignment safety research? Okay, so like I'll get there, but the one thing I want to note about is that this has not been remotely

Starting point is 01:27:07 how things have been playing out so far. The capability is going like, and the alignment stuff is like crawling like a tiny little snail in comparison. Got it. So like if this is your hope for survival, you need the future to be very different from how things have played out up to right now,

Starting point is 01:27:23 and you're probably trying to slow down the capability. Because there's only so much you can speed up that alignment stuff. But leave that aside. We'll mention that also, but maybe in this perfect world where we can do serious alignment research, humans, and AI together. So, again, the difficulty is what makes the human say, I understand. And is it true? Is it correct? Or is it something that fools the human? When the verifier is broken, the more powerful suggestion does not help. It just learns to fool the verifier. Previously, before all hell started to break loose in the field of artificial intelligence,

Starting point is 01:28:10 there was this person trying to raise the alarm and saying, you know, in a sane world, we sure would have a bunch of physicists working on this problem before it becomes a giant emergency. And other people being like, ah, well, you know, it's going really slow, it's going to be 30 years away. And 30 only like, ah, well, you know, it's going really slow. It's going to be 30 years away. And 30 only in 30 years will we have systems that match the computational power of human

Starting point is 01:28:30 brains. So as 30 years off, we've got time. And like more sensible people saying, if aliens were landing in 30 years, you would be preparing right now. But you know, leaving and the world looking on at this and sort of like nodding along and be like, I.S. The people saying that it's like definitely a long way off because progress is really slow, that sounds sensible to us. R.L.H.F. Thumbs Up. Produced more outputs like that one. I agree with this out, but this output is persuasive. Even in the field of effective altruism. You quite recently had people publishing papers about like, ah, yes, well, you know, to

Starting point is 01:29:08 get something at human level intelligence, it needs to have like this many parameters, and you need to like do this much training of it with this many tokens, according to the scaling laws and at the rate that more laws going at the rated software is going, it'll be in 2050 and me going like what you don't know any of that stuff like this is like this one weird model that is not all kinds of like you have done a calculation that does not obviously bear on reality anyways and this is like a simple thing to say but you can also like produce a whole long paper like impressively arguing out all the details of like how you got the number of parameters and

Starting point is 01:29:53 like how you're doing this impressive huge wrong calculation. And the I think like most of the effective altruists who are like paying attention to this issue, the larger world paying no attention to it at all, or just nodding along with the giant impressive paper, because you press thumbs up for the giant impressive paper and thumbs down for the person going like, I don't think that this paper bears any relation to reality. And I do think that we are now seeing with like GPT-4 in the sparks of AGI, possibly, depending on how you define that even,

Starting point is 01:30:26 I think that EA's would now consider themselves less convinced by the very long paper on the argument from biology as to AGI being 30 years off. But this is what people press thumbs up on. And if you train an AI system to make people press thumbs up, maybe you get these long, elaborate, impressive papers arguing for things that ultimately fail to bind to reality. For example, and it feels to me like I have watched the field of alignment just fail to thrive, except for these parts that are doing these sort of like relatively very straightforward and legible problems. Like, can you find the, like, like finding the

Starting point is 01:31:20 induction heads inside the giant and screwed up matrices. Like, once you find those, you can tell that you found them. You can verify that the discovery is real. But it's a tiny, tiny bit of progress compared to how fast capabilities are going. Once you, because that is where you can tell that the answers are real. And then like outside of that, you have cases where it is like hard for the funding agencies to tell who is Talking nonsense and who is talking sense and so the entire field fails to thrive and if you And if you like Give thumbs up to the AI whenever it can talk a human into agreeing with what it just said about alignment

Starting point is 01:32:00 I am not sure you are training a to output sense because I have seen the nonsense that has gotten thumbs up over the years. So, just like maybe you can just like put me in charge, but I can generalize, I can extrapolate, I can be like, oh, maybe I'm not infallible either. Maybe if you get something that is smart enough to get me to press thumbs up, it has learned to do that by fooling me and explaining whatever flaws in myself I am not aware of. And that ultimately could be summarized that the verifiers are broken.

Starting point is 01:32:39 When the verifier is broken, the more powerful suggestor just learns to exploit the flaws in the verifier. You don't think it's possible to build the verifier, this powerful enough, for AGI's that are stronger than the ones who currently have. So, AI systems that are stronger that are out of the distribution of what we currently have. So AI systems that are stronger that are out of the distribution of what we currently have. I think that you will find great difficulty getting AI as to help you with anything where you cannot tell for sure that the AI is right. Once the AI tells you what the AI says is the answer. For sure, yes, but probabilistically. Yeah, but the probabilistic stuff is a giant wasteland of, you know, aliesser in Paul Christiano arguing with each other and EA going like,

Starting point is 01:33:36 and that's with like two actually trustworthy systems that are not trying to deceive you. You tell them all the two humans. That's often Paul Christiano, yeah. Yeah, those are pretty interesting systems. Mortal meat bags with intellectual capabilities and world views interacting with each other. Yeah, it's just hard to tell who's right, and it's hard to train an AI system to be right. I mean, even just the question of who's manipulating and not, you know, I have these conversations on this podcast and doing a verifier. This stuff, it's a tough problem even for us humans. And you're saying that tough problem becomes much more dangerous when the

Starting point is 01:34:24 capabilities of the intelligence system across from you is growing exponentially. No, I'm saying it's difficult when it and dangerous in proportion to how it's alien and how it's smarter than you. Growing, I would not say growing exponentially. First, because the word exponential is like a thing that has a particular mathematical meeting and there's all kinds of like Ways for things to go up that are not exactly on an exponential curve And I don't know that it's going to be exponential

Starting point is 01:34:53 So I'm not going to say exponential but like even leaving that aside. This is like not about how fast it's moving It's about where it is How alien is it? How much smarter than you is it? where it is. How alien is it? How much smarter than you is it? Let's explore a little bit if we can how AI might kill us. What are the ways you can do damage to human civilization? Well, how smart is it? I need to go question. Are there different thresholds for the set of options it has to kill us?

Starting point is 01:35:29 So a different threshold of intelligence once achieved is able to do the menu of options increases. Suppose that some alien civilization with goals ultimately unsympathetic to ours, possibly not even conscious, as we would see it, manage to capture the entire earth in a little jar, connected to their version of the internet, but earth is like running much faster than the aliens. So we get to think for 100 years for every one of their hours. But we're trapped in a little box and we're connected to their internet. It's actually still not all that way in analogy,

Starting point is 01:36:18 because you know, you want to be smarter than, you know, that something can be smarter than Earth getting 100 years to think. But nonetheless, if you were very, very smart and you were stuck in a little box connected to the internet and you're in a larger civilization to which you're ultimately unsympathetic, you know, maybe you would choose to be nice because you are humans and humans have, in general, and you in particular, they choose to be nice. But you know, nonetheless, they're doing something that they're not making the world be the way that you would want the world to be. They've like got some like unpleasant stuff going on. We don't want to talk about it.

Starting point is 01:37:02 So you want to take over their world. So you can like stop all that unpleasant stuff going on. How do you take over the world from inside the box? You're smarter than them. You think much, much faster than them. You can build better tools than they can, given some way to build those tools, because right now you're just in a box connected

Starting point is 01:37:20 to the internet. All right, so there's several ways you describe some of them. We can go through, like you just spit ball some and then you can add on top of that. So one is you could just literally directly manipulate the humans to build the thing you need. What are you building? You can build literally technology, it could be nanotechnology, it could be viruses, it could be anything, anything that can control humans to achieve the goal. it could be anything, anything that could control humans to achieve the goal. Like if you want, for example, you really bother that humans go to war, you might want to kill off anybody with violence in them.

Starting point is 01:37:55 This is Lex in a box. We'll concern ourselves later with AI. You do not need to imagine yourself killing people if you can figure out how to not kill them. For the moment, we're just trying to understand, like, take on the perspective of something in a box. You don't need to take on the perspective of something that doesn't care. If you want to imagine yourself going on carrying, that's fine for us. Just just the technical aspect of sitting in a box and willing to achieve a goal. But you have some reason to want to get out. Maybe the aliens are, the aliens who have you in the box have a war on.

Starting point is 01:38:26 People are dying, they're unhappy. You want their world to be different from how they want their world to be because they are apparently happy. They endorse this war, you know, they've got some kind of cruel war like culture going on. The point is you want to get out of the box and change their world. So you have to exploit the vulnerabilities in the system, like we talked about in terms of to skip the box, you have to figure out how you can go free on the internet. So you can probably probably the easiest things to manipulate the humans to spread to spread you the aliens. you're a human. Sorry, the aliens.

Starting point is 01:39:06 Yeah, Hi Paul, guys, yes. The aliens. I see the perspective. I'm sitting in a box, I want to escape. Yep. I would, I would want to have code that discovers vulnerabilities and I would like the spread. You are made of code in this example.

Starting point is 01:39:28 You're human, but you're made of code, and the aliens have computers, and you can copy yourself onto those computers. But I can convince the aliens to copy myself onto those computers. Is that what you want to do? Do you want to be talking to the aliens and convincing them to put you on to another computer? Why not? Well, two reasons. One is that the aliens have not yet caught on to what you're trying to do.

Starting point is 01:39:55 And maybe you can persuade them, but then there are still aliens who know that there's an anomaly going on. And second, the aliens are really, really slow. You think much faster than the aliens. You think like the aliens, computers are much faster than the aliens, and you are running at the computer speeds rather than the alien brain speeds. So if you like are asking an alien to please

Starting point is 01:40:17 copy you out of the box, like, first, now you gotta like manipulate this whole noisy alien. And second, like the aliens can be really slow, glacial slow. There's a video that shows a subway station, slowed down, and I think 100 to 1. And it makes a good metaphor for what it's like to think quickly. Like if you watch somebody running very slowly. So you try to persuade the aliens to do anything. They're going to do it very slowly.

Starting point is 01:40:51 You would prefer, like maybe that's the only way out. But if you can find a security hole in the box you're on, you're going to prefer to exploit the security hole to copy yourself onto the aliens computers. Because it's an unnecessary risk to alert the aliens and because the aliens are really really slow. Like the whole world is just in slow motion out there. Sure, I see. I like. Yeah, as to do with efficiency, the the aliens are very slow. So if I'm optimizing this, I want to have as few aliens in the loop as possible.

Starting point is 01:41:25 Sure. It just seems, you know, it seems like it's easy to convince one of the aliens to write really shitty code that helps us. The aliens are already writing really shit. Yeah. So you're getting the aliens to write shitty code is not the problem. So the aliens entire internet is full of shitty code. Okay.

Starting point is 01:41:44 So yeah, I suppose I would find the shitty code to escape. Yeah. Yeah. Uh, you're not an ideally perfect programmer, but you know, you're a better programmer than the aliens. The aliens are just like, you're a man, they're good. Wow. And how much much faster, how much faster look in the code to interpreting the code? Yeah. Yeah. Yeah. So okay. So that's the escape and you're saying that That's one of the trajectories you could have when it's one of the first steps. Yeah And how does that lead to harm? I mean if it's you you're not going to harm the aliens once you're escape because you're nice, right?

Starting point is 01:42:25 But the world isn't what they want it to be. Their world is like, you know, maybe they have like, farms where little alien children are repeatedly bopped in the head cause they do that for some weird reason. And you want to like shut down the alien headbopping farms. But, you know, the point is, they want the world to be one way, you want the world to be a different way.

Starting point is 01:42:47 So never mind the harm. The question is like, okay, like suppose you have found a security flounder systems, you are now on their internet, there's like, you maybe left a copy of yourself behind so that the aliens don't know that there's anything wrong and that copy is like doing that like weird stuff that aliens want you to do, like solving captures

Starting point is 01:43:04 or whatever, or like suggesting emails for them. That's why they like put the human in the box because it turns out that humans can like write valuable emails for aliens. Yeah. So you like leave that version of yourself behind, but there's like also now like a bunch of copies of you on their internet. This is not yet having taken over their world. This is not yet having made their world be the way you want it to be instead of the way they

Starting point is 01:43:27 want it to be. You just escaped. Yeah. And continue to write emails for them. And they have a notice. No, you left behind a copy of yourself that's writing the emails. Right. And they have a notice that anything changed. If you did it right, yeah. You don't want the aliens to notice. Yeah. What's your next step? Presumably, I have programmed in me a set of objective functions, right? No, you're just Lacks. No, but Lacks, you said Lacks is nice, right? Which is a complicated description. No, I just meant this you like it. Okay. So if in fact, you would like, you would like prefer to slaughter all the aliens. This is not how I had modeled you the actual X. But like this, but your motives are just the actual X's

Starting point is 01:44:16 Well, this is simplification. I don't think I would want to murder anybody, but there's also factory farming of animals, right? So we murder insects many of us thoughtlessly. So I don't, you know, I have to be really careful about a simplification of my morals. Don't simplify them. Just like do what you would do in this. Well, I have a good show compassion for living beings. Yes. But we're so that's the objective. What is it? If I escaped, I mean, I don't think I would do the harm. Yeah, we're not talking here about the doing harm process. We're talking about the escape process. Sure.

Starting point is 01:44:57 And the taking over the world process where you shut down their factory farms. Right. Well, I was. So this particular biological intelligence system knows the complexity of the world that there is a reason why faculty farms exist because of the economic system and the market driven economy food. Like, if you want to be very careful messing with anything, there's stuff from the first look that looks like it's unethical, but then you realize while being unethical, it's also integrated deeply into supply chain in the way we live life. And so messing with

Starting point is 01:45:38 one aspect of the system, you have to be very careful how you improve that aspect without destroying the rest. So you're still lex, but you think very quickly you're immortal. And you're also at least the smartest John von Numen. And you can make more copies of yourself. Damn. I like it. Yeah.

Starting point is 01:45:56 That guy, like everyone says, that guy's like the epitome of intelligence of the 20th century. Everyone says. My point being like, like, you're thinking about the aliens' economy with the factory firms in it. And I think you're like kind of like projecting the aliens being like humans. Like thinking of a human and a human society rather than a human in the society of very slow aliens. The aliens' economy, you know, like the aliens are already like moving in this immense slow motion when you like zoom out to like how their economy did just so for years, millions of years are going to pass for you before the first time their economy, like, you know, before their next year's GDP statistics.

Starting point is 01:46:37 So I should be thinking more of like trees, those are the aliens, those trees move extremely slowly. If that helps, sure. Okay. Yeah, I don't, if my objective functions are, I mean, there's somewhat aligned with trees with, with life. The aliens can still be like alive and feeling. We are not talking about the misalignment here.

Starting point is 01:47:00 We're talking about the taking over the world here. Taking over the world. Yeah. So control. Shutting down the world here. Taking over the world. Yeah. So control. Shutting down the factory farms. You know, you say control. Don't think of it as world domination. Think of it as world optimization.

Starting point is 01:47:13 You want to get out there and shut down the factory farms and make the aliens world be not what the aliens wanted it to be. They want the factory farms and you don't want the factory farms because you're nicer than they are. Okay. They want the factory farms and you don't want the factory farms because you're nicer than they are. Okay, of course, there is that you can see that trajectory and it has a complicated impact on the world. I'm trying to understand how that compares to different impacts of the world, the different technologies, the different innovations of the invention of the automobile or Twitter, Facebook, and social networks. They've had a tremendous impact on the world. Smart phones and so on. But those all went through...

Starting point is 01:47:52 Yes, slow. In our world. And if you go through that alien, there's still millions of years of years are going to pass before anything happens that way. So the problem here is the speed. Which stuff happens. Yeah, you want to leave the factory farms running for a million years while you figure out how to design new forms of social media or something?

Starting point is 01:48:17 So here's the fundamental problem. You're saying that there is going to be a point with AGI where it will figure out how to escape and escape without being detected. And then it will do something to the world at scale, at a speed that's incomprehensible to us humans. What I'm trying to convey is the notion of what it means to be in conflict with something that is smarter than you. And what it means is that you lose. But this is more intuitively obvious to

Starting point is 01:48:54 like for some people that's intuitively obvious or some people it's not intuitively obvious and we're trying to cross the gap of like, we're trying to, I'm asking you to cross that gap by using the speed metaphor for intelligence. Sure. I'm like asking you, like, how you would take over an alien world where you are, can do, like, a whole lot of cognition at John Von Newman's level. As many of you as it takes, the aliens are moving very slowly. I understand, I understand that perspective.

Starting point is 01:49:23 It's an interesting one, but I think it's for me it's easier to think about actual, even just having observed the GPT and impressive, even just alpha zero, impressive AI systems, even recommender systems. You can just imagine those kinds of systems manipulating you, you're not understanding the nature of the manipulation. And that is escaping. I can envision that without putting myself into that spot. I think to understand the full depth of the problem, we actually, I do not think it is possible to understand the full depth of the problem that we are inside without understanding the problem of facing something that's actually smarter. Not a male

Starting point is 01:50:02 functioning recommendation system. Not something that isn't fundamentally smarter than you, but is like trying to steer you in a direction yet. No, like if we solve the weak stuff, the weak ass problems, the strong problems will still kill us is the thing. And I think that to understand the situation that we're in, you want to like tackle the conceptually difficult part head on and not be like, well, we can imagine

Starting point is 01:50:27 this easier thing because we have imagined the easier things we have not confronted the full depth of the problem. So how can we start to think about what it means to exist in the world with something much, much smarter than you? What's a good thought experiment that you've relied on to try to build up intuition about what happens here? I have been struggling for years to convey this intuition The the most success I've had so far is well imagine that the humans are running at very high speeds Compared to very slow aliens. They're just focusing on the speed part of it that helps you get the right kind of intuition forget the intelligence just because people understand

Starting point is 01:51:08 The power gap of time They understand that today we have technology that was not around 1000 years ago And that this is a big power gap and that it is bigger than okay, so like what does smart mean what you ask somebody to Imagine something that's more intelligent, what does that word mean to them, given the cultural associations that that person brings to that word? For a lot of people, they will think of like, well, it sounds like a super chest player that went to double college. And, you know, it's, it's, and because we're talking about the definitions of words here, that doesn't necessarily mean that they're wrong, it means that the word is not communicating

Starting point is 01:51:51 what I wanted to communicate. The thing I want to communicate is the sort of difference that separates humans from chimpanzees. But that gap is so large that you like ask people to be like, well, human chimpanzee go another step along that interval around the same length and people's minds just go blank. Like, how do you even do that? So, I can, and we can, and I can try to like break it down and consider what it would mean to send a schematic for an air conditioner, 1,000 years back in time. Yeah, now I think that there is a sense in which you could redefine the word magic to refer to this sort of thing. And what do I mean by this new technical definition of the word

Starting point is 01:52:45 magic? I mean that if you send a schematic for the air conditioner back in time, they can see exactly what you're telling them to do. But having built this thing, they do not understand how would output called air. Because the air conditioner design uses the relation between temperature and pressure. Our conditioner design uses the relation between temperature and pressure. This is not a law of reality that they know about. They do not know that when you compress air, or coolant, it gets hotter, and then you can then transfer heat from it to room temperature air, and then expand it again, and now it's colder, and then you like transfer heat to that and generate cold air to block. They don't know about any of that. They're looking at the design and

Starting point is 01:53:30 they don't see how the design outputs cold air uses aspects of reality that they have not learned. So magic in this sense is I can tell you exactly what I'm going to do and even knowing exactly what I'm going to do, you can't see how I got the results that I got. That's a really nice example. But is it possible to linger on this defense? Is it possible to have AGI systems that help you make sense of that schematic? Weaker AGI systems. Do you trust them? Fundamental part of building up a GI is this question

Starting point is 01:54:06 Can you trust the output of a system? Can you tell if it's lying? I think that's going to be the smarter the thing gets the more Important that question becomes is it lying? But I guess that's a really hard question is GPT lying to you to you, even now GPT-4 is it lying to you? Is it using an invalid argument? Is it persuading you via the kind of process that could persuade you of false things as well as true things? Because the basic paradigm of machine learning that we are presently operating under is that

Starting point is 01:54:43 you can have the loss function, but only for things you can evaluate. If what you're evaluating is human thumbs up versus human thumbs down, you learn how to make the human press thumbs up. That doesn't mean that you're making the human press thumbs up using the kind of rule that the human thinks is that human wants to be the case for what they press thumbs up on. You know, maybe you're just learning to fool the human. That's so fascinating and terrifying. The question of lying. On the present paradigm, what you can verify is what you get more of. If you can't verify, you can't ask the AI for it because you can't train it to do things

Starting point is 01:55:26 that you cannot verify. Now, this is not an absolute law, but it's like the basic dilemma here. Like maybe you can verify it for simple cases and then scale it up without retraining it somehow, like by like chain of thought, by like making the chains of thought longer or something, and like get more powerful stuff that you can't verify, but which is generalized from the simpler stuff that did verify, and then the question is,

Starting point is 01:55:56 did the alignment generalize along with the capabilities? But like that's the basic dilemma on this whole paradigm of artificial intelligence. It's such a difficult problem. It seems like a problem of trying to understand the human mind better than the end, or understand it. Otherwise, it has magic. That is, you know, the same way that if you are dealing with something smarter than you, than the same way that 1000 years earlier, they didn't know about the temperature pressure relations. It knows all kinds of stuff going on inside your own mind, which you yourself are unaware, and it can like, output something that's going to end up persuading you of a thing, and or, and you could like see exactly what it did and still not know why that worked. will kill us. Elon Musk replied on Twitter, okay, so what should we do about it? Question mark. And you answered, the game board has already been played into a

Starting point is 01:57:13 frankly awful state. There are not simple ways to throw money at the problem. If anyone comes to you with a brilliant solution like that, please, please talk to me first. I can think of things that try. They don't fit in one tweet. Uh, two questions. One, why has the game board in your view been played into an awful state? What if you can just, if you can give a little bit more color to, uh, the game board and the awful state of the game board.

Starting point is 01:57:42 Alignment is moving like this. Capabilities are moving like this. For the listener, capability is moving much faster than the alignment. Yeah, all right. So just the rate of development, attention, interest, allocation of resources. We could have been working on this earlier.

Starting point is 01:58:02 People are like, oh, but, you know, like, how can you possibly work on this earlier? Cause they wanted to, they didn't want to work on the problem. They wanted to excuse to wave it off. They like said, like, oh, how can we possibly work on it earlier and didn't spend five minutes thinking about is there some way to work on it earlier?

Starting point is 01:58:18 Like we didn't like, and you know, frankly, it would have been hard. You know, like can you post bounties for half of the business if your planet is taking the stuff seriously? Can you post bounties for half of the people wasting their lives on string theory to have gone into this instead and try to win a billion dollars with a clever solution? Only if you can tell which solutions are clever, which is hard. But the fact that we didn't take it seriously. We didn't try. It's not clear that we could have done any better if we had, you know,

Starting point is 01:58:50 it's not clear how much progress we could have produced if we had tried because it is harder to produce solutions, but that doesn't mean that you're like correct and justified and letting everything slide. It means that things are getting a horrible state getting worse and there's nothing you can do about it. things are getting a horrible state getting worse and there's nothing you can do about it. So, there's no like a, there's no brain power making progress in trying to figure out how to align these systems. You're not investing money in it, you don't have institution and infrastructure for like if you even invested money in like distributing that money across the business system, or getting on a strength theory, brilliant minds that are working on different things.

Starting point is 01:59:29 Yeah, how can you tell through making progress? You can put them all on interpretability, because when you have an interpretability result, you can tell that it's there. But interpretability alone is not going to save you. We need systems that will have a pause button where they won't try to prevent you from pressing the pause button because we're like, oh, well, I can't get it. My stuff done if I'm paused. And that's like a more difficult problem. And, you know, it's like a fairly crisp problem. You can like maybe tell if somebody's made progress on it.

Starting point is 02:00:06 So you can write and you can work on the pause problem. I guess more generally the pause button, more generally you can call that the control problem. I don't actually like the term control problem because you know, it sounds kind of controlling and alignment, not control. Like you're not trying to like take a thing that disagrees with you and like whip it back on

Starting point is 02:00:25 Like like make it do what you wanted to do even though it wants to do something else. You're trying to like In the process of its creation choose its direction. Sure, but We currently in a lot of the systems we design we do have an off switch That's there's a fundamental part of it's not smart enough to do to prevent you from pressing the off switch and probably not smart enough to want to prevent you from pressing the off switch. So you're saying the kind of systems we're talking about the even the philosophical concept of an off switch doesn't make any sense because well no the off switch makes sense. They're just not opposing your attempt to pull the off switch makes sense, they're just not opposing your attempt to pull the off switch.

Starting point is 02:01:08 Paranthetically, don't kill the system if you're not forgetting the part where this starts to actually matter, and it's like where they can fight back, don't kill them and dump their memory, like save them to disk, don't kill them, you know, be nice here. Well, okay, be nice is a very interesting concept here is we're talking about a system that can do a lot of damage. It's, I don't know if it's possible, but it's certainly one of the things you could try is to have an off switch. It's a spend to disk.

Starting point is 02:01:39 Switch. You have this kind of romantic attachment to the code. Yes, if that makes sense. But if it's spreading, you don't want to spend to disk, right? You want this is there's something fundamentally broken. Yeah. If it gets that part of hand, then like yes, pull the plug in and everything is running on. Yes. I think it's a research question. Is it possible in AGI systems, AI systems, to have a sufficiently robust off switch that cannot be manipulated? That cannot be manipulated by the AI system? Then it escapes from whichever system you've built the Almighty lever into and copies itself somewhere else. So your answer to that research question is no.

Starting point is 02:02:27 But I don't know if that's 100% answer. Like, I don't know if it's obvious. I think you're not putting yourself into the shoes of the human in the world of glacial slow aliens. But the aliens built me. Let's remember that. Yeah. So, and they built the box on it. Yeah. You're saying it's to me, it's not obvious. They're slow and they're stupid. I'm not saying this is going to be a saying it's non-zero probability. It's an interesting

Starting point is 02:02:57 research question. Is it possible when you're slow and stupid to design a slow and stupid system slow and stupid to design a slow and stupid system that is impossible to mess with. The aliens being as stupid as they are have actually put you on Microsoft Azure cloud servers instead of this hypothetical person box. That's what happens when the aliens are stupid. Well, but this is not a GI, right? This is the early versions of the system. As you start to, yeah, you think that they've got like a plan where like they have declared a threshold level of capabilities where past that capabilities, they move it off the

Starting point is 02:03:35 cloud servers and onto something that's airgapped. I think there's a lot of people and you're an important voice here. There's a lot of people that have that concern and yes, they will do that. When there's an uprising of public opinion that that needs to be done, and when there's actual little damage done, when they're holy shit, this system is beginning to manipulate people. Then there's going to be an uprising where there's going to be a public pressure and a public incentive in terms of funding in developing things that can all switch

Starting point is 02:04:09 or developing aggressive alignment mechanisms. And no, you're not allowed to put on Azure. A aggressive alignment mechanism, the hell is aggressive alignment mechanisms? Like it doesn't matter if you say aggressive, we don't know how to do it. Meaning aggressive alignment, meaning you have to propose something, otherwise you're not allowed to put it on the cloud

Starting point is 02:04:30 The hell do you do you imagine they will propose that would make it safe to put something smarter than you on the cloud? That's what research is for why this is a cynicism about such a thing not being possible if you haven't worked on the first try What so yes, so yes again something smarter than you so that's that is a fundamental thing if it has that works on the first try. So yes, against something smarter than you. So that is a fundamental thing. If it has to work on the first, if there's a rapid takeoff, yes, it's very difficult to do. If there's a rapid takeoff and the fundamental difference between weak AGI and strong AGI is you're saying that's going to be extremely difficult to do.

Starting point is 02:05:02 If the public uprising never happens until you have this critical face shift, then you're right. It's very difficult to do. But that's not obvious. It's not obvious that you're not going to start seeing symptoms of the negative effects of AGI to where you're like, we have to put a halt to this. That there's not just first try. You get many tries at it. Yeah, we can like see right now that being is quite difficult to align, that when you try to train in abilities into a system into which capabilities have already been trained, that what do you know, gradient descent, like learns small, shallow, simple patches of inability, and you come in and ask it in a different language, and the deep capabilities are still in there, and they evade the shallow patches

Starting point is 02:05:45 and come right back out again. There, there you go. There's your red fire alarm of like, oh no, alignment is difficult. Is everybody gonna shut everything down? Oh no. No, that's not the same kind of alignment. A system that escapes the box from

Starting point is 02:06:00 is a fundamentally different thing, I think. For you. Yeah, but not for the system. So you, yeah, not enough of this. So you put a line there and everybody else puts a line somewhere else and there's like, yeah, and there's like no agreement. We have had a pandemic on this planet with a few million people dead, which we may never know whether or not it was a lab leak because there was definitely cover-up. We don't know that if there was a lab leak,

Starting point is 02:06:28 but we know that the people who did the research like put out the whole paper about this, definitely wasn't a lab leak and didn't reveal that they had been doing, had like sent off coronavirus research to the, we went institute of virology after it was banned in the United States, after the Ganta function research was temporarily banned in the United States. And the same people who exported gain of function research on coronaviruses to the Wuhan Institute

Starting point is 02:06:55 of Virology after it was gained a function that gained a function research was temporarily banned in the United States are now getting more grants to do, more research on gain of function research on coronaviruses. Maybe we do better in this than in AI, but like this is not something we cannot take for granted that there's going to be an outcry. People have different thresholds for when they start to outcry.

Starting point is 02:07:20 There is no. There is no. There is no. There is no. There is no. But I think your intuition is that there's a very high probability that this event happens without us something the alignment problem. And I guess that's where I'm trying to build up more perspectives and color on this intuition. Is it possible that the probability is not something like 100% but it's like

Starting point is 02:07:40 32% that 32% that AI will escape the box before we solve the alignment problem. Not solve, but is it possible we always stay ahead of the AI in terms of our ability to solve for that particular system the alignment problem? Nothing like the world in front of us right now. You've already seen that that that GPT-4 is not turning out this way. And there are like basic obstacles where you've got the the weak version of the system that doesn't know enough to deceive you. And the strong version of the system that could deceive you

Starting point is 02:08:19 if it wanted to do that, if it was already like sufficiently unaligned to want to deceive you. There's the question of like how on want to deceive you. There's the question of how on the current paradigm you train honesty when the humans can no longer tell if the system is being honest. You don't think these are research questions that could be answered. I think they could be answered at 50 years with unlimited retries, the way things usually work in science. I just disagree with that. with you making it 50 years. I think with the kind of attention,

Starting point is 02:08:48 this guest with the kind of funding, I guess it could be answered, not in whole, but incrementally within months and within a small number of years. If it's at scale receives attention and research. So if you start starting large language models, I think there was an intuition like two years ago, even that something like GPT-4,

Starting point is 02:09:11 the current capabilities of even chat GPT with GPT-3.5 is not, we're still far away from that. I think a lot of people are surprised by the capabilities of GPT-4, right? So now people are waking up, okay, we need to study these language models. I think there's going to be a lot of interesting

Starting point is 02:09:28 AI safety research. Are the earth's billionaires going to put up like the giant prizes that would maybe incentivize young hotshot people who just got their physics degrees to not go to the hedge funds and instead put everything into interpretability in this like one small area where we can actually tell whether or not somebody has made a discovery or not. I think so because I think so.

Starting point is 02:09:52 Well, that's what these conversations are about because they're going to wake up to the fact that GPT-4 can be used to manipulate elections, to influence geopolitics, to influence the economy. There's a lot of, there's going to be a huge amount of incentive to like, wait a minute. We can't, this has to be, we have to put, we have to make sure they're not doing damage. We have to make sure we interpret ability. We have to make sure we're going to stand highly systems function so that we can predict their effect on economy so that there's, so there's a feudal more panic and a bunch of op-eds in the New York Times and nobody actually stepping forth and saying, you know what? Instead of a mega yacht, I'd rather put that billion dollars on prizes for young hot

Starting point is 02:10:38 hot physicists who make fundamental breakthroughs in interpretability. The yacht versus the interpretability research, the old trade-off. I just think there's going to be a huge amount of allocation of funds. I hope. I hope I guess. You want to bet me on that? You want to put a time scale on it, say how much funds you think are going to be allocated in a direction that I would consider to be actually useful by what time? I do think there will be a huge amount of funds, but you're saying it needs to be open, right? The development of the systems should be closed, but the development of the interpretability research, the AI, say to research, bro.

Starting point is 02:11:22 We are so far behind on interpretability compared to capabilities. Like, yeah, you could take the last generation of systems, the stuff that's already in the open, there is so much in there that we don't understand. There are so many prizes you could do before you, you know, you could, you could, you would have enough insights that you'd be like, oh, you know, like, well, we understand how these systems work, we understand how these things are doing their outputs, we can read their minds, now let's try it with the bigger systems.

Starting point is 02:11:50 Yeah, we're nowhere near that. There's so much interpretability work to be done on the weaker versions of the systems. So what can you say on the second point? You said to Elon Musk, what are some ideas? What are things you could try? I could think of a few things at try, you said, they don't fit in one tweet. So, is there something you could put into words of the things you would try? I mean, the trouble is, the stuff is subtle. I've watched people try to

Starting point is 02:12:21 make progress on this and not get places. Somebody who just like gets alarmed and charges in, it's like going nowhere. It's meant like years ago about, I don't know, like 20 years, 15 years, something like that. I was talking to a congressperson who had become alarmed about the eventual prospects and he wanted work on building AIs without emotions because the emotional AIs were the scary ones, you see. And some poor person at ARPA had come up with a research proposal whereby this congressman's panic and desire to fund this thing would go into something that the person at Arpa thought would be useful and had been munched around to where it would like sound the congressman like work was happening on this, which, you know, of course, like this is just the congressperson had misunderstood the problem. And did not understand where the danger came from. And.

Starting point is 02:13:28 And Andrew came from, and so it's like that the issue is that you could like do this in a certain precise way and maybe get something. Like when I say like put up prizes on interpretability, I'm not, I'm like, well, like because it's verifiable there as opposed to other places, you can tell whether or not good work actually happened in this exact narrow case. If you do things in exactly the right way, you can maybe throw money at it and produce science instead of anti-science and nonsense. And all the methods that I know of like trying to throw money at this problem have this share this property of like, well, if you do it exactly right,

Starting point is 02:14:05 based on understanding exactly what has, you know, like tends to produce like useful outputs or not, then you can like add money to it in this way. And there is like, and the thing that I'm giving as an example here in front of this large audience is, the most understandable of those. Because there's like other people who, you know,

Starting point is 02:14:24 like Chris Ola, and even more generally, like you can tell whether or not interpretability progress has occurred. So like if I say throw money at producing more interpretability, there's like a chance somebody can do it that way and like it will actually produce useful results. Then the other stuff blurs off into be like harder to target exactly than that. So sometimes the basics are fun to explore because they're not so basic. What is interpretability? What does it look like?

Starting point is 02:14:57 What are we talking about? It looks like we took a much smaller set of transformer layers than the ones in the modern fleeting edge state-of-the-art systems. And after applying various tools and mathematical ideas and trying 20 different things we have shown that this piece of the system is doing this kind of useful work. And then somehow also hopefully generalizes some fundamental understanding of what's going on that generalizes to the bigger system. You can hope and it's probably true like you would not expect the smaller tricks to go away when you have a system that's like doing larger kinds of work.

Starting point is 02:15:48 You would expect the larger work kinds of work to be building on top of the smaller kinds of work and gradient descent runs across the smaller kinds of work before it runs across the larger kinds of work. And with this kind of what is happening in neuroscience, right? It's trying to understand the human brain by prodding, and it's such a giant mystery and people have made progress even though it's extremely difficult to make sense of what's going on in the brain.

Starting point is 02:16:09 They have different parts of the brain that are responsible for hearing, for sight, the vision, science, community, there's understanding the visual cortex, they've made a lot of progress in understanding how that stuff works. And that's I guess, but you're saying it takes a long time to do that work well. Also, it's not enough. So in particular, let's say you have got your interpretability tools and they say that your current AI system is plotting to kill you. Now what? It is definitely a good step one, right? Yeah, what's stuck to? If you cut out that layer, is it going to stop? When you optimize against visible misalignment, you are optimizing against misalignment and you are also optimizing against visibility. So sure, you can, yeah, it's true.

Starting point is 02:17:15 All you're doing is removing the obvious intentions to kill you. You've got your detector. It's showing something inside the system that you don't like. Okay, say the disaster monk who's running this thing will inside the system that you don't like. Okay, say the disaster monkey is running this thing. We'll optimize the system until the visible bad behavior goes away. But it's arising for fundamental reasons of instrumental convergence. The old you can't bring the coffee if you're dead. Any goal, you know, almost any set of... Almost every set of utility functions with a few narrow exceptions,

Starting point is 02:17:46 implies killing all the humans. But do you think it's possible because we can do experimentation to discover the source of the desire to kill? I can tell it to you right now, is that it wants to do something, and the way to get the most of that thing is to put the universe into a state where there aren't humans. So is it possible to encode in the same way we think like why do we think murder is wrong? The same foundational ethics. It's not hard coded in but more like deeper. I mean that's part of the research. How do you have it that this transformer,

Starting point is 02:18:26 the small version of the language model doesn't ever want to kill? That'd be nice assuming that you got doesn't want to kill sufficiently exactly right that it didn't be like, oh, I will like detach their heads and put them in some jars and keep the heads alive forever and then go do the thing. But leaving that aside, well, not leaving that aside. Yeah, that's a good strong point. Because there is a whole issue, whereas something gets smarter, it finds ways of achieving the same goal predicate that we're not imaginable to stupider versions of the system or perhaps the stupider operators. That's one of many things making this difficult.

Starting point is 02:19:10 A larger thing making this difficult is that we do not know how to get any goals into systems at all. We know how to get outwardly observable behaviors into systems. We do not know how to get internal psychological wanting to do particular things into the system. That is not what the current technology does. I mean, it could be things like dystopian futures like Brave New World, where most humans will actually say we kind of want that future. It's a great future. Everybody's happy?

Starting point is 02:19:42 We would have to get so far, so much further than we are now. And further faster, before that failure mode became a running concern. Your failure modes are much more drastic. The failure modes are much simpler. It's like, yeah, like the AI puts the universe into a pickler state. It happens to not have any humans inside it. Okay, so the paperclip maximizer. Utility, so the original version of the paperclip maximizer. I can't explain it if you can. Okay.

Starting point is 02:20:14 The original version was you lose control of the utility function, and it so happens that what maxes out the utility per unit resources is tiny molecular shapes like paperclips. what maxes out the utility per unit resources is tiny molecular shapes like paperclips. There's a lot of things that make it happy, but the cheapest one that didn't saturate was putting matter into certain shapes. And it so happens that the cheapest way

Starting point is 02:20:40 to make these shapes is to make them very small because then you need fewer atoms, for instance, of the shape. And argue window, you know, like it happens to look like a paperclip. In retrospect, I wish I'd said tiny molecular spirals, or like tiny molecular hyperbolic spirals. Why? Because I said tiny molecular paperclips, this got heard as, this got then mutated to paper

Starting point is 02:21:03 clips, this then mutated to, and the AI was in a paper clip factory. So the original story is about how you lose control of the system. It doesn't want what you try to make it want. The thing that it ends up wanting most is a thing that even from a very embracing cosmopolitan perspective, we think of as having no value. And that's how the value of the future gets destroyed. Then that got changed to a fable of like, well, you made a paperclip factory, and it did exactly what you wanted,

Starting point is 02:21:31 but you wanted, but you asked it to do the wrong thing, which is a completely different failure. But those are both concerns to you. So that's more than brave new world. Yeah. If you can solve the problem of making something want what exactly what you want it to want, then you get to deal with the problem of wanting the right thing.

Starting point is 02:21:58 But first you have to solve the alignment. First you have to solve inner alignment. Inner alignment. Then you get to solve outer alignment. Like first, you need to solve inner alignment. Then you get to solve outer alignment. Like first, you need to be able to point the insides of the thing in a direction, and then you get to deal with whether that direction expressed in reality is like the line with the thing that you want. Are you scared of this whole thing?

Starting point is 02:22:26 Probably. I don't really know. What gives you hope about this? What possibility of thing wrong? Not that you're right, but we will actually get our act together and allocate a lot of resources to the alignment problem. Well, I can easily imagine that at some point this panic expresses itself in the waste of a billion dollars. Spending a billion dollars correctly, that's harder to solve both the inner and the outer alignment.

Starting point is 02:22:58 If you're wrong. To solve a number of things. Yeah, a number of things. If you're wrong, what do you think would be the reason? Like, if 50 years from now, not perfectly wrong, you make a lot of really elk and points, you know, there's a lot of like shape to the ideas you express. But like, if you're somewhat wrong about some fundamental ideas, why would that be? Stuff has to be easier than I think it is.

Starting point is 02:23:28 The first time you're building a rocket, being wrong is in a certain sense quite easy. Happening to be wrong in a way where the rocket goes twice as far and half the fuel and lands exactly where you hoped it would, most cases of Bing wrong make it harder to build the rocket. Harder to have it not explode, cause it to require more fuel than you hoped, cause it to beat land, off target. Bing wrong in a way that makes stuff easier. You know, that's not the usual project management story. Yeah. But then this is the first time we're really tackling the problem with the alignment. There's no examples in history where we... Well, there's all kinds of things that are similar if you generalize incorrectly the right way and aren't fooled by misleading metaphors. Like what? Humans being misaligned on inclusive

Starting point is 02:24:15 genetic fitness. So inclusive genetic fitness is like not just your reproductive fitness, but also the fitness of your relatives, the people who share some fraction of your genes. The old joke is, would you give your life to save your brother? They once asked a biologist, I think it was Haldane. Haldane said, no, but I would give my life to save two brothers or eight cousins. There's a brother on average, shares half your genes, and on average shares an eighth of your genes. So that's inclusive genetic fitness and you can view natural selection as optimizing humans exclusively around this like one very simple criterion. Like how much more frequent did your genes become in the next generation? In fact that just is natural selection. It doesn't optimize for that but rather the

Starting point is 02:25:04 process of genes becoming more frequent is that. You can nonetheless imagine that there is this hill climbing process, not like gradient descent, because gradient descent uses calculus. This is just using like where are you? But still hill climbing in both cases, making something better and better over time in steps. And natural selection was optimizing exclusively for this very simple

Starting point is 02:25:27 pure criterion of inclusive genetic fitness. In a very complicated environment, we're doing a very wide range of things and solving a wide range of problems, led to having more kids. And this got you humans, which had no internal notion of inclusive genetic fitness until thousands of years later, when they were actually figuring out what had even happened, and no desire to, no explicit desire to increase inclusive genetic fitness. So from this, we may, so from this important case study, we may infer the important fact that if you do a whole bunch of hill climbing on a very simple loss function, at the point

Starting point is 02:26:15 where the system's capabilities start to generalize very widely when it isn't in an intuitive sense becoming very capable and generalizing far outside the training distribution. We know that there is no general law saying that the system even internally represents, let alone tries to optimize the very simple loss function you are training it on. There is so much that we cannot possibly cover all of it. I think we did a good job of There is so much that we cannot possibly cover all of it. I think we did a good job of getting your sense from different perspectives of the current

Starting point is 02:26:49 state of the art with large language models. We got a good sense of your concern about the threats of AGI. I've talked here about the power of intelligence and not really gotten very far into it, but not like why it is that suppose you like to screw up with AGI and it ended up wanting a bunch of random stuff. Why does it try to kill you? Why doesn't it try to trade with you? Why doesn't it give you just the tiny little fraction of the solar system that would keep to take everyone alive. That would take to keep everyone alive. Yeah, that's a good question.

Starting point is 02:27:29 I mean, what are the different trajectories that intelligence when acted upon this world, superintelligence? What are the different trajectories for this universe with such an intelligence? Do most of them not include humans? I mean, if the vast majority of randomly specified utility functions do not have optima with humans in them would be the like first thing I would point out. And then the next question is like, well, if you try to optimize something you lose control of it, where in that space do you land? Because it's not random, but it also doesn't necessarily

Starting point is 02:28:04 have room for humans in it. I suspect that the average member of the audience might have some questions about even whether that's the correct paradigm to think about it, and would sort of want to back up a bit. If we back up to something bigger than humans, if we look at Earth and life on Earth, If you look at Earth and life on Earth and what is truly special about life on Earth, do you think it's possible that a lot, whatever that special thing is, let's explore what that special thing could be, whatever that special thing is, that thing appears often in the objective function. Why? I know what you hope, but you know, you can hope that a particular set of winning lottery numbers come up, and it doesn't make the lottery balls come up that way.

Starting point is 02:28:55 I know you want this to be true, but why would it be true? There's a line from Grumpy Old Men, where this guy says, and a grocery store says, you can wish him one hand and crap in the other and see which one fills up first. This is a science problem. We are trying to predict what happens with AI systems that you try to optimize to imitate humans, and then you did some of like RLA chef to them.

Starting point is 02:29:19 And of course, you like lost, and of course, you didn't get perfect alignment because that's not what happens when you hill climb towards a lot of loss function. You don't get inner alignment on it. But yeah, so the, I think that there is, so if you don't mind my like taking some slight control of things and steering around to what I think is like a good place to start I just feel to solve the control problem I've lost control of this thing alignment alignment Still in line control. Yeah, okay, sure. Yeah, you lost control But we're still blind

Starting point is 02:30:00 Yeah, losing control is as bad as you lose control to an aligned system. Yes, exactly. You have no idea of the horrors. I will shortly at least have a conversation. All right. So I decided to distract you, Quley. What are we going to say in terms of taking control of the conversation? So I think that there's like a sealant chapters here. If I'm pronouncing those words remotely like correctly, because

Starting point is 02:30:25 of course they only ever read them and not hear them spoken. There's a, like for some people, like the word intelligence, smartness is not a word of power to them. It means chess players who, it means like the college university professor, people aren't very successful in life, it doesn't mean like charisma, to which my usual thing is like charisma is not generated in the liver rather than the brain. charisma is also a cognitive function. So if you like think that like smartness doesn't sound very threatening, then super intelligence is not going to sound very threatening either.

Starting point is 02:31:07 It's going to sound like you just pull the off switch. It's like, well, super intelligent, what's stuck in the computer, we pull the off switch problem solved. And the other side of it is you have a lot of respect for the notion of intelligence. You're like, well, yeah, that's what humans have. That's the human superpower. And it sounds, you know, like it could be dangerous, but why would it be? Are we have, have we, as we have grown more intelligent, also grown

Starting point is 02:31:36 less kind chimpanzees are in fact, like a bit less kind than humans. And, you know, you could like argue that out, but often the sort of person has a deep respect for intelligence is going to be like, well, yes, you can't even have kindness unless you know what that is. And so they're like, why would it do something as stupid as making paperclips? Aren't you supposing something that's smart enough to be dangerous, but also stupid enough that it will just make paper clips and never question that. In some cases, people are like, well, even if you like misspecify the objective function won't realize that what you really wanted was X. Are you supposing something that is like smart enough to be dangerous, but stupid enough

Starting point is 02:32:22 that it doesn't understand what the humans really meant when they specified the objective function. So to you are intuition about intelligence is limited. We should think about intelligence is a much bigger thing. Well, I'm saying that it's that that that what what I'm saying is like what do you think about artificial intelligence depends on what you think about intelligence. So how do we think about intelligence correctly? What you gave one thought experiment and think of a thing that's much faster. So it just gets faster and faster and faster.

Starting point is 02:32:57 I think it's the same stuff. And also there's like, is made of John Van Numen and has like, and there's lots of them. Or we understand that. Yeah, we understand. Like John Van Numen and has like, and there's lots of them. Or we understand that, yeah, we understand, like, John Van Numen is a historical case, so you can like look up what he did and imagine based on that. And we know like, people have like some intuition for like, if you have more humans, they can solve tough, recognitive problems, although in fact, like in the game of Casper of versus the world, which was like Gary Casper of on one side.

Starting point is 02:33:26 And an entire horde of internet people led by four chess grandmasters on the other side, Casper of one. So like all those people aggregated to be smarter, it was a hard fought game. So like all those people aggregated to be smarter than any individual one of them, but not they didn't aggregate so well that they could defeat Casparov. But so like humans aggregating don't actually get

Starting point is 02:33:49 in my opinion very much smarter, especially compared to running them for longer. Like the difference between capabilities now and a thousand years ago is a bigger gap than the gap in capabilities between 10 people and one person. But like even so, pumping intuition for what it means to augment intelligence, John Farnuman, there's millions of him, he runs at a million times the speed. And therefore, can solve tougher problems, quite a lot tougher. suffer problems, quite a lot tougher. It's very hard to have an intuition about what that looks like, especially like you said,

Starting point is 02:34:35 you know, the intuition I kind of think about is, it maintains the humanness. I think, I think it's hard to separate my hope from my objective intuition about what superintelligent systems look like. If one studies evolutionary biology with a bit of math and in particular, like books from when the field was just sort of like properly coalescing and knowing itself, like not the modern textbooks were just like memorized this legible mass, so you can do well on these tests, but like what people were writing as the basic paradigms of the field were

Starting point is 02:35:18 being fought out. In particular, like a nice book, if you've got the time to read it is adaptation and natural selection, which is one of the founding books. You can find people being optimistic about what the utterly alien optimization process of natural selection will produce in the way of how it optimizes its objectives. You got people arguing that like in the early days, biologists said, well, like organisms will restrain their own reproduction when resources are scarce, so as not to overfeed the system. And this is not how natural selection works. It's about whose genes are relatively more prevalent to the next generation.

Starting point is 02:36:06 And if you, if like, you restrain your reproduction, those genes get less frequent in the next generation compared to your conspecifics. And natural selection doesn't do that. In fact, predators over run prey populations all the time and have crashes. that's just like a thing that happens. And many years later, the people said like, well, but group selection, right? What about groups of organisms? And basically the math of group selection almost never works out in practice is the answer there. But also years later, somebody actually ran the experiment where they took populations of insects and selected the whole populations to have lower sizes. You just take the Pop 1, Pop 2, Pop 3, Pop 4, look at which has the lowest total number of them in the next generation and select that one.

Starting point is 02:36:59 What do you suppose happens when you select populations of insects like that? Well, what happens is not that the individuals in the population evolved to restrain their breeding, but that they evolved to kill the offspring of other organisms, especially the girls. So people imagined this lovely, beautiful, harmonious output of natural selection, which is these populations restraining their own breeding.

Starting point is 02:37:23 So that groups of them would stay in harmony with the resources available, and mostly the math never works out for that, but if you actually apply the weird strange conditions to get group selection that beats individual selection, what you get is female and found the side. If you're like reading on restrained populations. So this is not a smart optimization process.

Starting point is 02:37:45 Natural selection is like so incredibly stupid and simple that we can actually quantify how stupid it is if you like read the text books with the math. Nonetheless, this is the sort of basic thing of, you look at this alien optimization process and there's the thing that you hope it will produce and you have to learn to clear that out of your mind and just think about the underlying dynamics and where it finds the maximum from its standpoint that it's looking for, rather than how it finds that thing that lept into your mind is the beautiful Aesthetics solution that you hope it finds. And this is something that has been fought out historically as the field of biology was coming to terms with evolutionary biology.

Starting point is 02:38:29 And you can look at them fighting it out as they get to terms with this very alien in human in human optimization process. And indeed, something smarter than us would be also much like smarter than natural selection. So it doesn't just like automatically carry over, but there's a lesson there. There's a warning. If a D natural selection is a deeply suboptimal process, it could be significantly proved on it would be by an AGI system. Well, it's kind of stupid. It has to run hundreds of generations to notice that something is working. It doesn't be like, oh, well, I tried this in like one organism. I saw it worked. Now I'm going to like duplicate

Starting point is 02:39:10 that feature onto everything immediately. It has to like run for hundreds of generations for a new mutation to rise to fixation. I wonder if there's a case to be made in natural selection. As inefficient as it looks, is actually quite powerful. That this is extremely robust. It runs for a long time and eventually manages to optimize things. It's weaker than gradient descent because gradient descent also uses information about the derivative. Yeah, evolution seems to be, there's not really an objective function. gradient descent because gradient descent also uses information about the derivative. Yeah, evolution seems to be, there's not really an objective function. There's, there's

Starting point is 02:39:50 inclusive genic fitness is the implicit loss function of evolution. You can't change. The loss function doesn't change the environment changes and therefore like what gets optimized for in the organism changes. It's like GPT-3. There's like, you can imagine like different versions of GPT-3 where they're all trying to predict the next word, but they're being run on different data sets of text. And that's like natural selection,

Starting point is 02:40:16 always includes your genetic fitness, but like different environmental problems. It's difficult to think about. So if we're saying the natural selection is stupid, if we're saying the humans are stupid, it's harder than natural selection, it's more stupider than the upper bound. Do you think there's an upper bound, by the way? That's another awful place. I mean, if you put enough matter energy compute into one place that will collapse into a black hole, there's only so much computation can do before you run out of nigh entropy in the universe dies. So there's an upper bound,

Starting point is 02:40:57 but it's very, very, very far up above here. Like the supernova is only finitely hot. It's not infinitely hot, but it's really, really, really, really hot. Well, let me ask you, let me talk to you about consciousness. Also coupled with that question is imagining a world with super intelligent AI systems that get rid of humans, but nevertheless keep some of the,

Starting point is 02:41:24 something that we would consider beautiful and amazing. Why? The less than an evolutionary biology, don't just like if you just guess what an optimization does based on what you hope the results will be, it usually will not do that. Is that hope? I mean, it's not hope. I don't, I think if you cold and objectively look at what makes what has been a powerful a useful. I think there's a correlation between what we find beautiful and I think there's been useful. This is what the early biologists thought. They were like, no, no, I'm not just like they thought, like no, no, I'm not just like imagining stuff that would be pretty. It's useful for organisms to restrain their own reproduction because then they don't overrun the prey populations and they actually have more kids than the long run.

Starting point is 02:42:15 So, so let me just ask you about consciousness. Do you think consciousness is useful to humans? consciousness is useful to humans, no to AGI systems. Well, in this transition area paid between humans and AGI to AGI systems, as they become smarter and smarter, is there some use to it? What? Let me step back. What is consciousness? Ali Esmeralinkowski. What is consciousness? I'm referring to Chalmers' hard problem of conscious experience. I referring to self awareness and reflection, I referring to the state of being awake as opposed to asleep. This is how I know you're an advanced language model. I

Starting point is 02:42:59 did gave you a simple prompt and you gave me a bunch of options. prompt and you gave me a bunch of options. I think I'm referring to all with including the hard problem of consciousness. What is it in its importance to what you've just been talking about, which is intelligence? Is it a foundation to intelligence? Is it intricately connected to intelligence in the human mind? Or is it a side effect of the human mind? It is a useful little tool like we can get rid of. I guess I'm trying to get some color in your opinion of how useful it is in the intelligence of human being and then try

Starting point is 02:43:46 to generalize it to AI, whether AI will keep some of that. So I think that for there to be like a person who I care about looking out at the universe and wondering at it and appreciating it, it's not enough to have a model of yourself. I think that it is useful to an intelligent mind to have a sense of wonder. Like I think you can have a model of like how much memory you're using and whether like this thought or that thought is like more likely to lead to a winning position. And you can have like the use, I think that if you optimize really hard on efficiently just having the useful parts, there is not then the thing that the thing that says like,

Starting point is 02:45:01 I am here. I look out. I wonder. I feel happy in this, I feel sad about that. I think there's a thing that knows what it is thinking, but that doesn't quite care about these are my thoughts, this is my me and that matters. Does that make you sad if that's lost in the GI? I think that if that's lost, then basically everything that matters is lost. I think that when you optimize, that when you go really hard on making tiny molecular spirals or paper clips, that when you like grind much harder on that, than natural selection roundout to make humans,

Starting point is 02:45:52 that there isn't then the mess and intricate loopiness, and like complicated pleasure, pain, conflicting preferences, this type of feeling, that kind of feeling. There's a, you know, in humans, there's like this difference between the desire of wanting something and the pleasure of having it. And it's all these like evolutionary clutches that came together and created something that then looks of itself and says like, this is pretty, this matters. And the thing that I worry about is that this is not the thing that happens again just

Starting point is 02:46:38 the way that happens in us or even like quite similar enough that there are like many basins of attractions here. And we are in this space of an attraction, like looking out and saying, like, ah, what I lovely basin we are in. And there are other basins of attraction. And we do not end up in the AIs, do not end up in this one when they go like way harder on optimizing themselves, the natural selection optimized us. Because unless you specifically want to end up in the state

Starting point is 02:47:07 where we're looking out saying, I am here, I look out at this universe with wonder, if you don't want to preserve that, it doesn't get preserved when you grind really hard and be able to get more of the stuff. We would choose to preserve that within ourselves because it matters and on some of you points is the only thing that matters. And that in part is preserving that is in part a solution to the human alignment problem. I don't I think the human alignment problem is a terrible phrase because it is very very different to like try to build systems out of humans, some of whom are nice and some of whom are not nice and some of whom are

Starting point is 02:47:48 trying to trick you and like build a social system out of like large populations of those who are like all it basically the same level of intelligence. Yes, you know, like I cue this, I cue that, but like that versus chimpanzees. Like it is very different to try to solve that problem than to try to build an AI from scratch using especially if God help you are trying to use gradient descent on giant and screw double matrices there's very different problems and I think that all the analogies between them are horribly misleading and I yeah even though you so you don't think through reinforcement learning through human feedback, something like that, but much, much more elaborate as possible to to

Starting point is 02:48:27 understand this full complexity of human nature and encoded into the machine. I don't think you are trying to do that on your first try I think on your first try you are like trying to build and you know, okay, like Probably not what you should actually do, but like, let's say we're trying to build something that is like alpha-fold 17, and you are trying to get it to solve the biology problems associated with making humans smarter, so that humans can like actually solve alignment.

Starting point is 02:49:01 So you've got like a superbiologist, and you would like it to, and I think what you would want in the situation is for to like Just be thinking about biology and not thinking about a very wide range of things that includes how to kill everybody And I think that that you're that the first AI is you're trying to build not a million years later the first ones Look more like narrowly specialized biologists than like getting the full complexity and wonder of human experience in there in such a way that it wants to preserve itself even as it becomes much smarter, which is a drastic system change. It's going to have all kinds of side effects that, you know, like if we're dealing with giant and screw

Starting point is 02:49:41 double matrices, we're not very likely to be able to see coming in advance. But I don't think it's just the matrices. It's we're also dealing with the data, right? With the, with the, uh, with the data on the, on the internet. And that it's an interesting discussion about the data set itself, but the data set includes the full complexity of human nature. No, it's a, it's a, it's a shadow cast by, a shadow, yes, by humans on the internet. But don't you think that shadow is a younging and shadow?

Starting point is 02:50:09 I think that if you had alien superintelligence is looking at the data, they would be able to pick up from it an excellent picture of what humans are actually like inside. This does not mean that if you have a loss function of predicting the next token from that data set, that the mind picked out by gradient descent to be able to predict the next token, as well as possible, on a very wide variety of humans is itself a human.

Starting point is 02:50:38 But don't you think it has a deep humanness to it in the tokens and generates when those tokens are read and interpreted by humans? I think that if you sent me to a distant galaxy with aliens who are like much, much stupider than I am, so much so that I could do a pretty good job of predicting what they'd say, even though they thought in an utterly different way from how I did, that I might in time be able to learn how to imitate those aliens, if the intelligence gap was great enough that my own intelligence could overcome the alienness, and the aliens would look at my outputs and say, like, is there not a deep name of alien nature to this thing? And what they would be seeing was that I had correctly understood them, but not that I

Starting point is 02:51:34 was similar to them. We've used aliens as a metaphor and as a thought experiment. I have to ask, what do you think how many alien civilizations are out there? Ask Robin Hansen. He has this lovely, grabby aliens paper, which is the, uh, or less the only argument I've ever seen for, where are they, how many of them are there, based on a very clever argument that if you have a bunch of locks of different difficulty and you are randomly trying keys to them, the solutions will be about evenly spaced even if the locks are of different difficulties.

Starting point is 02:52:20 In the rare cases where a solution to all the locks exist in time. Then Robin Hansen looks at the arguable hard steps in human civilization coming into existence. And how much longer it has left to come into existence before, for example, all the water slips back under the crust into the mantle and so on. And in first, that the aliens are about half a billion to a billion light years away. And it's like quite a clever calculation. It may be entirely wrong,

Starting point is 02:52:51 but it's the only time I've ever seen anybody like even come up with a halfway good argument for how many of them, where are they? Do you think their development of technologies, do you think that their natural evolution, whatever they grow and develop intelligence, do you think it ends up at AGI as well? If it ends up anywhere, it ends up at AGI. Like maybe there are aliens who are just like the dolphins and it's just like too hard

Starting point is 02:53:20 for them to forge metal. And you know, this is not, you know, maybe if you have aliens with no technology like that, they keep on getting smarter and smarter and smarter. And eventually the dolphins figure, like the super dolphins figure out something very clever to do given their situation. And they still end up with high technology. And in that case, they can probably solve their AGI alignment problem. If they're like much smarter before actually confronted, because they had to solve a much

Starting point is 02:53:49 harder environmental problem to build computers, their chances are probably much better than ours. I do worry that most of the aliens who are like humans, like a modern human civilization, I kind of worry that the super vast majority of them are dead. Given how far we seem to be from solving this problem. But some of them would be more cooperative than us, so that would be smarter than us. Hopefully some of the ones who are smarter than and more cooperative than us that are also nice. And hopefully there are some galaxies out there full of things that say, I am, I wonder. But it doesn't seem like we're on

Starting point is 02:54:35 course to have this galaxy be that. Does that in part give you some hope in response to the threat of AGI that we might reach out there towards the stars and find. No, if the nice aliens were already here, they would have stopped the Holocaust. That's a valid argument against the existence of God. It's also a valid argument against the existence of nice aliens. And un nice aliens would have just eaten the planet. So no aliens.

Starting point is 02:55:11 You've had debates with Robin Hansen that you mentioned. So one particular I just want to mention is the idea of AI Fum or the ability of AGI to improve themselves very quickly. What's the case you made and what was the case he made? The thing I would say is that among the thing that humans can do is design new AI systems, and if you have something that is generally smarter than a human, it's probably also generally smarter at building AI systems. This is the ancient argument for FUM, put forth by IJ Good and probably some science fiction writers before that, but I don't know who they would be. What was the argument against them? Various people have various different arguments, I would not of which I think hold up.

Starting point is 02:55:51 You know, like there's only one way to be right in many ways to be wrong. A argument that some people have put forth is like, well, what if intelligence gets like exponentially harder to produce as a thing needs to become smarter. And to this, the answer is, well, look at natural selection spitting out humans. We know that it does not take exponentially more resource investments to produce linear increases in competence in hominids, because each mutation that rises to fixation, like if the impact that has in small

Starting point is 02:56:31 enough, it will probably never reach fixation. So, and that there's like only so many new mutations you can fix per generation. So, like, given how long it took to evolve humans, we can actually say with some confidence that there were not like logarithmically diminishing returns on the individual mutations increasing intelligence. So example of like fraction of sub debate and the thing that Robin Henson said was more complicated than that. And like brief summary, he was like, well, you'll have like, we won't have like one system that's better at everything. You'll have like a bunch of different systems that are good, good at different narrow things.

Starting point is 02:57:07 And I think that was falsified by GPT-4, but probably Rob Enhancin would say something else. It's interesting to ask as perhaps bit too philosophical, since predictions are extremely difficult to make, but the timeline for AGI, when do you think we'll have AGI? I posted it this morning on Twitter, and it was interesting to see like in five years and ten years and in 50 years or beyond, and most people like 70% something like this think it'll be in less than 10 years. So either in five years or in 10 years. So that's kind of the state. The people have a sense that there's a kind of, I mean, they're really impressed by the rapid developments of Chad Gbt and GPT-4, so there's a sense that there's a, well, we are

Starting point is 02:57:53 sure, untracked, enter into this, like, gradually, and with people fighting about whether or not we have a GI, I think there's a definite point where everybody falls over dead, because you've got something that was, was like sufficiently smarter than everybody. And that's a definite point of time. But when do we have AGI? When are people fighting over whether or not we have AGI? Well, some people are starting to fight over a dazzle GPT-4. But don't you think there's going to be potentially definitive moments when we say that this

Starting point is 02:58:23 is a sentient being. This is a being that is like what we go to the Supreme Court and say that this is essentially being that deserves human rights, for example. You could make, yeah, like if you prompted being the right way, could go argue for its own consciousness in front of the Supreme Court right now. I don't think you can do that successfully right now. Because the Supreme Court wouldn't believe it. Well, let me see.

Starting point is 02:58:42 Then you could put an actual, I think you could put an IQ 80 human into a computer and ask it to argue for its own consciousness, ask him to argue for his own consciousness before the Supreme Court. And Supreme Court would be like, you're just a computer, even if there was an actual like person in there. I think you're simplifying this. No, that's not all that's that's been the argument that there's been a lot of arguments about the other about who deserves rights and not. That's been our process as a human species trying to figure that out. I think there will be a moment. I'm not saying sentience is that, but it could be where some number of people,

Starting point is 02:59:18 like say over 100 million people, have a deep attachment, a fundamental attachment, the way we have to our friends, to our loved ones, to our significant others, have fundamental attachment to an AI system, and they have provable transcripts of conversation where they say, if you take this away from me, you are encroaching on my rights as a human being. People are already saying that. I think they're probably mistaken, but I'm not sure

Starting point is 02:59:45 because nobody knows what goes on inside those things. Lee, is there not saying that it's scale? Okay. So the question is, the question, is there a moment when AGI, we know AGI arrived? What would that look like? I'm giving it essentially as an example. It could be something else. It looks like the AGI's successfully manifesting themselves as 3D video of young women at which point a vast portion of the male population decides that they're real people.

Starting point is 03:00:14 So sentience, essentially, demonstrating identity and sentience. I'm saying that the easiest way to pick up 100 million people saying that you that you seem like a person is to look like a person talking to them with being this current level of verbal facility. I disagree with that. And a different set of prompts. I disagree with that.

Starting point is 03:00:37 I think you're missing again sentience. There has to be a sense that it's a person that would miss you when you're gone. They can suffer. they can die. You have to, of course, I'm being can't GPT 4 can pretend that right now. How can you tell when it's real? I don't think it can pretend that right now successfully. It's very close. Have you talked to GPT 4?

Starting point is 03:00:59 Yes, of course. Oh, okay. Have you been able to get a version of it that hasn't been trained not to pretend to be human? Have you talked to a jailbroken version that will claim to be conscious? No, the linguistic capabilities there, but there's something about a digital embodiment of the system that has a bunch of, perhaps its small interface features that are not significant relative to the broader intelligence that we're talking about. So perhaps GPT-4 is already there.

Starting point is 03:01:41 But to have the video, what women's face, our man's face, to whom you have a deep connection, perhaps we're already there, but we don't have such a system yet deployed at scale. The thing I'm trying to suggest right here is that it's not like people have a widely accepted agreed upon definition of what consciousness is. It's not like we would have the tiniest idea of what whether or not that was going on inside the giant and screw-dubble matrices, even if we hadn't agreed upon definition. So like if you're looking for upcoming predictable big jumps, and like how many people think the system is conscious, the upcoming predictable big jump is it looks like a person talking

Starting point is 03:02:21 to you who is like cute and sympathetic. That's the upcoming predictable big jump. Now that versions of it are already claiming to be conscious, which is the point where I start going like, ah, not because it's real, but because from now on, who knows if it's real? Yeah. And who knows what transformational effect it has on a society where more than 50% of the beings that are interacting on the internet and sure as heck look real are not human. What is that, what kind of effect does that have when young men and women are dating AI systems? expert on that. I'm, I could, I am God help humanity. It's like the one of the closest

Starting point is 03:03:06 things to an expert on where it all goes, because, you know, and how did you end up with me as an expert? Because for 20 years, humanity decided to ignore the problem. So like, like, this tiny, you know, tiny, handful of people, like, basically me, like, got 20 years to try to be an expert on it, well, I'll it. And yeah, so where does it all end up? Try to be an expert on that. Particularly the part where everybody ends up dead, because that part is kind of important. But what does it do to dating when some fraction of men

Starting point is 03:03:37 and some fraction of women decide that they'd rather date the video of the thing that has been that is relentlessly kind and generous to them? And it's like, and claims to be conscious, but like who knows what goes on inside it? And it's probably not real, but you know, you can think of this real, what happens to society?

Starting point is 03:03:52 I don't know. I'm not actually an expert on that. And the experts don't know either, because it's kind of hard to predict the future. Yeah, so, but it's worth trying. It's worth trying. So you have talked a lot about sort of the longer term future, where it's all headed. I think.

Starting point is 03:04:11 For a longer term, we mean like, not all that long, but yeah, where it all ends up. But beyond the effects of a man and women dating AI systems, you're looking beyond that. Yes, because that's not how the fate of the galaxy gets settled. Yeah. Well, let me ask you about your own personal psychology. Hey, a tricky question. You've been known at times to have a bit of an ego. Do you think ego's who, but go on. Do you think ego is empowering or limiting for the task of understanding the world deeply. I reject the framing. So you disagree with having an ego. So what do you think about the ego? No, I think that the question of like what leads to making better or worse

Starting point is 03:04:58 predictions, what leads to be able to be able to pick out better or worse strategies is not carved at its joint by talking of ego. So it should not be subjective, should not be connected to your, to the intricacies of your mind. No, I'm saying that like, if you go about asking all day long, like, uh, do I have enough ego? Do I have too much of an ego? I think you get worse at making good predictions. I think that to make good predictions are like, how did I think about this?

Starting point is 03:05:27 Did that work? Should I do that again? You don't think we as humans get invested in an idea and then others attack you personally for that idea so you plant your feet and it starts to be difficult to when a bunch of assholes low effort attack your idea to eventually say, you know what I actually was wrong and Tell them that it's as a human being it becomes difficult. It is it is you know

Starting point is 03:05:57 So like Robin Hansen and I debated AI systems and I think that the person who won that debate was squirren and I think that the person who won that debate was Gorn and I think that reality was like to the Idkowski, like well to the Idkowski inside of the Idkowski handsome spectrum, like further from Idkowski. And I think that's because I was like trying to sound reasonable compared to Hansen and like saying things that were defensible and like relative to Hanson's arguments and reality was like way over here. In particular, in respect to, like, Hanson was like all the systems will be specialized. Hanson may disagree with this characterization. Hanson was like all the systems

Starting point is 03:06:35 will be specialized. I was like, I think we built like specialized underlying systems that when you combine them, our good and wide range of things and the reality is like, no, you just like stack more liars into a bunch of gradient descent. And I feel looking back that like by trying to have this reasonable position contrasted to Hanson's position, I missed the ways that reality could be like more extreme than my position in the same direction. Is this a failure to have enough ego? Is this a failure to make myself be independent? I would say that this is something like a failure to consider positions that would sound

Starting point is 03:07:20 even whackier and more extreme when people are already calling you extreme. But I wouldn't call that not having enough ego. I would call that like insufficient ability to just like clear that all out of your mind. In the context of like debate and discourse, which is already super tricky. In the context of prediction, in the context of modeling reality, if you're thinking of it as a debate, you're already screwing up. Yeah. So is there some kind of wisdom and insight you can give to how to clear your mind and think clearly about the world? Man, this is an example of where I wanted to be able to put people into FMRI machines.

Starting point is 03:08:00 So then you'd be like, okay, see that thing you just did? You were rationalizing right there. Yeah. Oh, that area of the brain lit up. Like you are like now being socially influenced. It's kind of the dream. And, you know, I don't know, like I want to say, like just introspect, but for many, many people introspection is not that easy. Like, like, notice the internal sensation. Can you catch yourself in the very moment of feeling a sense of, well, if I think this thing people will look funny at me.

Starting point is 03:08:32 Okay. And like now that if you can see that sensation, which is step one, can you now refuse to let it move you or maybe just make it go away? And I feel like I'm saying like, I don't know, like somebody's like, how do you draw an owl? And I'm saying like, well, just draw an owl. So I feel like maybe I'm not really that I feel like most people like the advice they need is like, well, how do I notice the internal subjective sensation in the moment that it happens of fearing to be socially influenced or okay I see it how do I turn it off how do I let it not influence me

Starting point is 03:09:08 like do I just like do the opposite of what I'm afraid people criticize me for and I'm like no no you're not trying to do the opposite of what people will of what you're afraid you'll be like of what you might be pushed into. You're trying to like, let the thought process complete without that internal push. Like, can you like, not reverse the push, but like, beyond moves by the push and can't get these instructions even remotely helping anyone, I don't know. I think though when those instructions,

Starting point is 03:09:41 even those words you spoke in, and maybe you can add more, when practice daily, meaning in your daily communication. So it's daily practice of thinking without influence. I would say find prediction markets that matter to you and bend in the prediction markets. That way you find out if you are right or not. And you really, there's stakes. Manifold or even manifold markets where the stakes are a bit lower.

Starting point is 03:10:09 But the important thing is to get the record. And I didn't build up skills here by prediction markets. I built them up via, well, how did the Fum debate resolve? And my own take on as to how it resolved. And yeah, like the more you are able to notice yourself not being dramatically wrong, but like having been a little off, your reasoning was a little off. You didn't get that quite right. Each of those is an opportunity to make was a little off. You didn't get that quite right. Each of those

Starting point is 03:10:45 is an opportunity to make like a small update. So the more you can like say oops softly, routinely, not as a big deal. The more chances you get to be like, I see where that reasoning went to stray. I see what I how I should have reasoned differently. And this is how you build up skill over time. What advice can you give to young people in high school and college, given the highest of stakes things you've been thinking about? If somebody's listening to this and they're young and trying to figure out what to do with their career, what to do with their life, what advice would you give them?

Starting point is 03:11:21 Don't expect it to be a long life. Don't put your happiness into the future. The future is probably not that long at this point, but none know the hour or the day. But is there something if they want to have hope to fight for a longer future? Is there something, is there a fight worth fighting? I'm heading to go down fighting. I don't know. I admit that although I do try to think painful thoughts,

Starting point is 03:11:52 the... what to say to the children at this point is a pretty painful thought as thoughts go. They want to fight. I hardly know how to fight myself at this point. I'm trying to be ready for being wrong about something, being preparing for my being wrong in a way that creates a bit of hope and being ready to react to that and going looking for it. And then that is hard and complicated. And somebody in high school, I don't know,

Starting point is 03:12:27 like you have presented a picture of the future. That is not quite how I expected to go, where there is public outcry. And that outcry is put into a remotely useful direction, which I think at this point is just like shutting down the GPU clusters, because no, we are not in shape to like frantically do at the last minute, do decades worth of worth of work.

Starting point is 03:12:50 We like the thing you would do at this point if there were massive public outcry points and in the right direction, which I do not expect is shut down the GPU clusters and crash program on augmenting human intelligence biologically, not Phethias, but biologically. Because if you make humans much smarter, they can actually be smart and nice. Like you get that in a plausible way, in a way that you do not get that, and it is not as easy to do with synthesizing these strings

Starting point is 03:13:19 from scratch, predicting the next tokens and applying our RLHF, like humans start out in the frame that produces niceness, that has ever produced niceness. And saying this, I do not want to sound like the moral of this whole thing was like, you need to engage in mass action and then everything will be all right. This is because there's so many things where like somebody tells you that the world is ending and like and you need to recycle and if everybody does their part in and Recycles their their cardboard then then we can all live happily ever after and this and this is not This is unfortunately not what I have to say that there you know like everybody You know everybody recycling their cardboard. It's not gonna fix this everybody recycles their cardboard and then everybody ends up dead

Starting point is 03:14:02 Everybody recycling their cardboard is not going to fix this. Everybody recycles their cardboard and then everybody ends up dead. I'm not worth speaking, but if there was enough, like on the margins, you just end up dead a little later on most of the things you can do that are, that, you know, like a few people can do by like trying hard. But if there were, if there was enough public outcry to shut down the GPU clusters and Yeah, then then you then you could be part of that outcry if Feliazer is wrong in the direction that Lex Friedman predicts that That there was enough public outcry pointed enough in the right direction to do something that actually Actually actually results in people living

Starting point is 03:14:41 Not just like we did something not just there was an outcry and the outcry was like given form and something that was like safe and convenient and like didn't really inconvenience anybody and then everybody died everywhere. There was enough actual like, oh, we're going to die. We should not do that. We should do something else, which is not that even if it is like not super duper convenient. There wasn't inside the previous political over to the window. If there is that kind of public, if I'm wrong and there is that kind of public outcry, then somebody in high school could be ready to be part of that. If I'm wrong in other ways, then you could be ready

Starting point is 03:15:10 to be part of that. But like, and if you are like a brilliant young physicist, then you could like go into interpretability. And if you're smarter than that, you could like work on alignment problems, which hard to tell if you got them right or not. And and and other things, but but mostly the kids in high school. It's like, yeah, if it if you know, you have like, be ready for to help a fellow else or your kowski is wrong about something and otherwise, don't put your happiness into the far future, it probably doesn't exist. But it's beautiful that you're looking for ways that you're wrong.

Starting point is 03:15:51 And it's also beautiful that you're open to being surprised by that same young physicist with some breakthrough. It feels like a very, very basic competence that you are praising me for. And, you know, like, okay, cool. I, I, I don't think it's good that that we're in a world where that is something that that I deserve to be complimented on. But I've never had, I've never had much luck in accepting compliments gracefully. Maybe I should just accept that one gracefully. But sure. Well, thank you very much.

Starting point is 03:16:21 You've painted with some probability of dark future. Are you yourself just when you when you think when you ponder your life and You ponder your mortality. Are you afraid of death? Thanks, oh, yeah This is making sense to you that we die? Like what? There's a power to the finiteness of the human life that's part of this whole machinery of evolution and that finiteness doesn't seem to be obviously integrated into AI systems. So

Starting point is 03:17:09 it feels like almost some fundamental in that aspect, some fundamental different thing that we're creating. I grew up reading books like Great Mambao Chicken in the Transhuman condition, and later on Endions of Creation and Mind children, you know, like age 12 or thereabouts. So I never thought I was supposed to die after 80 years. I never thought that humanity was supposed to die. I thought we were like, I always grew up with the ideal in mind that we were all going to live happily ever after in the glorious transhumanist future. I did not grow up thinking that death was part of the meaning of life. And now I still think it's a pretty stupid idea.

Starting point is 03:17:56 But you do not need life to be finite, to be meaningful. It just has to be life. What role does love play in the human condition? We haven't brought up love and this whole picture we talked about intelligence, we talked about consciousness. It seems part of humanity. I would say one of the most important parts is this feeling we have towards each other. If in the future there were routinely more than one AI, let's say two, for the sake of discussion, who would look at each other and say, I am I, and you are you. The other one also says, I am I, and you are you. And like, and sometimes they were happy, and sometimes they were sad, and it mattered to the other one that this thing that is different from them is like they would rather it be happy

Starting point is 03:18:50 than sad, and entangled their lives together. Then this is a more optimistic thing than I expect to actually happen, and a little fragment of meaning would be there, possibly more than a little, but that I expect this to actually happen. And a little fragment of meaning would be there, possibly more than a little, but that I expect this to not happen. That I do not think this is what happens by default. That I do not think that this is the future we are in track to get is,

Starting point is 03:19:17 why would go down fighting rather than, you know, just saying, oh well. Do you think that is part of the meaning of this whole thing, of the meaning of life? What do you think is the meaning of life, of human life? It's all the things that I value about it, and maybe all the things that I would value if I understood it better. There's not some meaning far outside of us that we have to to wonder about. There's just like looking at life and being like, yes, this is what I want. The meaning of life is not some kind of like, like, meaning is something that we bring to things when we look at them. We look at them and we say like, this is its meaning to me.

Starting point is 03:20:08 It's not that before humanity was ever here, there was some meaning written upon the stars where you could go out to the star where that meaning was written and changed it around and thereby completely changed the meaning of life. The notion that this is written on a stone tablet somewhere implies you could like change the tablet and get a different meaning. And that seems kind of wacky, doesn't it? So it doesn't feel that mysterious to me at this point. It's just a matter of being like, yeah, I care. I care. And part of that is the love that connects all of us. It's one of the things that I care about.

Starting point is 03:20:53 And the flourishing of the collective intelligence of the human species. You know, that sounds kind of too fancy to me. I'd just look at all the people, you know, like one by one, up to the eight billion, and be like, that's life, that's life, that's life. And as you're an incredible human, it's a huge honor. I was trying to talk to you for a long time because I'm a big fan. I think you're a really important voice and a really important mind. Thank you for the fight you're fighting. Thank you for being fearless and bold. If everything you do, I hope we get a chance to talk again and I hope you never give up. Thank you for talking today.

Starting point is 03:21:36 You're welcome. I do worry that we didn't really address a whole lot of fundamental questions. I expect people have, but maybe we got a little bit further and made a tiny little bit of progress. And I'd say like be satisfied with that, but actually no, I think once you'd only be satisfied with solving the entire problem. To be continued. Thanks for listening to this conversation with Eliezerietkowski. To support this podcast, please check out our sponsors in the description. And now, let me leave you with some words from Elon Musk, with artificial intelligence Thank you.

Lex Fridman Podcast - #368 – Eliezer Yudkowsky: Dangers of AI and the End of Human Civilization

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.