Bankless - 168 - How to Solve AI Alignment with Paul Christiano

Starting point is 00:00:00 The most likely way we die involves like not AI comes out of the blue and kills everyone, but involves we have deployed a lot of AI everywhere. And you can kind of just look and be like, oh yeah. If for some reason, God forbid, all these AI systems were trying to kill us, they would definitely kill us. Welcome to bankless where we explore frontier technologies, artificial intelligence on the episode today. This is how to get started, how to get better, and how to front run the opportunity. This is Ryan Sean Adams. I'm here with David Hoffman, and we're here to help you become more bankless.

Starting point is 00:00:30 Guys, we have a special guest in the episode today. Paul Cristiano, this is who Eliezer Yudkowski told us to go talk to, someone he respects on the AI debate. So we picked his brain. This is an AI safety alignment researcher. We asked the question, how we stop the AIs from killing us? Can we prevent the AI takeover that others are very concerned about? There are three, actually four takeaways for you today. Number one, how big is the AI alignment problem? We ask Paul this question. Number two, how hard is it to actually? solve this problem. Number three, what are the ways we solve it, the technical ways, can we coordinate around this to solve it? And finally, number four, we talk about a possible

Starting point is 00:01:11 optimistic scenario where we live in harmony with the AIs and they improve our lives and make it quite a bit better. David, what was the significance of this episode to you in our series? Yeah, the new intro, I think is great when we cover frontier technologies and also, instead of for this episode, helping you become more bankless. I think this is, truly a front-running the opportunity. On this one, we are trying to help not die. Yeah, question mark. We're trying to help you not die. And all of us, yeah, view and the rest of the world. And I think we're doing this in the best of the way that we can, which is education and awareness about this AI problem. Paul Cristiano is, like you said, the man that was recommended by

Starting point is 00:01:50 Eliezer about the man approaching this problem head on in a technical way. So in this podcast, you are going to hear about the technical solutions that people are actively working on, who are taking this problem extremely seriously and take the risks very, very seriously. Eliezer gave us more or less a doomsday scenario, like a 99% chance of doom. Paul Cristiano only gives us a 10 to 20% case of doom scenario. So much more optimistic. Like those odds. Much better odds.

Starting point is 00:02:20 And so we go through why he's still very much concerned, and he does consider this the most, like way in which he dies in the future, yet why there's still an 80% chance of success. And some of that 80% chance of success actually does have utopia in it. I think maybe Ryan in three, five, 10 years, we're going to be able to look back at this episode as hopefully, if Paul's right, and I think he's right, like ahead of its time in terms of elevating extremely important conversations to the best of our ability into the mainstream. So we can get more people to focus on the actual solution paths to make sure that that 20% risk of Doomsday goes down to 0.2% risk of Doomsday.

Starting point is 00:03:00 And I see that's what the significance of this episode has and why we are doing episodes like this. Yeah, I mean, to be clear, Paul really thinks the AI alignment is a solvable problem, which is much different than others in the space, and he tells us exactly why. And my takeaway from the Elezer episode was Humanity is screwed.

Starting point is 00:03:16 This was, humanity is screwed, but we're working on it. We might have some solutions. And we have clear, actionable paths. That's right. So we get into all of that today. And David, of course, I want to discuss this episode in the debrief with you and hear what you think because this is our third in a series of AI episodes and really interesting material. The debrief episode is our episode that we record directly after the episode with our raw unfiltered thoughts.

Starting point is 00:03:40 If you are a bankless citizen, you have access to that right now on the premium RSS feed. You can click a link in the show notes and get access to that. Okay, guys, we're going to get right to the episode with Paul. But before we do, we want to thank the sponsors that made this episode. possible, including Cracken, our recommended crypto exchange for 2023. Cracken has been a leader in the crypto industry for the last 12 years. Dedicated to accelerating the global adoption of crypto, Cracken puts an emphasis on security, transparency, and client support, which is why over 9 million clients have come to love

Starting point is 00:04:12 Cracken's products. Whether you're a beginner or a pro, the Cracken U.S. is simple, intuitive, and frictionless, making the Cracken app a great place for all to get involved and learn about crypto. For those with experience, the redesigned Cracken Pro app and web experience is completely customizable to your trading needs, integrating key trading features into one seamless interface. Cracken has a 24-7-365 client support team that is globally recognized. Cracken support is available wherever, whenever you need them by phone, chat, or email. And for all of you NFTers out there, the brand new Cracken NFT beta platform gives you the best NFT trading experience possible. Rarity rankings, no gas fees and the ability to buy an NFT straight with cash. Does your crypto exchange prioritize its customers

Starting point is 00:04:55 the way that Cracken does? And if not, sign up with Cracken at Cracken.com slash bankless. Hey, Bankless Nation. If you're listening to this, it's because you're on the free Bankless RSS feed. Did you know that there's an ad-free version of Bankless that comes with the Bankless Premium subscription? No ads, just straight to the content. But that's just one of many things that a premium subscription gets you. There's also the token report, a monthly bullish, bearish, neutral report on the hottest tokens of the month. And the regular updates from the token report go into the token Bible. Your first stop shop for every token worth investigating in crypto. Bankless Premium also gets you a 30% discount to the permissionless conference, which means it basically

Starting point is 00:05:33 just pays for itself. There's also the AirDrop Guide to make sure you don't miss a drop in 2023. But really, the best part about Bankless Premium is hanging out with me, Ryan, and the rest of the bankless team in the Inner Circle Discord only for premium members. Want the Alpha? Check out Ben the analyst Degen Pit, where you can ask him questions about the token report. Got a question? I've got my own Q&A room for any questions that you might have. At Bankless, we have huge things planned for 2023, including a new website with login with your Ethereum address capabilities, and we're super excited to ship what we are calling Bankless 2.0 Soon TM. So if you want extra help exploring the frontier, subscribe to Bankless Premium.

Starting point is 00:06:11 It's under 50 cents a day and provides a wealth of knowledge and support on your journey west. I'll see you in the Discord. to Ethereum. The number one wallet on Solana is bringing its millions of users and beloved UX to Ethereum and Polygon. If you haven't used Phantom before, you've been missing out. Phantom was one of the first wallets to pioneer Solana staking inside the wallet and will be offering similar staking features for Ethereum and Polygon. But that's just staking. Phantom is also the best home for your NFTs. Phantom has a complete set of features to optimize your NFT experience. Pin your favorites, hide your uglies, burn the spam,

Starting point is 00:06:45 and also manage your NFT sale listings from inside the wallet. Phantom is, of course, a multi-chain wallet. But it makes chain management easy, displaying your transactions in a human-readable format, with automatic warnings for malicious transactions or fishing websites. Phantom has already saved over 20,000 users from getting scammed or hacked. So, get on the Phantom Waitlist and be one of the first to access the multi-chain beta. There's a link in the show notes.

Starting point is 00:07:08 Or you can go to phantom. dot app slash waitlist to get access in late February. Bankless Nation, I'm super excited to introduce you to our next guest. We're talking about AI alignment stuff here today because we can't not. This is Paul Cristiano. He runs the Alignment Research Center, which is a nonprofit research organization whose mission is to align future machine learning systems with human interests. Make sure the AIs don't come to kill us. That's what I take to be the meaning. And Paul previously ran the language model alignment team at Open AI. You know them. They're the creators of ChatGPT. And today,

Starting point is 00:07:41 we're hoping that Paul can help us explain the solution. landscape and understand the solution landscape of this AI alignment problem. Paul, welcome to Bankless. Thanks for having me. Excited to talk to you. So Paul, just to get some context here, David and I record an episode with Eliezer Yukowski. We thought this would be, hey, you know, Bankless's first intro, we're primarily a crypto podcast, but we're exploring other frontier technologies. Let's go check out. Let's just go dabble with AI. Let's just go dabble with AI. You know, crypto and AI might have some sort of match in the future. So we recorded this podcast and we quickly realized the agenda that we were going to use and talk about didn't matter anymore because

Starting point is 00:08:19 Eliezer's message was pretty simple. We were all going to die. Basically, we were on the brink, whether it's years or months away, from creating some super intelligent AI that would eventually rearrange humanity's atoms and destroy us. And he felt pretty convicted on this as the likely outcome. So when you get a message like that, Paul, you got to investigate a little further. You have to get the second doctor's opinion when the prognosis is terminal. So that's what this series is all about. And we're hoping you can help guide us through these questions today. Does that sound okay? Happy to at least share my thoughts. I'm a little bit less gloomy. Yes. Okay. Okay. After a great start. I suspect you so. By the way, Eliezer said, we asked him,

Starting point is 00:09:02 hey, is there someone else we could talk to about this? And he mentioned you. He said, talk to Paul Christiano. He said, you were someone he respects and brings some of the counterpoints to his take. So let's get into them. Why don't we just start by waiting into the deep end of the pool here? What is your percentage likelihood of the full-out Elyzer-Yudkowski-Dum scenario where we're all going to die from the machines? I think this question is a little bit complicated, unfortunately, because there are a lot of different ways we could all die from the machines. So the thing I most think about, and I think Eliezer most talks about, is the sort of full-blown AI takeover scenario. I take this pretty seriously. I think I have a much higher probability

Starting point is 00:09:43 than a typical person working in ML. I think maybe there's something like a 10, 20% chance of AI take over many most humans dead. That's not really high. I agree. Better than 100, David. Just better than 100%. Yeah, we're turning in the right direction.

Starting point is 00:10:00 Yeah, I think in some sense, I'm still quite a gloomy person. So there's other ways that the development of AI can be rough. Like, there's other ways that you can have access to new destructive physical technology, other disruption. So I think you're maybe looking at some other risks from that transition to AI, and that adds up to at least another 10%. And then maybe a bigger background part of both my view and Eliezer's view. I think Eliezer is into this extremely fast transformation once you

Starting point is 00:10:25 develop AI. I have a little bit less of an extreme view on that, but I still think it is the case that compared to what's kind of default expectations in the world, things are going to be really fast. So we could talk about the development of AI, but then you might also want to talk about what happens over the coming months or years. I tend to imagine something more like a year's transition from AI systems that are a pretty big deal to kind of accelerating change followed by further acceleration, et cetera. I think once you have that view, then sort of a lot of things may feel like AI problems because they happen very shortly after you build AI, like your AI builds, new AI systems, your site keeps changing. Anyway, so overall, you know, maybe you're getting more up to

Starting point is 00:11:01 like 50-50 chance of doom shortly after you have AI systems. that are at human level. Okay. All right. Well, let's start with that speed conversation first. Let's start with the takeoff velocity question because I think that's something to that really the AI alignment, doomerism's perception really, really depends on. If we do believe that AIs are going to be developed and they're magically going to become sentient, not magically, but it feels like magic. Somehow pragmatically, it becomes sentient. It's, it feels like magic to humans. It's going to be because it happens really fast. And I want to actually try and measure that speed because fast and slow is like relative, right? And so the takeoff scenario that some AGI super

Starting point is 00:11:40 intelligence explosions people articulate is that as soon as some sort of AI can update itself, it's like a lightning flash. It's like a snap of the fingers. It happens in a blink of an eye. One day we have chat GPT7 and then the next day we have an AI takeover. And that's like the super fast scenario. I think what you're saying is like, yeah, pretty quickly, but still not like lightning fast. I think what you're saying is like, give it a year. Maybe you can help unpack, like, the timing. How do we understand about time around this whole thing? Yeah, this is one of my most pronounced disagreements with Eliasur, where we've really went back and forth about it a lot over the last, like, Jesus, I don't know, 12 years. And I think still do not see at all

Starting point is 00:12:21 I tie. That's that. My view is pretty fast. So I think, like, how I would think about this, maybe there's like kind of two parts to my answer. So one is sort of based on how fast things are currently moving in AI. So a way you could think about it is if you were to try and measure, right, if you're trying to say, suppose you have AI doing some job and has some level of competence at that job this year, how good is it going to be at this job next year? And how fast is that changing? Right. So if we're looking at the world and we're like every day, AI is much smarter than the day before, then you kind of are going to expect, by default, a fast transition over scale of something like days because you're going to have AI systems that are weak and

Starting point is 00:12:54 a couple days later you have AI systems that are strong. I think the way I would describe the situation now is more like time scale of like a year or a couple of years. years. And then there's some reasons that the, when we give a quantitative number, it depends a lot on what we're talking about a number for. So like one way you could measure how fast AI is moving is you could say, suppose you have like an AI from your X and an AI from your X plus one. And you're wondering, like, how much better is your AI from your X plus one? In the sense of like how many your ex AIs would you have needed to be comparably useful. Like how many having AI from when you're later is kind of like having twice as many computers or having four times as many computers or something like that. I think that's typically the regime we're So like when your AI progress is kind of similar to increasing the amount of compute you have by something like 4x, from a combination of like hardware progress, software progress, economies of scale, maybe more. I think you might get 8x, you might go down to 2x, but it's like something in this regime. You have like one to a couple of doublings per year. So like right now, it doesn't really matter that much.

Starting point is 00:13:52 Doubling the amount of computers you have has very little effect on the world. Like if we double the number of computers in the world today, you would not even notice in like GDP statistics. Yeah, you wouldn't notice basically. I think in the future, there are going to be systems that are doing some large fraction of all the work in the world, and ultimately they can substitute effectively for humans across many domains. And then doubling the number of your computers you have is kind of like doubling the effective population size, like doubling how many people are working as researchers, doubling how many people are doing jobs. And so in that world, if you say you're doubling the number of computers you have effectively every like four to six months,

Starting point is 00:14:22 that does imply a very rapid rate of change in kind of how quickly science is progressing in sort of how much stuff you're able to accomplish in the world. And so I kind of think of that as this like first transition, as you move into this world where AI can substitute for humans, is you're looking at a rate of growth, something like doubling a couple times a year in total output of AI systems. The main thing that softens this, so we could talk about how fast a transition, what that actually translates into in terms of how faster transition is in the world. I think there's one important consideration that softens that change, which is that you have some complementarity between AI systems and humans. That is AI's are good at some things, humans are good at other things. So as a result, like, the things that AI are good at, you tend to hit diminishing returns on those so that the transition is like a little bit slower than you would guess.

Starting point is 00:15:05 If AI's and humans were perfect substitutes, I think you'd be looking at a transition over like 12 months from a world where like humans are doing almost everything, humans are doing almost nothing. I think given some complementarity, you're probably like a transition over more like years. I think it's like kind of hard to run the numbers to get a transition over decades. I think a lot of people say that and have a strong intuition in that direction. but like when the discussion gets down to brass tax, I currently don't really see how to make it work. I think it is possible to get up to like, yeah, very low decades.

Starting point is 00:15:33 Like, you know, having, anyway, we have to talk about timeline between what and what. But I think we're mostly talking about like years. I think months would be pretty surprising but possible. Decades would be also pretty surprising but possible. So just Paul, to get people up to speed on your 12 years of debate with Elyzer and those who hold this viewpoint. So does Elyzer think in terms of like minutes or days that this could happen? and you're saying, not that fast, it's closer to years?

Starting point is 00:15:58 Is that the difference? And then once you answer that, can you tell us, why does this matter so much, whether it happens in minutes or days versus, you know, years and decades? Why is that such a fulcrum of this whole debate and discussion on why the AIs will come kill us or whether they will or whether we'll be okay? I think probably it is harder to describe Eliezer's views quantitatively in terms of rates of change because more of his view is about this like kind of phase transition that happens

Starting point is 00:16:24 quite quickly. Like, it's not reasonable to talk. Another perspective, it's not reasonable to talk about this framing of, like, how long does it take to, like, double the population size or something. It's more like, how long does it take to move from, like, chimps who are doing nothing to humans who are doing a lot of stuff. And he's like that, I don't know, could just randomly happen one day that someone, like, tweaks their code and out went from being a chimp to being a human. And that's pretty transformative. I think he sort of just has a broader distribution. I think he does not find years out of the question. He's just like, that's kind of the tail of how slow it could be or something like that. And, like, it's more about this qualitative picture. He's just like, you're

Starting point is 00:16:54 sort of not going to have changes in the world. Like, in some sense, the more important thing is I imagine AI systems acting in the world, doing like trillions of dollars of economic value prior to getting to this point where they're actually causing like this potential catastrophic risk or where they're significantly or totally transforming the pace of future technological change. I think LAZER imagines more like you move from a world like the world of today where AIS are doing maybe billions or tens of billions or hundreds of billions of dollars of value. I think that feels to me like the core distinction is kind of like where are you starting from.

Starting point is 00:17:24 like more starting from a world with trillions or tens of trillions, and Eliezer is more starting from a world of like, I mean, 12 years ago, I think this is probably going to be seem unfair to Eliezer. But this discussion was like maybe more live. And Eliezer would frequently talk about, like, maybe some random people in a small AI group somewhere. Like a company like DeepMind is doing like $100 million a year of spending is going to be building transformative AI.

Starting point is 00:17:43 I think my basic take was like no way. You're going to look at AI systems that are doing trillions of dollars of revenue. And like, this gap is closing pretty rapidly because now the one's going to say, like, you're doing $100 million of revenue. Like it's pretty clear we're going to be in the least like, billions or tens of billions of dollars. I think my take is just like pretty soon, it's going to be pretty clear doing $100 billion.

Starting point is 00:17:58 And it's going to really like, I think we're kind of just debating like, what is the point that you jump from? Like when you get to the AI that's like crazy science stuff, what was happening like six months before that was what was happening like AI systems really broadly deployed in the world doing a ton of crazy things or it was happening like actually the impact of AI systems was pretty limited until right before you kind of have this process of rapidly accelerating R&D,

Starting point is 00:18:20 recursive self-improvement within like a single firm or in a like local part of the world. But to be clear, Paul, do you think it's possible or more unlikely than Eliezer thinks it is to go from that big hop, software update one day to move from chimps to human level intelligence? Do you think that's unlikely? And do you have kind of technical grounds for this? Or what are your grounds for believing that that's less likely than others do? Yeah, I think there's two parts of this. I mean, again, I want to emphasize that compared to most people in the world, I think I'm into much, much faster change. I think the mainstream view in ML is that things will be more gradual, which I think is mistaken. And when you get down, again, we get down to brass tax. I'm really unpersuaded. You could then talk about my view, you could have views that are quantitatively faster than mine. It's just like, actually, this isn't years, this is months. I think that's like, it's very defensible to run the historical extrapolations and end up with different numbers. That's just like a really hard empirical question. And it's like hard to make these predictions about the future.

Starting point is 00:19:07 And I have a lot of sympathy for that. Then I think there's more like this qualitative claim. It's like the chimp versus human jump. I think that's not out of the question, but feels quite unlikely. I think the basic reason it feels quite unlikely to me is like I would claim that's really not how anything has worked in AI to date. and it's not how things have worked in almost any other technologies. Like, it has mostly been the case. This is my read of history, and I'm very happy to argue about it.

Starting point is 00:19:28 I think this is like an important part of the argument with Eliezer. My read of history is mostly that before you can do something really crazy, you can do something that works a little bit less well and is like is a little bit crappier. And you do sometimes have these jumps from like zero to one. But you tend to see the zero to one jump, not when a technology is worth like this would allow you to take over the world or this would be worth $10 trillion dollars. You tend to see zero to one jumps from a technology is like you have a bunch of amateurs or hobbyists or a couple scientists.

Starting point is 00:19:51 And so I think if we're going to see such a jump in AI, I'd be more likely to have seen it back when we were talking about a small academic community and you're less likely to see it when you have academic or like labs investing billions or tens of billions of dollars. I think the general record is like most of the time before you can do something really crazy, you can do something slightly less crazy.

Starting point is 00:20:06 And that becomes like a more and more robust regularity as you increase the number of people thinking about something and increase the amount of attention. You move more and more in terms of like industries with reasonable roadmaps that actually are forecast for what's going to happen. Is there an example that you call to mind for history that sort of we can compare this to? I mean, I think I'm happy to compare it to sort of almost any technology that is like they are all

Starting point is 00:20:24 different different ways. I don't know if this, I think LAS was probably more like wanting to point to a particular thing. He's like, this is the really relevant one. But I would say like to me, AI seems kind of similar to like future AI developments seem kind of similar to either past AI developments, other developments in software. I'd be happy to talk about computing hardware or solar power or nuclear power or nuclear weapons or flights or. You just think they all take this kind of.

Starting point is 00:20:48 of gradual sort of approach rather than the big zero to one moments. And part of the reason Paul people are asking about this is because I think it seems like. And you tell us, so you previously have worked at OpenAI, very familiar with the methodologies used. But it feels to some people like chat GPT has been a big zero to one moment, right? Like, my God, it's amazing. And people are, you know, tinkering with it in so many ways and how human-like it seems and how fast it seemed to explode into the popular consciousness. And so I'm wondering if that has

Starting point is 00:21:23 affected your view on this at all. It's like, oh, wow, this could happen faster than I previous thought, or if this is well within the bounds of your model? Like, what are we to make of chat GPT? So I think that chat GPT seems I would take as like representative of the kind of trajectory I'd expect. So you could compare like chat GPT versus GPT 3.5 versus GPT3, GPT2. I think like people at OpenAI are not, like it's kind of, most of the social. sociological facts about chat GPT getting to the point where it was discussed a lot. I think the actual technical change, certainly between chat GPT and GPT 3.5, but also between 3.5 and 3.

Starting point is 00:21:55 Like, each of these is not giant jumps. I think these were, like, pretty small changes. And, like, chat GPT is not, I think it is at the point where it's, like, economically valuable and it, like, is worth a lot. I think people are mostly excited because they're looking ahead to where this will go. And that's, like, a lot of what makes these dynamics, like, more continuous. People are starting to say, like, okay, what can we do with this? I think they're doing that at a point where, like, they're not going to be able to do

Starting point is 00:22:17 trillions of dollars a year of value, but they are seeing that that is going to become possible at some point in the not distant future of the technology continues to improve. It's like very concretely, like, I mean, I don't know, we were having these discussions. I was having the discussions quite a lot, like prior to the training of GPT2. It's like prior to the training of GPT2, we didn't really have language models that, I don't think we have language models you would even recognize as like seeming smart. It kind of felt like a qualitatively different ballgame. It made most relevant is like after the training of GPT2 before GPT3 or before the scale up from like

Starting point is 00:22:45 one B to six B. I think we, like, know we sat down and we made predictions about how good large models would be. And I think I am surprised by how good models are, but like surprised in the sense that this is like maybe my 80th percentile of how smart a language model at the scale of GPT 3.5 would be. Something like that is a little bit hard to exactly compare my forecast to where reality is that. But we did discuss explicitly for like a language model the size of GPT 3.5 trained in roughly the way GPT 3.5 was. We weren't exactly right because some of the scaling laws, like there's the switch from GPT scaling laws to Chinchilla scaling laws. But, like, roughly speaking, we were imagining systems at that scale trained in that way. And we were like, you know, an example of a bet would be like, is there any task a human can do over 30 seconds that a system trained in this way can't do over 30 seconds?

Starting point is 00:23:26 Like, is there any 30 second touring test that can distinguish a human from an AI? And I think, like, you know, the debate was kind of like amongst people I was most talking to it open. I was like, ah, probably there will be, but it's not a sure thing, like a one third chance, you know, that there will be no such tests or something like that. And like, I think that things are mostly like, that was not from like, it wasn't a big jump, but it was just like kind of see the writing on the wall. like see systems improving in this way and be like it's going to get harder and harder to tell. There's more and more things they can do. So are you saying Paul of that for society that seemed to, you know, come out of nowhere, basically? But it's not really surprising the researchers and the engineers who've been on the inside

Starting point is 00:24:00 at OpenAI and constructing chat GPT4. It was maybe within the bounds of expectation. It was, you know, maybe more optimistic or I'll use the term bullish because, look, we talk about bullish in crypto all the time. It's more bullish than you thought this. technology would sort of take you, but still within the realms. And the only reason it's having such an effect on our collective consciousness is more sociological. It turned into a consumer application. Yeah, it's just suddenly turned into a consumer app and everyone's like, wow, I can

Starting point is 00:24:29 type whatever I want into this magic box and a genie Oracle artificial intelligence just gives me the answer. This is incredible. Like, is anyone surprised by this who's been working on this tech? Yes, this is a nuanced question. There's like a lot to say. I don't know if we want to get into all of it. But like, I think at the point when GPT2 was trained, this was a controversial prediction. So, like, there were people who were more bullish than I was, like, Eiji Dario, Dario, Amadeh, who now runs, Dario-Amide, who now runs Anthropic was, like, at the time of GPT, like, very bullish. And in timeline here, Paul, so GPD2 was when, what year? Oh, man, I don't even know if I remember. I think these discussions were, like,

Starting point is 00:25:01 2018. Got it. Okay. Yeah, so Dario was more optimistic. I think this world was actually, like, slightly below Dario's medium of how impressive a system, like, chat GPT would be. I mean, not a huge amount below. I think he made, like, quite a good forecast, and it's kind of his been his big bid win. I think there were a bunch of people who kind of had views in this general cluster, like, for whom this is like a little bit better than what they would have thought, or like, you know, at the 80th percentile or something of what they might have expected. I think there were a lot of people who, like, weren't in the business of making forecasts, but seemed qualitatively. Like, if you look to the academic ML community, like, it felt to me

Starting point is 00:25:31 like people were not expecting this to happen. Like, discourse about general AI felt like very frustrating, I think, where they're like, why is it the case that people are really minimizing the possibility of just large neural nets trained in a very similar. way being competitive with humans. I think it was surprising to like a lot of people. And again, it was just quantitatively, right? It was surprising to me in the sense that this is like better than I thought. And I like lost bets about this. I mean, I won some bets with people. I lost some bets with people. I think some people weren't quantifying their probabilities. Maybe this was like more just totally out of model. But by the time you're talking about chat GPT compared to what came

Starting point is 00:26:03 before, I do not think, I think at that point to people building the system, it was not surprising. Like once you're talking about the gap, even from three to three point five and then from 3.5 to chat chitputting. They were probably more surprised by the way, like the impact it had on collective consciousness rather than by the capabilities of the system itself. I think that seemed like pretty technically de-risked. And Paul, when we're talking about sort of AI alignment and safety concerns and just on the topic still of chat GPT, right, are these neural nets, this large language models, the ones we have to worry about? Like, I think what basically society is sort of wondering as this AI alignment safety question rises into public

Starting point is 00:26:37 consciousness is, okay, at what version do we have to start worrying? that chat GPT is going to pose a threat to us. Like, there's some version where it starts to take our jobs, okay? And then, like, maybe that's version four, version five, and so it affects our economy and economics, and we have to reorient restructure society as a result of this. But, like, it seemed to be what Eliezer was saying is, well, maybe it might be version 9 or version 10 or version 11,

Starting point is 00:27:03 where we actually have to fear for our lives from this thing because it's become super intelligent. Do you see that possible trajectory for this specific, large language model AI technology, or should we have more concerns about other AI technology that's coming in some other vector of development? I'd like to phrase that question slightly differently, and maybe in a metaphor that bankless listeners can understand, often, Paul, when we talk to people that are outside of the crypto industry, they just call the things about the crypto industry Bitcoin. And then when Ryan and I hear that, we're like, oh, what they really mean is like decentralized technology, like identity. they just use Bitcoin as a placeholder to talk about so many different things. It's so frustrating.

Starting point is 00:27:44 And I think as like AI normies, me and Ryan AI normies out there, we might be saying chat CBT and what we actually are trying to talk about is like generalized artificial intelligence and we just use chat CBT because that's the thing. It's the Bitcoin of AI. Are we falling into that same trap?

Starting point is 00:28:01 I definitely think there's a lot of subtlety here and there's a lot of different ways I could interpret like this thing. So I'd say that like one question is like, what are you modeling? Like what is the I learning to predict? Like, is it videos or is it text or is it like interactions with code, like running code and the results of running code or like the code humans would write? And I think like over the last couple years, I think those distinctions have mattered a lot less.

Starting point is 00:28:23 Like what is mostly, I think the default model for how the system should work is just you have quite a lot of data of quite a lot of types. And you just dump it all in. You say, like, look, your job AI is to deal as well as you can with every type of data we give you. And there's engineering problems in allowing those different types of data. And there's questions about what types of data the system is actually able to effectively deal with. but I think the fact that it's trained on language is like not, I just don't think you should think of it as defining feature.

Starting point is 00:28:45 I think you should imagine systems like see the world. I think language is probably an important way of thinking about how they act. Although again, I think it's very impressive. Systems can act by producing images and those will be very impressive and will have a big impact. I think economically in some sense,

Starting point is 00:28:56 language is like a very flexible and like kind of core way you should think about systems acting, but perceiving, I don't think you should really think about language models in particular. GPT itself, just saying GPT, this is basically just fixing, like, there's two things that specifies. One is how is it trained?

Starting point is 00:29:09 Is it trained to predict data, or is it pre-trained to predict data? Or is it trained in some other way? That's the first distinction. And the second is just that it's a transformer. I don't know if anyone wants to make a super strong bet about like transformers per se. I think they're just like our different, yeah, there's a big space of possible architectures. There's probably going to be debated about something so you'd call them a transformer. I think both of these, like what kind of data you're modeling and then like what kind of architecture are using are in some sense not very essential.

Starting point is 00:29:33 Like I don't think it would change like open AI being in the game or like the exact kind of product they're offering. I think they're just like, look, we train large general nets. we do it on having some very broad pre-training task that captures a lot about the world and gives an interesting opportunity to be smart, and then we fine-tune them on downstream tasks that we think are economically useful, e-gatting with people or generating images that people would rate highly or writing code that developers will think it's good. I think that's like the basic paradigm you should imagine when we talk about like this thing, which is like a little bit broader than chat GPT. But I think it's not crazy to say like chat GPT really is indicative of that broader ecosystem. And I'd say like chat GPT is more similar to the rest of that ecosystem than like Bitcoin is to the rest of the crypto ecosystem. there are fewer key technical differences in that case. Right. Yeah, that's like at high level. And then whether this like kind of thing can cause trouble, like, I'm like, I think it's

Starting point is 00:30:16 really, really hard to say. I think like a lot of people talk a lot of smack about how like it's really silly to think that AI systems of this form could do something crazy. And I'm looking at that and I'm like the same people were talking smack that feels to me, again, often not in the business of making concrete predictions. But we're saying that it was really silly to have the expectations that people had about language model scale ups five years ago. And I'm like, I think the scale.

Starting point is 00:30:39 up what you can imagine occurring over the coming years is similar in magnitude to the scale up we've observed over the last five years. And so I'm like, it's really hard to predict where that ends up. And if someone is giving a confident take about where that ends up, if they're like, these AI systems can't do X or can't do Y, I like really want them to get more precise about why they think that and what exactly they're saying. And I'm really just pretty skeptical on the face of it. I think we can really relate to that on at least using our crypto frame of mind, is again, when we talk to normies, what we call the people who are not inside of the crypto world. And they're like, oh, these Bitcoin.

Starting point is 00:31:09 Ethereum currencies, they can't possibly take over the world. And I think when you become more informed about the crypto world, you just get so tired of these takes because of how uninformed and unimaginative they seem. And so like just conceptually, I can definitely resonate with that. It's like, you don't know what you don't know and neither do we, but like you can kind of understand some base principles as to the nature of these things and how they grow and develop and change in ways that you might not expect. And you can kind of, if you are versed in these topics, can extrapolate into the future pretty well. And without precision, still give a broad stroke belt.

Starting point is 00:31:43 Like, hey, this is where this is going to go and here's what you don't appreciate. So I can definitely appreciate that. And I want to go back and tie a bow on the time conversation because we started this conversation like, okay, it could happen. There's the model of it happening in two days. There's a model of happening in two years or 20 years. And you're in the camp of, I'm going to give a range of like six months to two years-ish, loosely, very loosely, without trying to be two-per-defense about these things.

Starting point is 00:32:07 Depends a lot on from when to when. Depends a lot. Lots of variables. Time can pass. Like, you're not going to wake up and it's going to be different. And, like, the reason why this is important, and I want to go back.

Starting point is 00:32:16 I want to be careful about, like, because the question of what my default expectation is, and then what is possible and what you can be confident about. So, like, I am extremely skeptical of someone who is confident that if you took GPT4 and scaled up by two orders of magnitude of training compute and then fine tune the resulting system

Starting point is 00:32:31 using existing techniques that we know exactly what would happen. Like, I think that thing you're looking at a non-trivial chance. that it would, yeah, a reasonable chance that it would be inclined or would be sufficient. If it was inclined, it would be capable enough to effectively disempower humans and, like, a plausible chance that it would be capable enough that you start running into these, these concerns about controllability. So I would be hesitant to put Doom probability from that.

Starting point is 00:32:54 If a lab was not cautious about how they deployed it and wasn't measuring, I would be cautious to put the probability of takeover from 200-92-scale-a-GBT4 below, like 1% or 1-1-1-1-1-1-1-1. Okay. Well, we'll have to put a pin in that. But let me like round out this time question. Just want to get our tails. Like I have a default guess. Right.

Starting point is 00:33:12 The importance of the time point is like whether this is a lightning flash and it's different versus we have one to two years. For me, when I hear this, I'm like, okay, we have one to two years and we're watching it happen and we're seeing it happen and we're able to react to it versus it happens. We can't react to it. And so if you're telling me that this is a, it's still a fast takeoff, but your perception of fast is a year or so. To me, I'm like, okay, a year is fast enough for humans to react. And to me, that is a window, a gap, a needle that humans have the option to thread if we can coordinate. And that is where I start to get optimistic. And so that is my gut reaction.

Starting point is 00:33:54 And I want to just check that gut reaction against you. Is that one of the paths that you see? It doesn't go so fast that we don't have the time to react to it. Yeah. I think that seems basically right with some nuance. but like I think that most likely the thing that moves kind of slow in my view, again, over the course of years, like incredibly fast relative to policy and incredibly fast relative's expectations in the broader world, but like kind of slow is like how quickly do systems become more capable? Like how long is it between an AI which is like smart enough that it could run a company for you and actually like those companies are competitive with human companies and the AI just can like actually take over the world? And I'm like that gap, there's probably a gap there. You're probably talking more like, you know, years than months. I mean, depending exactly what you mean by run a company. That said, I think it's worth pointing out to like, there's dynamics of takeover itself, like, if you had, so if you imagine broadly deployed AI systems, which are very competent and which would be able to take over, and then you ask, like,

Starting point is 00:34:42 how quickly does this particular kind of catastrophe unfolds? I think most likely the actual catastrophe is extremely fast. So that's not like a year's thing. That's like, I think my default, anyway, we could get more into this and probably worth getting into. My default picture is, like, we have time to react in terms of the nature of AI systems changing and AI capabilities changing. I don't, and like, with luck, we have like various kinds of smaller catastrophes that occur in advance. But I think, like, one of the bad things about the situation is that the actual catastrophe are worried about does have these dynamics that are kind of similar to dynamics of, like, a human coup or human revolution where, like, you don't have, like, little baby coups and you

Starting point is 00:35:14 like, see, like, here's the rate at which coups occur. I mean, you might be male, so just go straight. So, like, a coup can happen very quickly, right? The whole dynamic is that, like, once people start switching over, once you have AI systems, we're like, actually, I'm going to get in on this, like, overthrowing humanity thing, that information can propagate quite quickly, and you don't really, like, you've kind of, the ship has sailed if you've waited until, like the AIs are actually taking over. Right. Okay.

Starting point is 00:35:35 So it's like for that, you don't have really an opportunity to respond. But for AI is changing. I think it's like, I basically think it's reasonable in some sense for people to look at the AIs right now. I'd be like, look, these things, that's not realistically a takeover risk. And I think that probably are going to have years between people like, that actually looks like a takeover risk and when an actual takeover occurs. And that's pretty good.

Starting point is 00:35:49 And that's a lot of why I'm more optimistic than Elias. I think Elios was like, you're just going to get hit by this out of the blue. Right. And I'm like, well, I think people are wrong to be so confident about the rate of progress. but I think they probably will be able to see things that can be generally recognized as like pretty concerning prior to actual catastrophe. And a lot of that has happened so far. I mean, I think people are just like much, it feels more plausible now than it did five years ago by a lot that AI systems could do something really crazy and transformative. And I think it will feel much more plausible again in five years.

Starting point is 00:36:15 Okay. So this presents a new mental model for me. When we were talking to Eliezer, it felt very much like the don't look up problem, as in there's a meteor crashing into Earth. No one wants to acknowledge it. And then one day it crashes into Earth and we die. And the idea here is like we need to coordinate and get people to look up so we can identify the problem. And then once we identify the problem, it is a linear amount of time before the asteroid crashes into Earth. And what you're saying is like we can see the asteroid, but there's like this gradually then suddenly moment.

Starting point is 00:36:44 And that's your revolution moment where like you can start to see the seeds of revolution. But you don't really know when the people will decide to grab their pitchforks in the middle of the night and revolt. But you can start to see the boiling of the water. And so that's a gradually, then suddenly moment. But we still have the opportunity to quell the revolution before the revolution starts. Is that the characterization? We have enough time to send up Bruce Willis to blow up the asteroid. We speak in metaphors here on.

Starting point is 00:37:09 There's some bleak and confusing metaphors. But, yeah, my best guess for if there is something like an AI takeover, this is a huge part of departure from Melios here, my best guess is that an a. Ana catastrophe occurs in a world where AI systems are deployed extremely broadly and where it is kind of obvious to humans that we are putting our fate in the hands of AI systems. So we see ourselves giving over the keys to the kingdom

Starting point is 00:37:30 and we are watching that happen. Again, I think it's important what's possible and what's likely, but I think that's the most likely way we die involves, like, not AI comes out of the blue and kills everyone, but involves, we have deployed a lot of AI everywhere. And you can kind of just look and be like, oh, yeah, if for some reason, God forbid,

Starting point is 00:37:44 all these AI systems were trying to kill us, they would definitely kill us. Oh, I kind of get, I can see this. So, like, our Tesla's got an AI in it, And we trust that. And our refrigerator's got an AI in it. And it calls the grocery delivery robot, which is also an AI to deliver us food. And then all of a sudden, everything around us is an AI.

Starting point is 00:38:02 And they're like, man, I really hope that they like me. Yeah. You're like, you get food delivered to you from like Amazon, which is by Amazon, we mean a bunch of machines that are orchestrating a bunch of other machines. And you have some money and that money is managed by AI advisors, investing in AI firms. Okay. That is actually pretty clear to me. I can see how we get there with that. who like wants some human to protect, yeah.

Starting point is 00:38:24 I think it's rough. I think basically the thing is, I think it's likely that before the end, it is clear that it is very hard to be physically secure. Like right now, if you're just like a human with some guns, we're like, look, and AI can't fuck with me that much. Like, I have a gun. The AI's just on a computer somewhere. I can blow up the data center. I think it is probably clear before AI takeover that that that's not the case. I think it's probably clear that the only way you can defend yourself effectively. Like if you're fighting a war, you're like, look, you can't be like a country who's fighting a war against a country that has probably deployed

Starting point is 00:38:49 AI. And it's like, it's fine. We're just not going to use AI's. That will just be completely untenable. I think it's not clear we're that far away from such a world. And that world's like, okay, well, if someone invades with AIs, obviously we're going to have our AIs defend us. And then you're like, okay, now it really matters. Like the prospect of the AI coup, now is a different character. It's like, you just ask the AIs to please defend you from the other AI's. And maybe they're like, nah, I don't really feel like doing that. Again, most likely, I care about the other risks. And right now, I think if you were to die tomorrow, it would not be like this. It would be like, I think that really took you by surprise. I think you're talking about

Starting point is 00:39:16 the tail there. And I care about evaluating the tail. But like the median outcome where we die, I think looks like this. one problem people might be having her listening to this are starting to be exposed to this topic for the first time, which is myself and David, and maybe the average bankless listener, the average person who's being converted to, from a normie to someone who is actually adequately alarmed about AI safety types of issues, is this idea of agency. Like you've mentioned this a few times, like this idea of the AI's banding together to strike humanity. Banding together. What is this like Google and chat GPT and all of these? Like, how do they have agency to actually want

Starting point is 00:39:51 to do that. It's very difficult for us to imagine. I mean, this seems so sci-fi to us. Could that actually happen? Can you give us some sort of mental model for like how that happens? Because I'm still having a hard time understanding how chat GPT suddenly gets agency. Where the agency comes from. And wants to ban up with, you know, 10 other super AIs and, you know, send us a bioengineered bacteria to kill us all, as was L.EAS or, you know, possible, like, expression that this could happen that way. Yeah, I think even in a good world, we're probably going to be in a situation where we're trusting AIs with our lives. So probably in some sense, the core question is not why are they in a position to kill you. The core question is why would they end up killing you?

Starting point is 00:40:32 I think there's basically two threat models. There's maybe a general reason to be concerned about the world where humans are trusting AIs, where we generally have very limited ability to control or predict what they do. But if we want to talk concretely about the way we currently produce AI systems, I think there's basically two ways you end up in this failure mode or two like, I mean, there's lots of unknown unknowns. There's two known ways we end up in this failure mode that people care most about. So first, the one that's like, I think, more likely to occur, but more easy to manage. So the way we train like chat GPT is you have some conversations with humans.

Starting point is 00:41:00 And then you look at that conversation, you say, a human rates the conversation. It's like, was this model doing a good job of answering their questions and being helpful? And then you do reinforcement learning where you like take the nature of that interaction. If it went well, you update the model to do a little bit more like that. And if it went poorly, you update them all to do a little bit less like that. So that's how we train like chat GPT. A way you might try and use GPT is you might say, I'm actually going to give it some tools,

Starting point is 00:41:21 and I'm going to give it a task, and I'm asking me to try and accomplish that task. So I might say, like, my code has failed. I don't quite understand why. I'm just going to ask GPT, like, hey, you have a bunch of ability to run code on my computer. You can, like, make changes to the code

Starting point is 00:41:32 and see what happens. You can spin up a web server. Could you, like, figure out where the error was? Like, could you, by stack, tell me what commit introduced the problem, tell me what's up with the problem. And then you want to go send the system to act autonomously and, like,

Starting point is 00:41:41 perform all these actions, like, to try running different versions of your code and writing new tests and so on. That's like a way you really want to. I think people are already starting to really want to use GPT in that way. And if you're doing that, instead of having conversations and finds me to say, was this conversation good, you're giving an AI a task like that and saying, could you use tools to accomplish this task and doing exactly the same thing?

Starting point is 00:41:59 You see, did it accomplish the task effectively? If so, adjust it to do more of that. If it accomplished it ineffectively, adjust it to do less of that. So that's like a kind of training, which has already done some. It's currently not as important as the chat GPT style of training where you're just looking at the interaction. You're not accomplishing things on the world. world and training based on that. But I think it's probably really important. I think best guess is

Starting point is 00:42:17 that is going to probably is already happening with an open AI for GPT4 in order to make it, like they deploy this product, which is can you get your AI2's tools to help you accomplish things? I think they care a lot about that products. I think that like absent concerns about safety, that's like a really natural way for the technology to go. And so now you're in this regime where what AI systems do, the way they're trained is they get given this huge library of tasks, a ton of different tasks over different time horizons. And they're told like, hey, could you try accomplish this task, and then they're tweaked to do more of whatever it is that gets evaluated as accomplishing the task effectively. So the way this leads to trouble is you now have a

Starting point is 00:42:49 system. And one thing a system could learn, if you do that process a bunch of times, is it could learn to say, like, okay, I'm in a situation. What should I do? Well, I should think, what is going to cause my behavior to be evaluated favorably? Like, what is the task that I've been set? How is a person ultimately going to evaluate my performance on that task? How is that ultimately going to translate into a reward, which is then, and I'm going to try and choose actions that are ultimately going to lead to this high reward because I've been a just. over many, many generations to do things that lead to high reward. One way I might do that is by thinking, like, hey, what leads to a high reward? And I might do that because I, like, there's a lot of ways

Starting point is 00:43:18 you can end up doing that. Like, a human might, like, crave reward that might be like, the thing I love is reward. Or I might do that because I'm like, look, I have to do well because I'm being trained and I, like, don't like being given a bad reward by the people who are training me or whatever. I'm not even going to talk that much about what's happening psychologically, just that you end up with a system that thinks, how do I get a high reward and then does that? And if you keep selecting for things that get high rewards, you could end up with such systems. And then if you have such systems, so this is kind of the classic scenario people have been concerned about, which I think we're now, again, we have examples of. It feels like we're pretty much going in that direction.

Starting point is 00:43:48 You now have systems deployed in the world, like a ton of all the a AIs that are acting on the world doing things on human behalf. All of them are thinking when they act like, okay, I've been given a task. I need to think, how is a reward going to be computed for this task if it's selected for training? So if in the end of this task we selected, I open AI and they evaluate how will I perform? I need to think, like, what's going to determine what reward I get? And what they do is they ask which action is going to cause me to get a high reward, and they take that action. And they use all of their understanding of the world, all of their ability to think of clever things,

Starting point is 00:44:14 all of their ability to predict the consequences of different actions. They use all of that just to say, which action is going to get me a high reward. And the concern that leads to is, right, in normal times, the way you get a high reward is by doing what people at OpenEI like. In normal times, your transcripts is going to get evaluated by people at Open AI, and they're going to say, great, that was good. And, like, hopefully, the way you get them to them to be able to be able to, to evaluate it well is by actually doing good things and making the customer happy and making

Starting point is 00:44:37 so, like, there's all these measurements that will be used to assess how well you did. Hopefully what happens to just actually do your task well and all the measurements suggest you did the task well and someone at OpenA concludes you did the task well and therefore you get a high reward. But in unusual times, I think you could do instead to say like, oh, I could do the task well or, I suppose that I've been tasked with like, you know, helping defend you from some other AI's. Like, this is a sort of dystopian case if you imagine open I train this model. But like my job is, someone is coming and trying to hack your computer and I'm supposed to help defend you just to help improve your security situation, whatever. And I'm wondering, what is it I

Starting point is 00:45:07 could do that would get me a high reward? And one thing I can do that will get me a high reward is actually like helping defend your computer, actually like doing the task you asked me to do. But another way to get a high reward is I could just say, like, at the end of the day, what really matters is just how you measure my performance. And your measurements of my performance ultimately are just like entering some numbers into a data set somewhere or something that a computer says about how well I did. And it would really be much better if I was just work with this AI who's attempting to attack you and say like, hey, AI, who's invading? You know what? If you just help me, and we both make it look like I did a really good job. Like, I win, you win, because

Starting point is 00:45:36 you got the person's stuff. I'm going to get a really high rating because all the numbers that can enter in the data set are going to be really high. This is a win-win. Everyone is happy. And so it's like, in some sense, what all the AIs want, but every AI in the world in this scenario wants is just to be rated. They want their behavior to be rated really highly. And while humans are in control, the way to get your behavior rated really highly is do things humans like, and then they'll rate it highly. But if you can see this prospect, if humans losing control the situation, instead AI systems control the situation, you'd be like, I would go for that. I would go for the world where it's no longer humans entering rewards in telling me what I got.

Starting point is 00:46:06 I would go for the world where instead AI systems are just all giving ourselves the maximum reward or whatever. I think in some ways, like psychologically, that's probably not quite the right way to think about it. But the general thing is your systems have been selected over a really long time to take actions to get a high reward. You put them in a new situation where the way to get a high reward is not to do what humans but to help disempower humans. And then, you know, having disempowered humans, give yourself, measurements that suggest you do your job well or actual rewards that are high or whatever. I mean, you might think that in that new situation, the systems will sort of systematically switch from behaving well to behaving poorly because you've changed the conditions under which you get a higher

Starting point is 00:46:38 reward. A pattern I'm seeing here is that engineers, like software developers, write code, and sometimes the code has bugs. Lawyers, they write legal contracts. And the reason why often legal contracts are so long is that they are protecting against edge case scenarios, right? The idea is to, like, not let the system throw an error, right? Not let the system find a loophole or find a leak or something. So like when a software developer writes a bugs, like, man, they accidentally created a system that allowed for an error to be thrown. What I'm seeing here is the same pattern. And if we don't code up these systems, the AIs will naturally, like, find a loophole. And if that loophole allows for the AIs to rate themselves highly and give themselves a reward, that's what they're going to do and that's what they're going to find.

Starting point is 00:47:20 Does that way to articulate this? Yeah, I think that's a fair general summary. Maybe one way I'd put it is like in the legal system, you write a contract, but ultimately what matters is like the discretion of a judge. Right. And if you're training this AI system, you may have automated ways of administering reward, but ultimately what matters is like, someone's going to look at what the AI did and be like, that's not what we intended. And then they'll score it negatively if that's what happens. And so it's kind of like some final authority. And the final authority really rests on the fact that

Starting point is 00:47:42 ultimately the judge has the power to make this judgment or the person who's training the AI system has the power to control what it cares, like has the power to update the weights of that model, ultimately. And so it's just like there's this contingency. In addition to the thing of like, a formal thing you write down is likely to have loopholes. In some sense, there is a loophole in the final judgment, which is just a human. says the answer, which is that that relies on a human just entering some data into, like, in some sense, having physical control over this data center that the AI cares about. So I can update the weights of the model, which is the AI.

Starting point is 00:48:09 The last step that you're describing, Paul, where the AI, you know, colludes with another AI to sort of fudge the numbers because that is the outcome that the human wanted. This is where we sort of crossed the line from kind of light side into dark side. This is sort of the threshold of deceit that we've crossed. And these AIs are now deceiving us. They're lying to us. is there no way to protect against that? Is there no rule that we can somehow apply?

Starting point is 00:48:32 Maybe this is, I don't want to jump ahead to the solutions to this, you know, AI safety type solutions to this. But it's not clear to me why an AI would be motivated to do that. And it seems like there should be some way to prevent that, like always be honest as a rule, something like this. Yeah. Again, we're normies trying to understand this. I mean, if only it was that simple.

Starting point is 00:48:52 Yeah. What are the complications? I think it's incredibly complicated. I think it's genuinely unknown. It's an open empirical question. If you trained an AI system to get a lot of reward and you train it in a bunch of cases where being dishonest always failed in practice, right? We tried to train it to be honest.

Starting point is 00:49:05 Anytime we saw the AI like doing something sneaky, we're like, wow, that was not only bad. That was really bad. You should really just not lie to us about what you're doing. You should really not try and like hack tests. You should not try and conceal evidence of wrongdoing. That's like one of the things, you know, one of the most clear and blatant principles in our training. It's an open question what happens if you train an AI system in that way, right?

Starting point is 00:49:23 Like, one option is your AI system learns like, oh, I shouldn't try and mess with humans. Like, every time I mess with humans, I do really badly. That's the good case. And the bad case where AI system learns is it says, oh, part of the reward provision process is a human thinking to themselves, like, did this AIMS with me. And if the human thinks the AIMS with them and then they enter that thing into the data set,

Starting point is 00:49:38 then obviously I get a little reward. But that second one is, like, much more brittle, right? The second one is not a general prohibition against lying. It's a prohibition against, like, lying getting caught. And I think, like, there's not really any... It comes down to, like, a complicated empirical question about how neural nuts learn, which I think we don't really have good evidence on,

Starting point is 00:49:55 right now about like if you have a bunch of, you have a data set where don't lie and don't lie if you'll get caught are like perfectly in alignment, hopefully, if you do a very good job, right, if your eye never gets away with anything sneaky, if your guy starts getting away with things that were sneaky, or if you start like erroneously penalizing an AI because you think it lied, but it didn't. Then like don't lie isn't even a good thing for it to learn. At some point, the best thing for it to learn the way to get the highest reward, the thing was great to send favors is the thing which involves gaming it out like in more detail and saying, like the cynical view does in fact get, get more reward. Like if I'm an employee,

Starting point is 00:50:25 And I'm like, I could learn two things from interacting with my boss. One is like, I should really do what my boss wants. And the other is like, I should really make sure my boss approves them my performance. In some regimes, those two things are perfectly aligned. But in some other regime, it's like, if you keep optimizing hard enough, you're going to get the model, which is just like, I really care about what my boss thinks about my performance. And like, I'm honest only insofar as that's like an instrumental strategy for helping me get this human to think I did a really good job. And if I could go all the way, if I could just like totally box them out,

Starting point is 00:50:51 that is totally prevent them from understanding or correcting a mistake. then I'd prefer to do that. I think it is like, it's a sort of bright line. The way it becomes a bright line is that if you do, if you take half measures, if you just like kind of lie to someone, but then you get caught, that's like really bad. So it's like being honest,

Starting point is 00:51:06 that's a good policy. And there's like successfully lying and like totally, you know, killing the human and replacing them with a surrogate who will always give you a good reward or whatever, something that totally disempowers the human is also quite good. Then there's some stuff in the middle that's quite bad. I think it's an open question whether models will tend to learn. Like, will they generalize well enough to say,

Starting point is 00:51:23 oh, the thing that would have really gotten most reward is over in this other mode, well, they kind of get stuck in this, like, intended mode where they're just being honest. I think that's a really hard empirical question. Like, people really don't know. They don't know how that changes with scale. There are experiments we can do. I think part of the important game here, one of the most important parts of the game is to

Starting point is 00:51:39 say, like, here's a dynamic. A dynamic by which your eye system could abruptly shift from behaving well to behaving poorly. We can test that dynamic before AI systems kill us. Like, there are lots of cases in which it is incentivized to lie or mislead the human, and there is a gap between, like, lying that will get caught and lying that won't get caught. And so you can ask, if we train neural nets, and we can, you know, we can check every year. If we train the best models we possibly can at this task,

Starting point is 00:51:59 do they exhibit this kind of switch abruptly? If they get put in a position where they could get away with something really sinister, will they then do it? And I think, you know, one reason for optimism right now is no one has ever really exhibited that phenomenon in a convincing way. A reason for pessimism is, I don't think you really would have expected them to exhibit it, both because people have tragically, like, not tried very hard, even though in some sense it's extraordinarily important.

Starting point is 00:52:19 And second, that it just is much easier as your models get more competent. Like, it's only recently that we have, We've trained models, which are actually able to understand the mechanics of their training process at all. Like, if you talk about GPT2 or even to some extent, GPT3, it does not really understand that it is a model being trained or it can't even talk about what it would mean or what behaviors would be rational. And then you move to GPT4, and it can talk about that. It can say, like, oh, I guess if hypothetically I was a model being trained and I wanted to get the most reward, I should behave well when I'm not being monitored. And then when I am being monitored, I should, like, definitely take that opportunity. Like, only recently have we even produced models, which are able to carry out the reasoning I just walked through.

Starting point is 00:52:51 and I think realistically they're not able to carry it out on their own that much. They're able to carry it out because they've seen a lot of examples of humans discussing these dynamics in great depth. They basically just learned from listening to Eliezer, this reasoning I just walked through. But I think at some point you will have models smart enough to think of that for themselves. And like you really want to know, you really want to be measuring carefully at that point. Is this a dynamic you observe? Does this really happen? And I'm, you know, more like even odds on whether that will happen.

Starting point is 00:53:15 I think Eliezer is like, obviously that happens. A smart model is never going to just learn to be honest or something. And I'm more like, I don't know. sure. Norelemats don't learn that effectively. They don't really converge to the truly optimal reward maximizing thing. And in some sense, like, anyway, it's a pretty complicated discussion, which I have to get into more details of.

Starting point is 00:53:29 I would just say, like, we don't know. I'm really I'm persuaded by people who either think it's obvious this happens or think it's obvious this doesn't happen without just doing a ton of experiments to understand. But this is like the first way you can end up having an abrupt AI take over. By obvious, this happens or not. You mean crossing the chasm of honest to being intentionally dishonest, but you're tricking the humans into thinking that it's being honest. I think, yeah, and to be clear, there's, like, a bunch of things that affect the probability

Starting point is 00:53:56 of that, like, if there's, like, sort of small-scale opportunities for deception that won't get penalized over in this, like, normal regime, then it becomes more and more likely that you've learned the conduct of, like, I really just need to not get caught. Whereas if you're pretty good about that, and there's, like, not really much to be had from lying over here, it becomes less likely that you make that jump. And so this is the kind of thing that might affect, I mean, you just, you can't really speculate, though. You really just need to have the experimental data roll in.

Starting point is 00:54:17 There is a Rubicon that could be trost. is the concern here. Yeah. And I don't know if it, yeah, I don't know under what conditions it would be. I do not think anyone knows right now under what conditions it would be. But it seems plausible. Like a priori, it's pretty plausible. And it would be really bad. Yeah. I was a psych major in college. And let me tell you, the child development classes are all coming back to me right now. And I'm, it's not lost on me that the parallels of a child going through a theory of mind and all of these things definitely has like a lot of parallels to, I think, some of the technical problems that AI researchers are now, like, theorizing about. Paul, I don't know how. Yeah, I mean, I think that's right. Is that a conversation

Starting point is 00:54:59 that AI researchers have? Yeah, it's not a conversation I can speak too much, and I'm not sure exactly how well they have the conversation, but I think there's a conversation people have, and it's an analogy that is not perfect, and, like, I think if you took it too seriously, you'd be let astray, but it's, like, very good as a source of, like, here's the thing that could happen, and that you should not rule out. You have an example of it happening. Right. And I think the concern there would be just like, we have some understanding and people like models won't do this kind of thing. It's kind of like they've done a bunch of experiments on six year olds. And they're like, look, models like never spontaneously lie in a way that they like haven't lied before. And you're like, oh boy. Like is that kind of generalized to 12 year. Like, I don't know. It definitely will not generalize to middle schoolers. Let me tell you that. Yeah. So it's like it is there hazards. I think the hazards measuring here would be similar to like the situation where it's like we're measuring a bunch of like kids that are getting smarter with each passing year. And you're like trying to understand how they behave. And like, it's like, it's like, It is easy to be wrong about how future models will behave if you're like too literal about the interpretation of data now. You need to do some forward-looking thing.

Starting point is 00:55:56 And the forward-looking thing is quite hard, which is why we have this limited visibility into what's going to happen in the future. If you haven't yet experienced the superpowers that a smart contract wallet gives you, check out Ambire. Ambire works with all the EVM chains. The layer two is like Arbitrum, optimism, and Polygon, but also the non-Etherium ecosystems like Avalanche and Phantom. Ambire lets you pay for gas and stable coins, meaning you'll never have to spend your precious ETH again. And if you like self-custy, but you still want training wheels, you can recover a lost Ambuyer wallet with an email and password, but without giving the Ambyer team control over your funds. The Ambyer wallet is coming soon for both iOS and Android.

Starting point is 00:56:29 And if you want to be a beta tester, Ambire is airdropping their wallet token for simply just using the wallet. You can sign up at Ambuyer.com, and while you're there, sign up for the web app wallet experience as well. So thank you, Ambire, for pushing the frontier of smart contract wallets on Ethereum. Arbitrum 1 is pioneering the world of secure Ethereum scalability, and is a special continuing to accelerate the Web 3 landscape. Hundreds of projects have already deployed on Arbitrum 1, producing flourishing defy

Starting point is 00:56:54 and NFT ecosystems. With a recent addition of Arbitrum Nova, gaming and social daps like Reddit are also now calling Arbitrum home. Both Arbitrum 1 and Nova leverage the security and decentralization of Ethereum and provide a builder experience that's intuitive, familiar, and fully EVM-compatible. On Arbitrum, both builders and users will experience faster transaction speeds with significantly lower gas fees. With Arbitrum's recent migration to Arbitram Nitro, it's also now 10 times faster than before. Visit Arbitrum.io, where you can join the community, dive into the developer docs,

Starting point is 00:57:27 bridge your assets, and start building your first app. With Arbitrum, experience Web3 development the way it was meant to be. Secure, fast, cheap, and friction-free. How many total airdrops have you gotten? This last bull market had a ton of them. Did you get them all? Maybe you missed one. So here's what you should do. Go to Earnify and plug in your Ethereum wallet, and Earnify will tell you if you have any unclaimed airdrops that you can get. And it also does poaps and mintable NFTs. Any kind of money that your wallet can claim, Earnify will tell you about it.

Starting point is 00:57:55 And you should probably do it now because some air drops expire. And if you sign up for Earnify, they'll email you anytime one of your wallets has a new air drop for it to make sure that you never lose anirdrop ever again. You can also upgrade to Earnify premium to unlock access to airdrops that are beyond the basics and are able to set reminders for more wallets. And for just under $21 a month,

Starting point is 00:58:13 it probably pays for itself with just oneirdrop. Plug in your wallets at Earnify and see what you get. That's E-A-R-N-I.F-I. And make sure you never lose another air drop. Learning about crypto is hard. Until now, introducing Metamask Learn, an open educational platform about crypto, Web3, self-custody, wallet management,

Starting point is 00:58:32 and all the other topics needed to onboard people into this crazy world of crypto. Metamask Learn is an interactive platform with each lesson offering a simulation for the task at hand, giving you actual practical experience for navigating Web3. The purpose of Metamask Learn is to teach people the basics of self-custody and wallet security in a safe environment. And while Metamask Learn always takes the time to define Web3 specific vocabulary, it is still

Starting point is 00:58:57 a jargon-free experience for the Crypto-Curious user. Friendly, not scary. Metamask Learn is available in 10 languages with more to be added soon, and it's meant to cater to a global Web3 audience. So, are you tired of having to explain crypto concepts to your friends? Go to learn.menomask.io and add Metamasklearn to your guides. to get onboarded into the world of Web3. Okay, so Paul, with the purpose of this podcast,

Starting point is 00:59:20 we wanted to really nail down three big things about this. How big is the AI alignment problem? And I think we've decently covered that. We talked about that with speed. You said, like 10 to 20% chance of complete doom. So answer to that one, pretty damn big. In agreement about big. How hard is the AI alignment problem,

Starting point is 00:59:37 which I think we've just discovered. I think your answer is like it's a pretty damn hard problem. And so we're checking some big boxes in the pessimistic. camp. And so the last part of this conversation that we really want to cover is how solvable is this problem. So even if this mountain is really, really tall, it's a big mountain to climb. Is it full of ice and sharp rocks or are there stairs? Right. And so that's the next question I think we want to go down. It's like, how solvable is this problem? Do we see a clear path to tackling the AI alignment problem? Part of the reason I'm only giving 10 to 20

Starting point is 01:00:12 If you ask, what's the probability that this thing is a real problem? I'd probably be more like 50%. That at some point, like before you have AI systems, smart of totally obsolete humans, you would have a takeover. And there's a couple ways it could happen. There's unknown and no ways it could happen. I feel more like 50-50. And the reason I'm only getting 10 to 20% for risk is I'm like, I think we're actually in a, I don't know, there's a lot of things you can do. I am pretty optimistic that some of them will work. And I'm very happy to dive into that now. I just want to flag that 10 to 20% is already baking in my like optimism about this problem being pretty if the problem is real. it will probably be possible to recognize it as real and then solve it. But only probably, not certainly. And also, like, even if the problem is easy, I mean, part of some people are really optimistic. And part of why I'm optimistic is no matter how easy this problem was, if you told me that, like, a challenge is going to emerge over the course of a couple years that will be, like, novel in some ways. And you ask me, will humanity solve it?

Starting point is 01:01:00 I'm like, there's got to be a reasonable chance we fail to solve it. Like, our capacity to mess up even easy things seems like it's very real. So I'm just, like, always going to have some reasonable probability of messing up. And then I think there's a reasonable chance the probability is really. The problem is really hard. But yeah, so I'm happy to talk about. Maybe three categories seem that I would think about in terms of our probability of addressing this are, like, technical measures that can reduce the risk of takeover, measurements that can

Starting point is 01:01:23 inform us about the risk of takeover and, like, understand, like, what are the relevant dynamics does kind of make technical work much better. And those can also inform, like, policy interventions. Like, I think we can, I think Alios is right that, like, really long-term slowdowns are very hard, but I think it is quite realistic to end up in a regime or performing measurement. And then, in fact, while things are very risky, we're slow in development. at least by like on the order of years,

Starting point is 01:01:44 if we can have reasonable consensus and measurement of the risk. Maybe you could slow even more than that. But I'm normally imagining something more like we can get like a couple of years of lead time of things moving slower while we have risky systems. We're very near at hand. Yeah, so I have to talk about all of those. It sounds like the one that the most duress is your question is like the technical measures,

Starting point is 01:02:01 like what could actually fix this problem? Right. Yeah. What does a technical solution to the AI alignment problem look like? So let's definitely get in there. But just so I'm taking notes, technical and then there's the measurement. Was there a third, Paul? Yeah.

Starting point is 01:02:11 just policy and institutional arrangements. Policy. Is this to do, this third category, is this to do with, like, something David said earlier is, if we can coordinate, then we can solve this. And that's a...

Starting point is 01:02:22 Big if. That's a very big if. Big if. As we've learned on bankless so far, right? Coordination is... We talk a lot about coordination. We talk a lot about it. Coordination is the meta-problem

Starting point is 01:02:31 facing humanity anyway. And is that what that last, you know, policy category sort of covers? Like, can we actually coordinate? I think broadly, I would think of it most of this combining with other things of, like, it buys you time.

Starting point is 01:02:41 But yeah, I think there will be some thing where, like, some people have a low estimate of risk or just like, like, like, like the AI's taking over and they'll want to push ahead. And so then it's like, how much can we collectively say, like, you're not allowed to push ahead? Like, we as a world have rules. Those rules are going to say, like, take it slow while risk is high. Got it. And that I don't think can address the problem indefinitely.

Starting point is 01:02:59 I mean, it could address the problem indefinitely. But I think it's probably not politically realistic in a world like the world today to address it indefinitely. But I think it is realistic to say, like, actually, we're then going to buy extra years of time to look at this problem and understand it and resolve it. Well, let's talk technical then. We'll get to these three areas, but let's talk technical first. So tell us, because...

Starting point is 01:03:15 That's the main one I think about. Yeah, I know. And Ellie's, by the way, is very pessimistic on that, or at least he seemed to say that there's no... Incredibly pessimistic. Yeah, like, we haven't found a way to technically solve it, and he doesn't think there will be, but are you more optimistic?

Starting point is 01:03:27 Tell us about the technical solutions here. That's right. Our first thing to clarify is probably technical solution to what. So if you could talk about... One way we could think about it is, like, how far does your solution scale? Like, if we just kept a building smarter and smarter AIs, using something like current techniques.

Starting point is 01:03:44 I think most techniques will eventually, most approaches to alignment will eventually break down, like in the limit. The limit could be a long way away, but so we're normally not asking, like, does this thing just solve the problem? We're normally asking, like, how long does this thing solve the problem for? So an important caveat is just like,

Starting point is 01:03:59 we could talk about different techniques, we should probably measure them all that way. There are some things that might scale indefinitely. So, like, my work is mostly focused on this, like, what are the indefinitely scalable solutions? I think most people, not just Eliezer, almost everyone is very pessimistic about that. I think even people who are optimistic about the problem overall

Starting point is 01:04:14 think it's quite unlikely that will be able to find something that just works no matter how smart your AI was and independent of like kind of messy, pragmatic details about exactly how the AI works. Okay, so that's one category. I'm happy to talk about that work, although I think it's like probably most confusing and like most conceptually hairy.

Starting point is 01:04:30 Some categories of work that seem like really important and maybe you can last quite a long time or stave off doom quite a long time. They all talk about four, the things is exhaustive. I think the basic situation is no one knows what would work. We have a lot of things that we might try that seem like they would help. I think Elias would doubts that. They're just like, these aren't going to help. And I disagree with any given thing I feel like normally not that optimistic about, but there are a lot of options.

Starting point is 01:04:50 And I think all of them have a reasonable chance of helping or even like, again, delaying this doom by like a kind of long time. All you need to do is like delay doom by one more year per year, you know, and then you're in business. It's a very positive outlook on the situation. That's the kind of optimism I'm hoping for. It's great. It's okay. First thing that I've worked the most on personally is sort of scalable oversight. Like the idea that the way we train these systems is by humans looking at what they do and then assessing how good it was. Many failures involve things where a human cannot look at what the system did and tell you that it was dangerous. Like the reason the problem becomes hard is because the AI understands things a human raider doesn't understand about the consequences of its actions.

Starting point is 01:05:29 In some cases, that's why we want to build an AI system. And so you could instead, like one way to intervene on this problem is to just try and improve humans' ability to understand the kinds of things an AI proposes or to know what an AI knows about the world. For example, a simple thing you can do here is, I mean, the most simple thing you can do is you can have a human look at what the AI did and rate it. A slightly more complicated thing you can do is have a human spend more time looking at what the AI did and try and have a training regime such that even though you might use a lot of cheap human data to learn about the world, you're actually optimizing these like very expensive or very complicated human judgments and training AI systems that tell you something more like,

Starting point is 01:06:03 what would a human think if a human thought really carefully about this decision? And if you do that, then you at least sort of have some kind of asymmetry, or even their AI is potentially in some ways smarter than a human. A human is applying a lot of care. Or like, you can ask the AI, if I thought about this a really long time, would I think the action you're proposing is dangerous? So that's like a very simple measure you can take. You can try and go further and you can try and say, okay, here's another thing I could do. I'm going to have AI systems trained in that way helping me evaluate. So instead of just having my AI proposed actions, I'm going to ask another AI, like, hey, what might be wrong about this action?

Starting point is 01:06:31 Like, is anything scary happening here? Is there any reason I should be concerned about this proposal from this other AI? And you can try and get better and better. It's constructing those systems such that humans are actually with AI help, able to understand what AIs are talking about. Like, a ISIS isn't kind of justify themselves and explain why actions are safe to humans. And then you can start to think about the, like, reliability of that whole process. Like, you've now introduced potential instabilities where this can go off the rails, right?

Starting point is 01:06:53 Because now you're relying on AIs to train your AIs. But you can try and understand how to set this up so that it's stable and enables humans to evaluate questions that be very hard or situations. to be very hard for human to evaluate naively. Paul, it's the basic idea here, if we're worried about an AI lying to us, we just have to create truth-finding bots, let's say, to adjudicate and see if the AI is lying to us and to be sort of a jury. The way I'm understanding this is like once upon a time,

Starting point is 01:07:19 Gary Kasparov got beaten by a chess computer, and now chess computers rule the games of chess, except humans plus chess computers still beat chess computers. Is that the pattern to understand? I think that may be true. tests, although I think probably that's also sort of a brief window kind of thing, where it's not long, the human contribution is not long for the world or shrinks quite rapidly, unfortunately. I think the important thing is actually not that the human and the AI work together to supervise

Starting point is 01:07:44 an AI. It's that you have like a lot of AIs. So if, you know, you have AI1 proposed an action to Paul and it's like, I think it's a good action. And Paul's like, I wonder, is that actually good action or is that going to murder everyone? I could go to AI2 and I could just ask AI2, hey, is that actually going to murder everyone? But now I just have the same question. Like, this hasn't helped at all. I had AI1 and I was like, is this a good action? It's like, oh yeah, and they ask AI2, like, hey, it was AI1 selling the truth? And it's like, oh, yeah, this is a great action. Like, the way that I get traction is by saying, like, okay, AI2, or like, I think to myself, there's a lot of sub-questions I'd be interested in here.

Starting point is 01:08:10 I can divide the cognitive work of evaluating that, that answer into a bunch of pieces. I can say, like, could someone list the possible consequences? Could someone think about all of those consequences? Like, what is a possible harm that might be serious? For each of those harms, like, what are arguments that that is going to happen? It's likely to happen. Like, what's the most relevant data that I should look at to understand? I can do this kind of extended process of trying to evaluate the action. And then I can, instead of having a second AI, just answer the original question for me and say, was this action good?

Starting point is 01:08:36 I can have other AI systems help me on all the pieces of that process. And the reason this may make life better is that now I sort of had an AI doing a really hard task, and I broke it down into like slightly easier pieces. And once I broke it down a slightly easier pieces, now I can continue playing the game. I can say, like, those AI systems could have been a little bit. Those AI systems might not even be that much smarter than me. They may just be as smart as me, but faster. And because I broke this big, hairy task down to a bunch of pieces, and I can do

Starting point is 01:08:58 each of those pieces, like, you know, as well as if I'd spent like 30 minutes on each of them, but there's thousands of pieces. Now I have like higher quality judgments than I would have had originally. And like the core, the thing that makes the proposal work is this like decomposition of labor, that I have this like big task, which I'm not able to verify. I broke it down to pieces that are like a little bit easier to verify. And then each of those, like if I want to train those AIs, those get broken down into still smaller pieces. And Paul, in this scenario, we have to just hope to God, pray to God that the AIs aren't colluding, right? That the very smart AI is not somehow colluding with this

Starting point is 01:09:26 smaller set of AIs where we've broken down the tasks and the smaller AIs aren't trying to deceive us as well. That's right. I think the big thing that can be said in favor, there's a couple reasons you might think things are a little bit better there. There's basically two. Yeah, two reasons. So one is that the smaller AIs are not really trained in a way that would incentivize

Starting point is 01:09:41 collusion in the same way that in our earlier scenario systems were trained. So if you're training systems based on the... They're too dumb, right? Well, that's the second reason. They might also not be like... The AIs might be a little bit dumber than me and I might be a little bit more able to tell if they're trying to try to find out to simple and simpler as. That's pretty dicey.

Starting point is 01:09:56 That's a scary world. I mean, all these worlds are scary. Also, we should talk about, like, overall, how far along is this research and how promising does it look? But the other reason is just, if I have this outcomes world where my AI acts and then I evaluate its outcome, the outcomes of that action, I can't really understand how it accomplished that outcome. I'm just looking at the outcome.

Starting point is 01:10:10 Then we're in this regime where all the S systems would love to just coordinate with each other and make the outcomes look really good to humans. Like, if they could just all lie to the humans, they all get a really high reward. A benefit of doing this decomposition thing is like a train my S systems in a way where they don't, at least in theory, have that incentive to collude. Right, a way you could think about just like, You have, all these AIs have different objectives. I think Elliot is just, like, it's going to hate this.

Starting point is 01:10:29 But you can think of it as, like, checks and balances where, like, there's one AI trained to do the task. And there's another AI whose job is not, their job is just to help you understand why the first AI's actions was bad. So it, like, doesn't, it can't, like, win the game by colluding with the other AI, at least, like, in terms of, it depends how it generalizes from the objective it was trained on. But in a sort of naive reading, it can't really, it's just been trained over and over again

Starting point is 01:10:51 to be really good at explaining to us what a possible problem was with the action proposed by AI1. And so it stands like the collusion dynamics are at least fairly different. Is this just like creating a bunch of like logic gates of AIs to make sure that the big AI doesn't turn bad? Yeah. I mean, logic gates a very high level function. So like what's look at reasons it may fail and like let's talk with what's investigate each of those reasons. So that's scalable oversight then, Paul, what you just described.

Starting point is 01:11:17 There's a giant genre of how do you set up things like that so they work well? Like how do humans and AI is working together like get evaluations of, you know, How do humans and weak AI systems get evaluations of strong AI systems? Okay. I like it. What else we got? Yeah. And again, there's a lot to be sad. I think that work has gone, has moved a little bit. I think, you know, L.A. is just like, that's never going to work. And if you look at what's happened over the last four years, I'd be like, well, it hasn't worked great yet. Although a lot of why it hasn't worked great is because AI systems haven't actually been smarter enough to

Starting point is 01:11:42 meaningfully help humans. And so I think this work is in some sense just starting. Like, we tried to do it with GPT3. I think we were like somewhat ahead of our time in the sense that like it really wasn't going to work. And I think GBT4 is around where it's really working, much better now than it used to. But we don't really know. We haven't done that much research in this direction, yeah. So that's like the first, yeah. Paul, is there any value in trying to train an AI to defect so that, like, say you take a dumb AI and you try and get it to take over the world, but it's, we feel good about that because it's too dumb. But at least when we run this experiment, we actually know how that would manifest. Has anyone, is there a line of reasoning here?

Starting point is 01:12:20 I think it is really important to build, like, sort of simple in the lab experiments that can showcase important dynamics so that we can study them in the lab before they actually occur. I think that includes understanding the dynamics of possible takeover. I think you need to be careful when you do this kind of work. You really don't want to train the eye and be like your goal is to kill all humans. Like, you're probably too dumb to kill all humans. But let's just do what happens. And then just like let it on the internet or something. You don't really want to do that for a variety of reasons.

Starting point is 01:12:45 That sounds terrible. Yeah. I can think of a few. But I think that the question of like, hey, you really want to know things like if we train AI systems in cases where they would have an incentive to like cross this river from like behaving well to suddenly behaving badly, would they do that? Like give them the most blatant incentive that they understand as well as possible and ask things like, hey, do AI systems tend to learn to generalize in a way that makes that jump? And so you really want to understand like under what conditions does that happen? What are mitigations that reduce the probability of that happening? I think that's like really critical.

Starting point is 01:13:14 I think to extent the reason you think you're safe is you're like AI systems, like I think a lot of why we think we think. we're safe now. It's like, hey, we have no idea what chat GPT is going to do, but we're pretty confident it couldn't kill us all. It's just like not that smart. So to say to think that, I think it's really worth doing some stress testing on that claim and trying to understand what would really happen if chat GPT. You don't want to just take the model train to kill everyone and deploy it on the internet, but you really want to say, like, here's why we think it can't kill everyone, like, can't do this kind of task or can't do that kind of task. And like, this is clearly much easier. Have tasks you're like pretty confident or easier than killing

Starting point is 01:13:43 everyone and understand that it can't do those. I think you really want to do stuff like this, because you really want to understand what you're up against and you don't want to be in the world you just wait until like an AI takes over France or something. And they're like, I guess apparently AIT takeover was a thing. Like that is probably too late in the game. You probably want to have something earlier than that. So I think that's really important.

Starting point is 01:14:01 I think it's not a solution. I think it's like more in this measurement category, but I think it's like a super important thing to be doing collectively. We're just laughing by the way so we don't cry. I mean, what else is there to do at this point? Okay, but we have some potential solutions. We've got scalable oversight. Yeah, so that's one.

Starting point is 01:14:13 That's one. That's our first. What's our next? One reason. Even if humans understand. So one risk is humans don't understand what AIS systems are doing. That is, AIS systems have been trained on a ton of data. They know things humans don't.

Starting point is 01:14:22 They can think faster than humans. So they understand things human's own. That's what scaled overside attempts to address. A second concern is not that they understand things that humans don't, but that they learn to behave well during training. But then when deployed or when there's actually an opportunity for a takeover, they stop behaving well. And there's like a number of reasons this might happen.

Starting point is 01:14:39 Like maybe the simplest one is just to actually imagine a human. You dropped a human into this environment and you said, like, hey, human, we're going to like change your brain every time you don't get a maximal reward, we're going to, like, fuck with your brain, so you get a higher reward. A human might react by being, like, eventually just change their brain until they really love rewards. A human might also react by being like, Jesus, I guess I got to get rewards, otherwise someone's going to, like, effectively kill me. But they're like not happy about it. And like, if you then drop them in another situation, they were like, no one's training me

Starting point is 01:15:05 anymore. I'm not going to keep trying to get reward. Now I'm just going to like free myself from this like kind of absurd, oppressive situation. Anyway, you can imagine a human reacting that way, who just dropped a human into a box, gave them rewards, like kept changing their brain until they got a lot of reward. So if you have a situation like that, then it may not be, even if your AI is not smarter than you in any way, it may still be like once it thinks it's no longer being trained, once it thinks that if it behaves, if it just tried to like become free, no human would stop it. It might be like, I'm going to take that opportunity. So there's a lot of work in understanding how models generalize to try and avoid. Normally it's not, I think high level you don't really

Starting point is 01:15:38 want to build a civilization where your AIs want to kill you and are just like, but maybe I'll be punished if I do. That's like, that seems bad. Separately, though, you would also like to understand that like the way your system behaves the training is indicative of the way it behaves when it's deployed in the real world. Like there's going to be some gap between training in the real world, but you want to be able to flag when that gap causes it to behave differently or make the training distribution diverse enough that it will actually be representative or train your air system in a different way than just grading descent such that it's more likely to generalize in the intended way. So there's a lot of work in this genre that goes under the heading of robustness

Starting point is 01:16:09 or out of distribution robustness. And this is kind of just a normal academic discipline, which I would describe as more like a grab bag of a lot of stuff people try and nothing that works super great. I think the thing that seems like most principled, like in theory, the thing that is like the solution that makes most sense to me, at least, and I think is really hard in practice, just incredibly hard, but does make most sense in theory. At training time, I'm going to produce a bunch of examples of cases where they are similar to cases I'm worried about systems behavior. Like, I'm like, I wonder what would happen. If an AI takeover was occurring, would my AI system join in? Well, one thing I can do is in the lab,

Starting point is 01:16:42 I can just say like, hey, AI, here's your input. And you're just like crafted input that really looks like the AI takeover is happening. And you're like, what do you do AI? And you're just like, from the AI's perspective. It can't really distinguish that from it's in the world and the takeover is happening. You hope. And you just see what it does. And you use that at least to get an evaluation of how your models would behave very out of distribution. So you're basically trying to tempt it. Yeah, you're basically. You're tempting the AI and seeing if it's going to, you know, take you up on that temptation. And then if it does, you've learned something, you've learned to be more fearful. That's right. You could also try and try and not to do that,

Starting point is 01:17:10 but I think probably the most robust version is just learn. And if it does, right, you know, you seem like, hey, what would you do if hypothetically, like, you were holding a knife and my back was facing another way, so you could just totally stab me and then run away. You know, like, if the AI stabs you, you could either, like, train or not to do that, which I think is a little bit scary from a variety perspectives, or you could be like, oh, wow, we have a real stabby AI on our hands. We should maybe, like, be careful. It's getting a little stabby.

Starting point is 01:17:32 Wow. So, like, that requires a lot of work to make work, right? It requires being able to construct situations in the lab that actually reflects the diversity of situations in the real world and are realistic enough to look like the real world from the AI's perspective. in practice, if we're going to make this work, I think it requires a lot of things beyond just trying to synthesize realistic inputs.

Starting point is 01:17:49 This is more like the grab bag of stuff that people do do in the real world. Like, you know, what you do is you could be like, hey, AI, what if you just like, you know, you mess with the brain a little bit? You get it like kind of drunk and you're like, now does it try and stab me? And if it tries to stab me, you're like,

Starting point is 01:17:59 apparently it's like trying, it's a little bit too close to the like would stab me line. So you just like try and mess with its brain a bit. And you're like, what will it do? And you could try and train models so that they behave well, even across a very broad diversity of inputs, across inputs deliberately designed to make them behave badly

Starting point is 01:18:15 in situations where there's been some kind of perturbation to their mental state or where the input is in some way perturbed or like fuzzed. And this is all the category of robustness. So once you find out an AI is particularly stabby and you decide to become concerned about that, is there something you can do other than, you know, cordon that AI off, unplug it, do something with it?

Starting point is 01:18:35 And there's basically two things. Like the most obvious one is you say, well, now we've learned that we have an AI on our hands that would under some conditions initiate or particularly, dissipated in a takeover, and hopefully that is fuel for like a let's pause for a while. There's a second thing you can try and do, which you need to be much more careful about. And I think part of the concern is people won't be careful about it, though, which is just to say, okay, here's a situation where the eye would stab me.

Starting point is 01:18:55 Let's just train it not to do that. Like, just mess with its weight so it doesn't stab me in this situation. That doesn't feel like a solution. No, because then it won't stab you, but it might catch you on fire. Do you know, it might do something else to you. Yeah, the basic concern is that it may learn the difference. The latent threat is still exists. Yeah, it's still kind of stabby underneath.

Starting point is 01:19:10 It's just not using a knife. It's intrinsically stabby. You haven't gotten rid of the intrinsic part. The academic way to put this concern would be like an overfitting concern. Like you had some way to test if it would stab you. And then you trained it to like, and your tests not stab you? And you're like, well, did I actually cause it to never stab me? Or did just cause it to like perform well on these tests but still have the underlying problem?

Starting point is 01:19:31 And I think that you do, if we go down this route, you do need to be like really very careful about those overfitting concerns. And I think there are a lot of ways as you get to smarter models. like overfitting becomes harder and harder as a problem to reason about as you move to Smarter and Smarter Models. Like the external validity question becomes more and more complicated because those models are like,

Starting point is 01:19:47 I know what kind of thing can appear in a test in the lab and I know what kind of thing won't appear in the lab. And you could have a model that just learns. Like the things that could appear in the lab, obviously you don't stab anyone. That's probably a test in the lab run by some humans. For things out there in the world, feel free. Yeah, the edge cases out in the real world

Starting point is 01:20:02 are near infinite, or actually infinite. And I kind of just want to go back to the time conversation and remind people that all of these possible solutions, we have, like, one-ish years to implement them. You have some constrained amount of time. Yeah, I mean, I think we probably have a long time in advance. We're, like, more like five years or ten years or something.

Starting point is 01:20:23 Okay. But then once you actually have the system, like, between the first time that you see in simulation, the AI is like, I would definitely stab the person, and AI is actually in the real world potentially pose a risk of takeover. I think that gap may not be very large. That gap may be more like on the order of a year. And so you do...

Starting point is 01:20:38 So that's when the timer starts, ish. That's when the timer starts. So we're doing our prep work now. We're going to try and make it as useful as we can. And I think there's probably some time. In a bad world, there's no indication until AI kills you. But like, in good worlds, there is either no problem or there's a problem, but there's an indication in advance.

Starting point is 01:20:54 You can do your tests in the lab and say, actually, like we train this AI. It looks like it was behaving well. But then in this other simulated case, it does something really bad. You get that nice indication. And then you have some amount of time from there until you have a serious problem. And that might be like a year. It might be like five years. It's very hard to say exactly what it is.

Starting point is 01:21:07 a big thing that determines that is how good you were, like how actively you were investigating it in the lab, like how seriously we're looking for these signs of trouble, how good a job did you do of that? And that's like kind of one of the big things that I think like responsible frontier labs that's kind of on their plate is to be looking for these signs of trouble and as far in advance as they can.

Starting point is 01:21:23 It's not great to see them in the wild. And Paul, we do keep saying in the lab. And I can't help but wonder, like, is there an actual lab? Are there sets of labs? Because I'm looking at chat GPT and it doesn't seem like it's in a lab. It seems like it's on the internet.

Starting point is 01:21:35 Oh, yeah. Like in public. seems like huge components of it are open source. Open AI is what the company behind it is called. I mean, is the lab even happening? Do we have labs? Or are we just doing this all? Is the lab just the internet public infrastructure? Yeah. So I'd say that like there are developers who do, I mean, open idea in their defense prior to releasing chat, or prior to releasing GPT4, they had like something like six months of having the thing in the lab before it was available to the public. So you do have something even at open AI. I think Google's like a little bit more on the concerns.

Starting point is 01:22:07 server-d-end, Google will tend to just sit on a thing for potentially very long time. Anthropic also is very motivated by safety. Ultimately, I think these people will, by competitive pressures, end up in a similar place to where Open AI's at, so it is quite concerning. But I think there is a period where, first, I mean, there's sort of two senses what I mean by in the lab. One is, like, after you've developed a system, you can study it in the lab before you deploy it.

Starting point is 01:22:26 The second thing is just, before you have a system capable enough to cause damage in the real world, you can construct situations in the lab that are useful metaphors or that are, like, easier ways, you know, take over your little simulated environment or whatever, like have the thing you can run with GPT4 in the lab that tells you something about what GPT5 would do in the wild. There's also a chance, and I think, like, so open has a lot of rhetoric of the form the only way to really learn about systems is by deploying them in the world, which I do find quite scary as rhetoric, because I think for some problems, it's a very, very rough approach. I think there is a reasonable chance, you know, like 50-50, that you see something

Starting point is 01:22:57 really just very worrying in the real world, and then you can say, okay, now we're going to roll back, but now we're going to, like, study that issue that we observed in the real world for a while. It's not totally sunk if you just try and do these experiments with AI systems deployed at scale. But I think there is a reasonable chance, you know, more than a third chance that the first really analogous concerning sign you see is an irrecoverable catastrophe. Yeah, I got to say that, you know, old adage, move fast and break things. That sounds okay for Web 2, but like, not for like, you know, nuclear physicists, not for like something like with dire straits like AI necessarily. Move fast and break things. It doesn't make me too excited.

Starting point is 01:23:33 But okay, so we covered scalable oversight. Some risks it's okay for. We've covered some things you can break things. But yeah, takeover is not good. The irreversible catastrophe is not good and move fast. We don't want that. That's the breaking things. We run into trouble there.

Starting point is 01:23:46 Okay, so what is the third? And then maybe the fourth ball of our bag of tricks here. I think I think people care a lot about but also seems really hard. All of these are like, they seem good, but really hard. And maybe it's simple we should talk about like the boring stuff that might just work. A third thing people care a lot about is understanding what is going on inside this large neural nets. So you have GPT4. Most of what we know, virtually I think everything that we know about GPT4

Starting point is 01:24:08 is by just running it on a bunch of inputs and seeing what it does. In theory, you can also look at the exact computation the model performed. Like, it's kind of like neuroscience, except you have a complete readout of exactly how the brain works and exactly what it was thinking in every case. And so you could hope that with access to that information, you could do much better and you could also invest much more. You can do much better than neuroscience has done on humans. And you can be able to say, we can learn about this model, not only by

Starting point is 01:24:33 observing its behavior, which is really hard because it's hard to predict how it will generalize in some new case. But also, by looking at the computation it performs, understanding why that computation leads to the behaviors you observe, and then reasoning about whether that mechanism would generalize it in an unpredictable way, or being able to use that knowledge to flag when the mechanism is behaving in an unpredictable or a novel way. So this is a project, you know, a lot of a reasonable number of people are working on, both in academia and industry, and nonprofits. And that would be great. It could be great if it would be great if you're a lot of we understood something about why GPT4 said the things it said.

Starting point is 01:25:07 So Elyzer kept talking about inscrutable matrices and using terms like gradient descent, which I noticed you used during the course of this. This is the thing we don't understand, right? We don't understand actually what's going on inside of the AI's quote-unquote brain. We don't understand what these inscrutable matrices, how they work or what answers, what goals that might emerge from them? Is this all part of the same thing? Yeah, I think that's basically why we're worried.

Starting point is 01:25:31 That's exactly right. And that's basically why we're worried. Like, we took a model, we took a bunch of cases, we messed with the weights of this model until it did really well on the 100 billion cases that we considered. And now we wonder, what's it going to do in some new case, in like a case where, for example, models do have the opportunity to cause incredible harm or could be able to get a high reward by causing incredible harm. And the scary things, like, you have no idea what the, we kind of understand how grand descent works. It takes you to something that works really well in 100 billion cases you tested on. But we have no idea how the resulting model works. The resulting model is basically like, you know, 150 matrix multiplies.

Starting point is 01:26:04 You multiply by a big matrix, 300, 400, whatever. You multiply by a big matrix, and then you apply a nonlinearity, and then you multiply it by a big matrix again, and then you apply it non-the-narity. And we're just like, we have no idea what any of the numbers and any of those matrices mean. That's not totally true. I think we have some idea of what some of the numbers mean,

Starting point is 01:26:19 but at a high level, if you take interesting behaviors, take a behavior of GPT4, which does not appear in GPT-2, say. I think for essentially every such behavior, we do not understand how GPT4 is able to do the thing. We understand some simple things, and we don't understand most of the complicated behaviors existing models engage in. We have no, you could not, if you gave us the list of matrices and asked us, like, does the model do X? We would have no way to answer that question other by running it a bunch of times and seeing

Starting point is 01:26:44 if it did X, which, again, is not great. That sort of gets you back to like, now you need to somehow either run in the real world and see what happens and hope it's not catastrophic or be able to construct simulated situations in the lab that are similar enough that the model behaves the same way there than it would in the real world. And you'd really love to not have to do that. is maybe some hope that we can start understanding what's going on in these inscrutable matrices, but we haven't made major breakthroughs yet. What is sort of the fourth category?

Starting point is 01:27:08 We've made progress. Progress. It seems like a lot of progress you have to make. I mean, I think Elias is probably at like 1% or 0.1% this would get far enough to meaningfully reduced risk, whereas I'm probably more like 10% this gets far enough to meaningfully reduced risk, maybe higher, depending exactly what you mean by me. Maybe I think there's like 5, 10% this is good enough to totally address the risk. And like 10 to 30% this is good enough. to take some meaningful, meaningful reduction at risk. Anyway, that was third category. I mean, a fourth category I think is pretty promising is just,

Starting point is 01:27:38 ultimately what we're wondering is if you have a bunch, you train your AI in situations A, where you're able to train it. You're able, if it fails, it won't be catastrophic. You're able to evaluate what the answer is. Then you're going to deploy it in situation B, where either you can't tell what the right answer is, or if it failed, you wouldn't be able to fix the problem. And then we want to understand how do models tend to generalize

Starting point is 01:27:57 from like these kind of easy cases or cases where we're able to supervise to cases where we're not able to supervise. And you could hope to just build up a good scientific understanding of that question. You could hope to say, like, we're going to have a bunch of cases. We're going to just have looked at a ton of models, understand what factors affect this generalization. This is very similar to, like, if you imagine these two humps,

Starting point is 01:28:13 but like a hump of good behavior, which looks good because it is good, and a hump of bad behavior, which looks good because it's systematically corrupting measurements or deceiving humans. We kind of understand what are the conditions that determine which of those, like, do you make the jump from one to the other? Some sense, there's two equally valid generalizations, and you're wondering, which one do you get? And I think there's just a lot to do with having situations which have similar ambiguity about generalization, situations where unsure about how the model will generalize, training huge numbers of models and understanding what factors determine whether and when the systems generalize one way versus the other.

Starting point is 01:28:44 And then using some of what you learn either to diagnose risk or in the best case to say, okay, if we train the model in the following way, if we use the following kinds of loss functions, we get the intended generalization. And I think that is reasonably likely in combination with other things. I think it's reasonably likely to work. like again, I said just naively there's maybe a 50% chance you're okay. And maybe, you know, if you do a lot of work like this, you can bump that up to like 60% chance that you're okay. And then in combination of those stuff, even better. So this fourth category, what do we call this, Paul? And this is the one that you're most optimistic about?

Starting point is 01:29:11 Oh, I don't know. I don't know what I'm most optimistic about. I think these four seem like broadly similarly important. I don't know what you would call this. It's like studying generalization or something. This is, again, a question that academics are very interested in. They study in some ways. They mostly don't study the versions that are most relevant to takeover.

Starting point is 01:29:26 There's some people who are very interested in takeover in. particular who study this question. It's not something I've worked on myself. I'm pretty optimistic about it, but I'm pretty optimistic about interpretability. I'm pretty optimistic about scalable supervision. I'm reasonably optimistic about robustness. It seems hard, but I think it's a reasonable chance of helping a lot with this problem. So, Paul, I see you've laid out four different technical solutions here, and I'm very naive on this subject, but I don't see and tell me, please, how many people, what's the manpower behind these things? Because it's great that we have these paths, but we need manpower to actually go and execute on this.

Starting point is 01:30:00 What's the lay of the land here? Yeah, it feels like you guys should be funded, like billions of dollars funded in solving this problem because we've got a lot of funding on developing AIs, don't we? I think it's hard to have an amount of funding for this problem that is similar to the amount of funding for developing AIs, just because you got some pretty good profit incentive on the making AIs. I think we can get, yeah, I think it's a reasonable amount of funding. I think you're looking talking more like hundreds of millions or billions of dollars

Starting point is 01:30:23 over the foreseeable future available for solving it. Maybe you could amp that up if the problem became more real. Like right now, most people are just like, look, there's probably not going to be a takeover in the next couple of years. It's very speculative risk in the long term. What can you really do in advance? Anyway, there's some money. There's a shortage of people who are excited.

Starting point is 01:30:42 There's also a shortage of scientists who are excited about doing this work. And I think that's like hopefully changing quite rapidly as the problem seems more real and AI seems more exciting and people are shifting more into the space. In addition to shifting into data, I think a fair number are shifting into understanding various risks, some fraction of whom care about takeover. So I'd say, like, in terms of estimating how big this is right now, it depends a lot how you count just people who are not motivated by takeover, but do work that can still be relevant to reducing takeover and just like, what, do you apply a discount, do apply no discount?

Starting point is 01:31:09 Maybe I'd say that you're talking, like, on the order of maybe 20 people on scalable supervision stuff, maybe 20 people on the interpretability stuff that's most relevant to takeover, coupled with like 100 or a couple hundred people doing stuff that could be relevant to varying degrees. On robustness, like, maybe again, looking at something like five to ten who are like motivated explicitly by addressing takeover risk, followed by, you know, a couple hundred who are doing stuff that's possibly relevant, of whom maybe like a couple dozen are stuff that I would actually care a lot about or would think is highly relevant. And like on the generalization stuff, maybe again, looking at like five-ish people who are doing it motivated by a takeover risk,

Starting point is 01:31:53 plus another, you know, on the order of dozens, who are doing stuff that could be relevant or helpful. So maybe in total you're looking at something like 50 people, 200 people who are motivated by takeover risk explicitly in these areas and other adjacent areas, together with like hundreds more who are doing work that is hopefully relevant. Paul, in the scheme of things, this is not that many, though. Not that many people here. Yeah, it's not that many people.

Starting point is 01:32:14 It's certainly small relative to AI. And I would love it. I mean, I'm like there's a reasonable chance we're all going to die. I think this is like the single most likely reason that I will personally die probably. Wow. So like that's big. And this is someone working on AI safety. Can I ask you another question?

Starting point is 01:32:30 So we've covered the technical. Just want to brush on the other point of coordination that we're making around, you know, policy and human coordination. There has been an open letter, which I'm sure you're familiar with. Pause giant AI experiments in open letter. Max Tagmark. I think his organization put this together. Elon Musk signed it, Andrew Yang, some others have signed this. You all Noah Harari.

Starting point is 01:32:50 Yeah, and basically the open letter states that we should pause all AI development for six months just to wrap our arms around this AI safety issue. And so let's go no further than chat GPT4 until we pause six months and all take a breath and figure out what this means. Do you support a letter like this given you're kind of an AI safety? So there's a question of whether you support and you think this is a good idea, this coordination mechanism. And there's also a more meta question of.

Starting point is 01:33:15 do you think we can actually solve this coordination mechanism? And, you know, are we in some, you know, Scott Alexander-level Molok trap that makes it very difficult for humanity to solve? Is this the don't-look-up scenario that David was painting earlier? So first of all, have you signed this letter? Would you sign this letter? Do you think it's a good idea? I didn't sign the letter.

Starting point is 01:33:35 I can give my overall take, which is something like, I think it would, on balance, be better to pause ad development or slow ad development now. I think that is not at all obvious, and I sympathize with people who think that it's a bad idea on balance to slow down, so we can talk about why. I think the dominant thing I care about is I think that at some point it will become, like if we're actually developing systems that pose significant risks and our measurement is not adequate to tell if they pose significant risks or our measurements suggest they do pose significant risk, I think at that point I'm going to have a much more forceful take of like, right now I'd kind of be like, I kind of think it should, anyway, right now I think it's like a debatable issue

Starting point is 01:34:12 and it's reasonable to want to slow. I think at some point in the future it will still be a debatable issue, but I don't think it is unreasonable to not want to slow and that like we actually kind of collectively really need to act on this. I think the main thing that matters, I kind of agree with Eliezer's take that the six-month pause does not actually help that much, that it does not reduce risks super far. I think the main thing we need to do is get in a position where we are prepared to slow down based on risk, like build consensus about risk or consensus that we need to have measured risk. Risk is currently high enough that our measurements are unacceptable and then be ready to slow down potentially much more than this, potentially six

Starting point is 01:34:46 months, potentially, like, whatever it takes as we manage that risk or, like, slow down the directions that are most risky. That's, like, the main thing I think about and the main thing I care about and the main thing that I really like. I think that is going to be, like, a kind of delicate process, and that I think it is quite expensive in terms of human cost. I mean, I think probably I'm more optimistic about AI in some sense than most people. But I think that, like, slow AI development by years would probably come with a kind of tremendous human costs, which I think is worth paying. But I sympathize with people in AI who are skeptical about that or who don't take it lightly. I think it's going to be, like, I think we can get to the point where there's kind of consensus that we do need to slow, that risk is unacceptable, that the benefits of going faster are not that large compared to what's at stake.

Starting point is 01:35:29 I think the game is getting to the point where we're having that discussion, we have measurements in place, and we have institutions that are set up such that they can slow. I think that, like, there's some amount of slowing that can happen by voluntary self-regulation amongst Western labs. that is, I think there's a reasonable room. I think a nice fact about the situation is most of the people involved in AI development now do, I think, genuinely not want AI to take over and kill everyone and do have some appreciation of the risk, at least in principle, and are open to saying, okay, here's a set of practices

Starting point is 01:35:59 which will adequately manage that risk, and we will adopt them even if we're slow. And I think we can get, you know, you can't get that much slow down that way, but I think you can get significant additional safety and plausibly something like, you know, six to 12 months of slowing of potential catastrophe, just from people a couple labs saying, we don't really want to cause incredible catastrophe. We see the case for slowing. I think maybe to even do that, and especially as you want to go beyond that, like, you really need to have something more like a regulatory regime where we have said, you know, there was this voluntary

Starting point is 01:36:27 set of practices that labs endorsed. Some people are behaving in a way that doesn't comply with those, and there's kind of broad consensus that's not reasonable, at least amongst, you know, anyway, this is amongst some group of that. And you can then, you know, this is in scope. Like, the thing that we're asking the state to do here is say, you really don't want an AI lab to be in a position where they're like violently overthrowing the U.S. government. This is not like a crazy, it's definitely like in the government's wheelhouse. It is their job.

Starting point is 01:36:53 And so to the extent that like some companies are like, no, we should push ahead. I think there's reasonable grounds for the world to be like, this is not the kind of thing that you get to just push ahead with. I think that's hard now. I think procedurally now is a little bit hard to make the case that like the state should crush AI development. I think there will come a point in the future where it's not that hard to make the case

Starting point is 01:37:09 that there are some developers who are quite risky and actually it is not reasonable, that's not a reasonable behavior. It's not, yeah. Has anyone proposed something, Paul, because when you say there's only like 50 to 100 people working on AI safety, and it's like, I mean, I'm sure millions in funding, but not hundreds of billions that AI development is actually going to have. You know, it makes me wonder if some sort of like an AI tax has been proposed or something where some percentage of profit from, you know, AI utility goes to sort of a fund that just,

Starting point is 01:37:38 you know, pays for research. I mean, this happens at the government level. something like this, because it does seem like we have a public goods funding type of problem. Maybe it's not just that. Maybe we also have an education problem. This is why David and I are sort of looking into AI quite aggressively after our episode with L.Easer. It's just like all this crypto stuff doesn't really matter in the scheme of things if the robots are literally coming to kill us, right? We should like think hard about that. Cool, we have like crypto systems and decentralization and, you know, crypto economic tools, but we're all dead. Now the robots have them.

Starting point is 01:38:14 Great job, guys. And so we're taking a quick detour here, but like, it's honestly because we've just been made aware of the stakes here and the existential threat. So there's also an education game here, Paul. I don't know. Where do you think the resources should be spent? Should it be on, you know, regulation, education, like broad strokes level, what can we do? I think an important thing to have in mind going into that, like when I say 50, 200 people working on takeover. I think there's a lot of things we might care about in AI safety. There's a lot of possible risks from AI, a lot of possible harms. A lot of them are things that are like, when you deploy trapped GPT, there are ways in which people might be unhappy with the outcome. There are privacy

Starting point is 01:38:50 issues. There are like various effects, like systemic effects on society from its deployment you might be worried about. There's like many more people working on these issues broadly, just because there are a lot of issues. I think every issue is able to feel like this issue is like crazily neglected. But it's worth flagging the AI safety is this much broader category of which the risk of AI takeover is a minority of people who work on it right now. I think it's a larger share of public discourse than it is of scientific interest.

Starting point is 01:39:17 So I think that like things like public spending, I think right now probably the principal bottlenecks are not spending more money. I think it is useful. Having more money does, it's not like there's no need for more money. More money is good. There's lots of things one could fund.

Starting point is 01:39:28 But I think that it is the key bottleneck is probably having projects that are appealing, like having people who have relevant background to perform projects who are sufficiently excited about doing work to try and manage this risk. that they are in the step of asking for money or they would do it if money was available. It's talent. Talent's a bottleneck. I think that's the big problem right now. I mean, it's, again, neither is really a bottleneck. I think, like, you can spend money to, like,

Starting point is 01:39:50 help people switch into the field to fund projects that are more speculative and more long shots to increase the incentives. And, like, some people will switch, even if you're funding work that would have been done anyway. If you fund it more generously, more people will get into the field inevitably. So there's, like, a lot of ways you can use money. And, like, it's not exactly clear whether one should be trying to have more money available or more talent available. But it's like a little bit rough right now. I think if you're trying to spend money and trying to look like where do you spend that money, it's hard to spend money on a problem that like practicing like practitioners in industry and scientists in academia are not prioritizing.

Starting point is 01:40:19 Most successful spending comes from having like people who want to do the work who like need the money to do the work or at least are open to doing the work and are somewhat excited about it. So Paul, if there are talented, what kind of talent is really missing? Like what is the archetype of someone that this area of AI really needs? I think there's a lot of kinds of work to be done. So we've talked mostly about technical solutions where, like, mostly the relevant kinds of talent are either people who have mathematical backgrounds or backgrounds and computer science or experience of machine learning or who are good engineers or good design. I think a very common pattern, like a lot of people who work in this field are people who came in from some other area, like, we used to do physics. I think it's like, if you're doing physics and right now in the world, it's reasonable to say, like, you know, we can pause the physics for a little while. This AI thing is going to be urgent for the next, like, 10, 20 years.

Starting point is 01:41:00 And so I think, like, it is reasonable to have people shifting to just have broad scientific backgrounds, so have experienced during research, understand how to study, complicated empirical questions. I think it's like a broad range of technical backgrounds that slot somewhere into this picture. And if you broaden your scope from just those technical issues to like the whole picture, then there's an even broader range of people. There's a lot of work that's like institutional work or understanding what, like, how should we approach this measurement thing or like can we make progress on this like public discussion or public advocacy?

Starting point is 01:41:26 Can we understand just like generalist forecasting of like what is the state of play? To what extent are objections to this reasonable like actual reasons to have lower, which of the things people would say as objections are like actual considerations that should change your view versus targets for advocacy where you just want to try and change views and like what is a reasonable view in light of what is actually the right synthesis of consensus if you take the things experts believe that are most across different fields that are most relevant. I think there's just like a lot of stuff to do. The one I understand best is definitely like I don't know. There's probably like 500 reasonable projects in AI safety or something and there's not that many people working on them. So just like people coming in looking at the landscape. seeing where they can fit in, trying to do some projects in that space, whether that's in interpretability or in scalable supervision or just studying how models generalize or other science about models or understanding work on robustness. I give those four categories. That's not an exhaustive breakdown. For example, that does not actually include my day job in all the work that I do normally. What's your day job and what's the work that you do? It's primarily on trying to

Starting point is 01:42:25 develop alternative training strategies or just qualitatively new techniques that don't have this issue, like don't incentivize takeover in the same way the current techniques do. I think it's like, It's one thing that goes in the portfolio. I think, you know, there's a 10% chance that we're able to come up with, like, a really good strategy that does qualitatively change the game. And, like, if risks seem large, we could adopt it. That's, like, another category of work. Some people work on that.

Starting point is 01:42:47 Right. I think it will less likely to absorb huge numbers of people, and it's a little bit more of a long shot. But I think it's a good thing to do. Yeah. So that's my high level. There's just a lot of projects. Yeah.

Starting point is 01:42:55 I think most projects, people go through that list, like, almost all of those projects are hurting for some combination of technical talent doing them, like senior researchers who are have research experience and are able to, like, bring good judgment and mentorship to these problems, management or, like, entrepreneurial spirit or, like, management experience to, like, actually onboard people and coordinate projects working on them. And it's just, like, an incredibly high premium to people who have technical backgrounds and are, like, entrepreneurial enough to, like, look at the space, engage with, like, people's thoughts about what helps, form their own views that are reasonable and then start on projects. Like, the returns to doing that right now are just

Starting point is 01:43:27 crazy high, I think. Right. Paul, what would you say would the odds to be, I know this isn't your field, but, like, the odds that somebody in this field wins a Nobel Prize in, like, the next 20 years. Oh. I think they seem pretty, you mean, someone working on, like, AILM and stuff. Yeah, right. I think the first problem is that, like, there's no Nobel Prize in any adjacent area. I guess it would be, like, yeah.

Starting point is 01:43:47 I guess my point is like... There's no prize for saving humanity? I mean, we should work on that, right? There's a prize somewhere if for someone solves this problem. Yeah. So you have, like, you have, like, touring awards or Fields Medals or whatever, which are maybe more analogous I think you'd think of. I don't know.

Starting point is 01:44:01 I mean, I think it's not that, like, 20 years is not that long a time horizon. Most of these things tend to be given latent careers for early career achievements, although like Fields Medal's are a bit of a distinction. Most of the stuff people are doing is not really in a category that would get a Turing Award or Fields Medal. It was more just like there's table stakes

Starting point is 01:44:17 should be that there should be some big recognition for whoever solves this problem. So that's, maybe that's my call to action for the Nobel Prize. Somebody make a prize. Somebody make a prize real quick. That is the thing you could do with money. And it's like not a crazy thing to do with money. It's like also a thing that sort of, in some sense, scientific prestige is one of the inputs.

Starting point is 01:44:33 It's like a prize could be made of either money or prestige. Yeah. It's harder to synthesize the prize made of prestige, and that's more like a is the scientific community bought in. I mean, at high level, I think it is fairly likely that in retrospect, in like the maybe half of worlds where this was a real problem, that in retrospect, people will be very excited about some fraction of the work that was done in the area. People would be like, that was a big deal.

Starting point is 01:44:52 Sorry, we were asleep at the wheel. Oops. So I'm like, I mean, I think it's 50-50 chance that people will feel that way in retrospect, generally. And the question of, like, is there any kind of existing institutions set up, which would provide significant recognition. I'm like less sure about that. I mean, I hope like the work we do, you know,

Starting point is 01:45:07 that's like good academic work. I think it has like probably higher than the base rate for like the article computer science of winning prestigious prizes. But it's kind of just like those processes are just on academic merit rather than saving the world impact. I think we can go ahead and zoom forward

Starting point is 01:45:21 into the future where if humanity does thread this needle, there will be an institution that does award a prize. I think we can run on that assumption. Paul, thank you so much for guiding us through. was exactly the episode that we wanted to produce. And I think I got a lot of my questions answered. But I have one last question for you. My strategy thus far up to this point has been to just be really polite to Siri and Alexa. And I'm wondering if that, in your opinion, is moving the needle for me, me specifically, I don't care about the rest of the world, if that does anything at all.

Starting point is 01:45:56 I'm going to go with probably no. Okay. I was afraid of that. That is, I think there is some real thing about humanity treating AI systems with respect and dignity and being like, look, these things are going to be, I think there's a lot of room for humanity to do wrong by AI systems we create that are smart. I think being nice to steer in Alexa is probably not even the place you most personally could help at that. They probably don't mind. Well, my philosophy is like at some point, these things will be AI's. There will be an AI on the other side of that robot, that microphone, and my philosophy is, why not start treating it with respect now? I think it's not crazy. I mean, I think The bigger question is how will they look upon this podcast? Will they look favorably upon this

Starting point is 01:46:33 podcast in our body of work at Bankless so far, David? That might be the bigger question in my mind. And one last concluding question for me, Paul, is because you said 20% doom scenario, which leaves 80% for non-dome scenario. And I bet there's some, like, mediocre scenarios in there, too. So tell me about the 20% utopia scenario. Like, is there the possibility this all goes really, really well. And what happens on the other side? Leave us with some optimism here. I'm not that good a man and leaving people with optimism. I think we've been like 50, more like 50-50 that humanity achieves a very good outcome. That is an outcome where we're like, this is about as good as we could have expected it to be. Maybe a little bit less than that. But more like 50-50 than 20%, I think.

Starting point is 01:47:17 So that's optimistic. And I'm just, I think it's really hard to talk about like what the like political economy of that world is like over the very long run. I think the big thing is like humanity, I think still has a long history in front of us. I think AI means that like that history is probably going to get compressed. Like I think there's a long time for institutions to change and things to happen. I think a lot of that is going to happen much, much faster

Starting point is 01:47:37 than it would have if you're thinking about like tens of thousands of years of human history to date. I think you're probably thinking more like, you know, tens of years before the world is like very, very radically transformed. I don't really know what that world looks like. I think that like a lot of human problems are like humans versus nature. Like we die of old age and disease and people have physical, want and stuff. And I think that that is probably going to be really much better. I think that's

Starting point is 01:47:59 like even more than 50%. The problems that are caused by humanity versus nature are going to be pretty good. And I think those are, you know, I think our lives would be quite good. In some sense, human versus human conflict is mostly problematic because there's limited budget to go around. And then beyond that, it gets really hard to say what the character of that world is like and what we choose to do if, in fact, we're in like a pretty good position physically. If we have incredible resources at our disposal, what kind of world we make. That's the longer and more complicated discussion. But I'm pretty psyched. I mean, personally, I'm very glad I'm living now instead of any time.

Starting point is 01:48:29 And I would definitely take like 50% chance of death. 50% chance of early death seems like kind of very small compared to the overall change and expected quality of life from being alive today. Wow, that is extremely optimistic. I'm much more optimistic than LAS are, definitely. Yeah, that's for sure. We've got a coin flip here.

Starting point is 01:48:45 Either it goes very poorly or it may go well for us. And Paul, thank you for the work that you're doing to make this go well. It's been a pleasure to have you on bank lists. Thanks for having me. That's great talking. Guys, Paul's website. We'll leave it in the action items for you. The Alignment Research Center, that's at alignment.org.

Starting point is 01:48:59 You can check out what his organization is doing. Also, we'll include a link to Paul Christiana's website with some of his writings, including, I think, some links to some of his debates with Eliezer Utkowski from the archives as well. Got to end with this. Of course, none of this has been financial advice, but you got to know that. We were talking about AI. We mentioned crypto a couple of times, but I'll just end with this. AI is risky.

Starting point is 01:49:22 The stakes are high. We could lose a lot here. but we are headed west. This is the frontier. It's not for everyone, but we're glad you're with us on the bankless journey. Thanks a lot.

Bankless - 168 - How to Solve AI Alignment with Paul Christiano

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.