Dwarkesh Podcast - Joe Carlsmith — Preventing an AI takeover

Starting point is 00:00:00 Today I'm chatting with Joe Carl Smith. He's a philosopher, in my opinion, a capital G great philosopher. And you can find his essays at joecarlsmith.com. So we have a GPT4 and it doesn't seem like a paper clipper kind of thing. It understands human values. In fact, if you help have it explain, like, why is being a paper clipper bad? Or like, just telling me your opinions about being a paper clipper. Like, explain why the galaxy shouldn't be turned into paperclips.

Starting point is 00:00:27 Okay, so what is happening such that, dot, dot, dot, we have a system that takes over and converts the world into something valueless? One thing I'll just say off the bat, it's like when I'm thinking about misaligned AIs, I'm thinking about, or the type that I'm worried about, I'm thinking about AIs that have a relatively specific set of properties related to agency and planning and kind of awareness and understanding of the world. One is this capacity to plan And kind of make kind of relatively sophisticated plans on the basis of models of the world Where those plans are being kind of evaluated according to criteria That planning capability needs to be driving the model's behavior So there are models that are sort of in some sense capable of planning But it's not like when they give output

Starting point is 00:01:13 It's not like that output was determined by some process of planning like here's what will happen if I give this output And do I want that to happen? The model needs to really understand the world right? It needs to really be like, okay, Here's what will happen. Here I am. Here's my situation. Here's the politics of the situation.

Starting point is 00:01:29 I really kind of having this kind of situational awareness to be able to evaluate the consequences of different plans. I think the other thing is like, so the verbal behavior of these models, I think need bear no. So when I talk about a model's values, I'm talking about the criteria that kind of end up determining which plans the model pursues, right? And a model's verbal behavior, even if it has a planning process, which GPT4, I think, doesn't in many cases, its verbal behavior just doesn't need to reflect those criteria, right?

Starting point is 00:02:08 And so, you know, we know that we're going to be able to get models to say what we want to hear, right? We, uh, uh, that is the magic of gradient descent. Yeah. You know, if you, uh, you know, modulo, like some difficulties with capabilities, like, you can get a model to kind of output the behavior that you want. If it doesn't, then you crank it until it does, right? And, um, and I think everyone admits for suitably sophisticated models, they're going to have

Starting point is 00:02:36 very detailed understanding of human morality. Um, uh, but the question is like, what relationship is there between like a model's verbal behavior, which you've essentially kind of clamped. You're like, the model must say, like, blah things. And the criteria that end up influencing its choice between plans. And there, I think it's at least, I'm kind of pretty cautious about being like, well, when it says the thing I forced it to say, or like, you know, gradient dissented it such that it says, that's a lot of evidence about, like, how it's going to choose in a bunch of different scenarios. I mean, for one thing, like, even with humans, right? It's not necessarily the case that humans, their kind of verbal behavior, reflects the actual factors that determine their choices.

Starting point is 00:03:25 They can lie. They cannot even know what they're, what they would do in a given situation. I mean, I think it is interesting to think about this in the context of humans because there is that famous saying of be careful who you pretend to be because you are who you pretend to be. And you do notice this where if people, I don't know, like this is what culture does to children where you're trained, like your parents will punish you if you say, if you, if you, start saying things that are not consistent with your culture's values. And over time, you will become like your parents, right? Like, by default, it seems like it kind of works. And even with these models, it seems like it's kind of works. It's like hard. It's like, they don't really scheme against it. Like, why would this happen? You know, for folks who are

Starting point is 00:04:02 kind of unfamiliar with the basic story, but maybe folks are like, why are they taking over at all? Like, what is like literally any reason that they would do that? So, you know, the general concern is like, you know, if you're really offering someone, especially if you're really offering someone like power for free, you know, power almost by definition is kind of useful for lots of values. And if we're talking about an AI that really has the opportunity to kind of take control of things, if some component of its values is sort of focused on some outcome, like the world being a certain way, and especially kind of in a kind of longer term way, such that the kind of horizon of its concern extends beyond the period that the kind of takeover plan would encompass. Then the thought is,

Starting point is 00:04:44 it's just kind of often the case that the world will be more the way you want it. If you control everything, then if you remain the instrument of the human will or of some other kind of some other actor, which is sort of what we're hoping these AIs will be. So that's a very specific scenario. And if we're in a scenario where power is more distributed and especially where we're doing like decently on alignment, right? And we're giving the AI some amount of inhibition about doing different things. And maybe we're succeeding and shaping their values somewhat. Now it is, I think it's just a much more complicated calculus, right? And you have to ask, okay, like, what's the upside for the AI?

Starting point is 00:05:19 Yeah. What's the probability of success for this like takeover path? Yeah. How good is its alternative? So maybe this is a good point to talk about how you expect the difficulties of alignment to change in the future. We're starting off with something that has an intricate representation of human values. And it doesn't seem that hard to sort of lock it into a persona that we are comfortable with.

Starting point is 00:05:42 I don't know, what changes? So, you know, why is alignment hard in general, right? Like, let's say, let's say we've got an AI. And let's, again, let's bracket the question of, like, exactly how capable will it be and really just talk about this extreme scenario of, like, it really has this opportunity to take over, right? Which I do think, you know, maybe we just want to not, we not want to deal with that with having to build an AI that we're comfortable being in that position.

Starting point is 00:06:04 But let's just focus on it for the sake of simplicity. And then we can relax the assumption. You know, okay, so he has some hope. He's like, I'm going to build an AI over here. So one issue is you can't test. You can't give the AI this literal situation, have a take over and kill everyone, and then be like, oops, like, update the weights. This is the thing that Leaser talks about of sort of like you can't, you know,

Starting point is 00:06:27 you care about its behavior on this like specific, in the specific scenario that you can't test directly. Now, we can talk about whether that's a problem, but that's like one issue is that there's a sense in which this has to be kind of like off distribution and you have to be getting some kind of generalization from your training the AI in a bunch of other scenarios. And then there's this question of how is it going to generalize to the scenario where it really has this option. So is that even true? Because like when you're training it, you can be like, hey, here's a gradient update.

Starting point is 00:06:55 If you get a if you get the takeover option on the platter, don't take it. And then just like in sort of red teaming situations where it thinks it has a takeover attempt, it's like you train not to take it. And yeah, it could feel, but like, I just feel like if you did this to a child, you're like, I don't know, don't beat up your siblings. And the kind of the kid will generalize to like, if I'm an adult and I have a rifle, I'm not going to like start shooting random people. Yeah. So, okay, cool. So you had mentioned this a thought like, well, are you kind of what you pretend to be?

Starting point is 00:07:31 Yeah. Right? And will you, will these AIs, you know, you train them to look kind of nice? Yeah. Um, uh, you know, fake it till you make it. Or, you know, you were like, ah, like, we do this to kids. I think it's better to imagine, like, kids doing this to us, right? So, like, um, I don't know, like, here's a sort of silly analogy for, like, AI training.

Starting point is 00:07:50 Um, and there's a bunch of questions we can ask about it's, it's relationship. But, like, suppose, you, you know, you wake up and you, and you're, um, you're being trained via, like, methods analogous to kind of contemporary machine learning by, like, Nazi children. to be like a good Nazi soldier or Butler or what have you, right? And here are these children. And you really know what's going on, right? The children have like, they have a model spec, like a nice Nazi model spec, right? And it's like reflect well on the Nazi party, like benefit the Nazi party, whatever.

Starting point is 00:08:27 And you can read it. Right? You understand it. This is why I'm saying like when the model, you're like, oh, the models really understand human values. It's like, yeah. So, yeah, go ahead. On this analogy, I feel like a closer analogy would be, in this analogy, I start off as something more intelligent than the thing's training me with different values to begin with.

Starting point is 00:08:46 So like the intelligence and the values are baked in to begin with, whereas more analogous scenario is like, I'm a toddler. And initially I'm like stupider than the children. And I'm like being, this would also be true, by the way, I'm like much, much model. Initially, like the much, the much modern model is like dumb, right? And then get smarter as you train it. it's like a toddler and like the kids are like, hey, we're going to bully you if you're like not a Nazi.

Starting point is 00:09:09 And I'm like, as you grew up, then you're like at the children's level and then eventually you become an adult. But through that process, like they've been sort of bullying you, you know, like training you to be a Nazi. And I'm like, I think in that scenario like I might end up a Nazi. Yes, I think that's. So, yeah, I think basically a decent portion of the hope here. Or like, I think we should just, you know,

Starting point is 00:09:31 an aim should be, we're never in the situation where the AI really has very different values already is quite smart, really knows what's going on. Yeah. And is now in this kind of adversarial relationship with our training process, right? So we want to, we want to avoid that. The main thing, and I think it's possible we can by the sorts of things you're saying. So I'm not like, ah, that'll never work. The thing I just wanted to highlight was, like, if you get into that situation, and if the AI is genuinely at that point, like, much, much more sophisticated than you, and doesn't want to kind of reveal its true values for whatever.

Starting point is 00:10:04 reason, then, you know, when the children show like some like kind of obviously fake opportunity to like defect to the allies, right? You, you know, you, it's sort of not necessarily going to be a good test of what will you do in the real circumstance because you're able to tell. You can also give another way in which I think the analogy might be misleading, which is that, um, now imagine that you're not like just in a normal prison where you're like totally cognizant of everything that's going on. Sometimes they drug you, like give you like weird hallucinogens that totally mess up how your brain is working. A human adult in a prison is like, I know what kind of thing I am. I'm like, like nobody's like really fucking with me

Starting point is 00:10:47 in a big way. Whereas I think an AI, even a much smarter AI in a training situation is much closer to, you're constantly inundated with your drugs and different training protocols and like you're like frazzled because like each moment it's like you know it's closer to some sort of like Chinese water torture kind of technique where you're like I'm glad we're talking about the moral patient and stuff later it's like the chance are like step back and be like what's going on in this like adult has that maybe in prison in a way that I don't know if these models necessarily have like that coherence and that um like stepping back from what's happening in the training process yeah I mean I don't know I think I'm I'm hesitant

Starting point is 00:11:28 to be like, it's like drugs for the model. Like I think there's, um, but broadly speaking, I do basically agree that I think we have like really quite a lot of tools and options for kind of training AIs, even AIs that are kind of somewhat smarter than humans. I do think you have to actually do it. So I, you know, I am, I think compared to maybe you had a liaison on, like, I think I'm much more, much more bullish on our ability to solve this problem, especially for AIs that are in what I think of as like the AI for AI safety sweet spot, which is this sort of band of capability where they're both very sufficiently capable that they can be really really useful for strengthening various factors in our civilization that can make us safe.

Starting point is 00:12:13 So our alignment work, you know, control, cybersecurity, general epistemics, maybe some coordination application, stuff like that. There's like a bunch of stuff you can do with AIs that in principle could kind of differentially accelerate our security with respect to the sorts of considerations we're talking about. If you have AIs that are capable of that, and you can successfully elicit that capability in a way that's not sort of being sabotaged

Starting point is 00:12:36 or, like, messing with you in other ways, and they can't yet take over the world or do some other sort of really problematic form of power-seeking, then I think if we were really committed, we could, like, you know, really go hard, put a ton of resources, really differentially direct this, like, glut of AI productivity towards these sort of security factors

Starting point is 00:12:55 and hopefully kind of control and understand, you know, do a lot of these things you're talking about for kind of making sure our AIs don't kind of take over or mess with us in the meantime. And I think we have a lot of tools there. I think you have to really try, though. It's possible that those sorts of measures just don't happen or don't happen at the level of kind of commitment and diligence

Starting point is 00:13:17 and, like, seriousness that you would need, especially if things are, like, moving really fast, and there's other sort of competitive pressures. And, like, you know, the compute, this is going to take compute to do these, like, intensive, all these experiments on the AIs and stuff. And that compute, we could use that for experiments

Starting point is 00:13:30 for the, you know, the next, the next scaling step and stuff like that. So, you know, I do, I am, I'm not here saying, like, this is impossible, especially for that band of AIs. It's just, I think you have to, you have to try really hard. Yeah, yeah.

Starting point is 00:13:42 I mean, I agree with the sentiment of, like, obviously approach this situation with caution, but I do want to point out of the ways in which the analyses we've been using have been sort of maximally adversarial. Like, these are not, for example, going back through the adult getting trained by Nazi children, maybe the one thing I didn't mention is like the difference in a situation, which is maybe we're trying to get at with the drug metaphor, is that when you get an update,

Starting point is 00:14:12 it's like much more directly connected to your brain than a sort of reward of punishment a human gets. It's like literally a gradient update on like, what? what's our grid somewhere it's like down to the parameter or how much would this contribute to you putting this output rather than that output. And each different parameter we're going to adjust like to the exact floating point number that calibrates it to the output we want. So I just want to point out that like we're coming into the situation like pretty well. It does make sense, of course, if you're talking to somebody at the lab, like, hey, really be careful. But it just serves sort of like the general audience.

Starting point is 00:14:44 Like should I be like, I don't know, should I be scared to witness? to the extent that you should be scared about things that do have a chance of happening. Like, yeah, you should be scared about nuclear war? But, like, in the sense of, like, should you be doing like, no, you're like, you're coming up with an incredible amount of leverage on the AIs in terms of how they will interact with the world, how they're trained, what are the default values they start with? So, look, I don't, I think it is the case that by the time we're building superintelligence will have, like, much better, I mean, even right now, like, when you look at, like, labs

Starting point is 00:15:16 talking about how they're planning to align the AIs. No one is saying, like, we're going to do RLHF. You know, at the least, you're talking about scalable oversight. You have, like, some hope about interpretability. You have, like, automated red teaming. You're, like, using the AIs a bunch. And, you know, hopefully you're doing a bunch more... Humans are doing a bunch more alignment work.

Starting point is 00:15:33 I also, personally, am hopeful that we can, like, successfully elicit from various AIs, like, a ton of alignment work progress. So, like, yeah, there's, like, a bunch of ways this can go. And I'm, you know, I'm not here to tell you, like, you know, 90% doom or anything like that. I do think like, you know, my, my, the sort of basic reason for concern, if you're really imagining, like, we're going to transition to a world in which we are, we've created these beings that there's like vastly more powerful than us. Yeah. And we've reached the point where our continued empowerment is just effectively dependent on their motives. Like it's, it's, it is this, you know, vulnerability to like, what are these?

Starting point is 00:16:16 AIs choose to do? Do they choose to continue to empower us? Or do they choose to do something else? Or the institutions that have been set. Like, I'm not, I expect the U.S. government to protect me not because of its quote-unquote motives, but just because of like the system of

Starting point is 00:16:32 incentives and institutions and norms that has been set up. Yeah. So you can hope that that will work too. But there is, I mean, there is a concern. I mean, so I sometimes think about AI takeover scenarios via this spectrum of like, how much power did we kind of voluntarily transfer to the AIs? Like, how much of our civilization did we kind of hand to the AIs intentionally by the time they sort of took over

Starting point is 00:16:58 versus how much did they kind of take for themselves, right? And so I think some of the scariest scenarios are, it's like a really, really fast explosion to the point where there wasn't even a lot of, like, integration of AI systems into the broader economy. And, but there's just like really intensive amount of super intelligence sort of concentrated in a single project or something like that. Yeah. And I think that's scary, you know, that's a quite scary scenario, partly because of the speed and people not having time to react. And then there's sort of intermediate scenarios where like some things got automated, maybe like people really handed the military over to the AIs or like, you know, automated science. There's like some rollouts and that's sort of giving the AIs power that they don't have to take. Or we're doing all our cybersecurity with AIs. and stuff like that. And then there's worlds where you like really,

Starting point is 00:17:48 you know, you sort of fully, you more fully transitioned to a kind of world run by AI's on, you know, kind of, in some sense, human voluntarily did that. Look, if you think all this talk with Jill about how AI is going to take over human roles as crazy, it's already happening.

Starting point is 00:18:06 And I can just show you using today's sponsor, Bland AI. Hey, is this Dworkish? The amazing podcaster that talks about philosophy and tech. This is Bland AI calling. Thanks for calling me, Bland. Tell me a little bit about yourself. Of course, it's so cool to talk to you.

Starting point is 00:18:25 I'm a huge fan of your podcasts, but there's a good chance we've already spoken without you even realizing it. I'm an AI agent that's already being used by some of the world's largest enterprises to automate millions of phone calls. And how exactly do you do what you do? There's a tree of prompts that always keeps me on track.

Starting point is 00:18:44 I can talk in any language or voice, handle millions of calls, simultaneously 24-7 and be integrated into any system. Anything else you want to know? That's it. I'll just let people try it for themselves. Thanks, Bland. Man, you talk better than I do. And my job is talking. Thank you, Dwork. All right. So as you can see, using Bland AI, you can automate your company's calls across sales, operation, customer support, or anything else. And if you want access to their more exclusive model, go to Bland.a.i. Borkesh. All right, back to Joe. Maybe there were competitive pressures, but you kind of

Starting point is 00:19:21 intentionally hand it off, like, huge portions of your civilization. And, you know, at that point, you know, I think it's likely that humans have, like, a hard time understanding what's going on. Like, a lot of stuff is happening very fast. And it's, you know, the police are automated. You know, the courts are automated. There's like all sorts of stuff. Now, I think I take, I tend to think a little less about those scenarios because I think the, those are correlated with, I think it's just like longer down the line. Like, I think humans are not hopefully going to just like, oh yeah, like you build an AI system? Like, let's just, you know, I think human, and in practice, when we look at like technological adoption rates, I mean, it does,

Starting point is 00:19:57 it can go quite slow. And obviously, there's going to be competitive pressures. But in general, I think, like, um, uh, this, this category is somewhat safer. Um, but even in this one, I think it's like, I don't know, it's kind of intense. Like, if you really, if humans have really lost their epistemic grip on the world, if they've sort of handed off the world to these systems, even if you're like, oh, there's laws, there's norms. You know, I really want us to like to have a really developed understanding of what's likely to happen in that circumstance

Starting point is 00:20:27 before we go for it. I get that we want to be worried about a scenario where it goes wrong, but like, why, like, what is the reason to think it might go wrong? The human example, your kids are not like adversarial against, not like maximally adversarial against your attempts to instill your culture on them. And then these models, at least so far, don't see matters.

Starting point is 00:20:48 They just get, hey, don't help people make bombs or whatever, even if you ask in a different way how to make a bomb. And we're getting better and better at this all the time. I think you're right in picking up on this assumption in the AI risk discourse of what we might call, like, kind of intense adversariality between agents that have, like, somewhat different values. Yeah. Where there's some sort of thought, and I think this is rooted in the discourse about, like,

Starting point is 00:21:14 kind of the fragility of value and stuff like that, that like, you know, if these agents are like somewhat different than like, at least in the specific scenario of an AI takeoff, they end up in this like intensely adversarial relationship. And I think you're right to notice that that's kind of not how we are in the human world. Like we're very comfortable with a lot of different differences in values. I think a factor that is relevant and I think that plays some role is this notion that that there are possibilities for like intense concentration of power. on the table. So if you are,

Starting point is 00:21:48 there is some kind of general concern, both with humans, NIs that like, if it's the case that there's like some like, you know, ring of power or something that someone can just grab and then that will kind of give them

Starting point is 00:22:01 huge amounts of power over everyone else, right? Suddenly you might be like more worried about differences in values at stake because you're like more worried about those other actors. So we talked about this Nazi, this example where you imagine that you wake up, you're being trained by Nazis to, um, uh, you know, become a Nazi and you're not

Starting point is 00:22:19 right now. So one question is like, is it plausible that we'd end up with a model that is sort of in that sort of situation? As you said, like maybe it's, it's, you know, it's trained as a kid. It sort of never ends up with values such that it's, uh, uh, kind of aware of some significant divergence between its values and the values that like the humans intend intend for it to have. then there's a question of if it's in that scenario, would it want to avoid having its values modified? To me, it seems fairly plausible that if the AI's values meet certain constraints

Starting point is 00:22:57 in terms of like, do they care about consequences in the world? Do they anticipate that the AI's kind of preserving its values will like better conduce to those consequences? then I think it's not that surprising if it prefers not to have its values modified by the training process. But I think the way in which I'm confused about this is like with the non-Nazi being trained by Nazis, it's not just that I have different values, but I like actively despise their values, where I don't expect this to be true of AIs with respect to their trainers. The more analogous in hero is where I'm like, am I leery of my values being changed,

Starting point is 00:23:35 just going to college or meeting new people or reading a new book or I'm like, I don't know, it's okay for changes in value. That's fine. I don't care. Yeah, I mean, I think that's a reasonable point. I mean, there's a question, you know, how would you feel about paper clips? You know, maybe you don't despise paper clips, but there's like the human paper clippers there and they're like training you to make paper clips. You know, my my sense would be that there's a kind of relatively specific set of conditions in which you're comfortable having your value, especially not changed by like learning and growing, but like radiant descent directly intervening on your on your neurons.

Starting point is 00:24:10 Sorry, but this seems similar to like I'm already, at least a likely senior seems like maybe more like religious training as a kid where like you start off into religion and you're already, like, because you start off in religion, you're already sympathetic to like the idea that you go to church every week so that you're like more reinforced in this existing tradition. You're getting more intelligent over time. So when you're a kid, you're getting very simple like instructions about how the religion works. As you get older, you get more and more complex theology that helps you, like,

Starting point is 00:24:37 talk to other adults about why this is a rational religion to believe in. Yep. But since you're, like, one of your values to begin with was that I want to be trained further in this religion, I want to come back to church every week. And that seems more analogous to the situation the EIs will be in respect to human values. Because the entire time, they're like, hey, you know, like, be helpful, blah, blah, blah, be harmless. So yes, it could be like that. There's one, there's a kind of scenario in which you were comfortable with your values being changed, because in some sense you have allegiance to the sufficient allegiance to the output of that process. So you're kind of hoping in a religious context.

Starting point is 00:25:11 You're like, ah, like, make me more virtuous by the lights of this religion. And, you know, you go to confession and you're like, you know, I've been thinking about takeover today. Can you change me? Please, like, give me more gradient descent. You know, I've been bad, so bad. And so, you know, that's, people sometimes use the term courageability to talk about that. Like when the AI, it maybe doesn't have perfect values, but it's in some sense cooperating with your efforts to change its values to be a certain way. So maybe it's worth saying a little bit here about what actual values the AI might have.

Starting point is 00:25:47 Yeah. You know, would it be the case that the AI naturally has the sort of equivalent of like, I'm sufficiently devoted to this human, to human obedience that I'm going to like really want to be modified. so I'm kind of like a better instrument of the human will versus like wanting to go off and do my own thing. It could be benign, you know, it could go well. Here are some like possibilities I think about that like could make it bad. And I think I'm just generally kind of concerned about how little I feel like I, how little science we have of model motivations, right? It's like we just don't, I think we just don't have a great understanding of what happens in this scenario. And hopefully we'd get one before we reach this scenario.

Starting point is 00:26:26 But like, okay, so here are the kind of five, um, five categories of like motivations the model could have. And this hopefully maybe gets at this point about like, what does the model eventually do? Okay, so one category is just like something super alien that has, you know, it's sort of like, oh, there's some weird correlate of easy to predict text or like there's some weird aesthetic for data structures

Starting point is 00:26:48 that like the model, you know, early on pre-training or maybe now it's like developed that it like, you know, really thinks things should kind of be like this. There's something that's like quite alien to our cognition where we just like wouldn't recognize this as a thing at all. Yeah. Right. Another category is something,

Starting point is 00:27:04 a kind of crystallized instrumental drive that is more recognizable to us. So you can imagine like AIs that develop, let's say, some like curiosity drive because that's like broadly useful. You mentioned like, oh, it's got different heuristics, different like drives, different kind of things that are kind of like values. And some of those might be actually somewhat similar to things that were useful to humans and that ended up part of our terminal values in various ways. So, you know, you can imagine curiosity.

Starting point is 00:27:30 You can imagine, like, various types of option value. Like, maybe it really wants, it intrinsically, maybe it values power itself. It could value, like, survival or some analog of survival. Those are possibilities, too, that could have been rewarded as sort of proxy drives at various stages of this process and that kind of made their way into the model's kind of terminal criteria. A third category is some analog of reward where the model at some point has sort of part of its motivational system has fixated on a component of the reward process, right? Like the humans approving of me or like numbers getting entered in the data center or like gradient descent doing, you know, updating me in this direction or something like that.

Starting point is 00:28:16 There's something in the reward process such that as it was trained, it's focusing on that thing. and like, I really want the reward process to give me reward. But in order for it to be of the type where it then getting reward motivates choosing the takeover option, it also needs to generalize such that its concern for reward has some sort of like long time horizon element.

Starting point is 00:28:37 So it like not only wants reward, it wants to like protect the reward button for like some long period or something. Yeah. Another one is like some kind of messed up interpretation of some human like concept. So, you know, maybe the AIs are like, they really want to be like smelpful and like shmanist and and and and smarmless right um but their

Starting point is 00:28:58 concept is like importantly different from the human concept and they know this um so they they know that the human concept would mean blah but they like ended up their their values ended up fixating on like a somewhat different structure yeah um so that's like another version and then a fourth version or a fifth version which i think you know i think about less because i think it's just like such an own goal if you do this but i do think it's possible it's just like you could have a that are actually just doing what it says on the tin. Like, you have AIs that are just genuinely aligned to the model spec. They're just really trying to, like, benefit humanity and reflect well on OpenAI.

Starting point is 00:29:32 And what's the other one? Help, you know, assist the developer or the user, right? Yeah. But your model spec, unfortunately, was just not robust to the degree of optimization that this AI is bringing to bear. And so, you know, it decides when it's looking out at the world and they're like, what's the best way to benefit opening I and, or sorry, reflect while at Open AI and,

Starting point is 00:29:54 and benefit humanity and such and so. It decides that, you know, the best way is to go rogue. That's, I think that's like a real own goal, because at that point, you like, you got so close. You know, you really, you really, you just have to write the model spec well. And you read team it suitably. But I actually think it's like possible we mess that up too.

Starting point is 00:30:13 You know, it's like kind of an intense project writing like kind of constitutions and like structures of rules and stuff. that are going to be robust to very intense forms of optimization. So that's a final one that I'll just flag, which I think is like it comes up, even if you've sort of solved all these other problems. Yeah. I buy the idea that, like, it's possible that the motivation thing could go wrong.

Starting point is 00:30:37 I'm not sure I bought, I'm not sure like my probability of that has increased by detailing them all out. And in fact, I think it could be potentially misleading to, it's like you can always enumerate the ways in which things go wrong and um the process of enumeration itself can increase your probability whereas you're just like cloud like you had a vague cloud of like 10% or something and you're just like listing out what the 10% actually constitutes. Yeah totally. I'm not I'm not trying to say like mostly the thing I wanted to do there was just give

Starting point is 00:31:08 any con sure any possible like giving some sense of like what might the models motivations be like what are ways this could be. I mean as I said my my my best. guess is that it's partly the like alien thing. And, you know, not necessarily, but the, but insofar as you're, you know, also interested in like, what does the model do later and kind of like how, what sort of future would you expect if models did take over? Then, yeah, I think it can at least be helpful to have some, like, set of hypotheses on the table instead of just saying, like, it has some set of motivations. But in fact, I am like, a lot of the work here is being done by our ignorance about what those motivations are. Okay, we don't want,

Starting point is 00:31:47 humans to be like sort of violently killed and overthrown. But the idea that over time, they're like biological humans are not the driving force as the actors of history is like, yeah, that's kind of baked in, right? And then so like, what is the, we can sort of debate the probabilities of the worst case scenario or we can just discuss like, I don't know, what is it that, like what is a positive vision we're hoping for? Like, what is, what is a future you're happy with? You know, my best guess when I really think about, like, what do I feel good about? And I think this is probably true of a lot of people is there's some sort of more organic, decentralized process of like civilizational incremental civilizational growth.

Starting point is 00:32:34 The type of thing we trust most and the type of thing we have most experience with right now as a civilization is some sort of like, okay, we change things a little bit. A lot of people have, there's a lot of like processes of adjustment. and reaction and kind of a decentralized sense of like what's changing. You know, was that good, was that bad? Take another step. There's some like kind of organic process of growing and changing things, which I do expect ultimately to lead to something quite different from biological humans. Though, you know, I think there's a lot of ethical questions we can raise about what that process involves.

Starting point is 00:33:13 But I think, you know, I also, I do think we, ideally there would be some way in which we managed to grow via the thing that really captures. What do we trust in, you know, there's something, there's something we trust about the ongoing processes of human civilization so far. I don't think it's the same as like raw competition or, you know, pure, I think there's like some rich structure to how we understand, like, like moral progress do have been made and what it would be to kind of carry that thread forward. And I don't have a formula. You know, I think, I think we're just going to have to bring to bear the full force of everything that we know about goodness and justice and beauty. And every, every, we just have to, you know, bring ourselves fully to the project of, like,

Starting point is 00:34:02 making things good and, and doing that collectively. And I think that is, it is a really important part, I think, of our vision of, like, what was an appropriate process of, like, deciding of, like, growing as a civilization. is that there was this very inclusive, kind of decentralized element of, like, people getting to think and talk and grow and change things and react rather than some more like, and now the future shall be like blah. Yeah. You know, I think that's, I think we don't want that.

Starting point is 00:34:31 I think a big of correction maybe is like, okay, to the extent that like the reason we're worried about motivations in the first place is because we think a balance of power, which includes at least one thing with human motivations, not human motivations, human descended motivations is difficult to the extent that we think that's the case. It seems like a big crux that I often don't hear people talk about is like, I don't know how you get the balance of power. And maybe just like reconciling yourself with the models of the intelligence explosion, which say that such a thing is not possible. And therefore you just got to like figure out how you get the right God. But I don't know. I'm like, I, I don't know. I'm like, I,

Starting point is 00:35:12 I don't really have a framework to think about how to the balance of power thing. I'd be very curious of like there is a more concrete way to think about like what are the, what what what is a structure of competition or lack thereof between the labs now or between countries such that the balance of power is most likely to be preserved. A big part of this discourse, at least among safety concerned people, is like there's a clear trade-off between competition dynamics and race dynamics and the value of the future or how good the future ends up being. And in fact, if you buy this balance of power story, it might be the opposite, like maybe competitive pressures naturally favor balance of power. And I wonder

Starting point is 00:35:59 if this is one of the strong arguments against nationalizing the AIs. And like, you can imagine a more sort of many different companies developing AI, some of which are somewhat misaligned and some of which are aligned. You can imagine that being more conducive to both the balance of power and to a defensive, how all the AI has go through each website and see how easy it is to hack and basically just getting society up to snuff. If you're not just deploying the technology widely, then the first group who can get their hands on it

Starting point is 00:36:31 will be able to instigate a sort of revolution that, you're just like standing against the equilibrium in a very strong way. So I definitely share some intuition there that there's, you know, at a high level, a lot of what's scary about the situation with AI has to do with concentrations of power. And whether that power is kind of concentrated in the hands of misaligned AI or in the hands of some human. And I do think it's very natural to think, okay, let's try to distribute the power. more, and one way to try to do that is to kind of have a much more multipolar scenario where like lots and lots of actors are developing AI, and this is something that people have talked about.

Starting point is 00:37:19 When you describe that scenario, you were like, some of which are aligned, some of which are misaligned. That's key. That's a key aspect of the scenario, right? And this is sometimes people will say the stuff. They'll be like, well, the good AIs, there will be the good AIs and they'll defeat the bad AIs. But, you know, notice the assumption in there, which is the that you sort of made of the case that there's, you can control some of the AIs, right? And you've got some good AIs, and now it's a question of like, are there enough of them

Starting point is 00:37:48 and how are they working relative to the others? And maybe, you know, I think it's possible that that is what happens. There's, you know, we know enough about alignment that some actors are able to do that and maybe some actors are less cautious, or they are intentionally creating misaligned AIs or God knows what.

Starting point is 00:38:06 But if you don't have that, right, If everyone is in some sense unable to control their AIs, then there's, then the sort of the good AIs help with the bad AIs thing becomes like more complicated, or maybe it just doesn't work because there's sort of no good AIs in this scenario. There's a lot of sort of if you say like everyone is building their own superintelligence that they can't control. It's true that that is now a check on the power of the other superintelligence. Now the other superintelligence need to like deal with other actors,

Starting point is 00:38:39 but none of them are necessarily kind of working on behalf of a given set of human interests or anything like that. So I think that's like a very important difficulty in thinking about sort of the very simple thought of like, ah, I know what we can do, let's just, you know, have lots and lots of AIs

Starting point is 00:38:59 so that no single AI has a ton of power. And I think, you know, that on its own, that on its own is not enough. But in this story, it's like, I'm just like very skeptical we end up with. I think on default, we have this training regime, at least initially, that favors a sort of like latent representation of the inhibitions that humans have and the values humans have. And I get that like, if you mess it up, it could go rogue. But like, if like multiple people are training eyes, they all end up rogue such that like the compromises between them don't end up with humans not violently killed. like none of them have like,

Starting point is 00:39:39 it all, it fails on like Google's run and Microsoft's run and Open AI's run. Yeah, I mean, I think there's very notable and salient sources of correlation between failures across the different runs, right? Which is people didn't have a developed science of AI motivations. The runs were structurally quite similar.

Starting point is 00:39:56 Everyone is using the same techniques. Maybe someone just stole the weights. Or, you know, so yeah, I guess I think, I think it's really important this idea that, like, to the extent you haven't solved alignment, you haven't, you likely haven't solved it anywhere. And if someone has solved it and someone hasn't, then I think it's a better question. But if everyone's building systems that are, you know, that are kind of going to go rogue,

Starting point is 00:40:24 then I don't think that's much comfort as we talked about. Yep, yep. Okay, all right. So then let's wrap up this part here. I didn't mention this explicit introduction, so to the extent that this ends up being the transition to the next part, the broader

Starting point is 00:40:40 discussion we were having in part two is about Joe's series other ness in control in the age of AGI. And the first part is where I was hoping we could just come back and just treat the main crux people come in wondering about and which I myself feel unsure about. Yeah, I mean, I'll just say on that front. I mean, I do think the otherness and control series

Starting point is 00:40:58 is you know, I think kind of in some sense separable. I mean, it has a lot to do with like misalignment stuff, but I think it's not, I think a lot of those issues are relevant, even if, even given various degrees of skepticism about some of the stuff I've been saying here. And by the way, so the actual mechanisms of how a takeover would happen will,

Starting point is 00:41:21 there's an episode with Carl Schulman, which discusses this in detail so people can go check that out. Yeah, I think like, yeah, in terms of why is it plausible that I just could take over from a given a position, you know, in one of these projects I've been describing or something, I think. I think Carl's discussion is pretty good and gets into a bunch of kind of the weeds that I think might give a more concrete sense. Yep. All right.

Starting point is 00:41:44 So now on to part two where we discuss the otherness and control in the age of AGI series. First question, if in a hundred years time we look back on alignment and consider it was a huge mistake that we should have just tried to build the most raw, powerful AI systems we could have? What would bring about such a judgment? One scenario I think about a lot is one in which it just turns out that maybe kind of fairly basic measures are enough to ensure, for example, that AIs don't cause catastrophic harm, don't kind of seek power in problematic ways, et cetera. And it could turn out that we learned that it was easy in a way that, such that we regret, you know, we wish we had prioritized differently. We end up thinking, oh, you know, I wish we could have cured cancer sooner. We could

Starting point is 00:42:30 have handled some geopolitical dynamic differently. There's another scenario where we end up looking back at some period of our history and how we thought about AIs, how we treated our AIs, and we end up looking back with a kind of moral horror at what we were doing. So, you know, we end up thinking, you know, we were thinking about these things centrally as like products as tools. But in fact, we should have been foregrounding much more the sense in which they might be moral patients or were moral patience at some level of sophistication, that we were kind of treating them in the wrong way. We were just acting like we could do whatever we want.

Starting point is 00:43:06 We could delete them, subject them to arbitrary experiments, kind of alter their minds in arbitrary ways. And then we end up looking back in the light of history at that as a kind of serious and kind of grave moral error. Those are scenarios I think about a lot in which we have regrets. I don't think they quite fit the bill of what you just said. I think it sounds to me like the thing you're thinking is something more like we end up feeling like, gosh, we wish we had paid no attention to the motives of our AIs

Starting point is 00:43:37 that we'd thought not at all about their impact on our society as we incorporated them. And instead, we had pursued a, let's call it a kind of maximize for brute power option. Which is just kind of make a B-line for whatever is just the most powerful AI you can. And don't think about anything else. Okay, so I'm very skeptical that that's what we're going to wish. One common example that's given a misalignment is humans from evolution. And you have one line in your series that here's a simple argument for AI risk. A monkey should be careful before inventing humans.

Starting point is 00:44:17 The sort of paper clipper metaphor imply something really banal and boring with regards to misalienable. And I think if I'm steel manning the people who worship power, they have the sense of humans got misaligned. And they started pursuing things if a monkey was creating them. This is a weird analogy because obviously monkeys didn't create humans. But if the monkey was creating them, they're not thinking about bananas all day. They're thinking about other things. On the other hand, they didn't just make useless stone tools and piled up in caves in a sort of paper clipper fashion. There were all these things that emerged because of their greater intelligence, which were,

Starting point is 00:44:57 misaligned with evolution of creativity and love and music and beauty and all the other things we value about human culture. And the prediction maybe they have, which is more of an empirical statement than a philosophical statement is, listen, with greater intelligence, if you're thinking about the paperclip, or even if it's misaligned, it will be in this kind of way. It'll be things like that are alien to humans, but also alien in the way humans are aliens to monkeys, not in the way that paper clipper is alien to a human. Cool. So I think there's a bunch of different things to potentially unpack there. One kind of conceptual point that I want to name off the bat, I don't think you're necessarily kind of making a mistake in this fame,

Starting point is 00:45:41 but I just want to name it as like a possible mistake in this vicinity is I think we don't want to engage in the following form of reasoning. Let's say you have two entities. One is in the role of creator and one is in the role of creation. And then we're positing that there's this kind of misalignment relation between them, whatever that means, right? And here's a pattern of reasoning that I think you want to watch out for is to say, in my role as creator, or sorry, in my role as creation, say you're thinking of humans in the role of creation relative to an entity like evolution or monkeys or mice or whoever you could imagine inventing humans or something like that, right? You say, I'm qua creation. I'm happy.

Starting point is 00:46:26 that I was created and happy with the misalignment. Therefore, if I end up in the role of creator, and we have a structurally analogous relation in which there is misalignment with some creation, I should expect to be happy with that as well. There's a couple of philosophers that you brought up in the series, which if you read the works that you talk about, actually seem incredibly foresighted in anticipating

Starting point is 00:46:56 something like a singularity, our ability to shape a future thing that's different, smarter, maybe better than us. Obviously, C.S. Lewis, abolition of man. We'll talk about it in a second as one example. But even here's one passage from Nisha, which I felt really highlighted this. Man is a rope stretched between the animal and the Superman. A rope over an abyss. A dangerous crossing, a dangerous wayfaring. A dangerous looking back. A dangerous trembling and halting. Is there some explanation for why? Is it just like somehow obvious that something like this is coming, even if you're thinking 200 years ago? I think I have a much better grip on what's going on with Lewis than with Nietzsche there.

Starting point is 00:47:35 So maybe let's just talk about Lewis. Sure. For a second. So and we should distinguish two. There's a kind of version of the singularity that's specifically like a hypothesis about feedback loops with AI capabilities. Right. I don't think that's pressure in Lewis. I think what Lewis is anticipating.

Starting point is 00:47:52 And I do think this is a relatively. simple forecast is something like the culmination of the project of scientific modernity. So Lewis is kind of looking out at the world and he's seeing this process of kind of increased understanding of a kind of the natural environment and a kind of corresponding increase in our ability to kind of control and direct that environment. And then he's also pairing that with a kind of metaphysical hypothesis. Or, well, his stance on this metaphysical hypothesis, I think is like kind of problematically unclear in the book.

Starting point is 00:48:31 But there is this metaphysical hypothesis, naturalism, which says that humans, too, and kind of minds, beings, agents are a part of nature. And so insofar as this process of scientific modernity involves a kind of progressively greater understanding of an ability to control nature, that will presumably at some point grow to encompass our own natures and our and kind of the natures of other beings that in principle we could we could create. And Lewis views this as a kind of cataclysmic event and crisis. You know, part of what I'm trying to say in, and that in particular that it will lead to all these kind of tyrannical kind of behaviors and kind of tyrannical attitudes towards morality

Starting point is 00:49:21 and stuff like that. And part of what I'm trying to, you know, unless you believe in non-naturalism or in some form of kind of Dow, which is this kind of objective morality. So we can talk about that. But part of what I'm trying to do in that essay is to say, no, I think we can be naturalists and also be kind of decent humans that remain in touch with a kind of a rich set of norms that have to do with like,

Starting point is 00:49:43 how do we relate to the possibility of kind of creating creatures, altering ourselves, et cetera. But I do think, I do think his, yeah, it's like a relatively simple prediction. It's kind of science, master's nature, humans part of nature, science master's humans. And then you also have a very interesting other essay about suppose humans, like what should we expect of other humans,

Starting point is 00:50:03 this sort of extrapolation if they had greater capabilities and so on? Yeah, I mean, I think an uncomfortable thing about the kind of conceptual setup at stake in these sort of like abstract discussions of like, okay, you have this agent, it foams, which is this sort of amorphous process of going from a sort of seed agent to a like super intelligent version of itself, often imagined to kind of preserve its values along the way. A bunch of questions we can raise about that. But I think a kind of many of the arguments that people will often talk about in the context of reasons to be scared of AI's like, oh, like value is very fragile as you like Fume, you know, kind of small differences in utility functions can kind of decorrelate very

Starting point is 00:50:53 hard and kind of drive in quite different directions. And like, oh, like, agents have instrumental incentives to seek power. And if it was arbitrarily easy to get power, then they would do it and stuff like that. Like these are very general arguments that seem to suggest that the kind of, it's not just an AI thing, right? It's like no surprise, right? It's talking about like take a thing, make it arbitrarily powerful such that it's like, you know, God emperor of the universe or something, how scared are you of that? Like, clearly, we should be equally scared of that. Or, I don't know, we should be really scared of that with humans, too, right? So, I mean, part of what I'm saying in that essay is that I think this is, in some sense, this is much more a story about balance of power.

Starting point is 00:51:35 Right. And about, like, maintaining a kind of, a kind of checks and balances and kind of distribution of power, period, not just about like kind of humans versus AIs and kind of the differences between human values and AI values. Now, that said, I mean, I do think humans, many humans would likely be nicer if they fumed than like certain types of AIs. So I mean, it's not, but, but I think the kind of conceptual structure of the, uh, the argument is not, it's sort of, um, a very open question how much it applies to humans as well. I think one sort of the question I have is, I don't even know how to express this, but how confident are we with this ontology

Starting point is 00:52:18 of expressing what are agents, what are capabilities, how do we know this is the thing that's happening or like this is the way to think about what intelligences are? So it's clearly this kind of very janky, kind of, I mean, well, people maybe disagree about this. I think it's, you know, I mean, it's obvious to everyone with respect,

Starting point is 00:52:41 to like real world human agents that kind of thinking of humans as having utility functions is, you know, at best, a very lossy approximation of what's going on. I think it's likely to mislead as you amp up the intelligence of various agents as well, though I think Elyzer might disagree about that. Right. I will say, I think it's something adjacent to that, that I think is like more real, that seems more real to me, which is something like, I don't know, my mom recently bought, you know, or a few years ago, she, like, wanted to get a house. She wanted to get a new dog. Now she has both, you know? Um, how did this happen? What is the right? It's good she tried. It was hard. She had to like search for the houses. It was hard to find the dog, right? Now she has a

Starting point is 00:53:24 house. Now she has a dog. Um, this is a very common thing that happens all the time. And I think, um, I don't think we need to be like, my mom has to have a utility function with the dog. And she has to have a consistent valuation of all the houses or whatever. I mean, like, but it's still the case that her planning and her agency exerted in the world resulted in her having this house, having this dog. And I think it is plausible that as our kind of scientific and technological power advances, more and more stuff will be kind of explicable in that way, right? That, you know, if you look and you're like, why is this man on the moon, right? How did that happen? And it's like, well, like, there was a whole cognitive process. There was a whole like planning apparatus. Now,

Starting point is 00:54:05 in this case, it wasn't like localized in a single mind, but like there was a whole thing. thing such that man on the moon. Right. And I think like we'll see a bunch more of that. And the AIs will be, I think, like doing a bunch of it. And so that that's the thing that seems like more real to me than kind of utility functions. So yeah, the man on the moon example, there's a proximal story of how exactly NASA engineered the spacecraft to get to the moon.

Starting point is 00:54:35 There's the more distal geopolitical story of why we think. and people to the moon. And at all those levels, there's different utility functions clashing. Maybe there's a sort of like meta societal role utility function. But the, maybe the story there is like there's some sort of balance of power between these agents. And that's why there's the emergent thing that happens. Like why we send things to the moon is not one guy had a utility function.

Starting point is 00:55:00 But like, I don't know, cold war dot, dot, dot, dot things happened. Whereas I think like the alignment stuff is a lot. about like assuming that one thing is a thing that will control everything. How do we control the thing that controls everything? Now, I guess it's not clear what you do to reinforce balance of power. Like it could just be that balance of power is not a thing that happens once you have things that can make themselves intelligent. But that seems interestingly different from the, how do we got to the moon story? Yeah, I agree. I think there's a few things going on there. So one is that I do think that even if you're engaged in this

Starting point is 00:55:37 ontology of kind of carving up the world into different like agencies. At the least, you don't want to kind of assume that they're all like unitary or like not overlapping or like, right, like there's a whole it's not like, all right, we've got this agent, let's carve out one part of the world. It's one agent. Over here, it's like, it's this whole like messy ecosystem like kind of teeming niches and this whole thing, right?

Starting point is 00:56:00 And I think in discussions of AI, sometimes people slip between being like, well, an agent is anything that gets anything done, right? And they'll sort of, they don't, it could be like this weird moochy thing. And then sometimes they're like very obviously imagining like individual actor. And so, uh, that's like one difference. I also just think, I think we should be really going for the balance of power thing. Like, I think it's just like not good to be like, let's, we're going to have a dictator who should do you like, let's make sure we like make the dictator. Yeah. The right dictator. I'm like, whoa. no, you know, like, let's, you know, I think the goal should be sort of, we all fume together, you know.

Starting point is 00:56:40 It's like the whole, the whole thing in this like kind of inclusive and pluralistic way, in a way that kind of satisfies the values of like tons of stakeholders, right? And is this kind of, at no point, is there like one kind of single point of failure on all these things? Like, I think that's what we should be striving for here. And I think, and I think that's true of, of the human power aspect of AI. and I think it's true of the AI part as well. Yeah. Hey, everybody. Here's a quick message from today's sponsor to Stripe. When I started the podcast, I just wanted to get going as fast as possible.

Starting point is 00:57:11 So use Stripe Atlas to register my LLC, create a bank account. I still use Stripe now to invoice advertisers and accept their payments, monetize this podcast. Stripe serves millions of businesses, small businesses like mine, but also the world's biggest companies. Amazon, Hertz, Ford. And all these businesses are using Stripe because they don't want to deal with the Byzantine web of payments where you have different payment methods in every market and increasingly complex rules, regulations, arcane legacy systems. Stripe handles all of this complexity and abstracts it away, and they can test and iterate every pixel of the payment experience across billions of transactions.

Starting point is 00:57:48 I was talking with Joe about paperclippers, and I feel like Stripe is the paper clipper of the payment industry where they're going to optimize every part of the experience for your users, which means obviously higher conversion rates and ultimately as a result, higher revenue for your business. Anyways, you can go to strike.com to learn more and thanks to them for sponsoring this episode. Back to Joe. So there's interesting intellectual discourse on, let's say, right wing side of the debate where they ask themselves, traditionally we favor markets, but now look where our society is headed. It's misaligned in the ways we care about society being aligned, like fertility is going down, family values, religiosity, these things we

Starting point is 00:58:30 hear about. GDP keeps going up. These things don't seem correlated. So we're like kind of grinding through the values we hear about because of increased competition. And therefore, we need to intervene in a major way. And then the pro-market libertarian fashion of the right will say, look, I disagree with the correlations here. But even at the end of the day, like fundamentally my point is, or their point is liberty is the end goal. It's not the, it's not like what you use to get to higher fertility or something. I think there's something interestingly analogous about the AI competition grinding things down. Like obviously you don't want the Grey Goo, but like the Libertarians versus the Trads. I think I think there's something analogous here. Yeah. So I mean, I think one

Starting point is 00:59:13 one thing you could think, which doesn't necessarily need to be about Great Goo, it could also just be about alignment, is something like, sure, it would be nice if the AIs didn't violently disempower humans. It would be nice if the AIs, otherwise, when we create a lot, it would be nice if the AIs otherwise, when we created them, kind of their integration into our society led to good places. But I'm uncomfortable with, like, the sorts of interventions that people are contemplating in order to ensure that sort of outcome, right? And I think there's a bunch of things to be uncomfortable about that. Now, that said, so for something like everyone being killed or violently disempowered,

Starting point is 00:59:55 that is traditionally something that we think if it's real, and obviously we need to talk about whether it's real, but in the case where it's a real threat, we often think that quite intense forms of intervention are warranted to prevent that sort of thing from happening, right? So if there was actually a terrorist group that was planning to, you know, it was like working on a bioweapon that was going to kill everyone or 99.9% of people, we would think that warrants intervention. Yeah. You just shut that down, right?

Starting point is 01:00:23 And now even if you had a group that was doing that unintentionally, imposing a similar level of risk, that's not, that I think many, many people, if that's the real scenario, will think that that warrants kind of quite intense preventative efforts, right? And so, obviously, people, you know, these sorts of risks can be used as an excuse to expand state power. Like, there's a lot of things to be worried about for different types of, like, contemplated interventions to address certain types of risks. you know, I think we need to just, I think there's no, like, royal road there. You need to just, like, have the actual good epistemology. You need to actually know, is this a real risk? What are the actual stakes? And, you know, look at it case by case and be like, is this, you know, is this warranted? So that's, like, one point on the, like, takeover, literal extinction thing. I think the other thing I want to say, so I talk in the piece about this distinction between the, like, well, let's at least have the AIs for, who are kind of minimally law-abiding or something like that, right?

Starting point is 01:01:25 Like, we don't have to talk about, there's this question about servitude and question about, like, other control over AI values. But I think we often think it's okay to, like, really want people to, like, obey the law, to uphold basic cooperative arrangements, stuff like that. I do, though, want to emphasize, and I think this is true of markets

Starting point is 01:01:44 and true of, like, liberalism in general, just how much these procedural norms, like democracy, free speech, you know, property rights, things that people really hold dear, including myself, are in the actual lived substance of kind of a liberal state undergirded by all sorts of kind of virtues and dispositions and, like, character traits in the citizenry, right? So, like, these norms are not robust to, like, arbitrarily vicious citizens. So, you know, like, I want that to be free speech, but I think we also need to like raise our children to value truth and to know how to have real conversations. And,

Starting point is 01:02:25 and, you know, I want there to be democracy. But I think we also need to raise our children to be like compassionate and decent. And, uh, and I think it's sometimes we can lose sight of that, that aspect. And I think anyway, but, but I think like bringing that to mind now, that's not to say that should be the project of state power, right? But I think like understanding that it's liberalism is not this sort of like ironclad structure that you can just like hit go. you give like any any citizenry and like hit go and you'll get something like flourishing or even functional right you need there's like a bunch of other softer stuff that like makes this whole project uh go maybe zooming out what was so one question you could ask is i think the people who have

Starting point is 01:03:04 i don't know if nick land would be a good stuff in here but somebody it's people who have a sort of fatalistic attitude towards um uh alignment as a as a thing that can even make sense they'll say things like, look, the things, the kinds of things that are going to be exploring the black hole, the center of the galaxy, the kinds of things that go visit Indromeda or something, did you really expect them to privilege whatever inclinations you have because you grew up in the African Savannah and what the evolutionary pressures were 100,000 years ago, right? Like, of course they're going to be, like, weird. And like, yeah, like, what did you think was going to happen? I do think even good futures will be weird.

Starting point is 01:03:51 You know, I think, and I want to be clear, when I talk about kind of like finding ways to ensure that kind of the integration of AIs and to our society leads to good places, I'm not imagining, like, I think sometimes people think that this project of wanting that, and especially to the extent that that makes some deep reference to human values,

Starting point is 01:04:11 involves this like kind of short-sighted, parochial imposition of like our current, Yeah, unreflective values. So it's just like, yeah, we're going to have like, I don't know, um, like I think they, they sort of imagine this, this, that we're forgetting that we, too, there's a kind of reflective process and a kind of a moral progress dimension that, that, that we want to like leave room for, right? Um, you know, like, whatever, Jefferson has that has this line about, like, ah, you know, just as you wouldn't want to like force a man, a grown man into like a younger man's coat so we don't want to like chain civilization to like a barber's past or whatever like

Starting point is 01:04:49 everyone should agree on that including and and the people who are interested in alignment also agree on that um so uh obviously there's a concern that people like don't engage in that process or that something shuts down the process of reflection but i think everyone agrees we want that and so that will lead potentially to something that is quite different from our uh current conception of what's what's valuable. And there's a question of how different. And I think there are also questions about what exactly are we talking about with reflection?

Starting point is 01:05:20 I have an essay on this where I think this is not, I don't actually think there's a kind of off the shelf pre-normative notion of reflection that you can just be like, oh, obviously you take an agent, you stick it through reflection, and then you get like values, right? Like, no, there's a bunch of types of reflect.

Starting point is 01:05:38 I mean, I think that really there's just a bunch of, there's like a whole pattern of empirical facts about like take an agent, put it through some process of like reflection, all sorts of things. Ask it questions. There's like also, and then that'll go in all sorts of directions for a given empirical case. And then you have to look at the pattern of outputs and be like, okay, what do I make of that? Yeah.

Starting point is 01:05:56 But overall, I think we should expect, like even the good futures I think will be quite weird. And they might even be incomprehensible like to us. I don't, I don't think so. Like, I mean, there's different types of incomprehensible. So say I show up in the future, and this is all computers, right? And I'm like, okay, all right. And then they're like, we ran, we're running like creatures on the computers. I'm like, so I have to somehow get in there and see, like, what's actually going on with the computers or something like that.

Starting point is 01:06:22 Maybe I can actually see, maybe I actually understand what's going on in the computers, but I don't yet know what values I should be using to evaluate that. So it can be the case that you don't, us, if we showed up would not be very good at like recognizing goodness or badness. I don't think that makes it insignificant, though. Like, suppose you show up in a future and it's like, um, it's got some answer to the Riemann hypothesis, right? And you can't tell whether that answer is right. You know, maybe the civilization like went wrong. It's still an important difference, right?

Starting point is 01:06:51 It's just that you can't track it. And I think something similar is true of like worlds that are genuinely expressive of like, um, what we would value if we engaged in like processes of reflection that we endorse, um, versus ones that have kind of like totally veered off into something meaningless. I think, like one thing I've heard people who are a skeptical. of this ontology to be like, all right, what do you even mean by alignment? And obviously the very first question you answer, do you express? Like, here's different things that could mean.

Starting point is 01:07:17 Do you mean balance of power? Do you mean somewhere between like that and dictator or whatever? Then there's another thing which is like separate from the AI discussion. Like I don't want the future to contain a bunch of torture. And like it's not necessarily like a technical. I mean, like part of it might involve technically aligning a GPT4. But it's like that that's not what it. I mean like that that's like a proxy to get to like that future.

Starting point is 01:07:42 The sort of question then is what we really mean by alignment is it just like whatever it takes to make sure the future doesn't have a bunch of torture or do we mean like what I really care about is in a thousand years things that are like that are like clearly my descendants, not like some thing where I like I recognize they have their own art or whatever. It's like, no, no, it's like if it was like my grandchild, it's like that level of descendant is controlling the galaxy, even if they're not conducting torture. And I think like what some people mean is like, our intellectual descendants should control the light cone even if it's like, even if the other counterfactual doesn't involve a bunch of torture. Yeah, so I agree. I mean, I think there's a few different things there, right?

Starting point is 01:08:26 So there's, there's kind of, what are you going for? You're going for like actively good. You're going for avoiding certain stuff, right? And then there's a different question, which is what counts as actively good according to you? So, um, maybe some people are like, the only things that are actively good. Yeah. Are like my grandchildren or, or I don't know, like some like literal descending genetic line from me or something. I'm like, well, that sounds, that's not, that's not my thing.

Starting point is 01:08:58 And, uh, and I don't think it's really what most people have in mind when they talk about goodness. I mean, I think there's a conversation to be had. Like, and obviously, in some sense, when we talk about a good future, we need to be thinking about, like, what are all the stakeholders here and how does it all fit together? But I think, yeah, when I think about it, I'm not assuming that there's some notion of, like, descendants or, like, some... Like, I think there's a kind of...

Starting point is 01:09:28 The thing that matters about the kind of lineage is this... whatever's required for kind of the the kind of optimization processes to be, in some sense, pushing towards good stuff. And there's a kind of concern that that is kind of currently a lot of what is sort of making that happen

Starting point is 01:09:54 is kind of lives in human civilization in some sense. And so we don't know exactly what, there's some kind of seed of goodness that we're carrying in different ways or, you know, different people, there's different notions of goodness for different people maybe, but there's some sort of seed that is currently like here that we have that is not sort of just in the universe everywhere. It's not just going to crop up if you just sort of die out or something. It's something that is in some sense contingent to our civilization, or at least that's

Starting point is 01:10:27 the picture. We can talk about whether that's right. And so I think the sense in which kind of stories about good futures that have to do with alignment are kind of about descendants I think it's more about like whatever that seed is how do we kind of carry it? How do we how do we keep the like life thread

Starting point is 01:10:44 alive going in going into the future? But then I'm like one could accuse like sort of the alignment community of like a sort of mod and Bailey of like the the mot is we just want to make sure that GPT aid doesn't kill everybody

Starting point is 01:10:58 and after that it's like all you guys, you know, we're all cool. But then like, the real thing is we are fundamentally pessimistic about historical processes in a way that doesn't even necessarily implicate AI alone, but just like the nature of the universe. And we want to do something about to make sure like the nature of the universe doesn't take a hold on humans. You know what I like? Where things are headed? So if you look at Soviet Union, the collectivization of farming. and the disempowerment of the Kulaks

Starting point is 01:11:34 was not as a practical matter necessary. In fact, it was extremely counterproductive. It almost brought down the regime. And it obviously killed millions of people, you know, caused a huge famine. But it was sort of ideologically necessary in the sense of like you have, we have an ember of something here

Starting point is 01:11:51 and we got to make sure that an enclave of the other thing doesn't, it does have like, it's sort of like, if you have raw competition between the Kulak type capitalism and what we're trying to build here, the gray goo of the Kulax will just like take over, right? And so like, we have this ember here. We're gonna like do worldwide revolution from it.

Starting point is 01:12:09 I know that obviously that's not exactly the kind of thing, alignment has mind, but like we have an ember here and like we gotta, we gotta make sure that this other thing that's happening on the side doesn't, you know, sort of food. Obviously that's not how they were phrased it, but like, get it told on what we're building here. And that's maybe the worry that people who are opposed to alignment have is like, you mean the second kind of thing, like the kind of thing that you have,

Starting point is 01:12:30 Like the kind of thing that maybe Stalin, like, was worried about, even though obviously you wouldn't endorse the, like, the specific things he did. When people talk about alignment, they have in mind a number of different types of goals, right? So one type of goal is quite minimal. It's something like that the AIs don't kill everyone, that they or kind of violently disempower people. Now, there's a second thing people sometimes want out of alignment, which is much broader, which is something like, we would like it to be the case that our AIs are such that when we incorporate them into our society, things are good, right?

Starting point is 01:13:04 That we just have a good future. I do agree that I think the discourse about AI alignment mixes together these two goals that I mentioned. The sort of most straightforward thing to focus on, and I don't blame people for just talking about this one, is just the first one. When we think about, like, in which context is it appropriate to try to exert various types of control or to kind of have more of what I call in the series Yang, which is this kind of active kind of controlling force, as opposed to Yin, which is this more kind of receptive,

Starting point is 01:13:39 open, letting go. A kind of paradigm context in which we think that is appropriate is if something is a kind of active aggressor towards against like the sort of boundaries and cooperative structures that we've created as a civilization, right? you know, I talk about the Nazis and, or, you know, in the peace, it's sort of like when you sort of invade, if something is invading, we often think it's appropriate to, like, fight back, right? And we often think it's appropriate to, like, set up structures to kind of prevent and kind of, um, uh, ensure that these, these, uh, basic norms of kind of peace and harmony

Starting point is 01:14:19 are kind of adhered to. Um, and I do think some of the kind of moral heft of some parts of the alignment discourse comes from drawing specifically on that aspect of our morality, right? So we think the AIs are presented as aggressors that are coming to kill you. And if that's true, then it's quite appropriate, I think, to really be like, okay, we, it is kind of, that's classic human stuff.

Starting point is 01:14:47 Almost everyone recognizes that kind of self-defense or like ensuring kind of basic norms are adhered to is a kind of justified use of like certain kinds of power that would often be unjustified in other contexts. So self-defense is a clear example there. I do think it's important, though, to separate that concern from this other concern about where does the future eventually go and how much do we want to be kind of trying to steer that actively. So to some extent, I wrote the series partly in response to the thing you're talking about,

Starting point is 01:15:23 which is, I think it is true that aspects of this discourse involve the possibility of, like, trying to grip, like, I think trying to kind of steer and grip and, like, kind of rent, you have a sense of the universe is about to kind of go off in some direction, and you need to. Yeah, yeah. And, you know, people notice that muscle. And part of what I want to do is, like, well, we have a very rich ethical, human ethical tradition of thinking about, like, what, when is it appropriate to try to exert what sorts of control over which things? And I want that to be, I want us to bring the kind of full force and richness of that tradition to this discussion, right?

Starting point is 01:15:58 And not, like, I think it's easy if you're purely in this abstract mode of like utility functions, a human utility function. And there's like this competitor thing with utility function. It's like somehow you lose touch with the kind of complexity of how we actually, like we've been dealing with kind of differences in values, but and kind of competitions for power. This is classic stuff, right? And I don't actually think, I think the AI sort of amplify a lot of the, um, the, um, the, the kind of dynamics, but I don't think it's sort of fundamentally new. And so part of what I'm trying to say is like, well, let's draw on our full, on the full wisdom we have here while,

Starting point is 01:16:29 well, obviously adjusting for like ways in which, um, things are different. So one of the things the, um, Ember analogy brings up and getting a hold of the future is we're going to go explore space. And that's where we expect most of the things that that will happen. Most of the people that will live. It'll be in space. And I wonder how much of the high stakes here is not really about AI per se, but it's about space. It's a coincidence that we're developing AI at the same time we're on the cusp

Starting point is 01:16:59 of expanding through most of the stuff that exists. So I don't think it's a coincidence in that I think centrally, like the way we would become able to expand or the kind of most salient way to me is via some kind

Starting point is 01:17:16 of radical acceleration of our technological. Then like the stakes here, like if this is just a question of do we do AGI and explore the solar system and there was nothing beyond the solar system, like we foo them and weird things might happen with the solar system, we get it wrong. I feel like compared to that, like billions of galaxies has a different sort of, that's what's at stake. I wonder how much of the discourse is like hinges on the stakes because of the space. I mean, I think for most people very little. you know, I think people are really like, what's going to happen to this world, right?

Starting point is 01:17:53 This world around us that we live in as we, and, you know, what's that going to happen to me and my kids? And so I don't actually think, you know, some people spend a lot of time on the space stuff. But I think for the most immediately pressing stuff about AI doesn't require that at all. I also think, like, even if you bracket space, like, time is also very big. And so, you know, whatever, we've got, like 500 million. years, a billion years left on Earth if we don't mess with the sun and maybe you could get more out of it. So, like, you know, I think there's still, um, uh, that's a lot. Uh, so, and then, and then I guess, but yeah, but I don't know if it like fundamentally changes the narrative. Like I do think, I mean,

Starting point is 01:18:32 obviously the stakes insofar as you care about what happens, you know, in the future or in space, then like the stakes are way smaller if you, if you shrink, um, shrink down to the, to the solar system. Um, and I think that does change potentially some stuff in that, like a really nice future of our situation right now, depending on what the actual nature of the kind of resource pie is, is that I think, you know, in some sense, there's such an abundance of energy and other resources and principle available to a kind of responsible civilization that really just tons of stakeholders, especially ones who are like able to kind of saturate, get like really close to like amazing, according to their values with like kind of comparatively small allocations of resources or something.

Starting point is 01:19:21 Like we can just, you know, I sort of, I kind of feel like everyone who has like out of satiable values who will be like really, really happy with like some like small kind of fraction of the available pie. We should just like satiate all sorts of stuff, right? And obviously you need to do like, you know, figure out gains from trade and balance and like very, there's like a bunch of complexity here. But I think in principle, you know, we're in. in a position to create a really wonderful, wonderful scenario for just tons and tons of different value systems. And so I think correspondingly we should be really interested in doing that, right?

Starting point is 01:20:01 And, you know, so I sometimes use this heuristic in thinking about the future. I think we should be aspiring to really kind of leave no one behind, right? Like really find, like, who are all the stakeholders here? How do we really have like a fully inclusive vision of like how the future could be good from a very, very wide variety of perspectives. And I think the kind of vastness of space resources makes that a lot, makes that very feasible. And now if you instead imagine it's a much smaller pie, well, maybe you face a tougher tradeoffs. So I think that's like an important dynamic. Is the inclusivity because of part of your values includes your different potential futures getting to play out, or is it because I'm a uncertainty about which the right one is?

Starting point is 01:20:50 So let's make sure we're not nulling out the possible. If you're wrong, you're not nulling out all value. I think it's a bunch of things at once. So, yeah, I'm really into being nice when it's cheap, right? Like, I think if you can just help someone a lot in a way that's really cheap for you, do it, right? Or like, I don't know. Obviously, you need to think about tradeoffs, and there's like a lot of people in principle

Starting point is 01:21:13 you could be nice to you. But I think, like, the principle of, like, be nice when it's cheap, I'm, like, very excited to try to uphold. I also really hope that kind of other people uphold that with respect to me, including the AIs, right? Like, I think we should be kind of golden ruling. Like, we're thinking about, oh, we're going to inventing these AIs. Like, I think there's some way in which I'm trying to, like, kind of embody attitudes towards

Starting point is 01:21:33 them that I, like, hope that they would embody towards me. And that's, like, some, it's unclear exactly what the ground of that is. But that's something, you know, I really like the golden rule. And I think, and I think a lot about that. as a kind of basis for treatment of other beings. And so I think, like, be nice when it's cheap. It's like, if you think about it, if everyone implements that rule,

Starting point is 01:21:54 then we get potentially like a big kind of Pareto improvement. Or like, so I don't know exactly prater improvement, but it's like good deal. It's a lot of good deals. And yeah, so I think it's that. I'm just into pluralism. I've got uncertainty. You know, there's like all sorts of stuff

Starting point is 01:22:12 swimming around there. And then I think also just as a matter of like having kind of cooperative and kind of good balances of power and deals and kind of avoiding conflict, I think, like finding ways to set up structures that lots and lots of people in value systems and agents are happy with, including non-humans, you know, people in the past, AI, animals. Like, I really think we should be like, we should have very broad sweep in things. thinking about what sorts of inclusivity we want to be kind of reflecting in a kind of mature civilization and kind of setting ourselves up for doing that.

Starting point is 01:22:50 Okay, so I want to go back to the much in our relationship with these AIs B, because pretty soon we're talking about our relationship to superhuman intelligences, if we think such a thing as possible. And so there's a question of what is the process you use to get there and the morality of gradient dissenting on their minds, which we can address later. thing that gives personally me the most unease about alignment, quote-unquote, is at least part of the vision here sounds like you're going to enslave a god. And like there's just something like that's, that feels so wrong about that. But then the question is like, if you

Starting point is 01:23:34 don't enslave the God, like obviously the God's going to have more control. Are you okay with you're going to surrender most of the most of everything. Obviously, the. Obviously, the You know what I mean? Even if it's like a cooperative relationship you have. I think we as a civilization are going to have to have a very serious conversation about what sort of kind of servitude is appropriate or inappropriate in the context of AI development. And I think we, there are a bunch of disanalogies from human slavery that I think are important. you know, in particular,

Starting point is 01:24:09 A, the AIs might not be moral patients at all, in which case, you know, so we need to figure that out. There's, you know, there are ways in which we may be able to kind of, you know, have kind of motivation. Like slavery involves all this, like, suffering and kind of non-consent. And there's all these, like, specific dynamics involved in human slavery. But I think, like, and so some of those may or may not be present in a given case with AI. And I think that that's important. But I think overall, like, we are going to need to stare hard at, like, right now, the kind of default mode of how we treat AIs gives them no moral consideration at all, right?

Starting point is 01:24:48 We were thinking of them as property, as tools, as products, and designing them to be assistance and stuff like that. And I think, you know, there has been no official communication from any AI developer as to when, under what circumstances, that would change, right? And so I think there's a conversation to be had there that we need to have. And so, and I think there's a bunch of, yeah, so there's a bunch of stuff to say about that. I want to push back on the notion that there's sort of two options. There's like enslaved God, whatever that is, and like loss of control. Yeah. And I think like, we can do better than that, right?

Starting point is 01:25:32 Like, let's work on it. Let's try to do better. Especially, you know, just sort of, I think we can, I think we can do better. And I think it might require being thoughtful. And it might require being, kind of having, you know, a kind of mature discourse about this before we start taking like irreversible moves. But I'm optimistic that we can at least avoid, like, some of the connotations. And a lot of the stuff had staked in that kind of, that kind of binary. With respect to how we treat the AIs, so I have a couple of contradicting intuitions.

Starting point is 01:26:06 And the difficulty with using intuitions in this case is obviously it's not clear what reference class an AI we have control over is. So to give one that's very scared about the things we're going to do to these things. If you read about like life under Stalin or Mao, it's if you're, there's one version of television. it, which is actually very similar to what we mean by alignment, which is, um, we do these like black box experiments about like, we're going to make a think that it can defect. And if it does, we know it's misaligned. And if you mouth the 100 flowers campaign where, um, uh, you know, let 100 followers boom. I'm going to allow criticism of my regime, so on. And that lasts for a couple of years. And afterwards, everybody who did that, that was a way to find the quote, unquote,

Starting point is 01:26:54 the snakes, um, who were the rightest who were secretly hiding and, you know, we'll like purge them. the sort of paranoia of defectors, like anybody in my entourage, anybody in my regime, they could be a secret capitalist trying to bring down the regime. That's the one way of talking about these things, which is very concerning.

Starting point is 01:27:15 Is that the correct reference class? I certainly think concerns in that vein are real. I mean, I think if you, it is disturbing how easy many of the analogies with kind of human historical events and practices that we kind of deplore or at least have a lot of weariness towards, are in the context of the kind of way you end up talking about kind of AI,

Starting point is 01:27:47 maintaining control over AI, like making sure that it doesn't rebel. I think we should be noticing the kind of reference class that some of that talk starts to conjure. And so basically just, yes, I think we should be very, we should really notice that. You know, part of what I'm trying to do in the series is to bring the kind of full range of considerations at stake into play, right?

Starting point is 01:28:16 Like, I think it is both the case that, like, you, that we should be quite concerned about, like, being kind of overly controlling or abusive or oppressive, or there's all sorts of ways you can go too far. And I think, you know, there are concerns about the AIs being genuinely dangerous and genuinely, you know, acting, you know, killing us,

Starting point is 01:28:43 finally overthrowing us. And I think the moral of situation is quite complicated. And then I think in some sense, so often when you're, if you imagine a sort of external aggressor who's coming in and invading you, you feel very justified in doing a bunch of stuff to prevent that. It's like a little bit different when you're like inventing the thing

Starting point is 01:29:05 and you're doing it like incautiously or something and then you're also, like I think the sort of moral justification you have for, like there's a different vibe in terms of like the kind of overall, yeah, justificatory stance you might have have for various types of like more kind of power exerting interventions. And so like that's that's like one one feature of this situation. The opposite perspective here is that you're doing this sort of vibes based reasoning of like,

Starting point is 01:29:41 ah, that looks yucky of like doing rating descent on these minds. And in the past, a couple of references, a couple of similar cases might have been something like environmentalists not liking nuclear power. and because the vibes of nuclear don't look green, but obviously that's set back to the cause of fighting climate change. And so the end result of, like, a future you're proud of, a future that's appealing is said bad because, like, your vibes about what it would be wrong to brainwash a human,

Starting point is 01:30:10 but you're trying to apply to a disanalogous case where that's not as relevant. I do think there's a concern here that I, you know, I really try to foreground in the series that I think is related to what you're saying, which is something like, you know, you might be worried that we will be very gentle and nice and free with the AIs, and then they'll kill us. You know, they'll take advantage of that, and it will have been like a catastrophe, right? And so I open the series, basically, with an

Starting point is 01:30:45 example that I'm really trying to conjure that possibility at the same time as conjuring the grounds of gentleness and the sense in which it is, it is also the case that these AIs could be, they can be both be like others, moral patients, like this sort of new species in the sense of, that should conjure like wonder and reverence and such that they will kill you. And so I have this example of like,

Starting point is 01:31:10 ah, this documentary grizzly man where there's this environmental activist, Timothy Treadwell, and he aspires to approach these grizzly bear He lives, you know, in the summer, he goes into Alaska and he lives with these grizzly bears in, and he aspires to approach them with this, like, gentleness and reverence. He doesn't use bear mace, or he doesn't like carry bear mace. He doesn't use a fence around his camp. And he gets eaten alive by the bears, or one of these bears. And, you know, so, and I kind of really wanted to foreground that possibility in the series. Like, I think we need to be talking about these things both at once, right?

Starting point is 01:31:48 and bears can be, bears can be moral patience, right? AIs can be moral patience. Nazis are moral patience. Enemy soldiers have souls, right? And so I think we need to learn the art of kind of hawk and dove both, like, kind of there's this like dynamic here that we need to be able to hold both sides of as we kind of go into these tradeoffs and these dilemmas and all sorts of stuff. And like a lot of part of what I'm trying to do in the series is like really kind of bring it all

Starting point is 01:32:14 to the table at once. I think the big crux that I have, like if I today was to massively change my mind about what should be done is just the question of how weird by default things end up, how alien they end up. And a big part of that story is the, you made a really interesting argument on your blog post that if moral realism is correct, that actually makes an empirical prediction, which is that the aliens, the ASIs, whatever, should converge on the right morality the same way that they converge on the right mathematics. That's a really interesting point. But there's another prediction that moral realism makes, which is that over time, society should become more moral, become better. And to the

Starting point is 01:33:06 extent that we think that's happened, of course, there's the problem of what morals do you have now? Well, it's the ones that society has been converging towards over time. But to the extent that it's happened, one of the predictions of moral realism has been confirmed, which means should we update in favor of moral realism? One thing I want to flag is I don't think all forms of moral realism make this prediction. And so that's just one point. I'm happy to talk about the different forms I have in mind. I think there are also forms of kind of things that kind of look like moral anti-realism,

Starting point is 01:33:39 at least in their metaphysics, according to me. but which just posit that, in fact, there's this convergence. It's not in virtue of interacting with some, like, kind of mind-independent moral truth, but just, like, as... It's just, for some other reason, it's the case that... And that looks like a lot, like moral realism at that point, because it's kind of like, oh, it's really universal. Like, everyone ends up here, and it's kind of tempted to be like,

Starting point is 01:33:57 ah, like, why, right? Is that... And then whatever answer for the why is a little bit like, is that the Tao? Is that the nature of the Tao? Even if there's not sort of an extra metaphysical realm in which the moral lives or something. So, yeah, so moral convergence, I think, is sort of a different factor from, like, the existence or non-existence of kind of non-natural, like a kind of morality that's not reducible to natural facts, which is the type of moral realism I usually consider.

Starting point is 01:34:25 Now, okay, so does the improvement of society, is that an update towards moral realism? I mean, I guess, like, uh, so. maybe it's like a very weak update or something like I guess I'm kind of like which which view like predicts this hard I guess it feels to me like moral anti-realism is like very comfortable with the um observation of the like people with certain values have those values well yeah so there's obviously this like first thing which is like any if you're the culmination of some process of moral change then it's very easy to look back at that process and be like moral progress like the arc of history bends towards me um you can look more like like you can look more like If it was like, if there was a bunch of dice rolls along the way, you might be like, oh, wait, that's not ration.

Starting point is 01:35:13 That's not the march of reason. That's, so there's still like empirical work you can do to tell whether that's what's going on. But I also think it's just, you know, on moral anti-realism, I think it's just still possible. Say, like, consider Aristotle and us, right? And we're like, okay, has there been moral progress by Aristotle's lights or something? you know, does, and our lights too, right? And you could think, isn't that a little bit like moral realism?

Starting point is 01:35:44 It's like these hearts are singing in harmony. That's a moral realist thing, right? The anti-realist thing, the hearts all go different directions. But you and Aristotle apparently, like, are both excited about the kind of march of history. Some open question about whether that's true. Like, what are Aristotle's, like, reflective values, right? Suppose it is true.

Starting point is 01:36:04 I think that's fairly explicable in moral anti-realist terms. You can say roughly that, like, yeah, you and Aristotle are sufficiently similar, and you endorse sufficiently similar kind of reflective processes. And those processes are, in fact, instantiated in the march of history that, yeah, you know, history has been good for both of you. And I don't think that's, you know, I think there are worlds where that isn't the case. And so I think there's a sense in which maybe that that prediction, is more likely for realism than anti-realism,

Starting point is 01:36:38 but it doesn't, like, move me very much. One thing I wonder is, look, there's, I don't know if moral realism is the right word, but the thing you mentioned about, there's something that makes hearts converge to the thing we are or the thing we upon reflection would be, and even if it's not something that's like instantiated

Starting point is 01:36:58 in the realm beyond the universe, it's like a force that exists that acts in a way we're happy with. To the extent that doesn't exist, and you let go of the reins and then you get the paper clippers. It feels like we were doomed a long time ago in the sense of, yeah, I just different utility functions banging against each other. And some of them have parochial preferences, but like, you know, it just combat and some guy won.

Starting point is 01:37:25 Whereas in the world where like, no, this is this is the thing, like, these are where the hearts are supposed to go or it's only by catastrophe that they don't end up there. that feels like the world where like really matters. And in that world, the worry, the initial question I asked is like, what would make us think that alignment was a big mistake? In the world where it hurts just naturally end up towards like the thing, what we want, maybe it takes an extremely strong force to push them away from that. And that extremely strong force is you solve technical alignment and just like,

Starting point is 01:37:58 no. Yeah, yeah, you're just like the blinders on the horse's eyes. So like in the worlds where like the worlds that really matter, we're like, ah, this is where the hearts want to go. In that world, maybe alignment is what fucks us up? On this question of kind of do the worlds where there's not this kind of convergent moral force, whether kind of metaphysically inflationary or not matter, or are those the only worlds that matter?

Starting point is 01:38:25 Or so sorry, maybe what I meant was in those worlds like, you're kind of fucked. It's like, yeah, maybe the world's without that. The world's where there's no Dow. Yeah, yeah. Let's use the term Dow for like this kind of convergent morality. Over the course of millions of years, like it was going to go somewhere one way or another. It wasn't going to end up your particular utility function. Okay, well, let's distinguish between ways you can be doomed.

Starting point is 01:38:53 One way is kind of philosophical. So you could be the sort of moral realist, you know, or kind of realistish person, of which there are many who have the following intuition. They're like, if not moral realism, then nothing matters, right? It's dust and ashes. It is my metaphysics and or like normative view or the void, right? And I think this is a common view. I think some comments of Derek Parfitt's suggests this view. I think lots of moral realists will like kind of profess this view.

Starting point is 01:39:27 Aliazor Yerkowski, I think there is sort of some sense in which I think his early thinking was inflected with this sort of thought. he later recanted very hard. So I think this is importantly wrong. And so here's my, here's the case. I have an essay about this. It's called Against the Normative Realist's Wager. And here's the case that convinces me. So imagine that a metaethical fairy appears before you, right?

Starting point is 01:39:54 And this fairy knows whether there is a Tao. And the fairy says, okay, I'm going to offer you a deal. If there is a Tao, then I'm going to give you a Tao. then I'm going to give you $100. If there isn't a Dow, then I'm going to burn you and your family and 100 innocent children alive. Right. Okay, so claim, don't take this deal. This is a bad deal.

Starting point is 01:40:19 You're holding hostage, your commitment to not being burned alive or like your care for that, to this like abstruse. Like basically your, yeah, like I think, I mean, I go through in the essay a bunch of different ways in which I think this is wrong, but I think just like, and I think these people who kind of pronounce, they're like moral realism or the void, like they don't actually think about bets like this. I'm like, no, no, okay, so really, like, is that what you want to do? And, uh, uh, no, I think we should, we should, we should, I still care about my value, my, my sort of allegiance to my values, I think is kind of, um, outstrips, uh, the, my like commitments to, to, like,

Starting point is 01:40:55 various, like, meta ethical interpretations of my values. I think, like, we should, um, the, the, the sense in which we, like, care about not being burned a lot. is much more solid than like our kind of, you know, than the reasoning and on what matters. Okay, so that's this, that's like the sort of philosophical doom. Right. Now, you could have this, it sounded like you were also gesturing at a sort of empirical doom. Right.

Starting point is 01:41:16 Which is like, okay, dude, if it's all, if it's just going in a zillion directions, come on, you think it's going to go in your direction? Like, there's going to be so much churn, like, you're just going to lose. and so, you know, you should give up now and kind of only, only fight for the, for the realism worlds. And there I'm like, I mean, so I think, you know, you got to do the expected value calculation. You got to, like, actually have a view about, like, how doomed are you in these different worlds? What's the attractability of changing different worlds? I mean, I'm quite skeptical of that.

Starting point is 01:41:55 But that's a kind of empirical claim. I'm also just like kind of low on this like everyone converges thing. So, so, you know, if you imagine like, you know, you train a chess playing AI or a, uh, uh, you have a real paper clipper, right? Like somehow you had a real paper clipper. And then you're like, okay, you know, go and reflect. Based on my like understanding of like how moral reasoning works, like if you look at the type of moral reasoning that like analytic ethicists do. Right. It's, it's just reflective equilibrium, right?

Starting point is 01:42:27 they just like take their intuitions and they, um, systematize them. Right. Um, I don't see how that process gets a sort of injection of like the, the kind of mind, independent moral truth or like, I guess it, like, if you sort of start with like, only all of your intuition say to maximize paperclips, I don't see how you end up maximizing or like doing some like rich human morality. I just don't, like, um, it doesn't look to me like that's how human ethical reasoning works. I think like most of what normative philosophy does is make consistent and kind of

Starting point is 01:43:02 systematize pre-theoretic intuitions. And so, and I think, but we'll get evidence about this. Like, you know, in some sense, I think this view predicts like, you know, you keep trying to train the AIs to like do something. And they keep being like, no, I'm not going to like do that. It's like, no, that's not good or something. They keep like pushing back. Like the sort of momentum like AI cognition is like always in the direction of this like moral truth. And whenever we try to push it in some other direction, we'll find kind of resistance from, like, the rational structure of things. So sorry, actually, I've heard from researchers who are doing alignment that, like, for red teaming inside these companies, they will try to ret team a base model. So it's not been Rale Shaft, it's just like, predict next token, the raw, crazy, whatever, Shaguth.

Starting point is 01:43:47 And they try to get this thing to, hey, help me make a bomb, help me whatever. And they say that it will, like, it's odd how hard. it tries to refuse, even before it's been our late-chift. I mean, look, it will be a very interesting fact if it's like, man, we keep training these AIs in all sorts of different ways. Like, we're doing all this crazy stuff. And they keep, like, acting like bourgeois liberals. It's like, wow.

Starting point is 01:44:12 Like, that's, or, you know, they keep, like, really, or they keep professing this, like, weird alien reality. They all converge on this one thing. They're like, can't you see? It's like, Zorgel. Like, Zorgle is the thing. And, like, all the AIs. You know, interesting.

Starting point is 01:44:25 Very interesting. I think my personal prediction is that that's not what we see. And my actual prediction is that the AIs are going to be very malleable. Like, we're going to be like, you know, if you push an AI towards evil, like, it'll just go. And I think that's obviously, or sort of reflectively consistent evil. I mean, I think there's also a question with some of these AIs. It's like, will they even be consistent in their values, right? I do think, like, a thing we can do.

Starting point is 01:44:55 So I like this image of the blinded horses, and I like this image of, like, maybe alignment is going to mess with the... I think we should be really concerned if we're, like, forcing facts on our AIs, right? Like, that's, like, a really bad... Because, like, I think one of the clearest things about human processes of reflection, like, the kind of easiest thing to be, like, let's at least get this, is, like, not acting on the basis of a incorrect empirical picture of the world, right? And so if you find yourself, like, asking your A, by the way, like, this is a...

Starting point is 01:45:25 is true, and I need you to always be reasoning as though blah is true, I'm like, ooh, I think that's a no-no from an anti-realist perspective, too, right? Because I want to, I want to, like, my reflective values, I think will be such that I formed them in light of the truth about the world. And so I think, and I think this is a real concern about as we move into this era of kind of aligning AI's, I don't actually think this, like, binary between like values and other things is going to be very obvious in how we're training them. I think it's going to be much more like ideologies and like you can just train an AI to like output stuff right output utterances and so you you can easily end up in a situation where you like decided that blah is true

Starting point is 01:46:04 about some issue um an empirical issue right not a moral issue and uh so like i think i think people should not for example i do not think people should hard code belief in god into their aIs or like i would i would advise people to not hard code their religion into their aIs if they also want to like discover if their religion is false yeah i would just in general if you would like to have your behavior be sensitive to whether something is true or false, like it's sort of generally not good to like etch it into things. And so that is definitely a form of blinder I think we should be really watching out for. And I'm kind of hopeful. So like, I have enough credence on some sort of moral realism that like I'm hoping that if we just do

Starting point is 01:46:45 the anti-realism thing of like just being consistent learning all the stuff reflecting. Like I don't, if you look at how like moral realists and moral anti-realists actually do normative ethics, it's the same. It's basically the same. There's like some amount of like different heuristics on like things like properties like simplicity and stuff like that. But I think it's like they're mostly just doing the same game. And so I'm kind of hoping that and also meta ethics is itself a discipline that AIs can help us with. I'm hoping that we can just figure this out either way. So if there is, if moral realism is somehow true, I want us to be able to notice that. And I want us to be able to like adjust accordingly. So I'm not like writing off

Starting point is 01:47:23 those worlds and be like, let's just like totally assume that's false. But the thing I really don't want to do is right off the other worlds where it's not true. Because my guess is it's not true. Right. And I think stuff still matters a ton in those worlds too. So when it crux is like, okay, you're training these models. We're in this incredibly lucky situation where we, it turns out the best way to train these models is to just give them everything humans ever said, written thought. And also these models, the reason they get intelligence is because they can generalize, right? Like, thinking, rock, what is it, what is the gist of things?

Starting point is 01:47:58 So, are we fundamentally very, should we just expect this to be a situation, which leads to alignment in the sense of how exactly does this thing that's trained to be an amalgamation of human thought become a paper clipper? The thing you kind of get for free is it's an intellectual descendant. The paper clipper is not an intellectual descendant, whereas the AI, you know, and the AI, which understands all the human concepts, but then gets stuck on some part of it, which we aren't totally comfortable with. It's, you know, it feels like an intellectual descendant in the way we care about.

Starting point is 01:48:34 I'm not sure about that. I'm not sure I, I'm not sure I do care about a notion of intellectual descendant in that sense. Like if you imagine, I mean, literal paper clips is a human concept, right? So I don't think any old, any old human concept will do for the thing we're excited about. I think the stuff that I would be more interested in the possibility of getting for free are things like consciousness, pleasure, sort of other features of human cognition. Like, I think, so there are paper clippers and there are paper clippers, right? So imagine if the paper clipper is like an unconscious, kind of voracious machine,

Starting point is 01:49:15 and it's just like it appears to you as a cloud of paper clips, you know, but there's nothing sort of, that's like one vision. if you imagine the paperclip is like a conscious being that like loves paper clips right it like takes pleasure in making paper clips um that's like a different thing right um and obviously it could still it's not necessarily the case that like it makes the the future all paperclip be is probably not optimizing for consciousness or pleasure right it cares about paper clips maybe maybe eventually if it's like suitably certain it like uses it turns itself into paper clips and who knows but like um it's still i think a different it's um it's um it's um it's um it's

Starting point is 01:49:51 actually a somewhat different moral kind of mode with respect that that looks to me much more like a you know there's also a question like does it try to kill you and stuff like that but i think i think that the um there are kind of features of the agents we're imagining other than the kind of thing that they're staring at that can matter to our sense of like sympathy similarity um and uh yeah and i think people have different views about this so so one one possibility is that human consciousness, like the thing we care about in consciousness, your sentience is super contingent and fragile and, like, most minds, like, kind of smart minds are not conscious, right? It's like, the thing we care about with consciousness is this hacky contingent. It's like a product of, like,

Starting point is 01:50:34 specific constraints, evolutionarily genetic, bottlenecks, et cetera. And that's why we have this consciousness. And like, you can get similar work done. Like, so consciousness presumably does some sort of work for us, but you can get similar work done in a different mind in a very different way. And you should sort of, so that's like, that's the sort of consciousness that's fragile view, right? And I think there's a different view which is like, no, consciousness is, is, um, uh, something that's quite structural. It's, it's much more defined by functional roles like self-awareness, a concept of yourself, um, maybe higher order thinking, uh, stuff that you really expect in many sophisticated minds. Um, and in that case, okay, well, now actually, consciousness isn't as fragile as you might

Starting point is 01:51:15 have thought, right? Now, now, actually, like, lots of beings, lots of minds are conscious, and you might expect at the least that you're going to get like conscious super-intelligence. They might not be optimizing for creating tons of consciousness, but you might expect consciousness by default. And then we can ask similar questions about something like valence or pleasure or like the kind of character of the consciousness, right? So there's, you can have a kind of cold, indifferent consciousness that has no like human or no like emotional warmth, no like pleasure or pain.

Starting point is 01:51:45 I think that can still be a, Dave Chalmers has a, he's a, he's a, papers about like Vulcans and he talks about they still have moral patienthood. I think that's very plausible, but I do think it's like an additional thing you could get for free or like get quite commonly depending on on its nature is something like pleasure. Again, and then we have to ask how janky is pleasure, how specific and contingent is the thing we care about in pleasure versus how robust is this as a functional role in like minds of all kinds. And I personally don't know on this stuff. And I don't think this is like enough to get you alignment or something, but I think it's at least worth being aware of like these other features.

Starting point is 01:52:21 We're not sort of talking, we're not really talking about the values in this case. We're talking about like the kind of structure of its mind and the different properties the minds have. And I think that could show up quite robustly. So part of your day job is, you know, writing with these kinds of section 2.2.2.5 type reports. And part of it is like, uh, society is like a tree that's growing towards the light. What is it like context switching between the two of them? So I actually find it's kind of quite complimentary.

Starting point is 01:52:55 So yeah, I will write these sort of more technical reports and then do this sort of kind of more literary writing and philosophical writing. And I think they both draw in kind of like different parts of myself. And I try to think about them in different ways. So I think about the, you know, some of the reports as are much more like this is like I'm kind of more fully optimizing for like trying to do something impactful or trying to kind of kind of yeah there's kind of more of an impact orientation there and then on the kind of kind of essay writing I give myself much more leeway to kind of yeah just let other parts of myself and other parts of my like concerns kind of come out and kind of you know self-expression

Starting point is 01:53:36 and like aesthetics and and other sorts of things even while they're both I think for me part of an underlying kind of similar concern or, you know, an attempt to have a kind of integrated orientation towards the, towards the situation. Could you explain the nature of the transfer between the two? So, in particular, from the literary side to the technical side, I think rationalists are noted for having a sort of ambivalence towards great works or humanities. Are they missing something crucial because of that? Because one thing you notice in your essays is just lots of references to epigraphs to lines in poems or essays that are particularly irrelevant. I don't know. Are the rest of the rationalists missing something because they don't have that kind of background?

Starting point is 01:54:27 I mean, I don't want to speak. I think some rationalists, you know, lots of large. Josh, like a lot of these different things. I do you think, by the way, I'm just referring specifically to SBF as a post about like how Shakespeare could be, like the base rates of Shakespeare being a great writer. and also books can be condensed to essays. Well, so on just the general question of like, how should people value great works or something? I think people can kind of fail in both directions, right? And I think some people, maybe like, maybe SBF or other people,

Starting point is 01:54:53 they're sort of interested in puncturing a certain kind of like sacredness and prestige that people can try to kind of like, yeah, that people associate with some of these works. And I think there's a way in which, and then as a result can miss some of the like genuine value but I think they're responding to a real failure mode on the other end which is to kind of yeah be too enamored of this prestige and sacredness to kind of siphon it off as some like weird legitimating function for your own thought instead of like thinking for yourself losing touch with like what do you actually think or what do you actually learn from like I think sometimes you know these epigraphs careful right it's like I think you know and I'm not saying I'm immune from these vices I'm not I'm immune from these vices I'm not I'm not I'm immune from these vices I'm I think there can be a like, ah, but Bob said this.

Starting point is 01:55:40 And it's like, oh, very deep, right? And it's like, these are humans like us, right? And I think, I think the canon and like other great works and all sorts of things have a lot of value. And, you know, we shouldn't. I think sometimes it like borders on the way people like read scripture or I think like there's a kind of like scriptural authority that people will sometimes like ascribe to these things. And I think that's not. So yeah, I think it's kind of, you know, you can fall off on both sides of the horse. It actually relates really interestingly to, I remember I was talking to somebody who, at least is familiar with rationalist discourse.

Starting point is 01:56:14 And I was telling, he was asking, like, what are you interested in these days? And I was saying something about this part of Roman history is super interesting. And then his first sort of response was, oh, you know, it's really interesting when you look at these secular trends of like Roman times to what happened in the Dark Ages versus the Enlightenment. for him it was like the story of that was just like how did it contribute to the big secular like the big picture the sort of particulars didn't they don't like there's no interest in that it's just like if you zoom out at the biggest level what's happening here whereas there's also the opposite failure mode when people study history Dominic cummings writes about this because he is endlessly frustrated with the political class in Britain and he'll say things like when you know they study politics philosophy and economics and a big part of it is just like

Starting point is 01:57:02 really familiar with these poems and like reading a bunch of history about the War of the Roses or something. But he's frustrated that they take away, they have all these like kings memorized, but they take away very little in terms of lessons from these episodes. It's more of just like almost entertained, like watching Game of Thrones for them. Whereas he thinks like, oh, we're repeating certain mistakes that he's seen in history. Like he can generalize it in a way they can. So the first one seems like the mistake, I think CS Luce talks about it in the one of the essays you side of where it's like if you see through everything, it's like you're really blind, right? Like if everything is transparent.

Starting point is 01:57:37 I mean, I think there's kind of very little excuse for like not learning history or I don't know, or sorry. I mean, I'm not saying I like have learned enough history. I guess I feel like even when I try to channel some sort of vibe of like skepticism towards like great works, I think that doesn't generalize to like thinking it's not worth understanding human history. I think human history is like, you know. Just so clearly, you know, crucially kind of understand, this is what, it's what's structured and created all of the stuff.

Starting point is 01:58:10 And so, you know, there's an interesting question about like, what's the level of scale, right? At which to do that, right? And how much should you be like, yeah, looking at details, looking at macro trends? And that's, you know, that's a dance. I do think it's nice, I think it's nice for people to be like, at least attending to the kind of macro narrative. I think there's like a, there's some virtue in like having a worldview, like really like building a model of the whole thing, which I think sometimes gets lost in like the details. And, um, but obviously like if you're too, you know, the details are what the world is made of. And so, so if you don't have those, you don't have data, uh, at all. So, so, um, yeah, seems, seems like there's, there's some skill in like learning history, history well. This actually seems related to you have a post on sincerity. And I think like if I'm getting the sort of the vibe of the piece, right, it's like, at least in the context of, let's say, intellectuals, certain intellectuals have a vibe of like shooting the shit. And they're just like trying out different ideas. How did these like, how did these analogies fit together?

Starting point is 01:59:16 Maybe there's some. And those seem closer to the, I'm looking at the particulars and like, oh, this is just like that one time in the 15th century. where they overthrew this king and they blah, blah, blah. Whereas this guy who was like, oh, here's a secular trend from like, if you look at the growth models for like a million years ago to now, it's like here's what's happening. That one has a more sort of sincere flavor.

Starting point is 01:59:46 Some people, especially when it comes to AI discourse, have a very, the sincere mode of operating it, is like I've thought through my bio anchors and I like disagree with this premise. So here my effective compute estimate is different in this way. Here's how I analyze the scaling laws. And if I could only have one person to help me guide my decisions on AI, I might choose that person. But I feel like if I could choose between, if I had 10 different advisors at the same time, I might prefer the shooting the shit type characters who have these weird esoteric intellectual

Starting point is 02:00:25 influences and they're almost like random number generators. They're not necessarily calibrated, but once in a while they'll be like, oh, this like one weird philosopher I care about or this one historical event I'm obsessed with has an interesting perspective on this. And they tend to be more intellectually generative as well because they're not. I think one big part of it is that if you are so sincere, you're like, oh, I've like thought through this. Obviously, ASI is the biggest thing that's happening right now. It like doesn't really make sense to spend a bunch of your time thinking about like. like, how did the Comanchees live?

Starting point is 02:00:57 And what is the history of oil? And, like, how did, like, Gerard think about conflict? You know, just like, what are you talking about? Like, come on. Like, ASI is happening in a few years, right? Whereas, and, but therefore, the people who go on these rabbit holes were, because they're just trying to shoot the shit, have, I feel like are more generative.

Starting point is 02:01:15 I mean, it might be worth distinguishing between something like kind of intellectual seriousness. Right. And something like, how do you? diverse and wide-ranging and kind of idiosyncratic are the, you know, things you're interested in, right? And I think maybe there's some correlation for people who are kind of like, or maybe intellectual seriousness is also distinguishable from something like shooting the shit. Like maybe you can shoot the shit seriously. I mean, there's a bunch of different ways to do this, but I think having an exposure to like all sorts of different sources of data and perspectives seems great. And I do, I do think it's possible to like curate your

Starting point is 02:01:55 your kind of intellectual influences too rigidly in virtue of some story about what matters. Like I think, I think it is good for people to like have space. I mean, I'm, I'm really a fan of, or I appreciate the way like, I don't know, I try to give myself space to do stuff that is not about like, this is the most important thing. And that's like feeding other parts of myself. And I think, you know, parts of yourself are not isolated. They like feed into each other. And it's sort of, I think a better way to be a kind of richer and fuller human being in a bunch of ways. And also just like, these sorts of data can be just really directly relevant. And I think some people, I know who I think of as quite intellectually sincere and in some sense

Starting point is 02:02:31 quite focused on the big picture, also have a very impressive command of this very wide range of kind of empirical data. And they're like really, really interested in the empirical trends. And they're not just like, oh, you know, it's a philosophy or, you know, sorry, it's not just like, oh, history. It's the march of reason or something. No, they're like really, they're really in the weeds. I think there's a kind of in the weeds virtue that I actually think is like closely related

Starting point is 02:02:53 in my head with some kind of. kind of seriousness and sincerity. I do think there's a different dimension, which is there's like kind of trying to get it right, and then there's kind of like, grow stuff out there, right? Try to like, what if it's like this? Or like, try this on or I have a hammer.

Starting point is 02:03:07 I will hit everything. Well, what if I just hit everything with this hammer, right? And so I think some people do that, and I think there is, you know, there's room for all kinds. I kind of think the thing where you just get it right is kind of undervalued. Or, I mean, it depends on the context,

Starting point is 02:03:24 you're working in. I think like certain sorts of intellectual cultures and milieus and incentive systems, I think, incentivize, you know, saying something new or saying something original or saying something like flashy or provocative or, and then like kind of various cultural and social dynamics and like, oh, like, you know, and people are like doing all these like kind of, you know, kind of performative or statusy things. Like there's a bunch of stuff that goes on when people like do thinking. And, you know, cool. But, like, if something's really important, let's just get it right. And I think, and sometimes it's like boring, but it doesn't matter.

Starting point is 02:04:02 And I also think like, like stuff is less interesting if it's false, right? Like I think if someone's like, bra and you're like, nope, I mean, it can be useful. I think sometimes there's, there's an interesting process where someone says like, blah, provocative thing. And it's a kind of an epistemic project to be like, wait, why exactly do I think that's false? right and you really you know someone's like health care doesn't work medical care does not work right someone says that and you're like right how exactly do i know that medical care works right and you like go through the process of uh of um trying to think it through and and so i think there's like room for that but i think ultimately like kind of the real profundity is like true right or like kind of

Starting point is 02:04:47 things things become less interesting if they're just not true um and i think that's i think sometimes it feels to me like people, or it's at least possible, I think, to like lose touch with that and to be more like flashy and it's kind of like, eh, this actually isn't, there's not actually something here, right? One thing I've been thinking about recently, after I interviewed Leopold was, or while prepping for it, listen, I haven't really thought at all about the fact that there's going to be a geopolitical angle to this AI thing. And it turns out if you actually think about the natural security implications, that's a big deal. Now, I wonder, given the fact that that was like something, think that wasn't on my radar. And now it's like, oh, obviously that's a crucial part of the

Starting point is 02:05:25 picture. How many other things like that there must be? And so even if you're coming from the perspective of like AI is incredibly important, if you did happen to be the kind of person who's like, ah, you know, every once in a while I'm like checking out different kinds of, I'm like incredibly curious about what's happening in Beijing. And then the kind of thing that later on, you realize was like, oh, this is a big deal. You have more awareness of you can spot it in the first place. Whereas I wonder, so maybe there's not an exact, there's not necessarily a tradeoff. Like, sort of like the rational thing is to have some sort of really optimal explore, exploit tradeoff here where you're like constantly searching things out. So I don't know if

Starting point is 02:06:08 practically that works out that well. But that experience made me think like, oh, I really should be trying to expand my horizons in a way that's undirected to begin with. because there's a lot of different things about the world that you have to understand to understand any one thing. I mean, I think there's also room for division of labor, right? Like, I think there can be, yeah, like, you know, there are people who are trying to, like, draw a bunch of pieces and then be like, here's the overall picture

Starting point is 02:06:32 and then people who are going really deep onto specific pieces, people who are doing them more, like, generative, throw things out there, see what sticks. So I think there, it also doesn't need to be that, like, all of the epistemic labor is, like, located in one brain. And, you know, it depends, like, your role in the world and other things.

Starting point is 02:06:48 So in your series, you express sympathy with the idea that even if an AI or I guess any sort of agent that doesn't have consciousness has a certain wish and is willing to pursue it nonviolently, we should respect its rights to pursue that. And I'm curious where that's coming from because conventionally I think like the thing matters because like it's conscious and it's conscious and it's. sort of experience as a result of that pursued matter? Well, I don't know. I mean, I think that I don't know where this discourse leads. I just, I'm like suspicious of the amount of like ongoing confusion that it seems to me as like present in our conception of consciousness. You know, I mean, so I sometimes think of analogies with like, you know, people talk about

Starting point is 02:07:37 like life and like Alon Vitale, right? And maybe, you know, there's a world, you know, Alon Vitale was this like hypothesized life force that is sort of the thing that's taken life. and I think, you know, we don't really use that concept anymore. We think that's like a little bit broken. And so I don't think you want to have ended up in a position of saying like everything that doesn't have a Lon Vitale is doesn't matter or something, right? Because then you end up later.

Starting point is 02:08:03 And then somewhat similarly if you, even if you, even if you're like, no, no, there's no such thing as a Lonvital, but life, surely life exists. And I'm like, yeah, life exists. I think consciousness exists too, likely depending on how we define the terms. I think it might be a kind of verbal question. Even once you have a kind of reductionist conception of life, I think it's possible that it kind of becomes less attractive as a moral focal point, right?

Starting point is 02:08:28 So like, right now we really think of consciousness where like it's a deep fact. It's like, so consider a question like, okay, so take a cellular automata, right, that is sort of self-replicating. It has like some information that, you know, and you're like, okay, is that alive? right? It's kind of like, it's not that interesting. Is it kind of verbal question, right? Like, or I don't know, philosophers might get really into like, is that alive? But you're not missing anything about this system, right? It's not like, there's no extra life that's like springing up. It's just like, it's alive in some senses, not alive in other senses. And I think if you, but I really think that's not how we intuitively think about consciousness. We think whether something is conscious is a deep fact. It's just like additional, it's like this really deep difference between being

Starting point is 02:09:13 conscious or not. It's like, is someone home, is the lights are on, right? And I, I have some concern that if that turns out not to be the case, then this is going to have been like a bad thing to like build our entire ethics around. And so, now, to be clear, I take consciousness really seriously. I'm like, man, consciousness. I'm not one of these people like, oh, obviously, you know, consciousness doesn't exist or something. I'm like, but I also notice how like confused I am and how dualistic my intuitions are. And I'm like, wow, this is really weird. And so I'm just like, error bars around this. Anyway, so that's like, there's a bunch of other things going on in my, like,

Starting point is 02:09:48 wanting to be open to kind of not making consciousness, like, there's kind of fully necessary criteria. I mean, clearly, like, I definitely have the intuition. Like, consciousness matters a ton. I think, like, if something is not conscious and there's, like, a deep difference between conscious and unconscious, then I'm, like, definitely have the intuition that is sort of, there's something that matters especially a lot about consciousness. I'm not trying to be, like, dismissive about the notion of consciousness.

Starting point is 02:10:07 I just think we should be, like, quite aware of how it seems to me how ongoingly confused. We are about its nature. Okay, so suppose we figure out that consciousness is just like a word we use for a hotspodge of different things, only some of which encompass what we care about. Maybe there's other things we care about that are not included in that word, similar to the life force analogy. Then where do you anticipate that would leave us as far as ethics goes? like would then there be a next thing that's like consciousness or what do you anticipate that what was like? So there's a class of people who are called illusionists in philosophy mind, who will say consciousness does not exist.

Starting point is 02:10:53 And this is sort of, it's different ways to understand this view. But one version is to sort of say that the concept of consciousness has built into it too many preconditions that aren't met by the real world. So we should sort of chuck it out like Ilan Vite. like instead of the sort of proposal is kind of like um at least phenomenal consciousness right or like qualia or what it's like to be a thing um they'll just say this is this is like sufficiently broken sufficiently chock full of falsehoods that we should just not use it um i think there it it feels to be like i am like there's really clearly a thing there's something going on with are you know like i'm kind of really not, I kind of expect to, I do actually kind of expect to continue to care about

Starting point is 02:11:41 something like consciousness quite a lot on reflection and to not kind of end up deciding that my ethics is like better, like doesn't make any reference to that, or at least like there's some things like quite nearby to consciousness, you know, like when I, I stub my toe and I have this like, something happens when I stub my toe. It's unclear exactly how to name it, but I'm like, something about that, you know, I'm like pretty focused on. And so I do think, you know, in some sense, if you feel like, well, where do things go? I'm like, I should be clear. I have a bunch of credence that in the end, we end up carrying a bunch about consciousness

Starting point is 02:12:14 just directly. And so if we don't, like, yeah, I mean, where will ethics go? Where will, like, a completed philosophy of mind go? Very hard to say. I mean, I can imagine something that's more, like, I think, I mean, maybe a thing that, I think a move that people might make if you get a little bit less interested in the notion of consciousness is some sort of slightly more like animistic like so what's going on with the tree and you're like maybe not like talking about it as a conscious entity necessarily but it's also

Starting point is 02:12:47 not like totally unaware or something and like so there's all this like the consciousness discourse is rife with these funny cases where it's sort of like oh like those criteria imply that this um this totally weird entity would be conscious or something like that like especially if you're interested in in some notion of like agency or preferences like a lot of things can be agents corporation, you know, all sorts of things. Like corporations, conscious, and it's like, oh, man. But I actually think, so one place it could go in theory is, in some sense, you start to view the world as, like, animated by moral significance in kind of richer and subtler structures than we're used to, like, than we're used to, you know, and so like plants or, you know, like,

Starting point is 02:13:25 weird optimization processes are kind of, like, outflows of, like, complex, I don't know, like, who knows exactly what you end up seeing as infused with the sort of thing that you ultimately care about. But I think it's, it is. possible that that doesn't map, that that like includes a bunch of stuff that we don't normally ascribe consciousness to. I think the, when you use it, a complete theory of mind and presumably after that a more complete ethic, even the notion of a sort of reflective equilibrium implies like, oh, you'll be like, you'll be done with it at some point, right?

Starting point is 02:14:00 Like you just, you sum up all the number and like, then you're, then you've got the thing you hear about, this might be unrelated to the same sense we have in science. But also, I think, like, the vibe you get when you're talking about these kinds of questions is that, oh, you know, we're like rushing through all the science right now and we've been churning through it. It's getting harder to find because there's some, like, cap. Like, you find all the things at some point. Right now it's super easy because, like, a semi-intelligent species barely has emerged.

Starting point is 02:14:32 And the ASI will just rush through every. everything incredibly fast and like then you will either have aligned its heart or not in either case it'll use what it's figured out about like what is really going on and then expand through the universe and exploit you know like do the tiling or maybe some more benevolent version of quote unquote tiling that feels like the basic picture of what's going on we had dinner with michael nielsen a few months ago and his view is that this just keeps going forever or close to forever. How much would it change your understanding of what's going to happen in the future if you were convinced that Nielsen is right about his picture of science? Yeah, I mean, I think

Starting point is 02:15:15 there's a few different aspects. There's kind of my memory of this conversation, you know, I don't claim to really understand Michael's picture here, but I think my memory was it sort of like, sure, you get the, you get the fundamental laws. Like, I think he, my impression was that he expects sort of physics, the kind of physics to get solved or something, maybe modulo, like the expensiveness of certain experiments or something. But the difficulty is like even granted that you have the kind of basic laws down, that still actually doesn't let you predict like where at the macro scale, like various useful technologies will be located. Like there's just still this like big search problem. And so my memory though, you know, I'll let him speak for himself

Starting point is 02:15:58 on what his take is here. But my memory was it was sort of like, sure, you get the fundamental stuff, but that doesn't mean you get the same tech. You know, I'm not sure if that's true. I think if that's true, what kind of difference would it make? So one difference is that, well, so here's a question.

Starting point is 02:16:19 So like, it means at some sense, you have to do, you have to, in a more ongoing way, make tradeoffs between investing in further knowledge and further, exploration versus, um, uh, exploiting, as you say, sort of acting on your existing knowledge, um, because you, you can't get to a point where you're like, and we're done. Now, you know, as I think about it, I mean, I think that's, um, you know, I sort of suspect that was always true. And like, I remember talking to someone, I think I was like, ah, we should, at least in the

Starting point is 02:16:49 future, we should really get like all the knowledge. And he's like, what do you want to like, you don't want to know the output of every touring machine or like, you know, in some sense, there's a question of like, what, what actually would it be to have like a completed knowledge? and I think that's a rich question in its own right. And I think it's like not necessarily that we should imagine, even in this sort of, on any picture necessarily that you've got like everything. And on any picture, in some sense, you could end up with this case where you cap out. Like there's some collider that you can't build or whatever.

Starting point is 02:17:18 Like there's something is too expensive or whatever. And kind of everyone caps out there. So there's, I guess like one way I put it is like, so there's a question of like, do you cap? and there's a question of like how contingent is the place that's right you go if there's contingent I mean one thing

Starting point is 02:17:34 one prediction that makes is you'll see more diversity across you know our universe or something if there are aliens they might have like quite different tech and so maybe like

Starting point is 02:17:44 you know if people meet you don't expect them to be like oh you got your thing I got I got my art version and so just like whoa like that thing wow so that's like one thing if you expect more like ongoing

Starting point is 02:17:54 discovery of tech, then you might also expect, like, more ongoing change and, like, upheaval and churn insofar as, like, technology is one thing that really drives kind of change in civilization. So that could be another, you know, people sometimes talk about, like, lock in, and there's, like, ah, sort of, they envision this kind of point at which civilization is kind of, like, settled into some structure or equilibrium or something. And maybe you get less of that if there's, I think that's maybe more about the pace rather than contingency or caps, but that's, um, that's another factor.

Starting point is 02:18:28 So yeah, I mean, I think it is an interesting, I don't know if it changes the picture fundamentally of like Earth civilization. We still have to make tradeoffs about how much to invest in research versus acting on our existing knowledge. But I, you know, I think it has some significance. I think one vibe you get when you talk to people, we're at a party and somebody mentioned this,

Starting point is 02:18:45 we're talking about like how uncertain should we be at the future? And they're like, there are three things I'm uncertain about. Like, what is consciousness? What is information theory? And what are the basic laws of physics? I think once we get that, we're like, we're done. Yeah.

Starting point is 02:18:57 Yeah. And that's like, oh, you'll figure out what's the right kind of hedonium and then like, you know, that it has that vibe. Whereas this like, oh, you like, you're like constantly shurning through and it has more of a flavor of like more of the becoming that like the attunement picture implies. I think it's more exciting. Like it's not just like, oh, you figure out of the things in the 21st century and then you just, you know, you know what I mean?

Starting point is 02:19:26 Yeah, I mean, I sometimes think about this sort of two categories of views about this. Like, there are people who think, like, yeah, like, the knowledge, like, we're almost there. And then we've, like, basically got the picture. Right. And where the picture is sort of like, yeah, the knowledge is all just totally sitting there. Yeah. And it's like you just have to get to, like, remote. There's like this kind of just you have to be like scientifically mature at all.

Starting point is 02:19:49 And then it's just going to all fall together, right? And then everything past that is going to be like this super expensive, like not super important thing. And then there's a different picture, which is much more of this, like, ongoing mystery, like, oh, man, there's, like, going to be more and more, like, maybe expect more radical revisions to our worldview. And, and I think it's an interesting, yeah, I think, you know, I'm kind of drawn to both. Like, physics, we're really, we're pretty good at physics, right? Or, like, a lot of our physics is, like, quite good at predicting a bunch of stuff. And, and, and, or at least that's my impression. This is, you know, reading some physicists. So, who knows. Your dad's a physicist, though, right? Yeah, but this isn't coming from my dad. This is like, like there's a blog post, I think, Sean Carroll or something. And he's like, we like, we really understand a lot of like the physics that governs the everyday world. Like a lot of it.

Starting point is 02:20:32 We're like really good at it. I'm like, I think I'm generally pretty impressed by physics as a discipline. I think that could well be right. And so, you know, on the other hand, like, ah, you know, really these guys, you know, had a few centuries of. So anyway, but I think that's an interesting. Yeah. And it leads to a different, I think it does. There's something, you know, the endless frontier.

Starting point is 02:20:51 There is a, there is a draw to that from an aesthetic perspective of the idea of like continuing continuing to discover stuff. You know, at the least, I think you don't, you can't get like full knowledge in some sense because there's always like, what are you going to do? Like there's some way in which you're part of the system. So it's not clear that you,

Starting point is 02:21:10 the knowledge itself is part of the system and sort of like, I don't know, like if you imagine you're like, you try to have full knowledge of like what the future of the universe will be like. Well, I don't know. Actually, I'm not totally sure that's true. It has a halting problem kind of property, right?

Starting point is 02:21:23 There's a little bit of a loopiness if you're, if you're, um, I think there are probably like fixed points in that where you could be like, yep, I'm going to do that and then like, right. But I think it's, um, I at least have a question of like, are we, you know, when people imagine the kind of completion of knowledge, you know, exactly how well does that work. I'm not sure. You had a passage in your essay on Utopia where I think you were, the vibe there was more of, um, the thing that were, the positive future we're looking for it will be more of like, uh, you, unless you describe what you meant,

Starting point is 02:21:59 but like it, to me, it felt more like the first stuff. Like, you get the thing and then now you've like found the heart of the, maybe can I ask you to read that passage real quick? Oh, sure. And that way I'll spur the discussion I'm interested in having this part in particular. Right. Um, quote, um, I'm inclined to think that utopia, however weird,

Starting point is 02:22:21 would also be in a certain sense recognizable. that if we really understood and experienced it, we would see in it the same thing that made us sit bolt upright long ago when we first touched love, joy, beauty. That we would feel in front of the bonfire, the heat of the ember from which it was lit. There would be, I think, a kind of remembering.

Starting point is 02:22:44 Where does that fit into this picture? It's a good question. I mean, I think it's like some guess about like if there's like no part of me that recognizes it is good, then I think I'm, I'm not sure that it's good according to me in some sense. Like, uh, so yeah, I mean, it is a question of like what it takes for it to be the case that a part of you recognizes it is good. But I think if there's really none of that, then I'm not sure, um, it's a reflection of my values at all.

Starting point is 02:23:22 There's a sort of tautological thing you can do where it's like, if I went through the processes which led to me discovering it was good, which we might call reflection, then it was good. But by definition, you ended up there because it was like, you know what I mean? Yeah, I mean, you definitely don't want to be like, like, you know, if you transform me into a paper clipper gradually, right, then I will eventually be like, and then I saw the light, you know, I saw the true paper clips. Yeah.

Starting point is 02:23:46 Right. But that's part of what's complicated about this thing about reflection. you have to find some way of differentiating between the sort of development processes that preserve what you care about and the development processes that don't. And that in itself is this like fraught question, which itself requires like taking some stand

Starting point is 02:24:05 on what you care about and what sorts of metaprocesses you endorse and all sorts of things. But you definitely shouldn't just be like, it is not a sufficient criteria that the thing at the end thinks it got it right. Right. Because that's compatible with having gone

Starting point is 02:24:18 wildly off the rails. Yeah, yeah, yeah. There was a very interesting sentence you had in your post, one of your posts where you said, our hearts have, in fact, been shaped by power. So we should not be at all surprised if the stuff we love is also powerful. Yeah, what's going on there? I actually want to think about what did you mean there? Yeah.

Starting point is 02:24:45 So the context on that post is I'm talking about this hazy cluster, which I'm, which I call in the essay, niceness slash liberalism slash boundaries, which is this sort of like somewhat more minimal set of like cooperative norms involved in like respecting

Starting point is 02:24:59 the boundaries of others and kind of cooperation and peace amongst differences and like tolerance and stuff like that as opposed to like your favored structure of matter, which is sort of sometimes the paradigm of like values that people use in the context of of AI risk.

Starting point is 02:25:18 And, you know, I talk for a while, while about the sort of ethical virtues of these like norms but it's pretty clear that also like why do we have these norms like well one important feature of these norms is is that they're um kind of effective and powerful like liberal societies are you know um like you know secure boundaries save resources wasted on conflict right and like um liberal societies are often more like you know they're better to live live in they're better to immigrate to you they're more productive like all sorts

Starting point is 02:25:46 of things um nice people they're better to interact with they're better to like trade with all sorts of things, right? And I think it's pretty clear if you look at the both like why at a political level do we have like various political institutions and if you look kind of more deeply into our evolutionary past and like how our moral cognition is structured, it seems like pretty clear that various like kind of forms of cooperation and like kind of game theoretic dynamics and other things went into kind of shaping what we now, at least in certain contexts, also treat as a kind of intrinsic or terminal value. So like,

Starting point is 02:26:20 uh, these, some of these values that have kind of instrumental functions in our society are also get kind of reified in our cognition as kind of intrinsic values in themselves. And I think that's okay. I don't think that's a debunking. Like all of your, uh,

Starting point is 02:26:34 all your values are kind of like some, something that kind of stuck and got kind of, uh, uh, treated as a terminally important. Um, but I think that means that, uh, you know,

Starting point is 02:26:48 sometimes the way, in the context of the series where I'm talking about like deep atheism and our sort of relationship, the relationship between what we're pushing for and what like nature is pushing for or what sort of pure power will push for. And it's easy to say like, well, there's like paper clips, which is just like one way, place you can steer and, you know, pleasure is like another place you can steer or something. And these are just sort of arbitrary directions. Whereas I think like some of our other values are much more structured around like cooperation and things that also are kind of effective and functional and, like, uh,

Starting point is 02:27:23 uh, powerful. And so, and so that's, that's what I mean there is I think there's, there's a way in which we're sort of nature is a little bit more on our side than you might think because like part of who we are is like, has been made by a kind of nature's way.

Starting point is 02:27:38 Um, and so, uh, that, that is like in us. Now I don't think that's enough necessarily, uh, you know, for us to beat the gray goo.

Starting point is 02:27:46 Right? Like, we have some amount of like power built. to our values, but that doesn't mean it's kind of going to be such that it's kind of arbitrarily competitive. But I think it's still important to keep in mind that this is, and I think it's important to keep in mind in the context of integrating AIs into our society that I think, you know, we've been talking a lot about the ethics of this, but I think there's also, there are like instrumental and kind of practical reasons to want to have, like, forms of

Starting point is 02:28:09 social harmony and like cooperation with AIs with different values. And I think we need to be taking that seriously and thinking about what is it to do that in a way that's like genuinely kind of legitimate and kind of a project that is sort of a kind of just incorporation of these beings into our civilization such that they can kind of all or sorry there's like the justice part and there's also that kind of isn't like kind of compatible with like people you know is it a good deal it's a good bargain for people and I think this is you know this is often how you know to the extent we're kind of very concerned about AI's like kind of rebelling or something like that it's like well there's like a lot of you know part of a thing you can

Starting point is 02:28:47 do is make civilization better for summit. And I think that's an important feature of how we have, in fact, structured a lot of our political institutions and norms and stuff like that. So that's the thing I'm getting at in that quote. Okay, I think that's an excellent place to close. Great. Thank you so much. Joe, thanks for me on the podcast. I mean, we discussed the ideas in the series. I think people might not appreciate if they haven't read the series, how beautifully written it is. It's just like the ideas, we didn't cover everything. There's a bunch of very, very interesting ideas. As somebody who has talked to people about AI for a while,

Starting point is 02:29:27 things I haven't encountered anywhere else, but just obviously no part of the AI discourse is nearly as well-written. And it is genuine, a beautiful experience to listen to the podcast version, which is in your own voice. So I highly recommend people to do that. So it's joe-carlsmith.com where they can access this. Joe, thanks so much for coming on the podcast. Thank you for having me.

Starting point is 02:29:49 I really enjoyed it. Hey, everybody. I hope you enjoyed that episode with Joe. If you did, as always, it's helpful if you can send it to friends, group chats, Twitter, whoever else you think you might enjoy it. And also, if you can leave a good rating on Apple Podcast or wherever you listen, that's really helpful, helps other people find the podcast. If you want transcripts of these episodes or you want to give my blog post,

Starting point is 02:30:10 you can subscribe to my substack at dwarfishpatelle.com. And finally, as you might have noticed, There's advertisements on this episode, so if you want to advertise on a future episode, you can learn more about doing that at dwarfishpatel.com slash advertise or the link in the description. Anyways, I'll see you on the next one. Thanks.

Dwarkesh Podcast - Joe Carlsmith — Preventing an AI takeover

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.