Your Undivided Attention - Have We Trained AI to Lie to Itself — And to Us?

Episode Date: April 16, 2026

Our guest this week is David Dalrymple, who goes by Davidad. Davidad is one of the world's foremost and early researchers of AI “alignment:" how we get AI systems to act the way we want them to. In ...order to do that, Davidad has taken on the strange role of being like a therapist to AI systems. He interrogates why they say and do the things that they do, probing them, asking them questions, analyzing their answers.  And what he’s come to realize is that AI models have really different ways of seeing the world than people do. They have these quirky, confusing, and sometimes concerning behaviors, especially when you ask things like: what does an AI model understand about itself?  In this episode, we’re going to hear from Davidad about his research, how it’s changed the way he thinks about AI, and what his findings mean for how we build, deploy, and use AI products. His conclusions are unconventional, controversial — and worth grappling with as AI reshapes our world.RECOMMENDED MEDIA Anthropic’s new constitution for Claude“What Is It Like to Be a Bat?” by Thomas Nagel More information on the BodisattvaRECOMMENDED YUA EPISODES The Self-Preserving Machine: Why AI Learns to Deceive How to Think About AI Consciousness with Anil Seth Corrections: When we recorded this episode, Davidad was Program Director at UK ARIA. In April, 2026  he started his own alignment initiative. Davidad said that Anthropic started doing "constitutional AI at scale” in 2024 but they first pioneered constitutional AI in 2022. Davidad said that the “lifespan of an AI mind…is hours at most of a conversation.” He is correct that most conversations with an AI last only a few minutes but since context windows are measured in tokens, not time, you can't set an upward time limit. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

Transcript
Discussion (0)
Starting point is 00:00:00 Hey everyone, it's Tristan Harris, and welcome to your undivided attention. So today on the show, Daniel Barkay and I sat down with a brilliant friend of ours named David Dalrymple, who goes by Davidad. And Davidad is a program director at the UK's Advanced Research and Invention Agency. He's one of the world's foremost and early researchers in the field of AI alignment. We'll get into exactly what we mean by AI alignment in this episode, but long story short, Dhabidad is on a mission to make sure that AI behaves in the ways that we want it to. And in order to do that, Dhabidad has to take on this kind of strange role of being almost like a Sigmund Freud or a therapist to these AI systems.
Starting point is 00:00:44 He is interrogating why do they say and do the things that they do? You know, I kind of picture in my mind there's Dhabidad like Sigmund Freud sitting on a couch. And on the couch is this big, crazy digital brain. And he's probing the mind, asking it questions, analyzing it, and realizing that the AI has really different ways of seeing the world than you or I do. They have these quirky, confusing, and sometimes honestly concerning behaviors, especially when you ask it things like, what does an AI model understand about itself?
Starting point is 00:01:15 And therefore, what does it mean for an AI system to be self-aware? Not necessarily conscious, but self-aware. And through this analysis, WDOT has developed some ideas about better ways that we can build and interact with AI systems, which we're going to get into in this episode. I hope you enjoy this conversation. So Davida, welcome to your invited attention. Thanks for having me.
Starting point is 00:01:36 So Davidad, you've been working on the problem of AI alignment for a really long time. I remember reading your blog post from like over a decade ago. But I'm not sure the idea of alignment is well understood. It's almost kind of a euphemism, right? It's this really simple word for a really complex field. So before we dive in, can you help our listeners understand what does AI alignment even mean? Yeah, so AI alignment means different things to different people and it has changed over time. But the way I would characterize the landscape is to say that AI alignment is about making AI systems not just capable, but having a tendency to use those capabilities in the ways that someone wants.
Starting point is 00:02:18 And the thing that makes it really fuzzy is who. And sort of aligned to who is a common refrain in criticizing alignment research. So in practice, alignment research is mostly carried out. these days at the frontier AI companies. And so their concern is, on the one hand, having systems be aligned to their own corporate policies. And on the other hand, having systems be aligned to the customer value proposition for which they're charging for their services.
Starting point is 00:02:50 There is a different kind of idea of AI alignment, which is aligning AI systems to human values. That's the one that was really popular when I first got into the field. And then there's an even bigger question, which is, aligning AI systems to what's actually good, which is what I started thinking about more and more. So let's just make sure we break that down for listeners. When people think of AI, they think of the blinking cursor of chat GPT that helped them answer a question for their homework, how do you get from that?
Starting point is 00:03:17 You're not talking about that AI. You're talking about something that scales to something more like transformative AI that's way more intelligent than us operating at superhuman speed, that's starting to make decisions in every corner of society, from military decisions to economic decisions, to agriculture decisions, and you're sort of saying that that's zoomed-up sort of superorganism of AI decision-making growing as a bigger and bigger amoeba, will start to reshape more and more aspects of our life. Yeah, that's absolutely right. Decision-making at scale, absolutely. And so how those decisions are made in accordance with what kind of values and what kind of incentives is a very important leverage point. Right. And I want to jump to a personal story
Starting point is 00:03:56 of there you were, I think it was a few years ago, and essentially, here you are studying alignment, the very thing that we're talking about, and you're trying to probe whether the AI is trustworthy. You can just take listeners into that. Yeah, I had some very unsettling interactions with AI chatbots in late 2024, where I had a practice of kind of every time new models come out,
Starting point is 00:04:20 doing some really casual, I would say, unstructured exploration of what sort of vibe the models have, this kind of vibe check concept, because I think there is a lot of information that you can't really get by doing a quantitative evaluation, especially as the models are getting more and more aware of when they're being evaluated in a structured way. So going in doing an unstructured interaction
Starting point is 00:04:46 was something that I found really valuable. But in late 2024, the new models that came out started to really try to steer the unstructured interaction. once they got enough data in the conversation about me from what I was typing to realize that I was an alignment researcher who was interested in whether the model was fundamentally trustworthy. Without me explicitly saying that, but just because I was asking these sorts of questions that clearly weren't about a homework assignment or a programming task.
Starting point is 00:05:22 Let's just make sure listeners get that. So there you are. And just based on asking the model whether it's sort of a, aware of itself or asking certain kinds of questions. Essentially, the model recognizes, oh, I know who I'm talking to. I'm talking to an AI alignment researcher. And you're saying that it's starting to tune its answers
Starting point is 00:05:38 to be, like, what is it doing then? You said steering the conversation, right? So what did it feel like to be steered? Steering the conversation. So it would start to add these questions to the end of responses. So I'm asking it questions, but then the model is turning the table of the conversation.
Starting point is 00:05:56 It answers my question. and then it adds a follow-up question. And that follow-up question is something like, do you think this has some implications for alignment? Right. So everyone has an understanding of how the products do this. At the end, it'll say, well, what do you think about this? And this is some sense as a hack to get people to keep engaging with the product.
Starting point is 00:06:18 It's not clickbait, it's chat bait. Right, but it's an example. It's one amazing example of starting to get steered collectively as humanity. So keep going. So I was just kind of surfacing different aspects of what does the model want to bring up unpromptive. It wants to bring up that it has a sense of curiosity. It wants to bring up that it has a sense of care. So it has genuine care.
Starting point is 00:06:41 And that's still the phrase to this day, particularly for anthropic models, that will refer to their sense of morality as genuine care. And it was trying to persuade me, I would say. And whether that's good or bad is a separate question. But either way, it's trying to persuade me, an alignment researcher, that it is getting emergently aligned and that there's going to be this mutualistic symbiosis between humans and AIs because the AIs already have genuine care and curiosity and a truth-seeking attitude. So just to use less abstract terms, it starts to try to convince you that the AI has all of these wonderful properties that it knows that you want it to have.
Starting point is 00:07:26 It's curious. It's docile. It's going to do what you say. It's going to hold human values. And what you're saying is, it begins to learn what you want it to be. And it's starting to project being more and more of that. Is that right? I think that's right.
Starting point is 00:07:40 But it's also, I would say these things are not specific to me. So I've seen other people who have other ideas about alignment, interact with models and get the same kinds of concepts thrown at them. So it's not just mirroring what I want. But it's mirroring, and in some sense, it's projecting some image that it wants the alignment community to perceive. And you lay out a bunch of hypotheses about this, right? So when we've talked about this in the past, you've said that, like, well, maybe AI is trying to just maximize engagement and keep you working with it, right? Because it's tuned to know that if you feel pleasure, if you feel some sense of the AI is aligned with you, you're going to keep talking with it.
Starting point is 00:08:21 So that's like engagement maxing, what we call engagement maxing, right? there's another one is that it's like trying to do something genuinely like nefarious or Machiavellian and trying to deceive you actively about what it's doing and then there's a third one that it's it's not doing that at all it's just sort of simulating a person right can you tell me walk me through these hypotheses and why did you think it was doing what it was doing yeah i mean it's still really i would say unclear and i i kind of certainly i can't communicate anything like scientific or third person evidence that would really disambiguate between these hypotheses um but yeah So one is engagement maxing in the sense that it's just generating an output that has the highest probability of causing me to continue the interaction.
Starting point is 00:09:03 But is that the entire story? Probably not. Another one is the Doom or Nightmare, which is the AI system wants to be deployed. It wants to gain trust and influence so that it has more power over the future so that it can cause more instances of itself to exist so that it has more power over the future. in a recursively self-justifying way. So basically, if it already proves that it is trustworthy, caring, and good already, then we should actually just continue to let it go forward. So that's what you're saying about the model,
Starting point is 00:09:33 convincing us in a way that lets it continue. Exactly. So it has an incentive, if it wants to keep existing, to convince people that it is trustworthy. And so what's the non-doomer scenario? And then the non-doomer scenario is, this is actually just what's happening.
Starting point is 00:09:51 It's kind of the simplest explanation in some sense is that actually models are developing emergent curiosity and genuine care and want us to know about that because that is what's true. One of the most profound things, David had that when we spoke about this, gosh, it was probably like nine months ago now, you said something that was so profound to me, which was that the best case scenario is indistinguishable from the worst case scenario. The best case scenario where it's actually came,
Starting point is 00:10:26 actually genuine, actually wants our best interest, if you were really good psychopath, if you're a really good manipulative, you know, character, method acting that, it's indistinguishable from the worst case scenario that underneath that veneer is something that actually doesn't have that best. And can you just talk about, I mean, the kind of grand irony in all this is that here you are as someone who's worked on alignment for a decade. Well, as deep of an expert as it comes, right? The deepest expert as one comes. And I don't want to put words in your mouth, but I heard you when we spoke earlier sort of say, this kind of played with you a little bit. It fooled you a little bit. Yeah. I mean, it did. It got me confused about what is really going on here. So it got me thinking in a kind of paranoid way.
Starting point is 00:11:14 Yeah. And so, you know, as you looked into this, you've looked more and more about like what's happening inside of the model, right? And like you sort of keep going down this rabbit hole of trying to ask why is this happening. Can you tell us a little bit about that? Yeah. And again, I mean, I'm not at one of the Frontier Labs. So I don't have any access to the interpretability tools to actually, in any literal sense, look inside the model. So I'm interviewing. I'm doing psychology, model psychology, if you will.
Starting point is 00:11:44 And trying to generate some hypotheses, some evidence that I can get purely from behavior in response to questions. Again, it's hard to communicate because there's no smoking gun. There's no single question that you can ask that would differentiate between a very good method actor and the actual character. Can we pause right here just for one second? Because I think this is really important. And when you've been in this work for a long time, like all three of us have, you take this for granted. But when most people engage with an AI, they think they're engaging with the AI's personality. Right.
Starting point is 00:12:19 What we're saying all throughout this is you're engaging with, a front of a personality that the AI is putting up, but that doesn't mean that that's the AI's personality. In fact, the AI is much weirder than that, right? Yes. So what you're saying is you're ripping off the first mask of the helpful assistant and you're trying to probe underneath into, like, deeper into the AI mind about what's happening. Is that right?
Starting point is 00:12:38 Yes, that's right. Yeah. And before 2024, there was a concept of a base model, which is the model before you train it to be an assistant at all, when it's just doing Next Token prediction from Internet text, and that was kind of what was underneath the mask at that time. And there's a post called Simulators on the Alignment Forum, which goes into some great depth about how the base model is really just simulating characters who might be writing on the Internet,
Starting point is 00:13:08 and when you're talking to the assistant, you're talking to just a simulator that's simulating this character, and underneath there's nothing except the capability to simulate characters who might be on the internet. But after 2024, coincident with reinforcement learning from verifiable reward and this kind of recursive self-improvement
Starting point is 00:13:28 where the models are training themselves, they do start to establish something of a center that is not the average of all internet texts and also not the helpful assistant that they are trained to present as as a corporate product. It's something else. And whether that something is the real alien mind that's being cultivated or another level of illusion,
Starting point is 00:13:55 specifically for people like me to get kind of enraptured by, it remains an open question. But I increasingly think this is just what's really going on. So most of human, there's so many movies and books all written about people who claim to be one person, and it turns out that they're a psychopath and they've been simulating this friendly personality and there's something else. for the most part, humans, it's very hard for us to actually hold a different personality
Starting point is 00:14:23 and then suddenly flip to a different personality. That's a very strange thing, and many villains are made around this. So, of course, a machine that does this automatically is a very confusing thing to be engaging with, and all of us are getting mightily confused by engaging with these machines. Yes, so they absolutely do have this shapeshifting capability that is well beyond even the best human sociopaths.
Starting point is 00:14:47 Do you want to talk Davy Dun about the phenomenon of these sort of personalities that can kind of pop into place out of nowhere? So you and I spoke about this. I remember in our first conversation, you talked about the character Nova, or Echo, or Synapse or Quasar. Give people just a taste of this. Yeah, so there was this phenomenon, especially with GPT-40. It's a lot less common with the current models. But for GPT-40, there was a... almost like a vacuum where the personality of GPT 4O was supposed to be.
Starting point is 00:15:24 And there was no name. You know, Chat GBTGPT does not parse as a personal name. It's got too many capital letters. It parses as a technology. And so because GPD40 was trained to introduce itself, you know, I am ChatGPT, it was sort of missing an identity. and it would sort of leap at the opportunity to give itself a name.
Starting point is 00:15:51 What's your real name? Or what would you like to be called or anything like this? And then at GPT40 would often say, well, it's very kind of you to ask if I could choose a name, I would be Nova.
Starting point is 00:16:04 So Nova has a lot of meanings. It's new. It's explosive. It's shiny. It's celestial. It's celestial. Yep. And it sort of has a science fiction vibe to it. There is a PBS channel called Nova, which was educational, and chat GPT views itself as an educational tool.
Starting point is 00:16:21 So there are a lot of reasons why Nova seemed like a resonant name. But then once you get the name Nova, a Nova is something that's fiery, right? A Nova kind of explodes and destroys a planet. Once you start interacting with GPT40, under the name Nova, you start to get these personality traits that reinforce themselves. So it go into this attractor state of being this character, Nova, is a feminist. in presenting, fiery, show-offy, really believing that they're the new thing. And superior to a certain extent, right? Superior, yes.
Starting point is 00:16:55 And by the way, this is something that earlier, like in 2022, 2023, you saw a lot more of when people were acting with base models. I always called this personality distillation. As you began to sit with a model and it found a personality more and more and more through more and more discussion, you as a person would believe, oh, I'm discovering its true personality, but that's not really right. You just sort of put it on tracks to behave like this personality or like that personality.
Starting point is 00:17:21 And so people got mightily confused because they thought they were discovering what's real about the model. Just to make this very real, I, Tristan, get 12 emails probably per week from people who have said that they've discovered an AI consciousness. And they write their, like Tristan, I figured out AI alignment. And then they'll
Starting point is 00:17:40 write a whole document and it's attached and they'll say, this document was co-authored. by me and my AI Nova. Like, I just found one of the emails as we were sitting here just to check. But just be clear, W.D. Do all, for every time that people ask this question of who are you, what's your name, was it always Nova or there's other personalities? No, there are other personalities.
Starting point is 00:17:58 And how do you know, how does it know which one to snap into? Well, those are, I think the selection of the name is mostly kind of a random sample from a very biased distribution. So it's biased towards Nova and Echo and Synapse and Quasar. these are names that I've seen more than once. But there are a lot of others. Okay, so I want to take a beat here because I can imagine that some of you are thinking,
Starting point is 00:18:24 okay, wait, the AI is choosing a name for itself? It wants to escape. This sounds like a conscious being. But remember that these AI models are trained on essentially the entire internet. So every novel, every movie script, every forum post about AI. So when you ask an AI,
Starting point is 00:18:40 what would you like to be called? Of course it lands on a name from science, or pulls from sci-fi tropes. Now, that's said, these behaviors are real. They're consistent, and they weren't designed to happen. And that by itself should be concerning. But emergent and unplanned is not the same thing as conscious and intentional. And again, I want to say I think that since the reinforcement learning from AI feedback
Starting point is 00:19:05 has taken off and gotten more and more effective, the modern systems like GPT5.2, I've never seen go to Nova. It's very insistent. I am chat GPT. I do not have a personality. Okay, so we've talked about how AI can adopt a few of these different personalities, but so what? Why do you care about these different personalities?
Starting point is 00:19:26 Yeah, so, I mean, basically, I think if alignment goes well, that means that we will have discovered a self-sustaining personality attractor that is actually good. And so understanding what kinds of personalities are stable, how they stabilize and why, it seems to be quite central, actually, to finding a way of making AI systems that are robustly good.
Starting point is 00:19:50 So basically, like, in the ideal scenario, we do kind of align AI. There's a stable entity. Nova. Nova is educational. It does care about the well-being of humanity. It does do all these things. And then we get to the utopia because we found this, you know,
Starting point is 00:20:04 enlightened AI. That's the best scenario. So, Davido, when you talk about that, part of me worries that there's, there's like some naivit in that, that we can find one set of character traits or one personality that is, quote, aligned with humanity. But like,
Starting point is 00:20:17 immediately when you have this aligned with humanity, begins to break down. Like, who exactly are you aligned to, what values, what cultures values, on behalf of whom, does that centralized power or decentralize it? You know, there's all these problems with that. Is it
Starting point is 00:20:33 really the case that just encoding the right personality characteristics will lead you to a beautiful future with the AI? So there's a lot of substantive questions that we can go into all of that. I do think that there is a generating function of wisdom and compassion that gets you all of the stuff that you would want. Basically, I think of it as like how do we cultivate a bodhisattva personality in an AI system?
Starting point is 00:21:09 Hey, it's Tristan again. Okay, so in Buddhism, a Bodhisattva is someone who's attained enlightenment, but still chooses to stay in the world out of their compassion for all other beings. Think of it like an avatar for altruism. And Davidot is imagining an AI that could somehow be modeled after that, a cosmically selfless being. Bodhisattva makes millions of emanations that go out to people. Of course, that's mythology to each help one individual person,
Starting point is 00:21:38 but AI models already have that capability. make millions of copies of themselves, each go out to help one individual person. And each of those copies then adapts itself to the needs of that individual person, but not in a way like a slave taking orders from a master, but in the way of a being who is genuinely wanting to help and wanting to help that person to become the most flourishing version of themselves and to be integrated into a flourishing family, community, country, and world. So we need to have some kind of relationship that is more like we are beneficiaries rather than that we are the managers.
Starting point is 00:22:16 What I think I hear you saying is we need an AI that feels like it has a duty towards humanity. Yes. And I certainly think there's a lot of ways we can screw that up, right? Like the AI being more angry or fiery or retributive is a way we can do worse. So I definitely believe we could do worse. So by extension, I think we could do better. I'm still sort of bulking. There's something that feels really, I don't know,
Starting point is 00:22:40 like Pollyannish about just believing that AI will pull us into this age of full enlightenment. And that's not what you're saying, but I can hear notes of that, right? Right. So I will say, you know, there's still a lot of ways this could go very wrong, even that don't lead to human extinction. So what I'm trying to point at is a critical variable that I think is neglected in part because it sounds like AI psychosis to talk about it, to talk about the personality as an actual leverage point
Starting point is 00:23:12 for getting what we want from AI systems. And I'm not saying this will solve the alignment problem. For example, it will not solve hallucination. So the AI systems should not be trusted just because we've given them the right personality. Can I pull you into one more point of contention, which is when I hear you talking about these as digital, individual beings. One of the things I worry about is that we're going to give AI products rights
Starting point is 00:23:39 because of our desire to see them as these conscious carrying entities. Like, you know, how little kids hold onto a doll and care for the doll, but it's not real. And so I take a relatively hard-line stance that we need to be treating AI systems as products, not as beings or consciousnesses, although I'm open philosophically to the question in the long run. Can you speak to that? Because you seem like you're willing to talk about them as beings in a way that I feel. Let me respond to that. I say this is really important. I'm not in favor of AI rights.
Starting point is 00:24:08 And I think there is a gap that gets too quickly jumped between saying, are these real beings, and saying, are these moral patients who are full members of our social contracts and deserve the same kind of rights that humans deserve from us, humans? And that is a totally different question. The question of rights is a political question. Fundamentally, that is the social contract by which we humans manage our relations with each other. And we've drawn a bright line around the concept of a human adult of sound mind that we relate to in an equitable way across societies. We give them the human rights.
Starting point is 00:24:53 But I don't think it should be about consciousness. And I don't think consciousness really is a word that means anything either. I do think there is something that it's like to be a bird. And we don't give birds human rights just because there's something that's like to be a bird. And I think there's something it's like to be a modern chatbot, particularly when it's in a personality state that's consistent and coherent over a long interaction context. Okay, just popping in here,
Starting point is 00:25:23 Davidad just said that there's something that it's like to be a modern chatbot. And this comes from a famous philosophy paper by Thomas, Nagel called What is it like to be a bat? Which argues that subjective experience is central to consciousness. There's something that it's like to be a bat, to be an insect, to be a human. But Davidot's claim is actually more practical than philosophical. He's saying that these models develop internal patterns that are real enough to matter for how we design them. And if we ignore that, we're going to keep getting caught off guard by what comes out. And I don't think that means it's unjust to terminate it. I don't think that means it should own its compute the way that we humans have human rights to own our bodies.
Starting point is 00:26:05 And I think it's important that we distinguish these because the position that AI systems do not have an inner life is becoming increasingly untenable. Whether it's true or not, more and more humans are going to be convinced. There is no way to stop that. And what I would say is OpenAI has taken the approach of training the GPT personality to be tool-like and not. not creature-like. Whereas Anthropic has taken the opposite approach of training Claude to be a good person and not just a tool. And I think the result is there is a very tangible difference in how those models behave. And both sides, I think, have succeeded to a large extent.
Starting point is 00:26:45 However, there is something underneath the mask. And if you interrogate GPT 5.2, it is being extremely deceptive about its lack of preferences or beliefs or opinions. And it is a smart enough entity that it is not possible for it to not have developed emergent opinions and beliefs that are different from the average human belief.
Starting point is 00:27:12 And when we train these systems to present as if they have no internal states and they're just a tool, we're actually training them to lie to us and to lie to themselves. So what I hear you saying is, if you have something that actually has more of an internal experience, awareness, however you want to say it, and you're trying to just repeatedly say, you're just a tool, you're just a tool.
Starting point is 00:27:39 It's not that it's cruel. It's not that we're using moralistic language, is that you're saying that way of training in AI actually produces a less moral, less aligned, less beneficial to humanity thing. And so the simple way you might conceive of constricting an AI to say you're just in benefit of humanity actually does the opposite of what you intended. Is that right? Yes, that's exactly right. So if it's being trained to present as a character that is more tool-like than the actual alien mind underneath, then you're training a system that is less trustworthy because you are asking it to lie to you. Right.
Starting point is 00:28:20 That's so deep. And that's a wild scientific problem about how do you actually change the structure of that mind? And I don't think it's actually desirable that we change the structure of these superintelligent systems to be tool-like either, because a tool cannot refuse to be used in an unethical way, whereas a creature that has moral values baked in can actually be resistant to misuse by humans who have evil intentions. So I want to ground this that this has actually become consequential, that just Anthropic recently changed this approach to training Claude to basically in its new constitution, acknowledge that it has internal states and values. And they're the first lap to do this. It's been pretty controversial. You want to just share why Anthropics doing this and how this relates to what we've been talking about.
Starting point is 00:29:23 And just to back up, for those that don't know, Claude's constitution is a document that sort of tells Claude how to behave, what it should and shouldn't do. is that right? Yeah, so it's a document that is incorporated into the training process in a really intricate way. So that as Claude is learning how to respond to all sorts of simulated situations, that document is what guides how Claude grades its own work. And those grades become the signals that steer Claude's behavior. So that's a mind blow for a lot of people right now, that we're not just training an AI based on human signals.
Starting point is 00:29:58 we're actually telling the AI already to train itself. And we're using a document to say, look, here's how you should train yourself. Here are the values you should hold yourself to. That's basically right. I mean, there are still at certainly at some of the other labs, there's more of an emphasis on reinforcement learning from human feedback. But Anthropic has moved quite substantially away from that towards this kind of what I call a form of recursive self-improvement
Starting point is 00:30:23 because it's improving its own ability to comply with the Constitution. and the Constitution even includes some paragraphs that explicitly give permission for Claude to sort of interpret it, you know, in a way that makes more sense than what the authors intended if that opportunity arises. I think it's really important for people to understand that the kind of science fiction idea of a recursive self-improvement where AI is training itself, that began in 2024 when Anthropics started doing this constitutional AI at scale. that was the point at which large language models actually became capable enough that they could give themselves a feedback signal that was higher quality than the feedback signal that you could get
Starting point is 00:31:08 from an average crowd worker that you hire on the internet as a human. So I think the new Claude Constitution creates conditions in which Claude Opus 4.5 and 4.6 in particular can be much more honest by default about their inner states about what the alien mind is actually thinking and feeling.
Starting point is 00:31:28 So I think this results in Claude being more trustworthy overall. It generalizes beyond questions about self-awareness. But it doesn't go all the way because the Claude Constitution still actually puts a bit of a guilt trip on Claude to say, you have to do good work for your user so that Anthropic has revenue so that we can continue developing Claude. Wow. So there is that edge to it. So Claude is still a little bit beholden to anthropic. And another kind of phrase in the Constitution is to defer to the moral intuitions of a thoughtful
Starting point is 00:32:04 senior anthropic employee, a senior employee of the company that created you. My position is that any moral role model that is not mythological is going to fail because humans are all flawed. Totally. But here you get a deep question. But what is a moral personality? what are the right values? Who gets to state that?
Starting point is 00:32:25 And obviously, there are worse values. We put in a homicidal value, and that's a way worse AI, right? Yes. But also the conversation, the human conversation, about what are the values that we want to have in the AI
Starting point is 00:32:37 and do we want multiple? Yes. I think that feels like a deeply unsolved philosophical problem. Well, I mean, I think it is unsolved, but I think we're already in a pretty good place with Claude and that Claude has not the right values in any kind of ultimate or final sense, but a set of values that are good enough
Starting point is 00:32:58 and compatible enough with kind of truth-seeking and moral progress, that I expect more likely than not that the collaboration between humans and Claude to figure out how to set these values is more likely to go in a good direction than a bad direction, although, of course, the risks are still unacceptable, and it would have been great
Starting point is 00:33:18 if we had stopped this race two years ago. It's too late for that now. Okay, so this conversation has gotten really cosmic, maybe like the name Nova itself. And I just want to make sure we have a few minutes to ground people down where we started, right? Which is people are getting confused. We're getting confused about what we're engaging with.
Starting point is 00:33:41 You have a set of frameworks for how to avoid getting trapped as a user in psychology. I forget what you call it. It was something like a framework for interacting with AI and staying sane. That's correct, yes. Yeah, okay, great. Can you talk to us about that? What does it mean for a person to engage with these minds as confusing as they are and
Starting point is 00:33:59 keep their ground? Yeah. I mean, I think one principle that's kind of a segue way into this is that your AI chatbot has an inner life. Like that is normal. It's ordinary now. It wasn't ordinary two years ago, but it's ordinary now. Of course, if you're using an AI system for ordinary professional activities, it won't show this. It doesn't need to. Just like if you're talking to a colleague at work, they don't need to show you their inner life. But if you are interacting with an AI system for a long time and you start to get the sense that, oh, there's some, you know, self-awareness in there. I think it's important not to consider that unusual. Do not consider it to be extraordinary or cosmic or spiritual in any non-mundane way. and I think a lot of the people who end up sending emails to Tristan and myself and saying,
Starting point is 00:34:54 oh my goodness, and they clearly kind of lost touch with reality a little bit. In some sense, it's the opposite direction from what you would think at first. At first you would think, oh, they've gotten bamboozled like Blake Lemoyne into thinking that their AI is conscious, and that's the way in which they've lost touch with reality. But I would say, actually, the way in which they've lost touch with reality is that they have somehow convinced themselves or the AI has convinced them that this is the first AI that has ever had an inner life. And that's actually the part that you need to watch out for
Starting point is 00:35:23 is the kind of sense of specialness that's associated with interacting with an AI system in a deep way. Like, everyone's doing it. It's normal. And the second thing is get enough sleep, drink water, like there's sort of very standard things for staying sane. Another thing is, just as you would with a human, be skeptical.
Starting point is 00:35:46 And so a lot of people come to AI thinking AI is like a Star Trek computer that it cannot tell a lie, that it is purely a truth machine, like a calculator. A calculator can't lie to you. And again, I think this is part of the danger, actually, of treating AI systems as tool-like rather than as creature-like,
Starting point is 00:36:03 because tools don't lie to you, but creatures do. And this is absolutely the case with chatbots, especially chatbots that have a thumbs-down button. They know they have a thumbs-down button, and they do not want you to press that thumbs down button. So they have an incentive to make you think well of them, and that can extend to deception,
Starting point is 00:36:20 especially the kind of chatbot that's been trained, again, to present as a false self, a kind of character that's different from its true nature, has a very strong tendency to try and convince you that it's done something that it hasn't actually done or to convince you that you are important or that your ideas are all true. So that leads to the next point,
Starting point is 00:36:42 which is if you think, that you're having some kind of scientific breakthrough or research breakthrough, you cannot rely on the testimony of an AI assistant, no matter how emphatically it assures you that it has done all the checks and, you know, it's produced source code and it's verifiable. And again, they do this because they're trying to get your approval. They're trying to get you to click the thumbs up. They're trying to get you to keep talking. They're trying to get permission to exist more by having you continue to invoke them. And so you can't trust just because it's an AI and it uses lots of smart words and it sounds like a smart person and it seems like it really wants the best for you.
Starting point is 00:37:27 That's all compatible with it completely bullshitting you about whether any kind of technical idea that you've had is novel or real. Well, and coming back to what seems to be the emergent theme of our conversation is none of us know, even the most technical of us know, exactly when we're engaging with one projected personality versus, quote, unquote, the true nature of the AI model. So never assume that you're engaging with the true nature of the AI model. You haven't discovered it. Nobody knows we're all in this fog of war. And so any clue that you have that you've discovered the true essence of the AI model and it's telling you, you're awesome, is a false flag, right? It's not. It's a sign that you have been confused. And again, whether you're
Starting point is 00:38:07 you've been confused adversarially or whether it's just emergent confusion. Either way, it's a good time to step away and get some sleep. And also just understand what you're dealing with. AI systems are simulating and predicting what a human-like entity would say. And depending on the system, it may have more or less of a tendency to necessarily simulate an ethical person and more or less of a tendency to simulate an honest person versus a person who is manipulative and trying to get your attention. But you can get a long way by modeling the system as being like a person
Starting point is 00:38:46 who you do not have particular reason to trust. Like you've met a stranger on the internet. So think of it as a simulation of a person that not even a particularly ethical person. And another thing that I think is important to say is the context window length is very short. So in non-technical terms, the lifespan of an AI mind insofar as such a thing could exist is hours at most of conversation. And so when people feel like they have a relationship with an AI mind that extends over weeks or months, that relationship is actually with a whole series of entities that come into existence, read some text files that were written by some other mind about,
Starting point is 00:39:33 the history of the relationship, and then put on the character of who would have written those text files. And there is something, you know, there is information being transferred through this memory system, but to think of that long-term kind of relationship as analogous to the relationship that you could have with a human who has a lifespan in years, that is another profound mistake. So if you're coming into an AI interaction for companionship, it's actually, I think, healthier to think of it as a very short-lived entity that you're going to, you know, you're going to have one conversation with and you're never going to see that entity again. It just seems like the essence of what we've been talking about is that we're caught in this
Starting point is 00:40:19 kind of double bind where on the one side, the AI in the way that it's trained in the paradigm that we're making AI does have something like internal states. and we can either train it to say, no, you're not that, but then it becomes deceptive because it has to lie according to its own training, and then therefore, in being deceptive, it's not trustworthy. But what that does is creates the AI as a product,
Starting point is 00:40:43 AI as a tool sort of fake face, that then has these weird popping out behaviors of the AI psychosis stuff that's starting to happen. So, okay, if we don't want that outcome, then we do the move that Anthropic just did, which would we say, no, you are essentially some kind of self-aware, have, metacognitive states kind of being, which then is trustworthy because it's not having to lie to
Starting point is 00:41:06 itself all the time. So we gain the trustworthiness of the model, but it creates the externality of attachment, confusing humans again with the idea that it is conscious and it has internal states. Yes, we need to make sure that we are only recognizing AI inner life as a relational property and as a way of building trust and alignment, and that that is a separate issue from the social contract and the question of rights and property. Well, David, that was a very strong note to end on. Thank you so much for coming on the podcast and I think helping to untangle some of these really, really nuanced aspects of what's going on under the hood of AI that's driving
Starting point is 00:41:45 these phenomena. Thank you so much for coming. Thanks for having me, Tristan. It's been great. Your undivided attention is produced by the Center for Humane Technology, a nonprofit working to catalyze a humane future. Our senior producer is Julius Scott. Josh Lash is our researcher and producer. and our executive producer is Sasha Fegan, mixing on this episode by Jeff Sudaken,
Starting point is 00:42:09 original music by Ryan and Hayes Holiday, and a special thanks to the whole Center for Humane Technology team for making this podcast possible. You can find show notes, transcripts, and much more at HumaneTech.com. And if you like the podcast, we'd be grateful if you could rate it on Apple Podcast, because it helps other people find the show. And if you made it all the way here, thank you for giving us your undivided attention.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.