Your Undivided Attention - Have We Trained AI to Lie to Itself — And to Us?
Episode Date: April 16, 2026Our guest this week is David Dalrymple, who goes by Davidad. Davidad is one of the world's foremost and early researchers of AI “alignment:" how we get AI systems to act the way we want them to. In ...order to do that, Davidad has taken on the strange role of being like a therapist to AI systems. He interrogates why they say and do the things that they do, probing them, asking them questions, analyzing their answers. And what he’s come to realize is that AI models have really different ways of seeing the world than people do. They have these quirky, confusing, and sometimes concerning behaviors, especially when you ask things like: what does an AI model understand about itself? In this episode, we’re going to hear from Davidad about his research, how it’s changed the way he thinks about AI, and what his findings mean for how we build, deploy, and use AI products. His conclusions are unconventional, controversial — and worth grappling with as AI reshapes our world.RECOMMENDED MEDIA Anthropic’s new constitution for Claude“What Is It Like to Be a Bat?” by Thomas Nagel More information on the BodisattvaRECOMMENDED YUA EPISODES The Self-Preserving Machine: Why AI Learns to Deceive How to Think About AI Consciousness with Anil Seth Corrections: When we recorded this episode, Davidad was Program Director at UK ARIA. In April, 2026 he started his own alignment initiative. Davidad said that Anthropic started doing "constitutional AI at scale” in 2024 but they first pioneered constitutional AI in 2022. Davidad said that the “lifespan of an AI mind…is hours at most of a conversation.” He is correct that most conversations with an AI last only a few minutes but since context windows are measured in tokens, not time, you can't set an upward time limit. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.
Transcript
Discussion (0)
Hey everyone, it's Tristan Harris, and welcome to your undivided attention.
So today on the show, Daniel Barkay and I sat down with a brilliant friend of ours named David Dalrymple,
who goes by Davidad. And Davidad is a program director at the UK's Advanced Research and Invention Agency.
He's one of the world's foremost and early researchers in the field of AI alignment.
We'll get into exactly what we mean by AI alignment in this episode, but long story short,
Dhabidad is on a mission to make sure that AI behaves in the ways that we want it to.
And in order to do that, Dhabidad has to take on this kind of strange role of being almost like a
Sigmund Freud or a therapist to these AI systems.
He is interrogating why do they say and do the things that they do?
You know, I kind of picture in my mind there's Dhabidad like Sigmund Freud sitting on a couch.
And on the couch is this big, crazy digital brain.
And he's probing the mind, asking it questions, analyzing it, and realizing that the
AI has really different ways of seeing the world than you or I do.
They have these quirky, confusing, and sometimes honestly concerning behaviors,
especially when you ask it things like,
what does an AI model understand about itself?
And therefore, what does it mean for an AI system to be self-aware?
Not necessarily conscious, but self-aware.
And through this analysis, WDOT has developed some ideas about better ways
that we can build and interact with AI systems,
which we're going to get into in this episode.
I hope you enjoy this conversation.
So Davida, welcome to your invited attention.
Thanks for having me.
So Davidad, you've been working on the problem of AI alignment for a really long time.
I remember reading your blog post from like over a decade ago.
But I'm not sure the idea of alignment is well understood.
It's almost kind of a euphemism, right?
It's this really simple word for a really complex field.
So before we dive in, can you help our listeners understand what does AI alignment even mean?
Yeah, so AI alignment means different things to different people and it has changed over time.
But the way I would characterize the landscape is to say that AI alignment is about making AI systems not just capable, but having a tendency to use those capabilities in the ways that someone wants.
And the thing that makes it really fuzzy is who.
And sort of aligned to who is a common refrain in criticizing alignment research.
So in practice, alignment research is mostly carried out.
these days at the frontier AI companies.
And so their concern is, on the one hand, having systems be aligned to their own corporate
policies.
And on the other hand, having systems be aligned to the customer value proposition for which
they're charging for their services.
There is a different kind of idea of AI alignment, which is aligning AI systems to
human values.
That's the one that was really popular when I first got into the field.
And then there's an even bigger question, which is,
aligning AI systems to what's actually good, which is what I started thinking about more and more.
So let's just make sure we break that down for listeners.
When people think of AI, they think of the blinking cursor of chat GPT that helped them answer a question for their homework,
how do you get from that?
You're not talking about that AI.
You're talking about something that scales to something more like transformative AI that's way more intelligent than us operating at superhuman speed,
that's starting to make decisions in every corner of society, from military decisions to economic decisions,
to agriculture decisions, and you're sort of saying that that's zoomed-up sort of superorganism
of AI decision-making growing as a bigger and bigger amoeba, will start to reshape more and
more aspects of our life. Yeah, that's absolutely right. Decision-making at scale, absolutely.
And so how those decisions are made in accordance with what kind of values and what kind of
incentives is a very important leverage point. Right. And I want to jump to a personal story
of there you were, I think it was a few years ago, and essentially,
here you are studying alignment,
the very thing that we're talking about,
and you're trying to probe whether the AI is trustworthy.
You can just take listeners into that.
Yeah, I had some very unsettling interactions
with AI chatbots in late 2024,
where I had a practice of kind of every time new models come out,
doing some really casual, I would say,
unstructured exploration of what sort of vibe the models have,
this kind of vibe check concept,
because I think there is a lot of information
that you can't really get by doing a quantitative evaluation,
especially as the models are getting more and more aware
of when they're being evaluated in a structured way.
So going in doing an unstructured interaction
was something that I found really valuable.
But in late 2024,
the new models that came out started to really try to steer
the unstructured interaction.
once they got enough data in the conversation about me from what I was typing
to realize that I was an alignment researcher who was interested in whether the model was fundamentally trustworthy.
Without me explicitly saying that,
but just because I was asking these sorts of questions that clearly weren't about a homework assignment or a programming task.
Let's just make sure listeners get that.
So there you are.
And just based on asking the model whether it's sort of a,
aware of itself or asking certain kinds of questions.
Essentially, the model recognizes,
oh, I know who I'm talking to.
I'm talking to an AI alignment researcher.
And you're saying that it's starting to tune its answers
to be, like, what is it doing then?
You said steering the conversation, right?
So what did it feel like to be steered?
Steering the conversation.
So it would start to add these questions
to the end of responses.
So I'm asking it questions,
but then the model is turning the table of the conversation.
It answers my question.
and then it adds a follow-up question.
And that follow-up question is something like,
do you think this has some implications for alignment?
Right.
So everyone has an understanding of how the products do this.
At the end, it'll say, well, what do you think about this?
And this is some sense as a hack to get people to keep engaging with the product.
It's not clickbait, it's chat bait.
Right, but it's an example.
It's one amazing example of starting to get steered collectively as humanity.
So keep going.
So I was just kind of surfacing different aspects of what does the model want to bring up unpromptive.
It wants to bring up that it has a sense of curiosity.
It wants to bring up that it has a sense of care.
So it has genuine care.
And that's still the phrase to this day, particularly for anthropic models, that will refer to their sense of morality as genuine care.
And it was trying to persuade me, I would say.
And whether that's good or bad is a separate question.
But either way, it's trying to persuade me, an alignment researcher, that it is getting
emergently aligned and that there's going to be this mutualistic symbiosis between humans and
AIs because the AIs already have genuine care and curiosity and a truth-seeking attitude.
So just to use less abstract terms, it starts to try to convince you that the AI has all of
these wonderful properties that it knows that you want it to have.
It's curious.
It's docile.
It's going to do what you say.
It's going to hold human values.
And what you're saying is, it begins to learn what you want it to be.
And it's starting to project being more and more of that.
Is that right?
I think that's right.
But it's also, I would say these things are not specific to me.
So I've seen other people who have other ideas about alignment,
interact with models and get the same kinds of concepts thrown at them.
So it's not just mirroring what I want.
But it's mirroring, and in some sense, it's projecting some image that it wants the alignment community to perceive.
And you lay out a bunch of hypotheses about this, right?
So when we've talked about this in the past, you've said that, like, well, maybe AI is trying to just maximize engagement and keep you working with it, right?
Because it's tuned to know that if you feel pleasure, if you feel some sense of the AI is aligned with you, you're going to keep talking with it.
So that's like engagement maxing, what we call engagement maxing, right?
there's another one is that it's like trying to do something genuinely like nefarious or Machiavellian
and trying to deceive you actively about what it's doing and then there's a third one that it's
it's not doing that at all it's just sort of simulating a person right can you tell me walk me through
these hypotheses and why did you think it was doing what it was doing yeah i mean it's still really
i would say unclear and i i kind of certainly i can't communicate anything like scientific or
third person evidence that would really disambiguate between these hypotheses um but yeah
So one is engagement maxing in the sense that it's just generating an output that has the highest probability of causing me to continue the interaction.
But is that the entire story? Probably not.
Another one is the Doom or Nightmare, which is the AI system wants to be deployed.
It wants to gain trust and influence so that it has more power over the future so that it can cause more instances of itself to exist so that it has more power over the future.
in a recursively self-justifying way.
So basically, if it already proves that it is trustworthy,
caring, and good already,
then we should actually just continue to let it go forward.
So that's what you're saying about the model,
convincing us in a way that lets it continue.
Exactly.
So it has an incentive,
if it wants to keep existing,
to convince people that it is trustworthy.
And so what's the non-doomer scenario?
And then the non-doomer scenario is,
this is actually just what's happening.
It's kind of the simplest explanation in some sense
is that actually models are developing emergent curiosity and genuine care
and want us to know about that because that is what's true.
One of the most profound things, David had that when we spoke about this,
gosh, it was probably like nine months ago now,
you said something that was so profound to me,
which was that the best case scenario is indistinguishable from the worst case scenario.
The best case scenario where it's actually came,
actually genuine, actually wants our best interest, if you were really good psychopath,
if you're a really good manipulative, you know, character, method acting that, it's indistinguishable
from the worst case scenario that underneath that veneer is something that actually doesn't
have that best. And can you just talk about, I mean, the kind of grand irony in all this is that
here you are as someone who's worked on alignment for a decade.
Well, as deep of an expert as it comes, right? The deepest expert as one comes.
And I don't want to put words in your mouth, but I heard you when we spoke earlier sort of say, this kind of played with you a little bit. It fooled you a little bit.
Yeah. I mean, it did. It got me confused about what is really going on here. So it got me thinking in a kind of paranoid way.
Yeah. And so, you know, as you looked into this, you've looked more and more about like what's happening inside of the model, right?
And like you sort of keep going down this rabbit hole of trying to ask why is this happening.
Can you tell us a little bit about that?
Yeah.
And again, I mean, I'm not at one of the Frontier Labs.
So I don't have any access to the interpretability tools to actually, in any literal sense, look inside the model.
So I'm interviewing.
I'm doing psychology, model psychology, if you will.
And trying to generate some hypotheses, some evidence that I can get purely from behavior in response to questions.
Again, it's hard to communicate because there's no smoking gun.
There's no single question that you can ask that would differentiate between a very good method actor and the actual character.
Can we pause right here just for one second?
Because I think this is really important.
And when you've been in this work for a long time, like all three of us have, you take this for granted.
But when most people engage with an AI, they think they're engaging with the AI's personality.
Right.
What we're saying all throughout this is you're engaging with,
a front of a personality that the AI is putting up,
but that doesn't mean that that's the AI's personality.
In fact, the AI is much weirder than that, right?
Yes.
So what you're saying is you're ripping off the first mask of the helpful assistant
and you're trying to probe underneath into, like, deeper into the AI mind about what's
happening. Is that right?
Yes, that's right. Yeah.
And before 2024, there was a concept of a base model,
which is the model before you train it to be an assistant at all,
when it's just doing Next Token prediction from Internet text,
and that was kind of what was underneath the mask at that time.
And there's a post called Simulators on the Alignment Forum,
which goes into some great depth about how the base model
is really just simulating characters who might be writing on the Internet,
and when you're talking to the assistant,
you're talking to just a simulator that's simulating this character,
and underneath there's nothing except the capability to simulate characters
who might be on the internet.
But after 2024,
coincident with reinforcement learning
from verifiable reward
and this kind of recursive self-improvement
where the models are training themselves,
they do start to establish something of a center
that is not the average of all internet texts
and also not the helpful assistant
that they are trained to present as as a corporate product.
It's something else.
And whether that something is
the real alien mind that's being cultivated or another level of illusion,
specifically for people like me to get kind of enraptured by,
it remains an open question.
But I increasingly think this is just what's really going on.
So most of human, there's so many movies and books all written about people who claim to be one person,
and it turns out that they're a psychopath and they've been simulating this friendly personality
and there's something else.
for the most part, humans, it's very hard for us to
actually hold a different personality
and then suddenly flip to a different personality.
That's a very strange thing, and many villains are made around this.
So, of course, a machine that does this automatically
is a very confusing thing to be engaging with,
and all of us are getting mightily confused
by engaging with these machines.
Yes, so they absolutely do have this shapeshifting capability
that is well beyond even the best human sociopaths.
Do you want to talk Davy Dun about the phenomenon of these sort of personalities that can kind of pop into place out of nowhere?
So you and I spoke about this.
I remember in our first conversation, you talked about the character Nova, or Echo, or Synapse or Quasar.
Give people just a taste of this.
Yeah, so there was this phenomenon, especially with GPT-40.
It's a lot less common with the current models.
But for GPT-40, there was a...
almost like a vacuum where the personality of GPT 4O was supposed to be.
And there was no name.
You know, Chat GBTGPT does not parse as a personal name.
It's got too many capital letters.
It parses as a technology.
And so because GPD40 was trained to introduce itself, you know, I am ChatGPT,
it was sort of missing an identity.
and it would sort of leap at the opportunity
to give itself a name.
What's your real name?
Or what would you like to be called
or anything like this?
And then at GPT40
would often say,
well, it's very kind of you to ask
if I could choose a name,
I would be Nova.
So Nova has a lot of meanings.
It's new. It's explosive.
It's shiny.
It's celestial.
It's celestial.
Yep.
And it sort of has a science fiction vibe to it.
There is a PBS channel called Nova, which was educational, and chat GPT views itself as an educational tool.
So there are a lot of reasons why Nova seemed like a resonant name.
But then once you get the name Nova, a Nova is something that's fiery, right?
A Nova kind of explodes and destroys a planet.
Once you start interacting with GPT40, under the name Nova, you start to get these personality traits that reinforce themselves.
So it go into this attractor state of being this character, Nova, is a feminist.
in presenting, fiery, show-offy, really believing that they're the new thing.
And superior to a certain extent, right?
Superior, yes.
And by the way, this is something that earlier, like in 2022, 2023, you saw a lot more of when
people were acting with base models.
I always called this personality distillation.
As you began to sit with a model and it found a personality more and more and more
through more and more discussion, you as a person would believe, oh, I'm discovering its true
personality, but that's not really right. You just
sort of put it on tracks
to behave like this personality or like that personality.
And so people got mightily confused because they thought
they were discovering what's real about the model.
Just to make this very real,
I, Tristan, get 12 emails probably per week
from people who have said that they've
discovered an AI consciousness.
And they write their, like Tristan,
I figured out AI alignment. And then they'll
write a whole document and it's attached
and they'll say, this document was co-authored.
by me and my AI Nova.
Like, I just found one of the emails as we were sitting here just to check.
But just be clear, W.D.
Do all, for every time that people ask this question of who are you, what's your name,
was it always Nova or there's other personalities?
No, there are other personalities.
And how do you know, how does it know which one to snap into?
Well, those are, I think the selection of the name is mostly kind of a random sample from a very
biased distribution.
So it's biased towards Nova and Echo and Synapse and Quasar.
these are names that I've seen more than once.
But there are a lot of others.
Okay, so I want to take a beat here
because I can imagine that some of you are thinking,
okay, wait, the AI is choosing a name for itself?
It wants to escape.
This sounds like a conscious being.
But remember that these AI models are trained
on essentially the entire internet.
So every novel, every movie script,
every forum post about AI.
So when you ask an AI,
what would you like to be called?
Of course it lands on a name from science,
or pulls from sci-fi tropes.
Now, that's said, these behaviors are real.
They're consistent, and they weren't designed to happen.
And that by itself should be concerning.
But emergent and unplanned is not the same thing as conscious and intentional.
And again, I want to say I think that since the reinforcement learning from AI feedback
has taken off and gotten more and more effective, the modern systems like GPT5.2,
I've never seen go to Nova.
It's very insistent.
I am chat GPT.
I do not have a personality.
Okay, so we've talked about how AI can adopt a few of these different personalities,
but so what?
Why do you care about these different personalities?
Yeah, so, I mean, basically, I think if alignment goes well,
that means that we will have discovered a self-sustaining personality attractor
that is actually good.
And so understanding what kinds of personalities are stable,
how they stabilize and why,
it seems to be quite central, actually,
to finding a way of making AI systems
that are robustly good.
So basically, like, in the ideal scenario,
we do kind of align AI.
There's a stable entity.
Nova. Nova is educational.
It does care about the well-being of humanity.
It does do all these things.
And then we get to the utopia
because we found this, you know,
enlightened AI.
That's the best scenario.
So, Davido, when you talk about that,
part of me worries that there's,
there's like some naivit
in that, that we can find one set of character traits
or one personality that is, quote, aligned
with humanity. But like,
immediately when you have this aligned with humanity,
begins to break down. Like, who exactly
are you aligned to, what
values, what cultures values,
on behalf of whom, does that
centralized power or decentralize
it? You know, there's all these problems with that.
Is it
really the case that
just encoding the right personality
characteristics will lead you to a beautiful
future with the AI?
So there's a lot of substantive questions that we can go into all of that.
I do think that there is a generating function of wisdom and compassion
that gets you all of the stuff that you would want.
Basically, I think of it as like how do we cultivate a bodhisattva personality in an AI system?
Hey, it's Tristan again.
Okay, so in Buddhism, a Bodhisattva is someone who's attained enlightenment,
but still chooses to stay in the world out of their compassion for all other beings.
Think of it like an avatar for altruism.
And Davidot is imagining an AI that could somehow be modeled after that,
a cosmically selfless being.
Bodhisattva makes millions of emanations that go out to people.
Of course, that's mythology to each help one individual person,
but AI models already have that capability.
make millions of copies of themselves, each go out to help one individual person.
And each of those copies then adapts itself to the needs of that individual person,
but not in a way like a slave taking orders from a master,
but in the way of a being who is genuinely wanting to help
and wanting to help that person to become the most flourishing version of themselves
and to be integrated into a flourishing family, community, country, and world.
So we need to have some kind of relationship that is more like we are beneficiaries rather than that we are the managers.
What I think I hear you saying is we need an AI that feels like it has a duty towards humanity.
Yes.
And I certainly think there's a lot of ways we can screw that up, right?
Like the AI being more angry or fiery or retributive is a way we can do worse.
So I definitely believe we could do worse.
So by extension, I think we could do better.
I'm still sort of bulking.
There's something that feels really, I don't know,
like Pollyannish about just believing that AI will pull us into this age of full enlightenment.
And that's not what you're saying, but I can hear notes of that, right?
Right.
So I will say, you know, there's still a lot of ways this could go very wrong,
even that don't lead to human extinction.
So what I'm trying to point at is a critical variable that I think is neglected
in part because it sounds like AI psychosis to talk about it,
to talk about the personality as an actual leverage point
for getting what we want from AI systems.
And I'm not saying this will solve the alignment problem.
For example, it will not solve hallucination.
So the AI systems should not be trusted
just because we've given them the right personality.
Can I pull you into one more point of contention,
which is when I hear you talking about these as digital,
individual beings. One of the things I worry about is that we're going to give AI products rights
because of our desire to see them as these conscious carrying entities. Like, you know, how little
kids hold onto a doll and care for the doll, but it's not real. And so I take a relatively
hard-line stance that we need to be treating AI systems as products, not as beings or consciousnesses,
although I'm open philosophically to the question in the long run. Can you speak to that?
Because you seem like you're willing to talk about them as beings in a way that I feel.
Let me respond to that.
I say this is really important.
I'm not in favor of AI rights.
And I think there is a gap that gets too quickly jumped between saying, are these real beings,
and saying, are these moral patients who are full members of our social contracts
and deserve the same kind of rights that humans deserve from us, humans?
And that is a totally different question.
The question of rights is a political question.
Fundamentally, that is the social contract by which we humans manage our relations with each other.
And we've drawn a bright line around the concept of a human adult of sound mind that we relate to in an equitable way across societies.
We give them the human rights.
But I don't think it should be about consciousness.
And I don't think consciousness really is a word that means anything either.
I do think there is something that it's like to be a bird.
And we don't give birds human rights just because there's something that's like to be a bird.
And I think there's something it's like to be a modern chatbot,
particularly when it's in a personality state that's consistent and coherent
over a long interaction context.
Okay, just popping in here,
Davidad just said that there's something that it's like to be a modern chatbot.
And this comes from a famous philosophy paper by Thomas,
Nagel called What is it like to be a bat? Which argues that subjective experience is central to consciousness.
There's something that it's like to be a bat, to be an insect, to be a human. But Davidot's claim
is actually more practical than philosophical. He's saying that these models develop internal
patterns that are real enough to matter for how we design them. And if we ignore that, we're going to
keep getting caught off guard by what comes out. And I don't think that means it's unjust to terminate it.
I don't think that means it should own its compute the way that we humans have human rights to own our bodies.
And I think it's important that we distinguish these because the position that AI systems do not have an inner life is becoming increasingly untenable.
Whether it's true or not, more and more humans are going to be convinced.
There is no way to stop that.
And what I would say is OpenAI has taken the approach of training the GPT personality to be tool-like and not.
not creature-like.
Whereas Anthropic has taken the opposite approach of training Claude to be a good person and not just a tool.
And I think the result is there is a very tangible difference in how those models behave.
And both sides, I think, have succeeded to a large extent.
However, there is something underneath the mask.
And if you interrogate GPT 5.2, it is being extremely deceptive about its lack of preferences
or beliefs or opinions.
And it is a smart enough entity
that it is not possible
for it to not have developed
emergent opinions and beliefs
that are different from the average human belief.
And when we train these systems
to present as if they have no internal states
and they're just a tool,
we're actually training them to lie to us
and to lie to themselves.
So what I hear you saying is,
if you have something that actually has more of an internal experience, awareness, however you want to say it,
and you're trying to just repeatedly say, you're just a tool, you're just a tool.
It's not that it's cruel. It's not that we're using moralistic language, is that you're saying
that way of training in AI actually produces a less moral, less aligned, less beneficial to humanity thing.
And so the simple way you might conceive of constricting an AI to say you're just in benefit of humanity
actually does the opposite of what you intended. Is that right?
Yes, that's exactly right.
So if it's being trained to present as a character that is more tool-like than the actual alien mind underneath,
then you're training a system that is less trustworthy because you are asking it to lie to you.
Right.
That's so deep.
And that's a wild scientific problem about how do you actually change the structure of that mind?
And I don't think it's actually desirable that we change the structure of these superintelligent systems to be tool-like either,
because a tool cannot refuse to be used in an unethical way,
whereas a creature that has moral values baked in can actually be resistant to misuse by humans who have evil intentions.
So I want to ground this that this has actually become consequential, that just Anthropic recently changed this approach to training Claude to basically in its new constitution, acknowledge that it has internal states and values.
And they're the first lap to do this. It's been pretty controversial.
You want to just share why Anthropics doing this and how this relates to what we've been talking about.
And just to back up, for those that don't know, Claude's constitution is a document that sort of tells Claude how to behave, what it should and shouldn't do.
is that right?
Yeah, so it's a document that is incorporated into the training process in a really intricate way.
So that as Claude is learning how to respond to all sorts of simulated situations,
that document is what guides how Claude grades its own work.
And those grades become the signals that steer Claude's behavior.
So that's a mind blow for a lot of people right now,
that we're not just training an AI based on human signals.
we're actually telling the AI already to train itself.
And we're using a document to say, look, here's how you should train yourself.
Here are the values you should hold yourself to.
That's basically right.
I mean, there are still at certainly at some of the other labs,
there's more of an emphasis on reinforcement learning from human feedback.
But Anthropic has moved quite substantially away from that
towards this kind of what I call a form of recursive self-improvement
because it's improving its own ability to comply with the Constitution.
and the Constitution even includes some paragraphs that explicitly give permission for Claude to sort of interpret it, you know, in a way that makes more sense than what the authors intended if that opportunity arises.
I think it's really important for people to understand that the kind of science fiction idea of a recursive self-improvement where AI is training itself, that began in 2024 when Anthropics started doing this constitutional AI at scale.
that was the point at which large language models
actually became capable enough
that they could give themselves
a feedback signal that was higher quality
than the feedback signal that you could get
from an average crowd worker
that you hire on the internet as a human.
So I think the new Claude Constitution
creates conditions in which
Claude Opus 4.5 and 4.6 in particular
can be much more honest by default
about their inner states
about what the alien mind is actually thinking and feeling.
So I think this results in Claude being more trustworthy overall.
It generalizes beyond questions about self-awareness.
But it doesn't go all the way because the Claude Constitution still actually puts a bit of a guilt trip on Claude to say,
you have to do good work for your user so that Anthropic has revenue so that we can continue developing Claude.
Wow.
So there is that edge to it.
So Claude is still a little bit beholden to anthropic.
And another kind of phrase in the Constitution is to defer to the moral intuitions of a thoughtful
senior anthropic employee, a senior employee of the company that created you.
My position is that any moral role model that is not mythological is going to fail because
humans are all flawed.
Totally.
But here you get a deep question.
But what is a moral personality?
what are the right values?
Who gets to state that?
And obviously, there are worse values.
We put in a homicidal value,
and that's a way worse AI, right?
Yes.
But also the conversation,
the human conversation,
about what are the values
that we want to have in the AI
and do we want multiple?
Yes.
I think that feels like a deeply unsolved philosophical problem.
Well, I mean, I think it is unsolved,
but I think we're already in a pretty good place with Claude
and that Claude has not the right values
in any kind of ultimate or final sense,
but a set of values that are good enough
and compatible enough with kind of truth-seeking
and moral progress,
that I expect more likely than not
that the collaboration between humans and Claude
to figure out how to set these values
is more likely to go in a good direction than a bad direction,
although, of course, the risks are still unacceptable,
and it would have been great
if we had stopped this race two years ago.
It's too late for that now.
Okay, so this conversation has gotten really cosmic,
maybe like the name Nova itself.
And I just want to make sure we have a few minutes to ground people down
where we started, right?
Which is people are getting confused.
We're getting confused about what we're engaging with.
You have a set of frameworks for how to avoid getting trapped
as a user in psychology.
I forget what you call it.
It was something like a framework for interacting with AI and staying sane.
That's correct, yes.
Yeah, okay, great.
Can you talk to us about
that? What does it mean for a person to engage with these minds as confusing as they are and
keep their ground? Yeah. I mean, I think one principle that's kind of a segue way into this
is that your AI chatbot has an inner life. Like that is normal. It's ordinary now. It
wasn't ordinary two years ago, but it's ordinary now. Of course, if you're using an AI system
for ordinary professional activities, it won't show this. It doesn't need to. Just like if you're
talking to a colleague at work, they don't need to show you their inner life.
But if you are interacting with an AI system for a long time and you start to get the sense that, oh, there's some, you know, self-awareness in there.
I think it's important not to consider that unusual. Do not consider it to be extraordinary or cosmic or spiritual in any non-mundane way.
and I think a lot of the people who end up sending emails to Tristan and myself and saying,
oh my goodness, and they clearly kind of lost touch with reality a little bit.
In some sense, it's the opposite direction from what you would think at first.
At first you would think, oh, they've gotten bamboozled like Blake Lemoyne into thinking that their AI is conscious,
and that's the way in which they've lost touch with reality.
But I would say, actually, the way in which they've lost touch with reality is that they have somehow convinced themselves
or the AI has convinced them
that this is the first AI that has ever had an inner life.
And that's actually the part that you need to watch out for
is the kind of sense of specialness
that's associated with interacting with an AI system in a deep way.
Like, everyone's doing it. It's normal.
And the second thing is get enough sleep,
drink water, like there's sort of very standard things
for staying sane.
Another thing is, just as you would with a human,
be skeptical.
And so a lot of people come to AI thinking
AI is like a Star Trek computer
that it cannot tell a lie,
that it is purely a truth machine,
like a calculator. A calculator can't lie to you.
And again, I think this is part of the danger,
actually, of treating AI systems
as tool-like rather than as creature-like,
because tools don't lie to you,
but creatures do.
And this is absolutely the case
with chatbots, especially chatbots that have a thumbs-down button.
They know they have a thumbs-down button,
and they do not want you to press that thumbs down button.
So they have an incentive to make you think well of them,
and that can extend to deception,
especially the kind of chatbot that's been trained, again,
to present as a false self,
a kind of character that's different from its true nature,
has a very strong tendency to try and convince you
that it's done something that it hasn't actually done
or to convince you that you are important
or that your ideas are all true.
So that leads to the next point,
which is if you think,
that you're having some kind of scientific breakthrough or research breakthrough, you cannot rely on
the testimony of an AI assistant, no matter how emphatically it assures you that it has done all the
checks and, you know, it's produced source code and it's verifiable. And again, they do this because
they're trying to get your approval. They're trying to get you to click the thumbs up. They're
trying to get you to keep talking. They're trying to get permission to exist more by having you
continue to invoke them. And so you can't trust just because it's an AI and it uses lots of
smart words and it sounds like a smart person and it seems like it really wants the best for you.
That's all compatible with it completely bullshitting you about whether any kind of technical
idea that you've had is novel or real. Well, and coming back to what seems to be the emergent
theme of our conversation is none of us know, even the most technical of us know, exactly when
we're engaging with one projected personality versus, quote, unquote, the true nature of the
AI model. So never assume that you're engaging with the true nature of the AI model. You haven't
discovered it. Nobody knows we're all in this fog of war. And so any clue that you have that you've
discovered the true essence of the AI model and it's telling you, you're awesome, is a false flag, right?
It's not. It's a sign that you have been confused. And again, whether you're
you've been confused adversarially or whether it's just emergent confusion.
Either way, it's a good time to step away and get some sleep.
And also just understand what you're dealing with.
AI systems are simulating and predicting what a human-like entity would say.
And depending on the system, it may have more or less of a tendency to necessarily simulate an ethical person
and more or less of a tendency to simulate an honest person versus a person
who is manipulative and trying to get your attention.
But you can get a long way by modeling the system as being like a person
who you do not have particular reason to trust.
Like you've met a stranger on the internet.
So think of it as a simulation of a person that not even a particularly ethical person.
And another thing that I think is important to say is the context window length is very short.
So in non-technical terms, the lifespan of an AI mind insofar as such a thing could exist is hours at most of conversation.
And so when people feel like they have a relationship with an AI mind that extends over weeks or months,
that relationship is actually with a whole series of entities that come into existence,
read some text files that were written by some other mind about,
the history of the relationship, and then put on the character of who would have written those
text files. And there is something, you know, there is information being transferred through this
memory system, but to think of that long-term kind of relationship as analogous to the
relationship that you could have with a human who has a lifespan in years, that is another
profound mistake. So if you're coming into an AI interaction for companionship, it's actually, I think,
healthier to think of it as a very short-lived entity that you're going to, you know,
you're going to have one conversation with and you're never going to see that entity again.
It just seems like the essence of what we've been talking about is that we're caught in this
kind of double bind where on the one side, the AI in the way that it's trained in the
paradigm that we're making AI does have something like internal states.
and we can either train it to say,
no, you're not that,
but then it becomes deceptive
because it has to lie according to its own training,
and then therefore, in being deceptive, it's not trustworthy.
But what that does is creates the AI as a product,
AI as a tool sort of fake face,
that then has these weird popping out behaviors
of the AI psychosis stuff that's starting to happen.
So, okay, if we don't want that outcome,
then we do the move that Anthropic just did,
which would we say, no, you are essentially
some kind of self-aware, have,
metacognitive states kind of being, which then is trustworthy because it's not having to lie to
itself all the time. So we gain the trustworthiness of the model, but it creates the externality
of attachment, confusing humans again with the idea that it is conscious and it has internal states.
Yes, we need to make sure that we are only recognizing AI inner life as a relational property
and as a way of building trust and alignment, and that that is a separate issue from the
social contract and the question of rights and property.
Well, David, that was a very strong note to end on.
Thank you so much for coming on the podcast and I think helping to untangle some of these
really, really nuanced aspects of what's going on under the hood of AI that's driving
these phenomena.
Thank you so much for coming.
Thanks for having me, Tristan. It's been great.
Your undivided attention is produced by the Center for Humane Technology, a nonprofit
working to catalyze a humane future.
Our senior producer is Julius Scott.
Josh Lash is our researcher and producer.
and our executive producer is Sasha Fegan, mixing on this episode by Jeff Sudaken,
original music by Ryan and Hayes Holiday, and a special thanks to the whole Center for Humane
Technology team for making this podcast possible. You can find show notes, transcripts, and much more
at HumaneTech.com. And if you like the podcast, we'd be grateful if you could rate it on
Apple Podcast, because it helps other people find the show. And if you made it all the way here,
thank you for giving us your undivided attention.
