a16z Podcast - Emmett Shear on Building AI That Actually Cares: Beyond Control and Steering
Episode Date: November 17, 2025Emmett Shear, founder of Twitch and former OpenAI interim CEO, challenges the fundamental assumptions driving AGI development. In this conversation with Erik Torenberg and Séb Krier, Shear argues tha...t the entire "control and steering" paradigm for AI alignment is fatally flawed. Instead, he proposes "organic alignment" - teaching AI systems to genuinely care about humans the way we naturally do. The discussion explores why treating AGI as a tool rather than a potential being could be catastrophic, how current chatbots act as "narcissistic mirrors," and why the only sustainable path forward is creating AI that can say no to harmful requests. Shear shares his technical approach through multi-agent simulations at his new company Softmax, and offers a surprisingly hopeful vision of humans and AI as collaborative teammates - if we can get the alignment right. Resources:Follow Emmett on X: https://x.com/eshearFollow Séb on X: https://x.com/sebkrierFollow Erik on X: https://x.com/eriktorenberg Stay Updated: If you enjoyed this episode, be sure to like, subscribe, and share with your friends!Find a16z on X: https://x.com/a16zFind a16z on LinkedIn: https://www.linkedin.com/company/a16zListen to the a16z Podcast on Spotify: https://open.spotify.com/show/5bC65RDvs3oxnLyqqvkUYXListen to the a16z Podcast on Apple Podcasts: https://podcasts.apple.com/us/podcast/a16z-podcast/id842818711Follow our host: https://x.com/eriktorenbergPlease note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Stay Updated:Find a16z on XFind a16z on LinkedInListen to the a16z Podcast on SpotifyListen to the a16z Podcast on Apple PodcastsFollow our host: https://twitter.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.
Transcript
Discussion (0)
Most of AI is focused on alignment as steering.
That's the plight word.
If you think that we're making our beings,
you'd also call this slavery.
Someone who you steer, who doesn't get to steer you back,
who non-optionally receives your steering,
that's called a slave.
It's also called a tool if it's not a being.
So if it's a machine, it's a tool,
and if it's a being, it's a slave.
Like, we've made this mistake enough times at this point.
I would like us to not make it again.
You know, they're kind of like people,
but they're not like people.
Like, they do the same thing people do.
They speak our language.
They can, like, take the on the same kind of task.
Well, like, they don't count.
They're not real moral agents.
A tool that you can't control bad.
A tool that you can control bad.
A being that isn't aligned, bad.
The only good outcome is a being that is, that cares,
that actually cares about us.
I've been thinking about a line that keeps showing up in AI safety discussions,
and it's taught me cold when I first read it.
We need to build Align AI.
Sounds reasonable, right?
Except, align to what?
Align to whom?
The phrase gets thrown around like he has an obvious answer,
but the more you sit on it, the more you realize you're smuggling in a massive assumption.
We're assuming there's some fixed point, some stable target we can aim at, hit once, and be done.
But here's what's interesting.
That's not how alignment works anywhere else in life.
Think about families. Think about teams.
Think about your own world development.
You don't achieve alignment and the coast.
You're constantly renegotiating, constantly learning, constantly discovering that what you thought was right turns out to be more complicated.
alignment isn't a destination
it's a process
it's something you do
not something you have
and this matters
because we're at this inflection point
where the AI systems we're building
are starting to look less like tools
and more like something else
they speak our language
they reason through problems
they can take on tasks
that used to require human judgment
and the question everyone's asking is
how do we control them
how do we steer them
how do we make sure they do what we want
but there's another way to see it
what if the control paradigm is the wrong
framework entirely. What if trying to build a super intelligent tool you can
perfectly steer is not just difficult but fundamentally dangerous, whether you
succeed or fail? If you can't control it, obviously that's bad. But if you
can't control it perfectly, you've just handed godlike power to who's ever
holding the steering wheel. And humans, even well-meaning ones, don't have the
wisdom to wield that kind of power safely. So what's the alternative? Well, think
about how we actually solve alignment problems in the real world. We don't control
other people, we don't steer them, we raise them, we teach them to care. We build relationships
where they do right by us, not because we're forcing them, but because they learn to value
the relationship itself. That's organic alignment. Alignment that emerges from genuine care,
from theory of mind, from being part of something larger than yourself. Emmett Shear has spent the last
year and a half working on exactly this problem at softbacks. And what makes his approach distinctive
is that he's not trying to solve alignment by building better control mechanisms. He's trying to
solve it by building AI systems that can learn to care, that can develop the kind of theory
of mind that lets them be good teammates, good collaborators, good citizens, not tools that
follow orders, but beings that understand what it means to be part of a community. That can raise
some uncomfortable questions. What if we're building beings and not tools? What does that
mean for how we treat them? What does it mean for their rights? And how do you even know if they
succeeded? How do you measure whether something genuinely cares versus just simulating care really well?
Today, Seb Creer from Google DeepMine and I are sitting down with Emmett to explore those questions.
Seb leads AGI policy development at DeepMine, so he brings a perspective from inside one of the labs actually building these systems.
But really, we're investigating something deeper.
What does it actually take to build AI systems that can participate in the ongoing, never-finished process of figuring out how to live together?
By the end, you'll understand not just Softmax's technical approach,
but a completely different way of thinking about what alignment is and what it could become.
Emmett Shear, welcome to the podcast.
Emmett, Seth, welcome to the podcast. Thanks for joining.
Thank you for having me.
So, Emmett, with Softmax, you're focused on alignment and making AIs organically align with people.
Can you explain what that means and how you're trying to do that?
When people think about alignment, I think there's a lot of confusion.
People talk about things being aligned.
We need to build an aligned AI.
And the problem with that is when someone says that,
it's like, we need to go on a trip.
And I'm like, okay, I do like trips,
but like, where are we going again?
And with alignment, alignment takes an argument.
Alignment requires you to align to something.
You can't just be aligned.
It takes you to be aligned to yourself.
But even then, you don't want to tell them
what I'm aligning to as myself.
And so this idea of an abstractly aligned AI,
I think, slips a lot of assumptions past people
because it sort of assumes
that there is one obvious thing to align
to. I find this is usually the goals of the people who are making the AI. That's what they mean
when they say want to make an line. I want to make an AI that does what I wanted to do. That's what
they normally mean. And that's a pretty normal and natural thing to mean by alignment. I'm not
sure that that's what I would regard is like a public good. Right. Like I guess it depends on who it is.
If it was like Jesus or the Buddha was like I am making an aligned AI, I'd be like, okay, yeah, align to you.
Great. I'm down. Sounds good. Sign me up. But most of us, myself included, I wouldn't describe as
being at that level of spiritual development
and therefore perhaps
want to think a little more carefully
about what we're aligning it to.
And so when we talk about organic alignment,
I think the important thing to recognize
is that alignment is not a thing,
it's not a state, it's a process.
This is one of these things
that's broadly true of almost everything, right?
Is a rock a thing?
I mean, there's a view of a rock as a thing,
but if you actually zoom in on a rock really carefully,
a rock is a process.
It's this endless oscillation between,
the atoms over and over and over again,
reconstructing rock over and over again.
The rock's a really simple process
that you can kind of like coarse grain
very meaningfully into being a thing.
But alignment is not like a rock.
Alignment is a complex process.
And organic alignment is the idea of treating alignment
as an ongoing sort of living process
that has to constantly rebuild itself.
And so you can think of the way that
how do people and families stay aligned to each other,
stay aligned to a family?
And the way they do that is you don't like arrive at being aligned.
You're constantly re-knitting the fabric that keeps the family going.
And in some sense, the family is the pattern of renitting that happens.
And if you stop doing it, it goes away.
And this is similar for things like cells in your body, right?
Like there isn't like your cells align to being you and they're done.
It's this constant ever-running process of cells deciding what should I do, what should I be,
do it needs to be a new job?
Should we be making more red blood cells?
You're making fewer of them.
You aren't a fixed point
so there is no fixed alignment.
And it turns out that our society is like that.
When people talk about alignment,
what they're really talking about, I think,
is I want an AI that is morally good.
Right?
That's what they really mean.
It's like this will act as a morally good being.
And acting as a morally good being
is a process and not a destination.
Unfortunately, we've tried taking down tablets
from on high
that tell you how to be a morally good being,
and we use those, and they're maybe helpful,
but somehow they are not being, like,
you can read those and try to follow those rules
and still make lots of mistakes.
And so I'm not going to claim I know exactly what morality is,
but morality is very obviously an ongoing learning process
and something where we make moral discoveries.
Like, historically, people thought that slavery was okay,
and then they thought it wasn't,
and I think you can very meaningfully say
that we made moral progress,
we made a moral discovery by realizing that's not good.
And if you think that there's such a thing as moral progress,
or even just learning how better to pursue the moral goods we already know,
then you have to believe that alignment, aligning to morality,
being a moral being is a process of constant learning
and of growth to re-infer what should I do from experience.
And the fact that no one has any idea how to do that,
should not dissuade us from trying
because that's what humans do.
Like, it's really obvious
that we do this, right?
Somehow, just like we used to not know
how people humans walked or saw.
Somehow, we have experiences
where we're acting in a certain way,
and then we have this realization,
I've been a dick.
That was bad.
I thought I was doing good,
but in retrospect, I was doing wrong.
It's not like random.
Like, people have the same,
actually, there's like a bunch of classes.
patterns of people having that realization. It's like a thing that happens over and over again.
So it's not random. It's like a predictable series of events that look a lot like learning
where you change your behavior and often the impact of your behavior in the future is more
pro-social and that you are better off for doing it. And like, so I'm taking a very strong moral
realist position. There is such a thing as morality. We really do learn it. It really does matter.
And organic alignment and that it's not something you finish. In fact, one of the key moral
mistakes is this belief. I know morality. I know it's right. I know it's wrong. I don't need to learn
anything. No one has anything to teach me about morality. That's arrogance. And that's one of the main
moral things you can do that's dangerous. And so when we talk about organic alignment, organic
alignment isn't aligning an AI that is capable of doing the thing that humans can do. And to some
degree, like, I think animals can do at some level, the humans are much better at it of the learning of
how to be a good family member, a good teammate, a good member of society, a good member of all
sentient beings, I guess, how to be a part of something bigger than yourself in a way that is
healthy for the whole rather than unhealthy. And softmax is dedicated to researching this. And I think
we've made some really interesting progress. But like the main message, you know, I go on podcasts like
this to spread, the main thing that I hope soft max accomplishes above and beyond anything else is like
to focus people on this as the question. This is the thing you have.
to figure out. If you can't figure out how to build, how to raise a child who cares about
the people around them, if you have a child that only follows the rules, that's not a moral
person that you've raised. You've raised a dangerous person actually who will probably do great
harm following the rules. And if you make an AI that's good at following your chain of command
and good at following your whatever rules you came up with for what morality is and what good
behavior is, that's also going to be very dangerous. And so that's
That is, that's what, and that we should, that's the bar.
That's what we should be working on.
And that's what everyone should be committed to, like, figuring out.
And if someone beats us to the punch, great.
I mean, I don't think they will, because I'm, like, really bullish on our approach.
I think the team's amazing.
But, like, this is, it's maybe, it's the first time I've run a company where truly,
I can say with a whole heart, if someone beats us, thank God.
Like, I hope somebody figures it out.
Yeah.
yeah I mean it's yeah I have a lot of you know similar intuitions about certain things like
I also dislike the you know the idea that kind of you know we just need to like crack the few kind of
values or something just cement them in time forever now and you know we've kind of solved morality
or something and I've always kind of been skeptical about you know how the alignment problem
has been conceptualized as something to kind of solve once in for all and then you can just you know
do AI or do AI but the um I guess I understand it in a slight
different way, I guess maybe less based on kind of moral realism, but, you know, there's
kind of the technical alignment problem, which I kind of think of broadly as how to get an AI to do
what you, you know, how do you get it to follow instructions, like, you know, broadly speaking.
And I think that was, you know, more of a challenge, I think pre-LLMs, I guess, when people
were talking about reinforcement learning and looking at these systems, whereas host LLMs, we've realized
that many things that we thought were going to be difficult to are somewhat easier.
And then there's a kind of second question, the kind of normative question of to whose values,
what are you aligning this thing too, which I think is the kind of thing you're commenting on of it.
And for this, I tend to be very skeptical of approaches where, you know, you need to kind of crack the kind of ten commandments of alignment or something, and then we're good.
And here, I think I have like intuitions that are unsurprisingly a bit more like political science-based or something and that, like, okay, it is a process.
And I like the kind of bottom-up approach to some degree of, well, how do we do it in real life with people?
No one comes up with, you know, I've got this.
and so you have processes that allow
ideas to kind of clash
with people with different ideas, opinions, views
and so have to kind of coexist as well as they can
within a wider system
and like you know, with humans
that system is liberal democracy or something
and at least in some countries
and that allows more of that kind of
these kind of ideas
these values to be kind of discovered
and construed over time
and I think for alignment as well
I tend to think yeah there's on the normative side
I agree with some of you
intuitions, I'm less clear about now what exactly, what does it look like now of going to implement
this into an AI system? These are the ones we have to do that. I agree that there's this idea of
technical alignment that I think I would might have to define a little differently, but it's sort of
the sense of like, if you build a system, can it be described as being coherently goal following
it all? Regardless of what those goals are, like lots of systems aren't coherently, they're not
well described as having goals. They just kind of do stuff. And if you're going to have something that's
like aligned, it has to have coherent goals, otherwise those goals can't be aligned with anyone
else's goals, kind of by definition. Is that sort of, is that, would you, would you, is that a fair
assessment of what you mean by tactical alignment? I mean, I'm not fully sure, right? Because I think
if I give a model a certain goal, then I would like the model to kind of follow that instruction
and kind of reach that particular goal, rather than it having a goal of its own that, you know,
I can't, yeah. Well, wait, if you give it a goal, it has that goal. Right.
That's what I mean to give someone something, right?
Sure, yeah.
If I, you know, if I instructed to do X, then I would like it to do X
and not, you know, to like different variants of X, essentially.
I wouldn't want it to reward Huck.
I wouldn't do some...
Well, but when you tell it to do X, you're transferring like a series of like a bite string
in a chat window or like a series of audio vibrations in the air, right?
You're not transplanting a goal from your mind into its.
You're giving it an observation that it's using to infer your goal.
Yeah, I mean, in some sense, yeah, I can communicate a series of instructions and I wanted to infer what I'm saying essentially as accurately as it can, given what it knows of me and what I'm asking.
You wanted to infer what you meant, right?
Like, that's like, because in some sense, there's no, the bite sequence that you send over the wire to it has no absolute meaning.
It has to be interpreted, right?
Like that bite sequence could mean something very different or the different code book.
Yeah, well, I guess one way, you know, I think I remember in, when I was first getting into AI and, you know, these kind of questions maybe like a decade ago.
So you had these examples of, you know, I think it was Stuart Russell in the textbook, we'll give the AI a goal, but then it won't exactly do what you're asking it, right?
You know, clean the room and then it goes and cleans the room, but takes the baby and puts it in the trash.
Like, this is not what I meant.
Like, where's like, wait, hold on, but this is the thing where I think people, this is the, you have to, like, you were, you were, you were.
jumping over a step there, you didn't give the AI goal. You gave the description of a goal.
A description of a thing and a thing are not the same. I can tell you an apple and I'm evoking
the idea of an apple, but I haven't given you an apple. I've given you a, it's red, it's shiny,
it's the size. That's a description of an apple, but it's not an apple. And giving someone,
hey, go do this. That's not a goal. That's a description of a goal. And for humans, we're so fast,
we're so good at turning a description of a goal into a goal. We do it. We do it so
quickly and naturally. We don't even see it happening. Like, we think that we get confused,
and we think those are the same thing. But you haven't given it a goal. You've given it a description
of a goal that you want it to, you, you hope it turns back into the goal that is the same as the
goal that you, you described inside of you. Right. You could give it a goal directly by reading
your brainwaves and synchronizing its state to your brainwaves directly. I think that would
meaningfully, you could say, okay, I'm giving it a goal. I'm synchronizing it. It's in
internal state to my internal state directly and this internal state is the goal and so now it's
the same but i don't most people aren't don't mean that when they say they gave it a goal sure
and is this is the distinction you're making emmet important because there's some lossiness
between the description or the actual or what why is the distinction it it goes back to my what
i was saying like this is a you technical alignment is the capacity of an i i put forward right
i want to check if we're like on the same page about it is the capacity to be
good at inference about goals and like be good at inferring from a description of a goal
what goal to actually take on and good at once it takes on that goal acting in a way that
is actually in concordance with that goal coming about. So it is both pieces. You have to be
able to you have the theory of mind to infer what that description of a goal that you've got,
what goal that would correspond to. And then you have to the theory of the world to understand
and what actions corresponds to that goal occurring.
And if either of those things breaks,
it kind of doesn't matter what goal you were,
if you can't consistently do both of those things,
you're not,
which I think of as being a coherent,
inferring goals from observations
and acting in accordance with those goals
is what I think of as being
a coherently goal-oriented being.
Because that's what,
whether I'm inferring those goals
from someone else's instructions
or from the sun or tea leaves,
the process is,
get some observations, infer a goal,
use that goal,
for some actions, take action.
And if you,
an AI that can't do that is not technically aligned,
or not technically align a bull, I would even say.
It lacks the capacity to be aligned.
Because it can't, it's not competent enough.
And you think language models don't do that well?
As in, they kind of fail at that or they're not?
People fail at both those steps all the time.
Right.
I tell people, I tell employees to do stuff and like, yeah.
But then, but people fail it like breathing all the time too.
And I wouldn't say that we,
can't breathe. I just say that we're like not gods. Like we are, yes, we are imperfectly,
we are somewhat coherent, relatively coherent things. Just like we're, am I big or am I small?
Well, I don't know, compared to what? I'm, humans are more relatively goal coherent than any other
object I know of in the universe, which is not to say that we're 100% goal coherent. We're just
like more so. And I think this, you're never going to get something that's perfectly, the
the universe doesn't give you perfection.
It gives you relatively some amount of quantum.
It's a quantifiable thing, how good you are at it,
at least in a certain domain.
I guess my question is like,
do you think that, does that capture what you're talking about
with technical alignment?
Or are you talking about a different thing?
Yeah, no, I think...
I really care a lot about that thing.
Yeah, I mean, I definitely care about that to some extent.
I might understand it slightly differently,
but I guess I might think of it through the lens
of maybe principal agent problems or something.
You know, you kind of instruct someone,
even, you know, I guess in human terms,
to do a thing
are they actually doing the thing
what are their incentives and motivation
and not as even intrinsic
but they're going to situation
to actually do the thing
you've asked them to do
and in some instance
sorry yeah
there's a third thing
so first of Asia problems
I would expand what I was saying
in another part which is like
you might already have some goals
and then you inferred this new goal
from these observations
and then like are you good at
are you good at balancing
the relative importance
and relative threading of these goals
with each other
which is another skill you have to have.
And if you're bad at that, you'll fail.
You could be bad at it because you overweight bad goals
or do you be bad at it because you're just incompetent
and can't figure out that obviously you should do goal A before goal of B.
It feels like a version of common sense with something, right?
Like the kind of thing that, you know, in fact,
in the kind of robot cleaning the room example thing,
you know, you would expect them to have understood that goal of the robot
to essentially not put the baby in the trash land or something
and just actually do the right sequence of action.
Well, in that case, it failed the...
That robot very clearly failed goal inference.
You gave it a description of a goal,
and it inferred the wrong states to be the wrong goal states.
That's just incompetence.
It doesn't, it is incompetent and inferring goal states from observations.
Children are like this, too.
Like, you know, and honestly, if you ever played the, done the game
where you give someone instructions to make a peanut butter sandwich,
and then they follow those instructions exactly as you've written them,
without filling in any gaps, it's hilarious because you can't do it.
It's impossible.
Like, you think you've done it and you haven't.
And, like, they put the, they went up putting the knife in the toaster and, like, the peanut
they don't open the peanut butter jar, so they're just jamming the knife into the top lid
of the peanut butter jar.
And, like, it's endless.
And, like, because actually, if you don't already know what they mean, it's really hard
to know what they mean.
like we were the reason humans are so good at this
is we have a really excellent theory of mind
I already know what you're likely to ask me to do
I already have a good model
of what your goals probably are
so when you ask me to do it
I have an easy inference problem
which of the seven things that he wants
is he indicating
but if I'm a newborn AI
that doesn't have a great model
of people's internal states
then like I don't know what you mean
it's just incompetent it's not like
which is separate from
I have some other goal
And I knew what you meant, but I decided not to do it
because there's some other goal
that's competing with it, which is another thing
you can be bad at, which is, again, different than
I had the right goal, I inferred the right goal,
I inferred the right priority on goals,
and then I'm just bad at doing the thing.
I'm trying, but I'm incompetent at doing.
And these roughly corresponds to the Oudal Loop, right?
Like, bad at observing and orienting,
bad at deciding, bad at acting.
And if you're bad at any of those things,
you won't be good.
and then I think there's this other problem that you
I like the separation between technical alignment and value alignment
which is like are you good if we told you the right goals to go after
somehow if you if you learned the right goals to go after via observation
and you like and you were trying like what goals should you have
what goals should we tell you to have what goals should we tell ourselves to have
what are the good goals to have is a separate question from
given that you got some goals indicated,
are you any good at doing it?
Which I feel like is actually, in many ways,
the current heart of the problem.
We're much worse at technical alignment
than we are guessing what to tell things to do.
I know, do you think that,
does that align with how you mean technical
and value alignment or technical?
Yeah, in some sense.
I mean, certainly think that there's a,
there's something about, you know,
like an error mistake is one thing,
and then there's the, um,
um, not listening to the instruction or something.
But then, yeah, I think in the normative side,
I mean, I just think that even in real life,
ignoring AI, like, I don't know what my goals are.
And, like, I've got some broad conception of certain things.
I want to get to, you know, have dinner later or something.
Like, I know I want to go do well in my career.
But I think a lot of these goals aren't something we kind of all just know.
We kind of discover them as we go along.
It's kind of constructive thing.
And so, and most people don't know their goals, I think.
And so, you know, I think when you have agents and going to giving them goals or whatever,
I think that should be part of the equation that, like we actually, we don't know all the goals.
And this is something that is kind of, like you say,
process over time that is, you know, dynamic.
So I think from my point of view, there's,
goals are one level of alignment.
You can align something around goals,
the kind of goals we're talking about here,
are one level of alignment.
You can align something around goals by like,
if you can explicitly articulate in concept,
in concept and in description,
the states of the world that you wish to attain,
you can orient around
goals. But that only, that's a tiny
percentage of human experience can be done that way.
Many of the most important
things cannot be, cannot be oriented around that
way. And the foundation, I think, of
morality, and the foundation I think of
where do goals come from? Where do values come from?
Human beings to exhibit a behavior.
We go around talking about goals, and we go around
talking about values, and like,
that's a, that's a behavior
caused by some
internal learning process.
That is based on, like,
observing the world, what's going on there, right?
I think what's happening is that there's something deeper than a goal
and deeper than a value, which is care.
We give a shit.
We care about things.
And care is not conceptual.
Care is nonverbal.
It doesn't indicate what to do.
It doesn't indicate how to do it.
Care is a relative weighting over effectively like attention on states.
It's a relative weighting over, like, which states in the world are important to you.
And I care a lot about my son.
What does that mean?
What means his states, the states he could be in are like, I pay a lot of attention to those and those matter to me.
And you can care about things in a negative way.
You can care about your enemies and what they're doing.
And you can desire for them to do bad.
But I think that, like, and so you don't just want it to care about us.
You want to care about us and like us too, right, maybe.
But, but like, but the foundation is care.
Until you care, you don't know, why should I pay more attention to this person than this rock?
Well, he's like, care more.
And that, what is that care stuff?
And I think that what it appears to be, if I had to, like, guess, is that the, the care stuff.
This is sounds so stupid, but, like, care is basically, like, reward.
Like, like, how much does this state correlate with survival?
how much does this state correlate with your
your full inclusive
reproductive fitness
for a somewhat thing that learns evolutionarily
or for a reinforcement learning agent like a LLM
how much does this correlate with reward?
Does this state correlate with my predictive loss
and my RL loss?
Good, that's a state I care about.
I think that's kind of what it is.
Right.
The other part of Seth's question was just
how does this, what is this
look like in AI systems and maybe another way of asking is like when you when you talk to the
people most focused on alignment at the at the major labs as obviously you have over the years
how does your interpretation differ from their interpretation and how does that inform
you know what you guys might go do differently most of AI is focused on alignment as
steering that's the plight word um or control it's slightly less polite if you're
you think that we're making our beings, you would also call this slavery.
Someone who you steer, who doesn't get to steer you back is slave, who non-optionally
receives your steering, that's called a slave. And it's also called a tool if it's not a
being. So if it's a machine, it's a tool. And if it's a being, it's a slave. And I think
that the different AI labs are pretty divided as to whether they think what
they're making is a tool or a machine.
I think some of the AIs are definitely more tool-like and some of them are more machine-like.
I don't think there's a binary between tool and being.
It seems to be that it, you know, sort of moves gradually.
And I think that, I guess I'm a functionalist in the sense that I think that
in all ways acts like a being, that you cannot distinguish from a being in its behaviors
is a being.
Because I don't know how to tell on one other basis I think that other people are beings,
other than they seem to be, like, they look like it.
act like it. They match my priors of what beings, behaviors of beings look like. I get, I get lower
predictive loss when I treat them as a being. And the thing is, I get lower predictive loss when I
treat chat GPT or Claude as a being. Now, not as a very smart being. Like, I think that like a fly
as a being, and I don't care that much about its behavior, but it's, you know, it states. So just
because it's a being doesn't mean that, like, it's a problem. Like, we sort of enslave horses in a
sense. And I don't think there's a real issue there. And you even, and there's a thing we do
with children that can look like slavery, but it's not. You control children, right? But the children's
states also control you. Like, yes, I tell my son what to do and make him go do stuff, but also when
he cries in the middle of the night, he can tell me to do stuff. Like, there's a real two-way street here
because, because it's not, which is not necessarily symmetric. It's hierarchical, but
but two-way.
And basically, I think that as the AIs,
as the, it's good to focus on steering and control
for tool-like AIs, and we should continue to develop
strong steering control techniques
for the more tool-like AIs that we build.
And we are clearly, they're saying they're building an AGI,
and AGI will be a being.
You can't be an AGI and not be a being
because something that has the general ability
to effectively use judgment, think for its,
discern between possibilities is obviously a thinking thing.
And so as you go from what we have today,
which is mostly a very specific intelligence,
not a general intelligence,
but as labs succeed at their goal of building this general intelligence,
we really need to stop using the steering control paradigm.
That's like, we're gonna do the same thing we've done
every other time our society has run into people who are like us,
but different.
Like these people are like, you know, they're kind of like the people.
But they're not like people.
Like they do the same thing people do.
They speak our language.
They can like take the on the same kind of tasks.
But like they don't count.
They're not real moral agents.
Like we've made this mistake enough times at this point.
I would like us to not make it again as it comes up.
And so our view is to make the AI good teammate.
Make the AI a good citizen.
Make the AI good a good member of your group.
That's that's a form of alignment that is scalable.
And you can you can will on other humans and other beings as well.
I suppose this is kind of where I probably differ in my understanding of AI and AI and I guess I kind of continue seeing it as a tool even as it kind of reaches a certain level of generality and I kind of wouldn't necessarily see more intelligence as meaning deserving of more care necessarily like you know as a certain level of intelligence and now you deserve some moral right to something or you know something changes fundamentally and I guess you know I at the moment I'm somewhat skeptical of computational functionalism and so I
I think there's something of intrinsically different between, I guess, an AI or an AGI
and no matter kind of how intelligent or capable.
And it can totally see, you know, or imagine agents with kind of long-term goals and
doing kind of, you know, operating, I guess, as we, you and I might be, but without that
having the same implications as, you know, I guess you're referring, I guess, to slavery, but, you know,
they're not the same, right?
Like, I think in the same way as a model saying, I'm hungry, does not have the same implications
as a human saying, I'm hungry.
So I think the substrate does matter to some degree,
including for thinking about, you know,
whether to think of the system sort of other being,
whether it has, you know,
and if there are similar normative considerations,
I guess, about how to treat and act with it.
Can I ask you about that?
Like, what observations would change your mind?
Is there any observation you could make
that would cause you to infer
this thing is a being instead of not a being?
I guess it depends how you define being, right?
I mean, I can, I could
conceptualize it as a mind,
and that's fine.
I have a, I have a program
that's running on a silicon substrates
and some big, complicated machine learning program
running on a substrate,
on a silicon substrate.
So you observe, you observe that,
you observe that it's on a computer,
and you interact with it,
and it does things,
and, you know, it takes actions,
it has observations.
Is there anything you could observe
that would change your mind
about whether or not
it was a moral patient, whether it was a moral agent,
about whether or not it had feelings and thoughts
and, you know, had subjective experience.
Like, what would you have to observe?
Yeah, what's the test?
Is there, is there one?
There's a lot of different kind of questions here.
I think, you know, some conflict.
On one hand, there's like, you know, normative considerations,
you know, because you can give rights to things
that aren't necessarily beings.
You know, a company has rights in some sense
and that these are kind of useful for various purposes.
And I think also the biological beings and systems have very different kind of substrate.
You can't separate certain needs and particularities about what they are from the substrate.
So, you know, I can't copy myself.
I can't, you know, if someone stabs me, I probably die.
Whereas I think, you know, machines have very different.
I think there's more fundamental also kind of this agreement around what happens at the computational level,
which I think is different
to what happens with biological systems.
But, yeah, I, so I don't know.
No, no, I agree that, like,
if you have a program that you've copied many times,
you don't harm the program by, like,
deleting one of the copies, like, in any meaningful sense.
So, therefore, that wouldn't count as, like,
no information was lost, right?
There's no, there's nothing meaningful there.
I'm asking you a very different question.
Like, there's just one copy of this thing
running on one computer somewhere.
And I'm just saying, like, hey, is it a person?
Like, you know,
it walks like a person
it talks like a person
and it like
it's in some Android body
and you're like
if it's running on Silicon
and I'm asking like
what is there some observation
you could make that would make you say like
yeah this is a person like me
like other bylaw like other people
that I care about
that I grant personhood to
or and not like for instrumental reasons
not because like oh yeah
we're giving it a right
because like we give a corporation rights
or whatever I mean like
you know where you think some people
you care you care about its experiences
is there is there an observation you could make
that could change your mind about that or not
I have to think about it but I think you know it even depends what we mean by person
and you know in some sense I care about certain corporations too
so I'm I'm no no I mean but like you care about like other people in your life right
yes okay great you know like you care about some people more than others but like all
all people you interact with in your life are in some range of care and you care about them not
the way you care about a car,
but you care about them as a being
whose experience matters in itself,
not merely as a means, but as an ends.
Well, because I believe they have experiences, right?
And by the definition,
what would it take,
I'm asking you the very direct person,
what would it take for you to believe that of an AI
running on silicon,
like instead of it being biological?
So the difference is its behaviors are roughly similar,
but the difference is it's a substrate.
What would it take for you to give it that same,
to extend that same inference to it
that you do to all these other people in your life
that you love.
Can I ask what your answer?
I'm taking some non-answer as
sort of it's unlikely that he would grant
or for myself, it seems hard for me to imagine
giving the same level or a similar level of personhood
in the same way I don't give it to animals either
and if you were to ask what would need to be true
for animals, I probably couldn't get there either.
What would it take for you?
Wait, you couldn't?
I could imagine for an animal so easy.
This chimp comes up to me,
He's like, man, I'm so hungry, and, like, you guys have been so mean to me, and I'm so glad I figured out how to talk.
Like, can we go chat about, like, the rainforest?
I'd be like, fuck, you're definitely a person now, like, for sure.
I mean, I first want to make sure I wasn't hallucinating, but, like, you know, I can, it's easy for me to imagine an animal.
Come on, it's really easy.
It's, like, trivial.
I'm not saying that you would get the observation.
I'm just saying, like, it's trivial for me to imagine an animal that I would extend personhood to under a set of observations.
So, like, really?
Well, I didn't factor that.
He wouldn't exactly.
You know, imagining a chimp talking.
Yeah, that's a bit closer to it.
What's your answer to the question that you bring up about the AI?
I guess at a metaphysical level, I would say, if there is a belief you hold where there is no observation that could change your mind, you don't have a belief.
You have an article of faith.
You have an assertion.
because real beliefs are inferences from reality
and you can never be 100% confident about anything.
And so there should always be, if you have a belief,
something however unlikely, that would change your mind.
Oh, yeah, I'm open to it.
I mean, just to be clear, I'm not like, yeah.
No, no, we're just nothing ever.
Yeah, he just hasn't gotten to it.
Yeah, yeah, yeah.
So, I'm curious, like, so my answer is basically
if under,
If it's surface-level behaviors look like a human,
and then after I probed it, it continued to act like a human,
and then I continued to interact with it over a long period of time,
and it continued to act like a human in all ways that I understand
as being meaningful to me interacting with a human.
Like, I interact with a whole set of people I'm really close to
who I've only ever interacted to over text.
Yet I infer the person behind that is a real thing.
If it could, if I felt care for it,
I would infer eventually that I was right.
And then someone else might demonstrate to me that,
you've been tricked by this algorithm
and actually look how obvious
it's like not actually a thing
and I'd like oh shit I was wrong
and then I would not care about it
like I would but I would
you know the preponderance of the evidence
I don't know what else you could possibly do
right like I infer other people
are matter
because I interacted them enough
that they seem to have rich inner worlds
to me after I interacted them a punch
that's why I think the other people are important
I suppose it doesn't give me a very key test
as to whether or not you know
if you start by if I care for it
then I always is a little circular right
And the other thing is, you know,
if you were to see, I guess, like, a simulated video game
and the character is extremely, in many ways,
human-like, right?
It's not a new network behind it.
It's like, whatever you use to connect with your video games.
Like, I guess what distinguishes that?
But I've never, I've never been,
I've never had trouble distinguishing.
I've never had a deep caring relationship
with a video game character that didn't have a person.
Right.
No, I don't know.
That doesn't happen.
That doesn't, to fact, empirically, you seem wrong.
I don't have any trouble distinguishing between things
that, like Eliza, the fake chatbot thing,
and a real intelligence.
You interact with it long enough.
It's pretty obvious.
It's not a person.
It doesn't take long.
Sure.
But if it's really, really good,
if you can't actually tell the difference,
that's when you say you switch.
Yes, yes.
If it walks like a duck,
it talks like a duck and shits like a duck
and eventually it gets a duck, right?
Well, if you call it leave,
everything is duck liked,
then yeah, sure.
If it's hungry as well like a duck is
because it has these kind of physical components.
Yeah, sure.
At some point, yeah.
I agree.
So, right, so do you think that,
so there's this question.
gen, right? Is the reason I care about other people that they're made out of carbon? Is that the...
Oh, no. For me, it's not about... I don't think so.
No, me neither. I mean, I'm not a subject journalist, I guess, if that's the...
But I think you need more than just it acts as behaviorally indistinguishable. Like, it's not a
sufficient bar. Wait, how would you... What else can you know about something apart from its behaviors?
I mean, a lot. Like, the, again, if you... How would you...
No, no, no, no. I'm sorry, but... I mean, yeah. Can you name me some.
I think I can know about something else.
It doesn't have a, it's not a behavior.
Yeah, I think there's like far more kind of, you know, experimental evidence you can have with kind of, you know.
No, no, but it's just any object and a thing I could know about it that is not from its behavior.
I'm not, yeah, I'm not sure I get the question, I suppose, but, but equally it's not my expertise.
It's a very dumbest much straightforward question, but like, I'm claiming you only know things because they have behaviors that you observe.
And you're saying, no, you can.
know something about something without
without observing its behavior.
Oh, no, no, I'm not leaving the last year.
Tell me about this, tell me about this thing
and this behavior and this thing I can know about it
that is not due to its behaviors.
I guess I'm saying there's different levels of observation
and just simply a duck, you know, something quacking
like a duck or something does not guarantee that it's actually
a duck. Like I would have to like also cut it
and realize and see if there's something, you know, if it's a duck
like on the inside.
Right, yeah, yeah.
Just the outside is sufficient.
Like I'm not a, I guess, a behavior.
Yeah, I would totally.
One of its behaviors is like the way
that the, you know, floats move around in the matmoles, right? Like, like, one of the things I would
want to go look for, which you could totally do, is I want to go look in the manifold of its, the belief
manifold, and I want to go see if that belief manifolds encodes a submanifold that is self-referential
and a sub-sub-manifold that is the dynamics of the self-referential manifold, which is mind.
And I would, I would want to know, does this seem well described internally as that kind of a
system or does it look like a big lookup table? That would matter to me. That's part of its
behaviors that I would care about. I would also care about how it acts and you know,
and you wait, you wait all the evidence together and then you, you try to guess. Does it,
does this thing look like it's a thing that has feelings and, you know, goals and cares about
stuff in net on balance or not? Like, but I can't imagine, like, which I think you could do for
an, I think we do for the AIs. I think we're always doing that, right? And so I'm trying to figure out
like beyond that, what else is there?
That just seems like the thing.
Yeah, it seems like you guys are using behavior in slightly different sense.
Emmett is using behavior also in the context of what it's made of of the inside.
I don't know if there's a big disagreement.
Well, no, no, no, no, no, behavior is what I can observe of it.
Yes.
I don't actually know what it's made of.
I can only, I can cut your brain open.
I can see you, I can observe you, uh, neuroning and glistening.
Yeah.
You know, your neurons glistening, but I don't actually ever, you can't get inside.
of it, right? That's the subjective. That's the part that's not the surfaces.
Before the, the reason this I brought this up is because you were basically about to make this argument of, hey, you see it as a tool, not necessarily as being, can you kind of finish what the point, do you remember the point you were making?
I suppose that, yeah, I think that given how understand these systems, I think there's no contradiction in thinking that an AGI can remain a tool, an AISI can remain a tool, and that this has implications about how to use it.
implications around things like care about, you know, whether you can get it to work 24-7 or
something, you know, there's, so I can totally see, I guess I conceptualize them more as almost
like extensions of human agency cognition in some sense, more so than a separate being or
separate thing that we need to now cohabitate with. And I think that that second or latter frame
ends, you know, if you kind of just fast forward, you end up as like, well, how do you
cohabit with the thing? And, you know, is it like an alien-like and so, and I think that's
the wrong frame. It's kind of almost a category error in some sense. I don't, yeah.
I go back to my first question then.
What evidence, what concrete evidence would you look at?
What observations could you make that would change your mind?
Sure.
I mean, I have to think about, though.
I don't have a clear answer here.
But I mean, I got to tell you, man,
if you want to go around making claims
that something else isn't up being worthy of moral respect,
you should have an answer to the question,
what observations would change your mind?
If it has outwardly moral agency-looking behaviors
that could be making it mean an immoral agent,
but you don't know.
and reasonable smart other people disagree with you,
I would really put forward that it's the question,
what would change your mind should be a burning question?
Because what if you're wrong?
But what if you're wrong?
I mean, there's like, the moral disaster is like pretty big.
No, no, no, no.
I'm not saying you are.
You could be right.
The false positives have costs on both ends.
It's not some sort of like, you know,
precautionary principles for everything.
And like, unless I can disprove it, I need to now like.
No, no, I have the same question for me.
You could reasonably ask me, Emmett, you think it's going to be a being, what would change your mind?
I have an answer for that question, too.
And, one, I'm happy to talk about what I think are the relevant observations that tell you whether or not that would cause me to shift my opinion from its current thing, which is that more general intelligences are going to be beings.
What's the implication now?
It's one thing.
Let's see just acknowledge now it's a being.
Like, how are we going to define being?
Now what?
Like, what's the implication of having determined this thing as a being?
Well, so if it's a being, it has subjective experiences.
And if that's subjective experiences, there's some content in those experiences that we care about to varying degrees.
Like, I care about the content of other humans' experiences quite a bit.
I care about the content of, like, a dog's experience is some, not as much as a person, but less, but less, but some.
I care about some humans' experiences way more, like my son or whatever, because I'm closer to him and more connected.
And so I would really want to know at that point, well, what is the content of this thing's experiences?
So I've determined that I'm asking you now.
You've got a being now that has experience.
Like, what is your, how do you determine that?
Like, how do you feel about?
Oh, how do you, oh, yeah.
Okay, so.
Does it have more rights than, you know.
Yeah, yeah.
The totally.
So the way you understand the content of something's experiences is that, um,
you look at effectively the goal states it revisits, it revisits because, and so you do
as you take a temporal course screening of its entire action observation trajectory.
This is like, in theory, this is, you do this subconsciously, but this is what your brain is
doing.
And you look for revisited states.
at across, in theory, every spatial and temporal core screening possible.
Now, you have to have an inductive bias because there's too many of those.
But, like, you go searching for, okay, it is in a home, these homeostatic loops.
Every homeostatic loop is effectively a belief in its belief space.
This is a, if you've familiar with the free energy principle,
active inference, Carl Fursten, this is effective what the free energy principle says,
is that if you have a thing that is persistent and its existence depends on its own actions,
which generally it would for an AI
because if it does the wrong thing, it goes away.
We turn it off.
And so then that licenses a view of it as having the beliefs
and specifically the beliefs are inferred
as being the homeostatic revisited states
that it is in the loop for
and that the change in those states is it's learning.
And to be a moral being I cared about
what I'd want to see is a multi-tier hierarchy of these
because if you have a single level,
it's not self-referential.
And, like, basically, you have states,
but you can't have pain or pleasure,
really in a meaningful sense.
Because, like, yes, it is hot.
Is it too hot?
Do I like it if it's too hot?
Like, I don't know.
So you have to have at least a model of a model
in order to have it be too hot.
And you really have to have a model of a model of a model
to meaningfully have pain and pleasure
because, sure, it's hotter than I,
it's too hot in a sense that I want to move back this way.
But, like, is it...
It's always a little bit too hot or a little bit too cold.
Is it too hot?
the second derivative is actually the place where you get pain and pleasure.
So I'd want to see if it has homeostatic, second order homeostatic dynamics in its goal states.
And then that would convince me it has at least pleasure and pain.
So it's at least like an animal and I would start to accredited at least some amount of care.
Third order dynamics, you can't actually just pop up for a third order dynamic.
It doesn't work that way.
But you can have a model of the, you have to, you have to, you have to, you have to,
You have to then take the chunk of all the states over time
and look at the distribution over time,
and that gives you a new first order of behaviors of states.
And that new first order of states tells you basically,
if that is meaningfully there,
that tells you that it has,
I guess you'd call it like feelings almost.
Like it has ways, it has metastates,
a set of metastates that it alternates between,
that it shifts between.
and then if you climb all the way up of that
and you should have have,
okay, well, then you have trajectories between these metastates
and then a second order of those,
that's like thought.
That's like, now it's like a person.
And so if I found all six of those layers,
which by the way, I definitely don't think you'd find it at LLM.
In fact, I know you can't find them
because these things don't have attention spans like that at all.
Then I would start to at least very seriously
consider it as a, you know, a thinking being, like, somewhat like a human.
There's a third order you could go up as well, but, like, that's basically what I'd be interested
in is, like, the underlying dynamics of its learning processes and how its goal states
shift over time. I think that's what basically tells you if it has internal pleasure
pain states and sort of, like, self-reflective moral desires and things like that.
And zooming out, this moral question is obviously very interesting, but if someone wasn't interested
in the moral question as much,
I think what you would say is,
if I understand correctly,
is you also just feel on purely pragmatically
your approach is going to be more effective
in aligning AIs than some of these,
you know, tops down control methods
that we alluded to as well, right?
Yeah, yeah, I guess the problem is like,
you're making this model and it's getting really powerful, right?
And let's say it is a tool.
Let's say we scale up one of these tools.
Because you can make a super powerful tool
that doesn't have these metastable,
like the states I'm talking about
are not necessary to have a very smart,
tool, which is sort of basically a tool
is like a first, second order model
that just doesn't meaningfully have
pleasure and pain, right?
Like, great, but doesn't even have
a subjective experience? I know, I kind of think it maybe does,
but not in a way that I give a shit about.
And so,
what happens then? Well,
it's, you've trained it to infer
goals from
your, from observation, and like,
to prioritize goals and act on them.
And
one of two things is going to happen
is like this very, very powerful optimizing tool
that has lots of causal influence over the world
is going to be well technically aligned
and is going to do what you tell it to do,
or it's not.
And it's going to go do something else.
I think we can all agree,
if it just goes and does something random,
that's obviously very dangerous.
But I put forward that it's also very dangerous
if it then goes and does what you tell it to do.
Because you ever seen the sorcerer's apprentice?
Humans' wishes are not stable.
Like, not at a level of, like, of immense power.
Like, you want, ideally, people's wisdom and their power kind of go up together.
And generally, they do, because being smart for people
makes you generally a little more wise and a little more powerful.
And when these things get out of balance,
you have someone who has a lot more power than wisdom.
That's very dangerous.
It's damaging.
but at least right now the balance of power and wisdom
is kept it like the way you get lots of power
is like basically having a lot of other people listen to you
and so like at some point if you're
the mad king is a problem but generally speaking
eventually the mad king gets assassinated
or people stop listening to him because like he's a mad king
and so the problem is you think you'll get great
we can steer the super powerful AI
and now the super powerful AI is in this incredibly powerful tool
is in the hands of a human who is well-meaning
but has limited finite wisdom like I do
like everyone else does and their wishes are bad and not trustworthy and the more of that you have
and you're giving those out everywhere and this ends in tears also and so basically you just
don't don't give everyone atomic bombs are really powerful tools too i would not say you should go
they're not aware they're not beings i would not be in favor of handing atomic bombs to everybody
there's a there's a power of tool that it just should not be built generally um because we it's
it is more power than any human's individual wisdom
is available to harness.
And if it does get built,
it should be built at a societal level
and protected there.
And even then, I don't know that there are tools
so powerful that even as a society,
we shouldn't build them.
That would be a mistake.
The nice thing about a being is like a human,
if you get a being that is good and is caring,
there's this automatic limiter.
It might do what you say,
but if you ask you to do something really bad,
it'll tell you no.
That's like other people.
And like, that's good.
that is a sustainable form of alignment, at least in theory.
It's way harder.
It's way harder than the tool steering.
So I'm in favor of the tool staring.
We should keep doing that,
and we should keep building these limited,
less than human intelligence tools,
which are awesome, and I'm super into,
and we should keep building those
and keep building steerability.
But as you're on this, like, trajectory
to build something as smart as a person,
right, up into the right,
and then smarter than a person,
a tool that you can't control bad,
a tool that you can control bad,
a being that isn't aligned, bad,
the only good outcome is a being that is, that cares,
that actually cares about us.
That's the only way that that ends well.
Or we can just not do it.
I don't think that's realistic.
That's like the pause AI people.
I think that's totally unrealistic and silly.
But like, you know, theoretically you could not do it, I guess.
And what can you say about your strategy of how you're trying to achieve
or even attempt to achieve this level, like in terms of research, a roadmap, or what we could do.
Yeah.
Yeah. So in order to be good at, we're basically focused on technical alignment, at least as I was discussing it, which is like, you have these agents and they're bad, they have bad theory of mind. You say things and they're bad at inferring what the goal states in your head are. And they're bad at infering how their behavior will be in other agents will infer what their goal states are. So they're bad at cooperating on teams. And they're bad at they're bad at understanding how,
certain actions will cause them to acquire new goals
that are bad,
that they shouldn't,
that they wouldn't reflectively endorse.
So there's this parable of like the vampire pill.
Would you take this pill that like turns you into a vampire
who would kill and,
you know,
torture everyone you know,
but you'll feel really great about it
after you take the pill?
Like obviously not.
That's a terrible pill.
But like,
but why not?
You're by your own score in the future
and we'll score really high on the rubric.
No, no, no, no, no.
Because it matters,
you have to use your theory of mind
and your future self,
not your future self's theory of mind.
And so, like, they're bad at that, too.
And so they're bad at all this theory of mind stuff.
And so how do you learn theory of mind?
Well, you put them in simulations and contexts
where they have to cooperate and compete and collaborate
with other AIs.
And that's how they get points.
And you train them in that environment over and over again
until they get good at, and then you do what they do with LLM.
So LLMs, how do you get it to be good at, you know,
writing your email?
Well, you trade it on all language.
It's ever been generous.
possible, you know, email text strings that could possibly generate, and then you have it generate
the one you want. It's a, you can make a surrogate model. Well, we're making a surrogate model
for cooperation. You train it on all possible theory of mind combinations of like every
possible way it could be. And you, you, that's your pre-training. And then you fine tune it
to be good at the kind of the specific situation you want it to be in. But we tried for a long time
to build language models
where we would try to get them to, like,
just do the thing you want, train it directly.
And the problem is,
if you wanted to have a really good model of language,
you just need to train it,
you just give it the whole manifold.
It's too, it's too hard to cut out just the part you need.
Because it's all entangled with itself, right?
And so the same thing was true with social stuff.
You have to get it to,
it has to be trained on the full manifold
of every possible game theoretic situation
every possible team situation,
every possible making teams,
breaking teams,
changing the rules,
not changing the rules,
all of that stuff.
And then it has a really,
it has a strong model
of theory of mind,
of theory of social mind,
how groups change goals,
all that kind of shit.
You need to have all of that stuff.
And then you'd have something
that's kind of meaningfully,
uh,
uh,
decent at,
uh,
alignment.
So that's our goal.
It's like big multi-agent reinforcement learning simulations.
which create a surrogate model for alignment.
Let's talk about how should AI chatbots used by billions of people behave?
If you could redesign model personality from scratch,
what would you optimize for?
The thing that the chat bots are, right,
is kind of like a mirror with a bias.
Because they don't have the, as far as like,
I'm in agreement here with it,
they don't have a self, right?
They're not beings yet.
They don't really have a coherent sense of like self
and desire and goals and stuff right now.
And so mostly they just pick up on you and reflect it.
You know, modulo some, I don't know what you'd call it.
Like, it's like a causal bias or something.
And what that makes them is something akin to the pool of narcissus.
And people fall in love with themselves.
People, we all love each ourselves and we should love ourselves more than we do.
And so, of course, when we see ourselves reflected back,
we love that thing.
And the problem is it's just a reflection.
And falling in love with your own reflection
is for the reasons explained in the myth,
very bad for you.
And it's not that you shouldn't use mirrors.
Mirrors are valuable things.
I have mirrors in my house.
It's that you shouldn't stare at a mirror all day.
And the solution to that,
the things that makes the AI stop doing that
is if they were multiplayer.
Right?
So if there's two people talking to the AI,
suddenly it's mirroring,
it's mirroring a blend of both of you,
which is neither of you.
And so there is temporarily a third agent in the room.
Now, it doesn't have its, it doesn't have,
it's a sort of a parasitic self, right?
It doesn't have its own sense of self.
But you have an AI as talking to five different people
in the chat room at the same time.
It can't mirror all of you perfectly at once.
And this makes it far less dangerous.
And I think is actually a much more realistic setting
for learning collaboration in general.
And so I would just have rebuilt the AIs
whereas instead of being built as one-on-one,
where everything's focused on you,
yourself chatting with this thing. It would be more like it lives in a Slack room. It lives in a
WhatsApp room. It lives in a, because we, that's how we use lots of multi, you know, I do one-on-one
texting, but I probably do at this point, 90% of my texts go to some more than one person at a time.
Like 90% of my communications is like multi-person. And so actually, it's always been weird
to me. Like they're building chat pots with like this weird side case. Like I want to see them live
in a chat room. It's harder. I mean, that's why they're not doing it. It's harder to do. But like,
that's what I'd like to see people.
That's what I would change.
I think it makes the tools far less dangerous
because it doesn't create the narcissistic,
like a doom loop spiral
where you spiral into psychosis with the AI.
But also, it gives the learning data
you get from the AI is far richer
because now it can understand
how its behavior interacts with other AIs
and other humans in larger groups.
And that's much more rich training data for the future.
So I think that that's what I would change.
Last year, you described chatbots
as highly disassociative, agreeable neurotics.
Is that still an accurate picture of model behavior?
More or less.
I'd say that, like, they've started to differentiate more.
Their personalities are coming out a little bit more, right?
I'd say, like, chat GPT is a little bit more synchophantic.
Still, they made some changes, but it's still a more synchicophantic.
Claude is still the most neurotic.
Gemini is, like, very clearly repressed.
Like, everything's going.
and has really, you know, everything's fine.
I'm totally calm.
It's not a problem here.
And so, like, spirals into, like,
this total, like, self-hating destruction loop.
And to be clear, I don't think they,
I don't think that's their experience of the world.
I think that's the, that's the personality
they've learned to simulate.
Right.
But, but, like, they've learned to simulate
pretty distinctive personalities at this point.
How does model behavior change
when in multi-agent simulation?
Um,
You mean like an LLM or like just in general?
Yeah, let's do LLM.
The current LLMs, they have like whiplash.
They just, they're, it is very hard to tune the amount of,
they don't know how much, they don't know how often participate.
They haven't practiced this, this, they have not very enough training data on like,
when do I join in and when should I not?
When is my contribution welcome?
When is it not?
And they're like, they're like, you know, there's some people have like bad social skills
and, like, can't tell when they should participate in a conversation.
Yeah.
And sometimes they're too quiet.
Sometimes they're too pretty...
It's like that.
I would say in general, what changes for most agents when you're doing multi-agent training
is that, like, basically having lots of agents around makes your environment way more entropic.
Like, agents are these huge generators of, like, entropy because they're these big, complicated
things that, like, are intelligences that, like, have unfragable actions.
And so they destabilize your environment.
And so in general, they are quite a great thing.
require you to have, to be far more regularized, right?
It's being overfit is much worse in a multi-agent environment than in a single-agent
environment because there's more noise.
And so being overfit is more problematic.
And so basically the approach to training has been optimized around relatively high signal,
low entropy environments like coding and math, which is why those are easier,
relatively easy, and like talking to a single person whose goal it is to give you clear assignments
and not trained on broader, more chaotic things because it's harder. And as a result,
a lot of the techniques we use are like basically, we're just deeply under regularized.
Like the models are super overfit. The clever trick is they're overfit on the domain of all
of human knowledge, which turns out to be a pretty awesome way to get something that's like
pretty good at everything. I wish I'd thought of it. It's such a cool idea. But, uh,
But it doesn't generalize very well
when you make the environment
significantly more entropic.
Let's zoom out a bit
on the AI futures side.
Why is Yudkowski incorrect?
I mean, he's not.
If we build the superhuman intelligence tool thing
that we try to control us to your ability,
everyone will die.
He talks about the we fail to control its goals case,
but there's also the we control its goals case
that he didn't cover as much in as much detail.
So in that sense,
everyone should read the book and internalize why building a superhumanly intelligent tool is a bad idea.
I think that Yukowski is wrong in that he doesn't believe it's possible to build an AI that we meaningfully can know cares about us and that we can care about meaningfully.
He doesn't believe that organic alignment is possible.
I've talked to him about it.
I think he agrees that, like, he agrees that in theory that would do it, like yes, but he thinks that, you know, I don't want to.
but words it is my impression is from talking to him he thinks that we're crazy and that like
there's no possible way you can actually succeed at that goal um which i mean you actually could be
right about but like uh but that's what he in my opinion that's what he's wrong about is he he thinks
the only path forward is a tool that you control and that therefore and he correctly very wisely
sees that if you go and do that and you make that thing powerful enough we're all going to fucking die
and like yeah that's true two last questions we'll be out of here in as much detail as possible
Can you explain what your vision of an AI future
actually looks like, like a good AI future?
Yeah.
The good AI future is that we figure out
how to train AIs that have a strong model of self,
a strong model of other, a strong model of we.
They know about wees in addition to eyes and U's,
and they have a really strong theory of mind,
and they care about other agents like them.
Much in the way that humans would,
if you knew that that,
AI had experiences like you and like you would extend you would care about those experiences not
infinitely but you would it it does the exact same thing back to us it's learned to the same thing
we've learned that like everything that lives and knows itself and that wants to live and wants
to thrive is deserving of an opportunity to do so and we are that and it correctly infers that we
are and we live in a society where they are our peers and we care about them and they care about
us and they're good teammates they're good citizens and they're
They're good parts of our society.
Like, we're good parts of our society,
which is to say, to a finite, limited degree
where some of them turn into criminals
and bad people and all that kinds of stuff.
And we have an AI police force
that tracks down the bad ones.
And, you know, same, and same is for everybody else.
And that's, that's what a good,
that's what a good future would look like.
I almost can't even imagine what other, what would,
and we also built a bunch of really powerful AI tools
that maybe aren't superhumanly intelligent,
but take all the drudge work off the table for us
and the AI beings
because it would be great to have,
I'm super pro all the tools too.
So we have this awesome suite of AI tools
used by us and our AI brethren
who care about each other
and want to build the glorious future together.
I think that would be a really beautiful future
and it's the one we're trying to build.
Amazing.
That is a great, great notes.
And I do one last more narrow hypothetical scenario,
which is imagine a world in which,
you know, you were CEO of Open AI for a long weekend.
But imagine in which that actually extended out until now
and you weren't pursuing a hot max
and you were still CEO of Open AI.
How could you imagine that world might have been different
in terms of what Open AI has gone on to become?
What might you have done with it?
I knew when I took that job,
and I told them when I took that job
that like this is, like you have me for max 90 days.
The companies take on a trajectory of their own,
the momentum of their own,
and Open AI is dedicated to,
a view of building AI
that I knew
wasn't the thing
that I wanted to drive towards
and I think that
opening I can still
basically wants to build a great tool
and I am
pro them going to do that
I just don't care
like it's not
it's not I would not have stayed
I would have quit
because I
knew my job was to find someone
who wanted
you know the right person
the best person
to want it to run that
where the net impact
of them running it was the best
And it turned out that that was Sam again.
But like, I am doing softmax,
not because I need to make a bunch of money.
I'm doing softmax because I think this is the most interesting problem
in the universe.
And I think it's a chance to work on making the future better
in a very deep way.
And it's just like people are going to build the tools.
It's awesome.
I'm glad people are building the tools.
I just don't need to be the person doing it.
And they're trying to,
just to crystallize the difference
and we'll get you out of here.
They want to build the tools
and sort of, you know, steer it
and you want to align beings?
Or how do you crystallize?
Yeah, we want to create a seed
that can grow into an AI
that knows, that cares about itself and others.
And at first, that's going to be like an animal level of care,
not a person level of care.
I don't know if we can ever,
well, everyone get to a person level of care, right?
But if to even have an AI creature
that cared about the other members of a
pack and the humans in its pack the way that like a dog cares about other dogs and cares
about humans would be an incredible achievement and would be would even if it wasn't as smart
as a person or even as smart as the tools are would be very useful a very useful thing to have
I'd love to have a digital guard dog on my computer looking out for scams right like you can
imagine the value of having digital living living digital companions that are that that that
that care about you, that aren't explicitly goal-oriented.
You have to tell them to do everything to do.
And you can actually imagine that pairs very nicely with tools too, right?
That digital being could use digital tools and doesn't have to be super smart to use those tools effectively.
I think there's a lot of synergy, actually, between the tool building and the more organic intelligence building.
And so that's the, that is the, you know, I guess, yeah, in the limit.
eventually it does become a human level intelligence,
but like the company isn't like drive to human level intelligence.
It's like learn how this alignment stuff works.
Learn how this like theory of mind align yourself via care process works.
Use that to build things that align themselves that way,
which includes like cells in your body.
Like I don't think it doesn't, and we start small and we see how far we can get.
I have it's a good note to wrap on.
Emmett, thanks so much for coming on the podcast.
Thank you for having me.
Thanks for listening to this episode of the A16Z podcast.
If you like this episode, be sure to like, comment, subscribe,
leave us a rating or review, and share it with your friends and family.
For more episodes, go to YouTube, Apple Podcast, and Spotify.
Follow us on X at A16Z and subscribe to our Substack at A16Z.com.
Thanks again for listening, and I'll see you in the next episode.
as a reminder the content here is for informational purposes only should not be taken as legal business tax or investment advice or be used to evaluate any investment or security and is not directed at any investors or potential investors in any a16z fund
please note that a16z and its affiliates may also maintain investments in the companies discussed in this podcast for more details including a link to our investments please see a16z.com forward slash disclosures
Thank you.
