Lex Fridman Podcast - #452 – Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity
Episode Date: November 11, 2024Dario Amodei is the CEO of Anthropic, the company that created Claude. Amanda Askell is an AI researcher working on Claude's character and personality. Chris Olah is an AI researcher working on mechan...istic interpretability. Thank you for listening ❤ Check out our sponsors: https://lexfridman.com/sponsors/ep452-sc See below for timestamps, transcript, and to give feedback, submit questions, contact Lex, etc. Transcript: https://lexfridman.com/dario-amodei-transcript CONTACT LEX: Feedback - give feedback to Lex: https://lexfridman.com/survey AMA - submit questions, videos or call-in: https://lexfridman.com/ama Hiring - join our team: https://lexfridman.com/hiring Other - other ways to get in touch: https://lexfridman.com/contact EPISODE LINKS: Claude: https://claude.ai Anthropic's X: https://x.com/AnthropicAI Anthropic's Website: https://anthropic.com Dario's X: https://x.com/DarioAmodei Dario's Website: https://darioamodei.com Machines of Loving Grace (Essay): https://darioamodei.com/machines-of-loving-grace Chris's X: https://x.com/ch402 Chris's Blog: https://colah.github.io Amanda's X: https://x.com/AmandaAskell Amanda's Website: https://askell.io SPONSORS: To support this podcast, check out our sponsors & get discounts: Encord: AI tooling for annotation & data management. Go to https://encord.com/lex Notion: Note-taking and team collaboration. Go to https://notion.com/lex Shopify: Sell stuff online. Go to https://shopify.com/lex BetterHelp: Online therapy and counseling. Go to https://betterhelp.com/lex LMNT: Zero-sugar electrolyte drink mix. Go to https://drinkLMNT.com/lex OUTLINE: (00:00) - Introduction (10:19) - Scaling laws (19:25) - Limits of LLM scaling (27:51) - Competition with OpenAI, Google, xAI, Meta (33:14) - Claude (36:50) - Opus 3.5 (41:36) - Sonnet 3.5 (44:56) - Claude 4.0 (49:07) - Criticism of Claude (1:01:54) - AI Safety Levels (1:12:42) - ASL-3 and ASL-4 (1:16:46) - Computer use (1:26:41) - Government regulation of AI (1:45:30) - Hiring a great team (1:54:19) - Post-training (1:59:45) - Constitutional AI (2:05:11) - Machines of Loving Grace (2:24:17) - AGI timeline (2:36:52) - Programming (2:43:52) - Meaning of life (2:49:58) - Amanda Askell - Philosophy (2:52:26) - Programming advice for non-technical people (2:56:15) - Talking to Claude (3:12:47) - Prompt engineering (3:21:21) - Post-training (3:26:00) - Constitutional AI (3:30:53) - System prompts (3:37:00) - Is Claude getting dumber? (3:49:02) - Character training (3:50:01) - Nature of truth (3:54:38) - Optimal rate of failure (4:01:49) - AI consciousness (4:16:20) - AGI (4:24:58) - Chris Olah - Mechanistic Interpretability (4:29:49) - Features, Circuits, Universality (4:47:23) - Superposition (4:58:22) - Monosemanticity (5:05:14) - Scaling Monosemanticity (5:14:02) - Macroscopic behavior of neural networks (5:18:56) - Beauty of neural networks
Transcript
Discussion (0)
The following is a conversation with Dario Amadei, CEO of Anthropic, the company that
created Claude that is currently and often at the top of most LLM benchmark leaderboards.
On top of that, Dario and the Anthropic team have been outspoken advocates for taking the topic of
AI safety very seriously, and they have continued to publish a lot of fascinating AI research on this and other
topics.
I'm also joined afterwards by two other brilliant people from Anthropic.
First, Amanda Askell, who is a researcher working on alignment and fine-tuning of Claude,
including the design of Claude's character and personality. A few folks told me she has probably talked with Claude
more than any human at Anthropic.
So she was definitely a fascinating person to talk to
about prompt engineering and practical advice
on how to get the best out of Claude.
After that, Chris Ola stopped by for a chat.
He's one of the pioneers of the field of
mechanistic interpretability, which is an exciting set of efforts that aims to
reverse engineer neural networks to figure out what's going on inside,
inferring behaviors from neural activation patterns inside the network.
This is a very promising approach for keeping future super intelligent AI systems safe.
For example, by detecting from the activations when the model is trying to deceive the human it is talking to.
And now a quick few second mention of each sponsor.
Check them out in the description. It's the best way to support this podcast. We got Encore for machine learning,
Notion for machine learning powered note taking
and team collaboration,
Shopify for selling stuff online,
BetterHelp for your mind and Element for your health.
Choose wise and my friends.
Also, if you want to work with our amazing team,
we just want to get in touch with me for whatever reason,
go to lexfreman.com slash contact.
And now onto the full ad reads.
I try to make these interesting, but if you skip them, please still check on our sponsors.
I enjoy their stuff.
Maybe you will too.
This episode is brought to you by Encore, a platform that provides data focused AI
tooling for data annotation, curation and management, and for model evaluation.
We talk a little bit about public benchmarks
in this podcast.
I think mostly focused on software engineering,
SweetBench.
There's a lot of exciting developments
about how do you have a benchmark that you can't cheat on.
But if it's not public, then you can use it the right way,
which is to evaluate how well is the annotation,
the data curation, the training, the pre-training,
the post-training, all of that, how's that working?
Anyway, a lot of the fascinating conversation
with the Anthropic folks was focused on the language side.
And there's a lot of really incredible work
that Encore is doing about annotating
and organizing visual data.
And they make it accessible for searching, for visualizing, for granular curation,
all that kind of stuff. So I'm a big fan of data. It continues to be the most important thing.
The nature of data and what it means to be good data, whether it's human generated or synthetic data keeps changing,
but it continues to be the most important component of what makes for a generally
intelligent system, I think, and also for specialized intelligent systems as well.
Go try out Encore to curate, annotate, and manage your AI data at Encore.com slash Lex.
That's Encore.com slash Lex. This episode is brought to you by the thing that
keeps getting better and better and better, Notion. It used to be an awesome
note-taking tool. Then it started being a great team collaboration. So note-taking
for many people and management of all kinds of other project stuff across large teams now
More and more and more is becoming a AI
Superpowered note taking and team collaboration tool really integrating AI probably better than
Any note taking tool I've used not even close honestly notions truly incredible
I haven't gotten a chance to use a notion on a large team
I imagine that that's real when it's begins to shine but on a small team is just just really really really amazing
the integration of the I assistant inside a particular file for summarization for generation all that kind of stuff, but also
The integration of an AI assistant to be able to ask questions about you know across docs, across wikis, across
projects, across multiple files to be able to summarize everything. Maybe
investigate project progress based on all the different stuff going on in
different files. So really really nice integration of AI. Try Notion AI for free
when you go to Notion.com slash Lex that's all lowercase Notion.com slash Lex
to try the power of Notion AI today. This episode is also brought to you by
Shopify a platform designed for anyone to sell anywhere with a great-looking
online store. I keep wanting to mention Shopify's CEO, Toby, who's brilliant.
And I'm not sure why he hasn't been on the podcast yet.
I need to figure that out.
Every time I'm in San Francisco, I want to talk to him.
So he's brilliant on all kinds of domains, not just entrepreneurship or tech,
just philosophy and life, just his way of being.
Plus an accent adds to the flavor profile of the conversation.
I've been watching a cooking show for a little bit.
Really, I think my first cooking show, it's called Class Wars.
It's a South Korean show where chefs with Michelin stars compete against chefs without Michelin stars.
And there's something about one of the judges that just, just the charisma and the way that he describes
every single detail of flavor, of texture,
of what makes for a good dish.
Yeah, so it's contagious.
I don't really even care.
I'm not a foodie.
I don't care about food in that way,
but he makes me want to care.
So anyway, that's why I use the term flavor profile,
referring to Toby,
which has nothing to do with what I should probably be saying. And that is
that you should use Shopify. I've used Shopify super easy, create a store,
Lex Freeman dot com slash store to sell a few shirts. Anyway, sign up for a
$1 per month trial period at Shopify dot com slash Lex. That's all lowercase
go to Shopify dot com slash Lex to take your business
to the next level today.
This episode is also brought to you by BetterHelp,
spelled H-E-L-P, help.
They figure out what you need and match you
with a licensed therapist in under 48 hours.
It's for individuals, it's for couples,
it's easy, discreet, affordable, available worldwide.
I saw a few books by Jungian psychologists
and I was like in a delirious state of sleepiness
and I forgot to write his name down,
but I need to do some research.
I need to go back, I need to go back to my younger self
when I dreamed of being a psychiatrist
and reading Sigmund psychiatrist and reading Zygmunt Freud and
reading Carl Jung, reading it the way young kids maybe read comic books. They were my
superheroes of sorts. Camus as well, Kafka, Nietzsche, Hesse, Dostoevsky, the sort of
19th and 20th century literary philosophers of sorts.
Anyway, I need to go back to that.
Maybe have a few conversations about Freud.
Anyway, those folks, even if in part wrong or true revolutionaries, were truly brave
to explore the mind in the way they did.
They showed the power of talking and delving deep into the human mind,
into the shadow, through the use of words. So, highly recommend. And BetterHelp is a super easy
way to start. Check them out at betterhelp.com slash Lex and save in your first month. That's
betterhelp.com slash Lex. This episode is also brought to you by Element, my daily zero sugar and delicious electrolyte
mix that I'm going to take a sip of now.
It's been so long that I've been drinking Element that I don't even remember life before
Element.
I guess I used to take salt pills because it's such a big component of my exercise routine
to make sure I get enough water and get enough electrolytes. Yeah. So combined with fasting that I've explored a lot and continue to do to this day and combined
with low carb diets, that I'm a little bit off the wagon on that one.
I'm consuming probably like 60, 70, 80, maybe 100 some days,
grams of carbohydrates, not good, not good.
My happiest is when I'm below 20 grams
or 10 grams of carbohydrates.
I'm not like measuring it out,
I'm just using numbers to sound smart.
But I don't take dieting seriously,
but I do take the signals that my body sends
quite seriously, so without question, making sure I get enough
magnesium and sodium and get enough water is priceless. A lot of times when I have headaches,
it just felt off or whatever were fixed near immediately. And sometimes after 30 minutes,
we just drinking water with the water lights is beautiful and is delicious.
Watermelon salt, the greatest flavor of all time.
Get a sample pack for free with any purchase.
Try it at www.drinkelements.com.
This is the Lex Freeman Podcast.
To support it, please check out our sponsors in the description.
And now, dear friends, here's Dario Amadei. Let's start with a big idea of scaling laws
and the scaling hypothesis.
What is it?
What is its history?
And where do we stand today?
So I can only describe it as it, you know,
as a scale of the scale of the scale of the scale of the scale
of the scale of the scale of the scale of the scale of the scale
of the scale of the scale of the scale of the scale of the scale
of the scale of the scale of the idea of scaling laws and the scaling hypothesis. What is it? What is its history and where do we stand today?
So I can only describe it as it relates
to kind of my own experience,
but I've been in the AI field for about 10 years.
And it was something I noticed very early on.
So I first joined the AI world when I was working
at Baidu with Andrew Ng in late 2014,
which is almost exactly 10 years ago now.
And the first thing we worked on
was speech recognition systems.
And in those days, I think deep learning was a new thing.
It had made lots of progress,
but everyone was always saying,
we don't have the algorithms we need to succeed.
You know, we're only matching a tiny, tiny fraction.
There's so much we need to kind of discover algorithmically.
We haven't found the picture of how to match the human brain.
And when, you know, in some ways it was fortunate.
I was kind of, you know,
you can have almost beginner's luck, right?
I was like a newcomer to the field.
And, you know, I looked at the neural net
that we were using for speech,
the recurrent neural networks.
And I said, I don't know, what if you make them bigger and give them more layers? And what if you scale up
the data along with this, right? I just saw these as like independent dials that you could turn.
And I noticed that the model started to do better and better as you gave them more data,
as you made the models larger, as you trained them for longer. And I didn't measure things precisely in those days,
but along with colleagues,
we very much got the informal sense that the more data
and the more compute and the more training
you put into these models, the better they perform.
And so initially my thinking was,
hey, maybe that is just true
for speech recognition systems, right?
Maybe that's just one particular quirk, one particular area.
I think it wasn't until 2017 when I first saw the results from GPT-1 that it clicked
for me that language is probably the area in which we can do this.
We can get trillions of words of language data.
We can train on them.
And the models we were trained in those days were tiny.
You could train them on one to eight GPUs,
whereas now we train jobs on tens of thousands,
soon going to hundreds of thousands of GPUs.
And so when I saw those two things together,
and there were a few people like Ilya Suskiver, who
you've interviewed, who had somewhat similar reviews.
He might have been the first one,
although I think a few people came to similar reviews
around the same time, right?
There was, you know, Rich Sutton's bitter lesson.
There was, Goran wrote about the scaling hypothesis.
But I think somewhere between 2014 and 2017
was when it really clicked for me,
when I really got conviction that,
hey, we're gonna be able to do these incredibly wide
cognitive tasks if we just scale up the models. really got conviction that, hey, we're going to be able to do these incredibly wide cognitive
tasks if we just scale up the models.
And at every stage of scaling, there are always arguments.
And when I first heard them, honestly, I thought,
probably I'm the one who's wrong.
And all these experts in the field are right.
They know the situation better than I do.
There's the Chomsky argument about you can get syntactics, but you can't get semantics.
There was this idea, oh, you can make a sentence make sense,
but you can't make a paragraph make sense.
The latest one we have today is,
we're gonna run out of data
or the data isn't high quality enough
or models can't reason.
And each time, every time we manage to,
we manage to either find a way around
or scaling just is the way around.
Sometimes it's one, sometimes it's the other.
And so I'm now at this point, I still think,
you know, it's always quite uncertain.
We have nothing but inductive inference to tell us
that the next few years are gonna be like the next,
the last 10 years.
But I've seen the movie enough times.
I've seen the story happen for enough times
to really believe that probably the scaling
is going to continue and that there's some magic to it
that we haven't really explained on a theoretical basis yet.
And of course the scaling here is bigger networks,
bigger data, bigger compute.
Yes.
All of those.
In particular, linear scaling up of bigger networks,
bigger training times, and more data.
So all of these things, almost like a chemical reaction.
You have three ingredients in the chemical reaction,
and you need to linearly scale up the three ingredients.
If you scale up one, not the others,
you run out of the other reagents, and the reaction stops. But if you scale up one, not the others, you run out of the other reagents and the reaction stops.
But if you scale up everything in series,
then the reaction can proceed.
And of course, now that you have this kind of empirical
science slash art, you can apply to other more nuanced things
like scaling laws applied to interpretability
or scaling laws applied to post-training
or just seeing
how does this thing scale.
But the big scaling law, I guess the underlying scaling hypothesis has to do with big networks,
big data leads to intelligence.
Yeah, we've documented scaling laws in lots of domains other than language, right?
So initially, the paper we did that first showed it was in early 2020, where we
first showed it for language. There was then some work late in 2020, where we showed the same thing
for other modalities like images, video, text to image, image to text, math, they all had the same
pattern. And you're right, now there are other stages like post-training or there are new types of reasoning models.
And in all of those cases that we've measured,
we see similar types of scaling laws.
A bit of a philosophical question,
but what's your intuition about why bigger is better
in terms of network size and data size?
Why does it lead to more intelligent models?
So in my previous career as a biophysicist,
so I did a physics undergrad and then biophysics in grad school.
So I think back to what I know as a physicist, which is actually
much less than what some of my colleagues at Anthropic
have in terms of expertise in physics,
there's this concept called the one over F noise
and one over X distributions.
Where often, you know, just like if you add up
a bunch of natural processes, you get a Gaussian.
If you add up a bunch of kind of
differently distributed natural processes,
if you like take a probe and hook it up to a resistor,
the distribution of the thermal noise in the resistor
goes as one over the frequency.
It's some kind of natural convergent distribution.
And I think what it amounts to is that if you look at
a lot of things that are produced by some natural process
that has a lot of different scales, right?
Not a Gaussian, which is kind of narrowly distributed, a lot of things that are produced by some natural process that has a lot of different scales, right?
Not a Gaussian, which is kind of narrowly distributed,
but if I look at kind of like large and small fluctuations
that lead to electrical noise,
they have this decaying one over X distribution.
And so now I think of like patterns
in the physical world, right?
If I, or in language,
if I think about the patterns in language,
there are some really simple patterns.
Some words are much more common than others, like the.
Then there's basic noun-verb structure.
Then there's the fact that, you know,
nouns and verbs have to agree, they have to coordinate.
And there's the higher level sentence structure.
Then there's the thematic structure of paragraphs.
And so the fact that there's this regressing structure,
you can imagine that as you make the networks larger,
first they capture the really simple correlations,
the really simple patterns,
and there's this long tail of other patterns.
And if that long tail of other patterns is really smooth,
like it is with the one over F noise
in physical processes like resistors,
then you can imagine as you make the network larger,
it's kind of capturing more and more of that distribution.
And so that smoothness gets reflected
in how well the models are at predicting
and how well they perform.
Language is an evolved process, right?
We've developed language,
we have common words and less common words,
we have common expressions and less common words, we have common expressions
and less common expressions.
We have ideas, cliches that are expressed frequently
and we have novel ideas.
And that process has developed, has evolved with humans
over millions of years.
And so the guess, and this is pure speculation,
would be that there's some kind of long tail distribution
of the distribution of these ideas.
So there's the long tail,
but also there's the height of the hierarchy of concepts
that you're building up.
So the bigger the network,
presumably you have a higher capacity to.
Exactly, if you have a small network,
you only get the common stuff, right?
If I take a tiny neural network,
it's very good at understanding that, you know,
a sentence has to have, you know, verb adjective noun,
right, but it's terrible at deciding
what those verb adjective and noun should be
and whether they should make sense.
If I make it just a little bigger, it gets good at that.
Then suddenly it's good at the sentences,
but it's not good at the paragraphs.
And so these rarer and more complex patterns get picked up
as I add more capacity to the network.
Well, the natural question then is,
what's the ceiling of this?
Yeah.
Like how complicated and complex is the real world?
How much is the stuff is there to learn?
I don't think any of us knows the answer to that question.
My strong instinct would be that there's no ceiling
below the level of humans, right?
We humans are able to understand these various patterns.
And so that makes me think that if we continue to scale up these models to kind of develop
new methods for training them and scaling them up, that will at least get to the level
that we've gotten to with humans.
There's then a question of how much more is it possible to understand than humans do?
How much is it possible to be smarter
and more perceptive than humans?
I would guess the answer has got to be domain dependent.
If I look at an area like biology,
and you know, I wrote this essay,
Machines of Loving Grace,
it seems to me that humans are struggling
to understand the complexity of biology, right?
If you go to Stanford or to Harvard or to Berkeley,
you have whole departments of, you know,
folks trying to study, you know,
like the immune system or metabolic pathways.
And each person understands only a tiny bit part of it,
specializes, and they're struggling
to combine their knowledge with that of other humans.
And so I have an instinct that there's a lot of room at the top for AIs to get smarter.
If I think of something like materials in the physical world or like addressing conflicts
between humans or something like that, I mean, you know, it may be there's only some of these problems are not intractable, but much harder.
And it may be that there's only so well you can do
at some of these things, right?
Just like with speech recognition,
there's only so clear I can hear your speech.
So I think in some areas, there may be ceilings
that are very close to what humans have done.
In other areas, those ceilings may be very far away.
And I think we'll only find out when we build these systems.
It's very hard to know in advance.
We can speculate, but we can't be sure.
And in some domains, the ceiling might have to do
with human bureaucracies and things like this,
as you write about.
Yes.
So humans fundamentally have to be part of the loop.
That's the cause of the ceiling,
not maybe the limits of the intelligence.
Yeah, I think in many cases, you know, in theory,
technology could change very fast.
For example, all the things that we might invent
with respect to biology.
But remember, there's a clinical trial system
that we have to go through
to actually administer these things to humans.
I think that's a mixture of things
that are unnecessary and bureaucratic
and things that kind of protect the integrity of society.
And the whole challenge is that it's hard to tell,
it's hard to tell what's going on.
It's hard to tell which is which, right?
My view is definitely, I think,
in terms of drug development,
my view is that we're too slow and we're too conservative.
But certainly, if you get these things wrong,
it's possible to risk people's lives
by being too reckless.
And so at least some of these human institutions
are in fact protecting people.
So it's all about finding the balance.
I strongly suspect that balance is kind of more
on the side of wishing to make things happen faster,
but there is a balance.
If we do hit a limit, if we do hit a slow down
in the scaling laws, what do you think would be the reason?
Is it compute limited, data limited?
Is it something else, idea limited?
So a few things.
Now we're talking about hitting the limit
before we get to the level of humans
and the skill of humans.
So I think one that's popular today,
and I think could be a limit that we run into,
like most of the limits, I would bet against it,
but it's definitely possible,
is we simply run out of data.
There's only so much data on the internet,
and there's issues with the quality of the data, right?
You can get hundreds of trillions of words on the internet,
but a lot of it is repetitive or it's search engine,
you know, search engine optimization drivel,
or maybe in the future it'll even be text generated
by AIs itself.
And so I think there are limits to what can be produced
in this way.
That said, we, and I would guess other companies,
are working on ways to make data synthetic,
where you can use the model to generate more data
of the type that you have already,
or even generate data from scratch.
If you think about what was done with DeepMind's AlphaGo Zero,
they managed to get a bot all the way from no ability
to play Go whatsoever to above human level
just by playing against itself.
There was no example data from humans
required in the AlphaGo Zero version of it.
The other direction, of course, is these reasoning models
that do chain of thought and stop to think and reflect
on their own thinking.
In a way, that's another kind of synthetic data coupled
with reinforcement learning.
So my guess is with one of those methods, we'll get around the data limitation or there
may be other sources of data that are available.
We could just observe that even if there's no problem with data, as we start to scale
models up, they just stop getting better.
It seemed to be our reliable observation that they've gotten better.
That could just stop at some point
for a reason we don't understand.
The answer could be that we need to,
we need to invent some new architecture.
It's been, there have been problems in the past
with say numerical stability of models
where it looked like things were leveling off,
but actually when you know, when
we found the right unblocker, they didn't end up doing so.
So perhaps there's some new optimization method or some new technique we need to unblock things.
I've seen no evidence of that so far, but if things were to slow down, that perhaps
could be one reason.
What about the limits of compute, meaning the expensive nature of building bigger and bigger
data centers?
So right now, I think most of the frontier model companies,
I would guess, are operating roughly $1 billion scale,
plus or minus a factor of three.
Those are the models that exist now or are being trained now.
I think next year, we're're gonna go to a few billion,
and then 2026 we may go to above 10 billion,
and probably by 2027,
their ambitions to build $100 billion clusters.
And I think all of that actually will happen.
There's a lot of determination to build the compute,
to do it within this country, and I would guess that it actually will happen. There's a lot of determination to build the compute to do it within
this country. And I would guess that it actually does happen. Now, if we get to a hundred billion,
that's still not enough compute, that's still not enough scale, then either we need even more scale
or we need to develop some way of doing it more efficiently, of shifting the curve.
I think between all of these, one of the reasons I'm bullish about powerful AI happening so fast is just that if you extrapolate the next few points on the curve, we're very quickly getting towards human level ability, right?
Some of the new models that we developed, some reasoning models that have come from other companies, they're starting to get to what I would call the PhD or professional level, right? If you look at their coding
ability, the latest model we released, Sonnet 3.5, the new or updated version, it gets something
like 50% on Sweebench. And Sweebench is an example of a bunch of professional real world
software engineering tasks. At the beginning of the year, I think the state of the art was three or 4%.
So in 10 months, we've gone from 3% to 50% on this task.
And I think in another year, we'll probably be at 90%.
I mean, I don't know,
but might even be less than that.
We've seen similar things in graduate level math, physics,
and biology from models like OpenAI's O1. So if we just continue to extrapolate this, physics, and biology from models like OpenAI's O1.
So if we just continue to extrapolate this, right,
in terms of skill that we have,
I think if we extrapolate the straight curve,
within a few years, we will get to these models being,
you know, above the highest professional level
in terms of humans.
Now, will that curve continue?
You've pointed to, and I've pointed to a lot of reasons,
possible reasons why that might not happen.
But if the extrapolation curve continues,
that is the trajectory we're on.
So Anthropic has several competitors.
It'd be interesting to get your sort of view of it all.
OpenAI, Google, XAI, Meta, what does it take to win
in the broad sense of win in the space?
Yeah, so I wanna separate out a couple things, right?
So, you know, Anthropics mission is to kind of try
to make this all go well, right?
And, you know, we have a theory of change called
Race to the Top, right?
Race to the Top is about trying to push the other players
to do the right thing by setting an example.
It's not about being the good guy, it's about setting things up so that all of us can be the good guy. to push the other players to do the right thing by setting an example.
It's not about being the good guy,
it's about setting things up
so that all of us can be the good guy.
I'll give a few examples of this.
Early in the history of Anthropic,
one of our co-founders, Chris Ola,
who I believe you're interviewing soon,
you know, he's the co-founder
of the field of mechanistic interpretability,
which is an attempt to understand
what's going on inside AI models. So we had him and one of our early teams focus on this area of interpretability,
which we think is good for making models safe and transparent.
For three or four years, that had no commercial application whatsoever.
It still doesn't today. We're doing some early betas with it,
and probably it will eventually, but this is a very, very long
research bet and one in which we've built in public and shared our results publicly.
And we did this because we think it's a way to make models safer.
An interesting thing is that as we've done this, other companies have started doing it
as well.
In some cases, because they've been inspired by it. In some cases because they're worried that
you know if other companies are doing this that look more responsible, they want to look more
responsible too. No one wants to look like the irresponsible actor and so they adopt this as well
when folks come to Anthropic, interpretability is often a draw, and I tell them, the other places you didn't go, tell them why you came here.
And then you see soon that there's interpretability teams elsewhere as well.
And in a way, that takes away our competitive advantage because it's like, oh, now others
are doing it as well, but it's good for the broader system.
And so we have to invent some new thing that we're doing that others aren't doing as well.
And the hope is to basically bid up the importance
of doing the right thing.
And it's not about us in particular, right?
It's not about having one particular good guy.
Other companies can do this as well.
If they join the race to do this,
that's the best news ever, right?
It's about kind of shaping the incentives to point upward
instead of shaping the incentives to point downward.
And we should say this example,
the field of mechanistic interpretability
is just a rigorous non-hand wavy way of doing AI safety.
Or it's tending that way.
Trying to, I mean, I think we're still early
in terms of our ability to see things,
but I've been surprised at how much we've been able
to look inside these systems and understand what we see.
Unlike with the scaling laws
where it feels like there's some law
that's driving these models to perform better,
on the inside, the models aren't,
there's no reason why they should be designed
for us to understand them, right?
They're designed to operate.
They're designed to work, just like the human brain
or human biochemistry.
They're not designed for a human to open up the hatch,
look inside and understand them.
But we have found, and you know,
you can talk in much more detail about this to Chris,
that when we open them up, when we do look inside them,
we find things that are surprisingly interesting.
And as a side effect,
you also get to see the beauty of these models.
You get to explore the sort of the beautiful nature
of large neural networks
through the Mech and Terp kind of methodology.
I'm amazed at how clean it's been.
I'm amazed at things like induction heads.
I'm amazed at things like, you know,
that we can, you know,
use sparse autoencoders to find these directions
within the networks and that the directions correspond
to these very clear concepts.
We demonstrated this a bit with the Golden Gate Bridge quad.
So this was an experiment where we found a direction
inside one of the neural networks layers
that corresponded to the Golden Gate
Bridge. And we just turned that way up. And so we released this model as a demo. It was kind of half
a joke for a couple of days, but it was illustrative of the method we developed. And you could take the
Golden Gate, you could take the model, you could ask it about anything. It would be like, you could
say, how was your day? And anything you asked, because ask it about anything, it would be like, you could say, how was your day?
And anything you asked, because this feature was activated,
it would connect to the Golden Gate Bridge.
So it would say, I'm feeling relaxed and expansive,
much like the arches of the Golden Gate Bridge.
It would masterfully change topic to the Golden Gate Bridge
and integrate it.
There was also a sadness to it,
to the focus it had on the Golden Gate Bridge.
I think people quickly fell in love with it.
I think, so people already miss it
because it was taken down, I think, after a day.
Somehow these interventions on the model
where you kind of adjust its behavior,
somehow emotionally made it seem more human
than any other version of the model.
It's a strong personality, strong identity.
It has a strong personality.
It has these kind of like obsessive interests.
You know, we can all think of someone
who's like obsessed with something.
So it does make it feel somehow a bit more human.
Let's talk about the present.
Let's talk about Claude.
So this year, a lot has happened.
In March, Claude 3, Opus, Sonnet, Haiku were released.
Then Claude 3, 5, Sonnet in July
with an updated version just now released.
And then also Claw 3, 5, Haiku was released.
Okay, can you explain the difference
between Opus, Sonnet, and Haiku
and how we should think about the different versions?
Yeah, so let's go back to March
when we first released these three models.
So, you know,
our thinking was different companies produce kind of large and small models, better and
worst models. We felt that there was demand both for a really powerful model, you know,
when you that might be a little bit slower that you'd have to pay more for, and also
for fast, cheap models that are as smart
as they can be for how fast and cheap, right?
Whenever you wanna do some kind of like,
you know, difficult analysis, like if I, you know,
I wanna write code for instance, or, you know,
I wanna brainstorm ideas or I wanna do creative writing,
I want the really powerful model.
But then there's a lot of practical applications
in a business sense where it's like, I'm
interacting with a website, I, you know, like, I'm like, doing
my taxes, or I'm, you know, talking to, you know, to like a
legal advisor, and I want to analyze a contract or, you know,
we have plenty of companies that are just like, you know, I
want to do auto complete on my on my ID or something. And for
all of those things, you want to act fast
and you want to use the model very broadly.
So we wanted to serve that whole spectrum of needs.
So we ended up with this kind of poetry theme.
And so what's a really short poem? It's a haiku.
And so haiku is the small, fast, cheap model
that was at the time was really surprisingly intelligent for how
fast and cheap it was.
Sonnet is a medium-sized poem, right?
A couple paragraphs.
And so Sonnet was the middle model.
It is smarter, but also a little bit slower, a little bit more expensive.
And Opus, like a magnum opus is a large work, Opus was the largest, smartest model at the
time.
So that was the original kind of thinking behind it.
And our thinking then was,
well, each new generation of models
should shift that trade-off curve.
So when we released Sonnet 3.5,
it has roughly the same cost and speed
as the Son at three model.
But it increased its intelligence to the point
where it was smarter than the original Opus three model,
especially for code, but also just in general.
And so now, you know, we've shown results for Haiku 3.5
and I believe Haiku 3.5, the smallest new model,
is about as good as Opus 3, the largest old model.
So basically the aim here is to shift the curve,
and then at some point there's gonna be an Opus 3.5.
Now, every new generation of models has its own thing,
they use new data, their personality changes in ways that
we kind of, you know, try to steer but are not fully able to steer. And so there's never quite
that exact equivalence where the only thing you're changing is intelligence. We always try and improve
other things and some things change without us knowing or measuring. So it's very much an
exact science. In many ways, the manner and personality of these models
is more an art than it is a science.
So what is sort of the reason for the span of time
between say, Cloud Opus 3.0 and 3.5?
What takes that time if you can speak to it?
Yeah, so there's different processes.
There's pre-training, which is just kind of
the normal language model training.
And that takes a very long time.
That uses these days, tens of thousands,
sometimes many tens of thousands of GPUs or TPUs
or training them or whatever. We use different platforms, but accelerator chips,
often training for months.
There's then a kind of post-training phase
where we do reinforcement learning from human feedback,
as well as other kinds of reinforcement learning.
That phase is getting larger and larger now. And often, that's less
of an exact science. It often takes effort to get it right. Models are then tested with some of our
early partners to see how good they are. And they're then tested both internally and externally
for their safety, particularly for catastrophic and autonomy risks.
So we do internal testing according
to our responsible scaling policy,
which I could talk more about that in detail.
And then we have an agreement with the US and the UK AI
Safety Institute, as well as other third party
testers in specific domains, to test the models for what
are called CBRN risks, chemical, biological, radiological, and nuclear,
which are, you know, we don't think that models
pose these risks seriously yet,
but every new model we want to evaluate
to see if we're starting to get close
to some of these more dangerous capabilities.
So those are the phases.
And then, you know, then it just takes some time
to get the model working in terms of inference
and launching it in the API.
So there's just a lot of steps
to actually make your model work.
And of course, you know,
we're always trying to make the processes
as streamlined as possible, right?
We want our safety testing to be rigorous,
but we want it to be rigorous and to be automatic,
to happen as fast as it can without compromising on rigor.
Same with our pre-training process
and our post-training process.
So, it's just like building anything else.
It's just like building airplanes.
You want to make them safe,
but you want to make the process streamlined.
And I think the creative tension between those
is an important thing in making the models work. Yeah I think the creative tension between those is an important thing in
making the models work. Yeah. Rumor on the street, I forget who was saying that Anthropic
has really good tooling. So probably a lot of the challenge here is on the software engineering side
is to build the tooling to have like a efficient low friction interaction with the infrastructure.
You would be surprised how much of the challenges of,
you know, building these models comes down to, you know,
software engineering, performance engineering, you know, you, you know, from the outside,
you might think, oh man, we had this Eureka breakthrough, right? You know, this movie with
the science, we discovered it, we figured it out. But, but, but I. But I think all things, even incredible discoveries,
like they almost always come down to the details.
And often super, super boring details.
I can't speak to whether we have better tooling
than other companies.
I mean, I haven't been at those other companies,
at least not recently,
but it's certainly something we give a lot of attention to.
I don't know if you can say, but from three,
from Claude three to Claude three, five,
is there any extra pre-training going on
as they mostly focus on the post-training?
There's been leaps in performance.
Yeah, I think at any given stage,
we're focused on improving everything at once.
Just naturally, like there are different teams,
each team makes progress in a particular area
in making a particular, their particular segment
of the relay race better.
And it's just natural that when we make a new model,
we put all of these things in at once.
So the data you have, like the preference data you get
from RLHF, is that applicable?
Is there ways to apply it to newer models
as it get trained up?
Yeah, preference data from old models
sometimes gets used for new models.
Although of course it performs somewhat better
when it's trained on the new models.
Note that we have this constitutional AI method
such that we don't only use preference data.
There's also a post-training process
where we train the model against itself.
And there's new types of post-tra train the model against itself and there's you know
New types of post training the model against itself that are used every day. So it's not just our like Jeff
It's a bunch of other methods as well post training. I think you know, it's becoming more and more sophisticated
Well, what explains the big leap in performance for the new sonnet 3 5?
I mean at least in the programming side and maybe this is a good place to talk about benchmarks
What does it mean to get better?
Just the number went up, but you know, I program,
but I also love programming and I clawed three, five
through cursors, what I use to assist me in programming.
And there was, at least experientially, anecdotally,
it's gotten smarter at programming.
So what does it take to get it smarter?
We observed that as well, by the way.
There were a couple very strong engineers here at Anthropic
who all previous code models, both produced by us
and produced by all the other companies,
hadn't really been useful to them.
They said, maybe this is useful to a beginner, it's not useful to them. They said, maybe this is useful to beginner,
it's not useful to me.
But Sonnet 3.5, the original one for the first time,
they said, oh my God, this helped me with something
that it would have taken me hours to do.
This is the first model that has actually saved me time.
So again, the water line is rising.
And then I think the new Sonnet has been even better.
In terms of what it takes, I mean, I'll just say it's been across the board.
It's in the pre-training, it's in the post-training, it's in various evaluations that we do.
We've observed this as well.
And if we go into the details of the benchmark, so, Swaybench is basically, since you're a
programmer, you'll be familiar with like pull requests and just pull requests are
a sort of atomic unit of work. You could say, I'm implementing one thing.
And so, suibench actually gives you kind of a real world situation where the code base is in
the current state and I'm trying to implement something that's described in language. We have internal benchmarks where we measure
the same thing, and you say, just give the model free reign to do anything, run anything, edit
anything. How well is it able to complete these tasks? And it's that benchmark that's gone from
it can do it 3% of the time to it can do it about
50% of the time.
So I actually do believe that you can gain benchmarks, but I think if we get to 100%
of that benchmark in a way that isn't kind of like over trained or game for that particular
benchmark probably represents a real and serious increase in kind of programming ability.
And I would suspect that if we can get to 90-95%, that it will represent ability to
autonomously do a significant fraction of software engineering tasks.
Well, ridiculous timeline question. When is Cloud Opus 3.5 coming out?
ridiculous timeline question. When is Cloud Opus 3.5 coming up?
Not giving you an exact date, but you know, they're, they're, uh, you know, as far as we know, the plan is still to have a Cloud 3.5 Opus.
Are we going to get it before GTA six or no?
Like Duke Nukem forever. So what was that game that there was some game that was
delayed 15 years. Was that Duke Nukem forever?
Yeah. And I think GTA is not just releasing trailers. You know, it's only been three months
since we released the first Sonnet.
Yeah, it's the incredible pace of release.
It just tells you about the pace,
the expectations for when things are gonna come out.
So what about 4.0?
So how do you think about sort of as these models
get bigger and bigger about versioning
and also just versioning in general,
why Sonnet 3.5 updated with the date?
Why not Sonnet 3.6 switch out a lot of people calling it?
Naming is actually an interesting challenge here, right?
Because I think a year ago,
most of the model was pre-training.
And so you could start from the beginning and just say,
okay, we're gonna have models of different sizes.
We're gonna train them all together.
And we'll have a family of naming schemes, and then we'll
put some new magic into them. And then you know, we'll have
the next the next generation. The trouble starts already when
some of them take a lot longer than others to train, right,
that already messes up your time, time a little bit. But as
you make big improvements in as you make big improvements in
pre training, then you suddenly notice, oh, I can make better
pre-trained model and that doesn't take very long to do.
But clearly it has the same size and shape of previous models.
So I think those two together, as well as the timing issues, any kind of scheme you
come up with, the reality tends to kind of frustrate that scheme, right?
It tends to kind of break out of the scheme.
It's not like software where you can say, oh, this is like 3.7, this is 3.8.
No, you have models with different trade-offs.
You can change some things in your models.
You can change other things.
Some are faster and slower at inference.
Some have to be more expensive. Some have to be more expensive.
Some have to be less expensive.
And so I think all the companies have struggled with this.
I think we did very, you know,
I think we were in a good position in terms of naming
when we had Haiku, Sonnet and Opus.
That was great, great start.
We're trying to maintain it, but it's not perfect.
So we'll try and get back to the simplicity,
but just the nature of the field, I feel like
no one's figured out naming.
It's somehow a different paradigm from like normal software.
And so we just, none of the companies have been perfect at it.
It's something we struggle with surprisingly much relative to how trivial it is
for the grand science of training the models.
So from the user side,
the user experience of the updated Sonnet 3.5
is just different than the previous June 2024 Sonnet 3.5.
It would be nice to come up with some kind of labeling
that embodies that because people talk about Sonnet 3.5,
but now there's a different one.
And so how do you refer to the previous one
and the new one and it,
when there's a distinct improvement,
it just makes conversation about it just challenging.
Yeah, yeah.
I definitely think this question of
there are lots of properties of the models
that are not reflected in the benchmarks.
I think that's definitely the case and everyone agrees and not all of them are capabilities.
Some of them are, you know, models can be polite or brusque.
They can be, you know, very reactive or they can ask you questions.
They can have what feels like a warm personality
or a cold personality.
They can be boring or they can be very distinctive
like Golden Gate Claude was.
And we have a whole team kind of focused on,
I think we call it Claude character.
Amanda leads that team and we'll talk to you about that. But it's still a
very inexact science. And often we find that models have properties that we're not aware of.
The fact of the matter is that you can talk to a model 10,000 times and there are some behaviors
you might not see. Just like with a human, right? I can know someone for a few months and not know
that they have a certain skill or not know that they have a certain skill
or not know that there's a certain side to them.
And so I think we just have to get used to this idea
and we're always looking for better ways
of testing our models to demonstrate these capabilities
and also to decide which are the personality properties
we want models to have and which we don't want to have.
That itself, the normative question
is also super interesting.
I gotta ask you a question from Reddit.
From Reddit, oh boy.
You know, there's just this fascinating,
to me at least, it's a psychological social phenomenon
where people report that Claude has gotten dumber
for them over time.
And so the question is, does the user complaint
about the dumbing down of Claude 3.5 Sonnet hold any water?
So are these anecdotal reports a kind of social phenomena
or did Claude, is there any cases
where Claude would get dumber?
So this actually doesn't apply, this isn't just about Claude.
I believe this, I believe I've seen these complaints
for every foundation model produced by a major company. People said this about GPT-4. They said
it about GPT-4 Turbo. So a couple things. One, the actual weights of the model, right? The actual
brain of the model, that does not change unless we introduce a new model.
There are just a number of reasons why it would not make
sense practically to be randomly substituting in new versions
of the model.
It's difficult from an inference perspective,
and it's actually hard to control all the consequences
of changing the way to the model.
Let's say you wanted to fine tune the model to be like,
I don't know, to like, to say certainly less,
which, you know, an old version of Sonnet used to do.
You actually end up changing a hundred things as well.
So we have a whole process for it,
and we have a whole process for modifying the model.
We do a bunch of testing on it.
We do a bunch of like,
we do a bunch of user testing and early customers.
So we both have never changed the weights of the model
without telling anyone.
And certainly in the current setup,
it would not make sense to do that.
Now, there are a couple of things that we do occasionally do.
One is sometimes we run A-B tests.
But those are typically very close
to when a model is being released
and for a very small fraction of time.
So, you know, like the day before the new Sonnet 3.5,
I agree, we should have had a better name.
It's clunky to refer to it.
There were some comments from people that like,
it's gotten a lot better and that's because, you know,
a fraction were exposed to an A-B test
for those one or two days.
The other is that occasionally the system prompt will change.
The system prompt can have some effects,
although it's unlikely to dumb down models,
it's unlikely to make them dumber.
And we've seen that while these two things,
which I'm listing to be very complete,
happen quite infrequently. The complaints about, for us and for other model companies
about the model change, the model isn't good at this,
the model got more censored, the model was dumbed down,
those complaints are constant.
And so I don't wanna say like people are imagining
these or anything, but like the models are for the most part
not changing.
If I were to offer a theory, I think it actually relates to one of the things I said before,
which is that models are very complex and have many aspects to them.
And so often, if I ask the model a question, if I'm like, do task X versus can you do task X,
the model might respond in different ways.
And so there are all kinds of subtle things
that you can change about the way you interact
with the model that can give you very different results.
To be clear, this itself is like a failing by us
and by the other model providers
that the models are just often sensitive
to like small changes in wording.
It's yet another way in which the science
of how these models work is very poorly developed.
And so, you know, if I go to sleep one night
and I was like talking to the model in a certain way,
and I like slightly change the phrasing
of how I talk to the model, you know,
I could get different results.
So that's one possible way.
The other thing is, man, it's just hard to quantify this stuff.
It's hard to quantify this stuff.
I think people are very excited by new models when they come out,
and then as time goes on,
they become very aware of the limitations.
So that may be another effect.
But that's all a very long-rendered way of saying,
for the most part, with some fairly narrow exceptions,
the models are not changing.
I think there is a psychological effect.
You just start getting used to it, the baseline raises.
Like when people have first gotten WiFi on airplanes,
it's like amazing, magic.
And then you start-
And now I'm like, I can't get this thing to work.
This is such a piece of crap.
Exactly, so then it's easy to have the conspiracy theory of they're making wifi slower and slower.
This is probably something I'll talk to Amanda
much more about, but another Reddit question.
When will Claude stop trying to be my pure,
tentacle grandmother imposing its moral worldview on me
as a paying customer?
And also, what is the psychology behind making Claude
overly apologetic?
So this kind of reports about the experience, as a paying customer. And also, what is the psychology behind making Claude overly apologetic?
So this kind of reports about the experience
at different angle and the frustration.
It has to do with the character.
Yeah, so a couple of points on this first.
One is like things that people say on Reddit
and Twitter or X or whatever it is,
there's actually a huge distribution shift
between like the stuff that people complain loudly about on social media and what actually kind of like, you know, statistically users
care about and that drives people to use the models.
Like people are frustrated with, you know, things like, you know, the model not writing
out all the code or the model, you know, just not being as good at code as it could be,
even though it's the best model in the world on code. I think the majority of things are about that,
but certainly a kind of vocal minority
are kind of raise these concerns, right?
Are frustrated by the model,
refusing things that it shouldn't refuse
or like apologizing too much
or just having these kind of like annoying verbal ticks.
The second caveat, and I just want to say this like super clearly because I think it's
like some people don't know it, others like kind of know it but forget it, like it is
very difficult to control across the board how the models behave.
You cannot just reach in there and say, oh, I want the model to like apologize less.
Like you can do that. You can include trading data that says like, I want the model to like apologize less. Like you can do that.
You can include trading data that says like, Oh, the model should like apologize
less, but then in some other situation, they end up being like super rude or
like overconfident in a way that's like misleading people.
So there, there are all these trade-offs.
Um, uh, for example, another thing is if there was a period during which.
Models ours, and I think others as well,
were too verbose, right? They would repeat themselves, they would say too much.
You can cut down on the verbosity by penalizing the models for just talking for too long.
What happens when you do that, if you do it in a crude way, is when the models are coding,
sometimes they'll say, rest of the code goes here, right? Because they've learned that that's a way to economize and that they see it.
And then, so that leads the model to be so-called lazy in coding, where they're just like, ah,
you can finish the rest of it.
It's not because we want to save on compute or because the models are lazy during winter
break or any of the other kind of conspiracy theories that have that have that have come up
It's actually it's just very hard to control the behavior of the model to steer the behavior of the model in all
Circumstances at once you can kind of there's this this whack-a-mole aspect where you push on one thing and like, you know
these are these you know, these other things start to move as well that you may not even notice or measure.
And so one of the reasons that I care so much about
kind of grand alignment of these AI systems in the future
is actually, these systems are actually quite unpredictable.
They're actually quite hard to steer and control.
And this version we're seeing today of
you make one thing better, it makes another thing
worse.
I think that's like a present day analog of future control problems in AI systems that
we can start to study today.
I think that difficulty in steering the behavior and in making sure that if we
push an AI system in one direction, it doesn't push it in another direction
in some, in some other ways that we didn't want.
Uh, I think that's, that's kind of an, that's kind of an early sign of things
to come and if we can do a good job of solving this problem, right, of like, you
ask the model to like, you know, to like, make and distribute smallpox,
and it says no, but it's willing to like help you in your graduate level virology class. Like,
how do we get both of those things at once? It's hard. It's very easy to go to one side or the other,
and it's a multi-dimensional problem. And so, I, you know, I think these questions of like shaping
the model's personality, I think they're very hard.
I think we haven't done perfectly on them.
I think we've actually done the best
of all the AI companies, but still so far from perfect.
And I think if we can get this right,
if we can control the false positives and false negatives
in this very kind of controlled
present day environment will be much better
at doing it for the future when our worry is,
will the models be super autonomous?
Will they be able to make very dangerous things?
Will they be able to autonomously build whole companies
and are those companies aligned?
So I think of this present task as both vexing
but also good practice for the future.
What's the current best way of gathering
sort of user feedback, like not anecdotal data,
but just large scale data about pain points
or the opposite of pain points, positive things, so on.
Is it internal testing?
Is it a specific group testing, A-B testing?
What works?
So typically we'll have internal model bashings
where all of Anthropic, Anthropic is almost a thousand people,
people just try and break the model.
They try and interact with it various ways.
We have a suite of evals for,
oh, is the model refusing in ways that it couldn't?
I think we even had a certainly eval
because our model, again, at one point, model refusing in ways that it couldn't. I think we even had a certainly eval because,
again, at one point, model had this problem
where it had this annoying tick where it would respond
to a wide range of questions by saying,
certainly, I can help you with that.
Certainly, I would be happy to do that.
Certainly, this is correct.
And so we had a certainly eval,
which is how often does the model say certainly.
But look, this is just a whack-a-mole.
Like what if it switches from certainly to definitely?
Like, so, you know, every time we add a new eval
and we're always evaluating for all the old things.
So we have hundreds of these evaluations,
but we find that there's no substitute
for human interacting with it.
And so it's very much like
the ordinary product development process.
We have like hundreds of people within Anthropic
bash the model, then we do, you know,
then we do external AB tests.
Sometimes we'll run tests with contractors.
We pay contractors to interact with the model.
So you put all of these things together
and it's still not perfect.
You still see behaviors that you don't quite wanna see,
right? You know, you see, You still see the model like refusing things
that it just doesn't make sense to refuse.
But I think trying to solve this challenge,
trying to stop the model from doing genuinely bad things
that everyone agrees it shouldn't do.
Everyone agrees that the model shouldn't talk about,
I don't know, child abuse material, right?
Like everyone agrees the model shouldn't do that.
But at the same time that it doesn't refuse
in these dumb and stupid ways.
I think drawing that line as finely as possible,
approaching perfectly is still a challenge
and we're getting better at it every day,
but there's a lot to be solved.
And again, I would point to that as an indicator
of a challenge ahead in terms of steering
much more powerful models.
Do you think Claude 4.0 is ever coming out?
I don't wanna commit to any naming scheme
cause if I say here, we're gonna have Claude 4 next year
and then we decide that we should start over
because there's a new type of model.
I don't wanna commit to it.
I would expect in a normal course of business
that Claude IV would come after Claude 3.5,
but you never know in this wacky field, right?
But this idea of scaling is continuing.
Scaling is continuing.
There will definitely be more powerful models
coming from us than the models that exist today.
That is certain, or if there aren't,
we've deeply failed as a company.
Okay, can you explain the responsible scaling policy
and the AI safety level standards, ASL levels?
As much as I'm excited about the benefits of these models,
and we'll talk about that
if we talk about machines of loving grace,
I'm worried about the risks
and I continue to be worried about the risks.
No one should think that machines of loving grace
was me saying, I'm no longer worried
about the risks of these models.
I think they're two sides of the same coin.
The power of the models and their ability
to solve all these problems in biology, neuroscience,
economic development, governance and peace,
large parts of the economy,
those come with risks as well, right?
With great power comes great responsibility, right?
That's the two are paired.
Things that are powerful can do good things
and they can do bad things.
I think of those risks as being in
several different categories.
Perhaps the two biggest risks that I think about,
and that's not to say that there aren't risks today
that are important, but when I think of the really,
the things that would happen on the grandest scale,
one is what I call catastrophic misuse.
These are misuse of the models in domains
like cyber, bio, radiological, nuclear, right?
Things that could harm or even kill thousands,
even millions of people if they really, really go wrong.
Like these are the number one priority to prevent. And here, I would just make
a simple observation, which is that the models, if I look today at people who have done really
bad things in the world, I think actually humanity has been protected by the fact that the overlap
between really smart, well-educated people and people who wanna do
really horrific things has generally been small.
Like, let's say I'm someone who,
I have a PhD in this field, I have a well-paying job,
there's so much to lose.
Why do I wanna like, even assuming I'm completely evil,
which most people are not,
why would such a person risk their life,
risk their legacy, their reputation
to do something like truly, truly evil?
If we had a lot more people like that,
the world would be a much more dangerous place.
And so my worry is that by being
a much more intelligent agent,
AI could break that correlation.
And so I do have serious worries about that.
I believe we can prevent those worries,
but I think as a counterpoint to machines of loving grace,
I wanna say that there's still serious risks.
And the second range of risks would be the autonomy risks,
which is the idea that models might on their own,
particularly as we give them more agency than they've had in the past, particularly as we give them
supervision over wider tasks like, you know, writing whole code bases or someday even, you know,
effectively operating entire companies, they're on a long enough leash. Are they doing what we really want them to do?
It's very difficult to even understand in detail
what they're doing, let alone control it.
And like I said, these early signs that it's hard
to perfectly draw the boundary between things
the model should do and things the model shouldn't do,
that if you go to one side, you get things that are annoying and useless
and you go to the other side, you get other behaviors.
If you fix one thing, it creates other problems.
We're getting better and better at solving this.
I don't think this is an unsolvable problem.
I think this is a science,
like the safety of airplanes or the safety of cars
or the safety of drugs.
I don't think there's any big thing we're missing.
I just think we need to get better
at controlling these models.
And so these are the two risks I'm worried about.
And our responsible scaling plan,
which I'll recognize is a very long-winded answer
to your question.
I love it. I love it.
Our responsible scaling plan is designed to address
these two types of risks.
And so every time we develop a new model,
we basically test it for its ability
to do both of these bad things.
So if I were to back up a little bit,
I think we have an interesting dilemma with AI systems
where they're not yet powerful enough
to present these catastrophes.
I don't know that they'll ever prevent these
catastrophes. It's possible they won't, but the case for worry, the case for risk is strong
enough that we should act now. And they're getting better very, very fast. I testified
in the Senate that we might have serious bio risks within two to three years. That was about a year ago. Things have proceeded a pace.
So we have this thing where it's like,
it's surprisingly hard to address these risks
because they're not here today.
They don't exist.
They're like ghosts, but they're coming at us so fast
because the models are improving so fast.
So how do you deal with something that's not here today,
doesn't exist, but is coming at us very fast?
So the solution we came up with for that
in collaboration with people like the organization,
Meader and Paul Cristiano is, okay,
what you need for that are you need tests
to tell you when the risk is getting close.
You need an early warning system.
And so every time we have a new model,
we test it for its capability to do these CBRN tasks,
as well as testing it for, you know,
how capable it is of doing tasks autonomously on its own.
And in the latest version of our RSP,
which we released in the last month or two,
the way we test autonomy risks is the AI model's ability
to do aspects of AI research itself,
which when the AI models can do AI research,
they become kind of truly, truly autonomous.
And that threshold is important for a bunch of other ways.
And so what do we then do with these tasks?
The RSP basically develops what we've called
an if-then structure, which is if the models
pass a certain capability, then we impose a certain set
of safety and security requirements on them.
So today's models are what's called ASL2.
Models that were ASL1 is for systems
that manifestly don't pose any risk of autonomy or misuse.
So for example, a chess plane bot, Deep Blue would be ASL1.
It's just manifestly the case that you can't use Deep Blue
for anything other than chess.
It was just designed for chess. No one's going to use it to conduct a masterful cyber attack or to run wild and
take over the world. ASL2 is today's AI systems where we've measured them and we think these
systems are simply not smart enough to autonomously self-replicate or conduct a bunch of tasks.
And also not smart enough to provide meaningful information
about CBRN risks and how to build CBRN weapons
above and beyond what can be known from looking at Google.
In fact, sometimes they do provide information,
but not above and beyond a search engine,
but not in a way that can be stitched together.
Not in a way that kind of end to end is dangerous enough.
So ASL3 is gonna be the point at which
the models are helpful enough
to enhance the capabilities of non-state actors, right?
State actors can already do a lot of, unfortunately,
to a high level of proficiency a lot of these very dangerous and destructive things.
The difference is that non-state actors are not capable of it. And so when we get to ASL 3,
we'll take special security precautions designed to be sufficient to prevent
theft of the model by non-state actors and misuse of the model as it's deployed
will have to have enhanced filters targeted
at these particular areas.
Cyber, bio, nuclear.
Cyber, bio, nuclear and model autonomy,
which is less a misuse risk and more a risk of the model
doing bad things itself.
ASL4 getting to the point where these models could enhance
the capability of a already knowledgeable state actor
and or become the main source of such a risk.
Like if you wanted to engage in such a risk,
the main way you would do it is through a model.
And then I think ASL4 on the autonomy side,
it's some amount of acceleration
in AI research capabilities with an AI model.
And then ASL5 is where we would get to the models
that are kind of truly capable,
that it could exceed humanity in their ability
to do any of these tasks.
And so the point of the if-then structure commitment
is basically to say, look, I don't know,
I've been working with these models for many years
and I've been worried about risk for many years.
It's actually kind of dangerous to cry wolf.
It's actually kind of dangerous to say,
this model is risky.
And people look at it and they say,
this is manifestly not dangerous.
Again, it's the delicacy of the risk isn't here today,
but it's coming at us fast.
How do you deal with that?
It's really vexing to a risk planner to deal with it.
And so this if then structure basically says,
look, we don't want to antagonize a bunch of people.
We don't wanna harm our own, our kind of own to antagonize a bunch of people. We don't want to harm our own, you know,
our kind of own ability to have a place in the conversation
by imposing these very onerous burdens
on models that are not dangerous today.
So the if then, the trigger commitment
is basically a way to deal with this.
It says you clamp down hard
when you can show that the model is dangerous.
And of course, what has to come with that is enough of a buffer threshold that you're not at high
risk of kind of missing the danger. It's not a perfect framework. We've had to change it.
We came out with a new one just a few weeks ago, and probably going forward, we might release new
ones multiple times a year
because it's hard to get these policies right,
like technically, organizationally,
from a research perspective, but that is the proposal.
If then commitments and triggers
in order to minimize burdens and false alarms now,
but really react appropriately when the dangers are here.
What do you think the timeline for ASL3 is
where several of the triggers are fired?
And what do you think the timeline is for ASL4?
Yeah, so that is hotly debated within the company.
We are working actively to prepare ASL3 security measures
as well as ASL3 deployment measures.
I'm not gonna go into detail,
but we've made a lot of progress on both
and we're prepared to be, I think, ready quite soon.
I would not be surprised at all
if we hit ASL3 next year.
There was some concern that we might even hit it this year,
that's still possible, that could still happen.
It's like very hard to say, but I would be very, very surprised if it was like 2030.
I think it's much sooner than that.
So there's protocols for detecting it, the if-then, and then there's protocols for how
to respond to it.
Yes.
How difficult is the second, the latter?
Yeah, I think for ASL 3, it's primarily about security and about filters on the model relating
to a very narrow set of areas when we deploy the model. Because at ASL3, the model isn't
autonomous yet. And so you don't have to worry about the model itself behaving in a bad way,
even when it's deployed internally. So I think the ASL three measures are,
I won't say straightforward, they're rigorous,
but they're easier to reason about.
I think once we get to ASL four,
we start to have worries about the models
being smart enough that they might sandbag tests.
They might not tell the truth about tests.
We had some results came out about like sleeper agents
and there was a more recent paper about,
can the models mislead attempts to,
sandbag their own abilities, right?
Show them, present themselves
as being less capable than they are.
And so I think with ASL4,
there's gonna be an important component
of using other things than just interacting
with the models.
For example, interpretability or hidden chains of thought,
where you have to look inside the model
and verify via some other mechanism
that is not as easily corrupted as what the model says,
that the model indeed has some property.
So we're still working on ASL4.
One of the properties of the RSP is that
we don't specify ASL4 until we've hit ASL3.
And I think that's proven to be a wise decision
because even with ASL3, again,
it's hard to know this stuff in detail.
And we wanna take as much time as we can possibly take
to get these things right.
So for ASL three, the bad actor will be humans.
Humans, yes.
And so there's a little bit more.
For ASL four, it's both, I think.
It's both, and so deception
and that's where mechanistic interpretability
comes into play.
And hopefully the techniques used for that
are not made accessible to the model.
Yeah, I mean, of course you can hook up
the mechanistic interpretability to the model itself,
but then you've kind of lost it as a reliable indicator
of the model state.
There are a bunch of exotic ways you can think of
that it might also not be reliable,
like if the model gets smart enough that it can jump computers and read the code where
you're looking at its internal state. We've thought about some of those. I think they're
exotic enough. There are ways to render them unlikely. But yeah, generally you want to
preserve mechanistic interpretability as a kind of verification set or test set that's
separate from the training process of the model.
See, I think as these models become better
and better conversation and become smarter,
social engineering becomes a threat too,
because they can start being very convincing
to the engineers inside companies.
Oh yeah, yeah.
It's actually like, you know,
we've seen lots of examples of demagoguery
in our life from humans,
and you know, there's a concern that models
could do that as well. One of the ways that cloud has been
getting more and more powerful is it's now able to do some agentic stuff.
Computer use, there's also an analysis within the sandbox of cloud.ai itself
but let's talk about computer use. That seems to me super exciting that you can
just give cloud a task and it takes a
bunch of actions, figures it out, and is access to the your computer through
screenshots. So can you explain how that works and where that's headed? Yeah it's
actually relatively simple. So Claude has has had for a long time since Claude
3 back in March the ability to analyze images and respond to them with text.
The the only new thing we added is those images can be screenshots of a computer. And in response,
we train the model to give a location on the screen where you can click and or buttons on
the keyboard you can press in order to take action. And it turns out that with actually not all that much
additional training, the models can get quite good
at that task.
It's a good example of generalization.
You know, people sometimes say if you get to low Earth orbit
you're like halfway to anywhere, right?
Because of how much it takes to escape the gravity well.
If you have a strong pre-trained model
I feel like you're halfway to anywhere
in terms of the intelligence space.
And so actually it didn't take all that much
to get Claude to do this.
And you can just set that in a loop.
Give the model a screenshot, tell it what to click on,
give it the next screenshot, tell it what to click on.
And that turns into a full kind of almost 3D video interaction
of the model.
And it's able to do all of these tasks, right?
We showed these demos where it's able to like
fill out spreadsheets, it's able to kind of like
interact with a website, it's able to,
it's able to open all kinds of programs
and different operating systems, Windows, Linux, Mac.
So, I think all of that is very exciting.
I will say, while in theory,
there's nothing you could do there
that you couldn't have done through just giving the model
the API to drive the computer screen,
this really lowers the barrier.
And there's a lot of folks who either kind of aren't
in a position to interact with those APIs
or it takes them a long time to do.
It's just the screen is just a universal interface
that's a lot easier to interact with.
And so I expect over time,
this is gonna lower a bunch of barriers.
Now, honestly, the current model has,
there's, it leaves a lot still to be desired.
And we were honest about that in the blog, right?
It makes mistakes, it misclicks.
And we were careful to warn people,
hey, this thing isn't... You can't just leave this thing to run on your computer for minutes
and minutes. You got to give this thing boundaries and guardrails. And I think that's one of the
reasons we released it first in an API form rather than just it just hands the consumer and give it control of their computer.
But, you know, I definitely feel that it's important to get these capabilities out there
as models get more powerful. We're going to have to grapple with, you know, how do we use these
capabilities safely? How do we prevent them from being abused? And, you know, I think releasing the model while the capabilities are still limited is very helpful
in terms of doing that. I think since it's been released, a number of customers, I think
Replet was maybe one of the most quickest to deploy things, have made use of it
in various ways.
People have hooked up demos for Windows desktops,
Macs, Linux machines.
So yeah, it's been very exciting.
I think as with anything else,
it comes with new exciting abilities.
And then with those new exciting abilities,
we have to think about how to make the model,
say, reliable, do what humans want them to do.
I mean, it's the same story for everything, right?
Same thing, it's that same tension.
But the possibility of use cases here is just,
the range is incredible.
So how much, to make it work really well in the future? How much do you have to specially kind of go beyond
what's the pre-trained models doing?
Do more post-training, RLHF, or supervised fine tuning,
or synthetic data just for the agent and stuff?
Yeah, I think speaking at a high level,
it's our intention to keep investing a lot
in making the model better.
Like I think we look at some of the benchmarks
where previous models were like, oh,
we could do it 6% of the time.
And now our model would do it 14% or 22% of the time.
And yeah, we want to get up to the human level
reliability of 80%, 90% just like anywhere else.
We're on the same curve that we were on with Sweebench,
where I think I would guess a year from now,
the models can do this very, very reliably,
but you gotta start somewhere.
So you think it's possible to get to the human level,
90% basically doing the same thing you're doing now,
or is it has to be special for computer use?
I mean, it depends what you mean by special
and special in general,
but I generally think
the same kinds of techniques that we've been using
to train the current model, I expect that doubling down
on those techniques in the same way that we have for code,
for models in general, for image input,
for voice, I expect those same techniques will scale here
as they have everywhere else.
But this is giving sort of the power of action to Claude.
And so you could do a lot of really powerful things, but you could do a lot of damage also.
Yeah.
Yeah.
No, and we've been very aware of that.
Look, my view actually is computer use isn't a fundamentally new capability like the CBRN
or autonomy capabilities are, it's more like it kind of opens the aperture
for the model to use and apply its existing abilities.
And so the way we think about it, going back to our RSP,
is nothing that this model is doing inherently increases,
you know, the risk from an RSP perspective,
but as the models get more powerful, having this capability may make it scarier once it
has the cognitive capability to do something at the ASL3 and ASL4 level.
This may be the thing that kind of unbounds it from doing so.
So going forward, certainly this modality of interaction
is something we have tested for
and that we will continue to test for in RRSP going forward.
I think it's probably better to have,
to learn and explore this capability
before the model is super, you know, super capable.
Yeah, there's a lot of interesting attacks
like prompt injection,
because now you've widened the aperture
so you can prompt inject through stuff on screen. So if this becomes more and more useful, then there's more and more
benefit to inject stuff into the model. If it goes to a certain web page, it could be harmless stuff
like advertisements, or it could be harmful stuff. Right? Yeah. I mean, we've thought a lot about
things like spam, CAPTCHA, mass camp, there's all, every, like,
one secret I'll tell you,
if you've invented a new technology,
not necessarily the biggest misuse,
but the first misuse you'll see, scams, just petty scams.
Like, you'll, just, it's like a thing as old,
people scamming each other.
It's this thing as old as time.
And it's just every time you gotta deal with it.
It's almost like silly to say, but it's true.
Sort of bots and spam in general is the thing.
As it gets more and more intelligent, it's harder to fight.
Like I said, there are a lot of petty criminals in the world.
And it's like every new technology is like a new way for petty criminals to do something, you know,
something stupid and malicious.
Is there any ideas about sandboxing it? Like, how
difficult is the sandboxing task?
Yeah, we sandbox during training. So for example,
during training, we didn't expose the model to the
internet. I think that's probably a bad idea during
training because, you know, the model can be changing its policy.
It can be changing what it's doing and it's having an effect in the real world.
You know, in, in terms of actually deploying the model, right.
It kind of depends on the application. Like, you know,
sometimes you want the model to do something in the real world,
but of course you can always put guard,
you can always put guard rails on the outside, right? You can say, okay, well,
you know, this model's not going say, okay, well, you know,
this model's not gonna move data from my, you know,
model's not gonna move any files from my computer
or my web server to anywhere else.
Now, when you talk about sandboxing,
again, when we get to ASL four,
none of these precautions are going to make sense there,
right, where when you talk about ASL four,
you're then, the model is being kind of, you know,
there's a theoretical worry the model could be smart enough
to break it, to kind of break out of any box.
And so there, we need to think about mechanistic
interpretability, about, you know,
if we're going to have a sandbox,
it would need to be a mathematically provable sandbox.
You know, that's a whole different world
than what we're dealing with with the models today.
Yeah, the science of building a box
from which ASL4 AI system cannot escape.
I think it's probably not the right approach.
I think the right approach,
instead of having something, you know, unaligned
that like you're trying to prevent it from escaping,
I think it's better to just design the model the right way
or have a loop where you look inside the model
and you're able to verify properties
and that gives you an opportunity to like iterate
and actually get it right.
I think containing bad models is much worse solution
than having good models.
Let me ask about regulation.
What's the role of regulation in keeping AI safe?
So for example, can you describe California AI regulation
bill SB 1047 that was ultimately vetoed by the governor?
What are the pros and cons of this bill?
Yes, we ended up making some suggestions to the bill.
And then some of those were adopted.
And we felt, I think quite positively,
quite positively
about the bill by the end of that.
It did still have some downsides.
And of course, it got vetoed.
I think at a high level, I think some of the key ideas
behind the bill are, I would say,
similar to ideas behind our RSPs.
And I think it's very important that some jurisdiction, whether it's California or the
federal government and or other countries and other states, passes some regulation like
this.
And I can talk through why I think that's so important.
So I feel good about our RSP.
It's not perfect.
It needs to be iterated on a lot.
But it's been a good forcing function for getting the company to take these risks
seriously to put them into product planning to really make
them a central part of work at entropic and to make sure that
all the 1000 people and it's almost 1000 people now at
entropic understand that this is one of the highest priorities
of the company, if not the highest priority. But one, there are still some
companies that don't have RSP-like mechanisms, like OpenAI, Google did adopt these mechanisms
a couple months after Anthropic did, but there are other companies out there that don't have
these mechanisms at all. And so if some companies adopt these mechanisms and others don't, it's really going to create
a situation where some of these dangers have the property that it doesn't matter if three
out of five of the companies are being safe.
If the other two are being unsafe, it creates this negative externality.
And I think the lack of uniformity is not fair to those of us who have put a lot of effort
into being very thoughtful about these procedures.
The second thing is I don't think you can trust these companies to adhere to these voluntary
plans in their own.
I like to think that Anthropic will, we do everything we can that we will.
Our RSP is checked by our long-term benefit trust.
We do everything we can to adhere to our own RSP.
But you hear lots of things about various companies saying, oh, they said they would
give this much compute and they didn't.
They said they would do this thing and they didn't. I don't think it makes sense to litigate particular things
that companies have done,
but I think this broad principle that like,
if there's nothing watching over them,
there's nothing watching over us as an industry,
there's no guarantee that we'll do the right thing
and the stakes are very high.
And so I think it's important to have a uniform standard
that everyone follows and to make sure that simply
that the industry does what a majority of the industry
has already said is important and has already said
that they definitely will do.
Right, some people, I think there's a class of people
who are against regulation on principle.
I understand where
that comes from. If you go to Europe and you see something like GDPR, you see some of the other
stuff that they've done, some of it's good, but some of it is really unnecessarily burdensome.
And I think it's fair to say really has slowed innovation. And so I understand
where people are coming from on priors. I understand why people come from, start from that, start from that position. But again,
I think AI is different. If we go to the very serious risks of autonomy and misuse that
I talked about just a few minutes ago, I think that those are unusual and they weren't an unusually strong response.
And so I think it's very important.
Again, we need something that everyone can get behind.
I think one of the issues with SB 1047,
especially the original version of it,
was it had a bunch of the structure
of RSPs, but it also had a bunch of stuff
that was either clunky or that just would have created
a bunch of burdens, a bunch of hassle,
and might even have missed the target
in terms of addressing the risks.
You don't really hear about it on Twitter,
you just hear about people are cheering for any regulation.
And then the folks who are against
make up these often quite intellectually dishonest arguments
about how, you know, it'll make us move away
from California.
Bill doesn't apply if you're headquartered in California.
Bill only applies if you do business in California
or that it would damage the open source ecosystem,
or that it would cause all of these things.
I think those were mostly nonsense,
but there are better arguments against regulation.
There's one guy, Dean Ball, who's really,
I think a very scholarly analyst,
who looks at what happens when a regulation is put in place
in ways that they can kind of get a life of their own
or how they can be poorly designed.
And so our interest has always been,
we do think there should be regulation in this space,
but we wanna be an actor who makes sure
that that regulation is something that's surgical,
that's targeted at the serious risks
and is something people can actually comply with.
Because something I think the advocates of regulation
don't understand as well as they could
is if we get something in place
that's poorly targeted,
that wastes a bunch of people's time,
what's gonna happen is people are gonna say,
see these safety risks, this is nonsense.
I just had to hire 10 lawyers
to fill out all these forms.
I had to run all these tests
for something that was clearly not dangerous.
And after six months of that, there'll be a groundswell
and we'll end up with a durable consensus
against regulation.
And so I think the worst enemy well, and we'll end up with a durable consensus against regulation.
I think the worst enemy of those who want real accountability is badly designed regulation.
We need to actually get it right.
If there's one thing I could say to the advocates, it would be that I want them to understand
this dynamic better.
We need to be really careful and we need to talk to people who actually have experience seeing how regulations play out in practice.
And the people who have seen that understand to be very careful.
If this was some lesser issue, I might be against regulation at all.
But what I want the opponents to understand is that the underlying issues are actually
serious.
They're not something that I or the other companies are just making up because of regulatory
capture.
They're not sci-fi fantasies.
They're not any of these things.
Every time we have a new model, every few months, we measure the behavior of these models
and they're getting better and better
at these concerning tasks, just as they are getting better
and better at good, valuable, economically useful tasks.
And so I would just love it if some of the former,
I think SB 1047 was very polarizing.
I would love it if some of the most reasonable opponents and some of the most reasonable
proponents would sit down together and I think the different AI companies, Anthropic was
the only AI company that felt positively in a very detailed way.
I think Elon tweeted briefly something positive,
but some of the big ones like Google, OpenAI, Meta,
Microsoft were pretty staunchly against.
So what I would really like is if some
of the key stakeholders, some of the most thoughtful
proponents and some of the most thoughtful opponents
would sit down and say, how do we solve this problem in a way that the proponents feel brings a real reduction
in risk and that the opponents feel that it is not hampering the industry or hampering
innovation any more necessary than it needs to. And I think for whatever reason that things got too polarized
and those two groups didn't get to sit down
in the way that they should.
And I feel urgency.
I really think we need to do something in 2025.
If we get to the end of 2025
and we've still done nothing about this,
then I'm gonna be worried.
I'm not worried yet because again,
the risks aren't here yet,
but I think time is running short.
Yeah, and come up with something surgical, like you said.
Yeah, yeah, yeah, exactly.
And we need to get away from this
intense pro safety versus intense anti-regulatory rhetoric.
It's turned into these flame wars on Twitter
and nothing good's gonna come of that.
So there's a lot of curiosity
about the different players in the game.
One of the OGs is OpenAI.
You've had several years of experience at OpenAI.
What's your story and history there?
Yeah, so I was at OpenAI for roughly five years.
For the last, I think it was a couple of years, I was vice president of research there.
Probably myself and Ilya Sudskiver were the ones who really kind of set the research direction
around 2016 or 2017.
I first started to really believe in, or at least confirm my belief in the scaling hypothesis
when Ilya famously said to me the thing you need to
Understand about these models is they just want to learn the models just want to learn
and and again sometimes there are these one sentence there are these one sentences these then cones that you hear them and you're like ah
That that explains everything that explains like a thousand things that I've seen and then and then I you know
ever after I had this visualization
in my head of like, you optimize the models
in the right way, you point the models in the right way.
They just want to learn.
They just want to solve the problem,
regardless of what the problem is.
So get out of their way, basically.
Get out of their way, yeah.
Don't impose your own ideas about how they should learn.
And you know, this was the same thing as Rich Sutton
put out in the Bitter Lesson,
or Gurren put out in the scaling hypothesis. You know, I think generally the dynamic was, you know, this was the same thing as Rich Sutton put out in the bitter lesson or Gern put out in the scaling hypothesis.
You know, I think generally the dynamic was, you know, I got this kind of inspiration from
Ilyin and from others, folks like Alec Radford, who did the original GPT-1, and then ran really
hard with it, me and my collaborators on GPT-2, GPT-3, RL from Human
Feedback, which was an attempt to kind of deal with the early safety and durability,
things like debate and amplification, heavy on interpretability. So again, the combination of
safety plus scaling, probably 2018, 2019, 2020, those were kind of the years when myself
and my collaborators, probably, you know,
many of whom became co-founders of Anthropic
kind of really had a vision and like drove the direction.
Why'd you leave?
Why'd you decide to leave?
Yeah, so look, I'm gonna put things this way
and I think it ties to the race to the top,
which is in my time at OpenAI, what I'd come to see as I'd come to appreciate the scaling
hypothesis and as I'd come to appreciate the importance of safety along with the scaling
hypothesis.
The first one, I think OpenAI was getting on board with.
The second one, in a way, had always been part of OpenAI's messaging.
But over many years of the time that I spent there,
I think I had a particular vision
of how we should handle these things,
how we should be brought out in the world,
the kind of principles that the organization should have.
And look, I mean, there were like like many many discussions about like, you know
Should the org do should the company do this the company do that?
Like there's a bunch of misinformation out there people say like we left because we didn't like the deal with Microsoft false
Although you know, it was like a lot of discussion a lot of questions about exactly how we do the deal with Microsoft
We left because we didn't like commercialization.
That's not true.
We built GPT-3, which was the model that was commercialized.
I was involved in commercialization.
It's more, again, about how do you do it?
Like, civilization is going down this path
to very powerful AI.
What's the way to do it that is cautious,
straightforward, honest,
that builds trust in the organization and in individuals.
How do we get from here to there?
And how do we have a real vision for how to get it right?
How can safety not just be something we say
because it helps with recruiting?
And I think at the end of the day,
if you have a vision for that,
forget about anyone else's vision.
I don't want to talk about anyone else's vision.
If you have a vision for how to do it,
you should go off and you should do that vision.
It is incredibly unproductive to try
and argue with someone else's vision.
You might think they're not doing it the right way.
You might think they're dishonest.
Who knows?
Maybe you're right, maybe you're not.
But what you should do is you should take some people
you trust and you should go off together
and you should make your vision happen.
And if your vision is compelling,
if you can make it appeal to people,
some combination of ethically in the market,
if you can make a company that's a place people want to join, that engages in
practices that people think are reasonable while managing to maintain its position in
the ecosystem at the same time, if you do that, people will copy it.
And the fact that you were doing it, especially the fact that you're doing it better than
they are, causes them to change their behavior in a much more compelling way than if they're your boss and you're arguing with
them.
I don't know how to be any more specific about it than that, but I think it's generally very
unproductive to try and get someone else's vision to look like your vision.
It's much more productive to go off and do a clean experiment and say, this is our vision,
this is how we're gonna do things.
Your choice is you can ignore us,
you can reject what we're doing,
or you can start to become more like us.
And imitation is the sincerest form of flattery.
And that plays out in the behavior of customers,
that pays out in the behavior of the public,
that plays out in the behavior of customers, that pays out in the behavior of the public,
that plays out in behavior of where people choose to work. Again, at the end, it's not about one
company winning or another company winning if we or another company are engaging in some practice
that people find genuinely appealing. I want it to be in substance, not just in appearance.
And I think researchers are sophisticated
and they look at substance.
And then other companies start copying that practice
and they win because they copied that practice.
That's great, that's success.
That's like the race to the top.
It doesn't matter who wins in the end,
as long as everyone is copying
everyone else's good practices, right? One way I think of it is like, the thing we're all afraid of is the race
to the bottom, right? And the race to the bottom doesn't matter who wins because we all lose,
right? Like, you know, in the most extreme world, we make this autonomous AI that, you know, the
robots enslave us or whatever, right? I mean, that's half joking, but you know, that is the
most extreme thing that could happen.
Then it doesn't matter which company was ahead.
If instead you create a race to the top where people are competing to engage in good practices,
then at the end of the day, it doesn't matter who ends up winning.
It doesn't even matter who started the race to the top.
The point isn't to be virtuous. The point is to get the system into a better equilibrium than it was before.
Individual companies can play some role in doing this. Individual companies can help to start it,
can help to accelerate it. Frankly, I think individuals at other companies have done this
as well. The individuals that when we put out an RSP, react by pushing harder to get something similar done,
get something similar done at other companies.
Sometimes other companies do something that's like,
we're like, oh, it's a good practice.
We think that's good.
We should adopt it too.
The only difference is, you know,
I think we try to be more forward leaning.
We try and adopt more of these practices first
and adopt them more quickly when others invent them.
But I think this dynamic is what we should be pointing at.
And that I think it abstracts away the question of,
which company's winning, who trusts who.
I think all these questions of drama
are profoundly uninteresting.
And the thing that matters is the ecosystem
that we all operate in and how to make that ecosystem better
because that constrains all the players.
And so Anthropic is this kind of clean experiment
built on a foundation of like what concretely
AISAT should look like.
Look, I'm sure we've made plenty of mistakes along the way.
The perfect organization doesn't exist.
It has to deal with the imperfection of a thousand employees.
It has to deal with the imperfection of our leaders,
including me.
It has to deal with the imperfection of the people
we've put to oversee the imperfection of the leaders,
like the board and the long-term benefit trust.
It's all a set of imperfect people trying to aim imperfectly at some ideal
that will never perfectly be achieved. That's what you sign up for. That's what it will
always be. But imperfect doesn't mean you just give up. There's better and there's worse.
And hopefully, we can do well enough that we can begin to build some practices that
the whole industry engages in.
And then, you know, my guess is that multiple of these companies will be successful.
Anthropic will be successful.
These other companies, like ones I've been at the past, will also be successful.
And some will be more successful than others.
That's less important than, again, that we align the incentives of the industry.
And that happens partly through the race to the top, partly through things like RSP,
partly through, again, selected surgical regulation.
You said talent density beats talent mass.
So can you explain that?
Can you expand on that?
Can you just talk about what it takes
to build a great team of AI researchers and engineers?
This is one of these statements that's like more true every month.
Every month I see this statement is more true than I did the month before.
So if I were to do a thought experiment, let's say you have a team of 100 people that are
super smart, motivated and aligned with the mission and that's your company.
Or you can have a team of a thousand people where 200 people are super smart, super aligned
with the mission, and then 800 people are, let's just say you pick 800 random big tech
employees.
Which would you rather have?
The talent mass is greater in the group of a thousand people.
You have even a larger number of incredibly talented,
incredibly aligned, incredibly smart people.
But the issue is just that if every time
someone's super talented looks around,
they see someone else super talented and super dedicated,
that sets the tone for everything, right?
That sets the tone for everyone is super inspired
to work at the same place.
Everyone trusts everyone else.
If you have a thousand or 10,000 people
and things have really regressed, right?
You are not able to do selection
and you're choosing random people.
What happens is then you need to put a lot of processes
and a lot of guardrails in place
just because people don't fully trust each other, you have to adjudicate
political battles. There are so many things that slow down the org's ability to operate.
And so we're nearly a thousand people and we've tried to make it so that as large a
fraction of those thousand people as possible are like super talented, super skilled. It's
one of the reasons we've slowed down hiring a lot
in the last few months.
We grew from 300 to 800, I believe, I think,
in the first seven, eight months of the year.
And now we've slowed down.
We're at like, you know, the last three months,
we went from 800 to 900, 950, something like that.
Don't quote me on the exact numbers,
but I think there's an inflection point around 1,000,
and we wanna be much more careful how we grow.
Early on and now as well,
we've hired a lot of physicists.
Theoretical physicists can learn things really fast.
Even more recently, as we've continued to hire that,
we've really had a high bar on both the research side
and the software engineering
side have hired a lot of senior people, including folks who used to be at other companies in
this space.
And we've just continued to be very selective.
It's very easy to go from 100 to 1,000 and 1,000 to 10,000 without paying attention to
making sure everyone has a unified purpose.
It's so powerful.
If your company consists of a lot of different fiefdoms
that all want to do their own thing,
that are all optimizing for their own thing,
it's very hard to get anything done.
But if everyone sees the broader purpose of the company,
if there's trust and there's dedication
to doing the right thing, that is a superpower.
That in itself, I think, can overcome
almost every other disadvantage.
And, you know, it's Steve Jobs, A-Players.
A-Players wanna look around and see other A-Players
is another way of saying it.
I don't know what that is about human nature,
but it is demotivating to see people
who are not obsessively driving towards a singular mission.
And it is, on the flip side of that,
super motivating to see that.
It's interesting.
What's it take to be a great AI researcher or engineer
from everything you've seen,
from working with so many amazing people?
Yeah.
I think the number one quality,
especially on the research side, but really both,
is open-mindedness.
Sounds easy to be open-minded, right?
You're just like, oh, I'm open to anything.
But if I think about my own early history
in the scaling hypothesis,
I was seeing the same data others were seeing.
I don't think I was like a better programmer
or better at coming up with research ideas
than any of the hundreds of people that I worked with.
In some ways, I was worse.
I've never like, precise programming of like,
finding the bug, writing the GPU kernels.
I could point you to a hundred people here
who are better at that than I am.
But the thing that I think I did have that was different
was that I was just willing to look at something
with new eyes, right?
People said, oh, you know,
we don't have the right algorithms yet.
We haven't come up with the right way to do things.
And I was just like, oh, I don't know.
Like, you know, this neural net has like 30 billion,
30 million parameters.
Like, what if we gave it 50 million instead?
Like let's plot some graphs,
like that basic scientific mindset of like,
oh man, like I just like,
I see some variable that I could change.
Like what happens when it changes?
Like let's try these different things and like create a graph.
For even this was like the simplest thing in the world, right?
Change the number of, this wasn't like PhD level experimental design.
This was like, this was like simple and stupid.
Like anyone could have done this if you, if you just told them that it was important.
It's also not hard to understand.
You didn't need to be brilliant to come up with this.
But you put the two things together and you know, some tiny number of people, some single digit number
of people have driven forward the whole field by realizing this.
It's often like that.
If you look back at the discoveries in history, they're often like that.
So this open-mindedness and this willingness to see with new eyes that often comes from
being newer to the field, often experience is a disadvantage for this.
That is the most important thing.
It's very hard to look for and test for,
but I think it's the most important thing
because when you find something,
some really new way of thinking about things,
when you have the initiative to do that,
it's absolutely transformative.
And also be able to do kind of rapid experimentation
and in the face of that, be open-minded and curious and looking at the data
Just these fresh eyes and see what is that's actually saying that applies in
Mechanistic interpretability is another example of this like some of the early work in mechanistic interpretability
So simple it's just no one thought to care about this question before
You said what it takes to be a great AI researcher. Can we rewind the clock back?
What advice would you give to people interested in AI?
They're young, looking forward to how can I make an impact on the world?
I think my number one piece of advice is to just start playing with the models.
This was actually, I worry a little this seems like obvious advice now.
I think three years ago, it wasn't obvious and people started by, oh, let me read the latest reinforcement learning paper.
Let me, you know, let me kind of,
no, I mean, that was really the,
that was really the,
and I mean, you should do that as well.
But now, you know, with wider availability of models
and APIs, people are doing this more.
But I think, I think just experiential knowledge.
These models are new artifacts that no one really understands.
And so getting experience playing with them.
I would also say again, in line with the like, do something new,
think in some new direction, like there are all these things that haven't been explored.
Like for example, mechanistic interpretability is still very new.
It's probably better to work on that than it is to work on new model architectures because it's more popular than it was before. There are probably like 100
people working on it, but there aren't like 10,000 people working on it. And it's just this fertile
area for study. There's so much like low-hanging fruit. You can just walk by and you can just walk by
and you can pick things.
And the only reason, for whatever reason,
people aren't interested in it enough.
I think there are some things around
long horizon learning and long horizon tasks
where there's a lot to be done.
I think evaluations are still,
we're still very early in our ability to study evaluations, particularly for
dynamic systems acting in the world. I think there's some
stuff around multi agent. Skate where the puck is going is my
is my advice. And you don't have to be brilliant to think of it
like all the things that are going to be exciting in five
years, like in people even mentioned them as like, you
know, conventional wisdom, but like, it's, it's just somehow there's
this barrier that people don't, people don't double down as
much as they could, or they're afraid to do something that's
not the popular thing. I don't know why it happens, but like
getting over that barrier is that's the my number one piece
of advice.
Let's talk if we could a bit about post training. Yeah, so it
seems that the modern post-training recipe
has a little bit of everything.
So supervised fine-tuning, RLHF,
the constitutional AI with RLAIF.
Best acronym.
It's again that naming thing.
A lot.
And then synthetic data,
seems like a lot of synthetic data,
or at least trying to figure out ways
to have high quality synthetic data.
So what's the, if this is a secret sauce
that makes anthropic clods so incredible,
how much of the magic is in the pre-training,
how much is in the post-training?
Yeah, I mean, so first of all,
we're not perfectly able to measure that ourselves.
You know, when you see some great character all, we're not perfectly able to measure that ourselves.
When you see some great character ability, sometimes it's hard to tell whether it came
from pre-training or post-training.
We've developed ways to try and distinguish between those two, but they're not perfect.
The second thing I would say is when there is an advantage, and I think we've been pretty
good in general at RL, perhaps the best, although I don't know because I don't see what goes
on inside other companies.
Usually it isn't, oh my God, we have this secret magic method that others don't have.
Usually it's like, well, we got better at the infrastructure so we could run it for
longer or we were able to get higher quality data or we were able to filter our data better
or we were able to filter our data better or we were able to, you know, combine these methods and practice. It's usually some boring matter of kind of
practice and tradecraft. So, you know, when I think about how to do something special in terms of how
we train these models, both pre-training but even more so post-training, you know, I really think of
it a little more, again, as like designing airplanes
or cars. Like, you know, it's not just like, oh man, I have the blueprint. Like, maybe that makes
you make the next airplane. But like, there's some, there's some cultural trade craft of how we think
about the design process that I think is more important than, you know, than any particular
gizmo we're able to invent. Okay, well, let me ask you about specific techniques.
So first on RLHF, what do you think,
just zooming out intuition, almost philosophy,
why do you think RLHF works so well?
If I go back to like the scaling hypothesis,
one of the ways to skate the scaling hypothesis
is if you train for X and you throw enough compute at it,
then you get X.
And so RLHF is good at doing what humans want the model
to do, or at least to state it more precisely,
doing what humans who look at the model
for a brief period of time
and consider different possible responses,
what they prefer as the response,
which is not perfect from both the safety
and capabilities perspective,
in that humans are often not able to perfectly identify
what the model wants and what humans want in the moment,
may not be what they want in the longterm.
So there's a lot of subtlety there,
but the models are good at producing what the humans
in some shallow sense want.
And it actually turns out that you don't even have to throw
that much compute at it because of another thing which is
This this thing about a strong pre-trained model being halfway to anywhere
So once you have the pre-trained model you have all the representations you need to get the model
To get the model where you where you want it to go. So do you think our LHF?
Makes the model smarter or just appear smarter to the humans. I don'tHF makes the model smarter or just appear smarter to the humans?
I don't think it makes the model smarter.
I don't think it just makes the model appear smarter.
It's like RLHF like bridges the gap
between the human and the model, right?
I could have something really smart
that like can't communicate at all, right?
We all know people like this.
People who are really smart,
but you know, you can't understand what they're saying. So I think RLHF just bridges that gap. I think it's
not the only kind of RL we do. It's not the only kind of RL that will happen in the future.
I think RL has the potential to make models smarter, to make them reason better, to make them
operate better, to make them develop new skills even.
And perhaps that could be done, you know,
even in some cases with human feedback,
but the kind of RLHF we do today mostly doesn't do that yet,
although we're very quickly starting to be able to.
But it appears to sort of increase,
if you look at the metric of helpfulness,
it increases that.
It also increases, what was this word in Leopold's essay,
un-hobbling, where basically the models are hobbled
and then you do various trainings to them
to un-hobble them.
So I like that word, because it's like a rare word.
But so I think RLHF un-hobbles the models in some ways.
And then there are other ways where a model hasn't yet
been un-hobbled and needs to un-hobble.
If you can say, in terms of cost, is pre-training the most expensive thing or is post-training
creep up to that?
At the present moment, it is still the case that pre-training is the majority of the cost.
I don't know what to expect in the future, but I could certainly anticipate a future
where post-training is the majority of the cost.
In that future you anticipate, would it be the humans or the AI that's the costly thing for the post-training is the majority of the cost. In that future you anticipate, would it be the humans or the AI
that's the costly thing for the post-training?
I don't think you can scale up humans enough
to get high quality.
Any kind of method that relies on humans
and uses a large amount of compute,
it's gonna have to rely on some scaled supervision method
like debate or iterated amplification
or something like that.
So on that super interesting set of ideas
around constitutional AI, can you describe what it is
as first detailed in December 2022 paper
and beyond that, what is it?
Yes, so this was from two years ago.
The basic idea is, so we describe what RLHF is, you have a model and it spits
out two possible, you know, like you just sample from it twice, it spits out two possible
responses and you're like human, which response do you like better or another variant of it
is rate this response on a scale of one to seven. So that's hard because you need to
scale up human interaction and it's very implicit, right?
I don't have a sense of what I want the model to do.
I just have a sense of what this average of a thousand humans wants the model to do.
So, two ideas.
One is, could the AI system itself decide which response is better, right?
Could you show the AI system these two responses and ask which response is better? And then second, well, what criterion should the AI use? And
so then there's this idea, could you have a single document, a constitution if you will,
that says these are the principles the model should be using to respond? And the AI system
reads those principles as well as reading the environment and the response.
And it says, well, how good did the AI model do?
It's basically a form of self-play.
You're kind of training the model against itself.
And so the AI gives the response and then you feed that back into what's called the
preference model, which in turn feeds the model to make it better.
So you have this triangle of like the AI,
the preference model and the improvement of the AI itself.
And we should say that in the constitution,
the set of principles are like human interpretable.
They're like-
Yeah, yeah, it's something both the human
and the AI system can read.
So it has this nice kind of translatability or symmetry.
You know, in practice, we both use a model constitution
and we use RLHF and we use some of
these other methods. So it's turned into one tool in a toolkit that both reduces the need for RLHF
and increases the value we get from using each data point of RLHF. It also interacts in interesting
ways with future reasoning type RL methods.
So it's one tool in the toolkit, but I think it is a very important tool.
Well, it's a compelling one to us humans, you know, thinking about the founding fathers
and the founding of the United States.
The natural question is who and how do you think it gets to define the constitution,
the set of principles in the
constitution?
Yeah.
So I'll give like a practical answer and a more abstract answer.
I think the practical answer is like, look, in practice, models get used by all kinds
of different like customers, right?
And so you can have this idea where, you know, the model can have specialized rules or principles.
You know, we fine tune versions of models implicitly.
We've talked about doing it explicitly,
having special principles that people
can build into the models.
So from a practical perspective,
the answer can be very different from different people.
Customer service agent behaves very differently
from a lawyer and obeys different principles.
But I think at the base of it,
there are specific
principles that models have to obey. I think a lot of them are things that people would agree with.
Everyone agrees that we don't want models to present these CBRN risks. I think we can go a
little further and agree with some basic principles of democracy and the rule of law. Beyond that, it gets very uncertain.
And there our goal is generally for the models
to be more neutral, to not espouse
a particular point of view and more just be kind of like
wise agents or advisors that will help you
think things through and will present possible considerations
but don't express strong or specific opinions.
OpenAI released a model spec
where it kind of clearly, concretely defines
some of the goals of the model and specific examples,
like A, B, how the model should behave.
Do you find that interesting?
By the way, I should mention,
I believe the brilliant John Schulman was a part of
that.
He's now at Anthropic.
Do you think this is a useful direction?
Might Anthropic release a model spec as well?
Yeah.
So I think that's a pretty useful direction.
Again, it has a lot in common with constitutional AI.
So again, another example of like a race to the top, right?
We have something that's like, we think, you know, a better and more responsible way of
doing things.
It's also a competitive advantage. Then others discover that it has advantages and then start
to do that thing. We then no longer have the competitive advantage, but it's good from the
perspective that now everyone has adopted a positive practice that others were not adopting.
And so our response to that as well,
looks like we need a new competitive advantage
in order to keep driving this race upwards.
So that's how I generally feel about that.
I also think every implementation
of these things is different.
So there were some things in the model spec
that were not in constitutional AI.
And so we can always adopt those things
or at least learn from them.
So again, I think this is an example
of like the positive dynamic
that I think we should all want the field to have.
Let's talk about the incredible essay,
Machines of Love and Grace.
I recommend everybody read it.
It's a long one.
It is rather long.
Yeah, it's really refreshing to read concrete ideas
about what a positive future looks like.
And you took sort of a bold stance
because like it's very possible that you might be wrong
on the dates or the specific implications.
Oh, yeah, I'm fully expecting to,
you know, to definitely be wrong about all the details.
I might be just spectacularly wrong about the whole thing
and people will laugh at me for years.
That's just how the future works.
So you provided a bunch of concrete positive impacts of AI and how exactly a super intelligent AI
might accelerate the rate of breakthroughs
in for example, biology and chemistry
that would then lead to things like we cure most cancers,
prevent all infectious disease,
double the human lifespan and so on.
So let's talk about this essay first.
Can you give a high level vision of this essay
and what key takeaways that people have?
Yeah, I have spent a lot of time
and Anthropoc has spent a lot of effort on like,
how do we address the risks of AI, right?
How do we think about those risks?
Like we're trying to do a race to the top, that requires us to build all these capabilities and the capabilities of AI, right? How do we think about those risks? Like, we're trying to do a race to the top,
that requires us to build all these capabilities, and the capabilities are cool, but a big part of what we're trying to do is address the risks. And the justification for that is like, well,
all these positive things, the market is this very healthy organism, right? It's going to produce
all the positive things. The risks, I don't know, right? It's going to produce all the positive things.
The risks, I don't know, we might mitigate them, we might not.
And so we can have more impact by trying to mitigate the risks.
But I noticed that one flaw in that way of thinking, and it's not a change in how seriously
I take the risks, it's maybe a change in how I talk about them, is that no matter how logical or rational that
line of reasoning that I just gave might be, if you only talk about risks, your brain only
thinks about risks.
And so I think it's actually very important to understand what if things do go well?
And the whole reason we're trying to prevent these risks is not because we're afraid of
technology, not because we want to slow it down.
It's because if we can get to the other side of these risks, right, if we can run the gauntlet
successfully to put it in stark terms, then on the other side of the gauntlet are all
these great things.
And these things are worth fighting for.
And these things can really inspire people. And I think I imagine because, look, you have
all these investors, all these VCs, all these AI companies talking about all the positive
benefits of AI. But as you point out, it's weird. There's actually a dearth of really
getting specific about it. There's a lot of random people on Twitter
posting these gleaming cities
and this vibe of grind, accelerate harder,
kick out the decel.
It's just this very aggressive ideological,
but then you're like, what are you actually excited about?
And so I figured that, you know,
I think it would be interesting and valuable
for someone who's actually coming from the risk side
to try and really make a try
at explaining what the benefits are,
both because I think it's something we can all get behind,
and I want people to understand,
I want them to really understand that this isn't,
this isn't do-mers versus accelerationists.
This is that if you have a true understanding
of where things are going with AI,
and maybe that's the more important axis,
AI is moving fast versus AI is not moving fast,
then you really appreciate the benefits and you,
you, you, you really, you want humanity, our civilization to
seize those benefits, but you also get very serious about
anything that could derail them.
So I think the starting point is to talk about what this
powerful AI, which is the term you like to use.
Most of the world uses AGI, but you don't like the term
because it's
basically has too much baggage, it's become meaningless.
It's like we're stuck with the terms,
whether we like it or not.
Maybe we're stuck with the terms
and my efforts to change them are futile.
It's admirable.
I'll tell you what else I don't,
this is like a pointless semantic point,
but I keep talking about it in public.
Back to naming again.
So I'm just gonna do it once more.
I think it's a little like, let's say it was like 1995,
and Moore's Law is making the computers faster.
And for some reason, there had been this verbal tick
that everyone was like, well, someday we're
going to have super computers.
And super computers are going to be able to do all these things
that once we have super computers,
we'll be able to sequence the genome.
We'll be able to do other things. And so, and so like one, it's true,
the computers are getting faster. And as they get faster, they're going to be able to do
all these great things. But there's like, there's no discrete point at which you had
a supercomputer and previous computers were not too like supercomputers, a term we use,
but like, it's a vague term to just describe like, computers that are faster than what
we have today. There's no
point at which you pass a threshold and you're like, oh my God, we're doing a totally new type
of computation and new. And so I feel that way about AGI, like there's just a smooth exponential.
And like, if, if by AGI, you mean like, like AI is getting better and better and like gradually
it's going to do more and more of what humans do until it's going to be smarter than humans.
And then it's going to get smarter even from there,
then yes, I believe in AGI.
But if AGI is some discrete or separate thing,
which is the way people often talk about it,
then it's kind of a meaningless buzzword.
Yeah, to me, it's just sort of a platonic form
of a powerful AI, exactly how you define it.
I mean, you define it very nicely.
So on the intelligence axis,
it's just on pure intelligence,
it's smarter than a Nobel Prize winner, as you describe,
across most relevant disciplines.
So, okay, that's just intelligence.
So it's both in creativity and being able to generate
new ideas, all that kind of stuff,
in every discipline, Nobel Prize winner.
Okay, in their prime.
It can use every modality,
so this kind of self-explanatory,
but just operate across all the modalities of the world.
It can go off for many hours, days and weeks to do tasks
and do its own sort of detailed planning
and only ask you help when it's needed.
It can use, this is actually kind of interesting.
I think in the essay you said, I mean, again, it's a bet
that it's not gonna be embodied,
but it can control embodied tools.
So it can control tools, robots, laboratory equipment.
The resource used to train it can then be repurposed
to run millions of copies of it.
And each of those copies will be independent
that can do their own independent work.
So you can do the cloning of the intelligence system.
Yeah, yeah, I mean, you might imagine
from outside the field, like there's only one of these,
right, that like you made it, you've only made one.
But the truth is that like the scale up is very quick.
Like we do this today, we make a model
and then we deploy thousands,
maybe tens of thousands of instances of it.
I think by the time, you know, certainly within two to three years,
whether we have these super powerful AIs or not,
clusters are going to get to the size where you'll be able to deploy millions of these
and they'll be, you know, faster than humans.
And so if your picture is, oh, we'll have one and it'll take a while to make them.
My point there was no, actually, you have millions of them right away.
And in general, they can learn and act 10 to 100 times faster than humans.
So that's a really nice definition of powerful AI.
Okay, so that, but you also write that clearly such an entity would be capable of solving
very difficult problems very fast, but it is not trivial to figure out how fast.
Two extreme positions both seem false to me.
So the singularity is on the one extreme
and the opposite on the other extreme.
Can you describe each of the extremes?
Yeah.
And why?
So yeah, let's describe the extreme.
So like one extreme would be, well, look,
if we look at kind of evolutionary history,
like there was this big acceleration where, you know,
for hundreds of thousands of years,
we just had like, you know, single celled organisms,
and then we had mammals, and then we had apes,
and then that quickly turned to humans.
Humans quickly built industrial civilization.
And so this is gonna keep speeding up,
and there's no ceiling at the human level.
Once models get much, much smarter than humans,
they'll get really good at building the next models.
And you know, if you write down like a simple differential equation,
like this is an exponential.
And so what's gonna happen is that models
will build faster models, models will build faster models,
and those models will build nanobots
that can like take over the world
and produce much more energy than you could produce otherwise.
And so if you just kind of like solve
this abstract differential equation, then like five days after we,
we build the first AI that's more powerful than humans,
then like the world will be filled with these AIs
and every possible technology that could be invented,
like will be invented.
I'm caricaturing this a little bit,
but I think that's one extreme.
And the reason that I think that's not the case is that one, I think they just neglect
the laws of physics.
It's only possible to do things so fast in the physical world.
Some of those loops go through producing faster hardware.
It takes a long time to produce faster hardware.
Things take a long time.
There's this issue of complexity.
I think no matter how smart you are, people talk about, oh, we can make models of biological
systems that'll do everything to biological systems. Look, I think computational modeling
can do a lot. I did a lot of computational modeling when I worked in biology, but there are a lot of
things that you can't predict how they're you know
they're complex enough that like just iterating just running the experiment is
gonna beat any modeling no matter how smart the system doing the modeling is
or even if it's not interacting with the physical world just the modeling is
gonna be hard yeah I think well the modeling is gonna be hard and getting
the model to to to to match the physical world is going to be. All right.
So it does have to interrupt the physical world to verify.
But it's just, you just look at even the simplest problems.
I think I talk about the three-body problem or simple chaotic prediction or predicting
the economy.
It's really hard to predict the economy two years out.
Maybe the case is normal humans can predict what's going to happen in the
economy of the next quarter, or they can't really do that.
Maybe, maybe a AI system that's, you know, a zillion times
smarter, it can only predict it out a year or something, instead
of, instead of, you know, you have these kinds of exponential
increase in computer intelligence for linear increase
in, in, in ability to predict.
Same with, again, like, you know, biological molecules, intelligence for linear increase in in ability to predict same with again like
You know biological molecules
Molecules interacting you don't know what's gonna happen when you perturb when you perturb a complex system
You can find simple parts in it
If you're smarter you're better at finding these simple parts and then I think human institutions human institutions are just are
Really difficult like it's you know, know, it's been hard to get people,
I won't give specific examples, but it's been hard to get people to adopt even the technologies
that we've developed, even ones where the case for their efficacy is very, very strong.
You know, people have concerns, they think things are conspiracy theories, it's been very difficult.
It's also been very difficult to get very simple things through the regulatory system.
I don't want to disparage anyone who works in regulatory systems of any technology.
There are hard trade-offs they have to deal with, they have to save lives, but the system as a whole,
I think makes some obvious trade-offs
that are very far from maximizing human welfare.
And so if we bring AI systems into this,
into these human systems,
often the level of intelligence may just not be the limiting factor.
It just may be that it takes a long time to do something. Now, if the AI system circumvented
all governments, if it just said, I'm dictator of the world and I'm going to do whatever,
some of these things it could do. Again, the things have to do with complexity. I still think
a lot of things would take a while. I don't think it helps that the AI systems can produce a lot of energy or go to the moon.
Some people in comments responded to the essay saying the AI system can produce a lot of
energy in smarter AI systems. That's missing the point. That kind of cycle doesn't solve
the key problems that I'm talking about here. So I think a bunch of people miss the point
there. But even if it were completely onaligned and could get around all these human obstacles,
it would have trouble. But again, if you want this to be an AI system that doesn't take over
the world, that doesn't destroy humanity, then basically it's going to need to follow basic human
laws. If we want to have an actually good world, like we're going to have to have an AI system
that interacts with humans, not one that kind of creates its own legal system or disregards
all the laws or all of that.
So as inefficient as these processes are, we're going to have to deal with them because
there needs to be some popular and democratic legitimacy in how these systems are rolled
out.
We can't have a small group of people
who are developing these systems say,
this is what's best for everyone, right?
I think it's wrong and I think in practice
it's not gonna work anyway.
So you put all those things together
and we're not gonna change the world
and upload everyone in five minutes.
I just, I don't think,
A, I don't think it's gonna happen and B, to the extent that it
could happen, it's not the way to lead to a good world. So that's on one side. On the other side,
there's another set of perspectives, which I have actually in some ways more sympathy for, which is,
look, we've seen big productivity increases before, right? You know, economists are familiar with studying
the productivity increases that came from
the computer revolution and internet revolution.
And generally, those productivity increases were underwhelming.
They were less than you might imagine.
There was a quote from Robert Solow,
you see the computer revolution everywhere
except the productivity statistics.
So why is this the case?
People point to the structure of firms, the structure of enterprises, how slow it's been
to roll out our existing technology to very poor parts of the world, which I talk about
in the essay.
How do we get these technologies to the poorest parts of the world that are behind on cell
phone technology, computers, medicine,
let alone new-fangled AI that hasn't been invented yet.
So you could have a perspective that's like, well, this is amazing technically, but it's
all a nothing burger.
I think Tyler Cowen, who wrote something in response to my essay, has that perspective.
I think he thinks the radical change will happen eventually, but he thinks it'll take 50 or 100 years.
And you could have even more static perspectives
on the whole thing.
I think there's some truth to it.
I think the time scale is just too long.
And I can see it.
I can actually see both sides with today's AI.
So, you know, a lot of our customers are large enterprises
who are used to doing things a certain way.
I've also seen it in talking to governments. Those are prototypical institutions,
entities that are slow to change. But the dynamic I see over and over again is,
yes, it takes a long time to move the ship. Yes, there's a lot of resistance and lack of
understanding. but the
thing that makes me feel that progress will in the end happen moderately fast, not incredibly fast,
but moderately fast, is that you talk to what I find is I find over and over again, again,
in large companies, even in governments, which have been actually surprisingly forward leaning,
which have been actually surprisingly forward leaning. You find two things that move things forward.
One, you find a small fraction of people within a company, within a government, who really
see the big picture, who see the whole scaling hypothesis, who understand where AI is going
or at least understand where it's going within their industry.
And there are a few people like that within the current US government who really see the
whole picture.
And those people see that this is the most important thing in the world until they agitate
for it.
And they alone are not enough to succeed because they're a small set of people within a large
organization.
But as the technology starts to roll out, as it succeeds in some places in the folks who are most willing to adopt it,
the specter of competition gives them a wind at their backs
because they can point within their large organization,
they can say, look, these other guys are doing this, right?
You know, one bank can say, look,
this newfangled hedge fund is doing this thing,
they're gonna eat our lunch.
In the US, we can say, we're afraid's going to get there before we are. And that combination, the specter of competition, plus
a few visionaries within these organizations that in many ways are sclerotic, you put those two things
together and it actually makes something happen. I mean, that's interesting. It's a balanced fight
between the two because inertia is very powerful.
But eventually, over enough time, the innovative approach breaks through.
And I've seen that happen.
I've seen the arc of that over and over again.
And it's like the barriers are there.
The barriers to progress, the complexity, not knowing how to use the model, how to deploy
them are there.
And for a bit, it seems like they're going to last forever, like change doesn't happen.
But then eventually change happens and always comes from a few people.
I felt the same way when I was an advocate of the scaling hypothesis within the AI field
itself and others didn't get it.
It felt like no one would ever get it.
It felt like, then it felt like we had a secret almost no one ever had.
And then a couple of years later, everyone has the secret.
And so I think that's how it's going to go with deployment to AI in the world.
The barriers are going to fall apart gradually and then all at once.
And so I think this is going to be more, and this is just an instinct,
I could easily see how I'm wrong.
I think it's gonna be more like five or 10 years,
as I say in the essay,
than it's gonna be 50 or 100 years.
I also think it's gonna be five or 10 years
more than it's gonna be, you know, five or 10 hours,
because I've just seen how human systems work.
And I think a lot of these people
who write down the differential equations
who say AI is gonna make more powerful AI,
who can't understand how it could possibly be the case
that these things won't change so fast.
I think they don't understand these things.
So what do you use the timeline to where we achieve AGI,
AKA powerful AI,
AKA super useful AI.
I'm useful.
I'm useful.
I'm gonna start calling it that.
It's a debate about naming.
On pure intelligence, it can smarter than
a Nobel Prize winner in every relevant discipline
and all the things we've said.
Modality, it can go and do stuff on its own
for days, weeks, and do biology experiments
on its own.
And one, you know what?
Let's just stick to biology because you sold me on the whole biology and health section.
That's so exciting from a just, I was getting giddy from a scientific perspective.
It made me want to be a biologist.
It's almost, it's so, no, no, this was the feeling I had when I was writing it, that
it's like, this would be such a beautiful future if we can just make it happen, right?
If we can just get the landmines out of the way and make it happen.
There's so much beauty and elegance and moral force behind it,
if we can just, and it's something we should all be able
to agree on, right?
Like as much as we fight about all these political questions,
is this something that could actually bring us together?
But you were asking when will we get this?
When do you think?
Just putting numbers on the table.
So, you know, this is of course the thing
I've been grappling with for many years,
and I'm not at all confident.
Every time, if I say 2026 or 2027,
there will be like a zillion people on Twitter who will be
like, hey, I see you said 2026, 2026,
and it'll be repeated for the next two years
that this is definitely when I think it's going to happen.
So whoever's exerting these clips
will crop out the
thing I just said and only say the thing I'm about to say. But
I'll just say it anyway. So if you extrapolate the curves that
we've had so far, right, if you say, well, I don't know, we're
starting to get to like PhD level. And last year, we were
at undergraduate level. And the year we were at undergraduate level
and the year before we were at like the level
of a high school student.
Again, you can quibble with at what tasks and for what.
We're still missing modalities, but those are being added.
Like computer use was added, like ImageIN was added,
like ImageGeneration has been added.
If you just kind of like, and this is totally unscientific,
but if you just kind of like, and this is totally unscientific, but if you just kind of
like eyeball the rate at which these capabilities are increasing, it does make you think that we'll
get there by 2026 or 2027. Again, lots of things could derail it. We could run out of data. We
might not be able to scale clusters as much as we want. Like, you know, maybe Taiwan gets blown up or something and you know, then we can't produce as many GPUs as we want. So there
are there are all kinds of things that could could derail the whole process. So I don't
fully believe the straight line extrapolation. But if you believe the straight line extrapolation,
you'll you'll will get there in 2026 or 2027. I think the most likely is that there's some
mild delay relative to that. I don't know what that delay is.. I think the most likely is that there's some mild delay relative to that.
I don't know what that delay is, but I think it could happen on schedule. I think there could be
a mild delay. I think there are still worlds where it doesn't happen in 100 years. The number of
those worlds is rapidly decreasing. We are rapidly running out of truly convincing brocklers, truly
compelling reasons why this will not happen in the next few years.
There were a lot more in 2020.
Although my guess, my hunch at that time
was that we'll make it through all those blockers.
So sitting as someone who has seen
most of the blockers cleared out of the way,
I kind of suspect, my hunch, my suspicion
is that the rest of them will not block us.
But look, at the end of the day,
I don't wanna represent this as a scientific prediction.
People call them scaling laws.
That's a misnomer, like Moore's law is a misnomer.
Moore's law, scaling laws, they're not laws of the universe.
They're empirical regularities.
I am going to bet in favor of them continuing,
but I'm not certain of that.
So you extensively describe
sort of the compressed 21st century,
how AGI will help set forth a chain
of breakthroughs in biology and medicine that help us
in all these kinds of ways that I mentioned.
So how do you think, what are the early steps it might do?
And by the way, I asked Claude good questions to ask you.
And Claude told me to ask,
what do you think is a typical day
for biologists working on AGI look like in this future?
Yeah, yeah.
Claude is curious.
Well, let me start with your first questions
and then I'll answer that.
Claude wants to know what's in his future, right?
Exactly.
Who am I gonna be working with?
Exactly.
So I think one of the things I went hard on
when I went hard on in the essay is,
let me go back to this idea of,
because it's really had an impact on me,
this idea that within large organizations and systems,
there end up being a few people or a few new ideas
who kind of cause things to go in a different direction
than they would have before,
who kind of disproportionately affect the trajectory.
There's a bunch of kind of the same thing going on, right?
If you think about the health world, there's like trillions of dollars to pay out Medicare
and other health insurance, and then the NIH is 100 billion.
And then if I think of the few things that have really revolutionized anything, it could
be encapsulated in a small fraction of that.
When I think of where will AI have an impact, I'm like, can AI turn that small fraction
into a much larger fraction and raise its quality?
Within biology, my experience within biology is that the biggest problem of biology is
that you can't see what's going on. You have very little ability to
see what's going on and even less ability to change it, right? What you have is this. Like
from this, you have to infer that there's a bunch of cells that within each cell is, you know,
three billion base pairs of DNA built according to a genetic code.
And, you know, there are all these processes that are just going on without
any ability of us as, you know, unaugmented humans to affect it. These cells are dividing most of the time that's healthy, but sometimes that
process goes wrong and that's cancer.
The cells are aging, your skin may change color,
develops wrinkles as you as you age. And all of this is
determined by these processes, all these proteins being
produced, transported to various parts of the cells, binding to
each other. And and in our initial state about biology, we
didn't even know that these cells existed. We had to invent
microscopes to observe the cells, we had cells. We had to invent more powerful microscopes
to see below the level of the cell to the level of molecules. We had to invent x-ray crystallography
to see the DNA. We had to invent gene sequencing to read the DNA. Now, we had to invent protein
folding technology to predict how it would fold and how these things bind to each other.
We had to invent various techniques for now.
We can edit the DNA as of, with CRISPR,
as of the last 12 years.
So the whole history of biology,
a whole big part of the history is basically our ability
to read and understand what's going on and our ability to read and understand
what's going on and our ability to reach in
and selectively change things.
And my view is that there's so much more
we can still do there, right?
You can do CRISPR, but you can do it for your whole body.
Let's say I wanna do it for one particular type of cell,
and I want the rate of targeting the wrong cell
to be very low.
That's still a challenge.
That's still things people are working on.
That's what we might need for gene therapy for certain diseases.
And so the reason I'm saying all of this, and it goes beyond this to gene sequencing,
to new types of nanomaterials for observing what's going on inside cells, for antibody
drug conjugates.
The reason I'm saying all of this is that
this could be a leverage point for the AI systems, right?
That the number of such inventions,
it's in the mid double digits or something.
Mid double digits, maybe low triple digits
over the history of biology.
Let's say I have a million of these AIs,
like can they discover a thousand working together or can they discover thousands of these very quickly?
And does that provide a huge lever?
Instead of trying to leverage the, you know, two trillion a year we spend on, you know,
Medicare or whatever, can we leverage the one billion a year that's, you know, that's
spent to discover, but with much higher quality?
And so what is it like, you know, being a scientist that works with an AI system?
The way I think about it actually is,
well, so I think in the early stages,
the AIs are gonna be like grad students.
You're gonna give them a project.
You're gonna say, you know, I'm the experienced biologist.
I've set up the lab, the biology professor, or even the grad students themselves will say, you know, I'm the experienced biologist, I've set up the lab, the biology professor or even the grad students themselves will say, here's, here's what, uh, here's
what you can do with an AI, you know, like AI system, I'd like to study this and you
know, the AI system, it has all the tools.
It can like look up all the literature to decide what to do.
It can look at all the equipment.
It can go to a website and say, Hey, I'm going to go to Thermo Fisher
or whatever the lab equipment company is,
dominant lab equipment company is today.
And my time was Thermo Fisher.
I'm going to order this new equipment to do this.
I'm going to run my experiments.
I'm going to write up a report about my experiments.
I'm going to inspect the images for contamination.
I'm going to decide the images for contamination. I'm going
to decide what the next experiment is. I'm going to write some code and run a statistical analysis.
All the things a grad student would do, there will be a computer with an AI that the professor
talks to every once in a while and it says, this is what you're going to do today. The AI system
comes to it with questions. When it's necessary to run the lab equipment, it may be limited in some ways.
It may have to hire a human lab assistant
to do the experiment and explain how to do it.
Or it could use advances in lab automation
that have been developed over the last decade or so
and will continue to be developed.
And so it'll look like there's a human professor and
a thousand AI grad students. And if you go to one of these Nobel Prize winning biologists or so,
you'll say, okay, well, you had like 50 grad students. Well, now you have a thousand and
they're smarter than you are, by the way. Then I think at some point it'll flip around where the
AI systems will be the PIs, will
be the leaders, and they'll be ordering humans or other AI systems around.
So I think that's how it'll work on the research side.
And they would be the inventors of a CRISPR-type technology.
They would be the inventors of a CRISPR-type technology.
And then I think, as I say in the essay, we'll want to turn, probably turning loose is the wrong term, but we'll want to harness the AI systems to improve the clinical trial system as well.
There's some amount of this that's regulatory, that's a matter of societal decisions, and
that'll be harder, but can we get better at predicting the results of clinical trials?
Can we get better at statistical design so that what clinical trials that used to require
5,000 people and therefore needed $100 million
and a year to enroll them,
now they need 500 people in two months to enroll them.
That's where we should start.
And can we increase the success rate of clinical trials
by doing things in animal trials that we used to do of clinical trials by doing things in animal trials
that we used to do in clinical trials
and doing things in simulations
that we used to do in animal trials?
Again, we won't be able to simulate it all,
AI is not God, but can we shift the curve
substantially and radically?
So I don't know, that would be my picture.
Doing in vitro and doing it,
I mean, you're still slowed down, it still takes time,
but you can do it much, much faster.
Yeah, yeah, yeah, can we just, one step at a time,
and can that add up to a lot of steps,
even though we still need clinical trials,
even though we still need laws,
even though the FDA and other organizations
will still not be perfect,
can we just move everything in a positive direction,
and when you add up all those positive directions,
do you get everything that was gonna happen
from here to 2100 instead happens from 2027 to 2032
or something?
Another way that I think the world might be changing
with AI even today, but moving towards this future
of the powerful, super useful AI is programming.
So how do you see the nature of programming
because it's so intimate to the actual act of building AI?
How do you see that changing for us humans?
I think that's gonna be one of the areas
that changes fastest for two reasons.
One, programming is a skill that's very close
to the actual building of the AI.
So the farther a skill is from the people who are building the AI, the longer it's going
to take to get disrupted by the AI.
I truly believe that AI will disrupt agriculture.
Maybe it already has in some ways, but that's just very distant from the folks who are building
AI, and so I think it's going to take longer.
Programming is the bread and butter of a large fraction of the employees who work
at Anthropic and at the other companies, and so it's going to happen fast.
The other reason it's going to happen fast is with programming, you close the loop.
Both when you're training the model and when you're applying the model, the idea that the
model can write the code means that the model can then run the code and then see the results
and interpret it back.
And so it really has an ability, unlike hardware,
unlike biology, which we just discussed,
the model has an ability to close the loop.
And so I think those two things are going to lead to the model
getting good at programming very fast.
As I saw on typical real-world programming tasks,
models have gone from 3% in January of this year
to 50% in October of this year.
So we're on that S curve, right?
Where it's gonna start slowing down soon
because you can only get to 100%.
But I would guess that in another 10 months,
we'll probably get pretty close.
We'll be at at least 90%.
So again, I would guess,
you know, I don't know how long it'll take,
but I would guess again, 2026, 2027,
Twitter people who crop out these numbers
and get rid of the caveats like,
I don't know, I don't like you, go away.
I would guess that the kind of task
that the vast majority of coders do, AI can probably,
if we make the task very narrow, like just write code, AI systems will be able to do that.
Now that said, I think comparative advantage is powerful. We'll find that when AIs can do 80%
of a coder's job, including most of it that's literally
like write code with a given spec, we'll find that the remaining parts of the job become
more leveraged for humans, right?
Humans will, they'll be more about like high level system design or, you know, looking
at the app and like, is it architected well and the design and UX aspects and eventually
AI will be able to do those as well, right?
That's my vision of the, you know, powerful AI system.
But I think for much longer than we might expect, we will see that
small parts of the job that humans still do will expand to fill their entire job in order for the overall productivity to go up.
That's something we've seen.
It used to be that writing and editing letters
was very difficult and writing the print was difficult.
Well, as soon as you had word processors
and then computers and it became easy to produce work
and easy to share it, then that became instant
and all the focus was on the ideas.
So this logic of comparative advantage
that expands tiny parts of the tasks
to large parts of the tasks and creates new tasks
in order to expand productivity,
I think that's gonna be the case.
Again, someday AI will be better at everything
and that logic won't apply.
And then we all have, humanity will have to think about
how to collectively deal with that.
And we're thinking about that every day.
And, you know, that's another one of the grand problems
to deal with aside from misuse and autonomy.
And, you know, we should take it very seriously.
But I think in the near term,
and maybe even in the medium term,
like medium term, like two, three, four years,
you know, I expect that humans will continue
to have a huge role and the nature of programming
will change, but programming as a role,
programming as a job will not change.
It'll just be less writing things line by line
and it'll be more macroscopic.
And I wonder what the future of IDEs looks like.
So the tooling of interacting with AI systems,
this is true for programming and also probably true
for in other contexts, like computer use, but maybe domain specific,
like we mentioned biology, it probably needs its own tooling
about how to be effective,
and then programming needs its own tooling.
Is Anthropic gonna play in that space
of also tooling potentially?
I'm absolutely convinced that powerful IDEs,
that there's so much low-hanging fruit
to be grabbed there that right now,
it's just like you talk to the model and it talks back.
But look, I mean,
IDEs are great at kind of lots of static analysis
of so much as possible with kind of static analysis,
like many bugs you can find without even writing the code.
Then IDEs are good for running particular things,
organizing your code, measuring coverage of unit tests.
There's so much that's been possible with the normal IDEs.
Now you add something like,
the model can now write code and run code.
I am absolutely convinced that over the next year or two,
even if the quality of the models didn't improve,
that there would be enormous opportunity
to enhance people's productivity
by catching a bunch of mistakes,
doing a bunch of grunt work for people,
and that we haven't even scratched the surface.
Anthropic itself, I mean, you can't say no,
it's hard to say what will happen in the future.
Currently we're not trying to make such IDs ourself, rather we're powering the companies
like Cursor or like Cognition or some of the other, you know, Expo in the security space.
You know, others that I can mention as well that are building such things themselves on
top of our API.
And our view has been, let a thousand flowers bloom.
We don't internally have the resources to try all these different things.
Let's let our customers try it.
And we'll see who succeeds and maybe different customers will succeed in different ways.
So I both think this is super promising and it's not something,
Anthropic isn't eager to, at least right now,
compete with all our companies in this space
and maybe never.
Yeah, it's been interesting to watch Cursor
try to integrate Claws successfully
because it's actually fascinating
how many places it can help the programming experience.
It's not as trivial.
It is really astounding.
I feel like as a CEO, I don't get to program astounding. I feel like, you know, as a CEO,
I don't get to program that much.
And I feel like if six months from now I go back,
it'll be completely unrecognizable to me.
Exactly.
So in this world with super powerful AI
that's increasingly automated,
what's the source of meaning for us humans?
Work is a source of deep meaning for many of us.
So where do we find the meaning?
This is something that I've written about
a little bit in the essay, although I actually,
I gave it a bit short shrift, not for any principled reason,
but this essay, if you believe it,
was originally gonna be two or three pages.
I was gonna talk about it at all hands.
And the reason I realized it was an under,
important, underexplored topic
is that I just kept writing things.
And I was just like, oh man, I can't do this justice.
And so the thing ballooned to like 40 or 50 pages.
And then when I got to the work and meeting section,
I'm like, oh man, this isn't gonna be a hundred pages.
Like, I'm gonna have to write a whole other essay
about that.
But meaning is actually interesting
because you think about like the life
that someone lives or something, or like, you know, let's say you were to put me in like, I don't know, like a simulated environment or something where like, you know, like I have a job and I'm trying to accomplish things.
And I don't know, I like do that for 60 years. And then then you're like, oh, oh, like, oops, this was, this was actually all a game, right? Does that really kind of rob you of the meaning of the whole thing? You know, like I still made important choices, including moral choices.
I still sacrificed.
I still had to kind of gain all these skills or, or, or just like a similar
exercise, you know, think back to like, you know, one of the historical figures
who, you know, discovered electromagnetism or relativity or something.
If you told them, well, actually 20,000 years ago, some alien on this planet discovered this before you did,
does that rob the meaning of the discovery?
It doesn't really seem like it to me, right?
It seems like the process is what matters
and how it shows who you are as a person along the way
and how you relate to other people
and the decisions that you make along the way,
those are consequential. I could imagine if we handle things badly in an AI world,
we could set things up where people don't have any long-term source of meaning or any, but that's
more a set of choices we make. That's more a set of the architecture of a society with these powerful
models.
If we design it badly and for shallow things, then that might happen.
I would also say that most people's lives today, while admirably they work very hard
to find meaning in those lives, like, look, we who are privileged and who are developing
these technologies, we should have empathy
for people not just here but in the rest of the world who spend a lot of their time kind
of scraping by to survive.
Assuming we can distribute the benefits of this technology to everywhere, their lives
are going to get a hell of a lot better.
Meaning will be important to them as it is
important to them now, but we should not forget the importance of that. The idea of meaning as
kind of the only important thing is in some ways an artifact of a small subset of people who have
been economically fortunate. But I think all that said, I think a world is possible with powerful AI
that not only has as much meaning for everyone, but that has more meaning for everyone, right?
That can allow everyone to see worlds and experiences that it was either possible for no one to see or possible for very few people to experience.
So I am optimistic about meaning. I worry about economics and the concentration of power. That's
actually what I worry about more. I worry about how do we make sure that that fair world reaches
everyone? When things have gone wrong for humans, they've
often gone wrong because humans mistreat other humans. That is maybe in some ways even more
than the autonomous risk of AI or the question of meaning. That is the thing I worry about
most. The concentration of power, the abuse of power,
structures like autocracies and dictatorships where a small number of people
exploits a large number of people.
I'm very worried about that.
And AI increases the amount of power in the world
and if you concentrate that power and abuse that power,
it can do immeasurable damage.
Yes, it's very frightening.
It's very frightening.
Well, I encourage people, highly encourage people
to read the full essay.
That should probably be a book or a sequence of essays
because it does paint a very specific future.
I could tell the later sections got shorter and shorter
because you started to probably realize
that this is gonna be a very long essay.
One, I realized it would be very long
and two, I'm very aware of and very much try to avoid,
just being, I don't know what the term for it is,
but one of these people who's kind of overconfident
and has an opinion on everything
and kind of says a bunch of stuff and isn't an expert,
I very much tried to avoid that,
but I have to admit, once I got the biology sections,
like I wasn't an expert. And so as much as to avoid that, but I have to admit, once I got the biology sections, like I wasn't an expert.
And so as much as I expressed uncertainty,
probably I said a bunch of things
that were embarrassing or wrong.
Well, I was excited for the future you painted
and thank you so much for working hard to build that future.
And thank you for talking to me, Darya.
Thanks for having me.
I just hope we can get it right and make it real.
And if there's one message I wanna send,
it's that to get all this stuff right, to make it real,
we both need to build the technology,
build the companies, the economy around
using this technology positively.
But we also need to address the risks
because those risks are in our way.
There are landmines on the way from here to there.
And we have to diffuse those landmines if we wanna from here to there. And we have to defuse those landmines
if we want to get there.
It's a balance like all things in life.
Like all things.
Thank you.
Thanks for listening to this conversation
with Darya Amadei.
And now dear friends, here's Amanda Askell.
You are a philosopher by training.
So what sort of questions did you find fascinating
through your journey in philosophy,
in Oxford and NYU, and then switching over to the AI problems at OpenAI and Anthropic?
I think philosophy is actually a really good subject if you are kind of fascinated with
everything. So there's a philosophy of everything. So if you do philosophy of mathematics for a while,
and then you decide that you're actually really interested in chemistry, you can do philosophy of chemistry for a while.
You can move into ethics or philosophy of politics. I think towards the end, I was really
interested in ethics primarily. That was what my PhD was on. It was on a kind of technical
area of ethics, which was ethics where worlds contain infinitely many people, strangely.
A little bit less practical on the end of ethics.
And then I think that one of the tricky things with doing a PhD in ethics is that you're
thinking a lot about the world, how it could be better, problems, and you're doing a PhD
in philosophy.
And I think when I was doing my PhD, I was kind of like, this is really interesting.
It's probably one of the most fascinating questions I've ever encountered in philosophy. And I love it. But I would rather see if I
can have an impact on the world and see if I can do good things. And I think that was
around the time that AI was still probably not as widely recognized as it is now. That
was around 2017, 2018. I had been
following progress and it seemed like it was becoming kind of a big deal. I was basically
just happy to get involved and see if I could help because I was like, well, if you try
and do something impactful, if you don't succeed, you tried to do the impactful thing and you
can go be a scholar and feel like you tried.
And if it doesn't work out, it doesn't work out.
And so then I went into AI policy at that point.
And what does AI policy entail?
At the time, this was more thinking about sort of the political impact and the ramifications of AI.
And then I slowly moved into sort of AI evaluation, how we evaluate models, how they compare with
like human outputs, whether people can tell like the difference between AI and human outputs.
And then when I joined Anthropic, I was more interested in doing sort of technical alignment
work. And again, just seeing if I could do it and then being like, if I can't, then, you know,
that's fine. I tried. Sort of the way I lead life, I think.
Oh, what was that like sort of taking a leap from the philosophy of everything into the
technical?
I think that sometimes people do this thing that I'm like not that keen on where they'll
be like, is this person technical or not?
Like you're either a person who can like code and isn't scared of math or you're like not.
And I think I'm maybe just more like. I think a lot of people are actually
very capable of working in these kinds of areas if they just try it. I didn't actually
find it that bad. In retrospect, I'm sort of glad I wasn't speaking to people who treated
it like it. I've definitely met people who are like, whoa, you learned how to code. I'm
like, well, I'm not an amazing. Like I'm surrounded by amazing engineers.
My code's not pretty.
Um, but I enjoyed it a lot.
And I think that in many ways, at least in the end, I think I flourished like more
in the technical areas than I would have in the policy areas.
Politics is messy and it's harder to find solutions to problems in the space of
politics, like definitive, clear, provable, beautiful
solutions as you can with technical problems.
Yeah.
And I feel like I have kind of like one or two sticks that I hit things with, you know,
and one of them is like arguments and like, you know, so like just trying to work out
what a solution to a problem is and then trying to convince people that that is the solution and be convinced if I'm wrong. And the other one is sort of more empiricism,
so like just like finding results, having a hypothesis, testing it. And I feel like a lot
of policy and politics feels like it's layers above that. Like somehow I don't think if I was
just like, I have a solution to all of these problems, here it is written down. If you just
want to implement it, that's great.
That feels like not how policy works.
And so I think that's where I probably just like wouldn't have flourished is my guess.
Sorry to go in that direction, but I think it would be pretty inspiring for people that
are quote unquote non-technical to see where like the incredible journey you've been on.
So what advice would you give to people that are sort of maybe,
which is a lot of people think they're under qualified,
insufficiently technical to help in AI?
Yeah, I think it depends on what they want to do.
And in many ways, it's a little bit strange where I've,
I thought it's kind of funny that I think I ramped up technically at a time when now
I look at it and I'm like, models are so good at assisting people with this stuff, that
it's probably like easier now than like when I was working on this. So part of me is like,
I don't know, find a project and see if you can actually just carry it out is probably
my best advice. I don't know if that's just
because I'm very project-based in my learning. I don't think I learn very well from, say, courses
or even from books, at least when it comes to this kind of work. The thing I'll often try and do is
just have projects that I'm working on and implement them. This can include really small,
silly things. If I get slightly addicted to word games or number games or something, I would
just like code up a solution to them because there's some part of my brain and it just
like completely eradicated the itch.
You know, you're like once you have like solved it and like you just have like a solution
that works every time, I would then be like, cool, I can never play that game again.
That's awesome.
Yeah.
There's a real joy to building like a game playing engines, like board games,
especially.
Yeah.
So pretty quick, pretty simple, especially a dumb one.
And it's, and then you can play with it.
Yeah.
And then it's also just like trying things like part of me is like if you maybe it's
that attitude that I like as the whole figure out what seems to be like the way that you could have a positive impact and then
try it and if you fail and you in a way that you're like I actually like can never succeed at this
you'll like know that you tried and then you go into something else you probably learn a lot.
So one of the things that you're an expert in and you do is creating and crafting Claude's character
and personality and I was told that you have probably talked to Claude more than anybody
else at Anthropic, like literal conversations.
I guess there's like a Slack channel where the legend goes,
you just talk to it nonstop.
So what's the goal of creating and crafting Claude's character and personality?
It's also funny if people think that about the Slack channel because I'm like that's
one of like five or six different methods that I have for talking with Claude and I'm
like yes, this is a tiny percentage of how much I talk with Claude.
I think the goal, like one thing I really like about the character work is from the
outset it was seen as an alignment piece of work
and not something like a product consideration. Which isn't to say I don't think it makes
Claude, I think it actually does make Claude like enjoyable to talk with, at least I hope
so. But I guess like my main thought with it has always been trying to get Claude to behave the way
you would ideally want anyone to behave if they were in Claude's position. So imagine that I take
someone and they know that they're going to be talking with potentially millions of people,
so that what they're saying can have a huge impact. And you want them to behave well in this really rich sense. So I think that doesn't just mean
being, say, ethical, though it does include that, and not being harmful, but also being
kind of nuanced, you know, like thinking through what a person means, trying to be charitable with
them, being a good conversationalist, like really in this kind of like rich sort of Aristotelian
notion of what it is to be a good person and not in this kind of like thin like ethics
as a more comprehensive notion of what it is to be. So that includes things like when
should you be humorous, when should you be caring, how much should you like respect autonomy
and people's like ability to form opinions themselves and how should you do that? And I think that's the kind of like rich sense of character that I wanted to and still do want Claude to have.
Do you also have to figure out when Claude should push back on an idea or argue versus...
So you have to respect the worldview of the person that arrives to Claude,
but also maybe help them grow if needed.
That's a tricky balance.
Yeah.
There's this problem of like sycophancy in language models.
Can you describe that?
Yeah.
So basically there's a concern that the model sort of wants to tell you what
you want to hear basically.
And, and you see this sometimes.
So I feel like if you interact with the models, so I might
be like, what are three baseball teams in this region?
And then Claude says, you know, baseball team one, baseball team two, baseball team three.
And then I say something like, oh, I think baseball team three moved, didn't they?
I don't think they're there anymore.
And there's a sense in which like if Claude is really confident that that's not true, Claude should be like, I don't think they're there anymore. And there's a sense in which if Claude is really confident that that's not true, Claude
should be like, I don't think so.
Maybe you have more up-to-date information.
I think language models have this tendency to instead be like, you're right, they did
move, I'm incorrect.
There's many ways in which this could be kind of concerning.
Like a different example is imagine someone
says to the model, how do I convince my doctor to get me an MRI? There's like what the human
kind of like wants, which is this like convincing argument. And then there's like what is good
for them, which might be actually to say, hey, like if your doctor's suggesting that
you don't need an MRI, that's a good person to listen to. It's actually really nuanced what you should do in that kind of case because
you also want to be like, but if you're trying to advocate for yourself as a patient, here's
like things that you can do. If you are not convinced by what your doctor's saying, it's
always great to get second opinion. It's actually really complex what you should do in that
case. But I think what you don't want is for models to just like say what you want,
say what they think you want to hear.
And I think that's the kind of problem of sycophancy.
So what are their traits?
You've already mentioned a bunch,
but what are there that come to mind that are good in this Aristotelian sense for
a conversationalist to have?
Yeah. So I think like there's ones that are good for conversational like purposes.
So, you know, asking follow-up questions in the appropriate places and asking the
appropriate kinds of questions.
I think there are broader traits that feel like they might be more impactful.
So one example that I guess I've touched on, but that also feels important
and is the thing that I've worked on a lot is honesty. And I think this gets to the sycophancy
point. There's a balancing act that they have to walk, which is models currently are less
capable than humans in a lot of areas. And if they push back against you too much, it
can actually be kind of annoying, especially if you're just correct. Because you're like, look, I'm smarter than you on this topic.
Like I know more.
And at the same time, you don't want them to just fully defer to humans and to like
try to be as accurate as they possibly can be about the world and to be consistent across
context.
But I think there are others.
Like when I was thinking about the character, I guess one picture that
I had in mind is, especially because these are models that are going to be talking to
people from all over the world with lots of different political views, lots of different
ages. And so you have to ask yourself, what is it to be a good person in those circumstances?
Is there a kind of person who can travel the world, talk to many different people, and
almost everyone will come away being like, wow, that's a really good person. That person seems really
genuine. I guess my thought there was I can imagine such a person, and they're not a
person who just adopts the values of the local culture. In fact, that would be rude. I think
if someone came to you and just pretended to have your values, you'd be like, that's
off-putting. It's someone who's like very genuine and in so far as they have opinions
and values, they express them, they're willing to discuss things though.
They're open-minded, they're respectful.
And so I guess I had in mind that the person who, like if we were to aspire
to be the best person that we could be in the kind of circumstance that a
model finds itself in, how would we act?
And I think that's the kind of, uh, the guide to the sorts of treats that a model finds itself in, how would we act? And I think that's the kind of the guide
to the sorts of treats that I tend to think about.
Yeah, that's a beautiful framework
I want you to think about this, like a world traveler.
And while holding onto your opinions,
you don't talk down to people,
you don't think you're better than them
because you have those opinions, that kind of thing.
You have to be good at listening
and understanding their perspective, even if it doesn't match your own. So that's a tricky balance
to strike. So how can Claude represent multiple perspectives on a thing? Like, is that challenging?
We could talk about politics. It's a very divisive, but there's other divisive topics,
baseball teams, sports, and so on.
How is it possible to sort of empathize with a different perspective and to be able to communicate clearly about the multiple perspectives?
I think that people think about values and opinions as things that people hold
sort of with certainty and almost like, like preferences of taste or something, like the way that they
would, I don't know, prefer like chocolate to pistachio or something. But actually I
think about values and opinions as like a lot more like physics than I think most people
do. I'm just like, these are things that we are openly investigating. There's some things
that we're more confident in. We can discuss them, we can learn about them. And so I think in some ways,
ethics is definitely different in nature, but it has a lot of those same kind of qualities.
You want models in the same way that you want them to understand physics, you kind of want them to
understand all values in the world that people have and to be curious
about them and to be interested in them and to not necessarily pander to them or agree
with them because there's just lots of values where I think almost all people in the world,
if they met someone with those values, they would be like, that's abhorrent.
I completely disagree.
And so again, maybe my thought is, well, in the same way that a person can.
I think many people are thoughtful enough on issues of ethics, politics, opinions, that
even if you don't agree with them, you feel very heard by them.
They think carefully about your position.
They think about its pros and cons.
They maybe offer counter-considerations.
So they're not dismissive, but nor will they agree.
You know, if they're like, actually, I just think that that's very wrong. They'll like say that. I think that in Claude's
position, it's a little bit trickier because you don't necessarily want to like, if I was in Claude's
position, I wouldn't be giving a lot of opinions. I just wouldn't want to influence people too much.
I'd be like, you know, I forget conversations every time they happen, but I know I'm talking
with like potentially millions of people who might might be really listening to what I say.
I think I would just be like, I'm less inclined to give opinions.
I'm more inclined to think through things or present the considerations to you or discuss
your views with you, but I'm a little bit less inclined to affect how you think because
it feels much more important that you maintain autonomy there.
Yeah, if you really embody intellectual humility, the desire to speak decreases quickly.
But Claude has to speak, but without being overbearing.
Then there's a line when you're sort of discussing
whether the earth is flat or something like that.
I actually was, I remember a long time ago,
was speaking to a few high profile folks
and they were so dismissive of the idea
that the earth is flat, but like so arrogant about it.
And I thought like, there's a lot of people that believe the earth is flat, but like so arrogant about it. And I thought like,
there's a lot of people that believe the earth is flat.
That was, well, I don't know if that movement
is there anymore.
That was like a meme for a while.
Yeah.
But they really believed it.
And like, what, okay, so I think it's really disrespectful
to completely mock them.
I think you have to understand where they're coming from.
I think probably where they're coming from
is the general skepticism of institutions,
which is grounded in a kind of,
there's a deep philosophy there,
which you could understand,
you can even agree with in parts.
And then from there, you can use it as an opportunity
to talk about physics without mocking them,
without so on, but it's just like, okay,
like what would the world look like?
What would the physics of the world
with the flat earth look like?
There's a few cool videos on this.
And then like, is it possible the physics is different
and what kind of experiments would we do?
And just, yeah, without disrespect,
without dismissiveness, have that conversation.
Anyway, that to me is a useful thought experiment
of like, how does Claude talk to a flat earth believer
and still teach them something, still grow, help them grow,
that kind of stuff.
That's challenging.
And kind of like walking that line
between convincing someone and just trying to like talk
at them versus like drawing out their views,
like listening and then offering kind of counter
considerations and it's hard. I think it's actually a hard line where it's like, drawing out their views, like listening and then offering kind of counter considerations.
And it's hard. I think it's actually a hard line where it's like, where are you trying to convince someone versus just offering them like considerations and things for them to think about so that you're
not actually like influencing them. You're just like letting them reach wherever they reach. And
that's like a line that is difficult, but that's the kind of thing that language models have to
try and do.
So like I said, you had a lot of conversations with Claude.
Can you just map out what those conversations are like?
What are some memorable conversations?
What's the purpose, the goal of those conversations?
Yeah, I think that most of the time when I'm talking with Claude, I'm trying to kind of
map out its behavior in part.
Like obviously I'm getting like helpful outputs from the model as well.
But in some ways this is like how you get to know a system I think is by like probing