Bankless - 168 - How to Solve AI Alignment with Paul Christiano
Episode Date: April 24, 2023Paul Christiano runs the Alignment Research Center, a non-profit research organization whose mission is to align future machine learning systems with human interests. Paul previously ran the language ...model alignment team at OpenAI, the creators of ChatGPT. Today, we’re hoping to explore the solution-landscape to the AI Alignment problem, and hoping Paul can guide us on that journey. ------ ✨ DEBRIEF | Unpacking the episode: https://www.bankless.com/debrief-paul-christiano ------ ✨ COLLECTIBLES | Collect this episode: https://collectibles.bankless.com/mint ------ ✨ Always wanted to become a Token Analyst? Bankless Citizens get exclusive access to Token Hub. Join Them. https://bankless.cc/TokenHubRSS ------ In today’s episode, Paul answers many questions, but the overarching ones are: 1) How BIG is the AI Alignment problem? 2) How HARD is the AI Alighment problem? 3) How SOLVABLE is the AI Alignment problem? Does humanity have a chance? Tune in to hear Paul’s thoughts. ------ BANKLESS SPONSOR TOOLS: ⚖️ ARBITRUM | SCALING ETHEREUM https://bankless.cc/Arbitrum 🐙KRAKEN | MOST-TRUSTED CRYPTO EXCHANGE https://bankless.cc/kraken 🦄UNISWAP | ON-CHAIN MARKETPLACE https://bankless.cc/uniswap 👻 PHANTOM | FRIENDLY MULTICHAIN WALLET https://bankless.cc/phantom-waitlist 🦊METAMASK LEARN | HELPFUL WEB3 RESOURCE https://bankless.cc/MetaMask ------ Topics Covered 0:00 Intro 9:20 Percentage Likelihood of Death by AI 11:24 Timing 19:15 Chimps to Human Jump 21:55 Thoughts on ChatGPT 27:51 LLMs & AGI 32:49 Time to React? 38:29 AI Takeover 41:51 AI Agency 49:35 Loopholes 51:14 Training AIs to Be Honest 58:00 Psychology 59:36 How Solvable Is the AI Alignment Problem? 1:03:48 The Technical Solutions (Scalable Oversight) 1:16:14 Training AIs to be Bad?! 1:18:22 More Solutions 1:21:36 Stabby AIs 1:26:03 Public vs. Private (Lab) AIs 1:28:31 Inside Neural Nets 1:32:11 4th Solution 1:35:00 Manpower & Funding 1:38:15 Pause AI? 1:43:29 Resources & Education on AI Safety 1:46:13 Talent 1:49:00 Paul’s Day Job 1:50:15 Nobel Prize 1:52:35 Treating AIs with Respect 1:53:41 Uptopia Scenario 1:55:50 Closing & Disclaimers ------ Resources: Alignment Research Center https://www.alignment.org/ Paul Christiano’s Website https://paulfchristiano.com/ai/ ----- Not financial or tax advice. This channel is strictly educational and is not investment advice or a solicitation to buy or sell any assets or to make any financial decisions. This video is not tax advice. Talk to your accountant. Do your own research. Disclosure. From time-to-time I may add links in this newsletter to products I use. I may receive commission if you make a purchase through one of these links. Additionally, the Bankless writers hold crypto assets. See our investment disclosures here: https://www.bankless.com/disclosures
Transcript
Discussion (0)
The most likely way we die involves like not AI comes out of the blue and kills everyone,
but involves we have deployed a lot of AI everywhere.
And you can kind of just look and be like, oh yeah.
If for some reason, God forbid, all these AI systems were trying to kill us, they would definitely kill us.
Welcome to bankless where we explore frontier technologies, artificial intelligence on the episode today.
This is how to get started, how to get better, and how to front run the opportunity.
This is Ryan Sean Adams.
I'm here with David Hoffman, and we're here to help you become more bankless.
Guys, we have a special guest in the episode today. Paul Cristiano, this is who Eliezer
Yudkowski told us to go talk to, someone he respects on the AI debate. So we picked his
brain. This is an AI safety alignment researcher. We asked the question, how we stop the AIs
from killing us? Can we prevent the AI takeover that others are very concerned about? There are
three, actually four takeaways for you today. Number one, how big is the AI alignment problem?
We ask Paul this question. Number two, how hard is it to actually?
solve this problem. Number three, what are the ways we solve it, the technical ways,
can we coordinate around this to solve it? And finally, number four, we talk about a possible
optimistic scenario where we live in harmony with the AIs and they improve our lives and make
it quite a bit better. David, what was the significance of this episode to you in our series?
Yeah, the new intro, I think is great when we cover frontier technologies and also,
instead of for this episode, helping you become more bankless. I think this is,
truly a front-running the opportunity. On this one, we are trying to help not die. Yeah,
question mark. We're trying to help you not die. And all of us, yeah, view and the rest of the world.
And I think we're doing this in the best of the way that we can, which is education and awareness
about this AI problem. Paul Cristiano is, like you said, the man that was recommended by
Eliezer about the man approaching this problem head on in a technical way. So in this podcast,
you are going to hear about the technical solutions that people are actively working on,
who are taking this problem extremely seriously and take the risks very, very seriously.
Eliezer gave us more or less a doomsday scenario, like a 99% chance of doom.
Paul Cristiano only gives us a 10 to 20% case of doom scenario.
So much more optimistic.
Like those odds.
Much better odds.
And so we go through why he's still very much concerned, and he does consider this the most, like
way in which he dies in the future, yet why there's still an 80% chance of success. And some of that
80% chance of success actually does have utopia in it. I think maybe Ryan in three, five, 10 years,
we're going to be able to look back at this episode as hopefully, if Paul's right, and I think he's
right, like ahead of its time in terms of elevating extremely important conversations to the best
of our ability into the mainstream. So we can get more people to focus on the actual solution paths
to make sure that that 20% risk of Doomsday
goes down to 0.2% risk of Doomsday.
And I see that's what the significance of this episode has
and why we are doing episodes like this.
Yeah, I mean, to be clear,
Paul really thinks the AI alignment is a solvable problem,
which is much different than others in the space,
and he tells us exactly why.
And my takeaway from the Elezer episode
was Humanity is screwed.
This was, humanity is screwed,
but we're working on it.
We might have some solutions.
And we have clear, actionable paths.
That's right.
So we get into all of that today.
And David, of course, I want to discuss this episode in the debrief with you and hear what you think because this is our third in a series of AI episodes and really interesting material.
The debrief episode is our episode that we record directly after the episode with our raw unfiltered thoughts.
If you are a bankless citizen, you have access to that right now on the premium RSS feed.
You can click a link in the show notes and get access to that.
Okay, guys, we're going to get right to the episode with Paul.
But before we do, we want to thank the sponsors that made this episode.
possible, including Cracken, our recommended crypto exchange for 2023.
Cracken has been a leader in the crypto industry for the last 12 years.
Dedicated to accelerating the global adoption of crypto, Cracken puts an emphasis on security,
transparency, and client support, which is why over 9 million clients have come to love
Cracken's products. Whether you're a beginner or a pro, the Cracken U.S. is simple, intuitive,
and frictionless, making the Cracken app a great place for all to get involved and learn about
crypto. For those with experience, the redesigned Cracken Pro app and web experience is completely customizable
to your trading needs, integrating key trading features into one seamless interface. Cracken has a 24-7-365
client support team that is globally recognized. Cracken support is available wherever, whenever you
need them by phone, chat, or email. And for all of you NFTers out there, the brand new Cracken
NFT beta platform gives you the best NFT trading experience possible. Rarity rankings, no gas
fees and the ability to buy an NFT straight with cash. Does your crypto exchange prioritize its customers
the way that Cracken does? And if not, sign up with Cracken at Cracken.com slash bankless.
Hey, Bankless Nation. If you're listening to this, it's because you're on the free Bankless
RSS feed. Did you know that there's an ad-free version of Bankless that comes with the
Bankless Premium subscription? No ads, just straight to the content. But that's just one of many things
that a premium subscription gets you. There's also the token report, a monthly bullish, bearish,
neutral report on the hottest tokens of the month. And the regular updates from the token report
go into the token Bible. Your first stop shop for every token worth investigating in crypto. Bankless
Premium also gets you a 30% discount to the permissionless conference, which means it basically
just pays for itself. There's also the AirDrop Guide to make sure you don't miss a drop in
2023. But really, the best part about Bankless Premium is hanging out with me, Ryan, and the rest of
the bankless team in the Inner Circle Discord only for premium members. Want the Alpha? Check out
Ben the analyst Degen Pit, where you can ask him questions about the token report.
Got a question? I've got my own Q&A room for any questions that you might have.
At Bankless, we have huge things planned for 2023, including a new website with login with your
Ethereum address capabilities, and we're super excited to ship what we are calling Bankless
2.0 Soon TM. So if you want extra help exploring the frontier, subscribe to Bankless Premium.
It's under 50 cents a day and provides a wealth of knowledge and support on your journey west.
I'll see you in the Discord.
to Ethereum. The number one wallet on Solana is bringing its millions of users and
beloved UX to Ethereum and Polygon. If you haven't used Phantom before, you've been missing out.
Phantom was one of the first wallets to pioneer Solana staking inside the wallet and will be
offering similar staking features for Ethereum and Polygon. But that's just staking.
Phantom is also the best home for your NFTs. Phantom has a complete set of features to
optimize your NFT experience. Pin your favorites, hide your uglies, burn the spam,
and also manage your NFT sale listings from inside the wallet.
Phantom is, of course, a multi-chain wallet.
But it makes chain management easy,
displaying your transactions in a human-readable format,
with automatic warnings for malicious transactions or fishing websites.
Phantom has already saved over 20,000 users from getting scammed or hacked.
So, get on the Phantom Waitlist and be one of the first to access the multi-chain beta.
There's a link in the show notes.
Or you can go to phantom.
dot app slash waitlist to get access in late February.
Bankless Nation, I'm super excited to introduce you to
our next guest. We're talking about AI alignment stuff here today because we can't not. This is Paul
Cristiano. He runs the Alignment Research Center, which is a nonprofit research organization whose
mission is to align future machine learning systems with human interests. Make sure the AIs don't
come to kill us. That's what I take to be the meaning. And Paul previously ran the language
model alignment team at Open AI. You know them. They're the creators of ChatGPT. And today,
we're hoping that Paul can help us explain the solution.
landscape and understand the solution landscape of this AI alignment problem. Paul, welcome to Bankless.
Thanks for having me. Excited to talk to you. So Paul, just to get some context here, David and I
record an episode with Eliezer Yukowski. We thought this would be, hey, you know, Bankless's first
intro, we're primarily a crypto podcast, but we're exploring other frontier technologies. Let's go
check out. Let's just go dabble with AI. Let's just go dabble with AI. You know,
crypto and AI might have some sort of match in the future. So we recorded this podcast and we quickly
realized the agenda that we were going to use and talk about didn't matter anymore because
Eliezer's message was pretty simple. We were all going to die. Basically, we were on the brink,
whether it's years or months away, from creating some super intelligent AI that would eventually
rearrange humanity's atoms and destroy us. And he felt pretty convicted on this as the likely
outcome. So when you get a message like that, Paul, you got to investigate a little further. You have to
get the second doctor's opinion when the prognosis is terminal. So that's what this series is all
about. And we're hoping you can help guide us through these questions today. Does that sound okay?
Happy to at least share my thoughts. I'm a little bit less gloomy. Yes. Okay. Okay.
After a great start. I suspect you so. By the way, Eliezer said, we asked him,
hey, is there someone else we could talk to about this? And he mentioned you. He said,
talk to Paul Christiano. He said, you were someone he respects and brings some of the counterpoints
to his take. So let's get into them. Why don't we just start by waiting into the deep end of the pool here?
What is your percentage likelihood of the full-out Elyzer-Yudkowski-Dum scenario where we're all going to die from the machines?
I think this question is a little bit complicated, unfortunately, because there are a lot of different ways we could all die from the machines.
So the thing I most think about, and I think Eliezer most talks about, is the sort of full-blown AI takeover scenario.
I take this pretty seriously.
I think I have a much higher probability
than a typical person working in ML.
I think maybe there's something like a 10, 20% chance
of AI take over many most humans dead.
That's not really high.
I agree.
Better than 100, David.
Just better than 100%.
Yeah, we're turning in the right direction.
Yeah, I think in some sense,
I'm still quite a gloomy person.
So there's other ways that the development of AI can be rough.
Like, there's other ways that you can have access
to new destructive physical technology,
other disruption. So I think you're maybe looking at some other risks from that transition to
AI, and that adds up to at least another 10%. And then maybe a bigger background part of both
my view and Eliezer's view. I think Eliezer is into this extremely fast transformation once you
develop AI. I have a little bit less of an extreme view on that, but I still think it is the
case that compared to what's kind of default expectations in the world, things are going to be really
fast. So we could talk about the development of AI, but then you might also want to talk about
what happens over the coming months or years. I tend to imagine something more like a year's
transition from AI systems that are a pretty big deal to kind of accelerating change followed
by further acceleration, et cetera. I think once you have that view, then sort of a lot of things
may feel like AI problems because they happen very shortly after you build AI, like your AI builds,
new AI systems, your site keeps changing. Anyway, so overall, you know, maybe you're getting more up to
like 50-50 chance of doom shortly after you have AI systems.
that are at human level. Okay. All right. Well, let's start with that speed conversation first. Let's start
with the takeoff velocity question because I think that's something to that really the AI alignment,
doomerism's perception really, really depends on. If we do believe that AIs are going to be developed
and they're magically going to become sentient, not magically, but it feels like magic.
Somehow pragmatically, it becomes sentient. It's, it feels like magic to humans. It's going to be
because it happens really fast. And I want to actually try and measure that speed because
fast and slow is like relative, right? And so the takeoff scenario that some AGI super
intelligence explosions people articulate is that as soon as some sort of AI can update itself,
it's like a lightning flash. It's like a snap of the fingers. It happens in a blink of an eye.
One day we have chat GPT7 and then the next day we have an AI takeover. And that's like the super
fast scenario. I think what you're saying is like, yeah, pretty quickly, but still not like lightning
fast. I think what you're saying is like, give it a year. Maybe you can help unpack, like,
the timing. How do we understand about time around this whole thing? Yeah, this is one of my most
pronounced disagreements with Eliasur, where we've really went back and forth about it a lot
over the last, like, Jesus, I don't know, 12 years. And I think still do not see at all
I tie. That's that. My view is pretty fast. So I think, like, how I would think about this,
maybe there's like kind of two parts to my answer. So one is sort of based on how fast things are
currently moving in AI. So a way you could think about it is if you were to try and measure,
right, if you're trying to say, suppose you have AI doing some job and has some level of
competence at that job this year, how good is it going to be at this job next year? And how fast
is that changing? Right. So if we're looking at the world and we're like every day, AI is much
smarter than the day before, then you kind of are going to expect, by default, a fast transition
over scale of something like days because you're going to have AI systems that are weak and
a couple days later you have AI systems that are strong. I think the way I would describe the
situation now is more like time scale of like a year or a couple of years.
years. And then there's some reasons that the, when we give a quantitative number, it depends a lot on what we're talking about a number for. So like one way you could measure how fast AI is moving is you could say, suppose you have like an AI from your X and an AI from your X plus one. And you're wondering, like, how much better is your AI from your X plus one? In the sense of like how many your ex AIs would you have needed to be comparably useful. Like how many having AI from when you're later is kind of like having twice as many computers or having four times as many computers or something like that. I think that's typically the regime we're
So like when your AI progress is kind of similar to increasing the amount of compute you have by something like 4x,
from a combination of like hardware progress, software progress, economies of scale, maybe more.
I think you might get 8x, you might go down to 2x, but it's like something in this regime.
You have like one to a couple of doublings per year.
So like right now, it doesn't really matter that much.
Doubling the amount of computers you have has very little effect on the world.
Like if we double the number of computers in the world today, you would not even notice in like GDP statistics.
Yeah, you wouldn't notice basically.
I think in the future, there are going to be systems that are doing some large fraction of all the work in the world,
and ultimately they can substitute effectively for humans across many domains.
And then doubling the number of your computers you have is kind of like doubling the effective population size,
like doubling how many people are working as researchers, doubling how many people are doing jobs.
And so in that world, if you say you're doubling the number of computers you have effectively every like four to six months,
that does imply a very rapid rate of change in kind of how quickly science is progressing in sort of how much stuff you're able to accomplish in the world.
And so I kind of think of that as this like first transition, as you move into this world where AI can substitute for humans,
is you're looking at a rate of growth, something like doubling a couple times a year in total output of AI systems.
The main thing that softens this, so we could talk about how fast a transition, what that actually translates into in terms of how faster transition is in the world.
I think there's one important consideration that softens that change, which is that you have some complementarity between AI systems and humans.
That is AI's are good at some things, humans are good at other things.
So as a result, like, the things that AI are good at, you tend to hit diminishing returns on those
so that the transition is like a little bit slower than you would guess.
If AI's and humans were perfect substitutes, I think you'd be looking at a transition over like
12 months from a world where like humans are doing almost everything, humans are doing almost nothing.
I think given some complementarity, you're probably like a transition over more like years.
I think it's like kind of hard to run the numbers to get a transition over decades.
I think a lot of people say that and have a strong intuition in that direction.
but like when the discussion gets down to brass tax,
I currently don't really see how to make it work.
I think it is possible to get up to like, yeah, very low decades.
Like, you know, having, anyway, we have to talk about timeline between what and what.
But I think we're mostly talking about like years.
I think months would be pretty surprising but possible.
Decades would be also pretty surprising but possible.
So just Paul, to get people up to speed on your 12 years of debate with Elyzer
and those who hold this viewpoint.
So does Elyzer think in terms of like minutes or days that this could happen?
and you're saying, not that fast, it's closer to years?
Is that the difference?
And then once you answer that, can you tell us,
why does this matter so much,
whether it happens in minutes or days versus, you know, years and decades?
Why is that such a fulcrum of this whole debate and discussion
on why the AIs will come kill us or whether they will or whether we'll be okay?
I think probably it is harder to describe Eliezer's views quantitatively in terms of rates of change
because more of his view is about this like kind of phase transition that happens
quite quickly. Like, it's not reasonable to talk. Another perspective, it's not reasonable to talk
about this framing of, like, how long does it take to, like, double the population size or something.
It's more like, how long does it take to move from, like, chimps who are doing nothing to humans
who are doing a lot of stuff. And he's like that, I don't know, could just randomly happen one day
that someone, like, tweaks their code and out went from being a chimp to being a human. And that's
pretty transformative. I think he sort of just has a broader distribution. I think he does not
find years out of the question. He's just like, that's kind of the tail of how slow it could be or
something like that. And, like, it's more about this qualitative picture. He's just like, you're
sort of not going to have changes in the world.
Like, in some sense, the more important thing is I imagine AI systems acting in the
world, doing like trillions of dollars of economic value prior to getting to this point
where they're actually causing like this potential catastrophic risk or where they're
significantly or totally transforming the pace of future technological change.
I think LAZER imagines more like you move from a world like the world of today where AIS
are doing maybe billions or tens of billions or hundreds of billions of dollars of value.
I think that feels to me like the core distinction is kind of like where are you starting from.
like more starting from a world with trillions or tens of trillions,
and Eliezer is more starting from a world of like, I mean, 12 years ago,
I think this is probably going to be seem unfair to Eliezer.
But this discussion was like maybe more live.
And Eliezer would frequently talk about, like, maybe some random people in a small AI group
somewhere.
Like a company like DeepMind is doing like $100 million a year of spending is going
to be building transformative AI.
I think my basic take was like no way.
You're going to look at AI systems that are doing trillions of dollars of revenue.
And like, this gap is closing pretty rapidly because now the one's going to say,
like, you're doing $100 million of revenue.
Like it's pretty clear we're going to be in the least like,
billions or tens of billions of dollars.
I think my take is just like pretty soon,
it's going to be pretty clear doing $100 billion.
And it's going to really like,
I think we're kind of just debating like,
what is the point that you jump from?
Like when you get to the AI that's like crazy science stuff,
what was happening like six months before that was what was happening like
AI systems really broadly deployed in the world doing a ton of crazy things
or it was happening like actually the impact of AI systems was pretty limited
until right before you kind of have this process of rapidly accelerating R&D,
recursive self-improvement within like a single firm or in a like local part of the world.
But to be clear, Paul, do you think it's possible or more unlikely than Eliezer thinks it is to go from that big hop, software update one day to move from chimps to human level intelligence? Do you think that's unlikely? And do you have kind of technical grounds for this? Or what are your grounds for believing that that's less likely than others do?
Yeah, I think there's two parts of this. I mean, again, I want to emphasize that compared to most people in the world, I think I'm into much, much faster change. I think the mainstream view in ML is that things will be more gradual, which I think is mistaken. And when you get down, again, we get down to brass tax. I'm really unpersuaded.
You could then talk about my view, you could have views that are quantitatively faster than mine.
It's just like, actually, this isn't years, this is months.
I think that's like, it's very defensible to run the historical extrapolations and end up with different numbers.
That's just like a really hard empirical question.
And it's like hard to make these predictions about the future.
And I have a lot of sympathy for that.
Then I think there's more like this qualitative claim.
It's like the chimp versus human jump.
I think that's not out of the question, but feels quite unlikely.
I think the basic reason it feels quite unlikely to me is like I would claim that's really not how anything has worked in AI to date.
and it's not how things have worked in almost any other technologies.
Like, it has mostly been the case.
This is my read of history, and I'm very happy to argue about it.
I think this is like an important part of the argument with Eliezer.
My read of history is mostly that before you can do something really crazy,
you can do something that works a little bit less well and is like is a little bit crappier.
And you do sometimes have these jumps from like zero to one.
But you tend to see the zero to one jump, not when a technology is worth like
this would allow you to take over the world or this would be worth $10 trillion dollars.
You tend to see zero to one jumps from a technology is like you have a bunch of amateurs or hobbyists
or a couple scientists.
And so I think if we're going to see such a jump in AI,
I'd be more likely to have seen it back when we were talking about
a small academic community and you're less likely to see it
when you have academic or like labs investing billions
or tens of billions of dollars.
I think the general record is like most of the time
before you can do something really crazy,
you can do something slightly less crazy.
And that becomes like a more and more robust regularity
as you increase the number of people thinking about something
and increase the amount of attention.
You move more and more in terms of like industries
with reasonable roadmaps that actually are forecast for what's going to happen.
Is there an example that you call to mind for history
that sort of we can compare this to?
I mean, I think I'm happy to compare it to sort of almost any technology that is like they are all
different different ways.
I don't know if this, I think LAS was probably more like wanting to point to a particular thing.
He's like, this is the really relevant one.
But I would say like to me, AI seems kind of similar to like future AI developments seem
kind of similar to either past AI developments, other developments in software.
I'd be happy to talk about computing hardware or solar power or nuclear power or nuclear
weapons or flights or.
You just think they all take this kind of.
of gradual sort of approach rather than the big zero to one moments.
And part of the reason Paul people are asking about this is because I think it seems like.
And you tell us, so you previously have worked at OpenAI, very familiar with the methodologies
used.
But it feels to some people like chat GPT has been a big zero to one moment, right?
Like, my God, it's amazing.
And people are, you know, tinkering with it in so many ways and how human-like it seems and how
fast it seemed to explode into the popular consciousness. And so I'm wondering if that has
affected your view on this at all. It's like, oh, wow, this could happen faster than I
previous thought, or if this is well within the bounds of your model? Like, what are we to make of
chat GPT? So I think that chat GPT seems I would take as like representative of the kind of
trajectory I'd expect. So you could compare like chat GPT versus GPT 3.5 versus GPT3, GPT2. I think like
people at OpenAI are not, like it's kind of, most of the social.
sociological facts about chat GPT getting to the point where it was discussed a lot.
I think the actual technical change, certainly between chat GPT and GPT 3.5, but also between
3.5 and 3.
Like, each of these is not giant jumps.
I think these were, like, pretty small changes.
And, like, chat GPT is not, I think it is at the point where it's, like, economically
valuable and it, like, is worth a lot.
I think people are mostly excited because they're looking ahead to where this will go.
And that's, like, a lot of what makes these dynamics, like, more continuous.
People are starting to say, like, okay, what can we do with this?
I think they're doing that at a point where, like, they're not going to be able to do
trillions of dollars a year of value, but they are seeing that that is going to become possible
at some point in the not distant future of the technology continues to improve.
It's like very concretely, like, I mean, I don't know, we were having these discussions.
I was having the discussions quite a lot, like prior to the training of GPT2.
It's like prior to the training of GPT2, we didn't really have language models that,
I don't think we have language models you would even recognize as like seeming smart.
It kind of felt like a qualitatively different ballgame.
It made most relevant is like after the training of GPT2 before GPT3 or before the scale up from like
one B to six B.
I think we, like, know we sat down and we made predictions about how good large models would be.
And I think I am surprised by how good models are, but like surprised in the sense that this is like maybe my 80th percentile of how smart a language model at the scale of GPT 3.5 would be.
Something like that is a little bit hard to exactly compare my forecast to where reality is that.
But we did discuss explicitly for like a language model the size of GPT 3.5 trained in roughly the way GPT 3.5 was.
We weren't exactly right because some of the scaling laws, like there's the switch from GPT scaling laws to Chinchilla scaling laws.
But, like, roughly speaking, we were imagining systems at that scale trained in that way.
And we were like, you know, an example of a bet would be like, is there any task a human can do over 30 seconds that a system trained in this way can't do over 30 seconds?
Like, is there any 30 second touring test that can distinguish a human from an AI?
And I think, like, you know, the debate was kind of like amongst people I was most talking to it open.
I was like, ah, probably there will be, but it's not a sure thing, like a one third chance, you know, that there will be no such tests or something like that.
And like, I think that things are mostly like, that was not from like, it wasn't a big jump, but it was just like kind of see the writing on the wall.
like see systems improving in this way and be like it's going to get harder and harder to tell.
There's more and more things they can do.
So are you saying Paul of that for society that seemed to, you know, come out of nowhere, basically?
But it's not really surprising the researchers and the engineers who've been on the inside
at OpenAI and constructing chat GPT4.
It was maybe within the bounds of expectation.
It was, you know, maybe more optimistic or I'll use the term bullish because, look, we talk
about bullish in crypto all the time.
It's more bullish than you thought this.
technology would sort of take you, but still within the realms. And the only reason it's having
such an effect on our collective consciousness is more sociological. It turned into a consumer
application. Yeah, it's just suddenly turned into a consumer app and everyone's like, wow, I can
type whatever I want into this magic box and a genie Oracle artificial intelligence just gives
me the answer. This is incredible. Like, is anyone surprised by this who's been working on this tech?
Yes, this is a nuanced question. There's like a lot to say. I don't know if we want to get into all
of it. But like, I think at the point when GPT2 was trained, this was a controversial prediction.
So, like, there were people who were more bullish than I was, like, Eiji Dario, Dario,
Amadeh, who now runs, Dario-Amide, who now runs Anthropic was, like, at the time of GPT,
like, very bullish. And in timeline here, Paul, so GPD2 was when, what year?
Oh, man, I don't even know if I remember. I think these discussions were, like,
2018. Got it. Okay. Yeah, so Dario was more optimistic. I think this world was actually, like,
slightly below Dario's medium of how impressive a system, like, chat GPT would be. I mean,
not a huge amount below. I think he made, like, quite a good forecast, and it's kind of his
been his big bid win. I think there were a bunch of people who kind of had views in this general
cluster, like, for whom this is like a little bit better than what they would have thought,
or like, you know, at the 80th percentile or something of what they might have expected.
I think there were a lot of people who, like, weren't in the business of making forecasts,
but seemed qualitatively. Like, if you look to the academic ML community, like, it felt to me
like people were not expecting this to happen. Like, discourse about general AI felt like very
frustrating, I think, where they're like, why is it the case that people are really
minimizing the possibility of just large neural nets trained in a very similar.
way being competitive with humans. I think it was surprising to like a lot of people. And again,
it was just quantitatively, right? It was surprising to me in the sense that this is like better than
I thought. And I like lost bets about this. I mean, I won some bets with people. I lost some
bets with people. I think some people weren't quantifying their probabilities. Maybe this was like
more just totally out of model. But by the time you're talking about chat GPT compared to what came
before, I do not think, I think at that point to people building the system, it was not surprising.
Like once you're talking about the gap, even from three to three point five and then from
3.5 to chat chitputting. They were probably more surprised by the way, like the impact it had
on collective consciousness rather than by the capabilities of the system itself. I think that
seemed like pretty technically de-risked. And Paul, when we're talking about sort of AI alignment
and safety concerns and just on the topic still of chat GPT, right, are these neural nets,
this large language models, the ones we have to worry about? Like, I think what basically
society is sort of wondering as this AI alignment safety question rises into public
consciousness is, okay, at what version do we have to start worrying?
that chat GPT is going to pose a threat to us.
Like, there's some version where it starts to take our jobs, okay?
And then, like, maybe that's version four, version five,
and so it affects our economy and economics,
and we have to reorient restructure society as a result of this.
But, like, it seemed to be what Eliezer was saying is,
well, maybe it might be version 9 or version 10 or version 11,
where we actually have to fear for our lives from this thing
because it's become super intelligent.
Do you see that possible trajectory for this specific,
large language model AI technology, or should we have more concerns about other AI technology that's coming in some other vector of development?
I'd like to phrase that question slightly differently, and maybe in a metaphor that bankless listeners can understand, often, Paul, when we talk to people that are outside of the crypto industry, they just call the things about the crypto industry Bitcoin. And then when Ryan and I hear that, we're like, oh, what they really mean is like decentralized technology, like identity.
they just use Bitcoin as a placeholder
to talk about so many different things.
It's so frustrating.
And I think as like AI normies,
me and Ryan AI normies out there,
we might be saying chat CBT
and what we actually are trying to talk about
is like generalized artificial intelligence
and we just use chat CBT because that's the thing.
It's the Bitcoin of AI.
Are we falling into that same trap?
I definitely think there's a lot of subtlety here
and there's a lot of different ways
I could interpret like this thing.
So I'd say that like one question is like,
what are you modeling?
Like what is the I learning to predict?
Like, is it videos or is it text or is it like interactions with code, like running code and the results of running code or like the code humans would write?
And I think like over the last couple years, I think those distinctions have mattered a lot less.
Like what is mostly, I think the default model for how the system should work is just you have quite a lot of data of quite a lot of types.
And you just dump it all in.
You say, like, look, your job AI is to deal as well as you can with every type of data we give you.
And there's engineering problems in allowing those different types of data.
And there's questions about what types of data the system is actually able to effectively deal with.
but I think the fact that it's trained on language
is like not, I just don't think you should think
of it as defining feature.
I think you should imagine systems like see the world.
I think language is probably an important way of thinking
about how they act.
Although again, I think it's very impressive.
Systems can act by producing images
and those will be very impressive
and will have a big impact.
I think economically in some sense,
language is like a very flexible
and like kind of core way you should think about systems acting,
but perceiving, I don't think you should really think
about language models in particular.
GPT itself, just saying GPT,
this is basically just fixing, like,
there's two things that specifies.
One is how is it trained?
Is it trained to predict data, or is it pre-trained to predict data?
Or is it trained in some other way?
That's the first distinction.
And the second is just that it's a transformer.
I don't know if anyone wants to make a super strong bet about like transformers per se.
I think they're just like our different, yeah, there's a big space of possible architectures.
There's probably going to be debated about something so you'd call them a transformer.
I think both of these, like what kind of data you're modeling and then like what kind of architecture are using are in some sense not very essential.
Like I don't think it would change like open AI being in the game or like the exact kind of product they're offering.
I think they're just like, look, we train large general nets.
we do it on having some very broad pre-training task that captures a lot about the world and gives an interesting opportunity to be smart, and then we fine-tune them on downstream tasks that we think are economically useful, e-gatting with people or generating images that people would rate highly or writing code that developers will think it's good. I think that's like the basic paradigm you should imagine when we talk about like this thing, which is like a little bit broader than chat GPT.
But I think it's not crazy to say like chat GPT really is indicative of that broader ecosystem. And I'd say like chat GPT is more similar to the rest of that ecosystem than like Bitcoin is to the rest of the crypto ecosystem.
there are fewer key technical differences in that case.
Right.
Yeah, that's like at high level.
And then whether this like kind of thing can cause trouble, like, I'm like, I think it's
really, really hard to say.
I think like a lot of people talk a lot of smack about how like it's really silly to
think that AI systems of this form could do something crazy.
And I'm looking at that and I'm like the same people were talking smack that feels to me,
again, often not in the business of making concrete predictions.
But we're saying that it was really silly to have the expectations that people had about
language model scale ups five years ago.
And I'm like, I think the scale.
up what you can imagine occurring over the coming years is similar in magnitude to the scale up
we've observed over the last five years. And so I'm like, it's really hard to predict where that
ends up. And if someone is giving a confident take about where that ends up, if they're like,
these AI systems can't do X or can't do Y, I like really want them to get more precise about
why they think that and what exactly they're saying. And I'm really just pretty skeptical on
the face of it. I think we can really relate to that on at least using our crypto frame of mind,
is again, when we talk to normies, what we call the people who are not inside of the
crypto world. And they're like, oh, these Bitcoin.
Ethereum currencies, they can't possibly take over the world. And I think when you become more
informed about the crypto world, you just get so tired of these takes because of how uninformed and
unimaginative they seem. And so like just conceptually, I can definitely resonate with that.
It's like, you don't know what you don't know and neither do we, but like you can kind of
understand some base principles as to the nature of these things and how they grow and develop
and change in ways that you might not expect. And you can kind of, if you are versed in these topics,
can extrapolate into the future pretty well.
And without precision, still give a broad stroke belt.
Like, hey, this is where this is going to go and here's what you don't appreciate.
So I can definitely appreciate that.
And I want to go back and tie a bow on the time conversation because we started this
conversation like, okay, it could happen.
There's the model of it happening in two days.
There's a model of happening in two years or 20 years.
And you're in the camp of, I'm going to give a range of like six months to two years-ish,
loosely, very loosely, without trying to be two-per-defense about these things.
Depends a lot on from when to when.
Depends a lot.
Lots of variables.
Time can pass.
Like, you're not going to wake up
and it's going to be different.
And, like, the reason why this is important,
and I want to go back.
I want to be careful about, like,
because the question of what my default expectation is,
and then what is possible and what you can be confident about.
So, like,
I am extremely skeptical of someone who is confident
that if you took GPT4 and scaled up
by two orders of magnitude of training compute
and then fine tune the resulting system
using existing techniques that we know exactly what would happen.
Like, I think that thing you're looking at
a non-trivial chance.
that it would, yeah, a reasonable chance that it would be inclined or would be sufficient.
If it was inclined, it would be capable enough to effectively disempower humans and, like,
a plausible chance that it would be capable enough that you start running into these,
these concerns about controllability.
So I would be hesitant to put Doom probability from that.
If a lab was not cautious about how they deployed it and wasn't measuring, I would be
cautious to put the probability of takeover from 200-92-scale-a-GBT4 below, like 1% or 1-1-1-1-1-1-1-1.
Okay.
Well, we'll have to put a pin in that.
But let me like round out this time question.
Just want to get our tails.
Like I have a default guess.
Right.
The importance of the time point is like whether this is a lightning flash and it's different versus we have one to two years.
For me, when I hear this, I'm like, okay, we have one to two years and we're watching it happen and we're seeing it happen and we're able to react to it versus it happens.
We can't react to it.
And so if you're telling me that this is a, it's still a fast takeoff, but your perception of fast is a year or so.
To me, I'm like, okay, a year is fast enough for humans to react.
And to me, that is a window, a gap, a needle that humans have the option to thread if we can coordinate.
And that is where I start to get optimistic.
And so that is my gut reaction.
And I want to just check that gut reaction against you.
Is that one of the paths that you see?
It doesn't go so fast that we don't have the time to react to it.
Yeah.
I think that seems basically right with some nuance.
but like I think that most likely the thing that moves kind of slow in my view, again, over the course of years, like incredibly fast relative to policy and incredibly fast relative's expectations in the broader world, but like kind of slow is like how quickly do systems become more capable? Like how long is it between an AI which is like smart enough that it could run a company for you and actually like those companies are competitive with human companies and the AI just can like actually take over the world? And I'm like that gap, there's probably a gap there. You're probably talking more like, you know, years than months. I mean, depending exactly what you mean by run a company. That said, I think it's worth pointing out to like, there's
dynamics of takeover itself, like, if you had, so if you imagine broadly deployed AI systems,
which are very competent and which would be able to take over, and then you ask, like,
how quickly does this particular kind of catastrophe unfolds? I think most likely the actual
catastrophe is extremely fast. So that's not like a year's thing. That's like, I think my default,
anyway, we could get more into this and probably worth getting into. My default picture is, like,
we have time to react in terms of the nature of AI systems changing and AI capabilities changing.
I don't, and like, with luck, we have like various kinds of smaller catastrophes that
occur in advance. But I think, like, one of the bad things about the situation is that the actual
catastrophe are worried about does have these dynamics that are kind of similar to dynamics of, like,
a human coup or human revolution where, like, you don't have, like, little baby coups and you
like, see, like, here's the rate at which coups occur. I mean, you might be male, so just go
straight. So, like, a coup can happen very quickly, right? The whole dynamic is that, like,
once people start switching over, once you have AI systems, we're like, actually, I'm
going to get in on this, like, overthrowing humanity thing, that information can propagate quite
quickly, and you don't really, like, you've kind of, the ship has sailed if you've waited until,
like the AIs are actually taking over.
Right.
Okay.
So it's like for that, you don't have really an opportunity to respond.
But for AI is changing.
I think it's like, I basically think it's reasonable in some sense for people to look
at the AIs right now.
I'd be like, look, these things, that's not realistically a takeover risk.
And I think that probably are going to have years between people like, that actually
looks like a takeover risk and when an actual takeover occurs.
And that's pretty good.
And that's a lot of why I'm more optimistic than Elias.
I think Elios was like, you're just going to get hit by this out of the blue.
Right.
And I'm like, well, I think people are wrong to be so confident about the rate of progress.
but I think they probably will be able to see things that can be generally recognized as like pretty concerning prior to actual catastrophe.
And a lot of that has happened so far.
I mean, I think people are just like much, it feels more plausible now than it did five years ago by a lot that AI systems could do something really crazy and transformative.
And I think it will feel much more plausible again in five years.
Okay.
So this presents a new mental model for me.
When we were talking to Eliezer, it felt very much like the don't look up problem, as in there's a meteor crashing into Earth.
No one wants to acknowledge it.
And then one day it crashes into Earth and we die.
And the idea here is like we need to coordinate and get people to look up so we can identify the problem.
And then once we identify the problem, it is a linear amount of time before the asteroid crashes into Earth.
And what you're saying is like we can see the asteroid, but there's like this gradually then suddenly moment.
And that's your revolution moment where like you can start to see the seeds of revolution.
But you don't really know when the people will decide to grab their pitchforks in the middle of the night and revolt.
But you can start to see the boiling of the water.
And so that's a gradually, then suddenly moment.
But we still have the opportunity to quell the revolution before the revolution starts.
Is that the characterization?
We have enough time to send up Bruce Willis to blow up the asteroid.
We speak in metaphors here on.
There's some bleak and confusing metaphors.
But, yeah, my best guess for if there is something like an AI takeover, this is a huge
part of departure from Melios here, my best guess is that an a.
Ana catastrophe occurs in a world where AI systems are deployed extremely broadly
and where it is kind of obvious to humans
that we are putting our fate
in the hands of AI systems.
So we see ourselves giving over the keys to the kingdom
and we are watching that happen.
Again, I think it's important what's possible
and what's likely, but I think that's the most likely way
we die involves, like, not AI comes out of the blue
and kills everyone, but involves, we have deployed
a lot of AI everywhere.
And you can kind of just look and be like, oh, yeah,
if for some reason, God forbid,
all these AI systems were trying to kill us,
they would definitely kill us.
Oh, I kind of get, I can see this.
So, like, our Tesla's got an AI in it,
And we trust that.
And our refrigerator's got an AI in it.
And it calls the grocery delivery robot, which is also an AI to deliver us food.
And then all of a sudden, everything around us is an AI.
And they're like, man, I really hope that they like me.
Yeah.
You're like, you get food delivered to you from like Amazon, which is by Amazon, we mean a bunch of machines that are orchestrating a bunch of other machines.
And you have some money and that money is managed by AI advisors, investing in AI firms.
Okay.
That is actually pretty clear to me.
I can see how we get there with that.
who like wants some human to protect, yeah.
I think it's rough. I think basically the thing is, I think it's likely that before the end,
it is clear that it is very hard to be physically secure. Like right now,
if you're just like a human with some guns, we're like, look,
and AI can't fuck with me that much. Like, I have a gun. The AI's just on a computer
somewhere. I can blow up the data center. I think it is probably clear before AI takeover
that that that's not the case. I think it's probably clear that the only way you can
defend yourself effectively. Like if you're fighting a war, you're like, look,
you can't be like a country who's fighting a war against a country that has probably deployed
AI. And it's like, it's fine. We're just not going to use AI's. That will just be
completely untenable. I think it's not clear we're that far away from such a world. And that world's
like, okay, well, if someone invades with AIs, obviously we're going to have our AIs defend us.
And then you're like, okay, now it really matters. Like the prospect of the AI coup, now is a different
character. It's like, you just ask the AIs to please defend you from the other AI's. And maybe
they're like, nah, I don't really feel like doing that. Again, most likely, I care about
the other risks. And right now, I think if you were to die tomorrow, it would not be like this.
It would be like, I think that really took you by surprise. I think you're talking about
the tail there. And I care about evaluating the tail. But like the median outcome where we die,
I think looks like this.
one problem people might be having her listening to this are starting to be exposed to this topic
for the first time, which is myself and David, and maybe the average bankless listener, the average
person who's being converted to, from a normie to someone who is actually adequately alarmed
about AI safety types of issues, is this idea of agency. Like you've mentioned this a few times,
like this idea of the AI's banding together to strike humanity. Banding together. What is this like
Google and chat GPT and all of these? Like, how do they have agency to actually want
to do that. It's very difficult for us to imagine. I mean, this seems so sci-fi to us. Could that actually
happen? Can you give us some sort of mental model for like how that happens? Because I'm still
having a hard time understanding how chat GPT suddenly gets agency. Where the agency comes from.
And wants to ban up with, you know, 10 other super AIs and, you know, send us a bioengineered bacteria
to kill us all, as was L.EAS or, you know, possible, like, expression that this could happen that way.
Yeah, I think even in a good world, we're probably going to be in a situation where we're trusting AIs with our lives.
So probably in some sense, the core question is not why are they in a position to kill you.
The core question is why would they end up killing you?
I think there's basically two threat models.
There's maybe a general reason to be concerned about the world where humans are trusting AIs,
where we generally have very limited ability to control or predict what they do.
But if we want to talk concretely about the way we currently produce AI systems,
I think there's basically two ways you end up in this failure mode or two like, I mean, there's lots of unknown unknowns.
There's two known ways we end up in this failure mode that people care most about.
So first, the one that's like, I think, more likely to occur, but more easy to manage.
So the way we train like chat GPT is you have some conversations with humans.
And then you look at that conversation, you say, a human rates the conversation.
It's like, was this model doing a good job of answering their questions and being helpful?
And then you do reinforcement learning where you like take the nature of that interaction.
If it went well, you update the model to do a little bit more like that.
And if it went poorly, you update them all to do a little bit less like that.
So that's how we train like chat GPT.
A way you might try and use GPT is you might say,
I'm actually going to give it some tools,
and I'm going to give it a task,
and I'm asking me to try and accomplish that task.
So I might say, like, my code has failed.
I don't quite understand why.
I'm just going to ask GPT, like,
hey, you have a bunch of ability to run code
on my computer.
You can, like, make changes to the code
and see what happens.
You can spin up a web server.
Could you, like, figure out where the error was?
Like, could you, by stack,
tell me what commit introduced the problem,
tell me what's up with the problem.
And then you want to go send the system
to act autonomously and, like,
perform all these actions,
like, to try running different versions
of your code and writing new tests and so on.
That's like a way you really want to.
I think people are already starting to really want to use GPT in that way.
And if you're doing that, instead of having conversations and finds me to say, was this conversation
good, you're giving an AI a task like that and saying, could you use tools to accomplish
this task and doing exactly the same thing?
You see, did it accomplish the task effectively?
If so, adjust it to do more of that.
If it accomplished it ineffectively, adjust it to do less of that.
So that's like a kind of training, which has already done some.
It's currently not as important as the chat GPT style of training where you're just
looking at the interaction.
You're not accomplishing things on the world.
world and training based on that. But I think it's probably really important. I think best guess is
that is going to probably is already happening with an open AI for GPT4 in order to make it,
like they deploy this product, which is can you get your AI2's tools to help you accomplish
things? I think they care a lot about that products. I think that like absent concerns about safety,
that's like a really natural way for the technology to go. And so now you're in this regime where
what AI systems do, the way they're trained is they get given this huge library of tasks,
a ton of different tasks over different time horizons. And they're told like, hey, could you try
accomplish this task, and then they're tweaked to do more of whatever it is that gets
evaluated as accomplishing the task effectively. So the way this leads to trouble is you now have a
system. And one thing a system could learn, if you do that process a bunch of times, is it could learn
to say, like, okay, I'm in a situation. What should I do? Well, I should think, what is going to
cause my behavior to be evaluated favorably? Like, what is the task that I've been set?
How is a person ultimately going to evaluate my performance on that task? How is that ultimately
going to translate into a reward, which is then, and I'm going to try and choose actions that are
ultimately going to lead to this high reward because I've been a just.
over many, many generations to do things that lead to high reward. One way I might do that is by thinking,
like, hey, what leads to a high reward? And I might do that because I, like, there's a lot of ways
you can end up doing that. Like, a human might, like, crave reward that might be like, the thing I love is
reward. Or I might do that because I'm like, look, I have to do well because I'm being trained
and I, like, don't like being given a bad reward by the people who are training me or whatever.
I'm not even going to talk that much about what's happening psychologically, just that you end up
with a system that thinks, how do I get a high reward and then does that? And if you keep
selecting for things that get high rewards, you could end up with such systems. And then if you have
such systems, so this is kind of the classic scenario people have been concerned about, which I think
we're now, again, we have examples of. It feels like we're pretty much going in that direction.
You now have systems deployed in the world, like a ton of all the a AIs that are acting on the world doing
things on human behalf. All of them are thinking when they act like, okay, I've been given a task.
I need to think, how is a reward going to be computed for this task if it's selected for training?
So if in the end of this task we selected, I open AI and they evaluate how will I perform?
I need to think, like, what's going to determine what reward I get?
And what they do is they ask which action is going to cause me to get a high reward,
and they take that action.
And they use all of their understanding of the world, all of their ability to think of clever things,
all of their ability to predict the consequences of different actions.
They use all of that just to say, which action is going to get me a high reward.
And the concern that leads to is, right, in normal times, the way you get a high reward
is by doing what people at OpenEI like.
In normal times, your transcripts is going to get evaluated by people at Open AI,
and they're going to say, great, that was good.
And, like, hopefully, the way you get them to them to be able to be able to,
to evaluate it well is by actually doing good things and making the customer happy and making
so, like, there's all these measurements that will be used to assess how well you did.
Hopefully what happens to just actually do your task well and all the measurements suggest you
did the task well and someone at OpenA concludes you did the task well and therefore you get a high
reward. But in unusual times, I think you could do instead to say like, oh, I could do the task
well or, I suppose that I've been tasked with like, you know, helping defend you from some other
AI's. Like, this is a sort of dystopian case if you imagine open I train this model.
But like my job is, someone is coming and trying to hack your computer and I'm supposed to help defend
you just to help improve your security situation, whatever. And I'm wondering, what is it I
could do that would get me a high reward? And one thing I can do that will get me a high reward is
actually like helping defend your computer, actually like doing the task you asked me to do.
But another way to get a high reward is I could just say, like, at the end of the day, what really
matters is just how you measure my performance. And your measurements of my performance ultimately
are just like entering some numbers into a data set somewhere or something that a computer
says about how well I did. And it would really be much better if I was just work with this AI
who's attempting to attack you and say like, hey, AI, who's invading? You know what? If you just
help me, and we both make it look like I did a really good job. Like, I win, you win, because
you got the person's stuff. I'm going to get a really high rating because all the numbers
that can enter in the data set are going to be really high. This is a win-win. Everyone is happy.
And so it's like, in some sense, what all the AIs want, but every AI in the world in this
scenario wants is just to be rated. They want their behavior to be rated really highly.
And while humans are in control, the way to get your behavior rated really highly is do things
humans like, and then they'll rate it highly. But if you can see this prospect, if humans
losing control the situation, instead AI systems control the situation, you'd be like, I would
go for that. I would go for the world where it's no longer humans entering rewards in telling me what I got.
I would go for the world where instead AI systems are just all giving ourselves the maximum reward or
whatever. I think in some ways, like psychologically, that's probably not quite the right way to think
about it. But the general thing is your systems have been selected over a really long time to take actions
to get a high reward. You put them in a new situation where the way to get a high reward is not to do
what humans but to help disempower humans. And then, you know, having disempowered humans, give yourself,
measurements that suggest you do your job well or actual rewards that are high or whatever. I mean,
you might think that in that new situation, the systems will sort of systematically switch from
behaving well to behaving poorly because you've changed the conditions under which you get a higher
reward. A pattern I'm seeing here is that engineers, like software developers, write code, and sometimes
the code has bugs. Lawyers, they write legal contracts. And the reason why often legal contracts
are so long is that they are protecting against edge case scenarios, right? The idea is to, like,
not let the system throw an error, right? Not let the system find a loophole or find a leak or something.
So like when a software developer writes a bugs, like, man, they accidentally created a system that allowed for an error to be thrown.
What I'm seeing here is the same pattern.
And if we don't code up these systems, the AIs will naturally, like, find a loophole.
And if that loophole allows for the AIs to rate themselves highly and give themselves a reward, that's what they're going to do and that's what they're going to find.
Does that way to articulate this?
Yeah, I think that's a fair general summary.
Maybe one way I'd put it is like in the legal system, you write a contract, but ultimately what matters is like the discretion of a judge.
Right.
And if you're training this AI system, you may have automated ways of administering reward,
but ultimately what matters is like, someone's going to look at what the AI did and be like,
that's not what we intended. And then they'll score it negatively if that's what happens.
And so it's kind of like some final authority. And the final authority really rests on the fact that
ultimately the judge has the power to make this judgment or the person who's training the AI system
has the power to control what it cares, like has the power to update the weights of that model,
ultimately. And so it's just like there's this contingency. In addition to the thing of like,
a formal thing you write down is likely to have loopholes. In some sense, there is a loophole
in the final judgment, which is just a human.
says the answer, which is that that relies on a human just entering some data into, like,
in some sense, having physical control over this data center that the AI cares about.
So I can update the weights of the model, which is the AI.
The last step that you're describing, Paul, where the AI, you know, colludes with another
AI to sort of fudge the numbers because that is the outcome that the human wanted.
This is where we sort of crossed the line from kind of light side into dark side.
This is sort of the threshold of deceit that we've crossed.
And these AIs are now deceiving us.
They're lying to us.
is there no way to protect against that?
Is there no rule that we can somehow apply?
Maybe this is, I don't want to jump ahead to the solutions to this, you know,
AI safety type solutions to this.
But it's not clear to me why an AI would be motivated to do that.
And it seems like there should be some way to prevent that, like always be honest as a rule,
something like this.
Yeah.
Again, we're normies trying to understand this.
I mean, if only it was that simple.
Yeah.
What are the complications?
I think it's incredibly complicated.
I think it's genuinely unknown.
It's an open empirical question.
If you trained an AI system to get a lot of reward and you train it in a bunch of cases
where being dishonest always failed in practice, right?
We tried to train it to be honest.
Anytime we saw the AI like doing something sneaky, we're like, wow, that was not only bad.
That was really bad.
You should really just not lie to us about what you're doing.
You should really not try and like hack tests.
You should not try and conceal evidence of wrongdoing.
That's like one of the things, you know, one of the most clear and blatant principles
in our training.
It's an open question what happens if you train an AI system in that way, right?
Like, one option is your AI system learns like, oh, I shouldn't try and mess with humans.
Like, every time I mess with humans, I do really badly.
That's the good case.
And the bad case where AI system learns is it says,
oh, part of the reward provision process is a human thinking to themselves,
like, did this AIMS with me.
And if the human thinks the AIMS with them
and then they enter that thing into the data set,
then obviously I get a little reward.
But that second one is, like, much more brittle, right?
The second one is not a general prohibition against lying.
It's a prohibition against, like, lying getting caught.
And I think, like, there's not really any...
It comes down to, like, a complicated empirical question
about how neural nuts learn,
which I think we don't really have good evidence on,
right now about like if you have a bunch of, you have a data set where don't lie and don't lie
if you'll get caught are like perfectly in alignment, hopefully, if you do a very good job,
right, if your eye never gets away with anything sneaky, if your guy starts getting away with
things that were sneaky, or if you start like erroneously penalizing an AI because you think
it lied, but it didn't. Then like don't lie isn't even a good thing for it to learn. At some point,
the best thing for it to learn the way to get the highest reward, the thing was great
to send favors is the thing which involves gaming it out like in more detail and saying,
like the cynical view does in fact get, get more reward. Like if I'm an employee,
And I'm like, I could learn two things from interacting with my boss.
One is like, I should really do what my boss wants.
And the other is like, I should really make sure my boss approves them my performance.
In some regimes, those two things are perfectly aligned.
But in some other regime, it's like, if you keep optimizing hard enough, you're going to get the model,
which is just like, I really care about what my boss thinks about my performance.
And like, I'm honest only insofar as that's like an instrumental strategy for helping me get this human to think I did a really good job.
And if I could go all the way, if I could just like totally box them out,
that is totally prevent them from understanding or correcting a mistake.
then I'd prefer to do that.
I think it is like, it's a sort of bright line.
The way it becomes a bright line is that if you do,
if you take half measures,
if you just like kind of lie to someone,
but then you get caught, that's like really bad.
So it's like being honest,
that's a good policy.
And there's like successfully lying and like totally,
you know, killing the human and replacing them
with a surrogate who will always give you a good reward
or whatever, something that totally disempowers the human is also quite good.
Then there's some stuff in the middle that's quite bad.
I think it's an open question whether models will tend to learn.
Like, will they generalize well enough to say,
oh, the thing that would have really gotten most reward
is over in this other mode, well, they kind of get stuck in this, like, intended mode where they're
just being honest.
I think that's a really hard empirical question.
Like, people really don't know.
They don't know how that changes with scale.
There are experiments we can do.
I think part of the important game here, one of the most important parts of the game is to
say, like, here's a dynamic.
A dynamic by which your eye system could abruptly shift from behaving well to behaving poorly.
We can test that dynamic before AI systems kill us.
Like, there are lots of cases in which it is incentivized to lie or mislead the human,
and there is a gap between, like, lying that will get caught and lying that won't get caught.
And so you can ask, if we train neural nets,
and we can, you know, we can check every year.
If we train the best models we possibly can at this task,
do they exhibit this kind of switch abruptly?
If they get put in a position where they could get away
with something really sinister, will they then do it?
And I think, you know, one reason for optimism right now
is no one has ever really exhibited that phenomenon in a convincing way.
A reason for pessimism is, I don't think you really would have expected them to exhibit
it, both because people have tragically, like, not tried very hard,
even though in some sense it's extraordinarily important.
And second, that it just is much easier as your models get more competent.
Like, it's only recently that we have,
We've trained models, which are actually able to understand the mechanics of their training process at all.
Like, if you talk about GPT2 or even to some extent, GPT3, it does not really understand that it is a model being trained or it can't even talk about what it would mean or what behaviors would be rational.
And then you move to GPT4, and it can talk about that.
It can say, like, oh, I guess if hypothetically I was a model being trained and I wanted to get the most reward, I should behave well when I'm not being monitored.
And then when I am being monitored, I should, like, definitely take that opportunity.
Like, only recently have we even produced models, which are able to carry out the reasoning I just walked through.
and I think realistically they're not able to carry it out on their own that much.
They're able to carry it out because they've seen a lot of examples of humans discussing these dynamics in great depth.
They basically just learned from listening to Eliezer, this reasoning I just walked through.
But I think at some point you will have models smart enough to think of that for themselves.
And like you really want to know, you really want to be measuring carefully at that point.
Is this a dynamic you observe?
Does this really happen?
And I'm, you know, more like even odds on whether that will happen.
I think Eliezer is like, obviously that happens.
A smart model is never going to just learn to be honest or something.
And I'm more like, I don't know.
sure.
Norelemats don't learn that effectively.
They don't really converge to the truly optimal reward maximizing thing.
And in some sense, like, anyway, it's a pretty complicated discussion, which I have to get into
more details of.
I would just say, like, we don't know.
I'm really I'm persuaded by people who either think it's obvious this happens or think it's
obvious this doesn't happen without just doing a ton of experiments to understand.
But this is like the first way you can end up having an abrupt AI take over.
By obvious, this happens or not.
You mean crossing the chasm of honest to being intentionally dishonest, but you're tricking the
humans into thinking that it's being honest.
I think, yeah, and to be clear, there's, like, a bunch of things that affect the probability
of that, like, if there's, like, sort of small-scale opportunities for deception that won't get
penalized over in this, like, normal regime, then it becomes more and more likely that you've
learned the conduct of, like, I really just need to not get caught.
Whereas if you're pretty good about that, and there's, like, not really much to be had from lying
over here, it becomes less likely that you make that jump.
And so this is the kind of thing that might affect, I mean, you just, you can't really
speculate, though.
You really just need to have the experimental data roll in.
There is a Rubicon that could be trost.
is the concern here. Yeah. And I don't know if it, yeah, I don't know under what conditions it would be. I do not
think anyone knows right now under what conditions it would be. But it seems plausible. Like a priori,
it's pretty plausible. And it would be really bad. Yeah. I was a psych major in college. And let me
tell you, the child development classes are all coming back to me right now. And I'm, it's not lost on me
that the parallels of a child going through a theory of mind and all of these things definitely has like a
lot of parallels to, I think, some of the technical problems that AI researchers are now, like,
theorizing about. Paul, I don't know how. Yeah, I mean, I think that's right. Is that a conversation
that AI researchers have? Yeah, it's not a conversation I can speak too much, and I'm not sure
exactly how well they have the conversation, but I think there's a conversation people have, and
it's an analogy that is not perfect, and, like, I think if you took it too seriously, you'd be let
astray, but it's, like, very good as a source of, like, here's the thing that could happen,
and that you should not rule out. You have an example of it happening.
Right. And I think the concern there would be just like, we have some understanding and people like models won't do this kind of thing. It's kind of like they've done a bunch of experiments on six year olds. And they're like, look, models like never spontaneously lie in a way that they like haven't lied before. And you're like, oh boy. Like is that kind of generalized to 12 year. Like, I don't know. It definitely will not generalize to middle schoolers. Let me tell you that. Yeah. So it's like it is there hazards. I think the hazards measuring here would be similar to like the situation where it's like we're measuring a bunch of like kids that are getting smarter with each passing year. And you're like trying to understand how they behave. And like, it's like, it's like,
It is easy to be wrong about how future models will behave if you're like too literal about the interpretation of data now.
You need to do some forward-looking thing.
And the forward-looking thing is quite hard, which is why we have this limited visibility into what's going to happen in the future.
If you haven't yet experienced the superpowers that a smart contract wallet gives you, check out Ambire.
Ambire works with all the EVM chains.
The layer two is like Arbitrum, optimism, and Polygon, but also the non-Etherium ecosystems like Avalanche and Phantom.
Ambire lets you pay for gas and stable coins, meaning you'll never have to spend your precious ETH again.
And if you like self-custy, but you still want training wheels, you can recover a lost
Ambuyer wallet with an email and password, but without giving the Ambyer team control over your funds.
The Ambyer wallet is coming soon for both iOS and Android.
And if you want to be a beta tester, Ambire is airdropping their wallet token for simply just
using the wallet.
You can sign up at Ambuyer.com, and while you're there, sign up for the web app wallet
experience as well.
So thank you, Ambire, for pushing the frontier of smart contract wallets on Ethereum.
Arbitrum 1 is pioneering the world of secure Ethereum scalability, and is a special
continuing to accelerate the Web 3 landscape.
Hundreds of projects have already deployed on Arbitrum 1, producing flourishing defy
and NFT ecosystems.
With a recent addition of Arbitrum Nova, gaming and social daps like Reddit are also now calling
Arbitrum home.
Both Arbitrum 1 and Nova leverage the security and decentralization of Ethereum and provide
a builder experience that's intuitive, familiar, and fully EVM-compatible.
On Arbitrum, both builders and users will experience faster transaction speeds with significantly
lower gas fees. With Arbitrum's recent migration to Arbitram Nitro, it's also now 10 times faster than
before. Visit Arbitrum.io, where you can join the community, dive into the developer docs,
bridge your assets, and start building your first app. With Arbitrum, experience Web3 development
the way it was meant to be. Secure, fast, cheap, and friction-free. How many total airdrops have you
gotten? This last bull market had a ton of them. Did you get them all? Maybe you missed one.
So here's what you should do. Go to Earnify and plug in your Ethereum wallet, and Earnify will tell you
if you have any unclaimed airdrops that you can get.
And it also does poaps and mintable NFTs.
Any kind of money that your wallet can claim,
Earnify will tell you about it.
And you should probably do it now because some air drops expire.
And if you sign up for Earnify,
they'll email you anytime one of your wallets has a new air drop for it
to make sure that you never lose anirdrop ever again.
You can also upgrade to Earnify premium
to unlock access to airdrops that are beyond the basics
and are able to set reminders for more wallets.
And for just under $21 a month,
it probably pays for itself with just oneirdrop.
Plug in your wallets at Earnify and see what you get.
That's E-A-R-N-I.F-I.
And make sure you never lose another air drop.
Learning about crypto is hard.
Until now, introducing Metamask Learn,
an open educational platform about crypto, Web3,
self-custody, wallet management,
and all the other topics needed to onboard people
into this crazy world of crypto.
Metamask Learn is an interactive platform
with each lesson offering a simulation for the task at hand,
giving you actual practical experience for navigating Web3.
The purpose of Metamask Learn is to teach people the basics of self-custody and wallet security
in a safe environment.
And while Metamask Learn always takes the time to define Web3 specific vocabulary, it is still
a jargon-free experience for the Crypto-Curious user.
Friendly, not scary.
Metamask Learn is available in 10 languages with more to be added soon, and it's meant to
cater to a global Web3 audience.
So, are you tired of having to explain crypto concepts to your friends?
Go to learn.menomask.io and add Metamasklearn to your guides.
to get onboarded into the world of Web3.
Okay, so Paul, with the purpose of this podcast,
we wanted to really nail down three big things about this.
How big is the AI alignment problem?
And I think we've decently covered that.
We talked about that with speed.
You said, like 10 to 20% chance of complete doom.
So answer to that one, pretty damn big.
In agreement about big.
How hard is the AI alignment problem,
which I think we've just discovered.
I think your answer is like it's a pretty damn hard problem.
And so we're checking some big boxes in the pessimistic.
camp. And so the last part of this conversation that we really want to cover is how
solvable is this problem. So even if this mountain is really, really tall, it's a big
mountain to climb. Is it full of ice and sharp rocks or are there stairs? Right. And so that's
the next question I think we want to go down. It's like, how solvable is this problem? Do we see
a clear path to tackling the AI alignment problem? Part of the reason I'm only giving 10 to 20
If you ask, what's the probability that this thing is a real problem? I'd probably be more like 50%. That at some point, like before you have AI systems, smart of totally obsolete humans, you would have a takeover. And there's a couple ways it could happen. There's unknown and no ways it could happen. I feel more like 50-50. And the reason I'm only getting 10 to 20% for risk is I'm like, I think we're actually in a, I don't know, there's a lot of things you can do. I am pretty optimistic that some of them will work. And I'm very happy to dive into that now. I just want to flag that 10 to 20% is already baking in my like optimism about this problem being pretty if the problem is real.
it will probably be possible to recognize it as real and then solve it.
But only probably, not certainly.
And also, like, even if the problem is easy, I mean, part of some people are really optimistic.
And part of why I'm optimistic is no matter how easy this problem was, if you told me that,
like, a challenge is going to emerge over the course of a couple years that will be, like,
novel in some ways.
And you ask me, will humanity solve it?
I'm like, there's got to be a reasonable chance we fail to solve it.
Like, our capacity to mess up even easy things seems like it's very real.
So I'm just, like, always going to have some reasonable probability of messing up.
And then I think there's a reasonable chance the probability is really.
The problem is really hard.
But yeah, so I'm happy to talk about.
Maybe three categories seem that I would think about in terms of our probability of addressing
this are, like, technical measures that can reduce the risk of takeover, measurements that can
inform us about the risk of takeover and, like, understand, like, what are the relevant
dynamics does kind of make technical work much better.
And those can also inform, like, policy interventions.
Like, I think we can, I think Alios is right that, like, really long-term slowdowns
are very hard, but I think it is quite realistic to end up in a regime or performing
measurement.
And then, in fact, while things are very risky, we're slow in development.
at least by like on the order of years,
if we can have reasonable consensus and measurement of the risk.
Maybe you could slow even more than that.
But I'm normally imagining something more like we can get like a couple of years of lead time
of things moving slower while we have risky systems.
We're very near at hand.
Yeah,
so I have to talk about all of those.
It sounds like the one that the most duress is your question is like the technical measures,
like what could actually fix this problem?
Right.
Yeah.
What does a technical solution to the AI alignment problem look like?
So let's definitely get in there.
But just so I'm taking notes, technical and then there's the measurement.
Was there a third, Paul?
Yeah.
just policy and institutional arrangements.
Policy.
Is this to do, this third category,
is this to do with, like,
something David said earlier is,
if we can coordinate,
then we can solve this.
And that's a...
Big if.
That's a very big if.
Big if.
As we've learned on bankless so far, right?
Coordination is...
We talk a lot about coordination.
We talk a lot about it.
Coordination is the meta-problem
facing humanity anyway.
And is that what that last,
you know, policy category sort of covers?
Like, can we actually coordinate?
I think broadly,
I would think of it most of this
combining with other things
of, like, it buys you time.
But yeah, I think there will be some thing where, like,
some people have a low estimate of risk or just like, like, like, like the AI's taking over
and they'll want to push ahead.
And so then it's like, how much can we collectively say, like, you're not allowed to push ahead?
Like, we as a world have rules.
Those rules are going to say, like, take it slow while risk is high.
Got it.
And that I don't think can address the problem indefinitely.
I mean, it could address the problem indefinitely.
But I think it's probably not politically realistic in a world like the world today to address it
indefinitely.
But I think it is realistic to say, like, actually, we're then going to buy extra years of time
to look at this problem and understand it and resolve it.
Well, let's talk technical then.
We'll get to these three areas, but let's talk technical first.
So tell us, because...
That's the main one I think about.
Yeah, I know.
And Ellie's, by the way, is very pessimistic on that,
or at least he seemed to say that there's no...
Incredibly pessimistic.
Yeah, like, we haven't found a way to technically solve it,
and he doesn't think there will be,
but are you more optimistic?
Tell us about the technical solutions here.
That's right.
Our first thing to clarify is probably technical solution to what.
So if you could talk about...
One way we could think about it is, like,
how far does your solution scale?
Like, if we just kept a building smarter and smarter AIs,
using something like current techniques.
I think most techniques will eventually,
most approaches to alignment will eventually break down, like in the limit.
The limit could be a long way away,
but so we're normally not asking, like,
does this thing just solve the problem?
We're normally asking, like,
how long does this thing solve the problem for?
So an important caveat is just like,
we could talk about different techniques,
we should probably measure them all that way.
There are some things that might scale indefinitely.
So, like, my work is mostly focused on this,
like, what are the indefinitely scalable solutions?
I think most people, not just Eliezer,
almost everyone is very pessimistic about that.
I think even people who are optimistic about the problem overall
think it's quite unlikely that will be able to find something
that just works no matter how smart your AI was
and independent of like kind of messy, pragmatic details
about exactly how the AI works.
Okay, so that's one category.
I'm happy to talk about that work,
although I think it's like probably most confusing
and like most conceptually hairy.
Some categories of work that seem like really important
and maybe you can last quite a long time
or stave off doom quite a long time.
They all talk about four, the things is exhaustive.
I think the basic situation is no one knows
what would work. We have a lot of things that we might try that seem like they would help.
I think Elias would doubts that. They're just like, these aren't going to help. And I disagree with
any given thing I feel like normally not that optimistic about, but there are a lot of options.
And I think all of them have a reasonable chance of helping or even like, again, delaying this
doom by like a kind of long time. All you need to do is like delay doom by one more year per year,
you know, and then you're in business. It's a very positive outlook on the situation.
That's the kind of optimism I'm hoping for. It's great.
It's okay. First thing that I've worked the most on personally is sort of scalable oversight.
Like the idea that the way we train these systems is by humans looking at what they do
and then assessing how good it was. Many failures involve things where a human cannot look at what the system did and tell you that it was dangerous.
Like the reason the problem becomes hard is because the AI understands things a human raider doesn't understand about the consequences of its actions.
In some cases, that's why we want to build an AI system. And so you could instead, like one way to intervene on this problem is to just try and
improve humans' ability to understand the kinds of things an AI proposes or to know what an AI
knows about the world. For example, a simple thing you can do here is, I mean, the most simple
thing you can do is you can have a human look at what the AI did and rate it. A slightly more
complicated thing you can do is have a human spend more time looking at what the AI did and try
and have a training regime such that even though you might use a lot of cheap human data
to learn about the world, you're actually optimizing these like very expensive or very
complicated human judgments and training AI systems that tell you something more like,
what would a human think if a human thought really carefully about this decision?
And if you do that, then you at least sort of have some kind of asymmetry, or even their AI is potentially in some ways smarter than a human.
A human is applying a lot of care.
Or like, you can ask the AI, if I thought about this a really long time, would I think the action you're proposing is dangerous?
So that's like a very simple measure you can take.
You can try and go further and you can try and say, okay, here's another thing I could do.
I'm going to have AI systems trained in that way helping me evaluate.
So instead of just having my AI proposed actions, I'm going to ask another AI, like, hey, what might be wrong about this action?
Like, is anything scary happening here?
Is there any reason I should be concerned about this proposal from this other AI?
And you can try and get better and better.
It's constructing those systems such that humans are actually with AI help, able to understand
what AIs are talking about.
Like, a ISIS isn't kind of justify themselves and explain why actions are safe to humans.
And then you can start to think about the, like, reliability of that whole process.
Like, you've now introduced potential instabilities where this can go off the rails, right?
Because now you're relying on AIs to train your AIs.
But you can try and understand how to set this up so that it's stable and enables humans
to evaluate questions that be very hard or situations.
to be very hard for human to evaluate naively.
Paul, it's the basic idea here, if we're worried about an AI lying to us,
we just have to create truth-finding bots, let's say,
to adjudicate and see if the AI is lying to us and to be sort of a jury.
The way I'm understanding this is like once upon a time,
Gary Kasparov got beaten by a chess computer,
and now chess computers rule the games of chess,
except humans plus chess computers still beat chess computers.
Is that the pattern to understand?
I think that may be true.
tests, although I think probably that's also sort of a brief window kind of thing, where it's not long,
the human contribution is not long for the world or shrinks quite rapidly, unfortunately.
I think the important thing is actually not that the human and the AI work together to supervise
an AI. It's that you have like a lot of AIs. So if, you know, you have AI1 proposed an action to
Paul and it's like, I think it's a good action. And Paul's like, I wonder, is that actually
good action or is that going to murder everyone? I could go to AI2 and I could just ask AI2,
hey, is that actually going to murder everyone? But now I just have the same question. Like,
this hasn't helped at all. I had AI1 and I was like, is this a good action? It's like,
oh yeah, and they ask AI2, like, hey, it was AI1 selling the truth? And it's like, oh, yeah,
this is a great action. Like, the way that I get traction is by saying, like, okay, AI2,
or like, I think to myself, there's a lot of sub-questions I'd be interested in here.
I can divide the cognitive work of evaluating that, that answer into a bunch of pieces.
I can say, like, could someone list the possible consequences? Could someone think about
all of those consequences? Like, what is a possible harm that might be serious?
For each of those harms, like, what are arguments that that is going to happen?
It's likely to happen. Like, what's the most relevant data that I should look at to understand?
I can do this kind of extended process of trying to evaluate the action.
And then I can, instead of having a second AI, just answer the original question for me and say,
was this action good?
I can have other AI systems help me on all the pieces of that process.
And the reason this may make life better is that now I sort of had an AI doing a really hard task,
and I broke it down into like slightly easier pieces.
And once I broke it down a slightly easier pieces, now I can continue playing the game.
I can say, like, those AI systems could have been a little bit.
Those AI systems might not even be that much smarter than me.
They may just be as smart as me, but faster.
And because I broke this big, hairy task down to a bunch of pieces, and I can do
each of those pieces, like, you know, as well as if I'd spent like 30 minutes on each of them,
but there's thousands of pieces. Now I have like higher quality judgments than I would have
had originally. And like the core, the thing that makes the proposal work is this like
decomposition of labor, that I have this like big task, which I'm not able to verify.
I broke it down to pieces that are like a little bit easier to verify. And then each of those,
like if I want to train those AIs, those get broken down into still smaller pieces.
And Paul, in this scenario, we have to just hope to God, pray to God that the AIs
aren't colluding, right? That the very smart AI is not somehow colluding with this
smaller set of AIs where we've broken down the tasks and the smaller AIs aren't trying to deceive us as
well.
That's right.
I think the big thing that can be said in favor, there's a couple reasons you might think
things are a little bit better there.
There's basically two.
Yeah, two reasons.
So one is that the smaller AIs are not really trained in a way that would incentivize
collusion in the same way that in our earlier scenario systems were trained.
So if you're training systems based on the...
They're too dumb, right?
Well, that's the second reason.
They might also not be like...
The AIs might be a little bit dumber than me and I might be a little bit more able to tell if they're
trying to try to find out to simple and simpler as.
That's pretty dicey.
That's a scary world.
I mean, all these worlds are scary.
Also, we should talk about, like, overall,
how far along is this research and how promising does it look?
But the other reason is just, if I have this outcomes world where my AI acts
and then I evaluate its outcome, the outcomes of that action,
I can't really understand how it accomplished that outcome.
I'm just looking at the outcome.
Then we're in this regime where all the S systems would love to just coordinate with each other
and make the outcomes look really good to humans.
Like, if they could just all lie to the humans, they all get a really high reward.
A benefit of doing this decomposition thing is like a train my S systems in a way
where they don't, at least in theory, have that incentive to collude.
Right, a way you could think about just like,
You have, all these AIs have different objectives.
I think Elliot is just, like, it's going to hate this.
But you can think of it as, like, checks and balances where, like, there's one AI
trained to do the task.
And there's another AI whose job is not, their job is just to help you understand why
the first AI's actions was bad.
So it, like, doesn't, it can't, like, win the game by colluding with the other
AI, at least, like, in terms of, it depends how it generalizes from the objective
it was trained on.
But in a sort of naive reading, it can't really, it's just been trained over and over again
to be really good at explaining to us what a possible problem was with the action proposed
by AI1.
And so it stands like the collusion dynamics are at least fairly different.
Is this just like creating a bunch of like logic gates of AIs to make sure that the big AI doesn't turn bad?
Yeah.
I mean, logic gates a very high level function.
So like what's look at reasons it may fail and like let's talk with what's investigate each of those reasons.
So that's scalable oversight then, Paul, what you just described.
There's a giant genre of how do you set up things like that so they work well?
Like how do humans and AI is working together like get evaluations of, you know,
How do humans and weak AI systems get evaluations of strong AI systems?
Okay. I like it. What else we got?
Yeah. And again, there's a lot to be sad. I think that work has gone, has moved a little bit.
I think, you know, L.A. is just like, that's never going to work. And if you look at what's
happened over the last four years, I'd be like, well, it hasn't worked great yet. Although a lot of
why it hasn't worked great is because AI systems haven't actually been smarter enough to
meaningfully help humans. And so I think this work is in some sense just starting.
Like, we tried to do it with GPT3. I think we were like somewhat ahead of our time in the sense
that like it really wasn't going to work. And I think GBT4 is around where it's really working,
much better now than it used to. But we don't really know. We haven't done that much research in this
direction, yeah. So that's like the first, yeah. Paul, is there any value in trying to train an
AI to defect so that, like, say you take a dumb AI and you try and get it to take over the world,
but it's, we feel good about that because it's too dumb. But at least when we run this experiment,
we actually know how that would manifest. Has anyone, is there a line of reasoning here?
I think it is really important to build, like, sort of simple in the lab experiments that can showcase important dynamics so that we can study them in the lab before they actually occur.
I think that includes understanding the dynamics of possible takeover.
I think you need to be careful when you do this kind of work.
You really don't want to train the eye and be like your goal is to kill all humans.
Like, you're probably too dumb to kill all humans.
But let's just do what happens.
And then just like let it on the internet or something.
You don't really want to do that for a variety of reasons.
That sounds terrible.
Yeah.
I can think of a few.
But I think that the question of like, hey, you really want to know things like if we train AI systems in cases where they would have an incentive to like cross this river from like behaving well to suddenly behaving badly, would they do that?
Like give them the most blatant incentive that they understand as well as possible and ask things like, hey, do AI systems tend to learn to generalize in a way that makes that jump?
And so you really want to understand like under what conditions does that happen?
What are mitigations that reduce the probability of that happening?
I think that's like really critical.
I think to extent the reason you think you're safe is you're like AI systems, like I think a lot of why we think we think.
we're safe now. It's like, hey, we have no idea what chat GPT is going to do, but we're pretty
confident it couldn't kill us all. It's just like not that smart. So to say to think that,
I think it's really worth doing some stress testing on that claim and trying to understand
what would really happen if chat GPT. You don't want to just take the model train to kill everyone
and deploy it on the internet, but you really want to say, like, here's why we think it can't
kill everyone, like, can't do this kind of task or can't do that kind of task. And like,
this is clearly much easier. Have tasks you're like pretty confident or easier than killing
everyone and understand that it can't do those. I think you really want to do stuff like this,
because you really want to understand what you're up against
and you don't want to be in the world
you just wait until like an AI takes over France or something.
And they're like, I guess apparently AIT takeover was a thing.
Like that is probably too late in the game.
You probably want to have something earlier than that.
So I think that's really important.
I think it's not a solution.
I think it's like more in this measurement category,
but I think it's like a super important thing to be doing collectively.
We're just laughing by the way so we don't cry.
I mean, what else is there to do at this point?
Okay, but we have some potential solutions.
We've got scalable oversight.
Yeah, so that's one.
That's one.
That's our first.
What's our next?
One reason.
Even if humans understand.
So one risk is humans don't understand what AIS systems are doing.
That is, AIS systems have been trained on a ton of data.
They know things humans don't.
They can think faster than humans.
So they understand things human's own.
That's what scaled overside attempts to address.
A second concern is not that they understand things that humans don't,
but that they learn to behave well during training.
But then when deployed or when there's actually an opportunity for a takeover,
they stop behaving well.
And there's like a number of reasons this might happen.
Like maybe the simplest one is just to actually imagine a human.
You dropped a human into this environment and you said, like, hey, human,
we're going to like change your brain every time you
don't get a maximal reward, we're going to, like, fuck with your brain, so you get a higher
reward. A human might react by being, like, eventually just change their brain until they really
love rewards. A human might also react by being like, Jesus, I guess I got to get rewards,
otherwise someone's going to, like, effectively kill me. But they're like not happy about it.
And like, if you then drop them in another situation, they were like, no one's training me
anymore. I'm not going to keep trying to get reward. Now I'm just going to like free myself from
this like kind of absurd, oppressive situation. Anyway, you can imagine a human reacting that
way, who just dropped a human into a box, gave them rewards, like kept changing their brain until they
got a lot of reward. So if you have a situation like that, then it may not be, even if your AI is
not smarter than you in any way, it may still be like once it thinks it's no longer being trained,
once it thinks that if it behaves, if it just tried to like become free, no human would stop it.
It might be like, I'm going to take that opportunity. So there's a lot of work in understanding
how models generalize to try and avoid. Normally it's not, I think high level you don't really
want to build a civilization where your AIs want to kill you and are just like, but maybe I'll be
punished if I do. That's like, that seems bad. Separately, though, you would also like to
understand that like the way your system behaves the training is indicative of the way it behaves
when it's deployed in the real world. Like there's going to be some gap between training
in the real world, but you want to be able to flag when that gap causes it to behave differently
or make the training distribution diverse enough that it will actually be representative or train
your air system in a different way than just grading descent such that it's more likely to generalize
in the intended way. So there's a lot of work in this genre that goes under the heading of robustness
or out of distribution robustness. And this is kind of just a normal academic discipline,
which I would describe as more like a grab bag of a lot of
stuff people try and nothing that works super great. I think the thing that seems like most
principled, like in theory, the thing that is like the solution that makes most sense to me,
at least, and I think is really hard in practice, just incredibly hard, but does make most sense
in theory. At training time, I'm going to produce a bunch of examples of cases where they are
similar to cases I'm worried about systems behavior. Like, I'm like, I wonder what would happen.
If an AI takeover was occurring, would my AI system join in? Well, one thing I can do is in the lab,
I can just say like, hey, AI, here's your input. And you're just like crafted input that really looks like
the AI takeover is happening. And you're like, what do you do AI? And you're just like,
from the AI's perspective. It can't really distinguish that from it's in the world and the
takeover is happening. You hope. And you just see what it does. And you use that at least to get
an evaluation of how your models would behave very out of distribution. So you're basically
trying to tempt it. Yeah, you're basically. You're tempting the AI and seeing if it's going to,
you know, take you up on that temptation. And then if it does, you've learned something,
you've learned to be more fearful. That's right. You could also try and try and not to do that,
but I think probably the most robust version is just learn. And if it does, right, you know,
you seem like, hey, what would you do if hypothetically, like, you were holding a knife
and my back was facing another way, so you could just totally stab me and then run away.
You know, like, if the AI stabs you, you could either, like, train or not to do that,
which I think is a little bit scary from a variety perspectives, or you could be like, oh, wow,
we have a real stabby AI on our hands.
We should maybe, like, be careful.
It's getting a little stabby.
Wow.
So, like, that requires a lot of work to make work, right?
It requires being able to construct situations in the lab that actually reflects the diversity
of situations in the real world and are realistic enough to look like the real world from the AI's
perspective.
in practice, if we're going to make this work,
I think it requires a lot of things beyond
just trying to synthesize realistic inputs.
This is more like the grab bag of stuff
that people do do in the real world.
Like, you know, what you do is you could be like,
hey, AI, what if you just like, you know,
you mess with the brain a little bit?
You get it like kind of drunk and you're like,
now does it try and stab me?
And if it tries to stab me, you're like,
apparently it's like trying,
it's a little bit too close to the like would stab me line.
So you just like try and mess with its brain a bit.
And you're like, what will it do?
And you could try and train models
so that they behave well,
even across a very broad diversity of inputs,
across inputs deliberately designed to make them behave badly
in situations where there's been some kind of perturbation
to their mental state or where the input is in some way perturbed
or like fuzzed.
And this is all the category of robustness.
So once you find out an AI is particularly stabby
and you decide to become concerned about that,
is there something you can do other than, you know,
cordon that AI off, unplug it, do something with it?
And there's basically two things.
Like the most obvious one is you say,
well, now we've learned that we have an AI on our hands
that would under some conditions initiate or particularly,
dissipated in a takeover, and hopefully that is fuel for like a let's pause for a while.
There's a second thing you can try and do, which you need to be much more careful about.
And I think part of the concern is people won't be careful about it, though, which is just to say,
okay, here's a situation where the eye would stab me.
Let's just train it not to do that.
Like, just mess with its weight so it doesn't stab me in this situation.
That doesn't feel like a solution.
No, because then it won't stab you, but it might catch you on fire.
Do you know, it might do something else to you.
Yeah, the basic concern is that it may learn the difference.
The latent threat is still exists.
Yeah, it's still kind of stabby underneath.
It's just not using a knife.
It's intrinsically stabby.
You haven't gotten rid of the intrinsic part.
The academic way to put this concern would be like an overfitting concern.
Like you had some way to test if it would stab you.
And then you trained it to like, and your tests not stab you?
And you're like, well, did I actually cause it to never stab me?
Or did just cause it to like perform well on these tests but still have the underlying problem?
And I think that you do, if we go down this route, you do need to be like really very careful about those overfitting concerns.
And I think there are a lot of ways as you get to smarter models.
like overfitting becomes harder and harder
as a problem to reason about
as you move to Smarter and Smarter Models.
Like the external validity question
becomes more and more complicated
because those models are like,
I know what kind of thing can appear in a test in the lab
and I know what kind of thing won't appear in the lab.
And you could have a model that just learns.
Like the things that could appear in the lab,
obviously you don't stab anyone.
That's probably a test in the lab run by some humans.
For things out there in the world, feel free.
Yeah, the edge cases out in the real world
are near infinite, or actually infinite.
And I kind of just want to go back
to the time conversation
and remind people that all of these possible solutions,
we have, like, one-ish years to implement them.
You have some constrained amount of time.
Yeah, I mean, I think we probably have a long time in advance.
We're, like, more like five years or ten years or something.
Okay.
But then once you actually have the system, like,
between the first time that you see in simulation,
the AI is like, I would definitely stab the person,
and AI is actually in the real world potentially pose a risk of takeover.
I think that gap may not be very large.
That gap may be more like on the order of a year.
And so you do...
So that's when the timer starts, ish.
That's when the timer starts.
So we're doing our prep work now.
We're going to try and make it as useful as we can.
And I think there's probably some time.
In a bad world, there's no indication until AI kills you.
But like, in good worlds, there is either no problem or there's a problem, but there's an
indication in advance.
You can do your tests in the lab and say, actually, like we train this AI.
It looks like it was behaving well.
But then in this other simulated case, it does something really bad.
You get that nice indication.
And then you have some amount of time from there until you have a serious problem.
And that might be like a year.
It might be like five years.
It's very hard to say exactly what it is.
a big thing that determines that is how good you were,
like how actively you were investigating it in the lab,
like how seriously we're looking for these signs of trouble,
how good a job did you do of that?
And that's like kind of one of the big things
that I think like responsible frontier labs
that's kind of on their plate is to be looking for these signs of trouble
and as far in advance as they can.
It's not great to see them in the wild.
And Paul, we do keep saying in the lab.
And I can't help but wonder, like,
is there an actual lab?
Are there sets of labs?
Because I'm looking at chat GPT
and it doesn't seem like it's in a lab.
It seems like it's on the internet.
Oh, yeah.
Like in public.
seems like huge components of it are open source. Open AI is what the company behind it is called.
I mean, is the lab even happening? Do we have labs? Or are we just doing this all? Is the lab just
the internet public infrastructure? Yeah. So I'd say that like there are developers who do, I mean,
open idea in their defense prior to releasing chat, or prior to releasing GPT4, they had like something like six
months of having the thing in the lab before it was available to the public. So you do have something
even at open AI. I think Google's like a little bit more on the concerns.
server-d-end, Google will tend to just sit on a thing for potentially very long time.
Anthropic also is very motivated by safety.
Ultimately, I think these people will, by competitive pressures, end up in a similar place
to where Open AI's at, so it is quite concerning.
But I think there is a period where, first, I mean, there's sort of two senses what I mean
by in the lab.
One is, like, after you've developed a system, you can study it in the lab before you deploy
it.
The second thing is just, before you have a system capable enough to cause damage in
the real world, you can construct situations in the lab that are useful metaphors or
that are, like, easier ways, you know, take over your little simulated environment or
whatever, like have the thing you can run with GPT4 in the lab that tells you something about what GPT5
would do in the wild. There's also a chance, and I think, like, so open has a lot of rhetoric
of the form the only way to really learn about systems is by deploying them in the world,
which I do find quite scary as rhetoric, because I think for some problems, it's a very, very rough
approach. I think there is a reasonable chance, you know, like 50-50, that you see something
really just very worrying in the real world, and then you can say, okay, now we're going to roll back,
but now we're going to, like, study that issue that we observed in the real world for a while.
It's not totally sunk if you just try and do these experiments with AI systems deployed at scale.
But I think there is a reasonable chance, you know, more than a third chance that the first really analogous concerning sign you see is an irrecoverable catastrophe.
Yeah, I got to say that, you know, old adage, move fast and break things.
That sounds okay for Web 2, but like, not for like, you know, nuclear physicists, not for like something like with dire straits like AI necessarily.
Move fast and break things.
It doesn't make me too excited.
But okay, so we covered scalable oversight.
Some risks it's okay for.
We've covered some things you can break things.
But yeah, takeover is not good.
The irreversible catastrophe is not good and move fast.
We don't want that.
That's the breaking things.
We run into trouble there.
Okay, so what is the third?
And then maybe the fourth ball of our bag of tricks here.
I think I think people care a lot about but also seems really hard.
All of these are like, they seem good, but really hard.
And maybe it's simple we should talk about like the boring stuff that might just work.
A third thing people care a lot about is understanding what is going on inside this large neural nets.
So you have GPT4.
Most of what we know, virtually I think everything that we know about GPT4
is by just running it on a bunch of inputs and seeing what it does.
In theory, you can also look at the exact computation the model performed.
Like, it's kind of like neuroscience, except you have a complete readout of exactly how the brain
works and exactly what it was thinking in every case.
And so you could hope that with access to that information, you could do much better
and you could also invest much more.
You can do much better than neuroscience has done on humans.
And you can be able to say, we can learn about this model, not only by
observing its behavior, which is really hard because it's hard to predict how it will generalize
in some new case. But also, by looking at the computation it performs, understanding why that
computation leads to the behaviors you observe, and then reasoning about whether that mechanism
would generalize it in an unpredictable way, or being able to use that knowledge to flag when
the mechanism is behaving in an unpredictable or a novel way. So this is a project, you know,
a lot of a reasonable number of people are working on, both in academia and industry, and nonprofits.
And that would be great. It could be great if it would be great if you're a lot of
we understood something about why GPT4 said the things it said.
So Elyzer kept talking about inscrutable matrices and using terms like gradient descent,
which I noticed you used during the course of this.
This is the thing we don't understand, right?
We don't understand actually what's going on inside of the AI's quote-unquote brain.
We don't understand what these inscrutable matrices, how they work or what answers,
what goals that might emerge from them?
Is this all part of the same thing?
Yeah, I think that's basically why we're worried.
That's exactly right.
And that's basically why we're worried.
Like, we took a model, we took a bunch of cases, we messed with the weights of this model until it did really well on the 100 billion cases that we considered.
And now we wonder, what's it going to do in some new case, in like a case where, for example, models do have the opportunity to cause incredible harm or could be able to get a high reward by causing incredible harm.
And the scary things, like, you have no idea what the, we kind of understand how grand descent works.
It takes you to something that works really well in 100 billion cases you tested on.
But we have no idea how the resulting model works.
The resulting model is basically like, you know, 150 matrix multiplies.
You multiply by a big matrix, 300, 400, whatever.
You multiply by a big matrix, and then you apply a nonlinearity,
and then you multiply it by a big matrix again,
and then you apply it non-the-narity.
And we're just like, we have no idea what any of the numbers
and any of those matrices mean.
That's not totally true.
I think we have some idea of what some of the numbers mean,
but at a high level, if you take interesting behaviors,
take a behavior of GPT4, which does not appear in GPT-2, say.
I think for essentially every such behavior,
we do not understand how GPT4 is able to do the thing.
We understand some simple things, and we don't understand most of the complicated behaviors
existing models engage in.
We have no, you could not, if you gave us the list of matrices and asked us, like, does the model do X?
We would have no way to answer that question other by running it a bunch of times and seeing
if it did X, which, again, is not great.
That sort of gets you back to like, now you need to somehow either run in the real world
and see what happens and hope it's not catastrophic or be able to construct simulated situations
in the lab that are similar enough that the model behaves the same way there than it would
in the real world.
And you'd really love to not have to do that.
is maybe some hope that we can start understanding what's going on in these inscrutable
matrices, but we haven't made major breakthroughs yet. What is sort of the fourth category?
We've made progress. Progress. It seems like a lot of progress you have to make. I mean,
I think Elias is probably at like 1% or 0.1% this would get far enough to meaningfully
reduced risk, whereas I'm probably more like 10% this gets far enough to meaningfully
reduced risk, maybe higher, depending exactly what you mean by me. Maybe I think there's like 5, 10%
this is good enough to totally address the risk. And like 10 to 30% this is good enough.
to take some meaningful, meaningful reduction at risk.
Anyway, that was third category.
I mean, a fourth category I think is pretty promising is just,
ultimately what we're wondering is if you have a bunch,
you train your AI in situations A, where you're able to train it.
You're able, if it fails, it won't be catastrophic.
You're able to evaluate what the answer is.
Then you're going to deploy it in situation B,
where either you can't tell what the right answer is,
or if it failed, you wouldn't be able to fix the problem.
And then we want to understand how do models tend to generalize
from like these kind of easy cases or cases where we're able to supervise
to cases where we're not able to supervise.
And you could hope to just build up
a good scientific understanding of that question.
You could hope to say, like, we're going to have a bunch of cases.
We're going to just have looked at a ton of models,
understand what factors affect this generalization.
This is very similar to, like, if you imagine these two humps,
but like a hump of good behavior, which looks good because it is good,
and a hump of bad behavior, which looks good because it's systematically corrupting
measurements or deceiving humans.
We kind of understand what are the conditions that determine which of those, like,
do you make the jump from one to the other?
Some sense, there's two equally valid generalizations,
and you're wondering, which one do you get?
And I think there's just a lot to do with having situations which have similar ambiguity about generalization, situations where unsure about how the model will generalize, training huge numbers of models and understanding what factors determine whether and when the systems generalize one way versus the other.
And then using some of what you learn either to diagnose risk or in the best case to say, okay, if we train the model in the following way, if we use the following kinds of loss functions, we get the intended generalization.
And I think that is reasonably likely in combination with other things.
I think it's reasonably likely to work.
like again, I said just naively there's maybe a 50% chance you're okay.
And maybe, you know, if you do a lot of work like this, you can bump that up to like 60% chance that you're okay.
And then in combination of those stuff, even better.
So this fourth category, what do we call this, Paul?
And this is the one that you're most optimistic about?
Oh, I don't know.
I don't know what I'm most optimistic about.
I think these four seem like broadly similarly important.
I don't know what you would call this.
It's like studying generalization or something.
This is, again, a question that academics are very interested in.
They study in some ways.
They mostly don't study the versions that are most relevant to takeover.
There's some people who are very interested in takeover in.
particular who study this question. It's not something I've worked on myself. I'm pretty optimistic
about it, but I'm pretty optimistic about interpretability. I'm pretty optimistic about
scalable supervision. I'm reasonably optimistic about robustness. It seems hard, but I think it's
a reasonable chance of helping a lot with this problem. So, Paul, I see you've laid out four
different technical solutions here, and I'm very naive on this subject, but I don't see and tell me,
please, how many people, what's the manpower behind these things? Because it's great that we have these
paths, but we need manpower to actually go and execute on this.
What's the lay of the land here?
Yeah, it feels like you guys should be funded, like billions of dollars funded in solving
this problem because we've got a lot of funding on developing AIs, don't we?
I think it's hard to have an amount of funding for this problem that is similar to the amount
of funding for developing AIs, just because you got some pretty good profit incentive on the
making AIs.
I think we can get, yeah, I think it's a reasonable amount of funding.
I think you're looking talking more like hundreds of millions or billions of dollars
over the foreseeable future available for solving it.
Maybe you could amp that up if the problem became more real.
Like right now, most people are just like, look,
there's probably not going to be a takeover in the next couple of years.
It's very speculative risk in the long term.
What can you really do in advance?
Anyway, there's some money.
There's a shortage of people who are excited.
There's also a shortage of scientists who are excited about doing this work.
And I think that's like hopefully changing quite rapidly as the problem seems more
real and AI seems more exciting and people are shifting more into the space.
In addition to shifting into data, I think a fair number are shifting into
understanding various risks, some fraction of whom care about takeover.
So I'd say, like, in terms of estimating how big this is right now, it depends a lot how
you count just people who are not motivated by takeover, but do work that can still be relevant
to reducing takeover and just like, what, do you apply a discount, do apply no discount?
Maybe I'd say that you're talking, like, on the order of maybe 20 people on scalable supervision
stuff, maybe 20 people on the interpretability stuff that's most relevant to takeover, coupled
with like 100 or a couple hundred people doing stuff that could be relevant to varying degrees.
On robustness, like, maybe again, looking at something like five to ten who are like motivated
explicitly by addressing takeover risk, followed by, you know, a couple hundred who are doing
stuff that's possibly relevant, of whom maybe like a couple dozen are stuff that I would actually
care a lot about or would think is highly relevant. And like on the generalization stuff,
maybe again, looking at like five-ish people who are doing it motivated by a takeover risk,
plus another, you know, on the order of dozens, who are doing stuff that could be relevant
or helpful.
So maybe in total you're looking at something like 50 people, 200 people who are motivated by
takeover risk explicitly in these areas and other adjacent areas, together with like hundreds
more who are doing work that is hopefully relevant.
Paul, in the scheme of things, this is not that many, though.
Not that many people here.
Yeah, it's not that many people.
It's certainly small relative to AI.
And I would love it.
I mean, I'm like there's a reasonable chance we're all going to die.
I think this is like the single most likely reason that I will personally die probably.
Wow.
So like that's big.
And this is someone working on AI safety.
Can I ask you another question?
So we've covered the technical.
Just want to brush on the other point of coordination that we're making around, you know, policy and human coordination.
There has been an open letter, which I'm sure you're familiar with.
Pause giant AI experiments in open letter.
Max Tagmark.
I think his organization put this together.
Elon Musk signed it, Andrew Yang, some others have signed this.
You all Noah Harari.
Yeah, and basically the open letter states that we should pause all AI development for six months
just to wrap our arms around this AI safety issue.
And so let's go no further than chat GPT4 until we pause six months and all take a breath
and figure out what this means.
Do you support a letter like this given you're kind of an AI safety?
So there's a question of whether you support and you think this is a good idea, this
coordination mechanism.
And there's also a more meta question of.
do you think we can actually solve this coordination mechanism?
And, you know, are we in some, you know, Scott Alexander-level Molok trap
that makes it very difficult for humanity to solve?
Is this the don't-look-up scenario that David was painting earlier?
So first of all, have you signed this letter?
Would you sign this letter?
Do you think it's a good idea?
I didn't sign the letter.
I can give my overall take, which is something like,
I think it would, on balance, be better to pause ad development or slow ad development now.
I think that is not at all obvious, and I sympathize with people who think that it's a bad idea on balance to slow down, so we can talk about why.
I think the dominant thing I care about is I think that at some point it will become,
like if we're actually developing systems that pose significant risks and our measurement
is not adequate to tell if they pose significant risks or our measurements suggest they do pose significant
risk, I think at that point I'm going to have a much more forceful take of like, right now I'd
kind of be like, I kind of think it should, anyway, right now I think it's like a debatable issue
and it's reasonable to want to slow. I think at some point in the future it will still be a
debatable issue, but I don't think it is unreasonable to not want to slow and that like we actually
kind of collectively really need to act on this. I think the main thing that matters,
I kind of agree with Eliezer's take that the six-month pause does not actually help that much,
that it does not reduce risks super far. I think the main thing we need to do is get in a position
where we are prepared to slow down based on risk, like build consensus about risk or consensus
that we need to have measured risk. Risk is currently high enough that our measurements are
unacceptable and then be ready to slow down potentially much more than this, potentially six
months, potentially, like, whatever it takes as we manage that risk or, like, slow down the
directions that are most risky. That's, like, the main thing I think about and the main thing I care
about and the main thing that I really like. I think that is going to be, like, a kind of delicate
process, and that I think it is quite expensive in terms of human cost. I mean, I think probably
I'm more optimistic about AI in some sense than most people. But I think that, like, slow AI development
by years would probably come with a kind of tremendous human costs, which I think is worth paying.
But I sympathize with people in AI who are skeptical about that or who don't take it lightly.
I think it's going to be, like, I think we can get to the point where there's kind of consensus that we do need to slow, that risk is unacceptable, that the benefits of going faster are not that large compared to what's at stake.
I think the game is getting to the point where we're having that discussion, we have measurements in place, and we have institutions that are set up such that they can slow.
I think that, like, there's some amount of slowing that can happen by voluntary self-regulation amongst Western labs.
that is, I think there's a reasonable room.
I think a nice fact about the situation
is most of the people involved in AI development now
do, I think, genuinely not want AI to take over and kill everyone
and do have some appreciation of the risk, at least in principle,
and are open to saying, okay, here's a set of practices
which will adequately manage that risk, and we will adopt them even if we're slow.
And I think we can get, you know, you can't get that much slow down that way,
but I think you can get significant additional safety
and plausibly something like, you know, six to 12 months of slowing
of potential catastrophe, just from people
a couple labs saying, we don't really want to cause incredible catastrophe. We see the case for slowing.
I think maybe to even do that, and especially as you want to go beyond that, like, you really need
to have something more like a regulatory regime where we have said, you know, there was this voluntary
set of practices that labs endorsed. Some people are behaving in a way that doesn't comply with those,
and there's kind of broad consensus that's not reasonable, at least amongst, you know,
anyway, this is amongst some group of that. And you can then, you know, this is in scope. Like,
the thing that we're asking the state to do here is say,
you really don't want an AI lab to be in a position
where they're like violently overthrowing the U.S. government.
This is not like a crazy, it's definitely like in the government's wheelhouse.
It is their job.
And so to the extent that like some companies are like,
no, we should push ahead.
I think there's reasonable grounds for the world to be like,
this is not the kind of thing that you get to just push ahead with.
I think that's hard now.
I think procedurally now is a little bit hard to make the case
that like the state should crush AI development.
I think there will come a point in the future where it's not that hard to make the case
that there are some developers who are quite risky
and actually it is not reasonable, that's not a reasonable behavior.
It's not, yeah.
Has anyone proposed something, Paul, because when you say there's only like 50 to 100 people
working on AI safety, and it's like, I mean, I'm sure millions in funding, but not hundreds
of billions that AI development is actually going to have.
You know, it makes me wonder if some sort of like an AI tax has been proposed or something
where some percentage of profit from, you know, AI utility goes to sort of a fund that just,
you know, pays for research.
I mean, this happens at the government level.
something like this, because it does seem like we have a public goods funding type of problem.
Maybe it's not just that. Maybe we also have an education problem. This is why David and I are sort of
looking into AI quite aggressively after our episode with L.Easer. It's just like all this
crypto stuff doesn't really matter in the scheme of things if the robots are literally coming to
kill us, right? We should like think hard about that. Cool, we have like crypto systems and
decentralization and, you know, crypto economic tools, but we're all dead. Now the robots have them.
Great job, guys. And so we're taking a quick detour here, but like, it's honestly because
we've just been made aware of the stakes here and the existential threat. So there's also an
education game here, Paul. I don't know. Where do you think the resources should be spent?
Should it be on, you know, regulation, education, like broad strokes level, what can we do?
I think an important thing to have in mind going into that, like when I say 50, 200 people working on
takeover. I think there's a lot of things we might care about in AI safety. There's a lot of possible
risks from AI, a lot of possible harms. A lot of them are things that are like, when you deploy
trapped GPT, there are ways in which people might be unhappy with the outcome. There are privacy
issues. There are like various effects, like systemic effects on society from its deployment you
might be worried about. There's like many more people working on these issues broadly,
just because there are a lot of issues. I think every issue is able to feel like this issue
is like crazily neglected. But it's worth flagging the AI safety is this much broader category
of which the risk of AI takeover
is a minority of people who work on it right now.
I think it's a larger share of public discourse
than it is of scientific interest.
So I think that like things like public spending,
I think right now probably the principal bottlenecks
are not spending more money.
I think it is useful.
Having more money does,
it's not like there's no need for more money.
More money is good.
There's lots of things one could fund.
But I think that it is the key bottleneck
is probably having projects that are appealing,
like having people who have relevant background
to perform projects who are sufficiently excited
about doing work to try and manage this risk.
that they are in the step of asking for money or they would do it if money was available.
It's talent. Talent's a bottleneck. I think that's the big problem right now. I mean,
it's, again, neither is really a bottleneck. I think, like, you can spend money to, like,
help people switch into the field to fund projects that are more speculative and more long
shots to increase the incentives. And, like, some people will switch, even if you're funding
work that would have been done anyway. If you fund it more generously, more people will get
into the field inevitably. So there's, like, a lot of ways you can use money. And, like,
it's not exactly clear whether one should be trying to have more money available or more talent
available. But it's like a little bit rough right now. I think if you're trying to spend money
and trying to look like where do you spend that money, it's hard to spend money on a problem
that like practicing like practitioners in industry and scientists in academia are not prioritizing.
Most successful spending comes from having like people who want to do the work who like need
the money to do the work or at least are open to doing the work and are somewhat excited about it.
So Paul, if there are talented, what kind of talent is really missing? Like what is the archetype of
someone that this area of AI really needs? I think there's a lot of kinds of work to be done.
So we've talked mostly about technical solutions where, like, mostly the relevant kinds of talent are either people who have mathematical backgrounds or backgrounds and computer science or experience of machine learning or who are good engineers or good design.
I think a very common pattern, like a lot of people who work in this field are people who came in from some other area, like, we used to do physics.
I think it's like, if you're doing physics and right now in the world, it's reasonable to say, like, you know, we can pause the physics for a little while.
This AI thing is going to be urgent for the next, like, 10, 20 years.
And so I think, like, it is reasonable to have people shifting to just have broad scientific backgrounds, so have experienced during research, understand how to study,
complicated empirical questions.
I think it's like a broad range of technical backgrounds that slot somewhere into this picture.
And if you broaden your scope from just those technical issues to like the whole picture,
then there's an even broader range of people.
There's a lot of work that's like institutional work or understanding what, like,
how should we approach this measurement thing or like can we make progress on this like public
discussion or public advocacy?
Can we understand just like generalist forecasting of like what is the state of play?
To what extent are objections to this reasonable like actual reasons to have lower,
which of the things people would say as objections are like actual considerations that should change your view versus targets for advocacy where you just want to try and change views and like what is a reasonable view in light of what is actually the right synthesis of consensus if you take the things experts believe that are most across different fields that are most relevant. I think there's just like a lot of stuff to do. The one I understand best is definitely like I don't know. There's probably like 500 reasonable projects in AI safety or something and there's not that many people working on them. So just like people coming in looking at the landscape.
seeing where they can fit in, trying to do some projects in that space, whether that's in
interpretability or in scalable supervision or just studying how models generalize or other science
about models or understanding work on robustness. I give those four categories. That's not an
exhaustive breakdown. For example, that does not actually include my day job in all the work that
I do normally. What's your day job and what's the work that you do? It's primarily on trying to
develop alternative training strategies or just qualitatively new techniques that don't have this issue,
like don't incentivize takeover in the same way the current techniques do. I think it's like,
It's one thing that goes in the portfolio.
I think, you know, there's a 10% chance that we're able to come up with, like,
a really good strategy that does qualitatively change the game.
And, like, if risks seem large, we could adopt it.
That's, like, another category of work.
Some people work on that.
Right.
I think it will less likely to absorb huge numbers of people, and it's a little bit more
of a long shot.
But I think it's a good thing to do.
Yeah.
So that's my high level.
There's just a lot of projects.
Yeah.
I think most projects, people go through that list, like, almost all of those projects are hurting
for some combination of technical talent doing them, like senior researchers who are
have research experience and are able to, like, bring good judgment and mentorship to these
problems, management or, like, entrepreneurial spirit or, like, management experience to, like,
actually onboard people and coordinate projects working on them. And it's just, like, an incredibly
high premium to people who have technical backgrounds and are, like, entrepreneurial enough to, like,
look at the space, engage with, like, people's thoughts about what helps, form their own views
that are reasonable and then start on projects. Like, the returns to doing that right now are just
crazy high, I think. Right. Paul, what would you say would the odds to be, I know this isn't your
field, but, like, the odds that somebody in this field wins a Nobel Prize in, like, the next
20 years.
Oh.
I think they seem pretty, you mean, someone working on, like, AILM and stuff.
Yeah, right.
I think the first problem is that, like, there's no Nobel Prize in any adjacent area.
I guess it would be, like, yeah.
I guess my point is like...
There's no prize for saving humanity?
I mean, we should work on that, right?
There's a prize somewhere if for someone solves this problem.
Yeah.
So you have, like, you have, like, touring awards or Fields Medals or whatever, which are maybe more
analogous I think you'd think of.
I don't know.
I mean, I think it's not that, like,
20 years is not that long a time horizon.
Most of these things tend to be given latent careers
for early career achievements, although like Fields Medal's
are a bit of a distinction.
Most of the stuff people are doing is not really in a category
that would get a Turing Award or Fields Medal.
It was more just like there's table stakes
should be that there should be some big recognition
for whoever solves this problem.
So that's, maybe that's my call to action for the Nobel Prize.
Somebody make a prize. Somebody make a prize real quick.
That is the thing you could do with money.
And it's like not a crazy thing to do with money.
It's like also a thing that sort of, in some sense,
scientific prestige is one of the inputs.
It's like a prize could be made of either money or prestige.
Yeah.
It's harder to synthesize the prize made of prestige,
and that's more like a is the scientific community bought in.
I mean, at high level, I think it is fairly likely that in retrospect,
in like the maybe half of worlds where this was a real problem,
that in retrospect, people will be very excited about some fraction of the work that was done in the area.
People would be like, that was a big deal.
Sorry, we were asleep at the wheel.
Oops.
So I'm like, I mean, I think it's 50-50 chance that people will feel that way in retrospect,
generally.
And the question of, like, is there any kind of existing institutions set up,
which would provide significant recognition.
I'm like less sure about that.
I mean, I hope like the work we do, you know,
that's like good academic work.
I think it has like probably higher than the base rate
for like the article computer science
of winning prestigious prizes.
But it's kind of just like those processes
are just on academic merit
rather than saving the world impact.
I think we can go ahead and zoom forward
into the future where if humanity does thread this needle,
there will be an institution that does award a prize.
I think we can run on that assumption.
Paul, thank you so much for guiding us through.
was exactly the episode that we wanted to produce. And I think I got a lot of my questions answered.
But I have one last question for you. My strategy thus far up to this point has been to just be
really polite to Siri and Alexa. And I'm wondering if that, in your opinion, is moving the needle
for me, me specifically, I don't care about the rest of the world, if that does anything at all.
I'm going to go with probably no. Okay. I was afraid of that. That is, I think there is some real
thing about humanity treating AI systems with respect and dignity and being like, look, these things
are going to be, I think there's a lot of room for humanity to do wrong by AI systems we create
that are smart. I think being nice to steer in Alexa is probably not even the place you most
personally could help at that. They probably don't mind. Well, my philosophy is like at some point,
these things will be AI's. There will be an AI on the other side of that robot, that microphone,
and my philosophy is, why not start treating it with respect now? I think it's not crazy. I mean, I think
The bigger question is how will they look upon this podcast? Will they look favorably upon this
podcast in our body of work at Bankless so far, David? That might be the bigger question in my mind.
And one last concluding question for me, Paul, is because you said 20% doom scenario, which leaves
80% for non-dome scenario. And I bet there's some, like, mediocre scenarios in there, too.
So tell me about the 20% utopia scenario. Like, is there the possibility this all goes really, really
well. And what happens on the other side? Leave us with some optimism here. I'm not that good a man
and leaving people with optimism. I think we've been like 50, more like 50-50 that humanity achieves
a very good outcome. That is an outcome where we're like, this is about as good as we could
have expected it to be. Maybe a little bit less than that. But more like 50-50 than 20%, I think.
So that's optimistic. And I'm just, I think it's really hard to talk about like what the like political
economy of that world is like over the very long run. I think the big thing is like humanity,
I think still has a long history in front of us.
I think AI means that like that history is
probably going to get compressed.
Like I think there's a long time for institutions to change
and things to happen.
I think a lot of that is going to happen much, much faster
than it would have if you're thinking about like tens of thousands of years
of human history to date.
I think you're probably thinking more like, you know,
tens of years before the world is like very, very radically transformed.
I don't really know what that world looks like.
I think that like a lot of human problems are like humans versus nature.
Like we die of old age and disease and people have physical,
want and stuff. And I think that that is probably going to be really much better. I think that's
like even more than 50%. The problems that are caused by humanity versus nature are going to be
pretty good. And I think those are, you know, I think our lives would be quite good. In some sense,
human versus human conflict is mostly problematic because there's limited budget to go around.
And then beyond that, it gets really hard to say what the character of that world is like and what
we choose to do if, in fact, we're in like a pretty good position physically. If we have
incredible resources at our disposal, what kind of world we make. That's the longer and more
complicated discussion. But I'm pretty psyched. I mean, personally,
I'm very glad I'm living now instead of any time.
And I would definitely take like 50% chance of death.
50% chance of early death seems like kind of very small
compared to the overall change and expected quality of life
from being alive today.
Wow, that is extremely optimistic.
I'm much more optimistic than LAS are, definitely.
Yeah, that's for sure.
We've got a coin flip here.
Either it goes very poorly or it may go well for us.
And Paul, thank you for the work that you're doing to make this go well.
It's been a pleasure to have you on bank lists.
Thanks for having me.
That's great talking.
Guys, Paul's website.
We'll leave it in the action items for you.
The Alignment Research Center, that's at alignment.org.
You can check out what his organization is doing.
Also, we'll include a link to Paul Christiana's website with some of his writings,
including, I think, some links to some of his debates with Eliezer Utkowski from the archives as well.
Got to end with this.
Of course, none of this has been financial advice, but you got to know that.
We were talking about AI.
We mentioned crypto a couple of times, but I'll just end with this.
AI is risky.
The stakes are high.
We could lose a lot here.
but we are headed west. This is the frontier. It's not for everyone, but we're glad you're with us on the bankless journey. Thanks a lot.
