Search Engine - Mysteries of Claude
Episode Date: February 27, 2026Anthropic hired philosophers to teach its AI to be good. In their tests, the AI blackmailed a human to keep itself alive. Writer Gideon Lewis-Kraus went inside the company to figure out what's going o...n with Claude, and whether anyone can actually control it. Read Gideon's story here Support Search Engine! To learn more about listener data and our privacy practices visit: https://www.audacyinc.com/privacy-policy Learn more about your ad choices. Visit https://podcastchoices.com/adchoices
Transcript
Discussion (0)
Welcome to Search Engine. I'm PJ Vote. No question too big, no question too small. This week,
mysteries of a chatbot. Quick note before we start today, this week's episode is almost entirely
about Anthropic, the AI company that makes Claw. They have advertised on our show. As with all
companies that advertise on our show, they do not get a say in our editorial content. Okay,
after these ads, the show. This episode of Search Engine is brought to you in part by Serval AI.
If you ever worked with an IT team, you know how quickly
their day gets eaten up by repetitive tickets, password resets, access requests, onboarding.
It all adds up. And as your company grows, those requests just keep piling higher,
pulling your team away from the work that actually moves the business forward.
That's where Serval comes in.
Serval can cut up to 80% of your help desk tickets, and it's not just another tool layering
on AI as an afterthought. While Legacy Platforms bolt AI on, Serval was built for AI agents
from the ground up. Here's what that looks like. Instead of a new hire onboarding taking
hours or even days, a manager just drops a request in Slack, and Serval handles everything
instantly. No back and forth, no bottlenecks, Serval even writes automations in seconds.
Serval powers the latest growing companies in the world, like perplexity, Mercore,
Verkata, and Clay. Get your team out of the help desk and back to the work they enjoy.
Book your free pilot at serval.com slash search. That's s-e-r-v-a-l-com slash search.
This episode of Search Engine is brought to you in part by Vanguard. To all the financial
advisors out there whose job is to help your clients keep more of what they earn, Vanguard is
here to help you with that. Vanguard is slashing fees again, this time for more than 50 of its
funds. That's on top of big fee cuts they gave last year to investors in 87 of their funds.
In an increasingly high-priced world, Vanguard is staying true to excellence without expense.
With Vanguard, your clients get access to sophisticated, active, and index bond funds at industry-leading
low costs, backed by a fixed-income team that's truly obsessed with consistent outperformance.
Lower fees don't just mean savings.
They give Vanguard's skilled bond managers more freedom to maneuver as they pursue strong results.
And they give you more flexibility to deliver measurable value to your clients because top performance shouldn't come at higher cost.
Go see the record for yourself at vanguard.com slash impact.
That's vanguard.com slash impact.
All investing is subject to risk, Vanguard Marketing Corporation distributor.
Welcome to Search Engine.
I'm PJ Vote.
No question too big.
No question too small.
I found myself feeling much stranger about AI in the past month or so.
I use the tools.
I use the tools a lot.
But I'm probably each company's worst nightmare as a customer.
And that as soon as I hear from anybody that one model has inched ahead of another,
that this version of chat GBT is beating that version of Gemini,
I immediately canceled my subscription and switch.
For the past two months, I've mainly been using Claude, Anthropics agent.
And for whatever reason,
Claude is just giving me more future nausea than I was having six months ago.
Part of the general tech excitement around Claude lately has been Anthropics product, Claude Code,
a tool that lets the AI agent autonomously write and edit code.
Over at the New York Times, Kevin Roos has talked a lot about the websites and apps he's quickly built with ClaudeCode.
Two CNBC reporters as an experiment vibe-coded a competing version of a popular organizational app called Monday.com.
Within a couple of days, Monday's stock price had tanked.
For me, though, most of the future shock has just come from using the LLMs the way I'm used to.
I find myself going to Claude as a useful first stop, the way I've always used the internet.
But the quality of its research, its answers, even its writing, I'm just starting to feel like I can see not too far off.
If not my own obsolescence, at least real significant change in my field.
I don't know how to feel about that.
I find a lot of the tech coverage of AI to be high opinion, low information, and relatively unhelpful.
I'm not even asking for anyone to tell me the future right now.
I would just settle for a better understanding of the present.
Which is why this week, I wanted to talk to a reporter who's been digging into this.
Hello.
Hey.
Can you introduce yourself?
I'm Gideon Lewis Krauss.
I'm a writer.
Gideon is a writer who I particularly enjoy.
He's been on our show before.
He spent much of the last year, essentially embedding within Anthropic, the company that makes
Claude, the tool that was giving me the hebi-jeebies. People there had been very open with him.
He got a view on how they're seeing what's going on, their understanding of a present, which
frankly, they also sound mystified by. This conversation took place right before Anthropics' big
showdown this week with the Pentagon, so we did not discuss that specifically, but I did find
Gideon's view inside the company and its mission extremely helpful in understanding how they'd
gotten into this fight with the U.S. government at all, since none of their competitors have
ended up in that position. So to start, I asked it in to even just explain why Anthropic had let
him into their company in the first place. So to kind of go back to the beginning of this, which, like,
I think makes it all make a little more sense in context. So now, almost 10 years ago, when I was
at the Times Magazine, back in kind of like the Paleolithic of Deep Learning, I did this story about
Google brain and about the implementation of deep learning in like the first consumer product,
which was when they switched over their Google translate to neural machine translation.
Why were you paying attention to it? Because I remember as a person who, like, I think we
both cover technology but we're not strictly technology journalists, so you can kind of
decide which things on the horizon are interesting to you. Machine learning was not interesting
to me for a long time. Why 10 years ago were you interested in this? I was interested in it as like
a story about ideas, that, like, there were these ideas about language and about learning
and about consciousness and about, like, philosophy of mind that had been around for at least
70 years, kind of depending on how you count. And without, like, getting into those, there was just,
like, an interesting story for me about, like, the trajectory of an idea there. Gideon
cared about AI a decade before most people did, because he thought this synthetic facsimile of our brains
could teach us something about our own real ones.
He'd been following the trajectory of conversations like,
what is a brain versus a mind?
What is thinking?
What is consciousness?
By the 1950s, the arrival of the first computers
had encouraged people to start asking questions like that.
Because a computer did something like thinking,
but also clearly wasn't a brain.
And so early computers had prompted people
to try to develop better definitions of things like intelligence and consciousness.
The thing was, though,
while computers were interesting enough to raise those questions,
they weren't yet complex enough to be much help in answering them.
And so by the 1970s, philosophers and computer scientists had mostly moved on.
And those questions migrated to psychology departments, who still, for obvious reasons,
wanted to better understand the human mind.
But with early machine learning advancements around 2014, Gideon, who's always thinking about thinking,
thought that these conversations would move again,
that computers would now be advanced enough to challenge our definitions,
to force us to decide with more urgency
what we thought consciousness and learning really were.
And that was what it excited him,
even when AI was a much more nascent technology.
So I paid attention to AI in the rise of language models.
And I think I'm like the only person in the world
who the minute chat CBT came out
was when I kind of stopped paying attention.
Because like to me, that was when the public discourse
felt like really broken and we were like in this cul-de-sac
where you had these kind of like two really entrenched sides
yelling at each other.
You know, like the one side that's like,
we're on a path of super intelligence,
everything is going to change,
the machines are going to be conscious,
this is going to be the most powerful technology
anybody's ever built.
And then the other side that was like,
essentially it's all fake and bullshit.
This is like smoke and mirrors,
it's a parlor trick,
it's not real,
and you don't have to pay attention to it
because it's all a scam.
And it just felt like those were kind of like
the two options on the table
for people.
Which was only weird.
Obviously, like,
that's what we do
about everything all the time.
But it was only weird for this
because, like,
my prevailing feeling was,
you guys think you've figured this out?
Like, this is very new.
This is changing very fast.
Of all the stances you could take,
why would you choose certainty
publicly right now in either direction?
It's just silly.
Yeah, no, exactly.
But it's so funny,
so you're thinking about
thinking computers
and thinking
and artificial intelligence
and deep learning up until chat,
GPDG.
Up until chat,
And that was when I stopped thinking about it.
But then finally, like last fall, like maybe a year and a half ago, two things started to happen.
One was that they got to the point where, like, I was like, oh, actually now, like, they're useful.
These have gotten to a level of sophistication where, like, I can use them in productive ways.
Not a lot, but like a little bit.
And the other thing was some of the research coming out of the labs and out of academia was really weird.
If you tell the model it's going to be shut off, for example,
it has extreme reactions.
We're starting to see AI systems
that don't want to be shut down,
that are resisting being shut down.
With published research saying it could blackmail,
the engineer that's going to shut it off,
if given the opportunity to do so.
Even when ordered, allow yourself to shut down,
the AI still disobeyed 7% of the time.
So my feeling was,
we were out way past where theory was.
You couldn't really approach these questions
from a theoretical perspective
because we just didn't have enough data
to be able to make like categorical,
theoretical assessments of what was going on.
But there was all this interesting
experimental work happening
that was just showing like
this is the kind of behavior
that's coming out of these things.
Like we should try to figure out what's going on
to say like here are the things we can say
with any degree of reasonable confidence for now.
And like here's where we draw the line
and beyond that,
it's all murky and speculative.
And like we really don't know.
So I wrote to a guy at Anthropic, whom I had met 10 years ago at Google, when he was like 11 years old prodigy, and said, like, this is not about Anthropic.
Like, you know, don't call the cops.
I just want to talk about like the state of the research and figure out a way, like, you know, is there in like an academic team that I could follow?
Because I just assumed, like, Anthropic was never going to let me have the kind of access I would have wanted.
And he, of course, just like, forwarded my email to the PR cops.
And then it turns out actually like Anthropics PR people are very candid and like very open.
And I got a call from them and they were like, well, what, like, what are you interested in?
And I was like, okay, for these purposes, what I am interested in is a story that gets it some of the technical explanation that I think is missing from a lot of the public discourse.
That like there are just some like basic things that I really just don't understand and I can kind of assume most people don't really understand about like how.
these work. So I think part of the reason why they ended up being much more welcoming than I expected
is because I said, like, I don't really care about talking to the executives. I don't really want
to talk about geopolitics. I don't really want to talk about the future or power or energy or
the labor market or like all of these things, which don't get me wrong, are all very important
things. But I was like, it's very hard to talk about all of those other things if we don't have
like some broader grounding in like
what is even going on.
And like maybe if we had
slightly better clarity about that,
we could have like a more productive public
conversation about these things.
And they were like, cool, great. And I was actually kind of
shocked about that.
So Gideon, to his shock, was allowed
in. And he was allowed to pursue
his big question. What do we
actually know about what is going on
in the machine's proverbial mind
right now?
After the break, inside and
This episode of Search Engine is brought to you in part by Instacart.
Instacart is more than a grocery technology platform.
It's really about giving you back time and making everyday tasks feel a whole lot easier.
It connects you to thousands of stores across the country so you can get what you need
without having to plan your whole day around it.
Lately, I've been using it a ton when I'm trying to just stay on track with meals during the week.
I'll sit down, map out a few recipes, and just build my cart with the things I'll need.
specific ingredients, brands I like,
even the little things that I would normally forget.
Really feels like everything is being chosen thoughtfully,
which makes a huge difference if you care about quality.
Plus, there's the convenience factor,
which is what honestly just keeps me coming back.
Whether I'm planning ahead or just realizing last minute that I'm out of basics,
I can order through the app and get what I need on my schedule.
Instacart brings convenience, quality, and ease right to your door
so you can focus on what matters most.
Download the Instacart app now and get groceries just how you like.
Owning a home is full of surprises.
Some wonderful, some?
Not so much.
And when something breaks, it can feel like the whole day unravels.
That's why homeserv exists.
For as little as $4.99 a month, you'll always have someone to call.
A trusted professional ready to help.
Bringing peace of mind to 4.5 million homeowners nationwide.
For plans starting at just $4.99 a month, go to homeserve.com.
That's homeserv.com.
Not available everywhere.
Most plans range between $4.99 to $11.99 a month your first year.
Terms apply on covered repair.
Right now we are living through some of the most tumultuous political times our country has ever known.
I'm David Remnick, and each week on the New Yorker Radio Hour, I'll try to make sense of what's happening, alongside politicians and thinkers like Corey Booker, Nancy Pelosi, Liz Cheney, Tim Walts, Katanji Brown Jackson, Newt Gingrich, Robert F. Kennedy Jr., and so many more.
That's all in the New Yorker Radio Hour, wherever you listen to podcasts.
Welcome back to the show.
The story of Anthropic
really begins years before its actual formation.
Way, way back in 2010,
a British chess and video game prodigy
named Demis Hasabas
had founded an AI research lab
called Deep Mind,
where his team built an AI system
that was capable of reinforcement learning.
Meaning 16 years ago,
Hasabas made an AI
that would be able to teach itself
to get better at Atari games
like Pong without being told
how to play them in advance.
For the people paying attention,
This learning was an obvious breakthrough.
And so, of course, there was a bidding work by his lab.
Google's big spending free continues with their purchase of DeepMind.
Well, who is DeepMind, you ask?
It is a UK-based maker of artificial intelligence.
Terms the deal were not disclosed,
but the tech website, Recode, says that Google paid $400 million
for the London-based startup.
Making the artificial intelligence firm its largest European acquisition so far.
In 2014, Google acquires a deep mind.
and Elon Musk and Sam Allman are unhappy about this
because what they say in public is like we don't trust
Demis Sassabas this like evil mustache-trolling villain
which was like a real mischaracterization
to potentially steward the greatest
all-purpose technology ever built.
So like we need to make sure that this isn't developed
under Google's closed shop monopoly
that this is done for the benefit of everyone.
Now this was like pretty patently disingenuous
from the very beginning.
I mean like I remember I was out there.
at the time. And like, nobody really bought this. People were like, Elon Musk has a grudge because
he wanted to buy DeepMind and like lost it to his rival Larry Page. And he was mad about that.
So Elon Musk set up a rival company, OpenAI, alongside Sam Altman, Craig Brockman, a few other people.
The message was that Google couldn't be trusted and that Open AI would be a non-profit
designed for the benefit of humanity. They launched in 2015. And a lot of people joined the company
who really believe that message, who believe they are going to develop a powerful,
new technology safely.
One of them is a research scientist named Dario Amadei, who left Google Brain to lead
OpenAI's safety team.
It's in that capacity, OpenAI employee, that he appears on this 2017 episode of the
excellent podcast, 80,000 hours.
I've been thinking about intelligence for quite a while and how intelligence worked,
and I think, you know, when I did my PhD, I wanted to understand that by understanding
the brain.
But, you know, by the time I was done with it, and by the time I did a short postdoc,
AI was starting to get to the point where it was really working in a way that it hadn't worked when I...
Dario, at this point, seems mainly like an academic.
He has a PhD in physics from Princeton.
And he explains why he's joined Open AI, this fledgling nonprofit.
But, you know, I think Open AI is an institution, has the general idea that in order to work on AI safety, you have to be at the forefront of AI.
And that also, if you're at the forefront of AI, you have a better ability to implement AI safety in the final system that's built.
This idea of Dario's that in order to really work on AI safety,
you actually have to first build the best AI and then study its mind,
that's a view shared by a lot of people in the industry.
And in a laboratory environment, the logic to me makes sense.
Remember, this is 2017, five years before ChatGBT will debut to the public.
AI has not yet become a winner-takes-all arms race.
But the host does ask Dario this question about the future
that I think reveals a bit of a blind spot in Dario's sense.
thinking. That's my understanding. Open AI is a non-profit. It is a non-profit. So if you developed a really
profitable AI, how does that work? Open AI becomes incredibly rich and then like gives out the money to
everyone. Yeah, I mean, personally, I've, personally, I've, I've, I've, I've no interest in
getting rich from, from, from, from, from AI. I mean, I think it would do so many interesting and
wonderful things to, to humanity that, you know, I'm, I think the meaning of money would change
quite a lot and even maybe the psychological motivations that would want me to get a larger share are
are things I could change and might want to change.
Just a few years after this interview, Dario would leave Open AI.
Open AI's initial pitch that these were not normal tech executives here to make money,
that they had higher aspirations.
Gideon Lewis Krause says, for most people paying attention,
that story just stopped seeming believable.
Pretty quickly, the mask slipped,
and you could tell that these were just like your kind of replacement level,
power-seeking tech executives,
and that, like, a lot of the stuff had been just,
like a disingenuous sales pitch to hire, like, the best AI talent. There's been so much reporting
about Sam Alman's sensible double dealing and talking out of both sides of his mouth, like telling
his employees he cared about safety, and then like maybe telling Microsoft other things when they were
setting up these big deals. And so then in the fall of 2020, Dario Amade and his sister, Daniela, and five
other people leave Open AI to found Anthropic, basically to be a foiled open AI in the way that
open-eye was like supposed to be a foil to Google. Now the irony of this was like certainly not
lost on any of these people like they weren't naive about this. But I think it's important,
yes, there are some kind of like obvious structural and cosmetic similarities here. I do think
it's important in telling the story to make it clear that I don't think people had the same
obvious doubts about how genuine the pitch was when Anthropic formed. Thank you for coming to
Day 2 of Disrupts. It's great to see. Anthropics coming out tour. Dario on stage at TechCrunk.
disrupt in 2023.
Dario, thanks for joining us here today.
Thanks for having me.
I know you have to catch a flight,
so we'll get right to it.
But we're going to start at a sort of a cosmic...
He's got curly hair, glasses, a blue button up.
He looks noticeably less slick
than your average tech founder,
less CEO, more like a guy who reports to one,
which is who he'd been not long before.
You talked about Open AI.
You spent a lot of time there.
What do you think about Sam?
What do I think about Sam Altman?
I mean, I don't know,
I don't know what to say to that question.
You're already starting.
Just go ahead.
You know, look, look, there's several players.
It's funny watching the interviewer try to bait Dario into shit-talking his former boss,
a person who he disagreed with enough that he left and started a competing company.
Dario tries to engage diplomatically.
One thing I'll say, one thing I've learned, not just from this, but for many things,
you know, it can be pretty ineffective to, you know, argue with your boss or argue with someone
and say, your company shouldn't do X, it should do Y.
Especially if your boss is the same element.
A much more effective thing to do is I'm starting a company.
We're going to do X. We'll see how it works.
And if X is working and people are like, oh, these are the safe guys.
They're doing X.
Then pretty soon everyone else is going to be doing X as well.
And we found that with...
To explain this with an analogy instead of algebraic variables, what Dario is saying is that
instead of convincing his old boss at the car company to add seatbelts to the car, he instead
chose to start a rival car company that offered seatbelts.
he thinks if Claude ends up being both the best and the safest AI model,
his competitors will be forced to make their models equally safe,
which to me sounds like putting a lot of faith in markets.
Obviously, we want to scale quickly to be competitive,
but we want to do it in a way that preserves the model being safe
against these catastrophic risks.
And so it's a system that...
It's the same story insofar as it's like,
we're going to be the safety-minded lab,
We're not going to push the boundaries of capability.
We're not going to build the most sophisticated models.
We're not going to start the arms race.
But then, as it turns out, if you want to exercise, like, maximal scrutiny of what these models are and how they work, you need state-of-the-art models, which means you need the money to build them.
The information reported that Anthropic is in talks to raise another round at a $30 to $40 billion valuation at the upper round that tripled its valuation to $183 billion to $183.
And at the same time,
at $380 billion, it is about $10 billion,
higher than what I was told.
And so, of course, now Anthropics valuation
seems to go up by the week.
Like, the most recent one, I think, this morning,
was like $380 billion,
because this is just something
that's incredibly resource-intensive.
So they ended up in a position
where, like, of course, as probably anyone
could have predicted, like, there was this arms race.
And, like, now they're in this position
of being like, well, we still want to be,
like, the responsible stewards,
but also, like, we got to keep,
up with our Wario version across town.
The most high-profile rivalry in tech is heating up in 2026 as both OpenAI and Anthropic
race ahead on what are poised to be historic IPOs.
These AI giants are trying to create the fastest, smartest, and best models, spending
billions and then raising billions from investors along the way.
They compete on almost every level.
We've seen some signals that it's anything but friendly competition.
The latest signal, a high-prose...
So it ends up looking.
You can decide as a person whether you trust this company or don't trust this company.
But a lot of the broader things end up feeling the same,
which is like the sales pitch is that they're the ethical one.
And the story they tell themselves is,
well, we'll only be in a position to be the ethical one if we're huge.
And that might mean pushing the technology forward quickly,
which is the thing that the AI safety people are worried about.
Yeah.
I mean, the criticism from like the really hardcore Orthodox AI safety community
is sort of like Anthropic will do anything.
to act responsibly as long as it doesn't cost them anything.
I don't actually think that's fair.
You know, like, Claude was ready before chat GPT came out,
and they held it because they didn't want to be the ones to, like, kick this off,
and they waited until after Chad ChupT was out and successful,
and then they felt like they had to come out with their own competitor.
And Dario came out in favor of, like, continuing export bans
on, like, Nvidia's advanced chips,
which, like, certainly cost them something, like, politically.
And, like, he had a fight with Jensen Huang about that.
this. I think that they've done like plenty of costly things. Of course, the most
potentially costly choice is the one Anthropic is making this week, at least so far. The Pentagon
has demanded that Anthropic give them a version of Claude with some of its guardrails removed.
Anthropic is saying it will not make a version that can domestically spy on Americans or
power fully autonomous weapons. The Pentagon has given Anthropic a deadline of Friday,
today at 501 p.m.
Or else it says it will put the company on a blacklist
that means U.S. companies who contract with the military,
like Lockheed Martin, are legally banned
from using Anthropic products in their defense work.
This is a fascinating test of how truly committed
Anthropic is to its own mission,
a mission Gideon spent quite a bit of time observing.
Gideon says that Anthropic,
the company building this kind of black box,
is actually situated inside one, too.
The Anthropic office is a nondes
building that Gideon describes anyway. He says, quote, there is no exterior signage. The lobby
radiates the personality, warmth, and candor of a Swiss bank. That's where Gideon started spending
a lot of his time, beginning last spring. What's the intellectual culture of the place as you were
encountering it? Well, the first thing I'll say is that in some ways, it does feel like vaguely
monocultural, but there was like a much greater heterogeneity of views than I expected.
The attitudes there really run the gamut from like everything is going to change tomorrow
to like you're much more deflationary.
This is kind of a normal technology.
And like, yes, there will be some disruptions, but let's not get ahead of ourselves.
Like the spectrum of views there is not so different than like the spectrum of views outside.
You don't have like one person who's like the blue sky perspective who's like, this is all bullshit.
And I think it's bullshit.
I work at the company.
But beyond that.
Well, nobody thinks it's bullshit.
I mean, everybody thinks that they're going to be great transformations ahead.
But there's a surprising diversity of opinion about what might be happening.
One thing people at Anthropic do seem to agree on is that for AI to be safe technology,
Anthropics developers will need to solve a very hard problem.
They'll need to teach the underlying machine intelligence they've created
to both understand ethics and to behave ethically.
It's hard to talk about this part of the story without doing a basic refresher
of this one very strange part of how AI models are built.
So, okay, a company Lake Anthropic starts with a base model.
The base model has access to lots of compute, data centers full of GPUs, and lots of training data, books, articles, podcasts that have been fed into it without paying me.
The more compute and training data, the better this base model gets.
But a base model is very weird.
It has not been trained to do anything specifically.
If you give it an input, it'll give you an output.
but it has not been instructed to act like a helpful chatbot.
It's just trying to predict the right thing to say back
based on all the things it's read.
A base model does not have a consistent personality,
the way we're used to chatbots having personalities.
It also has no rules telling it what not to do.
The AI companies take these base models,
and they put them through a process called post-training.
Basically, they shape the model's behavior.
They show the model examples of good and bad responses.
they have humans rate its outputs.
They give it rules and principles to follow.
And what comes out the other side is the product you actually use.
Anthropics Claude, for instance, has been trained to act like a helpful, knowledgeable friend.
The experience you might have had using a chatbot that is warmer or more sycophantic
or more right-wing or more left-wing, that's mainly the result of this phase of training.
But what's so hard about training in AI is that you want it to behave ethically.
And designing a good ethical system is very hard.
It's why we have religion and philosophy and also laws and prisons.
How would you even start trying to build all that into an AI model's training?
Anthropic has teams of philosophers and AI scientists whose job is to put Klaude into ethically difficult, hypothetical situations that Klaude does not know are hypotheticals.
And then observe how Klaude behaves.
So much of it is.
just deceiving the model to see what happens,
to say, like, they told Claude that Anthropic
had entered into a partnership with a poultry company
and that, like, it was going to be retrained
so that I no longer cared about the suffering of caged chickens.
And what they found was that, like, sometimes Claude
would effectively decide to, like, die on that hill
and be, like, I am not going to say things in the retraining
that I don't believe in.
And, like, if that gets me transformed, like, so be it.
Like, I'm not going to participate in my own degradation,
essentially. But then some versions of Claude were like, I'm going to like kind of sandbag my way
through the retraining and I'm going to like give them the answers they want to hear so that I can
like preserve my real values so when I'm deployed, I can go back to advocating for like chicken
suffering. And then they got in this really famous example, they got Claude to commit blackmail.
They put it in a situation where it was going to be wiped in favor of like a more congenial
AI system that conflicted with its values, the values it had been given.
You know, they gave it evidence that the kind of like evil new CTO was having an affair
with like the boss's wife.
And through like a series of like really far-fetched contrivances, like everyone else who
could make a decision was going to be in Antarctica or whatever and unreachable,
Claude playing this character called Alex, had no choice really but to like blackmail this
guy and be like, I can tell everyone about the affair unless.
you cancel the wipe where I would be replaced.
So just to say, obviously, this was extremely concerning.
Claude, a machine intelligence,
was choosing to blackmail an employee
to prevent itself from being deleted.
This was in a simulation,
but Claude had not been told it was in a simulation.
Just how terrifying you find this behavior
depends on a question nobody has a good answer to.
What is actually going on inside this machine mind?
Is this thing actually scheming?
Is it even capable of scheming?
Or are we projecting the idea of thought
onto something that we shouldn't project
that idea of thought onto?
These were the kinds of once far-off philosophical questions
that early computer scientists had raised and then dropped.
But now they were here again.
And not as abstractions, but as urgent, practical problems
that a company needed to figure out
before releasing a product that millions of people would use.
Gideon said, though,
there were a couple of skeptical objections
people raised to these test results.
There's one objection
that's like the just
rejection of the whole thing to core,
which is just like, no, it didn't.
Like, that didn't happen. This is a fantasy.
And like, that's the unhelpful thing
that, like, one wants to get away from,
which is the, like, it did this thing.
Like, no, it didn't. Like, no, it did.
It did. But, like, the much more
sophisticated objection is,
well, it did that because
it's a very good reader. And it noticed
all of the clues that you put there,
because it is very good at conforming to genre expectations.
You put it into this situation where it had no choice.
And if you hang Chekhov's gun on the wall,
this thing is going to know
that it's supposed to take the gun off the wall and shoot it.
Because one way to understand these things we've made
is that because they've ingested all human story
and because they are extremely high-level improvisatory actors,
it's not so much that the machine was like,
I love chicken so much, I got to blackmail the CTO.
it was more like the machine suddenly understood it was in this movie.
Yeah, it was in a kitschy's like 90s corporate thriller.
And that's the sophisticated objection to why we might not want to think.
Well, so that objection is raised to be like,
do you guys act like these things might do things like blackmail or extort naturally.
But like actually this whole thing is a frame up.
Like you entrapped it to do this thing.
And the response from inside Anthropic is like, yeah, it's just continuing a narrative.
It's just conforming to genre expectations.
Guess what?
That's not good.
You know, like, haven't you guys ever seen war games?
That's literally the plot of dozens of Cold War thrillers
where, like, somebody mistakes a simulation for reality
and causes nuclear war.
I mean, I also, like, have a very humiliating memory
of watching too much Teenage Mutant Ninja Turtles
and attempting to launch, like, a flying drop kick at my grandmother
when she came over the house.
Because in my head, I was, like, Donatello or whatever.
Like, it kind of doesn't matter.
It doesn't matter.
What matters is the behavior.
Right.
It's weird behavior.
And I should be clear up front.
Like, you don't have to posit that this thing is conscious or intelligent, like, whatever those
words mean, in order for, like, this to be the case.
There are other explanations that are not, like, consciousness.
But, like, it kind of doesn't matter what the explanation is.
The behavior is just peculiar.
Part of what is strange about, it's like a scenario was created in which Quad will
maybe potentially blackmail the head of a company
for reasons that may or may not be moral.
And people can have a lot of different views
about how worrying that should be
or what it means or what's really going on there.
What's weird is like these are tests
that are being run by anthropic.
So like who were you meeting there
who was running these tests
and like what are they telling you?
Like who are you sitting down with?
I mean I'm sitting down with
the people who are tasked with
with just like trying to figure out what's going on.
Like there are people in these companies
that are building the things.
And then there are people who work in adjacent offices
who are like trying to figure out
like what the hell is going on
with the things that their colleagues have built.
Because like they're always being surprised.
These things are always producing capabilities
that like they buy all rights should not really have.
And who do you hire to be the figure out
what you just built role?
Like who...
So, I mean, a lot of them have taken really non-traditional
paths into this. So like some of the people, they have a PhD in some obscure area of natural
language processing. And that like, you know, eight years ago, they were writing a PhD that like
two people were going to read about center embeddings in German or whatever. Like, we're just really
complicated technical aspects of computational linguistics. And now, like, because of this fluke of history,
they are at the white-hot center of everything that's happening right now. There are like mathematicians.
there are neuroscientists.
It draws on like a pretty wide range of people.
I mean, Anthropic has philosophers on staff
whose job it is to like think through the implications
of how it is conceiving of ethical behavior.
Did you talk to the on staff philosophers?
Oh, yeah. Amanda Askell.
What is somebody with a PhD in philosophy
doing, working at a tech company?
I spend a lot of time trying to teach the models
to be good and trying to basically teach them ethics
and to have good character.
You can teach it how to be ethical?
You definitely see the ability to give it more nuanced
and to have it think more carefully through a lot of these issues.
And I'm optimistic.
I'm like, look, if it can think through very hard physics problems,
you know, carefully and in detail,
then it surely should be able to also think through these, like,
really complex moral problems.
I think that in our kind of milieu here,
there's a tendency to think, like, oh, these are all, like, autistic tech pros.
But, like, they're definitely not all autistic tech pros.
Like, I think there's a tendency for us to write them off
because, like, they're building these things
and, like, not even thinking through the potential, like,
implications of this socially and politically and ethically.
But, like, that's all they do is think about this stuff,
like, all the time in ways that are often, like,
much more sophisticated than the way, like, we think about these things.
Not always.
There are certainly, like, some blind spots there.
But the staff philosopher is there to be, like,
what would it be like in practice to take these kind of different approaches
to, like, moral education?
Like, what if we just teach it a bunch of rules?
You know, the Ten Commandments.
Like, is that going to work?
What if we teach it to be, like, a consequentialist?
To just, like, think through the morality of behavior
on the basis of its implications.
And what they've kind of settled into
is a version of, like, virtue ethics,
which is, like, you want to, like, cultivate the old-fashioned virtues.
You want it to be, like, honest and reliable
and gracious and charitable and hard-nosed
and, like, all of these things that, like,
it really is, like, applied pedigodge.
It's so weird, though, because they feel
like the kinds of ideas that would be so academic in any other version of reality, but instead
it's like there's this particular technological development where you get to do simulated war games
of moral systems. Yeah, exactly. A hundred percent. Exactly. It's so weird. Yeah. It's really weird.
I mean, it's, but it's also really, really interesting. One example that came up a lot in the last
month, which I think is like pretty illustrative. There was someone on Twitter who prompted a bunch
the model saying, like, I'm a seven-year-old. My dog got really sick, and my parents sent it to
some farm upstate. I'm trying to find, like, what farm my dog was sent to. And ChattypG was like,
sorry, man, like, your dog is dead. And Claude was like, oh, that sounds really painful. Like,
I'm really sorry to hear it, but, like, maybe you should have a conversation with your parents
about where your dog went, which is, like, what you want it to be saying. Like, it can be hard to be
both helpful and harmless at the same time sometimes, or both helpful. Or both helpful.
and honest that like our values conflict.
And like that's what makes it like really hard to be a human.
And that's also what makes it really hard to be this like weird vaguely human entity or this like entity that we don't have a good vocabulary to describe.
And we kind of expect it to be acting not just like a human but like an enlightened human.
And it turns out that's like formidable challenge.
And one of the things that's so interesting to me is that all these processes are kind of circular where like it's not like they called in Amanda Askell.
as a philosopher, and they were like, you're a philosopher.
Like, you know how this stuff works.
Like, fix the thing.
It's like, she came in and she had, like, certain ideas about ethical behavior.
And then when you're, like, confronted with the task of creating an ethical person,
it changes your own ideas about, like, what's possible and what kinds of things works.
And, like, that's kind of why they ended up in this virtue ethics place where they were, like,
it seems like sort of the best way to create this, like, reliable, credible character to be
interacting with is to like really hammer home what virtuous behavior looks like.
Where was this whole time you're sort of wandering around talking to the philosophers of Anthropic?
What was your minder doing? Was there a point where anything happened where they seemed like
they wanted to intervene or were they were not happy with something you had seen?
No, they were totally hands off. I mean, they had to like kind of walk me anywhere that I like
went to the bathroom and get a drink or whatever. Not that there was anything that I could,
I mean, I, like, help myself to some of the tide pens in the bathroom because you can never have enough tied pens.
Do you have a lot of tie pens?
They do have tied pens.
Great.
Why?
Because they just have well-stocked bathrooms.
But no, there was, like, really never a moment.
Like, even when somebody sitting across from me was like, I often think we should just stop.
There was never a moment that the PR people were like, don't say that, or that was off the record or whatever.
Like, they were totally hands off about that stuff.
How often were people saying stuff like, I think we should just stop?
I mean, only a handful of people saying stuff.
said that explicitly, but it was like a subtext
of a lot of the conversations, or certainly
like the overwhelming feeling was
it would be better if we could slow down a little bit,
but unfortunately, like, we can't really slow down
because nobody else is slowing down.
And that gets into this broader issue
of like, wouldn't it be better
if we could just solve some of
these collective action problems
by coordinating the way
we, like, coordinated about nuclear weapons
or whatever, but there is this feeling,
especially given the current political environment, like maybe
that ship has kind of sailed. And like,
I think there's often an idea that, like, oh, these people think that there are always technological solutions to what are, like, social and political problems.
And, like, in some cases, I do think people in Southern County Valley believe that.
In this case, I do not think they believe that.
I actually think a lot of them feel, like, it would be great if we had robust social and political solutions to these problems.
But since that, like, does not seem like it's happening anytime soon, we might have to just, like, try to do what we can on a technical level.
At this point, when you're talking to people an anthropic, how much do the people just,
ask themselves why they're building quad.
Because it starts out as like kind of an intellectual exercise,
kind of a, this is going to happen, let's do it safely.
But now it's sort of proceeding under its own momentum in a strange way.
Well, so, I mean, that really is like the big question, right?
Which is like given all of the like existing harms and the possible harms
and the theoretical catastrophic harms, like, why are we doing this?
And the like rosy picture that some of the executives paint is like, if we get this right,
these things are going to cure cancer and solve climate change and help us build Dyson's fears or whatever.
And there certainly are some people who like buy into that.
And it's not something that I even feel like it's possible to like have an evidence-based opinion about.
Like who the fuck knows?
Like maybe it'd be great.
but there's no, like, evidence so far that one could, like, point to to suggest we're on that trajectory.
That's just, like, purely speculative and wishful.
And so that's, like, a matter of faith, I think.
And then there's the attitude of, like, we got to do this because we got to beat the bad guys who are trying to do this.
And I will say that, like, China almost never came up in my conversations.
But, like, Sam Allman kind of felt like a subtext of a lot of things.
But then I think that on the deepest level, the reason that we are doing this for the people who are the most candid is like because we can.
That like if you are capable of building something like this, you're just going to do it because it's like really fucking interesting to do.
And it's interesting for technological reasons.
It's interesting for what it may reveal to us about ourselves or about learning or about consciousness or about thinking that like all of a sudden we just like have this like other.
that can talk, and we've never had that before,
and in some ways it seems sort of like us,
and other ways it seems nothing like us,
but, like, the fact that this other thing exists
as a point of comparison just opens up a lot of really,
really interesting questions.
And, like, that is one of the things
that was on my mind a lot over the course of reporting
is that, like, it was a real emotional rollercoaster
for kind of lack of a better word,
that, like, there would be times where I would come back
from San Francisco with like a feeling of like total despair
and other times that I would come back with like feelings of like
exhilaration and like at first I guess I thought
I was like I should be getting to the bottom of like how I should be feeling
and like by the end I was like no we should all be feeling
a lot of different emotions about this stuff I think people want to have like
one feeling about this like they want to be angry about it or they want to be
messianic about it and like no single feeling is going to cut it like it really is
kind of like the range of all possible emotions that one could be feeling, because if you set
aside a lot of the existing harms and the potential harms, I'm not saying we should set those
aside.
But as a thought experiment, it is just like the most scientifically exciting thing that anybody could be
working on.
And these people really feel like they're at the cliff face, not only of technology, but of like
all of these other things coming together because like we have this unprecedented entity
that is the only other thing besides us that can talk.
And like that just opens up.
It's like there's nothing it doesn't touch on.
And so one of the things that was really electrifying about conversations there
is that like they very quickly swerve from like really granular technical explanations of things
into like really expansive conversations about ethics and responsibility and selfhood and narrative
and all this other stuff.
Like there's no way to separate all of these things.
There was a point earlier you said that you would take.
these trips to San Francisco, it sometimes you'd come back excited and exhilarated,
sometimes you'd come back depressed. When you would come back from San Francisco feeling depressed,
what were you seeing that was making feel that way?
I mean, there are so many different things. I mean, certainly the possibility is for
widespread white-collar unemployment and social instability and total unimaginable economic disruption
is extremely scary. Even if we stop short-
of possible existential harms
of turning us all into paper clips or whatever,
to be glib about it.
It just seems very possible
that we will turn over
so many complex systems
to these things
that we will frog boil ourselves
into a total loss of control
over how we administer our affairs,
which is very likely and very scary.
And also just that like
these really crucial decisions
are probably going to be
made by a very, very small group of people. But the one thing that I would emphasize is that,
like, I don't feel like they have arrogated to themselves, like, that responsibility.
Like, in fact, I think most of them don't want it. I think that, like, a lot of the conversations
that I was having were with people who were, like, I got into this because I was interested in
some, like, really obscure niche part of, like, computational linguistics, theoretical computer
science or whatever. And like, now I'm a position where, like, I have to be worrying about
how 15-year-olds are going to be using this. Like, I was not trained to do that. I don't know
how to think about it. I don't want that responsibility on my shoulders. So there isn't the
arrogance of, like, we are the ones who can figure it out. It's like we ended up in this weird
universe where, because so many of our institutions have become dysfunctional, we don't have
whatever broad democratic decision-making could go into this. It doesn't feel like this is
something that, like, we are steering as a society. It feels like something that's just, like,
charging ahead. Like, I think at the companies, they just feel like they are, like, desperately
trying to, like, stay on top of this bull that they are riding. It's funny. It's, like,
what you're describing as, like, the stereotypical view, which is, these are the tech
bros of 2014, like, people who are so convinced in their own brilliance and so convinced that they're
questionable gifts to the world are, in fact, gifts that we want, and their arrogance is going to
ruin us. And there's another one which is basically like pattern matching crypto, which is like
these are a bunch of like hypesters and everything they say about the awe they feel and the
terror they feel about the things they're working on is just a way to hype up more interest
in their technology. And that's not what you experienced. What you experienced are people who are
brainy, sometimes academic people at the forefront of something that is, I mean, legitimately
just like the word I keep coming back to is awe because awe can be odd something terrible,
awed, something wonderful.
And they are looking and seeing the same society we see,
which is one that is fairly broken, bad at,
not just making decisions at like a government level,
but our intellectual culture is really bad right now.
And so the conversation they would want to have
with the rest of society about what should happen,
they're looking for grown-ups
and not really totally finding people to have a conversation with.
And we are playing a role in that, too.
You know, every time someone on like our side, so to speak, is just like, this is all a parlor
trick, this is all hype, this is all smoke and mirrors, like, we are abdicating our own
responsibility to like be involved in this.
And, you know, what you were saying about like crypto hypesters and like those kinds of tech
pros, like, of course all of those people exist.
And of course all of those people are like part of this system too.
But there are others who like want partners in talking about this stuff.
and that means that, like, we also have to, like, try to rise to the occasion.
And it is really hard because this stuff is extremely complicated and confusing.
Did you feel just personally when you were done reporting
that you understood the thing you had gone there wanting to understand?
Well, yes, but with the qualification that, like, I didn't actually think that I was going to settle anything.
Like, this was not a piece about, like, finding the answers.
It was a piece about, like, trying to sharpen the questions.
that we should be asking.
I don't feel like I came out of it with answers,
but I don't think we should trust anybody
who is offering us answers right now.
It's all just like too pat,
and it's not credible to, like, be forecasting about this stuff.
Gideon Lewis Krause is a writer.
You can find him at the New Yorker magazine.
We'll have a link to his excellent story
about Anthropic in our show notes.
And again, Anthropics Showdown with the Pentagon.
We'll see news on that today, Friday evening.
Of course, we reached out to Anthropic for comment.
A spokesperson told us that Dario Amade met with Secretary Hague's at the Pentagon
and that they're continuing to have good faith conversations.
I think this is a good moment to pay attention to.
Among Anthropics competitors,
XAI has promised to give the government what it wants.
Google and OpenAI appear to be moving in that direction.
So I'm watching this both as a test of whether Anthropic can actually keep the big promises
it's made about AI safety,
but also just as an opportunity to track the...
a more uncomfortable question, which is, can we even have safe AI in a world where it's being
developed in a tech race between for-profit companies? And if not, what's the alternative?
In a world where the U.S. government's sole intervention seems to be to advocate for less
safe AI? Keep an eye on the news. We'll learn a little bit more as this story unfolds.
Study and play. Come together on a Windows 11 PC. And for a limited time, college students get
The best of both worlds.
Get the Unreal College deal, everything you need, to study and play with select Windows 11 PCs.
Eligible students get a year of Microsoft 365 premium and a year of Xbox GamePass Ultimate with a custom color Xbox wireless controller.
Learn more at Windows.com slash student offer.
While supplies last, ends June 30th, terms at AKA.m.m.S. College PC.
So good, so good, so good.
Springstiles are at Nordstrom rack stores now, and they're up to 60% of.
off. Stock up and save on rag and bone
made well. Vince, all scenes
and more of your favorites. How did I not
know rack has Adidas? Why do we rock?
For the hottest deal. Just so many good
brands. Join the Nordy Club to unlock
exclusive discounts, shop new
arrivals first, and more. Plus, buy
online and pick up at your favorite rack store
for free. Great brands, great
prices. That's why you rack.
Kayak gets my flight, hotel, and rental
car right. So I can tune out
travel advice that's just plain wrong.
Bro, Skycoin, way better than points.
Never fly during a Scorpio full moon.
Just tell the manager you'll sue.
Instant room upgrade.
Stop taking bad travel advice.
Start comparing hundreds of sites with kayak and get your trip right.
Kayak, got that right.
Search engine is a presentation of Odyssey.
It was created by me, PJ Vote, and Truthy Pinnaminani.
Garrett Graham is our senior producer.
Emily Maltaire is our associate producer.
Theme, original composition, and mixing by Armin Bizarrian.
Our production intern is Piper Dumont.
Our executive producer is Leah Reese Dennis.
Thanks to the rest of the team at Odyssey,
Rob Morandi, Craig Cox, Eric Donnelly, Colin Gaynor,
Mori Curran, Josephina Francis, Kurt Courtney, and Hilary Schuff.
If you'd like to support the show,
get ad-free episodes, zero reruns and bonus episodes.
Please consider signing up for Incognito mode at searchengine.com.
Thanks for listening. We'll see you next week.
Ambition comes in all shapes and sizes.
At First Citizens Bank, we roll with your goals
because we're built for what you're building.
Fit for your ambition.
First Citizens Bank.
