Hard Fork - Big Tech's Tariff Chaos + A.I. 2027 + Llama Drama
Episode Date: April 11, 2025This week, with the tech world in chaos over President Trump’s tariffs, we look at how four specific companies are navigating the new day-to-day reality. Then, the A.I. researcher Daniel Kokotajlo r...eturns to the show to discuss a new set of predictions for how artificial intelligence could transform the world in just the next few years and how we avoid the most dystopian outcomes. Finally, we explore whether Meta cheated on an important A.I. benchmark with its new Llama model. Guest:Daniel Kokotajlo, executive director of the AI Futures Project Additional Reading: Inside Trump’s Reversal on Tariffs: From ‘Be Cool!’ to ‘Getting Yippy’A.I. 2027Meta Gets Rebuked Over Benchmark Gaming We want to hear from you. Email us at hardfork@nytimes.com. Find “Hard Fork” on YouTube and TikTok. Unlock full access to New York Times podcasts and explore everything from politics to pop culture. Subscribe today at nytimes.com/podcasts or on Apple Podcasts and Spotify.
Transcript
Discussion (0)
What's it do with you?
Oh, you know, just binge buying cheap Chinese stuff online
to beat the tariffs.
Making your final Shien purchases
before that company shuts down.
Yes, no, I actually did buy a bunch of stuff over the weekend
because I thought this might be my last chance.
Yeah.
Casey, what cheap overseas good are you going to miss most
after the tariffs kick in?
Oh, I feel...
The thing, I was never a big like oh
I gotta go on to Tmoo and get like a pressure cooker for six dollars or whatever like that was never my
Journey, but I know that it's you know, it's a major pastime for a lot of people. Yeah. Yeah. Yeah
Well for me, it's like the ability to buy cheap crap for my kid has been revolutionary
My kid the other day, starts saying the phrase,
dinosaur unicorn.
And I thought, that's not real.
And he says, I want a dinosaur unicorn.
And I said, well, that's not a thing, we can't have that.
But then this little bell goes off in my mind that says,
someone out there has made a dinosaur unicorn something.
Almost certainly.
My wife finds like eight different dinosaur
Unicorn t-shirts and buys one of them and now he's got this dinosaur unicorn t-shirt that he absolutely loves
That would not happen in a tariffs world as of today that shirt costs over $400
Yeah, well, I mean, I'm sure Jude looks great in that he does yeah, he does and's going to have to wear it for 10 years. I hope it stretches.
I'm Kevin Ruse, a tech columnist from the New York Times. I'm Casey Noon from Platformer.
And this is Hard Fork. This week, the tech world is in chaos over Trump's tariffs. Then,
AI researcher Daniel Cocotello returns to the show to discuss a fascinating new set
of predictions for how AI could transform the world in just the next few years.
And finally, the meta-cheat on an important AI benchmark. Well, Casey, for the second week in a row, we have been interrupted by news about these
Trump tariffs.
Now, there was a time in the history of the Hard Fork podcast where the only thing that
would cause us to rip up a segment and re-record it was if Sam Altman had been fired or rehired.
But now we live in this new reality where news can change on a dime.
And over the past few days, that is exactly what we've seen.
I think it's fair to say Hardfork has been hit
harder by the tariffs than any other company.
That's true.
That's true.
We are bracing ourselves for massive impact
and getting ready for the new reality.
So Casey, every great era deserves a name.
And I think we should call this era in the technology industry
the chaos meta.
Nothing to do with meta the company, but in video gaming
metas are sort of like the overall set of conditions
that the players have to navigate.
And I think it's fair to say that chaos
and the lack of certainty surrounding what Donald Trump is going to do on any given day
is the new meta for Silicon Valley's largest companies.
Yeah, remember how when we were talking about whether or not
TikTok would be banned, which also had a lot to do with what
Trump wanted, we talked about how it was kind of simultaneously
alive and dead at the same time?
Now that's just the entire US economy, Kevin.
Yes.
So as of early this week, it looked like we were going to get these massive tariffs on goods
imported to the United States from many, many countries all over the world, larger than any
tariffs we've seen in the recent history of this country. Then on Wednesday, as we were taping our
episode, we got the news that the Trump administration was pushing pause on most of them.
Most of these reciprocal tariffs on countries like Vietnam and India were going to be delayed for 90 days,
and there would be a baseline 10% tariff rate applied, but not the much higher rates that people had been fearing,
except for China, which would have its tariffs increased. And on Thursday,
we learned that those tariffs would actually be 145% on Chinese goods entering the US.
The problem is with a podcast, we can't just have a little ticker on the bottom that shows
you what the current tariff is.
Yes. But what we saw early this week was that the stock prices of all the biggest US tech
companies took a dramatic nosedive that was in response to these fears
about these very high reciprocal tariffs.
Now, after the news that these tariffs are
going to be placed on a 90-day hold, except for China,
some of these stock prices have rebounded.
Apple, in particular, had its biggest trading day
in many years after the news of these tariffs being delayed
came out. So the stock market whiplash is part of the setting for the tech companies that
they have to deal with now. But the bigger picture scenario is that doing
business in Trump's America is turning out to be very difficult. Not because the
administration is necessarily unfriendly to these businesses, but because there's
just so much fast moving news that
is hard for businesses to do any kind of planning or strategy
at all.
Well, I mean, I wouldn't say this
is a particularly business friendly set of announcements
that have been made.
I mean, sure, I guess it's friendlier
to pause the tariffs than to continue them.
But the general chaos, Kevin, I think
has been really bad for American companies.
Yeah, so even beyond the tariffs,
there are a bunch of things that the Trump administration has been doing that have impacted the tech industry, restrictions on immigration, cuts
to science funding, these antitrust cases, many of which are still going forward.
So I wanted to kind of give our listeners a sense of how this instability feels on the
ground in Silicon Valley to the biggest tech companies.
And you had a really smart idea, which was to look at the new chaos meta of Trump's second term
through the lens of four tech companies.
So today we're going to take a look at how Trump's new policies and these tariffs
have affected four companies.
Apple, Nintendo, TikTok, and Meta.
All of which have faced significant challenges since Trump took office,
and all of which are now trying to figure out, how do we go forward?
What do we do? How do we navigate this new uncertain climate?
Yeah. So let's start with Apple. Casey, what is going on with Apple?
Well, look, of all of the tech companies, Apple has long been the most dependent on China.
That is where 90% of iPhones are made.
The company is just heavily dependent
on its supply chain relationships
that it has in that country.
So the fact that these tariffs are now 145%
on goods coming out of China has just really sent a shiver
through that company earlier this week.
Apple had its worst four day trading period
since the year 2000.
Once the pause was announced, its stock has started to come back, but this is a very volatile
situation for them.
And the underlying dynamics are the same, which is that it is simply going to be much
more expensive for Apple to sell goods made in China here in the United States, Kevin.
Yeah.
And obviously, one of the hopes of these tariffs is that it will drive manufacturing
back to the United States.
There's some hope among members of the Trump administration that this could even force
Apple to consider making the iPhone in the United States.
Do you think that is likely and why?
No, and in fact, I think it's almost sort of worse, Kevin, because this week, the president's press secretary
said that the president believes that iPhones can
be made in the United States, despite the fact
that we know that it is much more expensive to manufacture
things here in this country, right?
It's very important to remember that whatever
the Trump administration might hope
that these tariffs accomplish, they have not
accompanied it with any plan
to increase the manufacturing capacity in this country.
The whole thing is just a wish and a prayer
that at some point in the future,
Apple might have a magical iPhone factory
stocked with Americans who wanna do those jobs.
As it stands now, that doesn't exist.
Yeah, so I would say Apple is somewhat unique
among tech companies because it has also been thinking
about tariffs and the effect of Trump's
policies on their business for longer than many of their competitors.
If you'll remember during the first Trump term, there was some talk about tariffs on
Chinese goods.
Apple successfully negotiated its way out of those, sort of got an exemption.
In part, they did that by cozying up to the Trump administration by promising to build
and assemble
some of their products in the United States.
There was this famous tour that Tim Cook gave Donald Trump
of this facility in Austin, Texas,
where he said they were gonna start making a bunch of stuff.
So they sort of managed to get the tariffs off their back
during the first term, but in the second term,
it's not at all clear that they are going to have
the same kind of success.
So Casey, how is Apple dealing with the new chaos meta? Well, they are trying to get as many
devices as they can out of China and into places where it's going to be much
less expensive to export them to the United States. So there was a great story
this week in the Times of India that according to senior Indian officials,
Apple transported five cargo
planes full of iPhones and other products from India to the United States, which sort
of calls to mind those scenes at the end of the Vietnam War when you see the last helicopter
leaving Saigon, except it's full of iPhone. Actually, Katie Nantopoulos had a great joke
on threads today. She said that this whole thing is like the movie, Dunkirk, but for iPhones.
Reuters reported that Apple transported 600 tons
of iPhones, Kevin, which would have been
about 1.5 million devices.
And look, you know, those iPhones will pad Apple's profits
a little bit more, but pretty soon,
there's gonna be no more planes out of no more countries
to escape these tariffs.
It is just gonna be a really expensive ass iPhone.
Do you think the iPhone 16 Pro Max's get to sit in first class on the plane?
They put them up front in the lie flat seats?
Yeah, they should definitely get the upgrade with what they're paying for those things.
Yeah, so, okay, let's move to our next case study of a company trying to deal with the uncertainty and chaos
of the Trump administration, Nintendo.
Casey, what is going on with Nintendo?
Well, so Kevin, as a hardcore gamer,
obviously you know that the Switch 2
is coming out this year.
This is the sequel to Nintendo's best-selling console
of all time, and it was supposed to become available for pre-orders
on this very Wednesday.
But then tariff chaos started happening,
and Nintendo said, we are going to pause pre-orders
because we don't know what it's actually gonna cost
to sell a Switch 2 in America anymore.
Yeah, and now that Trump has paused these tariffs
on most countries other than China,
have they
said that actually they're going to start shipping the switch to on time after all?
Well, what they've said is that they're not planning to change the launch date, which
is June 5th.
And it does seem like because they are a Japanese company and make the switch to in Vietnam,
they are going to be able to avoid the really tough tariffs that Apple is facing, right?
Before Trump initiated the pause,
there was gonna be a 46% tariff on the Switch 2.
Now it's back down to that 10%.
But look, the Switch 2 is already planning
to go on sale for $450,
which is $150 more than the original Switch sold at launch.
So I think there's a very real question here of whether the price of this console goes up over time,
which would be a reversal of the usual trend, which is a console goes on sale for a high price
and that price comes down over time. So once again, Kevin, there's just real chaos here as
we await probably the most highly anticipated piece of hardware to launch, I would say,
in the United States this year.
Yeah, now are they bringing in planes
full of Switch 2s from Vietnam
or wherever they're manufacturing them?
They were actually able to put them in one of those pipes
and they just sort of warped down.
It's kind of a really cool little thing they have there.
I got it.
Okay, next company on our list, TikTok.
Casey, this is a company we have talked about a lot on this show.
They were going to be banned. The deadline for banning them got pushed out by another 75 days last week.
Casey, what is the latest on TikTok and how it is coping with this escalating trade war between China and the US?
Well, Kevin, what is going on with TikTok is of course,
the question asked most in the history of hard fork
and what was going on with it until tariff chaos
was that it looked like we might have a deal.
There was some great reporting in the Times this week
that ByteDance with the support of the Chinese government
had reached the rough outlines of an agreement
in which TikTok would create
a new American entity, American investors would own the majority of it, Chinese owners
would have about a 20% stake, and the American company would essentially rent the algorithm
from ByteDance.
And so by Thursday of last week, there was this draft executive order that outlined the
deal and then Trump did the thing with the tariffs.
And all of a sudden, ByteDance has to call up the White House and say,
that deal that you just helped us negotiate, it's off the table because the Chinese government
isn't going to support the deal anymore. Right. So this was a pretty dramatic reversal. And
it does seem like they got very close to a deal before these tariffs. What is happening now that these tariffs are on? Does TikTok have any options left?
Well, Kevin, along with a 90-day tariff pause, we also now have a 75-day extension that comes
after the original 75-day extension that Trump gave in order to force buy Dance2Divest TikTok.
This man loves extensions. Let's justest TikTok. This man loves extensions.
Let's just say it.
This man loves to come up right against a deadline
and say, you know what?
You got a little more time.
Yeah, well, look, you know,
I don't know what's gonna happen over these next 75 days.
I imagine that if the tariffs against China stand at 145%,
there is no way the Chinese government
is going to support the sale of TikTok.
And I just want to say how self-defeating this is because it was barely more than a
week ago that Trump was telling reporters that Beijing, if they would simply go along
with his plan to force the divestiture of TikTok, then he would go easy on them on tariffs.
This was his big bargaining chip of if you don't want high tariffs, you have to let the Americans have TikTok.
And to my surprise, it seemed like the Chinese government
was actually going to go along with that.
And then before they could even get that deal out,
Trump seemingly out of nowhere
announces a brand new set of tariffs
that completely scuttles the deal.
So it is as if the president was essentially
negotiating against himself
and lost the deal that he had won.
Yeah, it does seem strange that he would not wait
until after the TikTok deal was finalized
and approved by all the relevant officials
to then issue these tariffs if he was actually interested
in getting a deal done.
Yeah, I think that's right.
So, okay, TikTok is still in this frustrating state
of superposition where they are both dead
and alive at the same time.
Do we think that this resolves before the end of the next 75-day extension, in this frustrating state of superposition where they are both dead and alive at the same time,
do we think that this resolves before the end of the next 75-day extension? Or do we think we will
need yet another extension to figure out what we're doing with TikTok? My assumption is that on the
day that Donald Trump leaves office, we will still be in the middle of one of these extensions. It'll
be sort of like the 15th extension, you know, or the 23rd extension. But no, until this tariff situation gets resolved,
I do not expect TikTok's fate to be resolved.
It is just going to continue to exist in its weird limbo.
All right, so that is TikTok.
Our last company on this list of case studies is Meta.
Casey, how is Meta dealing with this new uncertain reality?
Well, I would say that things turn out a little bit better
for them this week than maybe it looked like
things were going because tariffs were gonna be
a huge problem for them too.
They are a digital advertising business
and a huge number of their advertisers are small
and medium sized businesses that buy ads outside
the United States to export goods from foreign countries
into the United States.
Mike Isaac at the Times had a great piece on this this week.
There's one analyst who estimates that about $10 billion
of Metta's revenue from ads originates
from outside the United States.
So in a world where everyone was facing
these massive tariffs, we were just expecting Metta
to get hit really hard on the ads front.
Well, now that has mostly gone away,
at least for the next 90 days, so it seems like
Metta is going to get some breathing room. But there is this one other outstanding question,
Kevin, which is that next week, Metta's antitrust case is going to trial, right? So in 2020,
during the first Trump administration, the Federal Trade Commission files an antitrust lawsuit and
tries to break off Instagram and WhatsApp from Metta. It has been in the planning
stages ever since. And on Monday, the case is set to go to trial. So why does all of this have
anything to do with Trump? Well, Mark Zuckerberg has been giving Trump the full court press going
so far as to buy a $23 million house in Washington, DC recently, just to get closer to and spend more
time with the president.
There's been some reporting that Zuckerberg was in the White House trying to negotiate a settlement
with Trump just within the past few days. So there's a lot of questions right now about whether
Zuckerberg will able to use this relationship that he's apparently been building with Trump
in order to get rid of this case, which is in some ways an existential threat to his business.
Yeah.
And we should also just say like, this shouldn't be possible, right?
The FTC is supposed to be an independent agency that has its own enforcement agenda and brings
its own cases that are independent from the president.
But of course, nothing is truly independent from the president in Trump's Washington.
He recently announced that he was getting rid of the two Democratic commissioners on
the Federal Trade Commission.
That is historically quite unusual for a president to intervene in FTC commissioner staffing
at that level.
But now it is sort of going to be staffed with people who are friendly to the Trump
administration.
And so presumably, if we were to go to them and say, hey, let's back off this meta case,
I don't actually think we need to proceed with this.
They might listen.
And we should say that another way that Metta tried to ensure
that this happened is that after the events of January 6th,
Metta suspended Trump from its platform for three years
and Trump sued them over that.
And so after he won the presidency,
Zuckerberg came along and said,
hey, why don't we settle
this too?
And paid Trump $25 million.
And I have to say, Metta was completely within its rights to suspend an account.
They're allowed to suspend whatever account they want.
It's a private company with a private platform.
But still, just as a little gesture of goodwill, hey, Trump, here's $25 million.
So if this actually happens and this lawsuit just goes away, it will just frankly be an example of open corruption
Okay, so that is our for company case study of how tech companies are trying to do business and survive in
This new uncertain environment. I have to ask after going through all these examples
Which of these companies would you be if you could be one?
Which do you think is in the best position
in this new chaotic environment?
Well, you know, until maybe Wednesday,
I think I would have said Apple, right?
Apple makes the iPhone,
the iPhone is the most lucrative product
in the history of the technology industry.
And even despite some of the tariffs that we were seeing,
it seemed like they
were still going to be in a good position to navigate them. I was seeing analysis that they
were only going to lose maybe seven points of profitability from all of this. But the world
looks really different with 145% tariff and in a world where Trump just keeps escalating this fight
more and more. And so I actually do think that the picture for Apple just looks really strange.
So look, I feel a little crazy saying this, but maybe I actually would just rather be
meta.
Their hardware business is still a relatively small part of what they do.
Mostly what they do is a digital services business.
And it seems like Zuckerberg has been able to make at least some inroads with the Trump
administration.
Maybe they're about to get rid of this lawsuit against them.
So God, I don't know.
Maybe I actually want to be meta.
How about you?
Yeah, I think, I mean, as venal and corrupt as it would be
for these naked attempts at flattery and persuasion
to actually work and pay off,
I would not underestimate how well this stuff works
with Donald Trump.
And I think that Mark Zuckerberg's, you know,
motive here is to win at all costs.
And if he needs to buy a $23 million mansion
or spend time in the White House,
or even make some policy adjustments
to appease the Trump administration
and get what he wants,
I think he's demonstrated very clearly
that he's willing to do that.
My last question on this, Casey,
is about this idea of the tech capitulation to Trump.
In the past few months, we've observed, we've talked about the fact that a lot of the tech capitulation to Trump. In the past few months, we've observed,
we've talked about the fact that a lot of these tech companies
have been really falling all over themselves
to appease the Trump administration.
Many of them gave to the inaugural.
Many of them showed up at inauguration.
Their CEOs were seated just behind the president's
own family.
The amount of flattery and ass kissing going on here for months now has been,
I would say, notable and historic. Do you think that any of that has worked to the degree
that these executives thought it would? Did the tech leaders get what they wanted out
of Donald Trump?
I think that until the tariffs, the answer was basically yes. And the tariffs are what have changed that equation, right?
If you look at how JD Vance was talking when he went to Europe,
he was echoing a lot of tech company talking points.
You know, he and Trump have criticized European fines against tech companies,
saying like, we need to protect and defend our American tech companies
against these European fines,
which was something that the Biden administration never ever did.
They've talked about getting rid of AI guardrails and just letting these companies do whatever
they want with AI, which is like music to Mark Zuckerberg's ears.
But look, these companies just rely on stable, normal governance to be able to conduct their
business around the world.
They are as plugged into the interconnected global economy as anyone else,
arguably more than many companies.
And Trump just came along and blew that up.
And I think that it is probably dawning on them that they are probably just going to
be living in chaos for the foreseeable future.
And it is just going to make their lives much, much more difficult.
Yeah, I think that's right.
And I think that a lot of these executives
have underappreciated how important stability
and predictability are in their business models.
I mean, these were companies, many of them,
that had issues with the Biden administration.
The Biden administration had issues with them.
But at least with the Biden administration,
these companies knew where they stood.
There was not this sort of day-to-day whiplash of stock price moving up 10%, down 10%, tariffs going up
to 145%, and then down to 10%. It just was not the kind of frenetic environment that
we're seeing today. And so I wonder if any of them are starting to appreciate how good
they had it during the Biden years,
where for as much as the Biden administration may have gone after them for various things,
including antitrust violations, at least they could wake up every day and understand what
the world was going to look like for the next 24 hours.
Yeah, I think that's true. I think that most of them would probably still be loathe to
admit it, but let's give it another few weeks, Kevin, and another few tariffs. And then let's check back in with them.
Sounds good.
Well, that's enough about tariffs, Casey. When we come back,
we're going to talk about a terrifying new report about what AI could look like in 2027. Well, Casey, today we're going to talk about a forecast.
And that's separate from a forecast, which is something different.
Yeah.
That's what we call our end of the year predictions episode, isn't it?
I think so.
But today we're talking about something different, which is this new report called AI 2027.
This is a report that I wrote about last week and that has gotten a lot of attention
in AI circles and policy circles this week. It was produced by the AI Futures Project,
a Berkeley-based nonprofit led by Daniel Cocotello, who listeners of this show may remember was
a former OpenAI employee who left the company last year and became something of a whistleblower,
warning about their reckless culture, as he called it,
and is now spending his time trying
to predict the future of AI.
Yeah, and of course, lots of people are trying
to predict the future of AI,
but what gives Daniel a lot of credibility here
is that in 2021, he tried to predict
what things would look like about now. And he just got a lot
of things right. And so when Daniel said, Hey, I'm putting
together a new report on what I think AI is going to look like
in 2027. A lot of close AI observers said, Oh, this is
really something to read. Yeah. And he didn't just do this
alone. He also partnered with a guy named Eli Lifeland,
who is an AI researcher and a very accomplished forecaster.
He's won some forecasting competitions in the past.
And the two of them, along with the rest of their group,
and Scott Alexander, who writes the very popular Astral
Codex 10 blog, put together this very detailed,
what they call a scenario forecast.
Essentially it's a big report, a website.
It's got some sort of research backing it up
and it is basically represents their best attempt
to kind of synthesize everything they think is likely
to happen in AI over the next few years
into a readable narrative.
Yeah, and if that sounds a little dull to you,
I'm telling you, you should just go check this thing out.
It's at ai-2027.com, and it's just super readable.
And it blows through stuff that feels very familiar right now,
like just sort of basic extrapolating from where we are today
into getting to, you know, six months, a year from now,
the world starts to look very, very different.
And there is a lot of research that they have to support
why they think that is plausible.
Yeah, and I can imagine people reading this report
or listening to us talking about it and say,
well, that sounds like science fiction to me.
And we should be clear, it is science fiction.
This is a fictionalized narrative
that they have put together.
But I would say it is also grounded
in a lot of empirical predictions
that can be tested and confirmed or verified.
It's also true that some science fiction
ends up becoming reality, right?
If you look at movies about AI from past decades,
a lot of the things in those movies
did end up actually being built.
So I think this report,
while it may not be 100% accurate, at least
represents a very rigorous and methodical attempt to sketch out what the future of AI might look like.
And here's my bet. If you put this conversation into a time capsule and revisited it in two years,
in 2027, my guess is we're going to find that a good number of things in that scenario actually
did come true. I hope we're still doing a podcast in two years.
That'd be good.
That'd be great.
Yeah.
So my forecast is that this is gonna be
a good conversation.
Let's bring in Daniel Cocotello.
Daniel Cocotello, welcome back to Hard Fork.
Thank you, happy to be here.
So you have just led this group that put together
this giant scenario forecast, AI 2027.
What was your goal?
So our goal was to predict
the future using the medium of a concrete scenario.
There is a small but exciting literature of
attempts to predict the future of AI that use other methods,
which is also very important. Things like defining exciting literature of attempts to predict the future of AI that use other methods, which
is also very important.
Things like defining a capabilities milestone.
Like, here's my definition of AGI, here is my forecast for how long we'll have until
AGI based on these reasons and stuff.
And that's great, and we've done that stuff before.
We did a lot of that in the run-up to this scenario.
But we thought it would be helpful to have a actual concrete story that you can read.
And part of the reason why we think this is important
is that it forces you to like think about everything
and integrate it all into a coherent picture.
Well, I wanna ask you a bit more about that.
So, I mean, the first thing I wanna say about AI 2027
is it's an extremely entertaining read.
Like it is as entertaining as most of the sci-fi
that I have read.
By the end of it, you get into scenarios
where humanity's survival is threatened.
And so whether you think it's true or false,
it is like really engaging to read.
But my understanding of your aim here
is that there was something practical
about what you were trying to do, right?
Can you tell us about sort of the practical idea of going through this exercise?
Yeah, well, I mean, important background context, the CEOs of OpenAI, Anthropic, and Google
Demine have all publicly stated that they're building AGI and even that they're building
super intelligence and that they think that they can succeed by the end of this decade.
And that's a really big deal,
and everyone needs to be paying attention to that.
I think a lot of people dismiss that as hype,
and it's a reasonable reaction to say,
oh, they're just hyping their product.
But it's not just the CEOs saying this,
it's also the actual researchers at the companies,
and it's not just people at the companies,
it's also various independent people
in academia and so forth.
And then also, you don't just have to trust people's word for it, if you actually people at the companies, it's also various independent people in academia and so forth. And then also, like, you don't just have to trust
people's word for it, if you actually look at the evidence,
it really does seem strikingly plausible
that this could happen by the end of this decade.
And then if it does happen,
things are gonna go crazy in some way or other.
We like, it's hard to predict exactly how,
but obviously if we do get super intelligent AGI,
what happens next is going to look like sci-fi.
It will be straight out of a sci-fi book,
except that it will be actually happening.
You mentioned that if what the CEOs of tech companies say
comes true, we will be living in a sci-fi world.
And I think for a lot of people,
they're content to sort of stop thinking there, right?
They might be willing to admit, okay, yeah, if you invent super intelligence, things will
probably be crazy, but like, I'll cross that bridge when we come to it.
You're sort of taking a different approach and saying like, no, you're going to want
to start thinking right now about what it would be like if some of these claims start
to come true.
So maybe we could get into what some of those claims are.
Sketch out for us what you think is very likely to happen just within the next couple of years.
Well, I wouldn't say very likely.
I should express my uncertainty, right?
So past discussion often focuses on a single milestone like artificial general intelligence
or super intelligence.
We broke it down into a couple of different milestones, which we call superhuman coders, superhuman AI researchers,
superintelligent AI researchers, and then broad superintelligence.
So we sort of like make our predictions for each of these stages.
Even the very first one, I'm only like 50% confident that it will happen by the end of
2027.
So I have 50% chance that 2027 will end
and there still won't be any autonomous
superhuman coding agents.
But it's a coin flip.
We might also be living in a world where, yes,
you do have an, yeah.
Exactly, so 50% chance we do have autonomous,
fully autonomous artificial intelligences
that can basically do the job of the cracked engineers
by 2027 and then you, okay, okay, well, it's the next milestone after that.
After that comes automating the full AI research process instead of just
the coding, because AI research is more than just coding.
And how long does it take to get to that?
Well, we have our guesses, and
in our scenario it happens like six months later, you know.
So in our story, get the superhuman coders, use them to go even faster to
get to the superhuman AI researchers
that are able to do the whole loop.
That really kicks things off
and now you're going much faster.
How much faster?
We say 25 times faster for the algorithmic progress
at least, of course your compute scale up
is not going any faster at all
because you still have the same amount of compute
but you're able to do the algorithmic progress
20 times faster, 25 times faster.
Then you start getting to the superhuman regime.
So you start getting systems that are just like
qualitatively superior to the best humans at stuff.
And they're also probably discovering new paradigms.
So we depict them going through multiple paradigm shifts
over the course of the second half of 2027,
ending up with something that's just
vastly superior to humans in every dimension by the end.
Yeah, let me just sort of pause
and maybe underline a couple of things there.
I think, you know, most people might not understand
why the big AI labs are obsessed with automating coding,
right, most people are not software engineers,
so they kind of don't care how much of it is automated.
But by the time you get to software
that is mostly writing itself,
it unlocks this other world of possibilities.
And you just sort of sketch out a vision
where once we get to a
point where the sort of AI coding systems are better than
almost every human engineer or maybe every human engineer,
then this other thing becomes possible, which is now you can
just set this thing to work trying to figure out how to build
AI itself, right? Is that what I'm hearing you say?
Basically, I'd break it down into two stages.
So I think the coding is separate from the complete automation,
as I previously mentioned.
I think that I expect to see systems that
are able to do all the coding extremely well,
but might lack research taste.
For example, they might lack good judgment about what
types of experiments to run.
And so that's why they can't completely automate
the research process.
And then you have to make a new system
or continually train the old system
so that it gets that taste, it gets that judgment.
Similarly, they might lack coordination ability.
They might be not so good at working together
in large organizations of thousands of copies,
at least initially, but then you fix that
and you come up with new methods
and you do additional training environments
and get them good at that sort of thing.
And that's what we depict happening over the first half
of 2027.
And we depict it happening in only half a year
because it goes faster because they've
got all the coding down pat.
And so even though humans are still
directing the whole process, they just
give orders to the coding agents and they quickly
make everything actually work.
And then halfway through the year, they've succeeded in making
new training runs that train the skills that the AIs were missing. So now they're not just coding
agents. They are able to do the research taste as well. They're able to come up with the new ideas.
They're able to come up with hypotheses and test them. And they're able to work together in big,
sort of like hive mind clusters of thousands and thousands of them. And that's when things really kick off.
That's when it really starts to accelerate.
In your scenario, you have this sort of choose your own adventure ending,
where after this thing you call the intelligence explosion where
the superhuman AI coders get into AI R&D and they start
automating the process of building better and better AIs,
you sort of have two buttons that you can click and one of them
unspools the good place ending where we
decide to slow down AI development and really get
these things under control and solve alignment.
Then the red button, you push that and it goes into
this very dark dystopian scenario
where we lose control of AI,
they start deceiving and scheming against us,
and ultimately maybe we all die.
Why did you decide to give people the option
of choosing one of those two endings
rather than just sketching what you believe to be
the most probable outcome?
So we did start by sketching what we believe
to be the most probable outcome,
and it's the race ending,
the one that ends with the misaligned AIs
in control of everything.
So we did that first.
And then we were like, well, this is kind of depressing and sad.
And there's a whole bunch of stuff that we didn't get to talk about because of that.
And so we wanted to then have a different ending that ended differently.
In fact, we wanted to have like a whole spread of different possible outcomes, but we were
limited by time and labor and we were only able to pull
together one other outcome, which is the one that we had to pay in the slowdown ending.
So in the slowdown ending, they solve the alignment issues, and they actually get AIs
that are actually what they say on their TIN.
They're not faking it.
They just actually have the goals and values that were put into them, or that the company
was trying to train into them.
It takes them a couple months to sort that out.
That's why it's a slowdown.
They had to pivot a lot of their compute and energy towards figuring that stuff out, but
they succeed.
So then in that ending, we still have this crazy arms race with China and we still have
this crazy geopolitical crisis.
And in fact, it still ends in a similar sort of way with this massive arms buildup on both
sides, this massive integration into the economy, and then ultimately a peace treaty.
I'm curious, Daniel, if the events of the last week in Washington, the tariffs, this looming
trade war with China have affected your forecast at all? I mean, we've been iteratively improving
it, but like the core structure of it was basically done a few months ago. So this is all new to us and wasn't really part of the forecast.
How would it change things? Well, if the trade war continues and causes a recession and stuff like that, it might
just generally slow the pace of AI progress, but not by much.
I think like say it makes compute 30% more expensive so that the companies are able to buy 30% less of it.
Like say it makes compute 30% more expensive so that the companies are able to buy 30% less of it
Maybe that would translate to like a 15% reduction in overall research velocity over the next few years
Which would mean that the milestones that we talked about happened like a few months later instead of when they do So the story would still be basically the same
So one of the things I think is most interesting about your project is the bets and bounties
One of the things I think is most interesting about your project is the bets and bounties section, where you are going to pay people for finding errors in your work,
for convincing you to change your mind on key points, or for drafting some alternate scenarios.
So talk to me a little bit about how that became part of this project.
So, you know, I come from the sort of rationalist community background,
which is big into making predictions and making bets, putting your money where your mouth is.
So I have a sort of aesthetic interest in doing that sort of thing.
But then also, specifically, one of the goals of this project is to get people to think
more about this stuff and to, you know, do more scenario forecasting along the lines
of what we've done.
We're really hoping that people will counter this with their own reasonably detailed, you
know, alternative pathways that represents their vision of what's coming.
And so we're going to give out a few thousand dollars of prizes to try to mildly incentivize
that.
And then as for the bounties thing, already we've gotten dozens of people being like,
you say this, but like, isn't this a typo or like, you know, this feels wrong.
And so I have a backlog of things to process, but I'm going to get through it.
I'm going to like, you know, pay out the little payments and fix all the little bugs and stuff like that.
And I'm just quite heart-warmed to see
that level of engagement.
And have you taken any bets on different scenarios so far?
I think so far I've done one or two.
But mostly there's just a backlog I need to work through.
Got it.
Got it.
Now, Daniel, you said you've been getting some good responses from people at the AI
companies to this scenario forecast. I did a bunch of calling around when I was writing about this,
and after we spoke, I talked to a bunch of different people both in the AI research community
and outside of it. And I would say the most frequent reaction I got was just kind of
And I would say the most frequent reaction I got was just disbelief. One person I talked to,
a prominent AI researcher said he thought it was
an April Fool's joke when I first showed him this scenario,
because it just sounded so outlandish.
You've got Chinese espionage and the models going rogue,
and the superhuman coders,
and it all just seemed fantastical and it
was almost like they didn't even think it was worth engaging with because it
was so far out. I'm curious if you've gotten much of that kind of reaction and
what your response is. A couple things. So first of all, well you know go write your
own damn scenario then. I would say you either will write a scenario that
doesn't seem outlandish,
which I will completely tear apart as unrealistic and just assuming basically that AI progress
hits a wall, or you'll write a scenario that does feel very outlandish, but perhaps in
different ways than ours do.
Again, are they actually going to get to AGI and superintelligence by the end of this
decade?
If so, you can't possibly write that in a way that's not outlandish.
It's just a question of like which outlandish thing
are you going to write?
And if you think maybe this is not gonna happen
and it's gonna hit a wall, yeah, that's possible too.
I think that's reasonable.
I don't think it's the most likely outcome.
Like I do actually think that probably by the end
of this decade, we're gonna have super intelligence.
But I think it's, yeah, yeah.
That's what I say.
And say more about that,
because I assume that a lot of our listeners
like think either truly think that it will hit a wall, or they're just sort of counting on it,
hitting a wall so as not to have to reckon with any of the scenarios that you describe.
So like, what is your message to the person that's just like, eh, it'll probably hit a wall?
I mean, I don't know, read the literature? Like there's...
These people are not going to read the literature. They listen to podcasts specifically,
so they don't have to read the literature.
Yeah, fair.
Well, I could point to specific parts of the literature,
like benchmarks, for example, and the trends on them.
So I would say the benchmarks used to be terrible,
but they're actually becoming a lot better.
Metre in particular has these agentic coding benchmarks
where they actually give AI systems access to some GPUs
and say, have fun, you have like eight hours
to make progress on this research problem, good luck.
And then they measure how good they are
compared to human researchers given the same setup.
And, you know, line goes up on the graph.
It seems like in a year or two, they'll have AIs
that are able to just
autonomously do eight-hour long ML research tasks,
you know, on these sorts of things.
And that's not AGI, that's not superintelligence,
but that is maybe the first milestone that I was talking about, superhuman coder, right?
So I point to those sorts of trends.
And then separately, I would also just do the appeal to authority.
Like, if you're not going to read the literature, if you're not going to look at the...
If you're not going to sort of form your own opinion about this,
and you're still just deferring to what other people think, well, then I will say,
yeah, there's a bunch of naysayers out there who are saying,
this is all never going to happen, it's just fantasy.
But also, there's a bunch of extremely credible people with amazing track records,
both inside the companies and outside the companies,
who are, in fact, taking this extremely seriously.
Yeah, I also want to read you-
Including our scenario.
Like, you know, Yashua Bengio, for example,
read an early draft of our thing and liked it
and gave us some feedback on it.
And then we put a quote from him at the top saying,
yeah, everyone should read this, it's plausible.
Right, so he's a pioneering AI researcher.
Yeah.
Another genre of criticism I've heard of this forecast
is from people who just don't,
who are just questioning the idea that if you get a eyes that are superhuman at coding?
They will kind of be able to bootstrap their way to general intelligence
And I just want to read you a quote from an email that I got from David Autor
Who is a very well-known economist at MIT. I had asked him to look at
the scenario and react to it.
With this particular eye on like,
what might this be missing as far as how it assumes
this easy and fast jump from
superhuman coding to something like AGI.
I'll just read you what he said. He said,
LLMs and their ilk are super powered incarnations of one
incredibly important and powerful part of our cognition. The reason I say we're not
on a glide path to AGI is that simply taking this capability to 11 does not substitute
for the parts that are still missing. I think that humanity will get to AGI eventually.
I'm not a dualist. I just don't believe that swimming faster and faster allows you to fly.
What is your reaction to that?
I agree.
We depict this in the course of the story.
So if you read AI 2027, they have something
that's like LLMs, but with a lot more reinforcement learning
to do long horizon tasks.
And that is what counts as the first superhuman coder.
So it's already somewhat different
from the systems of today, but it's still broadly similar.
It's still sort of maybe the same fundamental architecture,
just a lot more training, a lot more scaling up,
and in particular, a lot more training specifically
on long horizon agented coding tasks.
But that's not itself AGI, I agree.
That's just the superhuman coder that you get early on.
And then you have to go through several more
like paradigm shifts to get to actual super intelligence.
And we depict that happening over the course of 2027.
So a key thing that I think that everyone needs
to be thinking about is this takeoff speeds variable.
How much faster does the research go
when you've reached the first milestone?
And how much faster does the research go
when you reach the second milestone and so forth?
And we are, of course, uncertain about this,
like we are about many things.
We say in this scenario that we could easily imagine
it being five times slower than we depict
and taking sort of like five years instead of one year,
but also we could imagine it being five times faster
than we depict and taking like two months, you know.
So we wanna do a lot more research on that obviously.
If you wanna know where our numbers are coming from,
go to the website, there is a tab that you can click on that lists, has a bunch of sort of
like back of the envelope calculations and little mini essays where we like generated
the quantitative estimates that are the skeleton of the story.
One other piece of criticism I've seen of this project that I wanted to ask you about
was from a researcher at Anthropic named Saffron Huang who argued on X that she thought that
your approach in AI 2027 was highly counterproductive.
Basically, that you were in danger of creating
a self-fulfilling prophecy by making
these scary outcomes very legible,
by burying some assumptions,
that you were essentially making the bad scenario
that you're worried about more likely to actually happen.
What do you make of that?
I'm quite worried about that as well.
And this is something we've been like fretting about
since day one of the project.
So let me just say a little bit more about that.
So first of all, there is a long history
of this sort of thing seeming to happen
in the field of artificial general intelligence
research, most notably Elias Yudkowsky, who is the sort of like, I don't know, er father
of like worrying about AGI, at least in this generation.
People, you know, Alan Turing also worried about it, but like Sam Altman specifically
tweeted, you know this tweet?
Yeah, Sam specifically said like, hats off to Elias Yudkowsky for like raising awareness about AGI. It's happening much faster now because of his
doomsaying because it's caused a bunch of people to like pay more attention to the possibility and to
like, you know, start investing in these companies and so forth. So I was sort of like a, I don't
know, twisting the knife at him because he obviously doesn't want this to happen faster. He thinks we
need more time to prepare and make it safe and so forth. But it does seem like there's been this effect where
people talking about how powerful and scary AGI could be has maybe caused it to come a little bit
faster and cause people to like wake up and race harder towards it. And similarly, I'm worried about
causing something like that with the AI 2027. Like I, one of the like subplots in AI 2027
is this whole like concentration of power issue
of like who gets to control the army
of super intelligences, right?
And in the race ending, it's sort of a moot question
because the army of super intelligences
is just pretending to be controlled.
And so it's not actually listening to anyone when it counts.
But in the slow down ending, they do actually align the AIs.
And so they are actually
going to do what they're told. And then who gets to say that, right? And the answer in our slowdown ending is the oversight committee, which is this ad hoc group of people that is some CEOs and the
president who get together and share power over the army of superintelligences. But what I would
like to see is something more democratic than that, something where the power is more distributed.
I'm also afraid that it could be less democratic than that.
At least we get an oligarchy with this committee,
but it could very easily end up a dictatorship
where one person has absolute control over the Army
of Superintelligences.
This is yet another example of how
I'm trying to not have the self-fulfilling prophecy happen.
I don't want people to read this and be like, hmm, I'm a CEO.
I can make a lot of money by building the misaligned AI.
You know, maybe. Yeah.
So, so, so. But all that being said.
Yes. Any of our evil villain listeners out there,
steepling your fingers in your, in your, in your lair under a mountain.
Knock it off. Yeah. So, so all that being said, we are taking a gamble
that sunlight is the best disinfectant.
Like the best way forward is to just generally tell
the world about what we think is coming
and hope that even though many people will react to that
in exactly the wrong ways, enough people will react to that in the right ways that overall it will be good.
Because I am tired of the alternative of like,
hush, hush, keep everything secret, do backroom negotiations, and hope that we get like
the right people in the right rooms at the right time, and that they make the right decisions.
I think that that is kind of doomed. So I'm sort of placing my faith in humanity,
and telling it as I see it and hoping that
insofar as I'm correct, people will wake up in time and, you know, overall that the outcome
will be better.
Yeah.
All right.
Thank you, Daniel.
Thanks, Daniel.
Thank you so much.
When we come back, Metta decides to fake it till they make it.
We'll talk about the cheating scandal that is rocking the world of AI benchmarks.
Well, Casey, there's one other big AI story we want to talk about this week, and that is about the drama surrounding Llama.
That's right, Kevin.
Meta has a new large language model.
It was hotly anticipated, but I think it's fair to say it kind of stumbled out of the
gate.
Yeah, they had some llama llama cred drama.
How many times are you going to do the llama drama pun?
Well, there's a very popular children's book called Llama Llama Red Pajama.
Are you aware of this?
I am.
So let's get into it.
There has been a lot of things going on around this new language model, Llama 4, that Meta
released last weekend.
Casey, you've been writing about this in your newsletter this week.
Catch me up.
What is going on with Llama 4?
Yeah.
So look, Meta has invested billions and billions of dollars in AI, and they're taking a very
different approach from the AI labs that we most often talk about on this show.
Companies like OpenAI and Thrap at Google, their models are closed.
You can't sort of download, fine-tune, re-release them under a sort of very permissive license.
But with MetaS, you can.
And when Llama 3 came out last year, developers said,
oh, this thing is actually like pretty good.
Like it's not as good as the state of the art,
which is often true of the open models,
but it's getting up there.
Right.
And so they spent all this money to develop Lama 4.
People have been talking for months about how this was gonna
sort of blow all the other open weights models
out of the water.
And then they release it,
and what happens?
Well, two things happen, Kevin.
The first is that Metta trumpets this model
in the way that companies usually do trumpet
their most recent models as being the most powerful ever,
the most efficient.
They show off a bunch of benchmarks.
They say, this thing is highly capable,
and it's the bee's knees.
They didn't actually say it was the bee's knees.
I'm not sure anyone has said that in the past 70 years,
but they said things like that.
And one of the benchmarks
that really got people's attention was LM Arena.
You know LM Arena?
I know of it, but I haven't spent much time on it.
What is it?
So it's this really interesting project.
It is a very small nonprofit that includes some researchers from UC Berkeley.
And what they do is they get people to volunteer to help, and they'll have people enter a query,
and then they'll show them the response from two different chatbots that are not labeled.
And after they get the answer, the user will say,
oh, I liked this one better.
And they collect those votes over time.
And the more that people vote for one chatbot over another,
the higher it rises on LM Arena.
I see.
So it's sort of like a crowdsourced leaderboard
for which of these models people prefer.
Exactly.
And Kevin, you know as as well as anyone else,
that whenever a new model comes out,
the question of how good is it
turns out to be weirdly hard to answer, right?
Maybe it's really good for what you need it to do.
Maybe it's really bad.
Or maybe it's about as good as something else,
but you just happen to like it better
because it has a style that matches
with what you're looking for.
So in such a world, companies are desperate
to be seen as good,
but they don't have an easy way of communicating that.
And that's when LM Arena enters the picture.
Because if you can get high enough on that leaderboard,
you can point to it and say,
aha, look at how we're doing.
Right, the people have voted.
That's right, the people have spoken
and look how well we're doing.
So do you know how well Lama 4 does on LMA?
No.
Lama 4 comes in at number two,
just under Gemini 2.5 Pro Experimental,
which is the latest model from Google,
which has been through a lot of testing
and which basically there's like universal acclaim
for this model.
People think like this is like a truly great model,
not just at this little chat bot contest,
but across a bunch of other things,
including coding and a lot of other things.
So Lama 4, sort of immediately zooming up
to number two on LM Arena,
would seem to indicate that Meta has really cooked here.
They have built this incredible model.
They are releasing it to the public
under an open weight structure.
And they are one of the leading AI labs when it
comes to creating very powerful models.
That's right, except there's an asterisk.
This version of Llama 4 is an experimental model.
Meta on its website says it has been optimized for chat.
People start to look into this.
They noticed this is not the version of Lama4 that
is actually available for download. The one that was included in LMArena was not the one that
people could download? That's right. It had a different name. It was named Maverick 0326
Experimental. And people start to think, oh, wait a minute. What if what happened here isn't
what normally happens on LM Marina, which is people make a new model and submit it to
LM Marina and see how it does. What if Metta trained a special version of llama for just
to be good at LM Marina? Now I have spent the past week trying to research whether this is true. And on Monday, I got Meta to send me a statement, which I guess I should read.
We experiment with all types of custom variants, and this experimental version is,
quote, a chat-optimized version we experimented with that also performs well on LM Arena.
We have now released our final open source version
and we will see how developers customize Lama
for their own use cases.
So this was really interesting to me
because when they say,
well, it also performs well on LM Arena,
it suggests that, well, maybe they just made like,
I don't know, 15 of these models
and they were just like,
oh, look, this one happens to do well on L.M. Arena.
That was like one possibility.
I think another possibility is exactly
what the cynics think, which is, oh no,
they sort of reverse engineered how L.M. Arena works,
and they built a bot that was just gonna beat it.
And how would you do that?
Like if your goal was to create a model
that would perform very well
on this one specific leaderboard,
what would you do?
So, LMArena has released a lot of chats over the years that sort of show which chats are
considered preferable to other chats.
And it seems that the users of LMArena really like it when the bot has a high degree of
what they call sycophancy.
So basically, you're just like, what should I have for breakfast today?
And the chat bot is like, oh my God, that's such a great question.
You're a genius.
I love the way you're starting the day off, right?
That is the kind of answer that people pick.
And so you can build a chat bot that essentially just flatters people constantly and it tends
to do really well on chatbot arena. So anyways, in the aftermath of this confusion, L.M. Arena,
which is a very sort of mild mannered organization that I think is not used to being involved in
public controversies, puts out a statement. And I have to read the statement, Kevin, because as
gentle as it is, I found it pretty damning. They don't go so far as to say Meta cheated.
But what they do say is quote,
Meta's interpretation of our policy
did not match what we expect from model providers.
Meta should have made it clear
that this experimental model was a customized model
to optimize for human preference.
As a result of that,
we are updating our leaderboard policies
to reinforce our commitment to fair reproducible evaluations
so this confusion doesn't occur in the future.
So why is that statement so interesting to me?
Well, you basically just have this tiny group of researchers
over at Berkeley and Metta violates their policies so hard
that they have to change the rules for how this competition even works And Metta violates their policies so hard
that they have to change the rules for how this competition even works
just to get people to stop breaking the competition.
Yeah, I thought this was a really interesting
set of stories.
I'm still waiting for someone, ideally you,
to get to the bottom of like,
what actually happened inside Metta.
But I think it's worth talking about for two reasons.
One, because I think it says something about Meta
and its place in the AI race.
And the other, because I think it says something
about the state of AI and these benchmarks
and how useful they are or aren't in making sense
of the torrent of new models that are coming out constantly
from the big AI labs.
So maybe let's take those one by one.
What do you think this says about Meta's place
in the AI race if it does turn out
that they had sort of gamed this leaderboard
to make it look like their model was better than it was?
Here's what I think.
I think if you're winning the AI race,
you do not waste time trying to beat LM Arena, right?
What you do is what Google
did, which is just release a very powerful pro version of Gemini. And it just happens
to float to the top of the arena, not because it's been optimized for conversation, but
just because it's a great model that's really good at a lot of things. If you have to make
a custom version of your model just to win this rinky dink competition, it's like hard
for me to think of a more adverse indicator for the quality of Meta's AI program. And we should say there's been reporting in the
information over the past year that the Llama 4 development process has been really frustrating
for Meta, that they delayed the release twice because they weren't getting the results that
they wanted. And when it finally did come out and people started to put it through other evaluations,
they found that it just was not hitting the mark.
In fact, Kevin, Ethan Malik, former guest on Hard Fork, compared the versions of the
experimental chat that was winning the leaderboard to the chats that were produced by the final
open weights model.
And what he found was the open weights model was
producing really bad responses. Essentially that the optimized model was
performing so much better than the real one that it wasn't even close. So why do
they just release the optimized model then? That's a great question. I don't
know the answer to that but what I'm going to assume is that whatever fine
tuning is necessary
to increase the level of sycophancy in the bot
might be great for this sort of competition,
but maybe it's really bad for coding or creative writing
or the countless other things
that we now expect LLMs to be good at, right?
You know, fine tuning is a very powerful process
that can take a very general purpose model that's kind of mediocre at a bunch of things and make it really good at one thing.
But these days, people have a lot of options to choose from with their large language models.
And there are a lot of them that just have very high general capability.
So they're going to use those instead. Yeah. I mean, I have not done my own reporting on
the situation inside Meta with Lama 4, but
I will just say from a broad view, if you just step back from this particular scandal,
Meta is not one of the top three AI labs in America when it comes to releasing Frontier
models.
They are not in the top tier of Frontier AI research.
A lot of their key researchers have left the company.
Their models are not seen as capable as the models
from OpenAI, Anthropic, and Google DeepMind.
And I think that really frustrates them, right?
I think Mark Zuckerberg and his lieutenants,
they really want to be seen as part of the vanguard here.
And so I would not be surprised at all
if in an effort to kind of juice their numbers
and appear to be leapfrogging some of their competition,
they may have violated the terms
of one particular AI benchmark.
And that should make us question
how well their overall AI program is doing.
Absolutely, and by the way,
the next time they release a model
and come out with a bunch of wild claims,
like you think I'm gonna believe any of that?
No, it's like you're gonna have to go, you know, try to verify every single claim they make independently.
Yeah, look, I assume some people are gonna hear this
and think that I'm making a mountain out of a molehill.
But I just think about what Daniel Coca-Tella just told us
about how powerful these systems are becoming
and about how powerful they're about to become.
And you want them to be like sort of loyal to human beings,
but you also want them to like not be used for bad behavior.
And like, if there is a company out there
that is just like cheating to win benchmarks,
what else can that model do?
So even though this may seem like a small thing,
I think it matters that we have companies building AI systems
where we have some level of trust in those companies,
where we believe they have some amount of integrity
when it comes to how they operate.
And so this was a moment where I thought, wow, my trust in Meta as an AI company has just been
dramatically reduced. Yeah. So the Meta of it all aside, I think this does actually raise a really
important question about the broader AI industry, which is the value of benchmarks in general.
Because one thing that I've heard from AI researchers over the past year or two is that
these benchmarks,
these tests that are given to these models
to figure out how intelligent they are,
they all have some flaw built into them, right?
There's this issue of data contamination,
which is what if some of the answers on these tests
are being fed into these models
during their training process
so that you're really not getting a sense
of how capable the model is.
They're just kind of regurgitating these answers that they've sort of seen already. That is an issue.
They're also just the issue that all these companies are effectively grading their own homework, right?
There's no like federal program that sort of puts these things through their paces and releases like standardized benchmark scores
that we can actually verify and trust. Some of these AI companies are using different methods
to even apply these benchmark tests.
There's these things called consensus at 64
and all these different ways that you can kind of cherry pick
like the best answer that your model gives
if you give it the test a bunch of times
and use that for your score.
So I think we are just losing our ability to trust
the way that we measure these AI models in general.
Yeah, and it's so frustrating.
I was thinking, Kevin, imagine in the early 2010s,
and it's not just that Instagram comes out
as an app in the app store.
You have Instagram, you have Instagram 01,
you have Instagram 01 mini,
you have Instagram 01 deep research,
and it's like, download the one that's best for you.
You'd be like, why are you making me do any of this?
Right?
Like just give me the one thing that works.
And while every AI lab is trying to realize that in the meantime, we're
living through this Cambrian explosion of large language models.
And on one hand, I think that makes it really important for there to be
benchmarks so that we can look at a glance to have a basic sense of, is this thing even worth my time?
But on the other hand, that makes the benchmarks such an attractive target for gaming and outright
cheating.
And so that's why the researcher, Andrej Karpathy, has said that we have what he calls an evaluation
crisis where when a new model comes out, the question of how good is it is just very difficult to answer. I've been wondering what we can do as journalists
to try to answer those questions better. Like is this a place for journalists to
actually say, okay, new model came out, we're gonna have our own custom set of
evaluations, maybe we're gonna keep those private in some way to prevent them from
being gained. But what solutions do you see here to this crisis?
Well, at the risk of scooping myself here,
I will disclose that I am actually starting to work
on my own benchmark, because I think that part of
how we are going to make sense of these AI models
is that people will just start developing
their own set of tests to give to new models.
Not necessarily to determine like their overall intelligence,
but to determine how good they are
at the things we care about.
Personally, I don't care much if an AI model
is getting a 97% on the graduate level physics exam
or a 93%, right?
That does not make a huge difference in my life.
Because it's still higher than you're gonna get.
Exactly, and I am not a graduate level physics researcher.
So I might care more about whether a model is good
at creative writing or not.
And I might want a battery of tests to determine that.
And so I think that as these things become more critical
in people's lives and work,
we will start seeing these more personalized tests
and evaluations that
actually measure if the models are good at the things that we care about.
What do you think?
Yeah, I think that's a great point.
And after you told me that you were going to do this, I sort of started a scheme and
thought, you know, I want my own benchmarks too, because there are, I don't know, I'm
sure I can come up with a list of like 10 things that I wish AI could do for me today
that it still can't.
And so maybe it's time that I should start scenario planning.
What's one of your tests that you want to give AI models
to determine if they're capable or not?
Well, like for example,
I have a newsletter that has customer service issues.
People email us, they say, oh my gosh,
can I change my email address?
People say the writing in this is so bad.
People love the writing.
That's all I hear about the writing.
People are saying, are humanists writing this?
That's insane.
But I would love to be able to
be able to automate some of that, you know,
make it easier for people.
So, oh, you need to download your invoice,
which is a question we get a lot.
It's like, okay, yes,
actually we're just gonna sort of handle that
in an automated way.
So that's like just one very easy thing.
And you know, if you're thinking,
oh, Casey, I actually have a product
that can already do that for you,
please don't email me.
It can't. I've done through this. Can I tell you one of the things, I actually have a product that can already do that for you. Please don't email me. It can't.
I've been through this.
Can I tell you one of the things that I wanna test AI on?
So as you know, I just moved into a new house.
As a result, I have spent like between a third
and half of my waking hours over the last few weeks
thinking about hanging pictures.
Hanging pictures is one of my least favorite tasks
in the world.
You have to do math, you have to bring out the laser level.
I mean, it's a huge process.
The golden ratio.
Yes, and I would love for an AI system
to be able to hang pictures for me.
That's beautiful.
And as soon as that happens, to me, that's AGI.
Now, would that involve a robot?
Probably, yeah.
So we gotta make some progress before we get there.
But if you're listening to this,
and you're working at one of these robotics companies,
get on it. Heart Fork is produced by Rachel Cohn and Whitney Jones.
This episode was edited by Matt Collette and fact-checked by Ina Alvarado.
Today's show is engineered by Chris Wood.
Original music by Rowan Nimisto and Dan Powell.
Our executive producer is Jen Poignan.
Video production by Soya Roquet, Pat Gunther, and Chris Schott.
You can watch this whole episode on YouTube at youtube.com slash hardfork.
Special thanks to Paula Schuman,, we winged him, Dalia
Haddad and Jeffrey Miranda. You can email us at hard for NY times.com with your AI doomsday
scenario. Thanks for watching!