Big Technology Podcast - Is AI Scaling Dead? — With Gary Marcus
Episode Date: May 7, 2025Gary Marcus is a cognitive scientist, author, and longtime AI skeptic. Marcus joins Big Technology to discuss whether large‑language‑model scaling is running into a wall. Tune in to hear a frank d...ebate on the limits of “just add GPUs" and what that means for the next wave of AI. We also cover data‑privacy fallout from ad‑driven assistants, open‑source bio‑risk fears, and the quest for interpretability. Hit play for a reality check on AI’s future — and the insight you need to follow where the industry heads next. --- Enjoying Big Technology Podcast? Please rate us five stars ⭐⭐⭐⭐⭐ in your podcast app of choice. Want a discount for Big Technology on Substack? Here’s 25% off for the first year: https://www.bigtechnology.com/subscribe?coupon=0843016b Questions? Feedback? Write to: bigtechnologypodcast@gmail.com
Transcript
Discussion (0)
Is the AI field reaching the limits of improving models by scaling them up?
And what happens if bigger no longer means better?
That's coming up with AI critic Gary Marcus right after this.
Welcome to Big Technology Podcast, a show for cool-headed and nuanced conversation of the tech world and beyond.
We're joined today by AI critic Gary Marcus, the author of the book Rebooting AI and Marcus on AI on Substack,
and he's here to speak with us about whether the AI industry is hitting the limits,
of scaling generative AI models up
and what it means if we're truly seeing
diminishing returns from making these models bigger.
Gary, it's great to see you.
Welcome to the show.
Thanks for having you.
So the genesis of this episode is that I did an episode
with Mark Chen from OpenAI about GPT 4.5
and you come into my DMs and you say,
listen, I want to give a rebuttal.
Scaling is basically over
and it's not exactly what Open AI has said.
Now, for those who don't know about the scaling laws,
Basically, the idea is that the more compute and data you put into these large language models, the better they're going to get, basically predictably, linearly.
Well, exponentially was the idea.
Right.
And so the context here is now we've seen almost every research house all but admit that that has hit the point of diminishing returns.
I think Mustafa Suleiman was here.
He pretty much admitted it.
Tom's Curian CEO of Google Cloud said that their diminishing returns are happening.
Jan Lacoon has also talked about the fact that you're just not going to see as many returns
from AI scaling as you would beforehand.
So just describe the context of what we're seeing right now, how big of a deal is it?
And then what are the implications for the AI industry?
Because this is the big question.
I mean, how much better can these things get, right?
That is the big question with AI today.
Well, I mean, I have to laugh because I wrote a paper in 2022 called Deep Learning is hitting a wall.
And the whole point of that paper is that scaling was going to run out, that we were going to hit diminishing.
returns and everybody in the field went after me a lot of the people you mentioned i mean lacoon did
Elon musk went after me by name altman did and they all like altman said give me the strength of the
of a mediocre deep learning skeptic so that people were really pissed when i said the deep learning
was going to run out so it's amazing to me and that a bunch of people have uh conceded that these
scaling laws uh are not working the way they used to be and they're also doing a bit of backpedaling
I think that Mark Chen interview, I can't quite remember the details, but I think it was a version of backpedaling and redefining things.
So if you go back to 2022, there were these papers by Jared Kaplan and others at OpenA.I.
And they said, look, we can just mathematically predict how good a model is going to be from how much data there is.
And then there were the so-called chinchilla scaling laws.
And everybody was super excited.
And basically, people invested half a trillion dollars assuming that these things were true.
You know, they made arguments to their investors or whatever.
They said, if we put in this much data, we're going to get here.
And they all thought that here in particular was going to mean AGI eventually.
And what happened last year is everybody was disappointed by their results.
So we got one more iteration of scaling after 2002, after 2022, that worked really well.
And we call that GPT4 and all of these models that are sort of like that.
So I wrote that paper around GPT3.
we got another iteration of scaling so right three was scaling compared to two it was much better
two was scaling compared to one it was much better so much better meant um sorry much more data meant
much better but what is what what is much better well i mean one way to think about it is you didn't
need a magnifying glass to see the difference between gpt2 and it was that we didn't call it gpt1
but the original gpt and you didn't need a magnifying glass for gpt4 as opposed to gpt 3 it was just
obviously better. A lot of people thought is that we would pretty quickly see GPT5 and a lot of people
raced to build it. So OpenAI tried to build GPT5 and they had a thing called Project Orion and it actually
failed and eventually got released as GPT 4 and a half. So what they thought was going to be GPT5 just didn't
meet expectations. Now they could slap any name on any model they want and in fact lately nobody
understands how they're naming their models. But they haven't felt like any of the models that
they've worked on since GPT4 actually deserve the name GPT5. And it didn't meet the performance that
these so-called mathematical laws required. And what I said in that paper is they're not really
mathematical laws. They're not physical laws of the universe like gravity. They're just generalizations
that held for a little while. Like a baby may double and wait every couple of months early in
its life. That doesn't mean that by the time you're 18 years old that you're going to be 30,000
pounds. And so we had this doubling for a while, and then it stopped, and we can talk about
why. But the reality is it's not really operative anymore. So there's been efforts to kind of
misdirect and shift direction. So I think everybody in the industry quietly or otherwise
acknowledged that, hey, we're not getting the returns that we thought anymore. And nobody's
been able to build a so-called GPT-5-level model. That's a big deal, right? I'm a scientist. And as a
scientist, or it was originally a scientist. As a scientist, we have to pay attention to negative
results as well as positive results. So when 30 people try the same experiment and doesn't work,
nature is telling you something. And everybody try the experiment of building models that
would 10x the size of GPT4, hoping to get to something they could call GPT5. It was like a quantum
leap better than GPT4. They didn't get there. So now they're talking about scaling inference time
compute. That's a different thing. Before we get there, I just want to
to you about I want to test your theory here. So it's not that scaling is over, right? I don't think
anyone that we're talking about say scaling is over. Basically what they're saying is if you want to make
the model better and I think that means more intelligent, more conversational, even more personable,
you can still do it by scaling. I think what they admit, the thing that they admit,
though, is that it takes much more compute and much more data to get the same results that you
would in the previous loops. So let's clarify two things. One is that what people talked about
about scaling originally was a mathematically predictable relationship between performance and
amount of data. You can go back and look at the chinchilla paper, the Jared Kaplan paper,
and lots of things that were posted on the internet. There were papers that saying, or t-shirt
saying scale is all you need. You looked at that t-shirt. You looked at that t-shirt.
shirt, and it had equations from the Jared Kaplan paper, and it said, you know, here's the
exponent, you can fit the equation. If you have this much data, this is the performance you're
going to get. And there were a bunch of papers, a bunch of models that actually seemed to fit
that curve, but it was an exponential curve. And what's happening now is, yeah, you add more data,
you get a little bit better, but you're not fitting that curve anymore. We've fallen off the
curve. That's what it really means to say that scaling isn't working anymore, is you don't, you
You know, if I drew a curve for you, it was going up and up and up really fast, and it's not going
up as a function of how much data you had, or how much compute you had.
So we added a bunch of compute, and you got this much better performance.
And this is how people justified running these experiments that cost a billion dollars,
is they're like, I know what I'm going to get for the billion dollars.
And then they ran the billion dollar experiments, and they didn't get what they thought
they would.
Yeah, you get a little bit better, but that's what diminishing returns means.
Diminishing returns means you're not getting the same bang for your buck as he used to.
That's where we are now.
So anytime you add a little piece of data, the model is going to do better on that piece of data.
But the question is, is it generalize and give you significant gains across the board?
And we were seeing that, and we just aren't anymore.
So is there still a path for these models to become much more performant?
I mean, let's say you do supersize these clusters to the point that is,
insanely, they are insanely bigger than they were previously.
Let's talk about, like, Elon Musk's one million GPU cluster.
Well, and let's look at what Elon got for his money, right?
So he built GROC 3, and by his own testimony, it was 10 times the size of GROC 2.
It's a little better, but it's not night and day, right?
GROC 2 was night and day better than the original GROC.
GPT 4 was night and day better than GPT 3.
GPT 3 was night and day better than GPT 2.
GROC 3 is like, yeah, you can measure it, you can see that there's some performance.
But for 10x, the investment of data, compute, and not to mention cost of energy to the environment,
it's not 10 times smarter by any reasonable measure.
It just isn't.
Okay.
And so this would be the point where I say, well, then this entire AI moment is done.
However...
Well, that's this moment.
There will be other AI moments, but this one...
I'm setting it up to say that it's not.
Not because, like you mentioned, you're talking about test time compute.
That's another way to say reasoning, I think, which is these models.
Well, I'm going to give you a hard time about that.
But I mean, people do do that.
But with reasoning or test time compute, you'll help me figure out the finer details.
What these models are doing is they're coming to try to find an answer and they're checking their progress
and deciding whether it's a good step or not and then taking another step and another step.
And we've seen that they have been able to perform much better when you.
you put that reasoning capabilities on top of these large models, which has enabled these
research houses to continue the progress in some way.
I mean, give you, but it's not really you.
It's these companies some pushback on that.
So it is true that you can build a model that will do better if you put more compute on it.
But it's only true to some degree.
So I don't get to whether it's actually reasoning or not.
But it turns out that on some problems, you can generate.
generate a lot of data in advance. And for those problems, adding more test time compute seems
helpful. There was a paper this weekend that's calling some of this into question.
By the way, just to explain to folks, test time is when the model is giving an answer.
That's what test time is. So you have these models now, like 03 and 04, that will sometimes
take like 30 seconds or five minutes or whatever to answer a question. And sometimes it's
absurd because you ask it like what's 37 times 11 and it takes you know 30 seconds you're like
my calculator could have done it faster but we'll put aside that absurdity in some cases it seems
like time well spent sometimes not but if you look carefully the best results for these models
are almost are almost always on the same things which are math and programming and so when you
look at math and programming you're looking at domains where it's possible to generate what we call
synthetic data and degenerate synthetic data that you know are correct. So for example, on multiplication,
you can train the model on a bunch of multiplication problems and you can figure out the answer
in advance. You can train the model what it is to predict. And so on these problems in what I would
call closed domains where we can do verification as we create the synthetic data, we can verify that
the answer we're teaching the model is correct. The models do better. But if you go back and you look
at the 03, sorry, the 01 paper, even then you could already see that the gains were there and not
across the board. They reported that on some problems, 01 was not better than GPT4. It's only on
other problems, these cut and dry problems with the synthetic data that you actually got better
performance. And I've now seen like 10 models and always seems to be that way. We're still waiting
for all the empirical data to come in, but it looks to me like it's a narrow trick that works in some
cases. The amazing thing about GPT4 is that it was just better than GPT3 on almost anything you could
imagine. And GPT3, the amazing thing is it was better than GPT2 on almost anything you can
imagine. Models like 01 are not systematically better than GPT4. They're better in certain use
cases, especially ones where you can create data in advance. Now, the reason I wouldn't call them
reasoning models, though you're right that many people do, is what I think they're doing
is basically copying patterns of human reasoning. They're getting data about how humans reason
certain things. But the depth of reasoning there is not that great. They still make lots of
stupid mistakes all the time. I don't think that they have the abstractions that we think,
for example, a logician has when they're reasoning. So it looks has the appearance of reasoning,
but it's really just mimicry. And there's limits to how far that mimicry goes. I'll give you just one
more example is 03 apparently hallucinates more than the models that came before it.
Which is stunning. How does that happen?
I mean, that's a good, broader question, which is our understanding of these models are still
remarkably limited. So the technical term, or one technical term. Interpreability. Well, I was going to
give you a different one, which is black box. Okay. But they're closely related those two terms.
You need interpability to get, figure out what's going on in the black box. If you can at all. I mean,
I'd almost put it another way, which is that black box, the thing in the plane that tells you what actually happened.
Well, that's a different thing, right? So a black box in a plane is actually a flight recorder that records a lot of data.
But what we mean in machine learning by black boxes, you have a model where you have the inputs and you have the outputs.
You know how you calculate them, but you don't really understand how the system gets there.
So in this case, you're doing all this matrix multiplication.
Nobody really understands it.
And so nobody can actually give you a straightforward answer for Y-03.
hallucinates more than GPT4.
We can just observe it.
That's what happens with black boxes, is you empirically observe things, and you say,
well, it does that, but you don't really know why, and you don't really know how to fix it either.
Another example, just in the last couple days, is apparently Sam Altman reported, I forget,
the new model is stubborn, or what was it, I forget?
No, it's not stubborn, it's a bro.
It's a bro.
But that's GPT-40.
It's just like, it became very fratty.
He came very Friday.
And like, you like, you would be like, what's going, like, help me with this.
And it's like, yo, that's a hell of good question, bro.
And they're like, we don't know why this happened.
And they rolled it back completely.
Yeah, exactly.
Or I thought they were partly rolled it or whatever.
No, no.
Sam said it's now the latest iteration's been completely rolled back.
So, right, that was what I would call again empirical.
Like, they tried it out.
And it didn't work or it worked in the way that it irritated people, right?
And so we don't know in advance.
Like, there's a lot of just, like, try it, because that's how black boxes work.
And we have some things, but those things are not very strong.
So the scaling, quote, laws were empirical guesses about how these models work.
And they were true for a little while, which was amazing.
And they're not true anymore, which is also amazing in a way.
So we don't know what's going to happen from the black boxes.
Right.
Okay.
So let me now sort of.
And sorry, let me come back to one other thing quick, which is interpretability.
So that's a very closely related notion.
So let's say you look at a GPS navigation system.
That's a piece of AI that's very interpretable.
So you can say it is plotting this route.
It says, you know, you can go this way, you can go that way.
This is the function that it's maximizing.
This is the database it's using.
This is how it looks up the data.
We don't have any of that in these so-called black box models.
We don't really know what the database is that it's consulting.
It isn't exactly consulting a database at all.
And we don't know how to fix it.
And so, you know, Dario,
L'Amode, who's a CEO.
We just talked about this on the show.
You actually praised his interpretability post.
That's right.
For interpretability.
I'll be honest.
I haven't read the paper yet.
I just read the title, so bad on me.
But the title of his paper was something like on the desperate need for interpretability.
That captures it.
And I think he's right.
I've said this too myself.
Like in my last book, I talked about interpretability being really important.
The only difference between Dario and me on this point is we both think that we're screwed as a society.
if we stick with uninterpretable models.
He just thinks that LLMs will eventually be interpretable.
And his company, to be fair,
has done the best work on interpretability of LLMs that I'm aware of.
Chris Ola, I think is brilliant.
But they haven't got that far.
They've gotten further than anybody else,
but I don't think we're ever going to get very far into the black box.
And so I think we need to start over
and find different approaches to AI altogether.
Right.
So, Gary, if I'm listening to what you're saying on this show so far,
it is basically after GPT4, we haven't made a lot of progress.
However, but let me just do the pushback here, which is, I mean, if you think about what
it's like using these models after GPT4, they are significantly better.
I'll give you one example.
I was using 03, this new reasoning model or test time model, whatever you want to call it.
And I just, I'm in it, and I'm doing crazy things, and it's exceptionally helpful.
So I put a photo of myself on a rock climbing wall and said,
what's going on. And it like was able to look at the form where my body was, where my,
what my posture was and like analyze all these things and give actually helpful coaching tips,
which you never would have had with with GPT4. Then you think about what Claude is doing,
the Anthropic bot. I was with some friends last night and this is what we do for fun.
I vibe coded a retirement calculator directly in Claude. It took like 10 minutes. We went from,
we took a bank statement. We got a line graph of the person's balances,
bar graph of their expenses, financial plan, and then we coded a retirement calculator based
off of the data that we had there.
And then you also have PhDs that are now adding their unique insights into these models
for training.
They just basically are sitting and writing down what they know and the model is absorbing it.
So we are seeing, I would call it, vast improvement over the GPT4 models.
So, I mean, there's a couple different ways to think about that.
So one is on a lot of benchmarks, there is improvements, but there's also issues of data contamination.
Alex Reisner wrote an excellent piece in The Atlantic about the issues of data contamination.
And we've seen a lot of studies where people are like, well, we tried it.
My company is not really that much better.
So they're better on the benchmarks.
Are they better in general?
Not so clear.
It was a new benchmark released by a company called Val AI or something like that.
Val's AI, to the Washington Post talked about yesterday, where they looked at things like
can you pull out a chart based on a series of financial statements, SEC statements from a bunch of
companies? And these systems all claim to do it, but accuracy was under 10%. And overall, on this new
benchmark, accuracy was at 50%. Would these be new models be better than GPT4? Maybe, but they
weren't that good. So I think people tend to notice when they do well. They don't notice as much
when they do poorly. And although I think there's been some improvement, there has not been the
quantum leap that people are expecting. We have not moved past hallucinations. We have not moved past
stupid reasoning errors. If you go back to my 22, 22 paper, Deep Learning is hitting a wall,
I didn't say there'd be no progress at all. What I said is we're going to have problems with
hallucinations. We're going to have problems with reasoning, planning until we have a different
architecture in some sense. And I think that that's still true. We're still,
stuck on the same kinds of things. So if you have your deep research right to a paper,
it's going to make up preferences. Okay. It's probably going to make up numbers. Like,
you know, did you actually go back and check? So for example, what I think it's called,
they all have similar names now, whatever GROC's version is, deep search, deep research.
Yeah, some, I don't know. Deep research mini 06. I won't be convinced that we have AGI
until these companies learn how to call deep research something other than deep research.
They all use the same exact name. It's really bizarre. So whichever version Grock has, I ask,
it for example to list all of the major cities that were west of Denver and to somebody
who wasn't paying attention to be super impressive but because I really wanted to know how well it was
working I checked and it left out Billings Montana right so you got a list that looks really
good and then there are errors this often happens and then I had a crazy conversation with
after that I said what happened to Billings and it said well there was an earthquake there
on February 10th or whatever and I looked up
in the, you know, the seismological data. I use Google because I want to have a real source or
duck, dot, go. And there was no earthquake then. And I pushed it on and said, well, I'm sorry
for the error or whatever. So we're still seeing those kinds of things. We may see them less,
but they are still there. We still have those kinds of problems. So I don't doubt that there's
been some improvement, but the quantum across the board that people were hoping for is not there.
The reliability is still not there. And there's still lots of subtle errors that people don't
notice. And then, you know, if you want to talk to me about retirement calculators, there are a lot of
those on the web. So the easy cases for these systems are the ones where the source code is actually
already there on the web. Like Kevin Roos talked about this example of having, he quote, vibe
coded a system to look in a refrigerator and tell them what recipe to make. But it turns out that
app is already there on the web and there are demos of that with source code. And so like if you ask
a system to do something that's already been done, that's always been true with all of these
systems.
That's their sweet spot, is regurgitation.
And so, yeah, they can build the stuff that's out there.
But if you want to code things in the real world, you usually want to code something
that's new.
And these systems have a lot of problems with that.
Another recent study, excuse me, showed that they're good at coding, but they're not good
at debugging.
And, like, coding is just the tiniest part of the battle, right?
The real battle is debugging things and maintaining the code over time.
And these systems don't really do that yet.
But, you know, search has made them more reliable.
When these bots are able to search the web and they are now starting to give you lots of links in the actual answers.
I still like get daily people sending me examples of, you know, it hallucinated these references.
I'm not saying hallucinations have been solved.
But for me, like, I will use it.
It's an incredible research assistant.
And then when it links out to things and I'm not sure of those figures, I'll then go to the primary sources and start reading.
I mean, good on you that you go to the primary source.
I worry the most about people who don't.
And we've seen countless lawyers, for example, get in trouble using these systems.
Has it been countless?
I just heard of one.
Oh, no, no, no.
There's many more than that.
There's some in the U.S., there's some in Canada.
I think there was just one in Europe.
I mean, it's not really countless one could sit there and do it, but it's got to be at least a dozen by now.
And whether this is going to be, all right, I think we can both agree on this,
that whether this is the end of progress or towards the end of progress or whether there's a lot more progress,
there's a real problem of people outsourcing their thinking to these bots.
Well, Microsoft did a study, in fact, suggesting that critical thinking was getting worse as a function of them.
And that wouldn't be too surprising.
We have a whole generation of kids who basically rely on these bots and who don't really know how to look at them critically.
You know, in previous years, we were starting to get too many kids relying on whatever garbage they found on the web, basically.
And I mean, chatbots are basically synthesizing the garbage that they find on the web.
And so we're not really teaching kids critical thinking skills.
And nowadays, like the idea for many kids of writing a term paper is I typed in a prompt
in chat GPT and then maybe I made a couple edits that I turn it in.
You're obviously not learning how to actually think or write in that fashion.
A lot of these tools, I think, are best used in the hands of sophisticated people who understand their limits.
So, you know, coding has actually been, I think, one of the.
biggest applications. And that's because coders understand how to debug code. And so they can take
the system. Basically, it's just typing for them and looking stuff up. And if it doesn't work,
then they can fix it, right? The really dangerous applications are like when somebody asks for
medical advice and they can't debug it themselves and, you know, something goes wrong.
Okay. So I'm going to take into consideration all the things that you've said so far and see if I can
get a sense as to where you think we're heading. It seems like there was a
push to just make these models better based off of scale. That could be things like the 300,000
GPU cluster, I think meta used for Lama 4, or it could be the million cluster GPU center
that Elon's built for GROC. And what you're saying is that's been maxed out pretty much.
Like no one's, hold on. I'll be more careful. It's not maxed out, but it's just diminishing returns.
There's diminishing returns. So the point that I'm trying to make here is you don't believe that
there's going to be anyone that's going to build a bigger GPU data center than that,
because if you're seeing diminishing returns from something that costs billions of dollars,
it doesn't make sense to invest.
Well, wait a second.
I'm not saying people are rational.
I think that people will probably try at least one more time.
They'll build things, you know, probably Elon will build something that's 10 times the size
of GROC 3, which will be huge and it will, you know, it will have a serious impact on the
environment and so forth.
I just don't be.
It's not just GPUs.
also it's data, right? Like how much more data is there? Let's come to the data separately
in a second. So I think people will actually try. Right. I think Masa has just bankrolled
and Sam to try. I just don't think they're going to get that much for it. I don't think they'll
get zero. I mean, there will be tangibly better performance on certain benchmarks and so forth.
But I don't think that it's going to be wildly impressive. And I don't think it's going to knock
down the problems of hallucinations, bone-headed eras. So here's what I'm getting at. That's not going
to feel much better than what we have today. It doesn't seem like you believe that reasoning is
going to make the bot feel much better than we have today.
Not the kind of reason we're doing.
There's no emergent coding.
So are you basically saying that what we have in AI today, this is it?
For a while.
For a while, I guess.
I mean, look, I put out some predictions last year in March that people can look up
that had on Twitter.
And those predictions include, I said there'd be no GPT5 this year or if it came out,
it would be disappointing.
It's supposed to come in summer.
Well, this was last year.
So I said in 2024, we won't see this.
And that was a very contrarium prediction at that point, right?
This was a few weeks after people had said, oh, I bet GPT-4 is going to drop off to the Super Bowl, like right after the Super Bowl.
Won't that be amazing?
So people really thought it was going to come last year if you go back and look at, you know, what they said on Twitter, et cetera.
And it didn't.
And I correctly anticipated that it wouldn't.
And I said, we're going to have a kind of pile up where we're going to have a lot of similar models from a lot of companies.
I think I said seven to ten, which was sort of roughly right.
And I said we were going to have no moat because everybody is doing the same thing.
And the prices we're going to go down when we have a price war.
All of that stuff happened.
Now, maybe we get to so-called GPT-5 level this year, keeps getting pushed back.
I don't know if we'll get much further than that without some kind of genuine innovation.
And I think genuine innovation will come.
But what I think is we're going down the wrong path.
Jan Lecun used this notion of, you know, we're on the exit.
ramp. How do you say large language models are the off ramp to AGI? You know, they're not really
the right path to AGI. And I agree with him. Or you could argue he agrees with me because I said
it, you know, for years before he did, but we won't go there. The broader notion is sometimes
we make mistakes in science. I think one of the most interesting ones was people thought
the genes were made of protein for a long time. So the early 20th century, lots of people
tried to figure out what protein is a gene made of. It turns out,
It's not made of a protein.
It's made of a sticky acid that everybody now knows called DNA.
So people spent 15 years or 20 years, like really looking at the wrong hypothesis.
I think that giant black box LLMs are the wrong hypothesis.
But science is self-correcting.
In the end, people put another $300 billion into this and it doesn't get the results they want.
They'll eventually do something different.
Right.
But what you're forecasting is basically an enormous financial collapse because...
That's right. I don't think LLMs will disappear. I think they're useful, but the valuations don't make sense.
I mean, I don't see Open AI being worth $300 billion. And you have to remember that venture capitalists have to like 10x to be happy or whatever.
Like, I don't see them, you know, IPOing at $3 trillion. I just don't.
No, it's interesting because I almost see the Open AI valuation as the one that makes the most sense because they have a consumer app.
The place that I start to get, if what you're saying is correct that we're not going to see any more, if we're seeing real diminishing results from scaling and this is basically where we are, then there's real worry for companies like NVIDIA, which has basically risen on the idea of scaling.
I mean, they're down a third, a third this year or something.
Two point something, two point five trillion last.
They're a genuinely good company.
They have a wonderful ecosystem.
They're worth a lot of money.
I mean, I don't want to put in an exact figure, but I'm not surprised that they fell,
and I'm not surprised that they're still worth a lot.
No, but this is a thing.
If we end up seeing the fact that this next iteration, the $10 billion that Sam is going to spend seemingly on the next set of GPUs,
if that doesn't produce serious results, that's going to hurt.
That will cause a crash in Nvidia because so much of the company's demand is coming based up with this idea that scaling is going to work.
So they have multiple problems, both.
Open AI and InVidio.
So one is it does look to me like we're hitting diminishing returns.
It does not look to me like this inference time compute trick is really a general solution.
It doesn't look like hallucinations are going away.
And it does look like everybody has the same magic formula.
So everybody is basically doing the same thing.
They're building bigger and bigger LOMs.
And what happens when everybody's doing the same thing?
You get a price for.
So DeepSeek came out and Open AI dropped its prices quite a bit.
Right.
And so every, because everybody, I mean, not literally everybody, but, you know, 10, 20 different companies all basically have the same idea or trying the same thing.
You have to have a price for it.
Nobody has a technical mode.
OpenAI has a user mode.
They have more users and that's.
That's the most valuable thing they have.
Like for that is the most valuable thing.
I would say the API is close to worth it.
I don't know, worthless is the right word, but it's worth, it's not worth very much.
It is that it's not a unique product.
It is the thing that that really has.
It's the brand name that is most valid.
I also think it's the best bot right now.
It might be.
I mean, I think people go back and forth.
Some people someday say it's Claude.
I've been on the Claude Drain for a long time.
And now you're on the chat.
And I'm on chat GPT.
What I think is going to happen is you have leapfrog.
Right.
But the leaps aren't going to be as big as they were.
So four was a huge leap.
I mean, this is a different way of saying what I said before.
It was a huge leap over three.
You know, let's say I can't even keep up with the naming scheme.
GPT 4.1.
Let's say it's better than GROC 3.7, or Claude 3.7, let's just say, hypothetically.
And so people run to this side of the room.
And then, you know, Claude, whatever, 3.8.1 or whatever, will be a little better than some people will run to that side of the room.
But nobody's able to charge that much money because the advances are going to be smaller.
And people start to say, well, you know, I use this one for coding and this one for brainstorming and whatever.
but nobody anymore says this is just like dominant.
Like GPT4 was just dominant.
When it came out, there was nothing as good as it.
For anything, if you wanted this kind of system, you used it, right?
I mean, that's my memory of it.
I don't hear any of the chat GPT or whatever.
I can't even keep up with the names anymore.
Any of those products, any of the open AI products being referred to
in the same kind of hush tones, like they're just better.
And like, you know, Google's still in this.
racing. They may undercut on price. Meta's giving stuff away. People are building on it. DeepSeek,
I hear has something new that's going to, you know, be better than ChachyPT. And, you know,
maybe it's true. Maybe it's not. But we were in this era where the differences between
the models are just getting really small. I was, I want to ask you when you're going to admit that
you were wrong about things or if you ever will. Which things? Which things? I think that,
But I also realize that the question doesn't really hit because I just want to say we spoke the last time you were, I think you've been on the show two times, once with Blake Lemoy and once one-on-one.
Yeah.
And we because it's interesting, I think you're one of the most outspoken AI critics.
And you say a lot of the things that we say here on the show, which is that AGI is marketing.
And even if we don't hit AGI, there's still a lot to be concerned about, whether that's the BS that people are talking about or being able to use these.
models for, you know, for nefarious purposes by churning out like content.
Like, I don't know if you saw there was this study of, this University of Zurich tried to
fool people on Reddit or tried to convince people on Reddit based off answers by a GPT
and it's still convinced more people than, than unions, the persuasion study.
I'm aware of it, but I'm ready yet.
So I guess like to me, it's, it does seem like it's kind of tough to be a critic of LMs right now
because they have been getting so much better.
But I don't know.
Just sort of like...
I mean, people say, Gary, you're wrong.
And I say, well, here are the predictions I actually made.
Like, I've actually reviewed them in print.
And I asked people who say that I'm wrong to, like, point,
what did I say that was wrong?
I think that sometimes people confuse my skepticism with other people's skepticism.
But I think if you look at the things that I have said in print,
they're mostly right.
And it, you know, like Tyler Cowan said,
you're wrong about everything.
You're always wrong.
And I said, Tyler, can you point to something?
And he said, well, you've written too much.
I can't do it.
Well, I look through some of your stuff.
And I do think that sometimes it seems like you might have put like this enormous burden
of proof for the AI industry.
Like you do pick out sometimes like everyone that says like AGI is coming this year.
And you're like, these people are liars.
But that being said, like I think your core arguments about scaling.
I've offered to put up money.
They offered Elon Musk a million dollars.
And I offered criteria.
And I'll tell you about that.
In 2022, in May, I offered him $100,000 bet.
Later, I opted to a million dollars.
And I put out criteria on Twitter.
I said, I'm going to offer these.
Do these make sense to you?
And everybody on Twitter, not everybody,
nearly everybody on Twitter at the time said those were fine.
People accused me of goalposting shifting.
But my goalposts are the same, right?
With 2014 paper in the New Yorker, article in the New Yorker,
where I talk about a comprehension challenge,
I've stuck by that.
That is part of my AGI criteria.
I made a bet with Miles Brundage on the same criteria, which he actually took the bet to his credit.
But when I put them out in 2022, this is the important part.
Everybody was more or less in agreement that those were reasonable criteria.
And I said, if you could beat my comprehension challenge, which is to say, you know, watch movies, know when to laugh, understand what's going on, if you could do the same thing for novel, if you could translate math from English into stuff, you could formally.
verify. If you could go into a random kitchen, you know, tell operating a robot and, you know,
make a dinner. If you could, what was the other criterion? Oh, you write, I think it was 10,000
lines of bug free code. You mean, you could do debugging to get there, whatever, you know, okay,
if you could do like three out of five, we'll call that AGI. And at the time, everybody said
that's fine. Now people are backtracking. Like Tyler Cowan said, oh three is AGI. Right. By what
measure. I felt that that was kind of a stretch. That was cheesy. And he said, he said the measure was
him. It looked like AGI to him. He invoked the, you know, classic line about pornography. I know
and I see it. But people have pointed out lots of problems with O3. I think it's absurd to call
O3 AGI. I wouldn't call it AGI. So, you know, you, you, a minute ago said, Gary, you're
wrong, but then you ticked off a bunch of things I'm actually right about. I didn't say,
Gary, you're wrong. I said, is there a point you'll admit you're wrong. Like, what I'm
Yes, there is. It's the point in which I'm wrong.
So let me clarify one other thing.
But let me just say, I didn't say that you're wrong.
I just said like, what is the point of advance that you would say, okay, I've been wrong
about this stuff?
Because I have listened to some of your...
Let me clarify something.
But I also, right after I said that, I was like, you know, it's kind of like a tough question.
And then I explained where I agreed with you.
Yeah.
Yeah, that's what happened.
So some people take me as saying that,
AI is impossible.
And that's not me, right?
I actually love AI.
I want it to work.
I just want us to take a different approach, right?
I want us to take a neurosymbolic approach where we have some classical elements of classical
AI, like explicit knowledge, formal reasoning and so forth, that people like Kinton have
kind of thumbed their nose at, but say, Demas Hasibus is used very effectively in alpha
fold.
So we get into that if you want.
If we get to AI, the question about whether I'm right or not depends on how we get there.
So I've made some pretty particular guesses about.
it. And I have guessed that pure LLM will not get us there, pure large language model.
So will I concede them wrong when we get to AI that actually works?
Depends on how it works.
Okay.
Yeah.
And I think it's clear that, I mean, I don't know, we could watch this back in a couple
years.
If we get to pure LLMs, if it's another round of scaling, you know, gets us to AGI by the
criteria that I laid out, then I will have to concede that I was wrong.
Okay.
All right.
I'm going to take a quick break and then let's come back and talk a little bit more.
more about the current risks and maybe read some of your tweets and have you expand upon them.
We'll be back right after this.
And we're back here on Big Technology podcast with AI skeptic, Gary Marcus.
Gary, let me ask you this.
So, you know, one of the things we talked about last time you were here was that AI doesn't
have to reach the AGI threshold to be something that we should be concerned about.
Absolutely not.
And a lot of the focus was on hallucinations.
You and I both, I think we have a little bit of a diverging opinion on hallucinations.
I think they've gotten much better.
You'd think it's still a big problem.
Those could both be true, by the way.
That could both be true.
All right.
So let's put a pit in that for now.
I think where I'm seeing the most concern is virology.
Or we just had a study that came out that showed that AI is now on PhD level in
terms of virology.
We had Dan Hendricks from the Center for AI Safety who was here.
We talked about the fact that like AI can now walk
virologists through how to create or enhance the function of viruses.
And we're starting to see some of these AI programs, like you mentioned, Deepseek,
be available to everybody, be pretty smart, and be released without guardrails,
or not enough guardrails, especially if they're open source.
So what are you worried about here?
Is that the core concern or is there other stuff?
I think there's actually multiple worries.
And the different worries from different architectures and architectures used in
different ways and so forth. So dumb AI can be dangerous. So if dumb AI is empowered to control things like
the electrical grid and it makes a bad decision, that's a risk, right? If you put a bad driverless
car system in, you know, a million cars, a lot of people would die, right? The main thing that is
saved a lot of people from dying in driverless cars is there aren't that many of them. And so,
you know, even though they're not actually super safe at the moment, you know, restrict where we use
them and so forth. We don't put them in situations where they wouldn't be very bright.
So dumb AI can cause problems. Super smart AI could, you know, maybe lock us all in cages
if it wanted to. I mean, we have to talk about the likelihood of it wanting to, but there
definitely worries there and we need to take them seriously. And then you have things that are
in between. So, for example, the virology stuff is AI that's not generally all that smart,
But it can do certain things.
And in the hands of bad actors, it can do those things.
And I think it is true either now or will be soon enough that these tools can be used
to help bad actors create viruses that cause problems.
And so I think that's a legitimate worry, even if we don't get to AGI.
So we have dumb AI right now is a problem.
Smarter AI, even if it's not AGI, can cause a different set of problems.
And if we ever got to superintelligent, that might open.
a different can of worms. I mean, you can think, like, you know, human beings of different
degrees of brightness and with different skills, if they choose to do bad things, can, you know,
cause different kinds of harm. And so what's your view on open source then? I worry about it.
I do worry about it because bad actors are using these things already. They're mostly using them
for misinformation, not sure how much biology they're doing, but they will, and they're going to
be interested in that. You know, state actors that want to do terrorist kinds of things will do
that. I am worried about open sourcing at all, and I think the fact that meta could
make that decision for the whole world is not good. Like, I think there should have been much
more government oversight, scientists should have contributed more of the discussion. But now
those kinds of models are open source. They've been released. We can't put that genie back
in the bottle. And over time, just like people, I should have said this earlier,
even if the models don't get any better, we will still find new uses for them.
And some of those new uses are positive and some of them will be negative, right?
We're still exploring what these technologies can do, and people are finding, you know,
ways to make money in dubious ways and to cause harm for various reasons and so forth.
And so, you know, giving those tools very broadly has problems.
On the other hand, I think what we've learned in the last three years is that the closed companies
are not the ethical actors that they once were.
So, you know, Google famously said don't do evil, and they took that out of their platform.
You know, Microsoft was all about AI ethics.
And then, you know, when Sydney came out, they're like, we're not taking this away.
We're going to stick with it.
Oh, they did kill Sydney, right?
Sydney was this very, I don't know, raunchy AI that tried to steal Kevin Rousse's wife.
Yeah, I mean, they reduced what it could do.
But they stuck with it in some sense.
But, you know, and like Open AI said that we're in, you know,
nonprofit for public benefit.
Now they're desperately trying to become a for-profit
that is really not particularly interested in public benefit.
It's interested in money.
And they may become a surveillance company,
which I don't think is...
Because what you're talking about with the advertising side?
So basically they have a lot of private data
because they have a lot of users and people type in all kinds of stuff.
And they may have no choice but to monetize that.
And, you know, they've been showing signs of it.
They hired Nakasone, who used to be at the NSA.
They bought a share in a webcam company.
And they recently announced they're trying to build a social media company.
They want, you know, they look like they're on a path to sell your data, your very private data to, you know, whoever they care.
It's concerning because whatever data I gave to Facebook, like I always used to think that this conversation around Facebook data was a little ridiculous because I didn't think I was giving that much information to Facebook.
But I am giving Open AI a lot of information.
I mean, there's a lot of people that treat it as a threat.
Well, that's the number one.
use as therapist companion. I don't use it as a therapist, but I'm like putting a lot of my work
information in there. I read a great book called Privacy and Power. I'm blanking slightly on the title by
Carissa Villis. And she had examples in there. Like people were taking data from Grindr and extorting
people. Right. Grindr is an app for gay people if you don't know. And, you know, that's still in
our society in like in some places it's acceptable in other places. You know, people don't necessarily
want to come out if they're gay or whatever. And so people have been extorting people with
data from Grindr. Imagine what they're going to do. You know, people type into chat GPT, like
they're very specific sexual desires, maybe crimes they've committed, like people type in a lot
of times. They want to commit. Crimes they want to commit. You know, we have a political climate
where, you know, conspiracy might be treated in a different way than it was. And so just typing
Indian to chat GPT might, you know, get somebody deported.
Who knows?
Now I'm freaked out.
It's, I wouldn't personally use the system because the writing is on the wall.
And I think that they, they make some promises to their business customers, but not to their, you know, consumer customers.
And that stuff is available for them to do what they want with it.
And they probably will because that's how they're going to make money.
Here's another way we put it is, suppose I'm right about the things I've been arguing.
and they can't really get to, you know, the GPT-7-level model that everybody dreamed of.
It can't really build AGI.
But they're sitting on this incredible treasure chest of data.
What are they going to do?
Well, if they can't make AGI, they're going to sell that data.
This is why I always thought, like, when you take in a lot of money,
it's always, you always have to pay that money back in some way,
and that changes the way you operate.
That's right.
I mean, look at 23 and me.
They're out of business, and now that data is for sale.
and who knows what's going to happen with the 23ME data.
I hope you're wrong about this one, but the history of the internet is...
I'm not saying you are.
I'm just saying I hope you are, because that would do that.
I hope I'm wrong, too.
But there is a level of...
There's a lot of things I hope I'm wrong.
Gary, if people got freaked out about what Facebook was doing with your data,
if they overstepped, there's going to be a major societal backlash.
Maybe.
I mean, sometimes people just accommodate to these things.
I've been amazed at how willing people are to give away.
all that information to Facebook.
I don't use it anymore, but...
Let me ask you this.
You quote tweeted one of these...
So we'll get into a tweet here.
You quote tweeted one of these tweets
is the push to optimize AI for user engagement,
just metric chasing Silicon Valley Brain
or an actual pivot and business model
from create a post-scarcity society God
to create a worst TikTok.
This is what basically we're talking about,
is that that might be the pivot.
Yeah, that's right.
I think that was someone else's tweet that I quote...
Yeah, Daniel Litt, and you said,
I've been basically telling you about this.
Yeah, exactly.
So that's what it is.
You also wrote this, saying the quiet part out loud,
the business model of Gen.
I will be surveillance and hyper-targeted ads,
just like it has been for social media.
We were just talking about that.
And what I was quote cheating there
was something from Aravind Serennavas, if I pronounce his name correctly,
who's a CEO of perplexity.
And he basically, I said he's saying quiet part out loud.
He basically said, we're going to use this stuff
to hyper-target ads.
You also said that companies like Johnson and Johnson
will finally realize that Gen A.I.
was not going to deliver on its promises.
Have there been companies that have pulled back?
Are you just using Johnson and Johnson as an example?
That was based on a Wall Street Journal thing,
and I may have failed to include the link because of Elon most crazy
notions around links.
Elon, you've got to put the links in the...
Elon, you've got to put the links in Twitter.
This is unacceptable, yes.
That's right.
So, anyway, that was...
I was alluding to a Wall Street
journal report that had just come out, which showed that J&J basically said, in so many words,
I'll paraphrase it, they tried Gen.
A.I. and a lot of different things, generative AI, and a few of them worked, and a lot of them
didn't, and they were going to, like, stick to the ones that did, like, customer service
and maybe not do some of the others. You have to go back, you know, a year and a half in
history to when people thought Gen A.I. was going to do everything that an employee was
able to do, basically. And I think what J&J and a bunch of companies have found out is that it's
not really true. You know, they can do a bunch of things that employees do, but they can't
typically do everything that a single employee does. And, you know, they're reasonably good at
triaging customer service. And they're not necessarily good at creating, say, a careful financial
production. Okay. So, Gary, we have like five minutes left. I want, you said something in the,
I think in the first half about the path that you think needs to be taken to AGI. Can you explain what
that is in like as basic of a way as you can to like you know make it as simple to understand for
anyone who's not caught up with the systems that you spoke about sure so a lot of people will
have read danny connemann's book thinking fast and slow and there he talked about system one and
system two cognition so system one was fast and automatic reflexive system two was more deliberate more
like reasoning i would argue that the neural networks that power generative AI are basically
like System 1 cognition.
They're fast, they're automatic, they're statistically driven, but they're also error-prone.
They're not really deliberative.
They can't sanity check their own work.
And I would say we've done that pretty well, but System 2 is more like classical AI,
where you can explicitly represent knowledge, reason over it.
It looks more like computer programming.
And these two schools have both been around since the 1940s, but they've been very separate
for what I think is sociological and economic reasons.
Either you work on one or you work on the other,
people argue or fight for graduate students
and fight for grants and stuff like that.
So there's been a great deal of hostility between the two.
But the reality is they kind of complement each other.
Neither of them has worked on its own.
So the classical AI failed, right?
People build all these expert systems,
but there were always these exceptions
and they weren't really robust.
You'd pay graduate students to patch up the exceptions.
Now we have these new systems.
They're not really robust either, which is why OpenAI is paying Kenyans and PhD students and so forth to kind of fix the errors.
The advantage of System 1 is it learns very well from data.
The disadvantages is not very accurate.
Sorry, very abstract.
So the, I should have said that slightly differently.
The large language models and that kind of approach, transformers, are very good at learning, but they're not very good at abstraction.
You can give them billions of examples, and they still never really understand what multiplication is.
And they certainly never get any other abstract concept well.
The classical approach is great at things like multiplication.
You write a calculator, and it never makes a mistake.
But it doesn't have the same broad coverage, and it can't learn new things.
You can wire multiplication in, but how do you learn something new?
The classical approaches have had trouble with that.
And so I think we need to bring them together.
And this is what I call neurosymbolic AI.
And it's really what I've been lobbying for for decades.
And I think it was hard to raise money to do that in the last few years because everybody was
obsessed with generative AI.
But now that they're seeing the diminishing returns, I think investors are more open to trying
alternatives.
And also alpha fold is actually a neurosymbolic model.
And it's probably the best thing that AI ever did.
And so decoding proteins, protein folding.
Yeah, figuring out the three dimensional.
structure of a protein from a list of its nucleotides.
Are you going to raise money to try to do this?
I'm very interested in that. Let's put it that way.
Masa, son, if you want to make use of your money.
No, I'm kidding.
You talking to him?
Not at this particular moment.
Okay. Massa, if you're watching.
I don't know. Trying to help.
Okay, great. Well, Gary, can you shout out where to find your substack?
So if anybody wants to read your longer work on the state of AI, where should they go?
Sure. So people might want to read my last two books, by the way, Taming Silicon Valley,
which is really about how to regulate AI. And rebooting AI, which was 2019, is a little bit old,
but still I think anticipates a lot of the problems around common sense and world models that we're still facing today.
And then for kind of almost daily updates, I write a substack, which is for you,
although you can pay if you like to support me. And that's at Gary Marcus.
substack.com. Okay, well, I'm a subscriber, Gary. Great to have you on the program. Thanks so much
for coming. Thanks a lot for having me again, yet again. Yet again. Well, we'll keep doing it.
It's always nice to hear your perspective on the world of AI. So I always enjoy our conversations.
Thanks for having me. Yes, same here. All right, everybody, thank you for listening. We'll be back
on Friday breaking down the week's news. Until then, we'll see you next time on Big Technology Podcast.
You know,