Big Technology Podcast - Is AI Scaling Dead? — With Gary Marcus

Starting point is 00:00:00 Is the AI field reaching the limits of improving models by scaling them up? And what happens if bigger no longer means better? That's coming up with AI critic Gary Marcus right after this. Welcome to Big Technology Podcast, a show for cool-headed and nuanced conversation of the tech world and beyond. We're joined today by AI critic Gary Marcus, the author of the book Rebooting AI and Marcus on AI on Substack, and he's here to speak with us about whether the AI industry is hitting the limits, of scaling generative AI models up and what it means if we're truly seeing

Starting point is 00:00:34 diminishing returns from making these models bigger. Gary, it's great to see you. Welcome to the show. Thanks for having you. So the genesis of this episode is that I did an episode with Mark Chen from OpenAI about GPT 4.5 and you come into my DMs and you say, listen, I want to give a rebuttal.

Starting point is 00:00:52 Scaling is basically over and it's not exactly what Open AI has said. Now, for those who don't know about the scaling laws, Basically, the idea is that the more compute and data you put into these large language models, the better they're going to get, basically predictably, linearly. Well, exponentially was the idea. Right. And so the context here is now we've seen almost every research house all but admit that that has hit the point of diminishing returns. I think Mustafa Suleiman was here.

Starting point is 00:01:23 He pretty much admitted it. Tom's Curian CEO of Google Cloud said that their diminishing returns are happening. Jan Lacoon has also talked about the fact that you're just not going to see as many returns from AI scaling as you would beforehand. So just describe the context of what we're seeing right now, how big of a deal is it? And then what are the implications for the AI industry? Because this is the big question. I mean, how much better can these things get, right?

Starting point is 00:01:47 That is the big question with AI today. Well, I mean, I have to laugh because I wrote a paper in 2022 called Deep Learning is hitting a wall. And the whole point of that paper is that scaling was going to run out, that we were going to hit diminishing. returns and everybody in the field went after me a lot of the people you mentioned i mean lacoon did Elon musk went after me by name altman did and they all like altman said give me the strength of the of a mediocre deep learning skeptic so that people were really pissed when i said the deep learning was going to run out so it's amazing to me and that a bunch of people have uh conceded that these scaling laws uh are not working the way they used to be and they're also doing a bit of backpedaling

Starting point is 00:02:26 I think that Mark Chen interview, I can't quite remember the details, but I think it was a version of backpedaling and redefining things. So if you go back to 2022, there were these papers by Jared Kaplan and others at OpenA.I. And they said, look, we can just mathematically predict how good a model is going to be from how much data there is. And then there were the so-called chinchilla scaling laws. And everybody was super excited. And basically, people invested half a trillion dollars assuming that these things were true. You know, they made arguments to their investors or whatever. They said, if we put in this much data, we're going to get here.

Starting point is 00:03:01 And they all thought that here in particular was going to mean AGI eventually. And what happened last year is everybody was disappointed by their results. So we got one more iteration of scaling after 2002, after 2022, that worked really well. And we call that GPT4 and all of these models that are sort of like that. So I wrote that paper around GPT3. we got another iteration of scaling so right three was scaling compared to two it was much better two was scaling compared to one it was much better so much better meant um sorry much more data meant much better but what is what what is much better well i mean one way to think about it is you didn't

Starting point is 00:03:39 need a magnifying glass to see the difference between gpt2 and it was that we didn't call it gpt1 but the original gpt and you didn't need a magnifying glass for gpt4 as opposed to gpt 3 it was just obviously better. A lot of people thought is that we would pretty quickly see GPT5 and a lot of people raced to build it. So OpenAI tried to build GPT5 and they had a thing called Project Orion and it actually failed and eventually got released as GPT 4 and a half. So what they thought was going to be GPT5 just didn't meet expectations. Now they could slap any name on any model they want and in fact lately nobody understands how they're naming their models. But they haven't felt like any of the models that they've worked on since GPT4 actually deserve the name GPT5. And it didn't meet the performance that

Starting point is 00:04:27 these so-called mathematical laws required. And what I said in that paper is they're not really mathematical laws. They're not physical laws of the universe like gravity. They're just generalizations that held for a little while. Like a baby may double and wait every couple of months early in its life. That doesn't mean that by the time you're 18 years old that you're going to be 30,000 pounds. And so we had this doubling for a while, and then it stopped, and we can talk about why. But the reality is it's not really operative anymore. So there's been efforts to kind of misdirect and shift direction. So I think everybody in the industry quietly or otherwise acknowledged that, hey, we're not getting the returns that we thought anymore. And nobody's

Starting point is 00:05:10 been able to build a so-called GPT-5-level model. That's a big deal, right? I'm a scientist. And as a scientist, or it was originally a scientist. As a scientist, we have to pay attention to negative results as well as positive results. So when 30 people try the same experiment and doesn't work, nature is telling you something. And everybody try the experiment of building models that would 10x the size of GPT4, hoping to get to something they could call GPT5. It was like a quantum leap better than GPT4. They didn't get there. So now they're talking about scaling inference time compute. That's a different thing. Before we get there, I just want to to you about I want to test your theory here. So it's not that scaling is over, right? I don't think

Starting point is 00:05:53 anyone that we're talking about say scaling is over. Basically what they're saying is if you want to make the model better and I think that means more intelligent, more conversational, even more personable, you can still do it by scaling. I think what they admit, the thing that they admit, though, is that it takes much more compute and much more data to get the same results that you would in the previous loops. So let's clarify two things. One is that what people talked about about scaling originally was a mathematically predictable relationship between performance and amount of data. You can go back and look at the chinchilla paper, the Jared Kaplan paper, and lots of things that were posted on the internet. There were papers that saying, or t-shirt

Starting point is 00:06:38 saying scale is all you need. You looked at that t-shirt. You looked at that t-shirt. shirt, and it had equations from the Jared Kaplan paper, and it said, you know, here's the exponent, you can fit the equation. If you have this much data, this is the performance you're going to get. And there were a bunch of papers, a bunch of models that actually seemed to fit that curve, but it was an exponential curve. And what's happening now is, yeah, you add more data, you get a little bit better, but you're not fitting that curve anymore. We've fallen off the curve. That's what it really means to say that scaling isn't working anymore, is you don't, you You know, if I drew a curve for you, it was going up and up and up really fast, and it's not going

Starting point is 00:07:12 up as a function of how much data you had, or how much compute you had. So we added a bunch of compute, and you got this much better performance. And this is how people justified running these experiments that cost a billion dollars, is they're like, I know what I'm going to get for the billion dollars. And then they ran the billion dollar experiments, and they didn't get what they thought they would. Yeah, you get a little bit better, but that's what diminishing returns means. Diminishing returns means you're not getting the same bang for your buck as he used to.

Starting point is 00:07:40 That's where we are now. So anytime you add a little piece of data, the model is going to do better on that piece of data. But the question is, is it generalize and give you significant gains across the board? And we were seeing that, and we just aren't anymore. So is there still a path for these models to become much more performant? I mean, let's say you do supersize these clusters to the point that is, insanely, they are insanely bigger than they were previously. Let's talk about, like, Elon Musk's one million GPU cluster.

Starting point is 00:08:14 Well, and let's look at what Elon got for his money, right? So he built GROC 3, and by his own testimony, it was 10 times the size of GROC 2. It's a little better, but it's not night and day, right? GROC 2 was night and day better than the original GROC. GPT 4 was night and day better than GPT 3. GPT 3 was night and day better than GPT 2. GROC 3 is like, yeah, you can measure it, you can see that there's some performance. But for 10x, the investment of data, compute, and not to mention cost of energy to the environment,

Starting point is 00:08:46 it's not 10 times smarter by any reasonable measure. It just isn't. Okay. And so this would be the point where I say, well, then this entire AI moment is done. However... Well, that's this moment. There will be other AI moments, but this one... I'm setting it up to say that it's not.

Starting point is 00:09:04 Not because, like you mentioned, you're talking about test time compute. That's another way to say reasoning, I think, which is these models. Well, I'm going to give you a hard time about that. But I mean, people do do that. But with reasoning or test time compute, you'll help me figure out the finer details. What these models are doing is they're coming to try to find an answer and they're checking their progress and deciding whether it's a good step or not and then taking another step and another step. And we've seen that they have been able to perform much better when you.

Starting point is 00:09:34 you put that reasoning capabilities on top of these large models, which has enabled these research houses to continue the progress in some way. I mean, give you, but it's not really you. It's these companies some pushback on that. So it is true that you can build a model that will do better if you put more compute on it. But it's only true to some degree. So I don't get to whether it's actually reasoning or not. But it turns out that on some problems, you can generate.

Starting point is 00:10:04 generate a lot of data in advance. And for those problems, adding more test time compute seems helpful. There was a paper this weekend that's calling some of this into question. By the way, just to explain to folks, test time is when the model is giving an answer. That's what test time is. So you have these models now, like 03 and 04, that will sometimes take like 30 seconds or five minutes or whatever to answer a question. And sometimes it's absurd because you ask it like what's 37 times 11 and it takes you know 30 seconds you're like my calculator could have done it faster but we'll put aside that absurdity in some cases it seems like time well spent sometimes not but if you look carefully the best results for these models

Starting point is 00:10:48 are almost are almost always on the same things which are math and programming and so when you look at math and programming you're looking at domains where it's possible to generate what we call synthetic data and degenerate synthetic data that you know are correct. So for example, on multiplication, you can train the model on a bunch of multiplication problems and you can figure out the answer in advance. You can train the model what it is to predict. And so on these problems in what I would call closed domains where we can do verification as we create the synthetic data, we can verify that the answer we're teaching the model is correct. The models do better. But if you go back and you look at the 03, sorry, the 01 paper, even then you could already see that the gains were there and not

Starting point is 00:11:34 across the board. They reported that on some problems, 01 was not better than GPT4. It's only on other problems, these cut and dry problems with the synthetic data that you actually got better performance. And I've now seen like 10 models and always seems to be that way. We're still waiting for all the empirical data to come in, but it looks to me like it's a narrow trick that works in some cases. The amazing thing about GPT4 is that it was just better than GPT3 on almost anything you could imagine. And GPT3, the amazing thing is it was better than GPT2 on almost anything you can imagine. Models like 01 are not systematically better than GPT4. They're better in certain use cases, especially ones where you can create data in advance. Now, the reason I wouldn't call them

Starting point is 00:12:21 reasoning models, though you're right that many people do, is what I think they're doing is basically copying patterns of human reasoning. They're getting data about how humans reason certain things. But the depth of reasoning there is not that great. They still make lots of stupid mistakes all the time. I don't think that they have the abstractions that we think, for example, a logician has when they're reasoning. So it looks has the appearance of reasoning, but it's really just mimicry. And there's limits to how far that mimicry goes. I'll give you just one more example is 03 apparently hallucinates more than the models that came before it. Which is stunning. How does that happen?

Starting point is 00:12:59 I mean, that's a good, broader question, which is our understanding of these models are still remarkably limited. So the technical term, or one technical term. Interpreability. Well, I was going to give you a different one, which is black box. Okay. But they're closely related those two terms. You need interpability to get, figure out what's going on in the black box. If you can at all. I mean, I'd almost put it another way, which is that black box, the thing in the plane that tells you what actually happened. Well, that's a different thing, right? So a black box in a plane is actually a flight recorder that records a lot of data. But what we mean in machine learning by black boxes, you have a model where you have the inputs and you have the outputs. You know how you calculate them, but you don't really understand how the system gets there.

Starting point is 00:13:40 So in this case, you're doing all this matrix multiplication. Nobody really understands it. And so nobody can actually give you a straightforward answer for Y-03. hallucinates more than GPT4. We can just observe it. That's what happens with black boxes, is you empirically observe things, and you say, well, it does that, but you don't really know why, and you don't really know how to fix it either. Another example, just in the last couple days, is apparently Sam Altman reported, I forget,

Starting point is 00:14:09 the new model is stubborn, or what was it, I forget? No, it's not stubborn, it's a bro. It's a bro. But that's GPT-40. It's just like, it became very fratty. He came very Friday. And like, you like, you would be like, what's going, like, help me with this. And it's like, yo, that's a hell of good question, bro.

Starting point is 00:14:27 And they're like, we don't know why this happened. And they rolled it back completely. Yeah, exactly. Or I thought they were partly rolled it or whatever. No, no. Sam said it's now the latest iteration's been completely rolled back. So, right, that was what I would call again empirical. Like, they tried it out.

Starting point is 00:14:42 And it didn't work or it worked in the way that it irritated people, right? And so we don't know in advance. Like, there's a lot of just, like, try it, because that's how black boxes work. And we have some things, but those things are not very strong. So the scaling, quote, laws were empirical guesses about how these models work. And they were true for a little while, which was amazing. And they're not true anymore, which is also amazing in a way. So we don't know what's going to happen from the black boxes.

Starting point is 00:15:11 Right. Okay. So let me now sort of. And sorry, let me come back to one other thing quick, which is interpretability. So that's a very closely related notion. So let's say you look at a GPS navigation system. That's a piece of AI that's very interpretable. So you can say it is plotting this route.

Starting point is 00:15:26 It says, you know, you can go this way, you can go that way. This is the function that it's maximizing. This is the database it's using. This is how it looks up the data. We don't have any of that in these so-called black box models. We don't really know what the database is that it's consulting. It isn't exactly consulting a database at all. And we don't know how to fix it.

Starting point is 00:15:44 And so, you know, Dario, L'Amode, who's a CEO. We just talked about this on the show. You actually praised his interpretability post. That's right. For interpretability. I'll be honest. I haven't read the paper yet.

Starting point is 00:15:56 I just read the title, so bad on me. But the title of his paper was something like on the desperate need for interpretability. That captures it. And I think he's right. I've said this too myself. Like in my last book, I talked about interpretability being really important. The only difference between Dario and me on this point is we both think that we're screwed as a society. if we stick with uninterpretable models.

Starting point is 00:16:19 He just thinks that LLMs will eventually be interpretable. And his company, to be fair, has done the best work on interpretability of LLMs that I'm aware of. Chris Ola, I think is brilliant. But they haven't got that far. They've gotten further than anybody else, but I don't think we're ever going to get very far into the black box. And so I think we need to start over

Starting point is 00:16:39 and find different approaches to AI altogether. Right. So, Gary, if I'm listening to what you're saying on this show so far, it is basically after GPT4, we haven't made a lot of progress. However, but let me just do the pushback here, which is, I mean, if you think about what it's like using these models after GPT4, they are significantly better. I'll give you one example. I was using 03, this new reasoning model or test time model, whatever you want to call it.

Starting point is 00:17:06 And I just, I'm in it, and I'm doing crazy things, and it's exceptionally helpful. So I put a photo of myself on a rock climbing wall and said, what's going on. And it like was able to look at the form where my body was, where my, what my posture was and like analyze all these things and give actually helpful coaching tips, which you never would have had with with GPT4. Then you think about what Claude is doing, the Anthropic bot. I was with some friends last night and this is what we do for fun. I vibe coded a retirement calculator directly in Claude. It took like 10 minutes. We went from, we took a bank statement. We got a line graph of the person's balances,

Starting point is 00:17:45 bar graph of their expenses, financial plan, and then we coded a retirement calculator based off of the data that we had there. And then you also have PhDs that are now adding their unique insights into these models for training. They just basically are sitting and writing down what they know and the model is absorbing it. So we are seeing, I would call it, vast improvement over the GPT4 models. So, I mean, there's a couple different ways to think about that. So one is on a lot of benchmarks, there is improvements, but there's also issues of data contamination.

Starting point is 00:18:21 Alex Reisner wrote an excellent piece in The Atlantic about the issues of data contamination. And we've seen a lot of studies where people are like, well, we tried it. My company is not really that much better. So they're better on the benchmarks. Are they better in general? Not so clear. It was a new benchmark released by a company called Val AI or something like that. Val's AI, to the Washington Post talked about yesterday, where they looked at things like

Starting point is 00:18:44 can you pull out a chart based on a series of financial statements, SEC statements from a bunch of companies? And these systems all claim to do it, but accuracy was under 10%. And overall, on this new benchmark, accuracy was at 50%. Would these be new models be better than GPT4? Maybe, but they weren't that good. So I think people tend to notice when they do well. They don't notice as much when they do poorly. And although I think there's been some improvement, there has not been the quantum leap that people are expecting. We have not moved past hallucinations. We have not moved past stupid reasoning errors. If you go back to my 22, 22 paper, Deep Learning is hitting a wall, I didn't say there'd be no progress at all. What I said is we're going to have problems with

Starting point is 00:19:30 hallucinations. We're going to have problems with reasoning, planning until we have a different architecture in some sense. And I think that that's still true. We're still, stuck on the same kinds of things. So if you have your deep research right to a paper, it's going to make up preferences. Okay. It's probably going to make up numbers. Like, you know, did you actually go back and check? So for example, what I think it's called, they all have similar names now, whatever GROC's version is, deep search, deep research. Yeah, some, I don't know. Deep research mini 06. I won't be convinced that we have AGI until these companies learn how to call deep research something other than deep research.

Starting point is 00:20:05 They all use the same exact name. It's really bizarre. So whichever version Grock has, I ask, it for example to list all of the major cities that were west of Denver and to somebody who wasn't paying attention to be super impressive but because I really wanted to know how well it was working I checked and it left out Billings Montana right so you got a list that looks really good and then there are errors this often happens and then I had a crazy conversation with after that I said what happened to Billings and it said well there was an earthquake there on February 10th or whatever and I looked up in the, you know, the seismological data. I use Google because I want to have a real source or

Starting point is 00:20:46 duck, dot, go. And there was no earthquake then. And I pushed it on and said, well, I'm sorry for the error or whatever. So we're still seeing those kinds of things. We may see them less, but they are still there. We still have those kinds of problems. So I don't doubt that there's been some improvement, but the quantum across the board that people were hoping for is not there. The reliability is still not there. And there's still lots of subtle errors that people don't notice. And then, you know, if you want to talk to me about retirement calculators, there are a lot of those on the web. So the easy cases for these systems are the ones where the source code is actually already there on the web. Like Kevin Roos talked about this example of having, he quote, vibe

Starting point is 00:21:26 coded a system to look in a refrigerator and tell them what recipe to make. But it turns out that app is already there on the web and there are demos of that with source code. And so like if you ask a system to do something that's already been done, that's always been true with all of these systems. That's their sweet spot, is regurgitation. And so, yeah, they can build the stuff that's out there. But if you want to code things in the real world, you usually want to code something that's new.

Starting point is 00:21:52 And these systems have a lot of problems with that. Another recent study, excuse me, showed that they're good at coding, but they're not good at debugging. And, like, coding is just the tiniest part of the battle, right? The real battle is debugging things and maintaining the code over time. And these systems don't really do that yet. But, you know, search has made them more reliable. When these bots are able to search the web and they are now starting to give you lots of links in the actual answers.

Starting point is 00:22:17 I still like get daily people sending me examples of, you know, it hallucinated these references. I'm not saying hallucinations have been solved. But for me, like, I will use it. It's an incredible research assistant. And then when it links out to things and I'm not sure of those figures, I'll then go to the primary sources and start reading. I mean, good on you that you go to the primary source. I worry the most about people who don't. And we've seen countless lawyers, for example, get in trouble using these systems.

Starting point is 00:22:47 Has it been countless? I just heard of one. Oh, no, no, no. There's many more than that. There's some in the U.S., there's some in Canada. I think there was just one in Europe. I mean, it's not really countless one could sit there and do it, but it's got to be at least a dozen by now. And whether this is going to be, all right, I think we can both agree on this,

Starting point is 00:23:04 that whether this is the end of progress or towards the end of progress or whether there's a lot more progress, there's a real problem of people outsourcing their thinking to these bots. Well, Microsoft did a study, in fact, suggesting that critical thinking was getting worse as a function of them. And that wouldn't be too surprising. We have a whole generation of kids who basically rely on these bots and who don't really know how to look at them critically. You know, in previous years, we were starting to get too many kids relying on whatever garbage they found on the web, basically. And I mean, chatbots are basically synthesizing the garbage that they find on the web. And so we're not really teaching kids critical thinking skills.

Starting point is 00:23:42 And nowadays, like the idea for many kids of writing a term paper is I typed in a prompt in chat GPT and then maybe I made a couple edits that I turn it in. You're obviously not learning how to actually think or write in that fashion. A lot of these tools, I think, are best used in the hands of sophisticated people who understand their limits. So, you know, coding has actually been, I think, one of the. biggest applications. And that's because coders understand how to debug code. And so they can take the system. Basically, it's just typing for them and looking stuff up. And if it doesn't work, then they can fix it, right? The really dangerous applications are like when somebody asks for

Starting point is 00:24:19 medical advice and they can't debug it themselves and, you know, something goes wrong. Okay. So I'm going to take into consideration all the things that you've said so far and see if I can get a sense as to where you think we're heading. It seems like there was a push to just make these models better based off of scale. That could be things like the 300,000 GPU cluster, I think meta used for Lama 4, or it could be the million cluster GPU center that Elon's built for GROC. And what you're saying is that's been maxed out pretty much. Like no one's, hold on. I'll be more careful. It's not maxed out, but it's just diminishing returns. There's diminishing returns. So the point that I'm trying to make here is you don't believe that

Starting point is 00:25:01 there's going to be anyone that's going to build a bigger GPU data center than that, because if you're seeing diminishing returns from something that costs billions of dollars, it doesn't make sense to invest. Well, wait a second. I'm not saying people are rational. I think that people will probably try at least one more time. They'll build things, you know, probably Elon will build something that's 10 times the size of GROC 3, which will be huge and it will, you know, it will have a serious impact on the

Starting point is 00:25:28 environment and so forth. I just don't be. It's not just GPUs. also it's data, right? Like how much more data is there? Let's come to the data separately in a second. So I think people will actually try. Right. I think Masa has just bankrolled and Sam to try. I just don't think they're going to get that much for it. I don't think they'll get zero. I mean, there will be tangibly better performance on certain benchmarks and so forth. But I don't think that it's going to be wildly impressive. And I don't think it's going to knock

Starting point is 00:25:52 down the problems of hallucinations, bone-headed eras. So here's what I'm getting at. That's not going to feel much better than what we have today. It doesn't seem like you believe that reasoning is going to make the bot feel much better than we have today. Not the kind of reason we're doing. There's no emergent coding. So are you basically saying that what we have in AI today, this is it? For a while. For a while, I guess.

Starting point is 00:26:16 I mean, look, I put out some predictions last year in March that people can look up that had on Twitter. And those predictions include, I said there'd be no GPT5 this year or if it came out, it would be disappointing. It's supposed to come in summer. Well, this was last year. So I said in 2024, we won't see this. And that was a very contrarium prediction at that point, right?

Starting point is 00:26:38 This was a few weeks after people had said, oh, I bet GPT-4 is going to drop off to the Super Bowl, like right after the Super Bowl. Won't that be amazing? So people really thought it was going to come last year if you go back and look at, you know, what they said on Twitter, et cetera. And it didn't. And I correctly anticipated that it wouldn't. And I said, we're going to have a kind of pile up where we're going to have a lot of similar models from a lot of companies. I think I said seven to ten, which was sort of roughly right. And I said we were going to have no moat because everybody is doing the same thing.

Starting point is 00:27:05 And the prices we're going to go down when we have a price war. All of that stuff happened. Now, maybe we get to so-called GPT-5 level this year, keeps getting pushed back. I don't know if we'll get much further than that without some kind of genuine innovation. And I think genuine innovation will come. But what I think is we're going down the wrong path. Jan Lecun used this notion of, you know, we're on the exit. ramp. How do you say large language models are the off ramp to AGI? You know, they're not really

Starting point is 00:27:35 the right path to AGI. And I agree with him. Or you could argue he agrees with me because I said it, you know, for years before he did, but we won't go there. The broader notion is sometimes we make mistakes in science. I think one of the most interesting ones was people thought the genes were made of protein for a long time. So the early 20th century, lots of people tried to figure out what protein is a gene made of. It turns out, It's not made of a protein. It's made of a sticky acid that everybody now knows called DNA. So people spent 15 years or 20 years, like really looking at the wrong hypothesis.

Starting point is 00:28:09 I think that giant black box LLMs are the wrong hypothesis. But science is self-correcting. In the end, people put another $300 billion into this and it doesn't get the results they want. They'll eventually do something different. Right. But what you're forecasting is basically an enormous financial collapse because... That's right. I don't think LLMs will disappear. I think they're useful, but the valuations don't make sense. I mean, I don't see Open AI being worth $300 billion. And you have to remember that venture capitalists have to like 10x to be happy or whatever.

Starting point is 00:28:43 Like, I don't see them, you know, IPOing at $3 trillion. I just don't. No, it's interesting because I almost see the Open AI valuation as the one that makes the most sense because they have a consumer app. The place that I start to get, if what you're saying is correct that we're not going to see any more, if we're seeing real diminishing results from scaling and this is basically where we are, then there's real worry for companies like NVIDIA, which has basically risen on the idea of scaling. I mean, they're down a third, a third this year or something. Two point something, two point five trillion last. They're a genuinely good company. They have a wonderful ecosystem. They're worth a lot of money.

Starting point is 00:29:23 I mean, I don't want to put in an exact figure, but I'm not surprised that they fell, and I'm not surprised that they're still worth a lot. No, but this is a thing. If we end up seeing the fact that this next iteration, the $10 billion that Sam is going to spend seemingly on the next set of GPUs, if that doesn't produce serious results, that's going to hurt. That will cause a crash in Nvidia because so much of the company's demand is coming based up with this idea that scaling is going to work. So they have multiple problems, both. Open AI and InVidio.

Starting point is 00:29:54 So one is it does look to me like we're hitting diminishing returns. It does not look to me like this inference time compute trick is really a general solution. It doesn't look like hallucinations are going away. And it does look like everybody has the same magic formula. So everybody is basically doing the same thing. They're building bigger and bigger LOMs. And what happens when everybody's doing the same thing? You get a price for.

Starting point is 00:30:18 So DeepSeek came out and Open AI dropped its prices quite a bit. Right. And so every, because everybody, I mean, not literally everybody, but, you know, 10, 20 different companies all basically have the same idea or trying the same thing. You have to have a price for it. Nobody has a technical mode. OpenAI has a user mode. They have more users and that's. That's the most valuable thing they have.

Starting point is 00:30:39 Like for that is the most valuable thing. I would say the API is close to worth it. I don't know, worthless is the right word, but it's worth, it's not worth very much. It is that it's not a unique product. It is the thing that that really has. It's the brand name that is most valid. I also think it's the best bot right now. It might be.

Starting point is 00:30:57 I mean, I think people go back and forth. Some people someday say it's Claude. I've been on the Claude Drain for a long time. And now you're on the chat. And I'm on chat GPT. What I think is going to happen is you have leapfrog. Right. But the leaps aren't going to be as big as they were.

Starting point is 00:31:10 So four was a huge leap. I mean, this is a different way of saying what I said before. It was a huge leap over three. You know, let's say I can't even keep up with the naming scheme. GPT 4.1. Let's say it's better than GROC 3.7, or Claude 3.7, let's just say, hypothetically. And so people run to this side of the room. And then, you know, Claude, whatever, 3.8.1 or whatever, will be a little better than some people will run to that side of the room.

Starting point is 00:31:38 But nobody's able to charge that much money because the advances are going to be smaller. And people start to say, well, you know, I use this one for coding and this one for brainstorming and whatever. but nobody anymore says this is just like dominant. Like GPT4 was just dominant. When it came out, there was nothing as good as it. For anything, if you wanted this kind of system, you used it, right? I mean, that's my memory of it. I don't hear any of the chat GPT or whatever.

Starting point is 00:32:08 I can't even keep up with the names anymore. Any of those products, any of the open AI products being referred to in the same kind of hush tones, like they're just better. And like, you know, Google's still in this. racing. They may undercut on price. Meta's giving stuff away. People are building on it. DeepSeek, I hear has something new that's going to, you know, be better than ChachyPT. And, you know, maybe it's true. Maybe it's not. But we were in this era where the differences between the models are just getting really small. I was, I want to ask you when you're going to admit that

Starting point is 00:32:43 you were wrong about things or if you ever will. Which things? Which things? I think that, But I also realize that the question doesn't really hit because I just want to say we spoke the last time you were, I think you've been on the show two times, once with Blake Lemoy and once one-on-one. Yeah. And we because it's interesting, I think you're one of the most outspoken AI critics. And you say a lot of the things that we say here on the show, which is that AGI is marketing. And even if we don't hit AGI, there's still a lot to be concerned about, whether that's the BS that people are talking about or being able to use these. models for, you know, for nefarious purposes by churning out like content. Like, I don't know if you saw there was this study of, this University of Zurich tried to

Starting point is 00:33:27 fool people on Reddit or tried to convince people on Reddit based off answers by a GPT and it's still convinced more people than, than unions, the persuasion study. I'm aware of it, but I'm ready yet. So I guess like to me, it's, it does seem like it's kind of tough to be a critic of LMs right now because they have been getting so much better. But I don't know. Just sort of like... I mean, people say, Gary, you're wrong.

Starting point is 00:33:52 And I say, well, here are the predictions I actually made. Like, I've actually reviewed them in print. And I asked people who say that I'm wrong to, like, point, what did I say that was wrong? I think that sometimes people confuse my skepticism with other people's skepticism. But I think if you look at the things that I have said in print, they're mostly right. And it, you know, like Tyler Cowan said,

Starting point is 00:34:13 you're wrong about everything. You're always wrong. And I said, Tyler, can you point to something? And he said, well, you've written too much. I can't do it. Well, I look through some of your stuff. And I do think that sometimes it seems like you might have put like this enormous burden of proof for the AI industry.

Starting point is 00:34:29 Like you do pick out sometimes like everyone that says like AGI is coming this year. And you're like, these people are liars. But that being said, like I think your core arguments about scaling. I've offered to put up money. They offered Elon Musk a million dollars. And I offered criteria. And I'll tell you about that. In 2022, in May, I offered him $100,000 bet.

Starting point is 00:34:48 Later, I opted to a million dollars. And I put out criteria on Twitter. I said, I'm going to offer these. Do these make sense to you? And everybody on Twitter, not everybody, nearly everybody on Twitter at the time said those were fine. People accused me of goalposting shifting. But my goalposts are the same, right?

Starting point is 00:35:05 With 2014 paper in the New Yorker, article in the New Yorker, where I talk about a comprehension challenge, I've stuck by that. That is part of my AGI criteria. I made a bet with Miles Brundage on the same criteria, which he actually took the bet to his credit. But when I put them out in 2022, this is the important part. Everybody was more or less in agreement that those were reasonable criteria. And I said, if you could beat my comprehension challenge, which is to say, you know, watch movies, know when to laugh, understand what's going on, if you could do the same thing for novel, if you could translate math from English into stuff, you could formally.

Starting point is 00:35:42 verify. If you could go into a random kitchen, you know, tell operating a robot and, you know, make a dinner. If you could, what was the other criterion? Oh, you write, I think it was 10,000 lines of bug free code. You mean, you could do debugging to get there, whatever, you know, okay, if you could do like three out of five, we'll call that AGI. And at the time, everybody said that's fine. Now people are backtracking. Like Tyler Cowan said, oh three is AGI. Right. By what measure. I felt that that was kind of a stretch. That was cheesy. And he said, he said the measure was him. It looked like AGI to him. He invoked the, you know, classic line about pornography. I know and I see it. But people have pointed out lots of problems with O3. I think it's absurd to call

Starting point is 00:36:24 O3 AGI. I wouldn't call it AGI. So, you know, you, you, a minute ago said, Gary, you're wrong, but then you ticked off a bunch of things I'm actually right about. I didn't say, Gary, you're wrong. I said, is there a point you'll admit you're wrong. Like, what I'm Yes, there is. It's the point in which I'm wrong. So let me clarify one other thing. But let me just say, I didn't say that you're wrong. I just said like, what is the point of advance that you would say, okay, I've been wrong about this stuff?

Starting point is 00:36:51 Because I have listened to some of your... Let me clarify something. But I also, right after I said that, I was like, you know, it's kind of like a tough question. And then I explained where I agreed with you. Yeah. Yeah, that's what happened. So some people take me as saying that, AI is impossible.

Starting point is 00:37:09 And that's not me, right? I actually love AI. I want it to work. I just want us to take a different approach, right? I want us to take a neurosymbolic approach where we have some classical elements of classical AI, like explicit knowledge, formal reasoning and so forth, that people like Kinton have kind of thumbed their nose at, but say, Demas Hasibus is used very effectively in alpha fold.

Starting point is 00:37:28 So we get into that if you want. If we get to AI, the question about whether I'm right or not depends on how we get there. So I've made some pretty particular guesses about. it. And I have guessed that pure LLM will not get us there, pure large language model. So will I concede them wrong when we get to AI that actually works? Depends on how it works. Okay. Yeah.

Starting point is 00:37:49 And I think it's clear that, I mean, I don't know, we could watch this back in a couple years. If we get to pure LLMs, if it's another round of scaling, you know, gets us to AGI by the criteria that I laid out, then I will have to concede that I was wrong. Okay. All right. I'm going to take a quick break and then let's come back and talk a little bit more. more about the current risks and maybe read some of your tweets and have you expand upon them.

Starting point is 00:38:12 We'll be back right after this. And we're back here on Big Technology podcast with AI skeptic, Gary Marcus. Gary, let me ask you this. So, you know, one of the things we talked about last time you were here was that AI doesn't have to reach the AGI threshold to be something that we should be concerned about. Absolutely not. And a lot of the focus was on hallucinations. You and I both, I think we have a little bit of a diverging opinion on hallucinations.

Starting point is 00:38:37 I think they've gotten much better. You'd think it's still a big problem. Those could both be true, by the way. That could both be true. All right. So let's put a pit in that for now. I think where I'm seeing the most concern is virology. Or we just had a study that came out that showed that AI is now on PhD level in

Starting point is 00:38:59 terms of virology. We had Dan Hendricks from the Center for AI Safety who was here. We talked about the fact that like AI can now walk virologists through how to create or enhance the function of viruses. And we're starting to see some of these AI programs, like you mentioned, Deepseek, be available to everybody, be pretty smart, and be released without guardrails, or not enough guardrails, especially if they're open source. So what are you worried about here?

Starting point is 00:39:29 Is that the core concern or is there other stuff? I think there's actually multiple worries. And the different worries from different architectures and architectures used in different ways and so forth. So dumb AI can be dangerous. So if dumb AI is empowered to control things like the electrical grid and it makes a bad decision, that's a risk, right? If you put a bad driverless car system in, you know, a million cars, a lot of people would die, right? The main thing that is saved a lot of people from dying in driverless cars is there aren't that many of them. And so, you know, even though they're not actually super safe at the moment, you know, restrict where we use

Starting point is 00:40:06 them and so forth. We don't put them in situations where they wouldn't be very bright. So dumb AI can cause problems. Super smart AI could, you know, maybe lock us all in cages if it wanted to. I mean, we have to talk about the likelihood of it wanting to, but there definitely worries there and we need to take them seriously. And then you have things that are in between. So, for example, the virology stuff is AI that's not generally all that smart, But it can do certain things. And in the hands of bad actors, it can do those things. And I think it is true either now or will be soon enough that these tools can be used

Starting point is 00:40:45 to help bad actors create viruses that cause problems. And so I think that's a legitimate worry, even if we don't get to AGI. So we have dumb AI right now is a problem. Smarter AI, even if it's not AGI, can cause a different set of problems. And if we ever got to superintelligent, that might open. a different can of worms. I mean, you can think, like, you know, human beings of different degrees of brightness and with different skills, if they choose to do bad things, can, you know, cause different kinds of harm. And so what's your view on open source then? I worry about it.

Starting point is 00:41:17 I do worry about it because bad actors are using these things already. They're mostly using them for misinformation, not sure how much biology they're doing, but they will, and they're going to be interested in that. You know, state actors that want to do terrorist kinds of things will do that. I am worried about open sourcing at all, and I think the fact that meta could make that decision for the whole world is not good. Like, I think there should have been much more government oversight, scientists should have contributed more of the discussion. But now those kinds of models are open source. They've been released. We can't put that genie back in the bottle. And over time, just like people, I should have said this earlier,

Starting point is 00:42:01 even if the models don't get any better, we will still find new uses for them. And some of those new uses are positive and some of them will be negative, right? We're still exploring what these technologies can do, and people are finding, you know, ways to make money in dubious ways and to cause harm for various reasons and so forth. And so, you know, giving those tools very broadly has problems. On the other hand, I think what we've learned in the last three years is that the closed companies are not the ethical actors that they once were. So, you know, Google famously said don't do evil, and they took that out of their platform.

Starting point is 00:42:37 You know, Microsoft was all about AI ethics. And then, you know, when Sydney came out, they're like, we're not taking this away. We're going to stick with it. Oh, they did kill Sydney, right? Sydney was this very, I don't know, raunchy AI that tried to steal Kevin Rousse's wife. Yeah, I mean, they reduced what it could do. But they stuck with it in some sense. But, you know, and like Open AI said that we're in, you know,

Starting point is 00:43:01 nonprofit for public benefit. Now they're desperately trying to become a for-profit that is really not particularly interested in public benefit. It's interested in money. And they may become a surveillance company, which I don't think is... Because what you're talking about with the advertising side? So basically they have a lot of private data

Starting point is 00:43:17 because they have a lot of users and people type in all kinds of stuff. And they may have no choice but to monetize that. And, you know, they've been showing signs of it. They hired Nakasone, who used to be at the NSA. They bought a share in a webcam company. And they recently announced they're trying to build a social media company. They want, you know, they look like they're on a path to sell your data, your very private data to, you know, whoever they care. It's concerning because whatever data I gave to Facebook, like I always used to think that this conversation around Facebook data was a little ridiculous because I didn't think I was giving that much information to Facebook.

Starting point is 00:43:54 But I am giving Open AI a lot of information. I mean, there's a lot of people that treat it as a threat. Well, that's the number one. use as therapist companion. I don't use it as a therapist, but I'm like putting a lot of my work information in there. I read a great book called Privacy and Power. I'm blanking slightly on the title by Carissa Villis. And she had examples in there. Like people were taking data from Grindr and extorting people. Right. Grindr is an app for gay people if you don't know. And, you know, that's still in our society in like in some places it's acceptable in other places. You know, people don't necessarily

Starting point is 00:44:31 want to come out if they're gay or whatever. And so people have been extorting people with data from Grindr. Imagine what they're going to do. You know, people type into chat GPT, like they're very specific sexual desires, maybe crimes they've committed, like people type in a lot of times. They want to commit. Crimes they want to commit. You know, we have a political climate where, you know, conspiracy might be treated in a different way than it was. And so just typing Indian to chat GPT might, you know, get somebody deported. Who knows? Now I'm freaked out.

Starting point is 00:45:06 It's, I wouldn't personally use the system because the writing is on the wall. And I think that they, they make some promises to their business customers, but not to their, you know, consumer customers. And that stuff is available for them to do what they want with it. And they probably will because that's how they're going to make money. Here's another way we put it is, suppose I'm right about the things I've been arguing. and they can't really get to, you know, the GPT-7-level model that everybody dreamed of. It can't really build AGI. But they're sitting on this incredible treasure chest of data.

Starting point is 00:45:42 What are they going to do? Well, if they can't make AGI, they're going to sell that data. This is why I always thought, like, when you take in a lot of money, it's always, you always have to pay that money back in some way, and that changes the way you operate. That's right. I mean, look at 23 and me. They're out of business, and now that data is for sale.

Starting point is 00:46:00 and who knows what's going to happen with the 23ME data. I hope you're wrong about this one, but the history of the internet is... I'm not saying you are. I'm just saying I hope you are, because that would do that. I hope I'm wrong, too. But there is a level of... There's a lot of things I hope I'm wrong. Gary, if people got freaked out about what Facebook was doing with your data,

Starting point is 00:46:18 if they overstepped, there's going to be a major societal backlash. Maybe. I mean, sometimes people just accommodate to these things. I've been amazed at how willing people are to give away. all that information to Facebook. I don't use it anymore, but... Let me ask you this. You quote tweeted one of these...

Starting point is 00:46:37 So we'll get into a tweet here. You quote tweeted one of these tweets is the push to optimize AI for user engagement, just metric chasing Silicon Valley Brain or an actual pivot and business model from create a post-scarcity society God to create a worst TikTok. This is what basically we're talking about,

Starting point is 00:46:54 is that that might be the pivot. Yeah, that's right. I think that was someone else's tweet that I quote... Yeah, Daniel Litt, and you said, I've been basically telling you about this. Yeah, exactly. So that's what it is. You also wrote this, saying the quiet part out loud,

Starting point is 00:47:10 the business model of Gen. I will be surveillance and hyper-targeted ads, just like it has been for social media. We were just talking about that. And what I was quote cheating there was something from Aravind Serennavas, if I pronounce his name correctly, who's a CEO of perplexity. And he basically, I said he's saying quiet part out loud.

Starting point is 00:47:26 He basically said, we're going to use this stuff to hyper-target ads. You also said that companies like Johnson and Johnson will finally realize that Gen A.I. was not going to deliver on its promises. Have there been companies that have pulled back? Are you just using Johnson and Johnson as an example? That was based on a Wall Street Journal thing,

Starting point is 00:47:45 and I may have failed to include the link because of Elon most crazy notions around links. Elon, you've got to put the links in the... Elon, you've got to put the links in Twitter. This is unacceptable, yes. That's right. So, anyway, that was... I was alluding to a Wall Street

Starting point is 00:47:58 journal report that had just come out, which showed that J&J basically said, in so many words, I'll paraphrase it, they tried Gen. A.I. and a lot of different things, generative AI, and a few of them worked, and a lot of them didn't, and they were going to, like, stick to the ones that did, like, customer service and maybe not do some of the others. You have to go back, you know, a year and a half in history to when people thought Gen A.I. was going to do everything that an employee was able to do, basically. And I think what J&J and a bunch of companies have found out is that it's not really true. You know, they can do a bunch of things that employees do, but they can't

Starting point is 00:48:32 typically do everything that a single employee does. And, you know, they're reasonably good at triaging customer service. And they're not necessarily good at creating, say, a careful financial production. Okay. So, Gary, we have like five minutes left. I want, you said something in the, I think in the first half about the path that you think needs to be taken to AGI. Can you explain what that is in like as basic of a way as you can to like you know make it as simple to understand for anyone who's not caught up with the systems that you spoke about sure so a lot of people will have read danny connemann's book thinking fast and slow and there he talked about system one and system two cognition so system one was fast and automatic reflexive system two was more deliberate more

Starting point is 00:49:20 like reasoning i would argue that the neural networks that power generative AI are basically like System 1 cognition. They're fast, they're automatic, they're statistically driven, but they're also error-prone. They're not really deliberative. They can't sanity check their own work. And I would say we've done that pretty well, but System 2 is more like classical AI, where you can explicitly represent knowledge, reason over it. It looks more like computer programming.

Starting point is 00:49:48 And these two schools have both been around since the 1940s, but they've been very separate for what I think is sociological and economic reasons. Either you work on one or you work on the other, people argue or fight for graduate students and fight for grants and stuff like that. So there's been a great deal of hostility between the two. But the reality is they kind of complement each other. Neither of them has worked on its own.

Starting point is 00:50:12 So the classical AI failed, right? People build all these expert systems, but there were always these exceptions and they weren't really robust. You'd pay graduate students to patch up the exceptions. Now we have these new systems. They're not really robust either, which is why OpenAI is paying Kenyans and PhD students and so forth to kind of fix the errors. The advantage of System 1 is it learns very well from data.

Starting point is 00:50:37 The disadvantages is not very accurate. Sorry, very abstract. So the, I should have said that slightly differently. The large language models and that kind of approach, transformers, are very good at learning, but they're not very good at abstraction. You can give them billions of examples, and they still never really understand what multiplication is. And they certainly never get any other abstract concept well. The classical approach is great at things like multiplication. You write a calculator, and it never makes a mistake.

Starting point is 00:51:06 But it doesn't have the same broad coverage, and it can't learn new things. You can wire multiplication in, but how do you learn something new? The classical approaches have had trouble with that. And so I think we need to bring them together. And this is what I call neurosymbolic AI. And it's really what I've been lobbying for for decades. And I think it was hard to raise money to do that in the last few years because everybody was obsessed with generative AI.

Starting point is 00:51:32 But now that they're seeing the diminishing returns, I think investors are more open to trying alternatives. And also alpha fold is actually a neurosymbolic model. And it's probably the best thing that AI ever did. And so decoding proteins, protein folding. Yeah, figuring out the three dimensional. structure of a protein from a list of its nucleotides. Are you going to raise money to try to do this?

Starting point is 00:51:55 I'm very interested in that. Let's put it that way. Masa, son, if you want to make use of your money. No, I'm kidding. You talking to him? Not at this particular moment. Okay. Massa, if you're watching. I don't know. Trying to help. Okay, great. Well, Gary, can you shout out where to find your substack?

Starting point is 00:52:16 So if anybody wants to read your longer work on the state of AI, where should they go? Sure. So people might want to read my last two books, by the way, Taming Silicon Valley, which is really about how to regulate AI. And rebooting AI, which was 2019, is a little bit old, but still I think anticipates a lot of the problems around common sense and world models that we're still facing today. And then for kind of almost daily updates, I write a substack, which is for you, although you can pay if you like to support me. And that's at Gary Marcus. substack.com. Okay, well, I'm a subscriber, Gary. Great to have you on the program. Thanks so much for coming. Thanks a lot for having me again, yet again. Yet again. Well, we'll keep doing it.

Starting point is 00:52:55 It's always nice to hear your perspective on the world of AI. So I always enjoy our conversations. Thanks for having me. Yes, same here. All right, everybody, thank you for listening. We'll be back on Friday breaking down the week's news. Until then, we'll see you next time on Big Technology Podcast. You know,

Big Technology Podcast - Is AI Scaling Dead? — With Gary Marcus

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.