Microsoft Research Podcast - AI Frontiers: The Physics of AI with Sébastien Bubeck

Starting point is 00:00:00 . I'm Ashley Lorenz with Microsoft Research. I spent the last 20 years working in AI and machine learning, but I've never felt more fortunate to work in the field than at this moment. Just this month, March 2023, OpenAI announced GPT-4, a powerful new large-scale AI model with dramatic improvements in reasoning, problem solving, and much more. This model and the models that will come after it represent a phase change in the decades-long pursuit of artificial intelligence. with fellow researchers about our initial impressions of GPT-4, the nature of intelligence, and ultimately how innovations like these

Starting point is 00:00:48 can have the greatest benefit for humanity. Today I'm sitting down with Sebastian Bubeck, who leads the Machine Learning Foundations group at Microsoft Research. In recent months, some of us at Microsoft had the extraordinary privilege of early access to GPT-4. We took the opportunity to dive deep into its remarkable reasoning, problem solving, and the many other abilities that emerged

Starting point is 00:01:10 from the massive scale of GPT-4. Sebastian and his team took this opportunity to probe the model in new ways, to gain insight into the nature of its intelligence. Sebastian and his collaborators have shared some of their observations in the new paper called Sparks of Artificial General Intelligence, Experiments with an Early Version of GPT-4. Welcome to AI Frontiers. Sebastian, I'm excited for this discussion.

Starting point is 00:01:38 So the place that I want to start is with what I call the moment. Okay. Or the AI moment. All right. So what do I mean by that? All right. So in my experience, everyone that's picked up and played with the latest wave of large scale AI models, whether it's chat GPT or the more powerful models coming after, has a moment, right? They have a moment where they're genuinely surprised by what the models are capable of, by the experience of the model, the apparent intelligence of the model. And in my observation, the intensity of the reaction is more or less universal. Although

Starting point is 00:02:16 everyone comes at it from their own perspective, it triggers its own unique range of emotions from awe to skepticism. So now, I'd love from your perspective, right, the perspective of a machine learning theorist, what was that moment like for you? That's a great question to start. So when we started playing with the model, of course, you know, we did what I think anyone would do. We started to ask it mathematical questions, mathematical puzzles, you know, we asked it about to give some poetry analysis on a poem. Peter Lee did one on black salt, which was very intriguing. But every time we were left wondering,

Starting point is 00:02:53 okay, but maybe it's out there on the internet. You know, maybe it's just doing some kind of pattern matching and it's finding a little bit of structure, but this is not real intelligence. You know, it cannot be. How could it be real intelligence when it's such simple components coming together? So for me, I think the really awestruck moment was one night when I woke up

Starting point is 00:03:14 and I turned out my laptop and fired the playground. And, you know, I have a three-year-old at home, a daughter, who is a huge fan of unicorns. And I was just wondering, you know what, let's ask GPT-4 if it can draw a unicorn. And you know, in my professional life, I play a lot with latex, this programming language for mathematical equations. And in latex, there is this sub-programming language called TIG-Z to draw images using code. And so I just asked it, can you draw a unicorn in TIGZ? And it did it so beautifully.

Starting point is 00:03:49 It was really amazing. And, you know, it was really this very visual because it's an image, you can render it and you can see the unicorn. And no, it wasn't a perfect unicorn. Really what was amazing is that it drew a unicorn which was quite abstract. It was really the concept of a unicorn. You know, all the bits and pieces of what makes a unicorn which was quite abstract. It was really the concept of a unicorn.

Starting point is 00:04:05 You know, all the bits and pieces of what makes a unicorn. The horn, the tail, you know, the fur, etc. And this is what really struck me at that moment. First of all, there is no unicorn in Tixi online. I mean, who, you know, would draw a unicorn in a mathematical language? I mean, this doesn't make any sense. So there is no unicorn online. I was pretty sure of that. And then we did further experiment to confirm that. And we're sure that

Starting point is 00:04:28 it really drew the unicorn by itself. But really what struck me is this getting into at what is the concept of a unicorn, that there is, you know, a head, a horn, the legs, etc. This has been a long standing challenge for AI research. This has always been the problem with all those, you know, AI systems that came before, like the convolutional neural networks that were trained on ImageNet and, you know, image data set, and that can recognize, you know, whether there is a cat or a dog in the image, et cetera. Those neural networks, it was always hard to interpret them. And it was not clear how they were detecting exactly whether there is a cat or a dog.

Starting point is 00:05:06 In particular, they were susceptible to these, you know, adversarial examples like small perturbation to the input that would completely change the output. And it was understood that the big issue is that they didn't really get the concept of a cat or a dog. And there, suddenly, with GPT-4, it was kind of clear to me at that moment that it really understood something. It really understands what is a unicorn. So that was the moment for me.

Starting point is 00:05:32 What did you feel in that moment? Does that change your concept of your field of study, your relationship to the field? What did you feel like in that moment? It really changed a lot of things to me. So first of all, I never thought that I would live to see what I would call a real artificial intelligence, like an intelligence which is artificial. Of course, you know, we've been talking about AI for, you know, many decades now. And, you know, the AI revolution in some sense has been happening for a decade already. But I would argue that all those systems before, they were really this narrow intelligence, which does not really rise to the level

Starting point is 00:06:08 of what I would call intelligence. Here, we're really facing something which is much more general and really feels like intelligence. So at that moment, I felt honestly lucky. I felt lucky that I had early access to this system, that I could be one of the first human beings to play with it.

Starting point is 00:06:25 And I saw that this is really going to change the world dramatically. And it is going to be, you know, selfishly, it's going to change my field of study, as you were saying. Now, suddenly, we can start to attack what is intelligence, really, we can start to approach this question, which seemed completely out of reach before. Really deep down inside me, incredible excitement. That's really what I felt. Then upon reflection, you know, in the next few days, etc., there is also some worry. Of course, clearly things are accelerating dramatically.

Starting point is 00:07:03 Not only did I never think that I would live to see a real artificial intelligence, but the timeline that I had in mind, say, you know, 10 years ago or 15 years ago when I was a PhD student, I thought maybe by the end of the decade, the 2010s, maybe at that time we will have a system that can play Go better than humans. That was my target. And maybe 20 years after that, we will have a system that can do language. And maybe somewhere in between, we will have a system that can play multiplayer games like StarCraft 2 or Dota 2. All of those things got compressed into the 2010s. And by the end of the 2010s, we had basically solved language in a way with GPT-3. And now we enter the 2020s and now suddenly something totally unexpected, which wasn't in my card for, you know, the 70 years of my life and professional career, intelligence in our hands.

Starting point is 00:07:55 So yeah, it's just changing everything. And this compressed timeline, I do worry, where is this going? You know, there are still fundamental limitations that I'm sure we're going to talk about, and it's not clear whether the acceleration is going to keep going. But if it does keep going, yeah, it's going to challenge a lot of things for us as human beings. As someone that's been in the field for a while myself, I had almost a very similar reaction where I felt like I was interacting with a real intelligence, like something deserving of the name artificial intelligence, AI. What does that mean to you?

Starting point is 00:08:32 What does it mean to have real intelligence? It's a tough question, you know, because, of course, intelligence has been studied for many decades. And, you know know psychologists have developed tests of your level of intelligence etc but in a way i feel we still intelligence is still something very mysterious we recognize it when we see it but it's very hard to define and what i'm hoping is that with this system i what i want to argue is that basically it was very hard before to study what is intelligence because we had only one example of intelligence. What is this one example? I'm not necessarily talking about human beings, but more about natural intelligence. By that, I mean intelligence that

Starting point is 00:09:16 happened on planet Earth through billions of years of evolution. This is one type of intelligence, and this was the only example of intelligence that we had access to, and so all our series were kind of fine-tuned to that example of intelligence. Now I feel, now that we have a new system, which I believe rises to the level of being called an intelligent system, we suddenly have two examples which are very different. GPT-4's intelligence is comparable to human in some ways, but it's also very, very different. It makes, you know, it can both solve Olympiad-level mathematical problems and also make elementary school mistakes when adding two numbers. So it's clearly not human-like intelligence.

Starting point is 00:09:58 It's a different type of intelligence. And of course, because it came about through a very different process than natural evolution. You could argue that it came about through a process which you could call artificial evolution. That's how I would call it. And so now I'm hoping that now that we have those two different examples of intelligence, maybe we can start to make progress on defining it and understanding what it is. So that was a long-winded answer to your question, but I don't know how to put it differently.

Starting point is 00:10:31 Basically, the way for me to test intelligence is to really ask creative questions, difficult questions that you do not find online and to search, because in a way you could ask, is Bing, is Google, are search engines intelligent? I mean, they can answer tough questions. Are these intelligent systems? Of course not. Everybody would say no.

Starting point is 00:10:48 So you have to distinguish, you know, what is it that makes us say that GPT-4 is an intelligent system? Is it just the fact that it can answer many questions? No, it's more that it can inspect its answers. It can explain itself. It can, you know, interact with you. You can have a discussion. This interaction is really of the essence of intelligence to me.

Starting point is 00:11:24 It certainly is a provocative and unsolved, you know, kind of question of what is intelligence. And perhaps equally maybe mysterious is how we actually measure intelligence, which is a challenge even for humans. Yes. measure intelligence, which is a challenge even for humans, which I'm reminded of with young kids in the school system, as you are or will be soon here as a father. You've had to think differently as you've tried to measure the intelligence of GPT-4. And you alluded to that. I'd say the prevailing way that we've gone about measuring the intelligence of AI systems or intelligence systems is through this process of benchmarking. And you and your team have taken a very different approach. Can you maybe contrast those? Of course. Yeah. So maybe let me start with an example. So we use GPT-4 to pass mock interviews for software engineers position at Amazon, at Google, at Meta, etc. It passes all of those interviews very easily.

Starting point is 00:12:09 Not only does it pass those interviews, but it also ranks in the very top of the human being. In fact, for the Amazon interview, not only did it pass all the questions, but it scored better than 100% of all the human users on that website. So this is really incredible. And the headlines would be, GPT-4 can be hired as a software engineer at Amazon. But this is a little bit misleading to view it that way,

Starting point is 00:12:35 because those tests, they were designed for human beings. They make a lot of hidden assumptions about what is going to be the person that they are interviewing. In particular, they will not test whether that person has a memory from one day to the next. This is baked in. Of course, human beings remember what they did the next day, unless there is some very terrible problem. So all those benchmarks, they all face those benchmarks of intelligence, at least. They face this issue that they were designed to test human beings. So we have to find new ways to test intelligence when we're talking about the intelligence

Starting point is 00:13:10 of AI systems. So that's point number one. Point number two is so far in the machine learning tradition, you know, we have developed lots of benchmarks to test AI system, narrow AI system. This is how the machine learning community has made progress over the decades by beating benchmark, by having systems that keep improving percentage by percentage over those target benchmarks. Now, all of those become kind of irrelevant in the era of GPT-4 for two reasons. Number one is GPT-4, we don't know exactly what it has been trained on.

Starting point is 00:13:46 And in particular, it might have seen all of these data sets. So really you cannot separate anymore the training data and the test data. This is not really a meaningful way to test something like GPT-4 because it might have seen everything. For example, Google came out with a suite of benchmarks,

Starting point is 00:14:02 which they call Big Bench. And in there, they hid a code to make sure that if you don't know the code, then you haven't seen this data. And of course, GPT-4 knows this code. So it has seen all of BigBench, so you just cannot benchmark it against BigBench. So that's problem number one for the classical ML benchmark. Problem number two is that all those benchmarks are just too easy. It's just too easy for GPT-4.

Starting point is 00:14:26 It crushes all of them hands down very, very easily. In fact, it also does the same thing for the medical license exam, for the multi-state bar exam, all of those things. It just passes it very, very easily. So the reason why we have to go beyond this, is really beyond the classical ML benchmark, is we really have to test the generative abilities, the interaction abilities, you know, how is it able to interact with human beings? How is it able to interact with tools? How creative can it be at

Starting point is 00:14:55 the task? So all those questions, it's very hard to benchmark them, you know, around hard benchmark where there is one right solution. Now, of course, the ML community has grappled with this problem recently because generative AI has been in the works for a few years now, but the answers are still very tentative. Just to give you an example, imagine that you want to have a benchmark where you describe a movie and you want to write a movie review. Let's say, for example, you want to tell the system, write a positive movie review about this movie.

Starting point is 00:15:24 The problem is in your benchmark, you will have some in the data. You will have examples of those reviews. And then you ask your AI system to write its own review, which might be very different from what you have in your training data. So the question is, is it better to write something different? Or is it worse? Do you have to match what was in the training data? Maybe, you know, GPT-4 is so good that it's going to write something different? Or, you know, is it worse? Do you have to match what was in the training data? Maybe, you know, GPT-4 is so good that it's going to write something better than what the humans wrote. And in fact, we have seen that many, many times, is that training data was

Starting point is 00:15:54 crafted by humans, and GPT-4 just does a better job at it. So it gives better labels, if you want, than what the humans did. So it cannot even compare to humans anymore. So this is a problem that we are facing as we are writing our paper, trying to assess GPT-4's intelligence. Give me an example where the model is actually better than the human. Sure. I mean, let me think of a good one. I mean, coding. It is absolutely superhuman at coding.

Starting point is 00:16:23 You know, we already alluded to this, and this is going to have tremendous implication. But really, it is absolutely superhuman at coding. You know, we already, you know, alluded to this, and this is going to have tremendous implication, but really coding is incredible. So for example, you know, again, going back to the example of movie reviews, there is this IMDB dataset, which is very popular in machine learning, where, you know, you can,

Starting point is 00:16:38 there are many basic questions that you want to ask. But now in the era of GPT-4, you can give the IMDB dataset and you can just ask GPT-4, hey, can you explore the dataset? And it's going to come up with suggestions of data analysis ideas. Maybe it will say, maybe we want to do some clustering. Maybe you want to, you know, cluster by the movie, you know, directors and you will see, you know, which movies was the most popular and why, etc.

Starting point is 00:17:03 It can come up creatively with its own analysis. So that one aspect, definitely coding, data analysis, it can be very easily superhuman. I think in terms of writing, I mean, its writing capabilities are just astounding. For example, in the paper, we asked it many times to rewrite parts of what we wrote. And he writes it in this much more lyrical way poetic way you can ask for any kind of style that you want it's really at the level i would say at this for in in my you know novice eyes i i would say it's at the level of some of the best authors out there the is its style and this is really native uh you know you don't have to do anything. It does remind me a little bit of the AlphaGo moment, more specifically the AlphaZero moment, where all of a sudden you leave the human training data behind and you're entering into a realm where it's its only real competition. You talked about kind of the evolution that we need to have of how we measure

Starting point is 00:18:05 intelligence from ways of measuring narrow or specialized intelligence to measuring more general kinds of intelligence. You know, we've had these narrow benchmarks. You see a lot of this pass the bar exam, these kinds of human intelligence measures. But what happens when all of those are also too easy? Yes. How do we think about measurement and assessment in that regime? So, of course, you know, I want to say maybe this is a good point to bring up the limitations of the system also. Right now, a very clear frontier that GPT-4 is not stepping over is to produce new knowledge, to discover new things. For example, let's say in mathematics, to prove mathematical theorems that humans do not know how to prove. Right now, the systems cannot do it. And this, I think, would be a very clean

Starting point is 00:18:50 and clear demonstration, where there is just no ambiguity, once it can start to produce this new knowledge. Now, whether it's going to happen or not is an open question. I personally believe it's plausible. I am not 100% sure what's going to happen, but I believe it is plausible that it will happen. But then there might be another question, which is what happens if the proof that it produces becomes inscrutable to human beings, which is another option. I mean, you know, mathematics is not only this abstract thing, but it's also a language between humans. Of course, at the end of the day, you can come back to the axioms, but that's not the way we humans do mathematics. So what happens if, let's say, GPT-5 proves the Riemann hypothesis, and it is formally proved? So maybe it gives the

Starting point is 00:19:36 proof in the lean language, which is a formalization of mathematics, and you can formally verify that the proof is correct, but no human being is able to understand the concepts that were introduced. What does it mean? Is the Riemann hypothesis really proven? I guess it is proven, but is that really what we human beings wanted? So this kind of question might be on the horizon. And that, I think, ultimately might be the real test of intelligence. Let's stick with this category of the limitations of the model. And you kind of drew a line here in terms of producing new knowledge.

Starting point is 00:20:12 You offered one example of that as proving mathematical theorems. What are some of the other limitations that you've discovered? So, you know, GPT-4 is a large language model which was trained on the next word prediction objective function. So what does it mean? It just means you give it a partial text and you're trying to predict what is going to be the next word in that partial text.

Starting point is 00:20:35 And then at test time, once you want to generate content, you just keep doing that on the text that you're producing. So you're producing words one by one. Now, of course, it's a question, and I have been reflecting upon myself, you know, once I saw GPT-4, it's a question whether human beings are machine like this.

Starting point is 00:20:52 I mean, it doesn't feel like it, you know? It feels like we're thinking a little bit more deeply. We're thinking a little bit more in advance of what we want to say. But somehow, as I reflect, I'm not so sure, at least when I speak verbally, orally,

Starting point is 00:21:04 maybe I am just coming up every time with the next word. So this is a very interesting aspect. But the key point is, suddenly when I'm doing mathematics, I think I am thinking a little bit more deeply. I'm not just trying to see what is the next step, but I'm trying to come up with a whole plan of what I want to achieve.

Starting point is 00:21:23 And right now, the system is not able to do this kind of long term planning. And we can give very simple experiments that show this. Maybe my favorite one is, you know, if you just want you, let's say you have a very simple arithmetic equality, I don't know, three times seven plus 21 times 27 equals something. So this is part of the prompt that you give to GPT-4. And now you just ask, OK, you're allowed to modify one digit in this so that the end result is modified in a certain way. Which one do you choose? So, you know, the way to solve this problem is that you have to think,

Starting point is 00:21:59 you have to, you know, try, OK, if I were to modify the first digit, what would happen? If I were to modify the second digit, what would happen? If I were to modify the second digit, what would happen? And GPT-4 is not able to do that. GPT-4 is not able to think ahead in this way. What it will say is just, it will say, oh, you know what, let me, I think if you modify the third digit, just randomly, the third digit is going to work. And it just tries and it fails. And the really funny aspect is that once it starts failing, GPT-4, this becomes part of its context,

Starting point is 00:22:30 which in a way becomes part of its truth. So the failure becomes part of its truth and then it will do anything to justify it. It will keep making mistake to keep justifying it. So these two aspects, the fact that it cannot really plan ahead and that once it makes mistakes, it just becomes part of its truth. These are very, very serious limitations, in particular for mathematics. I mean, this makes it a very uneven system once you approach

Starting point is 00:22:57 mathematics. You mentioned something that's different about machine learning, the way it's conceptualized in this kind of generative AI regime, which is fundamentally different than what we've typically thought about as machine learning, where you're optimizing an objective function with a fairly narrow objective versus when you're trying to actually learn something about the structure of the data, albeit through this next word prediction or some other way. What do you think about that learning mechanism? Are there any limitations of that? Yeah, so this is a very interesting question. So, you know, maybe I just want to backtrack for a second and just acknowledge that what happened there is kind of a miracle. Nobody, I think,

Starting point is 00:23:41 nobody in the world, perhaps except OpenAI, expected that intelligence would emerge from this next world prediction framework just on a lot of data. I mean, this is really crazy if you think about it. Now, the way I have justified it to myself recently is like this. It is, you know, agreed that deep learning, which is what powers, you know, the GPT-4 training, you have a big neural network that you're training with gradient descent, just trying to fiddle with the parameters. So it is agreed that deep learning is this hammer that if you give it a data set, it will be able to extract the latent structure of that data set. So for example, the first breakthrough that happened in deep learning a little bit more than 10 years ago was the AlexNet moment where they trained a neural network to basically classify, you know, cats,

Starting point is 00:24:30 dogs, you know, cars, et cetera, with images. And when you train this network, what happens is that you have these edge detectors that emerge on the first few layers of the neural network. And nothing in the objective function told you that you have to come up with edge detector. This was an emergent property. Why? Because it makes sense. This is a structure of an image is to combine those edges to create geometrical shapes. Now, I think what's happening, and we have seen this more and more with the large language models,

Starting point is 00:24:58 is that there are more and more emerging properties that happen as you scale up, you know, the size of the network and the size of the data. Now, what I believe is happening is that in the case of GPT-4, they gave it such a big data set, so diverse, with so many complex parameters in it, that the only way to make sense of it, the only latent structure that unifies all of this data is intelligence. The only way to make sense of the data was for the system to become intelligent. This is kind of a crazy sentence. And, you know, I expect the next few years, maybe even the next few decades, we'll try to make sense of whether this sentence is correct or not. And hopefully we can, you know, human beings are intelligent enough to make sense of that sentence. I don't know. Right now, I just feel like it's a reasonable hypothesis that this is what happened there.

Starting point is 00:25:48 In a way, you can say maybe there is no limitation to the next word prediction framework. So that's one perspective. The other perspective is, no, no, no. Actually, the next word prediction token framework is very limiting, at least at generation time. At least once you start to generate new sentences, you should go beyond a little bit if you want to have the planning aspect, if you want to be able to revisit mistakes that you made. So there we believe that at least at

Starting point is 00:26:17 generation time, you need to have a slightly different system. But maybe in terms of training, in terms of coming up with intelligence in the first place, maybe this is a fine way to do it. One aspect of our previous notion of intelligence, and maybe still the current notion of intelligence for some, is this aspect of compression. The ability to take something complex and make it simple, maybe thinking grounded in Occam's razor, where we want to generate the simplest explanation of the data. And some of the things you're saying, and some of, where we want to generate the simplest explanation of the data. And some of the things you're saying and some of the things we're seeing in the model kind of go against that intuition.

Starting point is 00:26:50 So talk to me a little bit about that. So I think this is really exemplified well in a project that we did here at Microsoft Research a year ago, which we called Lego. So let me tell you about this very briefly because it will really get to the point of what you're trying to say. So let's say you want to train an AI system that can solve middle school systems of linear equations. So maybe it's x plus y equals z, 3x minus 2y equals 1, and

Starting point is 00:27:19 so on and so forth. You have three equations with three variables. And you want to train a neural network that does that. It know, it takes in the system of equation and outputs the answer for it. The classical perspective, the Occam's razor perspective, would be collect a data set with lots of equations like this, train a system to solve those linear equations, and there you go. You know, this is a way you have IID,

Starting point is 00:27:41 you know, the same kind of distribution at training time and at test time. Now, what this new paradigm of deep learning and in particular of large language models would say is instead, even though your goal is to solve systems of linear equations for middle school students, instead of just training data, having middle school systems of linear equation, we're going to collect a hugely diverse set of data. Maybe we're going to do next-world prediction,

Starting point is 00:28:09 not only on the systems of linear equations, but also on all of Wikipedia. Okay, so this is now, you know, a very concrete experiment. You have two neural networks. Neural network A, only trained on equations. Neural network B, trained on equations, plus Wikipedia. And any kind of classical thinking would tell you that Neural Network B is going to do worse

Starting point is 00:28:29 because it has to do more things, it's going to get more confused, it's not the simplest way to solve the problem, et cetera. But lo and behold, if you actually run the experiment for real, Network B is much, much, much better than Network A. Now I need to quantify this a little bit. Network A, if it was trained with systems of linear equations

Starting point is 00:28:49 with three variables, is going to be fine on systems of linear equations with three variables. But as soon as you ask it four variables or five variables, it's not able to do it. It didn't really get the essence of what it means to solve linear equations. Whereas network B,

Starting point is 00:29:04 it not only solves systems of equations with three variables, but it also does four, it also does five, and so on and so forth. Now, the question is, why? What's going on? Why is it that making the thing more complicated, going against Occam's razor, why is that a good idea?

Starting point is 00:29:19 And, you know, the extremely naive perspective, which, in fact, some people said, because it's so mysterious, would know, some people said because, you know, it was so mysterious, would be, maybe it read the Wikipedia page on solving systems of linear equation, right?

Starting point is 00:29:33 But of course, that's not what happened. And this is another, you know, aspect of this whole story, which is anthropomorphication of the system is a big danger, but let's not get into that right now. But the point is,

Starting point is 00:29:44 that's not at all the reason why it became good at solving systems of linear equation. It's rather big danger, but let's not get into that right now. But the point is, that's not at all the reason why it became good at solving systems of linear equations. It's rather that it had this very diverse data, and it forced it to come up with unifying principles, like more canonical component of intelligence, and then it's able to compose this canonical component of intelligence to solve the task at hand. I want to go back to something you said much earlier around natural evolution versus this

Starting point is 00:30:13 kind of notion of artificial evolution. And I think that starts to allude to where I think you want to take this field next, at least in terms of your study and your group. And that is focusing on the aspect of emergence and how intelligence emerges. So what do you see as the way forward from this point, from your work with Lego that you just described for you and for the field? Yes, absolutely. So I think I would argue that maybe we need a new name for machine learning. In a way, GPT-4 and GPT-3 and, you know, all those other large language models, in some ways, it's not machine learning anymore.

Starting point is 00:30:49 And by that, I mean, you know, machine learning, it's all about how do you teach a machine a very well-defined task, recognize cats and dogs, you know, something like that. But here, that's not what we're doing. We're not trying to teach it a narrow task. We're trying to teach it everything. And we're not trying to mimic how a human would learn. You know, this is another point of confusion. Some people say,

Starting point is 00:31:09 you know, oh, but it's learning language, you know, but using more text than any human would ever see. But that's kind of missing the point. The point is we're not trying to mimic human learning. And that's why maybe learning is not the right word anymore. We're really trying to mimic something which is more akin to evolution. We're trying to mimic the experience of millions, billions of entities that interact with the world. In this case, the world is, you know,

Starting point is 00:31:33 the data that humans produced. So it's a very different style. And I believe the reason why all the tools that we have introduced in machine learning are kind of useless and almost irrelevant in light of GPT-4 is because it's a new field. It's something that needs new tools to be defined. So we hope to be at the forefront of that

Starting point is 00:31:53 and we want to introduce those new tools. And of course, we don't know what it's going to look like, but the avenue that we're taking to try to study this is to try to understand emergence. So emergence, again again is this phenomenon that as you scale up the network and the data, suddenly there are new properties that emerge at every scale. And Google had this experiment where they scaled up their large language models from 8 billion, 60 billion to 500 billion. And at 8 billion, it's able to understand

Starting point is 00:32:23 language and it's able to do a little bit of arithmetic. At 60 billion, suddenly it's able to translate between language. You know, before it couldn't translate, at 60 billion parameters, suddenly it translates. At 500 billion, suddenly it can explain jokes. You know, why can it suddenly explain jokes? So we really would like to understand this. And of course, from our perspective, the way we want to do it is, let me say it like this. There is another field out there that has been grappling with emergence for a long time, that we're trying to study systems of very complex, you know, particles interacting with each other and leading to some emergent behaviors. What is this field? It's physics. So what we would like to propose is let's study the physics of AI or the physics of AGI, because in a way, you know, we're really seeing this general intelligence now. So

Starting point is 00:33:11 what would it mean to study the physics of AGI? What it would mean is let's try to borrow from the methodology that physicists, you know, have used for the last few centuries to make sense of reality. And what were those tools? Well, one of them was to run very controlled experiment. You know, if you look at a waterfall and you have the water which is flowing and it's going in all kinds of ways, et cetera, and you go look at it in the winter and it's frozen, I mean, good luck to try to make sense of the phases of water by just staring at a waterfall. And GPT-4 or Lambda or all those large language models, these are our waterfalls. What we need are much more small-scale,

Starting point is 00:33:47 controlled experiments where we know we have pure water. It's not being tainted by the stone, by the algaes, etc. We need those controlled experiments to make sense of it. And Lego is one example. So that's one direction that we want to take. But in physics, there is another direction that you can take, which is to build toy mathematical models of the real world. You try to abstract. But in physics, there is another direction that you can take, which is to build toy mathematical models of the real world.

Starting point is 00:34:07 You try to abstract away lots of things and you're left with a very simple mathematical equation that you can study. And then you have to go back to real experiment and see whether the prediction

Starting point is 00:34:16 from the toy mathematical model tells you something about the real experiment. So that's another avenue that we want to take. And there we made some progress recently also with interns at MSR.

Starting point is 00:34:26 So we have a paper which is called Learning Threshold Units. And here really we're able to understand how does the most basic element, I don't want to say intelligence, but the most basic element of reasoning emerges in those neural networks. And what is this most basic element of reasoning? It's a threshold unit. It's something that takes as input, you know, some value. And if the value is too small, then it just turns it to

Starting point is 00:34:49 zero. And this emergence already, it's a very, very complicated phenomenon. And we were able to understand the non-convex dynamics at play and connect it to something which is called the edge of stability, which is all very exciting. But the key point is that it's really, we have a toy mathematical model. And there, in essence, what we were able to do is to say that emergence is related to the instability in training, which is very surprising

Starting point is 00:35:17 because usually in classical machine learning, instability is something that you do not want. You want to erase all the instabilities. And here, somehow, through this physics of AI approach, where we have a toy mathematical model, we're able to say, ah, actually the instability in training that you're seeing, that everybody has seen for decades now, it actually matters for learning and for emergence. So this is the first step that we took. I want to come back to this aspect of interaction. And I want to ask you if you see fundamental limitations with this whole methodology around certain kinds of

Starting point is 00:35:53 interactions. So right now we've been talking mostly about these models sort of interacting with information in information environments, with information that people produce, and then producing new information. Behind the source of that information is actual humans. And so I want to know if you see any limitations or if this is an aspect of your study, how we make these models better at interacting with humans, understanding the person behind the information produced. And after you do that, I'm going to come back and we'll ask the same question of the natural world in which we as humans reside. Absolutely. So this is one of the emergent properties of GPT-4, to put it very simply, that not only can it interact with

Starting point is 00:36:36 information, but it can actually interact with humans too. It can really, you know, you can communicate with it. You can discuss and you're going to have very interesting discussion. In fact, some of have very interesting discussions. In fact, some of my most interesting discussions in the last few months were with GPT-4. This is surprising. Not at all something we would have expected, but it's there. Not only that, but it also has a theory of mind. GPT-4 is able to reason about what somebody is thinking, what somebody is thinking about what somebody else is thinking, and so on and so forth. So it's really a very sophisticated theory of mind. There was recently

Starting point is 00:37:10 a paper saying that chat GPT is roughly at the level of, you know, seven years old in terms of its theory of mind. For GPT-4, I cannot really distinguish from an adult. Just to give you an anecdote, I don't know if I should say this, but, you know but one day in the last few months, I had an argument with my wife and she was telling me something and I just didn't understand what she wanted from me. And I just talked with GPT-4. I explained the situation. I asked what's going on. What should I be doing?

Starting point is 00:37:37 And the answer was so detailed, so thoughtful. I mean, I'm really not making this up. This is absolutely real. I learned something from GPT-4 about the human interaction with my wife. This is as real as it gets. I can't see any limitation right now in terms of interaction. And not only can it interact with humans, but it can also interact with tools. And this is the premise in a way of the new being that was, you know, recently introduced, which is that this new model, you can tell it, hey, in a way, of the new Bing that was recently introduced, which is that this new model, you can tell it,

Starting point is 00:38:08 hey, you know what, you have access to a search engine. You can use Bing. If there is some information that you're missing and you need to find it out, please make a Bing search. And somehow, natively, this is again an emergent property, it's able to use a search engine and make searches when it needs to, which is really, really incredible. And not only can it use those tools which are well-known,

Starting point is 00:38:29 but you can also make up tools. You can say, hey, I invented some API. Here is what the API does. Now, please solve me problem XYZ using that API. And it's able to do it natively. It's able to understand your description in natural language of what the API that you built is doing, and it's able to leverage its power and use it. This is really incredible and opens, you know, so many directions. Yeah, we certainly see some, I mean, super impressive

Starting point is 00:38:55 capabilities like the new integration with Bing, for example. We also see some of those limitations come into play. Tell me about your exploration of those in this context. Right. So one keyword that didn't come up yet and which is going to, you know, drive probably the conversation, at least online and on Twitter, is hallucinations. So those models, you know, they still, GPT-4 still does hallucinate a lot. And in a way, for good reason, you know. Hallucination, it's on a spectrum where on the one hand, you have bad hallucination,

Starting point is 00:39:27 completely making up facts which are contrary to the real facts in the real world. But on the other hand, you have creativity. I mean, when you create, when you generate new things, you are in a way hallucinating. It's good hallucinations, but still, these are hallucinations. So having a system which can both be creative but does not hallucinate at all, it's a very delicate balance. And GPT-4 did not solve that problem yet. It made a lot of progress, but it didn't solve it yet. That's still a big limitation, you know, which the world is going to have to grapple with. And, you know, I think in the new being, it's very clearly explained that it is still making mistakes from time to time and

Starting point is 00:40:04 that you need to double check the result, etc. I still think the rough contour of what GPT-4 says and the new Bing says is really correct. It's a very good first draft most of the time and you can get started with that. But then, yeah, you need to do your research and it cannot be used for critical missions yet. Now, what's interesting is that GPT-4 is also intelligent enough to look over itself. So once it produces a transcript, you can ask another instance of GPT-4

Starting point is 00:40:35 to look over what the first instance did and to check whether there is any hallucination. This works particularly well for what I would call in-context hallucinations. So what would be in-context hallucinations is, let's say, you have a text that you're asking it to summarize. And maybe in the summary, it invents something that was not out there. Then the other instance of GPT-4 will immediately spot it. So that's, you know, basically in-context hallucinations.

Starting point is 00:41:00 We believe they can be maybe fully solved soon. The open world type of hallucination, when you ask anything, for example, in our paper, we ask where is the McDonald's at SeaTac, at the airport in Seattle, and it responds gate C2. And the answer is not C2, the answer is B3. So this type of open world hallucination, it's much more difficult to resolve, and we don't know yet how to do that exactly.

Starting point is 00:41:29 Do you see a difference between a hallucination and a factual error? I would say that no, I do not really see a difference between a hallucination and a factual error. In fact, I would go as far as saying that when it's making arithmetic mistakes, which again, it still does, you know, when it adds to number, you can also view it as some kind of hallucination. And by that, I mean, it's kind of an hallucination by omission. And let me explain what I mean.

Starting point is 00:41:58 So when it does a calculation, an arithmetic calculation, you can actually ask it to print all of its steps. And that improves the accuracy. It does a little bit better if it has to go through all the steps. And this makes sense from the next-world prediction framework. Now, what happens is, very often, it will skip a step. It will kind of forget something.

Starting point is 00:42:13 This can be viewed as a kind of hallucination. It just thought, it hallucinated that this step is not necessary and that it can move on to the next stage immediately. And so this kind of factual error, or in this case, reasoning error, if you want, they are all related to the same concept of hallucination.

Starting point is 00:42:31 There could be many ways to resolve those hallucinations. Maybe we want to look inside the model a little bit more. Maybe we want to change the training pipeline a little bit. You know, maybe the reinforcement learning with human feedback can help. All of these are small patches. And it's still, I want to make it clear to the audience that it's still an academic open problem, whether any of those directions can eventually fix it, or is it a fatal error

Starting point is 00:42:56 for large language models that will never be fixed? We do not know the answer to that question. I want to come back to this notion of interaction with the natural world. Yes. As human beings, we learn about the natural world through interaction with it. We start to develop intuitions about things like gravity, for example. And there's an argument or a debate right now in the community as to how much of that knowledge of how to interact with the natural world is encoded and learnable from language and the kinds of information inputs that debate right now in the community as to how much of that knowledge of how to interact with the natural world is encoded and learnable from language and the kinds of information inputs

Starting point is 00:43:29 that we put into the model versus how much actually needs to be explicitly encoded in an architecture or just learned through interaction with the world. What do you see here? Do you see a fundamental limitation with this kind of architecture for that purpose? So I do think that there is a fundamental limitation in terms of the current structure of the pipeline. And I do believe it's going to be a big limitation once you ask the system to discover new facts. So what I think is the next stage of evolution for the systems would be to hook it up

Starting point is 00:44:01 with a simulator of sorts. So that the system at training time, when it's going through all of the web, it's going through all of the data produced by humanity, suddenly it realizes, oh, maybe I need more data of a certain type. Then we want to give it access to a simulator so that it can produce its own data.

Starting point is 00:44:20 It can run experiments, which is really what babies are doing. You know, infants, they run experiments when they play with a ball, you know, when they look at their hand in front of their face. This is an experiment. I believe we do need to give access to the system a way to do experiments. Now, the problem of this is you get into this little bit of a dystopian discussion of whether do we really want to give these systems,

Starting point is 00:44:44 which are super intelligent in some way, aren't we afraid that they will become superhuman in every way if some of the experiments that they can run is to run code, is to access the Internet? I mean, there are lots of questions about what could happen, and it's not hard to imagine what could go wrong there. It's a good segue into maybe a last question or topic to explore, which comes back to this phrase, AGI. Yes.

Starting point is 00:45:11 Artificial General Intelligence. In some ways, there's kind of a lowercase version of that where we talk about towards more generalizable kinds of intelligence. That's the regime that we've kind of been exploring. Then there's a kind of a capital letter version of that, which is this almost like a sacred cow or a kind of dogmatic pursuit within the AI community. So what does that capital letter phrase AGI mean to you? And what, you know, maybe the part B of that is, is our classic notion of AGI the right goal for us to be aiming for?

Starting point is 00:45:46 Excellent. So I would say before interacting with GPT-4, to me, AGI was this unachievable dream. Something that, you know, it's not even clear whether it's doable, you know, what does it even mean, etc. And really by interacting with GPT-4, I suddenly had the realization that actually general intelligence is something very concrete. It's this general intelligence. It's able to understand any kind of topic that you bring up. It's going to be

Starting point is 00:46:14 able to reason about any of the things that you want to discuss. It can bring up information. It can use tools. It can interact with humans. It can interact with an environment, etc. This is general intelligence. Now, you're totally right in calling it, you know, lowercase AGI. Why is it not uppercase AGI?

Starting point is 00:46:34 Because it's still lacking some of the fundamental aspects, two of them which are really, really important. One is memory. Every new session with GPT-4 is a completely fresh tabular Rasa session. It's not, you know, remembering what you did yesterday with it. And this I want to say that it's something which is emotionally hard to take because you kind of develop a relationship with the system. As crazy as it sounds, that's really what happens.

Starting point is 00:47:01 And so you're kind of disappointed that it doesn't remember, you know, all the good times that you guys had together. So this is one aspect. The other one is the learning. So right now, you cannot teach it new concepts very easily. You know, you can turn the big, you know, crank of, you know, retraining the model. Sure, you can do that. But you cannot explain, you know, I give you this example of using a new API.

Starting point is 00:47:24 Tomorrow, you have to explain it again. It's not able to learn. So, of course, learning and memory, those two things are very, very related, you know, as I just explained. So this is one huge limitation. To me, if it had that, I think it would qualify as uppercase AGI. Now, not everybody would agree even with that because many people will say, no, it needs to be embodied, it needs to have real world experience, etc.

Starting point is 00:47:50 This becomes a philosophical question. Is it possible to have something that you would call a generally intelligent being that only lives in digital world? I don't see any problem with that, honestly. I cannot see any issue with this. Now, there is another aspect once you get into this philosophical territory, which is right now the system, they have no intrinsic motivation. All they want to do is to generate the next token. So is that also an

Starting point is 00:48:16 obstruction to having something which is a general intelligence? Again, to me, this becomes more philosophical than really technical, but maybe there is some aspect which is technical there. Again, if you start to hook up the system to simulate or to run their own experiments, then suddenly they maybe have some intrinsic motivation to just improve themselves. So maybe, you know, that's one technical way to resolve the question. I don't know. There's a word around that phrase in the community, agent, or seeing agentic or goal-oriented behaviors. And that is really where you start to get into the need for serious sandboxing or alignment or other kinds of guardrails for the system. You know, that actually starts to exhibit goal-oriented behavior.

Starting point is 00:49:01 Absolutely. Maybe one other point that I want to bring up about AGI, which I think is confusing a lot of people. And, you know, when you were talking about the sacred cow, somehow when people hear general intelligence, they want something which is truly general, that could grapple with any kind of environment. And not only that, but maybe that grapples with any kind of environment and does so in a sort of optimal way. This universality and optimality, I think are completely irrelevant to intelligence. Intelligence has nothing to do with universality or optimality.

Starting point is 00:49:35 We as human beings are notoriously not universal. I mean, we change a little bit the condition of your environment and you're going to be very confused for a week. It's going to take you months to adapt, et etc. So we are very, very far for universal. And I think I don't need to tell anybody that we're very far from being optimal. I mean, the number of crazy decisions that we make, you know, every second basically is astounding. So we're not optimal in any way. It is not realistic to try to have an AGI that would be universal and

Starting point is 00:50:03 optimal. And it's not even, you know even desirable in any way, in my opinion. So that's maybe the sacred cow version, which is not achievable and not even realistic, in my opinion. Is there an aspect of complementarity that we should be striving for in, say, a refreshed version of AGI or this kind of long-term goal for AI? Yeah, absolutely. But, you know, I don't want to be here

Starting point is 00:50:29 in this podcast today and try to say, you know, what I view for, in terms of this question, because I think it's really the community that should come together and discuss this in the coming weeks, months, years, and come up together with

Starting point is 00:50:41 where do we want to go? Where does society want to go? And so on and so forth. I think it's a terribly important question, and we should not dissociate our futuristic goal with the technical innovation that we're trying to do day to day. We have to take both into account. But I imagine that this discussion will happen and, you know, we will know a lot more a year from now hopefully thanks sebastian just a really fun and fascinating discussion appreciate your time today yeah thanks ashley it was super fun

Microsoft Research Podcast - AI Frontiers: The Physics of AI with Sébastien Bubeck

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.