Microsoft Research Podcast - AI Frontiers: The Physics of AI with Sébastien Bubeck
Episode Date: March 23, 2023Powerful new large-scale AI models like GPT-4 are showing dramatic improvements in reasoning, problem-solving, and language capabilities. This marks a phase change for artificial intelligence—and a ...signal of accelerating progress to come.In this new Microsoft Research Podcast series, AI scientist and engineer Ashley Llorens hosts conversations with his collaborators and colleagues about what these new models—and the models that will come next—mean for our approach to creating, understanding, and deploying AI, its applications in areas such as health care and education, and its potential to benefit humanity.The first episode features Sébastien Bubeck, who leads the Machine Learning Foundations group at Microsoft Research in Redmond. He and his collaborators conducted an extensive evaluation of GPT-4 while it was in development, and have published their findings in a paper that explores its capabilities and limitations—noting that it shows “sparks” of artificial general intelligence.https://www.microsoft.com/research
Transcript
Discussion (0)
.
I'm Ashley Lorenz with Microsoft Research.
I spent the last 20 years working in AI and machine learning,
but I've never felt more fortunate to work in the field than at this moment.
Just this month, March 2023,
OpenAI announced GPT-4,
a powerful new large-scale AI model with dramatic improvements in reasoning, problem solving, and much more.
This model and the models that will come after it represent a phase change in the decades-long pursuit of artificial intelligence. with fellow researchers about our initial impressions of GPT-4, the nature of intelligence, and ultimately how innovations like these
can have the greatest benefit for humanity.
Today I'm sitting down with Sebastian Bubeck,
who leads the Machine Learning Foundations group at Microsoft Research.
In recent months, some of us at Microsoft had the extraordinary privilege
of early access to GPT-4.
We took the opportunity to dive deep
into its remarkable reasoning, problem solving,
and the many other abilities that emerged
from the massive scale of GPT-4.
Sebastian and his team took this opportunity
to probe the model in new ways,
to gain insight into the nature of its intelligence.
Sebastian and his collaborators have shared
some of their observations in the new paper called Sparks of Artificial General Intelligence, Experiments with an Early Version of GPT-4.
Welcome to AI Frontiers.
Sebastian, I'm excited for this discussion.
So the place that I want to start is with what I call the moment.
Okay.
Or the AI moment. All right. So what do I mean by
that? All right. So in my experience, everyone that's picked up and played with the latest wave
of large scale AI models, whether it's chat GPT or the more powerful models coming after,
has a moment, right? They have a moment where they're genuinely surprised by what the
models are capable of, by the experience of the model, the apparent intelligence of the model.
And in my observation, the intensity of the reaction is more or less universal. Although
everyone comes at it from their own perspective, it triggers its own unique range of emotions from
awe to skepticism. So now, I'd love from your perspective,
right, the perspective of a machine learning theorist, what was that moment like for you?
That's a great question to start. So when we started playing with the model, of course,
you know, we did what I think anyone would do. We started to ask it mathematical questions,
mathematical puzzles, you know, we asked it about to give some poetry analysis on a poem.
Peter Lee did one on black salt, which was very intriguing.
But every time we were left wondering,
okay, but maybe it's out there on the internet.
You know, maybe it's just doing some kind of pattern matching
and it's finding a little bit of structure,
but this is not real intelligence.
You know, it cannot be.
How could it be real intelligence
when it's such simple components coming together?
So for me, I think the really awestruck moment was one night when I woke up
and I turned out my laptop and fired the playground.
And, you know, I have a three-year-old at home, a daughter,
who is a huge fan of unicorns.
And I was just wondering, you know what, let's ask
GPT-4 if it can draw a unicorn. And you know, in my professional life, I play a lot with latex,
this programming language for mathematical equations. And in latex, there is this
sub-programming language called TIG-Z to draw images using code. And so I just asked it, can you draw a unicorn in TIGZ?
And it did it so beautifully.
It was really amazing.
And, you know, it was really this very visual
because it's an image, you can render it
and you can see the unicorn.
And no, it wasn't a perfect unicorn.
Really what was amazing is that
it drew a unicorn which was quite abstract.
It was really the concept of a unicorn. You know, all the bits and pieces of what makes a unicorn which was quite abstract. It was really the concept of a unicorn.
You know, all the bits and pieces of what makes a unicorn.
The horn, the tail, you know, the fur, etc.
And this is what really struck me at that moment.
First of all, there is no unicorn in Tixi online.
I mean, who, you know, would draw a unicorn in a mathematical language?
I mean, this doesn't make any sense.
So there is no unicorn online.
I was pretty sure of that. And then we did further experiment to confirm that. And we're sure that
it really drew the unicorn by itself. But really what struck me is this getting into at what is
the concept of a unicorn, that there is, you know, a head, a horn, the legs, etc. This has been a
long standing challenge for AI research. This has always been the problem with all those, you know, AI systems that came before,
like the convolutional neural networks that were trained on ImageNet and, you know, image
data set, and that can recognize, you know, whether there is a cat or a dog in the image,
et cetera.
Those neural networks, it was always hard to interpret them.
And it was not clear how they were detecting exactly whether there is a cat or a dog.
In particular, they were susceptible to these, you know,
adversarial examples like small perturbation to the input
that would completely change the output.
And it was understood that the big issue is that they didn't really get
the concept of a cat or a dog.
And there, suddenly, with GPT-4, it was kind of clear to me at that moment
that it really
understood something. It really understands what is a unicorn. So that was the moment for me.
What did you feel in that moment? Does that change your concept of your field of study,
your relationship to the field? What did you feel like in that moment?
It really changed a lot of things to me. So first of all, I never thought
that I would live to see what I would call a real artificial intelligence, like an intelligence
which is artificial. Of course, you know, we've been talking about AI for, you know, many decades
now. And, you know, the AI revolution in some sense has been happening for a decade already.
But I would argue that all those systems before, they were really this narrow intelligence,
which does not really rise to the level
of what I would call intelligence.
Here, we're really facing something
which is much more general
and really feels like intelligence.
So at that moment, I felt honestly lucky.
I felt lucky that I had early access to this system,
that I could be one of the first human beings
to play with it.
And I saw that this is really going to change the world dramatically. And it is going to be,
you know, selfishly, it's going to change my field of study, as you were saying.
Now, suddenly, we can start to attack what is intelligence, really, we can start to
approach this question, which seemed completely out of reach before.
Really deep down inside me, incredible excitement.
That's really what I felt.
Then upon reflection, you know, in the next few days, etc., there is also some worry.
Of course, clearly things are accelerating dramatically.
Not only did I never think that I would live to see a real artificial intelligence,
but the timeline that I had in mind, say, you know, 10 years ago or 15 years ago when I was a PhD student, I thought maybe by the end of the decade, the 2010s, maybe at that time we will have a system that can play Go better than humans. That was my target.
And maybe 20 years after that, we will have a system that can do language.
And maybe somewhere in between, we will have a system that can play multiplayer games like StarCraft 2 or Dota 2.
All of those things got compressed into the 2010s. And by the end of the 2010s, we had basically solved language in a way with GPT-3. And now we enter the 2020s and now suddenly something totally unexpected,
which wasn't in my card for, you know,
the 70 years of my life and professional career,
intelligence in our hands.
So yeah, it's just changing everything.
And this compressed timeline,
I do worry, where is this going?
You know, there are still fundamental limitations that I'm sure we're going to talk about, and it's not clear whether the acceleration is going to keep going.
But if it does keep going, yeah, it's going to challenge a lot of things for us as human beings.
As someone that's been in the field for a while myself, I had almost a very similar reaction where I felt like I was interacting with a real intelligence,
like something deserving of the name artificial intelligence, AI.
What does that mean to you?
What does it mean to have real intelligence?
It's a tough question, you know, because, of course, intelligence has been studied for many decades.
And, you know know psychologists have developed tests
of your level of intelligence etc but in a way i feel we still intelligence is still something
very mysterious we recognize it when we see it but it's very hard to define and what i'm hoping
is that with this system i what i want to argue is that basically it was very hard before to study what is intelligence
because we had only one example of intelligence. What is this one example? I'm not necessarily
talking about human beings, but more about natural intelligence. By that, I mean intelligence that
happened on planet Earth through billions of years of evolution. This is one type of intelligence,
and this was the only example of intelligence
that we had access to, and so all our series were kind of fine-tuned to that example of
intelligence. Now I feel, now that we have a new system, which I believe rises to the
level of being called an intelligent system, we suddenly have two examples which are very
different. GPT-4's intelligence is comparable to human in some ways, but it's also very, very different.
It makes, you know, it can both solve Olympiad-level mathematical problems and also make elementary school mistakes when adding two numbers.
So it's clearly not human-like intelligence.
It's a different type of intelligence.
And of course, because it came about through a very different process than natural evolution.
You could argue that it came about through a process which you could call artificial evolution.
That's how I would call it.
And so now I'm hoping that now that we have those two different examples of intelligence,
maybe we can start to make progress on defining it and understanding what it is.
So that was a long-winded answer to your question,
but I don't know how to put it differently.
Basically, the way for me to test intelligence is to really ask creative questions,
difficult questions that you do not find online
and to search, because in a way you could ask,
is Bing, is Google, are search engines intelligent?
I mean, they can answer tough questions.
Are these intelligent systems?
Of course not.
Everybody would say no.
So you have to distinguish, you know,
what is it that makes us say that GPT-4 is an intelligent system?
Is it just the fact that it can answer many questions?
No, it's more that it can inspect its answers.
It can explain itself.
It can, you know, interact with you.
You can have a discussion.
This interaction is really of the essence of intelligence to me.
It certainly is a provocative and unsolved, you know, kind of question of what is intelligence. And perhaps equally maybe mysterious is how we actually measure intelligence, which is a challenge even for humans.
Yes. measure intelligence, which is a challenge even for humans, which I'm reminded of with young kids in the school system, as you are or will be soon here as a father. You've had to think differently
as you've tried to measure the intelligence of GPT-4. And you alluded to that. I'd say the
prevailing way that we've gone about measuring the intelligence of AI systems or intelligence systems is through this process of benchmarking. And you and your team
have taken a very different approach. Can you maybe contrast those? Of course. Yeah. So maybe
let me start with an example. So we use GPT-4 to pass mock interviews for software engineers
position at Amazon, at Google, at Meta, etc.
It passes all of those interviews very easily.
Not only does it pass those interviews,
but it also ranks in the very top of the human being.
In fact, for the Amazon interview, not only did it pass all the questions,
but it scored better than 100% of all the human users on that website.
So this is really incredible.
And the headlines would be,
GPT-4 can be hired as a software engineer at Amazon.
But this is a little bit misleading to view it that way,
because those tests, they were designed for human beings.
They make a lot of hidden assumptions
about what is going to be the person that they are interviewing.
In particular, they will not test whether that person has a memory from one day to the next.
This is baked in. Of course, human beings remember what they did the next day, unless there is some
very terrible problem. So all those benchmarks, they all face those benchmarks of intelligence,
at least. They face this issue that they were designed to test human beings.
So we have to find new ways to test intelligence when we're talking about the intelligence
of AI systems.
So that's point number one.
Point number two is so far in the machine learning tradition, you know, we have developed
lots of benchmarks to test AI system, narrow AI system.
This is how the machine learning community has
made progress over the decades by beating benchmark, by having systems that keep improving
percentage by percentage over those target benchmarks. Now, all of those become kind of
irrelevant in the era of GPT-4 for two reasons. Number one is GPT-4, we don't know exactly what it has been trained on.
And in particular,
it might have seen all of these data sets.
So really you cannot separate anymore
the training data and the test data.
This is not really a meaningful way
to test something like GPT-4
because it might have seen everything.
For example, Google came out with a suite of benchmarks,
which they call Big Bench.
And in there, they hid a code to make sure that if you don't know the code,
then you haven't seen this data.
And of course, GPT-4 knows this code.
So it has seen all of BigBench, so you just cannot benchmark it against BigBench.
So that's problem number one for the classical ML benchmark.
Problem number two is that all those benchmarks are just too easy.
It's just too easy for GPT-4.
It crushes all of them hands down very, very easily.
In fact, it also does the same thing for the medical license exam,
for the multi-state bar exam, all of those things.
It just passes it very, very easily.
So the reason why we have to go beyond this,
is really beyond the classical ML benchmark,
is we really have to test the generative abilities, the interaction abilities, you know, how is it able to
interact with human beings? How is it able to interact with tools? How creative can it be at
the task? So all those questions, it's very hard to benchmark them, you know, around hard benchmark
where there is one right solution. Now, of course, the ML community has grappled with this problem recently
because generative AI has been in the works for a few years now,
but the answers are still very tentative.
Just to give you an example, imagine that you want to have a benchmark
where you describe a movie and you want to write a movie review.
Let's say, for example, you want to tell the system,
write a positive movie review about this movie.
The problem is in your benchmark, you will have some in the data.
You will have examples of those reviews.
And then you ask your AI system to write its own review,
which might be very different from what you have in your training data.
So the question is, is it better to write something different?
Or is it worse?
Do you have to match what was in the training data?
Maybe, you know, GPT-4 is so good that it's going to write something different? Or, you know, is it worse? Do you have to match what was in the training data? Maybe, you know, GPT-4 is so good that it's going to write something better than what the humans wrote. And in fact, we have seen that many, many times, is that training data was
crafted by humans, and GPT-4 just does a better job at it. So it gives better labels, if you want,
than what the humans did. So it cannot even compare to humans anymore. So this is a problem that we are facing as we are writing our paper, trying to assess
GPT-4's intelligence.
Give me an example where the model is actually better than the human.
Sure.
I mean, let me think of a good one.
I mean, coding.
It is absolutely superhuman at coding.
You know, we already alluded to this, and this is going to have tremendous implication. But really, it is absolutely superhuman at coding. You know, we already, you know, alluded to this,
and this is going to have tremendous implication,
but really coding is incredible.
So for example, you know, again,
going back to the example of movie reviews,
there is this IMDB dataset,
which is very popular in machine learning,
where, you know, you can,
there are many basic questions that you want to ask.
But now in the era of GPT-4,
you can give the IMDB dataset
and you can just ask GPT-4, hey, can you explore the dataset?
And it's going to come up with suggestions of data analysis ideas.
Maybe it will say, maybe we want to do some clustering.
Maybe you want to, you know, cluster by the movie, you know, directors
and you will see, you know, which movies was the most popular and why, etc.
It can come up creatively with its own analysis.
So that one aspect, definitely coding, data analysis, it can be very easily superhuman.
I think in terms of writing, I mean, its writing capabilities are just astounding.
For example, in the paper, we asked it many times to rewrite parts of what we wrote.
And he writes it in this much more lyrical way
poetic way you can ask for any kind of style that you want it's really at the level i would say at
this for in in my you know novice eyes i i would say it's at the level of some of the best authors
out there the is its style and this is really native uh you know you don't have to do anything. It does remind me a little bit of the AlphaGo moment, more specifically the AlphaZero moment, where all of a sudden you leave the human training data behind and you're entering into a realm where it's its only real competition. You talked about kind of the evolution that we need to have of how we measure
intelligence from ways of measuring narrow or specialized intelligence to measuring more
general kinds of intelligence. You know, we've had these narrow benchmarks. You see a lot of
this pass the bar exam, these kinds of human intelligence measures. But what happens when
all of those are also too easy? Yes. How do we think about measurement and assessment
in that regime? So, of course, you know, I want to say maybe this is a good point to bring up the
limitations of the system also. Right now, a very clear frontier that GPT-4 is not stepping over
is to produce new knowledge, to discover new things. For example, let's say in mathematics,
to prove mathematical theorems that humans do not know how to prove. Right now, the systems cannot do it. And this, I think, would be a very clean
and clear demonstration, where there is just no ambiguity, once it can start to produce this new
knowledge. Now, whether it's going to happen or not is an open question. I personally believe
it's plausible. I am not 100% sure what's going to
happen, but I believe it is plausible that it will happen. But then there might be another question,
which is what happens if the proof that it produces becomes inscrutable to human beings,
which is another option. I mean, you know, mathematics is not only this abstract thing,
but it's also a language between humans. Of course, at the end of the day, you can come back to the axioms, but that's not the way we humans do mathematics. So what happens
if, let's say, GPT-5 proves the Riemann hypothesis, and it is formally proved? So maybe it gives the
proof in the lean language, which is a formalization of mathematics, and you can formally verify that
the proof is correct, but no human being is able to
understand the concepts that were introduced. What does it mean? Is the Riemann hypothesis
really proven? I guess it is proven, but is that really what we human beings wanted? So this kind
of question might be on the horizon. And that, I think, ultimately might be the real test of
intelligence. Let's stick with this category of the limitations of the model.
And you kind of drew a line here
in terms of producing new knowledge.
You offered one example of that
as proving mathematical theorems.
What are some of the other limitations that you've discovered?
So, you know, GPT-4 is a large language model
which was trained on the next word prediction objective function.
So what does it mean?
It just means you give it a partial text
and you're trying to predict what is going to be the next word in that partial text.
And then at test time, once you want to generate content,
you just keep doing that on the text that you're producing.
So you're producing words one by one.
Now, of course, it's a question,
and I have been reflecting upon myself,
you know, once I saw GPT-4,
it's a question whether human beings
are machine like this.
I mean, it doesn't feel like it, you know?
It feels like we're thinking
a little bit more deeply.
We're thinking a little bit more in advance
of what we want to say.
But somehow, as I reflect,
I'm not so sure,
at least when I speak verbally, orally,
maybe I am just coming up every time with the next word.
So this is a very interesting aspect.
But the key point is,
suddenly when I'm doing mathematics,
I think I am thinking a little bit more deeply.
I'm not just trying to see what is the next step,
but I'm trying to come up with a whole plan
of what I want to achieve.
And right now, the system is not able to do this
kind of long term planning. And we can give very simple experiments that show this. Maybe my
favorite one is, you know, if you just want you, let's say you have a very simple arithmetic
equality, I don't know, three times seven plus 21 times 27 equals something. So this is part of the prompt that you give to GPT-4.
And now you just ask, OK, you're allowed to modify one digit in this
so that the end result is modified in a certain way.
Which one do you choose?
So, you know, the way to solve this problem is that you have to think,
you have to, you know, try, OK, if I were to modify the first digit,
what would happen?
If I were to modify the second digit, what would happen? If I were to modify the
second digit, what would happen? And GPT-4 is not able to do that. GPT-4 is not able to think ahead
in this way. What it will say is just, it will say, oh, you know what, let me, I think if you
modify the third digit, just randomly, the third digit is going to work. And it just tries and it
fails. And the really funny aspect is that once it starts failing,
GPT-4, this becomes part of its context,
which in a way becomes part of its truth.
So the failure becomes part of its truth
and then it will do anything to justify it.
It will keep making mistake to keep justifying it.
So these two aspects, the fact that it cannot really plan ahead
and that once it
makes mistakes, it just becomes part of its truth. These are very, very serious limitations,
in particular for mathematics. I mean, this makes it a very uneven system once you approach
mathematics. You mentioned something that's different about machine learning, the way it's
conceptualized in this kind of generative
AI regime, which is fundamentally different than what we've typically thought about as machine
learning, where you're optimizing an objective function with a fairly narrow objective versus
when you're trying to actually learn something about the structure of the data, albeit through
this next word prediction or some other way. What do you think about that learning mechanism? Are there any limitations of that?
Yeah, so this is a very interesting question. So, you know, maybe I just want to backtrack for a
second and just acknowledge that what happened there is kind of a miracle. Nobody, I think,
nobody in the world, perhaps except OpenAI, expected that intelligence would emerge from this next world prediction framework just on a lot of data.
I mean, this is really crazy if you think about it.
Now, the way I have justified it to myself recently is like this.
It is, you know, agreed that deep learning, which is what powers, you know, the GPT-4 training, you have a big neural network
that you're training with gradient descent, just trying to fiddle with the parameters.
So it is agreed that deep learning is this hammer that if you give it a data set,
it will be able to extract the latent structure of that data set. So for example, the first
breakthrough that happened in deep learning a little bit more than 10 years ago was the AlexNet moment where they trained a neural network to basically classify, you know, cats,
dogs, you know, cars, et cetera, with images. And when you train this network, what happens is that
you have these edge detectors that emerge on the first few layers of the neural network.
And nothing in the objective function told you that you have to come up with edge detector.
This was an emergent property.
Why?
Because it makes sense.
This is a structure of an image is to combine those edges to create geometrical shapes.
Now, I think what's happening, and we have seen this more and more with the large language models,
is that there are more and more emerging properties that happen as you scale up, you know, the size of the network and the size of the data. Now, what I believe is happening is that in the case of GPT-4, they gave it such a big data set,
so diverse, with so many complex parameters in it, that the only way to make sense of it,
the only latent structure that unifies all of this data is intelligence. The only way to make sense of
the data was for the system to become intelligent. This is kind of a crazy sentence. And, you know,
I expect the next few years, maybe even the next few decades, we'll try to make sense of whether
this sentence is correct or not. And hopefully we can, you know, human beings are intelligent
enough to make sense of that sentence. I don't know. Right now, I just feel like it's a reasonable hypothesis
that this is what happened there.
In a way, you can say maybe there is no limitation
to the next word prediction framework.
So that's one perspective.
The other perspective is, no, no, no.
Actually, the next word prediction token framework
is very limiting, at least at generation time.
At least once you start to generate new sentences, you should go beyond a little bit if you want to have the planning aspect,
if you want to be able to revisit mistakes that you made. So there we believe that at least at
generation time, you need to have a slightly different system. But maybe in terms of training,
in terms of coming up with intelligence in the first place, maybe this is a fine way to do it.
One aspect of our previous notion of intelligence, and maybe still the current notion of intelligence
for some, is this aspect of compression. The ability to take something complex and make it
simple, maybe thinking grounded in Occam's razor, where we want to generate the simplest explanation
of the data. And some of the things you're saying, and some of, where we want to generate the simplest explanation of the data.
And some of the things you're saying and some of the things we're seeing in the model
kind of go against that intuition.
So talk to me a little bit about that.
So I think this is really exemplified well
in a project that we did here at Microsoft Research a year ago,
which we called Lego.
So let me tell you about this very briefly
because it will really get to the point of what you're trying to say.
So let's say you want to train an AI system that can solve middle school
systems of linear equations. So maybe it's x plus y equals z, 3x minus 2y equals 1, and
so on and so forth. You have three equations with three variables. And you want to train
a neural network that does that. It know, it takes in the system of equation
and outputs the answer for it.
The classical perspective, the Occam's razor perspective,
would be collect a data set with lots of equations like this,
train a system to solve those linear equations,
and there you go.
You know, this is a way you have IID,
you know, the same kind of distribution
at training time and at test
time.
Now, what this new paradigm of deep learning and in particular of large language models
would say is instead, even though your goal is to solve systems of linear equations for
middle school students, instead of just training data, having middle school systems of linear
equation, we're going to collect a hugely diverse set of data.
Maybe we're going to do next-world prediction,
not only on the systems of linear equations,
but also on all of Wikipedia.
Okay, so this is now, you know, a very concrete experiment.
You have two neural networks.
Neural network A, only trained on equations.
Neural network B, trained on equations, plus Wikipedia.
And any kind of classical thinking
would tell you that Neural Network B is going to do worse
because it has to do more things,
it's going to get more confused,
it's not the simplest way to solve the problem, et cetera.
But lo and behold,
if you actually run the experiment for real,
Network B is much, much, much better than Network A.
Now I need to quantify this a little bit.
Network A, if it was trained with systems of linear equations
with three variables,
is going to be fine on systems of linear equations
with three variables.
But as soon as you ask it four variables or five variables,
it's not able to do it.
It didn't really get the essence
of what it means to solve linear equations.
Whereas network B,
it not only solves systems of equations with three variables,
but it also does four, it also does five,
and so on and so forth.
Now, the question is, why?
What's going on?
Why is it that making the thing more complicated,
going against Occam's razor,
why is that a good idea?
And, you know, the extremely naive perspective,
which, in fact, some people said,
because it's so mysterious, would know, some people said because, you know,
it was so mysterious,
would be,
maybe it read the Wikipedia page
on solving systems
of linear equation, right?
But of course,
that's not what happened.
And this is another, you know,
aspect of this whole story,
which is anthropomorphication
of the system is a big danger,
but let's not get into that right now.
But the point is,
that's not at all the reason why it became good at solving systems of linear equation. It's rather big danger, but let's not get into that right now. But the point is, that's not at all the reason
why it became good at solving systems of linear equations.
It's rather that it had this very diverse data,
and it forced it to come up with unifying principles,
like more canonical component of intelligence,
and then it's able to compose this canonical component of intelligence
to solve the task at hand.
I want to go back to something you said much earlier around natural evolution versus this
kind of notion of artificial evolution. And I think that starts to allude to where I think
you want to take this field next, at least in terms of your study and your group. And that is
focusing on the aspect
of emergence and how intelligence emerges. So what do you see as the way forward from this point,
from your work with Lego that you just described for you and for the field?
Yes, absolutely. So I think I would argue that maybe we need a new name for machine learning.
In a way, GPT-4 and GPT-3 and, you know, all those other large language models,
in some ways, it's not machine learning anymore.
And by that, I mean, you know, machine learning,
it's all about how do you teach a machine a very well-defined task, recognize cats and dogs,
you know, something like that.
But here, that's not what we're doing.
We're not trying to teach it a narrow task.
We're trying to teach it everything.
And we're not trying to mimic
how a human would learn. You know, this is another point of confusion. Some people say,
you know, oh, but it's learning language, you know, but using more text than any human
would ever see. But that's kind of missing the point. The point is we're not trying to
mimic human learning. And that's why maybe learning is not the right word anymore. We're
really trying to mimic something which is more akin to evolution.
We're trying to mimic the experience
of millions, billions of entities
that interact with the world.
In this case, the world is, you know,
the data that humans produced.
So it's a very different style.
And I believe the reason why all the tools
that we have introduced in machine learning
are kind of useless and almost irrelevant
in light of GPT-4 is because it's a new field.
It's something that needs new tools to be defined.
So we hope to be at the forefront of that
and we want to introduce those new tools.
And of course, we don't know what it's going to look like,
but the avenue that we're taking to try to study this
is to try to understand emergence.
So emergence, again again is this phenomenon that
as you scale up the network and the data, suddenly there are new properties that emerge
at every scale. And Google had this experiment where they scaled up their large language
models from 8 billion, 60 billion to 500 billion. And at 8 billion, it's able to understand
language and it's able to do a little bit of arithmetic. At 60 billion, suddenly it's able to translate between language. You know,
before it couldn't translate, at 60 billion parameters, suddenly it translates. At 500
billion, suddenly it can explain jokes. You know, why can it suddenly explain jokes? So we really
would like to understand this. And of course, from our perspective, the way we want to do it is,
let me say it like this. There is another field out there that has been grappling with emergence for a long time, that we're trying to study systems of very complex, you know, particles
interacting with each other and leading to some emergent behaviors. What is this field? It's
physics. So what we would like to propose is let's study the physics of AI or the
physics of AGI, because in a way, you know, we're really seeing this general intelligence now. So
what would it mean to study the physics of AGI? What it would mean is let's try to borrow from
the methodology that physicists, you know, have used for the last few centuries to make sense of
reality. And what were those tools? Well, one of them was to run
very controlled experiment. You know, if you look at a waterfall and you have the water which is
flowing and it's going in all kinds of ways, et cetera, and you go look at it in the winter and
it's frozen, I mean, good luck to try to make sense of the phases of water by just staring at
a waterfall. And GPT-4 or Lambda or all those large language models, these are our waterfalls.
What we need are much more small-scale,
controlled experiments where we know we have pure water.
It's not being tainted by the stone, by the algaes, etc.
We need those controlled experiments to make sense of it.
And Lego is one example.
So that's one direction that we want to take.
But in physics, there is another direction that you can take,
which is to build toy mathematical models of the real world. You try to abstract. But in physics, there is another direction that you can take, which is to build toy mathematical models
of the real world.
You try to abstract away
lots of things
and you're left with
a very simple mathematical equation
that you can study.
And then you have to go back
to real experiment
and see whether the prediction
from the toy mathematical model
tells you something
about the real experiment.
So that's another avenue
that we want to take.
And there we made some progress
recently also
with interns at MSR.
So we have a paper which is called Learning Threshold Units.
And here really we're able to understand how does the most basic element,
I don't want to say intelligence,
but the most basic element of reasoning emerges in those neural networks.
And what is this most basic element of reasoning?
It's a threshold unit.
It's something
that takes as input, you know, some value. And if the value is too small, then it just turns it to
zero. And this emergence already, it's a very, very complicated phenomenon. And we were able to
understand the non-convex dynamics at play and connect it to something which is called the edge
of stability, which is all very exciting. But the key point is that it's really,
we have a toy mathematical model.
And there, in essence, what we were able to do
is to say that emergence is related
to the instability in training,
which is very surprising
because usually in classical machine learning,
instability is something that you do not want.
You want to erase all the instabilities.
And here, somehow, through this physics of AI approach, where we have a toy mathematical model,
we're able to say, ah, actually the instability in training that you're seeing, that everybody
has seen for decades now, it actually matters for learning and for emergence. So this is the
first step that we took. I want to come back to this aspect of interaction. And I want to ask
you if you see fundamental limitations with this whole methodology around certain kinds of
interactions. So right now we've been talking mostly about these models sort of interacting
with information in information environments, with information that people produce,
and then producing new information.
Behind the source of that information is actual humans. And so I want to know if you see any
limitations or if this is an aspect of your study, how we make these models better at
interacting with humans, understanding the person behind the information produced.
And after you do that, I'm going to come back and we'll ask the same question of the natural world in which we as humans reside. Absolutely. So this is one of the
emergent properties of GPT-4, to put it very simply, that not only can it interact with
information, but it can actually interact with humans too. It can really, you know,
you can communicate with it. You can discuss and you're going to have very interesting discussion.
In fact, some of have very interesting discussions.
In fact, some of my most interesting discussions in the last few months were with GPT-4.
This is surprising.
Not at all something we would have expected, but it's there.
Not only that, but it also has a theory of mind.
GPT-4 is able to reason about what somebody is thinking, what somebody is thinking about what somebody else is thinking, and so on and so forth. So it's really a very sophisticated theory of mind. There was recently
a paper saying that chat GPT is roughly at the level of, you know, seven years old in terms of
its theory of mind. For GPT-4, I cannot really distinguish from an adult. Just to give you an
anecdote, I don't know if I should say this, but, you know but one day in the last few months, I had an argument with my wife
and she was telling me something and I just didn't understand what she wanted from me.
And I just talked with GPT-4.
I explained the situation.
I asked what's going on.
What should I be doing?
And the answer was so detailed, so thoughtful.
I mean, I'm really not making this up.
This is absolutely real.
I learned something from GPT-4 about the human interaction with my wife. This is as real as it gets. I can't see
any limitation right now in terms of interaction. And not only can it interact with humans,
but it can also interact with tools. And this is the premise in a way of the new being that was,
you know, recently introduced, which is that this new model, you can tell it, hey, in a way, of the new Bing that was recently introduced,
which is that this new model, you can tell it,
hey, you know what, you have access to a search engine.
You can use Bing.
If there is some information that you're missing and you need to find it out,
please make a Bing search.
And somehow, natively, this is again an emergent property,
it's able to use a search engine and make searches when it needs to,
which is really, really incredible.
And not only can it use those tools which are well-known,
but you can also make up tools.
You can say, hey, I invented some API.
Here is what the API does.
Now, please solve me problem XYZ using that API.
And it's able to do it natively.
It's able to understand your description in natural language
of what the API that you built is doing, and it's able to leverage its power and use it. This is really incredible
and opens, you know, so many directions. Yeah, we certainly see some, I mean, super impressive
capabilities like the new integration with Bing, for example. We also see some of those limitations
come into play. Tell me about your exploration of those in this context.
Right. So one keyword that didn't come up yet
and which is going to, you know, drive probably the conversation,
at least online and on Twitter, is hallucinations.
So those models, you know, they still, GPT-4 still does hallucinate a lot.
And in a way, for good reason, you know.
Hallucination, it's on a spectrum where on the one hand, you have bad hallucination,
completely making up facts which are contrary to the real facts in the real world.
But on the other hand, you have creativity.
I mean, when you create, when you generate new things, you are in a way hallucinating.
It's good hallucinations, but still, these are hallucinations.
So having a system which can both be creative but does not hallucinate at all, it's a very delicate balance. And GPT-4 did not solve
that problem yet. It made a lot of progress, but it didn't solve it yet. That's still a big
limitation, you know, which the world is going to have to grapple with. And, you know, I think in
the new being, it's very clearly explained that it is still making mistakes from time to time and
that you need to double check the result, etc.
I still think the rough contour of what GPT-4 says and the new Bing says is really correct.
It's a very good first draft most of the time and you can get started with that.
But then, yeah, you need to do your research and it cannot be used for critical missions yet.
Now, what's interesting is that GPT-4
is also intelligent enough to look over itself.
So once it produces a transcript,
you can ask another instance of GPT-4
to look over what the first instance did
and to check whether there is any hallucination.
This works particularly well
for what I would call in-context hallucinations.
So what would be in-context hallucinations is, let's say, you have a text that you're asking it to summarize.
And maybe in the summary, it invents something that was not out there.
Then the other instance of GPT-4 will immediately spot it.
So that's, you know, basically in-context hallucinations.
We believe they can be maybe fully solved soon. The open world type of hallucination,
when you ask anything, for example, in our paper,
we ask where is the McDonald's at SeaTac,
at the airport in Seattle, and it responds gate C2.
And the answer is not C2, the answer is B3.
So this type of open world hallucination,
it's much more difficult to resolve, and we don't
know yet how to do that exactly.
Do you see a difference between a hallucination and a factual error?
I would say that no, I do not really see a difference between a hallucination and a factual
error.
In fact, I would go as far as saying that when it's making arithmetic mistakes,
which again, it still does, you know, when it adds to number,
you can also view it as some kind of hallucination.
And by that, I mean, it's kind of an hallucination by omission.
And let me explain what I mean.
So when it does a calculation, an arithmetic calculation,
you can actually ask it to print all of its steps.
And that improves the accuracy.
It does a little bit better if it has to go through all the steps.
And this makes sense from the next-world prediction framework.
Now, what happens is,
very often, it will skip a step.
It will kind of forget something.
This can be viewed as a kind of hallucination.
It just thought, it hallucinated
that this step is not necessary
and that it can move on
to the next stage immediately.
And so this kind of factual error,
or in this case, reasoning error, if you want,
they are all related to the same concept of hallucination.
There could be many ways to resolve those hallucinations.
Maybe we want to look inside the model a little bit more.
Maybe we want to change the training pipeline a little bit.
You know, maybe the reinforcement learning
with human feedback can help.
All of these are small patches.
And it's still, I want to make it clear to the audience that it's still an academic
open problem, whether any of those directions can eventually fix it, or is it a fatal error
for large language models that will never be fixed?
We do not know the answer to that question.
I want to come back to this notion of interaction with
the natural world. Yes. As human beings, we learn about the natural world through interaction with
it. We start to develop intuitions about things like gravity, for example. And there's an argument
or a debate right now in the community as to how much of that knowledge of how to interact with
the natural world is encoded and learnable from language and the kinds of information inputs that debate right now in the community as to how much of that knowledge of how to interact with the
natural world is encoded and learnable from language and the kinds of information inputs
that we put into the model versus how much actually needs to be explicitly encoded in an
architecture or just learned through interaction with the world. What do you see here? Do you see
a fundamental limitation with this kind of architecture for that purpose? So I do think
that there is a fundamental limitation in terms of the current structure of the pipeline.
And I do believe it's going to be a big limitation
once you ask the system to discover new facts.
So what I think is the next stage of evolution
for the systems would be to hook it up
with a simulator of sorts.
So that the system at training time,
when it's going through all of the web,
it's going through all of the data produced by humanity,
suddenly it realizes,
oh, maybe I need more data of a certain type.
Then we want to give it access to a simulator
so that it can produce its own data.
It can run experiments,
which is really what babies are doing.
You know, infants, they run experiments when they play with a ball,
you know, when they look at their hand in front of their face.
This is an experiment.
I believe we do need to give access to the system a way to do experiments.
Now, the problem of this is you get into this little bit of a dystopian discussion
of whether do we really want to give these systems,
which are super intelligent in some way,
aren't we afraid that they will become superhuman in every way
if some of the experiments that they can run is to run code,
is to access the Internet?
I mean, there are lots of questions about what could happen,
and it's not hard to imagine what could go wrong there.
It's a good segue into maybe a last question or topic to explore, which comes back to this phrase, AGI.
Yes.
Artificial General Intelligence.
In some ways, there's kind of a lowercase version of that where we talk about towards more generalizable kinds of intelligence.
That's the regime that we've kind of been exploring. Then there's a kind of a capital letter version of that, which is this almost like a sacred cow
or a kind of dogmatic pursuit within the AI community.
So what does that capital letter phrase AGI mean to you?
And what, you know, maybe the part B of that is,
is our classic notion of AGI the right goal
for us to be aiming for?
Excellent.
So I would say before interacting with GPT-4, to me, AGI was this unachievable dream.
Something that, you know, it's not even clear whether it's doable, you know, what does it even mean, etc.
And really by interacting with GPT-4, I suddenly had the realization that actually general intelligence
is something very concrete.
It's this general intelligence.
It's able to understand any kind
of topic that you bring up. It's going to be
able to reason about any of
the things that you want to discuss.
It can bring up information.
It can use tools. It can interact with
humans. It can interact with an environment,
etc. This is general intelligence.
Now, you're totally right in calling it, you know, lowercase AGI.
Why is it not uppercase AGI?
Because it's still lacking some of the fundamental aspects,
two of them which are really, really important.
One is memory.
Every new session with GPT-4 is a completely fresh tabular Rasa session.
It's not, you know, remembering what you did yesterday with it.
And this I want to say that it's something which is emotionally hard to take
because you kind of develop a relationship with the system.
As crazy as it sounds, that's really what happens.
And so you're kind of disappointed that it doesn't remember, you know,
all the good times that you guys had together.
So this is one aspect.
The other one is the learning.
So right now, you cannot teach it new concepts very easily.
You know, you can turn the big, you know, crank of, you know, retraining the model.
Sure, you can do that.
But you cannot explain, you know, I give you this example of using a new API.
Tomorrow, you have to explain it again.
It's not able to learn.
So, of course, learning and memory, those two things are very, very related, you know, as I just explained.
So this is one huge limitation.
To me, if it had that, I think it would qualify as uppercase AGI.
Now, not everybody would agree even with that because many people will say,
no, it needs to be embodied,
it needs to have real world experience, etc.
This becomes a philosophical question.
Is it possible to have something
that you would call a generally intelligent being
that only lives in digital world?
I don't see any problem with that, honestly.
I cannot see any issue with this.
Now, there is another aspect once you get into this philosophical territory, which is right now the system, they have
no intrinsic motivation. All they want to do is to generate the next token. So is that also an
obstruction to having something which is a general intelligence? Again, to me, this becomes more
philosophical than really technical, but maybe there is some aspect which is technical there.
Again, if you start to hook up the system to simulate or to run their own experiments, then suddenly they maybe have some intrinsic motivation to just improve themselves.
So maybe, you know, that's one technical way to resolve the question.
I don't know.
There's a word around that phrase in the community, agent, or seeing agentic or goal-oriented behaviors.
And that is really where you start to get into the need for serious sandboxing or alignment or other kinds of guardrails for the system.
You know, that actually starts to exhibit goal-oriented behavior.
Absolutely. Maybe one other point that I want to bring up about AGI, which I
think is confusing a lot of people. And, you know, when you were talking about the sacred cow,
somehow when people hear general intelligence, they want something which is truly general,
that could grapple with any kind of environment. And not only that, but maybe that grapples with
any kind of environment and does so in a sort of optimal way. This universality and optimality,
I think are completely irrelevant to intelligence.
Intelligence has nothing to do
with universality or optimality.
We as human beings are notoriously not universal.
I mean, we change a little bit
the condition of your environment
and you're going to be very confused for a week.
It's going to take you months to adapt, et etc. So we are very, very far for universal.
And I think I don't need to tell anybody that we're very far from being optimal. I mean,
the number of crazy decisions that we make, you know, every second basically is astounding. So
we're not optimal in any way. It is not realistic to try to have an AGI that would be universal and
optimal. And it's not even, you know even desirable in any way, in my opinion.
So that's maybe the sacred cow version,
which is not achievable and not even realistic, in my opinion.
Is there an aspect of complementarity
that we should be striving for in, say,
a refreshed version of AGI or this kind of long-term goal for AI?
Yeah, absolutely.
But, you know, I don't want to be here
in this podcast today and try to say,
you know, what I view for,
in terms of this question,
because I think it's really the community
that should come together
and discuss this in the coming weeks,
months, years,
and come up together with
where do we want to go?
Where does society want to go?
And so on and so forth. I think it's a terribly important question, and we should not dissociate
our futuristic goal with the technical innovation that we're trying to do day to day. We have to
take both into account. But I imagine that this discussion will happen and, you know,
we will know a lot more a year from now hopefully thanks sebastian just a
really fun and fascinating discussion appreciate your time today yeah thanks ashley it was super
fun