Lex Fridman Podcast - #94 – Ilya Sutskever: Deep Learning
Episode Date: May 9, 2020Ilya Sutskever is the co-founder of OpenAI, is one of the most cited computer scientist in history with over 165,000 citations, and to me, is one of the most brilliant and insightful minds ever in the... field of deep learning. There are very few people in this world who I would rather talk to and brainstorm with about deep learning, intelligence, and life than Ilya, on and off the mic. Support this podcast by signing up with these sponsors: – Cash App – use code “LexPodcast” and download: – Cash App (App Store): https://apple.co/2sPrUHe – Cash App (Google Play): https://bit.ly/2MlvP5w EPISODE LINKS: Ilya's Twitter: https://twitter.com/ilyasut Ilya's Website: https://www.cs.toronto.edu/~ilya/ This conversation is part of the Artificial Intelligence podcast. If you would like to get more information about this podcast go to https://lexfridman.com/ai or connect with @lexfridman on Twitter, LinkedIn, Facebook, Medium, or YouTube where you can watch the video versions of these conversations. If you enjoy the podcast, please rate it 5 stars on Apple Podcasts, follow on Spotify, or support it on Patreon. Here's the outline of the episode. On some podcast players you should be able to click the timestamp to jump to that time. OUTLINE: 00:00 - Introduction 02:23 - AlexNet paper and the ImageNet moment 08:33 - Cost functions 13:39 - Recurrent neural networks 16:19 - Key ideas that led to success of deep learning 19:57 - What's harder to solve: language or vision? 29:35 - We're massively underestimating deep learning 36:04 - Deep double descent 41:20 - Backpropagation 42:42 - Can neural networks be made to reason? 50:35 - Long-term memory 56:37 - Language models 1:00:35 - GPT-2 1:07:14 - Active learning 1:08:52 - Staged release of AI systems 1:13:41 - How to build AGI? 1:25:00 - Question to AGI 1:32:07 - Meaning of life
Transcript
Discussion (0)
The following is a conversation with Ilya Setskever.
Co-founder and chief scientist of OpenAI,
one of the most cited computer scientists in history
with over 165,000 citations.
And to me, one of the most brilliant and insightful minds ever
in the field of deep learning.
There are very few people in this world
who I would rather talk to and brainstorm with
about deep learning, intelligence, and life in general than Ilya, on and off the mic.
This was an honor and a pleasure.
This conversation was recorded before the outbreak of the pandemic.
For everyone feeling the medical, psychological, and financial burden of this crisis, I'm sending
love your way.
Stay strong, we're in this together,
we'll beat this thing.
This is the Artificial Intelligence Podcast. If you enjoy it, subscribe on YouTube, review
it with 5 stars and appa podcast, support it on Patreon or simply connect with me on Twitter,
at Lex Friedman's spelled F-R-I-D-M-A-N. As usual, I'll do a few minutes of ads now and
never any ads in the middle that can break the flow of the conversation.
I hope that works for you and doesn't hurt the listening experience.
This show is presented by CashApp, the number one finance app in the app store.
When you get it, use code LexPodcast.
CashApp, as you said, money to friends, buy Bitcoin, invest in the stock market with
as little as $1.
Since CashApp allows you to buy Bitcoin, let me mention that cryptocurrency in the context
of the history of money is fascinating.
I recommend a cent of money as a great book on this history, both the book and audiobook
are great.
Depends on credits on ledgers started around 30,000 years ago. The US dollar created over 200 years
ago. And Bitcoin, the first decentralized cryptocurrency released just over 10 years
ago. So given that history, cryptocurrency is still very much in its early days of development.
But it's still aiming to just might redefine the nature of money. So again, if you get
cash out from the app store, Google Play, and use the code,
Lex, Podcast, you get $10, and cash out will also donate $10 to first, an organization that is
helping advance robotics and STEM education for young people around the world. And now, here's my with the Ilya Satskevur. You were one of the three authors with Alex Kaczewski, Jeff Hinton of the famed Alex Ned paper that is arguably
the paper that marked the big catalytic moment that launched the deep learning revolution
At that time take us back to that time. What was your intuition about
neural networks about the representational power of neural networks and
Maybe you could mention how did that evolve
over the next few years up to today, over the 10 years?
Yeah, I can answer that question.
At some point in about 2010 or 2011,
I connected two facts in my mind.
Basically, the realization was this.
At some point we realized that we can train very large,
I shouldn't say very tiny by today's standards,
but large and deep neural networks
end to end with backward propagation.
At some point, different people obtain this result.
I obtained this result.
The first moment in which I realized
that deep neural networks are
powerful was when James Martens invented the Hessian free optimizer in 2010 and
hit train the 10 layer neural network and to end without pre-training from scratch.
And when that happened, I thought this is it because if you can train a big
neural network, a big neural network can represent very complicated function.
Because if you have a neural network with 10 layers,
it's as though you allow the human brain
to run for some number of milliseconds,
neuron firings are slow.
And so in maybe 100 milliseconds, your neurons
only fire 10 times.
So it's also kind of like 10 layers.
And in 100 milliseconds, you can perfectly recognize
any object.
So I thought, so I already had the idea
then that we need to train a very big neural network
on lots of supervised data.
And then it must succeed because we can find
the best neural network.
And then there's also theory that if you have more data
than parameters, you want to perfect.
So Davey know that actually this theory is very incomplete and you want to perfect even you have less data than parameters. But definitely, if you have more data than parameters, you want to perfect. Today we know that actually this theory is very incomplete and you want to perfect even
you have less data than parameters.
But definitely, if you have more data than parameters, you want to perfect.
So the fact that neural networks were heavily overparameterized wasn't discouraging to you.
So you were thinking about the theory that the number of parameters that factors huge
number parameters is okay.
It's gonna be okay.
I mean, there was some evidence before that it was okay, but the theory was most, the theory was that if you had a big dataset and a big neural native
was going to work, the overparametricization just didn't really figure much as a problem.
I thought, well, with images, you're just going to add some data augmentation and it's gonna be okay.
So where was any doubt coming from?
The main doubt was, can we train a bigger, if you really have enough compute to train a big enough neural net?
With bad propagation.
The bad propagation I thought was would work.
The image wasn't clear whether it would be enough compute to get a very convincing result.
Then at some point Alex Kyrgyzky wrote these insanely fast CUDA kernels for training convolutional
neural nets.
Net was bam.
Let's do this.
Let's get an image in it and it's going to be the greatest thing.
Was your intuition, most of your intuition from empirical results by you and by others?
So like just actually demonstrating that a piece of program can train a 10 layer neural
network, or was there some pen and paper or marker and whiteboard thinking intuition?
Like, because you just connected a 10 layer large neural networks to the brain.
So you just mentioned the brain. So in your intuition about neural networks,
does the human brain come into play as an intuition builder?
Definitely.
I mean, you know, you've got to be precise with these analogies
between your artificial neural networks and the brain.
But there is no question that the brain is a huge source
of intuition and inspiration for deep learning researchers
since all the
way from Rosenblatt in the 60s.
Like if you look at the whole idea of a neural network is directly inspired by the brain.
You had people like McCollum and Pits who were saying, hey, you got these neurons in the
brain.
And hey, we recently learned about the computer and automata.
Can we use some ideas from the computer and automata to design some kind of computational object that's going to be simple computational and kind of like the brain
and they invented the neuron. So they were inspired by it back then then you had the convolutional
neural network from Fukushima and then later you had a lekan who said hey if you limit the
receptive fields of a neural network it's going to be especially suitable for images as it turned
out to be true. So there was a very small number of examples where analogies to the brain were successful.
And I thought, well, probably an artificial neuron is not that different from the brain
if it's going to harden off.
So let's just assume it is and roll with it.
So we're now at a time where deep learning is very successful.
So let us squint less and say, let open our eyes and say what to use an interesting
difference between the human brain. Now I know you're probably not an expert
neither in your scientists or in your biologists, but loosely speaking, what's the difference
between the human brain and artificial neural networks? That's interesting to you for the next
decade or two. That's a good question to ask. What is an interesting difference between the brain
and our artificial neural networks?
So I feel like today, artificial neural networks,
so we all agree that there are certain dimensions
in which the human brain vastly outperforms our models.
What I also think that there are some ways
in which our artificial neural networks
have a number of very important advantages over the brain.
Looking at the advantages versus disadvantages is a good way to figure out what is the important
difference.
So the brain uses spikes which may or may not be important.
Yes, that's a really interesting question.
Do you think it's important or not?
That's one big architectural difference between artificial neural networks.
It's hard to tell, but my prior is not very high.
I can say why.
There are people who are interested in spike in neural networks.
Basically, what they figured out is that they need to simulate the non-spike in neural networks in spikes.
That's how they're going to make them work.
If you don't simulate the non-spike in neural networks in spikes, it's not going to work because the question is, why should it work?
And that connects to questions around back propagation and questions around deep learning.
You got this giant neural network.
Why should it work at all?
Why should the learning rule work at all?
It's not a self-evident question, especially if you, let's say, if you were just starting
in the field and you read the very early papers.
You can say, hey, people are saying, let's build neural networks. That's a great idea because the brain is a neural network, so it would be useful to build neural networks.
Now, let's figure out how to train them. It should be possible to train them probably, but how?
And so the big idea is the cost function. That's the big idea.
The cost function is a way of measuring the performance of the system according to some measure.
By the way, that is a big...
Actually, let me think.
Is that one a difficult idea to arrive at and how big of an idea is that?
That there's a single cost function.
Sorry, let me take a pause.
It's supervised learning a difficult concept to come to.
I don't know.
All concepts are very easy and retrospective.
Yeah, that's what it seems trivial now, but I,
because the reason I asked that and we'll talk about it,
is there other things?
Is there things that don't necessarily have a cost function,
maybe have many cost functions, or maybe have many cost
functions, or maybe have dynamic cost functions, or maybe a totally different kind of architectures?
Because we have to think like that in order to arrive at something new, right?
So the only, so the good examples of things which don't have clear cost functions again,
right? Again, you have a game. So instead of thinking of a cost function,
where you want to optimize, where you know that you have an algorithm gradient descent, which will optimize the cost function.
And then you can reason about the behavior of your system in terms of what it optimizes.
With again, you say, I have a game, and I'll reason about the behavior of the system in
terms of the equilibrium of the game.
But it's all about coming up with these mathematical objects that help us reason about the behavioral system. Right, that's interesting. Yeah, so again, it's the only one. It's kind of
a, the cost function is emergent from the comparison. I don't know if it has a cost function. I don't
know if it's meaningful to talk about the cost function of again. It's kind of like the cost
function of biological evolution or the cost function of the economy. It's, you can talk about
biological evolution of the cost function of the economy. It's, you can talk about regions to which it can be
go towards, but I don't think the cost function analogy is
the most useful.
So evolution doesn't, that's really interesting.
So if evolution doesn't really have a cost function, like a
cost function based on it's something akin to our mathematical conception of a cost function.
Then do you think cost functions in deep learning are holding us back?
You just kind of mentioned that cost function is a nice first profound idea.
Do you think that's a good idea?
Do you think it's an idea we'll go past?
So self play starts to touch on that a little bit in reinforcement learning systems. That's right
Self-play and also ideas around exploration where you're trying to take action. That's a surprise a predictor
I'm a big fan of cost functions. I think cost functions are great and they service really well
And I think that whenever we can do things because these cost functions we should
And you know, maybe there is a chance that we will come up with some yet another profound way of
looking at things that will involve cost functions in a less central way. But I don't know,
I think cost functions are, I mean, I would not better guess against cost functions.
Is there other things about the brain that pop into your mind that might be different and interesting
for us to consider in designing artificial neural networks? So we talk about spiking a little bit.
I mean one one thing which may potentially be useful, I think people, neuroscientists,
figured out something about the learning rule of the brain or I'm talking about spike time
independent plasticity and it would be nice if some people would just study that in simulation.
I'm talking about spike time independent plasticity and it would be nice if some people would just study that in simulation.
Wait, sorry, spike time independent plasticity. Yeah, that's a STD.
It's a particular learning rule that uses spike time to figure out how to determine how to update the
synapses. So it's kind of like if the synapse fires into the neuron before the neuron fires,
then it strengthens the synapse and if the synapse fires into the neurons shortly after the neuron fired, then it weakens
the CNAPs.
Something along this line, I'm 90% sure it's right.
So if I said something wrong here, don't get too angry.
But you sounded really well saying it.
But the timing, that's one thing that's missing.
The temporal dynamics is not captured.
I think that's like a fundamental property of the brain
is the timing of the signals.
What do you have a recurrent neural networks?
But you think of that as this,
I mean, that's a very crude simplified,
what's that called?
There's a clock, I guess, to recurrent neural networks.
It seems like the brain is the continuous version There's a clock, I guess, to recurrent neural networks.
It seems like the brain is the continuous version of that, the generalization, where all possible
timings are possible, and then within those timings, it's contained some information.
You think recurrent neural networks, the recurrence in recurrent neural networks can capture the same kind of phenomena as the timing that
seems to be important for the brain in the firing of neurons in the brain.
I mean, I think recurrent neural networks are amazing and they can do anything we want.
If we want a system to do right now
recurring neural networks have been superseded by transformers but maybe one
day they'll make a comeback maybe they'll be back we'll see. Let me uh
in a small tension say do you think they'll be back? So so much of the breakthrough
recently that we'll talk about on uh natural language processing and
language modeling has been with transformers that don't emphasize recurrence
Do you think recurrence will make a comeback?
Well some kind of recurrence I think very likely
recurrent neural networks for pros as they're
Typically thought of for processing sequences. I think it's also possible
What is to you a recurrent neural network in generally speaking? I guess to you a recurrent neural network?
In generally speaking, I guess, what is a recurrent neural network?
You have a neural network which maintains a high dimensional hidden state.
And then when an observation arrives, it updates
its high dimensional hidden state through its connections in some way.
So do you think, you know, that's what like expert systems did, right?
Symbolic AI, the knowledge-based, growing a knowledge base is maintaining a hidden state,
which is its knowledge base and is growing it by some question processing.
Do you think of it more generally in that way or is it simply, is it the more constrained
form of a hidden state with certain kind of
gating units that we think of as today with LSDMs and that?
I mean, the hidden state is technically what you described there.
The hidden state that goes inside the LSDM or the RNN or something like this.
But then what should be contained, you know, if you want to make the expert system analogy,
I mean, you could say that the
knowledge is stored in the connections. And then the short-term process is done in
the in the hidden state. Yes. Could you say that? So sort of, do you think there's a future
of building large scale knowledge bases within the neural networks. Definitely. So we're going to pause on that confidence
because I want to explore that. Well, let me zoom back out and ask back to the history
of ImageNet. Neural networks have been around for many decades as you mentioned. What do
you think were the key ideas that led to their success, that image, that moment and beyond the success in the past 10 years?
Okay, so the question is to make sure I didn't miss anything.
The key ideas that led to the success of deep learning over the past 10 years.
Exactly.
Even though the fundamental thing behind deep learning has been around for much longer. So the key idea about deep learning,
or rather the key fact about deep learning before
deep learning started to be successful
is that it was underestimated.
People who worked in machine learning
simply didn't think that new neural networks could do much.
People didn't believe that large neural networks could do much. People didn't believe that large neural networks could be trained.
People thought that, well, there was a lot of debate going on
in machine learning about what are the right methods and so on.
And people were arguing because there was no way
to get hard facts.
And by that, I mean, there were no benchmarks
which were truly hard.
That if you do really well on them, then you can say, look, here is my system.
That's when you switch from, that's when this field becomes a little bit more of an engineering field.
So in terms of deep learning to answer the question directly, the ideas were all there.
The thing that was missing was a lot of supervised data and a lot of compute.
there the thing that was missing was a lot of supervised data and a lot of compute.
Once you have a lot of supervised data and a lot of compute then there is a third image as needed as well and that is conviction. Conviction that if you take the right stuff which
already exists and apply and mixed with a lot of data and a lot of compute that it will in fact work.
And so that was the missing piece.
It was you had the, you needed the data,
you needed the compute, which showed up in terms of GPUs,
and you needed the conviction to realize that you need to mix them together.
So that's really interesting.
So I guess the presence of compute and the presence supervised data
allowed the empirical evidence to do the convincing of the
majority of the computer science community. So I guess there's a key moment with a
Jitendra Mollick and Alex, Alisha Efros, who were very skeptical, right? And then there's a
Jeffrey Hinton that was the opposite of skeptical. And there was a convincing moment.
And I think ImageNet served as that moment.
That's right.
And they represented this kind of, or the big pillars of computer vision community, kind
of the wizards got together.
And then all of a sudden there was a shift.
And it's not enough for the ideas to all be there and the computer to be there.
For it to convince the cynicism that existed.
That's interesting, the people just didn't believe for a couple of decades.
Yeah, well, but it was more than that.
It's kind of, when put this way, it sounds like, well, you know, those silly people who didn't believe,
what were they missing, but in reality, things were confusing because New York really did not work on anything.
And they were not the best method on pretty much anything
as well.
And it was pretty rational to say, yeah,
this stuff doesn't have any traction.
And that's why you need to have these very hard tasks, which
produce undeniable evidence.
And that's how we make progress.
And that's why the field is making progress today because we have these hard benchmarks
which represent true progress.
And so, and this is why we are able to avoid endless debate.
So incredibly, you've contributed some of the biggest recent ideas in AI in computer
vision, language, natural language processing, reinforcement learning,
sort of everything in between, maybe not GANs.
There may not be a topic you haven't touched, and of course the fundamental science of deep learning.
What is the difference to you between vision, language, and as in reinforcement learning action,
as learning problems, and what are the common commonalities do you see them as all interconnected are they fundamentally different domains that require different
approaches
Okay, that's a good question machine learning is a field with a lot of unity a huge amount of unity
In fact, what do you mean by unity like overlap of ideas
In fact, what do you mean by unity? Like overlap of ideas?
Overlap of ideas, overlap of principles.
In fact, there is only one or two or three principles,
which are very, very simple.
And then they apply in almost the same way,
in almost the same way to the different modalities
to the different problems.
And that's why today, when someone writes a paper
on improving optimization of deep learning and vision,
it improves the different NLP applications,
and it improves the different reinforcement learning
applications.
Reinforcement learned.
So I would say that computer vision and NLP
are very similar to each other.
Today, they differ in that they have slightly different architectures.
We use transformers in NLP and use convolutional neural networks
in vision.
But it's also possible that one day
this will change and everything will be unified with a single architecture. Because if you go back a
few years ago in natural language processing, there were a huge number of architectures for every
different tiny problem had its own architecture. Today, there's just one transformer for all those
different tasks. And if you go back in time even more,
you had even more and more fragmentation and every little problem in AI,
had its own little sub-specialization and sub- you know,
little set of collection of skills,
people who would know how to engineer the features.
Now it's all been subsumed by deep learning.
We have this unification.
So I expect vision to be
communicated with natural language as well.
Or rather, I think it's possible. I don't want to be too sure because I think on the commercial
neuralities, it's very computationally efficient. Arrell is different. Arrell does require slightly
different techniques because you really do need to take action. You really need to do something about
exploration. Your variance is much higher. But I think there is a lot of unity even there.
exploration, your variance is much higher. But I think there is a lot of unity even there.
And I would expect, for example, that at some point there will be some
broader unification between Arrell and supervised learning, where somehow the Arrell will be making decisions to make the supervised learning go better. And it will be, I imagine one big black box
and you just throw, you know, you shovel, shovel things into it and it just figures out what to do
with whatever you shovel at it. I mean, reinforcement learning has some aspects of language and vision combined almost.
There's elements of a long term memory that you should be utilizing and there's elements of a
really rich sensory space. So it seems like the, it's like the union of the two or something like that.
I'd say something slightly differently.
I'd say that reinforcement learning is neither,
but it naturally interfaces and integrates with the two of them.
You think action is fundamentally different?
So yeah, what is interesting about what is unique
about policy of learning to act?
Well, so one example, for instance, is that when you learn to act, you are fundamentally
in a non-stationary world.
Because as your actions change, the things you see start changing.
You experience the world in a different way.
And this is not the case for the more traditional static problem,
where you have a distribution and you just apply a model to that distribution.
You think it's a fundamentally different problem,
or is it just a more difficult,
it's a generalization of the problem of understanding?
I mean, it's a question of definitions almost.
There is a huge amount of commonality for sure.
You take gradients, you take gradients,
we try to approximate gradients in both cases,
in some case, in the case of reinforcement learning, you have some tools to reduce
the variance of the gradients. You do that. There's lots of commonality, you use the
same neural net in both cases. You compute the gradient, you apply Adam in both
cases. So I mean, there's lots in common for sure, but there are some small
differences which are not completely insignificant.
It's really just a matter of your point of view, what frame of reference,
how much do you want to zoom in or out as you look at these problems?
Which problem do you think is harder?
So people like No Chomsky believe that language is fundamental to everything.
So it underlies everything.
Do you think language understanding is harder than
visual scene understanding or vice versa? I think that asking if a problem is hard is likely wrong.
I think the question is a little bit wrong and I want to explain why. So what does it mean for a
problem to be hard? Okay, the non-interesting dumb answer to that is there's a benchmark and there's a human
level performance on that benchmark and how is the effort required to reach the human
level benchmark.
So from the perspective of how much until we get to human level on a very good benchmark,
yeah, like some, I understand what you mean by that.
So what I was going to say, that a lot of it depends on, you know, once you solve a problem
it stops being hard.
And that's always true.
And so, but something is hard and not depends on what a tool can do today.
So you know, you say today, true human level, language understanding and visual perception,
are hard in the sense that there is no way of solving the problem completely in the next three months.
So I agree with that statement.
Beyond that, I'm just, my guess would be as good as yours, I don't know.
Okay, so you don't have a fundamental intuition about how hard language understanding is.
I think I know a change in my mind. I'd say language is probably going to be hard. I mean, it depends on how you define it.
Like, if you mean absolute top notch, 100% language understanding, I'll go with language.
And so, but then if I show you a piece of paper with letters on it, is that, you see what
I mean?
So you have a vision system. You say it's the best human level vision system.
I show you, I open a book and I show you letters.
If you let understand how these letters form into word and sentences and meaning, is this part of the vision problem? Where does vision end and language begin?
Yeah, so Chomsky would say it starts at language, so vision is just a little example of the kind of
a structure and fundamental hierarchy of ideas that's already represented in our brains somehow, that's
represented through language.
But where does vision stop in language begin?
That's a really interesting question. So one possibility is that it's impossible to achieve really deep understanding in either
images or language without basically using the same kind of system.
So you're going to get the other for free.
I think it's pretty likely that yes, if we can get one, we probably, our machine learning
is probably that good that we can get the other, but it's not 100, I'm not 100% sure.
And also, I think a lot of it really does depend on your definitions.
Definitions of like perfect vision, because reading is vision, but should it count?
Yeah, to me, so my definition is of a system looked at an image and then a system looked at a piece of text and then told me something about that and I was really impressed.
That's relative. You will be impressed for half an hour and then you're going to say, well, I mean all the systems do that, but here is the thing they don't do.
Yeah, but I don't have that with humans. Humans continue to impress me.
Is that true? Well, the ones, okay, so I'm a fan of monogamy, so I like the idea of
marrying somebody being with them for several decades. So I believe in the fact that yes,
it's possible to have somebody continuously giving you pleasurable, interesting, witty, new ideas, friends.
Yeah, I think so.
They continue to surprise you.
The surprise, it's that injection of randomness seems to be a nice source of continued inspiration,
like the wit, the humor.
I think, yeah, that would be, it's a very subjective test, but I think if you have enough
humans in the room.
Yeah, I understand what you mean.
Yeah, I feel like I misunderstood what you meant by impressing you.
I thought you meant to impress you with its intelligence With how with how good valid understands and image. I thought you meant something like I'm gonna show you the really complicated image
And it's gonna get it right and you're gonna say wow. That's really cool a systems of you know a
January 20 20 have not been doing that. Yeah, I think it all boils down to like
The reason people click like on stuff on the internet, which is like it makes them laugh
So it's like humor or wit or insight.
I'm sure we'll get that as well.
So forgive the romanticized question, but looking back to you, what is the most beautiful
or surprising idea and deep learning or AI in general you've come across?
So I think the most beautiful thing about deep learning is that it actually works.
And I mean it, because you got these ideas, you got the neural network, you got the back
propagation algorithm, and then you got some theories as to, you know, this is kind of
like the brain, so maybe if you make it large, if you make the neural network large in
a train, there are a lot of data, then it will do the same function of the brain does and it turns out to be true. That's crazy
And now we just train these neural networks and you make them larger and they keep getting better and I find it unbelievable
I find it unbelievable that this holy eye stuff with neural networks works
Have you built up an intuition of why are there a little bits and pieces of
Intuitions of insights of why this whole thing works?
I mean, sums, definitely, while we know that optimization, we now have good, you know,
we've had lots of empirical, you know, huge amounts of empirical reasons to believe
that optimization should work on all most problems we care about.
Do you have insights of what?
So you just said empirical evidence.
Is most of your sort of empirical evidence kind of convinces you.
It's like evolution is empirical.
It shows you that look, this evolutionary process seems to be a good way to design organisms that survive in their environment.
But it doesn't really get you to the insights of how the whole thing works.
I think it's a good analogy is physics.
You know how you say, Hey, let's do some physics calculation and come on.
Do some new physics theory and make some prediction.
But then you got around the experiment.
You know, you got around the experiment.
It's important.
So it's a bit the same here, except that maybe sometimes
the experiment came before the theory, but it still is the case.
You know, you have some data and you come up with some prediction.
So yeah, let's make a big neural network, let's train it,
and it's going to work much better than anything before it,
and it will, in fact, continue to get better as a make it larger.
And it turns out to be true.
That's amazing when a theory is validating like this.
You know, it's not a mathematical theory, it's more of a biological theory almost. So I think there are not terrible
analogies between deep learning and biology. I would say it's like the geometric mean of
biology and physics, that's deep learning. The geometric mean of biology and physics.
I think I'm going to need a few hours to wrap my head around that. Just to find
the geometric, just to find the set of biology represents.
Biology, biology things are really complicated in the years, are really, really hard to have
good, predictive theory. In physics, the theories are too good. In physics, people make the
super precise theories, they make these amazing predictions. And in machine physics, the theory is too good. In physics, people make the super precise theory,
which make these amazing predictions.
And in machine learning, we're kind of in between.
Kind of in between, but it'd be nice if machine learning
somehow helped us discover the unification of the two
as opposed to server the in between.
But you're right, you're kind of trying to juggle both.
So do you think there are still beautiful and mysterious
properties in your networks that are yet to be discovered? Definitely. trying to juggle both. So do you think there are still beautiful and mysterious properties
in your networks that are yet to be discovered? Definitely. I think that we are still massively
underestimating deep learning. What do you think it will look like? Like what? Fine,
you have to do it. So, but if you look at all the progress from the past 10 years, I would
say most of it, I would say there have been a few cases where some were things that felt like really new ideas
showed up.
But by and large, it was every year, it was like, okay, deep learning goes this far.
Nope, it actually goes further.
And then the next year, okay, now this is peak deep learning.
We are really done.
Nope, goes further.
It just keeps going further each year.
So that means that we keep on the estimating.
We keep on not understanding it.
A surprising property is all the time.
Do you think it's getting harder and harder
to make progress?
Need to make progress.
It depends on what we mean.
I think the field will continue to make very robust progress
for quite a while.
I think for individual researchers, especially people
who are doing research, it can be harder
because there is a very large number
of researchers right now.
I think that if you have a lot of compute,
then you can make a lot of very interesting discoveries,
but then you have to deal with the challenge
of managing a huge cluster to run your experiments.
It's a little bit harder.
So I'm asking all these questions
that nobody knows the answer to,
but you're one of the smartest people
I know, so I'm going to keep asking.
So let's imagine all the breakthroughs that happen in the next 30 years in deep learning.
Do you think most of those breakthroughs can be done by one person with one computer?
Sort of in the space of breakthroughs, do you think compute will be compute and large efforts will be necessary?
I mean, I can't be sure. When you say one computer, you mean how large?
You're clever. I mean, one GPU. I see. I think it's pretty unlikely.
I think it's pretty unlikely. I think it's pretty unlikely.
I think that there are many, the stack of deep learning is starting to be quite deep.
If you look at it, you've got all the way from the ideas, the systems to build the data
sets, the distributed programming, the building, the actual cluster, the GPU programming, putting it all together.
So now the stack is getting really deep and I think it can be quite hard for a single
person to become, to be world class in every single layer of the stack.
What about what like Vladimir Vapnik really insists on is taking MNIST and trying to learn
from very few examples.
So being able to learn more efficiently.
Do you think that's, there'll be breakters in that space
that would, may not need the huge compute?
I think there will be a large number of breakters in general
that will not need a huge amount of compute.
So I maybe I should clarify that.
I think that some breakters will require a lot of compute.
And I think building systems which actually do things
will require a huge amount of compute.
That one is pretty obvious.
If you want to do X, and X requires a huge neural net,
you've got to get a huge neural net.
But I think there will be lots of,
I think there is lots of room for very important work
being done by small groups and individuals.
Can you maybe sort of on the topic of the science and deep learning, talk about one of the
recent papers that you've released, the deep double descent, where bigger models and
more data hurt?
I think it's a really interesting paper.
Can you describe the main idea?
Yeah, definitely.
So what happened is that some over the years, some small number of researchers noticed that
it is kind of weird that when you make the neural network logic works better and it seems to go in contradiction with statistical ideas.
And then some people made an analysis showing that actually you got this double descent bump.
And what we've done was to show that double descent occurs for pretty much all practical deep learning systems and that it will also, so can you step back,
what's the X axis and the Y axis of a double descent plot?
Okay, great.
So you can look, you can do things like,
you can take your neural network
and you can start increasing its size slowly
while keeping your
data set fixed.
So if you increase the size of the neural network slowly and if you don't do early stopping,
that's a pretty important detail.
Then when the neural network is really small, you make it larger.
You get a very rapid increase in performance.
Then you continue to make it larger and at some point performance you'll get worse,
and it gets the worst exactly at the point
to which it achieves zero training error,
precisely zero training loss.
Then as you make it larger,
it starts to get better again.
It's counterintuitive because you'd expect
deep learning phenomena to be monotonic.
It's hard to be sure what it means,
but it also occurs in the case of linear classifiers
and the intuition basically boils down to the following.
When you have a lot,
when you have a large data set and a small model,
then small tiny random, so basically what is overfitting?
Overfitting is when your model is somehow very sensitive to the small random, unimportant
stuff in your dataset, in the training dataset, precisely.
So if you have a small model and you have a big dataset, and there may be some random,
some training cases are randomly in the dataset and say, and others may not be there.
But the small model is kind of insensitive to this randomness, because there is pretty
much no uncertainty about the model when the data is at its large.
So okay, so at the very basic level to me, it is the most surprising thing that neural networks don't overfit every time very quickly
before ever being able to learn anything.
The huge number of parameters.
So here is, so there is one way, okay, so maybe,
so let me try to give the explanation,
and maybe that will be, that will work.
So you got a huge neural network.
Let's suppose you got a, you are, you have a huge neural network,
you have a huge number of parameters. And now let's pretend everything is linear,
which is not let's just pretend. Then there is this big sub space,
a very neural network achieves zero error. And as GT is going to find approximately
the point that's right, approximately the point with the smallest norm in that
sub space. And that can also be proven to be insensitive to the small randomness in the data when the
dimensionality is high.
But when the dimensionality of the data is equal to the dimensionality of the model, then
there is a one-to-one correspondence between all the datasets and the models.
So small changes in the dataset actually lead to large changes in the model and that's
why performance gets worse.
So this is the best explanation, more or less.
So then it would be good for the model to have more parameters, so to be bigger than the data.
That's right, but only if you don't early stop. If you introduce early stop and your regularization, you can make the double descent bump almost completely disappear.
What is early stop? Early stop is when you train your model
and you monitor your test evaluation performance.
And then if at some point,
validation performance starts to get worse,
you say, OK, let's stop training.
You're good.
You're good enough.
So the magic happens after that moment.
So you don't want to do the early stop.
Well, if you don't do the early stop
and you get the very pronounced double descent.
Do you have any intuition why this happens?
Double descent?
Or are you stopping?
No, the double descent.
So the...
Oh, yeah, so I try, let's see.
The intuition is basically this, that when the data set has as many degrees of freedom as the model,
then there is a one-to-one correspondence
between them, and so small changes to the data set lead to noticeable changes in the
model. So your model is very sensitive to all the randomness. It is unable to discard it,
whereas it turns out that when you have a lot more data than parameters, or a lot more
parameters than data, the result in solution will be insensitive to small changes in the dataset.
So it's able to nicely put, discard the small changes, the parameters.
Exactly.
The spurious correlation, if you don't want.
Jeff Hinton suggested we need to throw back propagation.
We already talked about this a little bit, but he suggested we need to throw away back
propagation and start over. I mean, of course, some of that is a little bit
wit and humor, but what do you think? What could be an alternative method of training
neural networks? Well, the thing that he said precisely is that, to the extent that you can't
find back propagation in the brain, it's worth seeing if we can learn something from how the brain learns, but back propagation is very useful and we should keep using it.
Oh, you're saying that once we discover the mechanism of learning in the brain or any
aspects of that mechanism, we should also try to implement that in your network.
If it turns out that we can't find back propagation in the brain.
If we can't find back propagation in the brain, if we can't find back propagation in the brain.
Well, so I guess your answer to that is back propagation is pretty damn useful. So why are we complaining?
I mean, I personally am a big fan of back propagation.
I think it's a great algorithm because it solves an extremely
fundamental problem, which is finding a neural circuit subject
to some constraints.
And I don't see that problem going away. So that's why I, finding a neural circuit subject to some constraints.
And I don't see that problem going away. So that's why I really, I think it's pretty unlikely
that you'll have anything which is going to be
dramatically different.
It could happen, but I wouldn't bet on it right now.
So let me ask a sort of big picture question.
Do you think neural networks can be made to reason?
Why not?
Well, if you look for example at AlphaGo or Alpha0, the neural network of Alpha0 plays Go, which we all agree is a game that requires reasoning.
Better than 99.9% of all humans.
Just the neural network without this search,
just the neural network itself.
Doesn't that give us an existence proof
that neural networks can reason?
To push back and disagree a little bit,
we all agree that Go is reasoning.
I agree, I don't think that's a trivial.
So obviously, reasoning like intelligence is a loose gray area term a little bit. Maybe you disagree with that. But yes, I think it has some of stepwise consideration of possibilities and sort of
building on top of those possibilities in a sequential manner until you arrive at some
insight.
Sort of, yeah, I guess playing goes kind of like that.
And when you have a single neural network doing that without search, that's kind of like
that.
So there's an existing proof in a particular constrained environment that a process akin
to what many people call reasoning exists, but more general kind of reasoning.
So off the board.
There is one other existence, probably, which one?
Us humans?
Yes.
Okay.
All right. So, do you think the architecture that will allow neural networks to reason will look similar
to the neural network architectures we have today?
I think it will.
I think, well, I don't want to make two overly definitive statements.
I think it's definitely possible that the neural networks that you'll produce
the reasoning breakthroughs of the future will be very similar to the architectures that
exist today. Maybe a little bit more recurrent, maybe a little bit deeper. But these neural
heads are so insanely powerful. Why wouldn't they be able to learn to reason? Humans can
reason. So why can't neural networks?
Do you think the kind of stuff we've seen neural networks do
is a kind of just weak reasoning?
So it's not a fundamentally different process.
Again, this is stuff nobody knows the answer to.
So when it comes to our neural networks,
I think which I would say is that neural networks
are capable of reasoning.
But if you train a neural network on a task which doesn't require reasoning,
it's not going to reason.
This is a well known effect where the neural network will solve exactly the,
is will solve the problem that you pose in front of it in the easiest way possible.
Right.
That takes us to the,
Right. That takes us to one of the brilliant ways you describe neural networks, which is
you've referred to neural networks as the search for small circuits and maybe general intelligence as the search for small programs, which I found is a metaphor very compelling. Can you elaborate on that difference?
Yeah, so the thing which I said precisely was that
if you can find the shortest program that outputs the data
in your disposal, then you will be able to use it
to make the best prediction possible.
And that's a theoretical statement
which can be proven mathematically. Now, you can also prove mathematically that it is that finding the
shortest program be generates some data is not a computable operation. No
finite amount of compute can do this. So then with neural networks, neural networks
are the next best thing that actually works
in practice.
We are not able to find the best, the shortest program which generates our data, but we
are able to find a small, but now that statement should be amended, even a large circuit, which
fits our data in some way.
Well, I think we met by the small circuit as the smallest needed circuit.
Well, the thing which I would change now,
back then I really haven't fully internalized
the over parametrious results.
The things we know about over parametrious neural nets,
now I would phrase it as a large circuit
that whose weights contain a small amount of information,
which I think is what's going on.
If you imagine the training process of a neural network as you slowly transmit entropy from the
data set to the parameters, then somehow the amount of information in the weights ends up being
not very large, which would explain where the general is so well. So that's the large circuit might be one that's helpful
for the generalization.
Yeah, something like this.
But do you see it important to be able to try to learn
something like programs?
I mean, if we can, definitely.
I think it's kind of, the answer is kind of yes, if we can do it, we should
do things that we can do it. It's the reason we are pushing on deep learning, the fundamental
reason, the root cause is that we are able to train them. So in other words, training
comes first. We've got our pillar, which is the training pillar. And now we are trying to contort our neural networks around the training pillar.
We got a state trainable.
This is an invariant we cannot violate.
And so being trainable means starting from scratch, knowing nothing, you can actually
pretty quickly converge towards knowing a lot, or even slowly.
But it means that given the resources at your disposal,
you can train the neural net and get it to achieve useful performance.
Yeah, that's a pillar we can't move away from. That's right. Because if you can, and
whereas if you say, hey, let's find the shortest program, well, we can't do that. So it doesn't matter
how useful that would be, we can't do it. So we won't.
So do you think you kind of mentioned that neural networks are good at finding small circuits
or large circuits?
Do you think then the matter of finding small programs is just the data?
No.
Sorry, not the size or the type of data.
Sort of ask giving it programs.
Well, I think the thing is that right now,
finding there are no good precedents of people
successfully finding programs really well.
And so the way you'd find programs
is you'd train a deep neural network to do it basically.
Right.
Which is the right way to go about it. But there's not good
illustrations that it hasn't been done yet. But in principle, it should be
possible. Can you elaborate a little bit? What's your answer in principle? And put
another way, you don't see why it's not possible. Well, it's kind of like more,
it's more a statement of I think that it's, I think that
it's advised to bet against deep learning. And if it's a fun, if it's a cognitive function,
it humans seem to be able to do then it doesn't take too long for some deep neural net to pop up that
can do it too. Yeah, I'm there with you. I've stopped betting against neural networks at this point,
because I continue to surprise us. What about long-term memory? Can neural networks have long-term
memory as something like knowledge basis? So being able to aggregate important information over long periods of time that will then serve as useful
representations of state that you can make decisions by.
So you have a long term context based on which you make into the decision.
So in some sense, the parameters already do that.
The parameters are an aggregation of the neural of the entirety of
the neural nets experience and so they count as the long term knowledge. And people have trained
various neural nets to act as knowledge bases and you know, investigated with,
people have investigated language and all those knowledge bases. So there is work there is work there. Yeah, but in some sense, do you
think in every sense, do you think there's a it's it's all just a matter of coming up with
a better mechanism of forgetting the useless stuff and remembering the useful stuff is right
now, I mean, there's not been mechanisms that do remember really long-term information. What do you mean by that precisely?
Precisely, I like the word precisely.
So I'm thinking of the kind of compression of information
that knowledge bases represent.
Sort of creating a, now I apologize for my sort of human
centric thinking about what knowledge is, because neural
networks aren't interpretable necessarily with the kind of knowledge they have discovered.
But a good example for me is knowledge-based is being able to build up over time something
like the knowledge that Wikipedia represents.
It's a really compressed, structured knowledge base.
Obviously, not the actual Wikipedia or the language, but like a semantic web.
The dream that semantic web represented.
So, it's a really nice compressed knowledge base or something akin to that in a not
interpretable sense as neural networks would have.
Well, the neural networks would be not interpretable if you look at their weights, but their outputs should be very interpretable sense as neural networks would have. Well, the neural networks would be not interpretable if you look at their weights,
but their outputs should be very interpretable.
Okay, so how do you make very smart neural networks like language models interpretable?
Well, you asked them to generate some text and the text would generally be interpretable.
Do you find that the epitome of interpretability, like, can you do better?
Like, can you add?
Because you can't, okay,
I'd like to know what doesn't know and what doesn't know.
I would like the neural network to come up with examples
where it's completely dumb and examples
where it's completely brilliant.
And the only way I know how to do that now
is to generate a lot of examples and use my human judgment.
But it would be nice if a new now had some
aware self-awareness about one hundred percent. I'm a big believer in self-awareness and I think that
I think new neural net self-awareness will allow for things like the capabilities like the ones
you describe like for them to know what they know and what they don't know and for them to know
where to invest to increase their skills most optimally.
And to your question of interpretability, there are actually two answers to that question.
One answer is, you know, we have the neural net so we can analyze the neurons and we can
try to understand what the different neurons and different layers mean.
And you can actually do that and openly I have done some work on that.
But there is a different answer which is that, I would say that's the human centric answer where you say, you know, you look at
a human being, you can't read, you know, how do you know what a human being is saying?
Can you ask them, you say, hey, what do you think about this? What do you think about that?
And you get some answers. The answers you get are sticky. In the sense, you already have a mental model. You already have a mental model of that human being.
You already have an understanding of a big conception of what that human being, how they think,
what they know, how they see the world, and then everything you ask, you're adding onto
that.
That's stickiness seems to be,
that's one of the really interesting qualities
of the human being is that information is sticky.
You don't, you seem to remember the useful stuff,
aggregate it well and forget most of the information
that's not useful.
That process, but that's also pretty similar
to the process that neural networks do,
is just that neural networks are much crappier at this time.
It doesn't seem to be fundamentally that different, but just stick on reasoning for a little longer.
He said, why not?
Why can't that reason?
What's a good impressive feat benchmark to you of reasoning?
That you'll be impressed by if Neonah was able to do.
Is that something you really have in mind?
Well, I think writing, writing really good code,
I think proving really hard theorems, solving open ended problems
without other box solutions.
And sort of theorem type mathematical problems.
Yeah, I think those ones are a very natural example as well.
You know, if you can prove an unproven theorem then it's hard to argue, you don't reason.
And so by the way, and this comes back to the point about the hard results, you know.
If you've got a hard, if you have machine learning, deep learning as a field is very fortunate
because we have the ability to sometimes produce these unambiguous results. And then they happen, the debate changes, the conversation
changes. It's a conversation, we have the ability to produce conversation change in results.
Conversation. And then of course, just like you said, people kind of take that for granted,
that wasn't actually a hard problem. Well, I mean, at some point, you've probably run out of hard problems. Yeah, that whole mortality thing is kind of a sticky problem
though we haven't quite figured out. Maybe we'll solve that one. I think one of the fascinating
things in your entire body of work, but also the work at opening eye recently, one of
the conversation changes has been in the world of language models. Can you briefly kind of try to describe the recent history of using your networks in the domain of language and text?
Well, there's been lots of history. I think the Elman network was a small tiny recurrent neural network applied to language back in the 80s.
So the history is really, you know, fairly long at least.
And the thing that started, the thing that changed the trajectory of neural networks and language is the thing that changed the trajectory of all deep learning and that state and compute.
So suddenly you move from small language models which learn a little bit.
And with language models in particular, you can, there's a very clear explanation for why they need to be a large, to be good, because they're trying to predict the next word.
So we don't, we don't know anything. You'll notice very, very broad strokes surface level patterns,
like sometimes there are characters and there is a space between those characters. You'll notice
this pattern. And you'll notice that sometimes there is a comma and then the next character is a capital letter.
You'll notice that pattern. Eventually you may start to notice that there are certain words
that occur often. You may notice that spellings are a thing. You may notice syntax.
And when you get really good at all these, you start notice the semantics.
You start notice the facts. But for that to
happen, the language model needs to be larger. So that's less linger on that.
That's where you and no jumps get disagree. So you think we're actually taking
incremental steps, sort of larger network, larger compute, will be able to get to the semantics, to be able to
understand language without what a gnome likes to think of as a fundamental
understandings of the structure of language, like imposing your theory of language onto the
learning mechanism. So you're saying the learning you can learn from raw data,
the mechanism that underlies the language. Well, I think it's pretty likely, but I also want to say
that I don't really know precisely what is what Chomsky means when he talks about him. You said
something about imposing your structure and language.
I'm not 100% sure what he means, but empirically it seems that when you inspect those larger
language models, they exhibit signs of understanding the semantics where is the smaller language
models do not. We've seen that a few years ago when we did work on the sentiment neuron. We trained
a small, you know, smaller shell STM to predict the next character in Amazon Reviews.
And we noticed that when you increase the size of the LSTM from 500 CLSDM cells to 4000 LSTM cells,
then one of the neurons starts to represent the sentiment of the article of story of their view.
Now why is that sentiment is a pretty semantic attribute, it's not a syntactic attribute.
And for people who might not know, I don't know if that's a standard term, but sentiment
is whether it's a positive or negative review.
That's right.
Like, is the person happy with something or is the person unhappy with something?
And so, here we had very clear evidence that a small neural net does not capture sentiment
while a large neural net does.
And why is that?
Well, our theory is that at some point you run out of syntax
to models, you start to get a focus on something else. And besides, you quickly run out of
syntax to model, and then you really start to focus on the semantics, it would be the
idea. That's right. And so I don't want to imply that our models have complete semantic
understanding because that's not true. but they definitely are showing signs of semantic understanding, partial semantic understanding, but the smaller models do not show that those signs.
Can you take a step back and say, what is GPT-2 is a transformer with one and a half billion parameters that
was trained on about 40 billion tokens of text, which were obtained from webpages that
were linked to from Reddit articles with more than three uploads. And what's the transformer?
The transformer, it's the most important advance in neural network architectures
in recent history. What is attention maybe to? Because I think that's the interesting idea,
not necessarily sort of technically speaking, but the idea of attention versus maybe what
recurrent neural networks represent.
Yeah. So the thing is the transformer is a combination of multiple ideas simultaneously,
and you know, which attention is one. Do you think attention is the key? No, it's A key, but it's not the key. The
transformer is successful because it is the simultaneous combination of
multiple ideas, and if you were to remove either idea, it would be much less
successful. So the transformer uses a lot of attention, but attention
exists for a few years, so that can't be the main innovation.
The transformer is designed in such a way that it runs really fast on the GPU, and that
makes a huge amount of difference.
This is one thing.
The second thing is that the transformer is not recurrent, and that is really important
too, because it is more shallow and therefore much easier to optimize.
So in other words, it uses attention.
It is a really great fit to the GPU, and it is not recurrent, so therefore, less deep
and easier to optimize.
And the combination of those factors makes it successful.
So now it makes great use of your GPU.
It allows you to achieve better results for the same amount of compute.
And that's why it's successful.
Were you surprised how well Transformers worked and GPT-2 worked?
So you worked on language, you've had a lot of great ideas before Transformers came about
in language.
So you got to see the whole set of revolutions before and after.
Were you surprised?
Yeah, a little.
A little?
Yeah.
I mean, it's hard to remember because you adapt really quickly.
But it definitely was surprising.
It definitely was, in fact, I'll retract my statement.
It was pretty amazing.
It was just amazing to see, generate this text of this.
And you know, you've got to keep in mind that we've
seen at that time, we've seen all this progress in GANS, in improving the samples produced by GANS
were just amazing. You have these realistic faces, but text hasn't really moved that much.
And suddenly, we moved from, you know, whatever GANS were in 2015 to the best, most amazing GANS
in one step. And I was really stunning. Even though theory predicted the AI,
you train a big language model, of course, you should get this.
But then to see it with your own eyes, it's something else.
And yet, we adapt really quickly,
and now there's some cognitive scientists,
right, articles saying that GPT-2 models
don't truly understand language. So we adapt quickly
to how amazing the fact that they're able to model the language so well is. So what do
you think is the bar for what? For impressing us that it? I don't know. Do you think that
bar will continuously be moved? Definitely. I think when you start to see really dramatic economic impact, that's when I think that's
in something the next barrier.
Because right now, if you think about the working AI, it's really confusing.
It's really hard to know what to make of all these advances.
It's kind of like, okay, you got an advance and now you can do more things and you got
another improvement and you got an advance. And now you can do more things and you got another improvement
and you got another cool demo. At some point, I think people who are outside of AI,
they can no longer distinguish this progress anymore. So we were talking offline about translating
Russian to English and how there's a lot of brilliant work in Russian that the rest of the world
doesn't know about. That's true for Chinese. that's true for a lot of scientists and just artistic work in general. Do you think translation
is the place where we're going to see sort of economic big impact? I don't know. I think there
is a huge number of... I mean, first of all, I want to point out that translation already
today is huge. I think billions of people interact with big chunks of the internet
primarily through translation. So translation is already huge and it's hugely, hugely positive too.
I think self-driving is going to be hugely impactful and that's, you know, it's unknown exactly
when it happens, but again I would not bet against deep learning. So I so this deep learning in general, but you think deep learning for self driving.
Yes, deep learning for self driving, but I was talking about sort of language models.
I see. Just to just to check your dear, dear, Duff a little bit, just to check,
you're not seeing a connection with your driving and language.
No, no. Okay. All right. Both using neural nets.
That would be a poetic connection. I think there might be some, like you said, there might be some kind of unification towards a kind of multitask transformers that can take on
both language and vision tasks. I'd be an interesting unification. Now let's see, what can I ask
about GPD 2 more? It's simple, so not much to ask. You take a transformer, make it bigger, give
it more data, and suddenly it does all those amazing things.
Yeah, one of the beautiful things is that GPT, the transformers are fundamentally simple
to explain, to train, do you think bigger will continue to show better results in language,
probably? Sort of like, what are the next steps to the GPT-2, do you think? will continue to show better results in language. Probably.
Sort of like, what are the next steps
of the GPT-2, do you think?
I mean, I think for sure seeing what a large version
can do is one direction.
Also, I mean, there are many questions.
There's one question which I'm curious about,
and that's the following.
So right now, GPT-2, so we feel it all is data
from the internet, which means that it needs to memorize all those random facts about everything in the internet. And it would be nice if
the model could somehow use its own intelligence to decide what data it wants to start,
accept and what data it wants to reject. Just like people, people don't learn all data in
this criminally, we are super selective about what we learn.
And I think this kind of active learning
I think would be very nice to have.
Yeah, listen, I love active learning.
So let me ask, does the selection of data, can you just elaborate
that a little bit more?
Do you think the selection of data is...
Like, I have this kind of sense that the optimization or do you think the selection of data is,
like I have this kind of sense that the optimization of how you select data
so the active learning process
is going to be a place for a lot of breakthroughs
even in the near future
because there's hasn't been many breakthroughs there
that are public.
I feel like there might be private breakthroughs
that companies keep to themselves because the fundamental problem has to be solved if you
want to solve self-driving, if you want to solve a particular task. What do you think about
the space in general? Yeah, so I think that for something like active learning or in fact,
for any kind of capability, like active learning, the thing that it really needs is a problem.
It needs a problem that requires it.
It's very hard to do research about a capability if you don't have a task.
Because then what's going to happen is you will come up with an artificial task,
get good results, but not really convince anyone.
Right. We're now past the stage where getting a result and amnest.
Some clever formulation of amnest will convince people.
That's right.
In fact, you could quite easily come up with a simple active learning scheme on amnest
and get a 10x speed up, but then so what?
And I think that with active learning, the active learning will naturally arise as problems
that require it pop up. That's how I would, that require it to pop up.
That's how I would, that's my take on it.
There's another interesting thing that OpenA has brought up
with GPT2, which is when you create a powerful artificial
intelligent system, and it wasn't clear what kind of detrimental,
once you release GPT2, what kind of detrimental effect will have because if you haven't a
Model that can generate pretty realistic text you can start to imagine that you know on the it would be used by bots and some
Some way that we can't even imagine so like there's this nervousness about what it's possible to do
So you did a really kind of brave and I think profound thing,
which just started a conversation about this. Like, how do we release powerful artificial
intelligence models to the public? If we do it all, how do we privately discuss with other
even competitors about how we manage the use of the systems and so on? So from that, this whole
experience, you released a report from that, this whole experience,
you released a report on it,
but in general, are there any insights
that you've gathered from just thinking about this,
about how you release models like this?
I mean, I think that my take on this
is that the field of AI has been in a state of childhood.
And now it's exiting that state
and it's entering a state of maturity.
What that means is that AI is very successful
and also very impactful.
And its impact is not only large, but it's also growing.
And so for that reason, it seems wise to start thinking
about the impact of our systems before releasing them,
maybe a little bit too soon, rather than a little bit too late.
And with the case of GPT-2, like I mentioned earlier,
the results really were stunning.
And it seemed plausible.
It didn't seem certain.
It seemed plausible that something like GPT-2
could easily use to reduce the cost of this information.
And so there was a question of what's the best way to release it and stage to release
simulogical.
A small model was released and there was time to see the...
Many people use these models in lots of cool ways, there have been lots of really cool
applications.
There haven't been any negative application to be know of and so eventually it was released.
But also other people replicated similar models.
That's an interesting question though, that we know of.
So in your view, stage release,
as at least part of the answer to the question of,
how do we,
how, what do we do once we create a system like this?
It's part of the answer, yes. Is there any
other insights? Like, say you don't want to release the model at all because it's useful to you
for whatever the business is. While there are plenty of people don't release models already.
Right. Of course, but is there some moral ethical responsibility when you have a very powerful
model to sort of communicate? Like, just as you said, when you had GPT-2, it was unclear how much you could be used for
misinformation.
It's an open question.
And getting an answer to that might require that you talk to other really smart people
that are outside of, outside of your particular group.
Have you, please tell me there's some optimistic pathway for people across
the world to collaborate on these kinds of cases? Or is it still really difficult from one
company to talk to another company? So it's definitely possible. It's definitely possible
to discuss these kind of models with quality sales wear and to get their take on what's on what to do.
A heart is it though. I mean, do you see that happening?
I think that's a place where it's important to gradually build trust between companies.
Because ultimately, all the AI developers are building technology, which is becoming increasingly
more powerful.
And so it's the way to think about it is that ultimately we only need to gather.
Yeah, it's, I tend to believe in the better angels of our nature, but I do hope that when you build a really powerful
AI system in a particular domain, that you also think about the potential negative consequences
of, yeah. It's an interesting and scary possibility that there would be a race for AI development that would push people to close that development and not share ideas with others.
I don't love this.
I've been in a pure academic for 10 years.
I really like sharing ideas and it's fun and it's exciting.
What do you think it takes to, let's talk about AGI a little bit.
What do you think it takes to build a system of human level intelligence?
We talked about reasoning.
We talked about long-term memory, but in general, what does it take, do you think?
Well, I can't be sure, but I think the deep learning, plus maybe another small idea.
Do you think self-play will be involved?
So do you've spoken about the powerful mechanism of self-play,
where systems learn by exploring the world in a competitive setting
against other entities that are similarly skilled as them.
And so incrementally improving this way.
Do you think self-play will be a component of building an AGI system?
Yeah. So what I would say to build AGI, I think it's going to be
deep learning plus some ideas and I think self-play will be one of those ideas.
I think that that is a very...
Self-play has this amazing property that it can surprise us
has this amazing property that it can surprise us in truly novel ways. For example, like we, I mean, pretty much every self-play system, both our daughter bot, I don't know if
you openly, I had a release about multi-agent where you had two little agents who were playing
hide and seek. And of course, also Alpha Zero,
they were all produced surprise in behaviors.
They all produce behaviors that we didn't expect.
They are creative solutions to problems.
And that seems like an important part of AGI
that our systems don't exhibit routinely right now.
And so that's why I like this area,
I like this direction because of its ability to surprises.
To surprises. And the AGI ability to surprise us. To surprises.
And an AGS system would surprise us fundamentally.
Yes.
But to be precise, not just a random surprise, but to find the surprising solution to a problem
that's also useful.
Right.
Now, a lot of the self-play mechanisms have been used in the game context or at least in
simulation context. How much how far along the path to EGI
do you think will be done in simulation? How much faith promise do you have in simulation versus having to have a system
that operates in the real world, whether it's the real world of digital real world data or
real world like actual physical world of robotics.
I don't think it's in either war.
I think simulation is a tool and it helps.
It has certain strengths and certain weaknesses and we should use it.
Yeah, but okay.
I understand that.
That's true, but one of the criticisms of self-play, one of the criticisms
of reinforcement learning is one of the, it's current power, it's current results,
while amazing, have been demonstrated in a simulated environment, or very constrained physical
environments.
Do you think it's possible to escape them?
Escape the simulated environments and be able to learn in non-similar environments?
Or do you think it's possible to also just simulate in a photo realistic
and physics realistic way, the real world in a way that we can solve real problems
with self-play in simulation?
So I think that transfer from simulation
to the real world is definitely possible
and has been exhibited many times
in many different groups.
It's been a specialist successful envision.
Also, OpenAI in the summer has demonstrated a robot hand
which was trained entirely in simulation
in a certain way that allowed for seemed to real transfer
to occur. This is for the rubric scoop.
Yes, that's right. And so that wasn't aware that was trained in simulation,
was training simulation entirely?
Really. So what it wasn't in the physical that the hand wasn't trained?
No, 100% of the training was done in simulation.
And the policy that was learned in simulation was trained to be very adaptive. So adaptive that when you transfer it, you could very quickly adapt to the physical
to the physical world. So the kind of perturbations with the giraffe or whatever the heck it was,
those weren't were those part of the simulation. Well, the simulation was generally,
so the simulation was trained to be robust to many different things, but not the kind of perturbations we've had in the video.
So it's never been trained with a glove, it's never been trained with a, uh,
a stuffed giraffe.
So in theory, these are novel perturbations.
Correct. It's not in theory in practice.
And that those are novel perturbations.
Well, that's okay.
That's a clean, small scale, but clean example of a transfer from the simulated world to the
physical world.
And now we'll also say that I expect the transfer capabilities of deep learning to increase
in general, and the better the transfer capabilities are, the more useful simulation will become.
Because then you could take, you could experience something in simulation and then learn a moral of the story,
which you could then carry with you to the real world.
As humans do all the time in the computer games.
So let me ask sort of an embodied question, staying on AGI first,
do you think AGI system we need to have a body? We need to have some of those human elements of self-awareness, consciousness, sort of fear of mortality, sort of self-preservation in the physical space, which comes with having a body?
I think having a body will be useful. I don't think it's necessary.
But I think it's very useful to have a body for sure, because you can learn a whole new, you can learn things which cannot be learned without a body, but at the same time
a thing that you can call, if you don't have a body you could compensate for it and still succeed.
Think so. Yes. Well there is evidence for this. For example, there are many people who were born
deaf and blind and they were able to compensate for the lack of modalities.
I'm thinking about Helen Keller specifically.
So even if you're not able to physically interact
with the world, and if you're not able to,
I mean, I actually was getting it.
Maybe let me ask on the more particular,
I'm not sure if it's connected to having a body or not,
but the idea of consciousness
and a more
constrained version of that is self-awareness.
Do you think any G.I. system should have consciousness?
We can't define God.
Whatever the heck you think consciousness is.
Yeah.
A hard question to answer, given how hard it is to define it.
Do you think it's useful to think about?
I mean, it's definitely interesting.
It's fascinating.
I think it's definitely possible that our systems will be conscious.
Do you think that's an emergent thing that just comes from?
Do you think consciousness could emerge from the representation that's stored within
your networks?
So like that, it naturally just emerges when you become more and more.
You're able to represent more and more over the world.
Well, I'd make the following argument, which is, humans are conscious, and if you
believe that artificial neural nets are sufficiently similar to the brain, then there should at
least exist artificial neural nets, you should be conscious, too.
You're leaning on that existence proof pretty heavily. Okay. But that's the
best answer I can give. No, I know. I know. I know. There's still an open question if
there's not some magic in the brain that we're not. I mean, I don't mean a non-materialistic
magic, but that the brain might be a lot more complicated and interesting than we give
a credit for.
If that's the case, then it should show up.
At some point, if you find out that we can't continue to make progress, I think it's
unlikely.
So, we talk about consciousness, but let me talk about another poorly defined concept
of intelligence.
Again, we've talked about reasoning, we've talked about memory.
What do you think is a good test of intelligence for you?
Are you impressed by the test that Alan Torin formulated with the imitation game of natural
language?
Is there something in your mind that you will be deeply impressed by if a system was
able to do?
I mean, lots of things.
There is a certain frontier of capabilities today.
And there exist things outside of that frontier,
and I would be impressed by any such thing.
For example, I would be impressed by a deep learning system,
which solves a very pedestrian task,
like machine translation or, pedestrian task, like machine translation
or computer vision task or something, which never makes mistakes a human wouldn't make under
any circumstances. I think that is something which have not yet been demonstrated and I would
find it very impressive. Yeah, so right now they make mistakes different, they might be more accurate
than you being, but they still, they make a difference at a mistakes.
So my, I would guess a lot of the skepticism that some people have about deep learning is
when they look at their mistakes and they say, well, those mistakes, they make no sense.
Like if you understood the concept, you wouldn't make that mistake.
And I think that changing that would be, would, that would inspire me, that would be yes, this
is this is this is this is progress.
Yeah, that's really a really nice way to put it, but I also just don't like that human instinct
to criticize the model is not intelligent.
That's the same instinct as we do when we criticize any group of creatures as the other. Because it's very possible that
GPT-2 is much smarter than human beings at many things. That's definitely true. It has
a lot more breadth of knowledge. Yes.
breadth of knowledge and even perhaps depth on certain topics. It's kind of hard to judge what depth means, but there's
definitely a sense in which humans don't make mistakes that these models do.
The same is applied to autonomous vehicles. The same is probably going to continue
being applied to a lot of artificial social systems. We find this is the annoying,
this is the process of, in the 21st century, the process of analyzing
the progress of AI is the search for one case where the system fails, you know, big
way where humans would not.
And then many people writing articles about it and then broadly as a, the public generally
gets convinced that the system is not intelligent.
And we like pacify ourselves by thinking it's not intelligent because of this one anecdotal case.
This seems to continue happening.
Yeah, I mean, there is truth to that.
Although I'm sure that plenty of people are also extremely impressed by the system that
exists today.
But I think this connects to the earlier point we discussed that.
It's just confusing to judge progress in AI. Yeah. And you know, you have a new robot demonstrating something.
How impressed should you be?
And I think that people will start to be impressed once AI
starts to really move the needle on the GDP.
So you're one of the people that might be able to create
an AGS system here, not you, but you and OpenAI.
If you do create an AGS system and you get the
spend sort of the evening with it, him, her, what would you talk about, do you think?
The very first time, the first time.
Well the first time I would just ask all kinds of questions and try to make it to get it
to make a mistake and that would be amazed that it doesn't make mistakes and just keep asking, broad questions.
What kind of questions do you think, would they be factual or would they be personal, emotional,
psychological?
What do you think?
All of their Bob.
Would you ask for advice?
Definitely. I mean, why would I limit myself? Yeah.
Talking to a system like this. Now, again, let me emphasize the fact that you truly are
one of the people that might be in the room where this happens. So let me ask a sort
of a profound question about, I've just talked of Stalin's story.
I've been talking to a lot of people who are studying power.
Abraham Lincoln said, nearly all men can stand adversity, but if you want to test a man's
character, give him power.
I would say the power of the 21st century, maybe a 22nd, but hopefully the 21st would be the creation of an AGI system
and the people who have control, direct possession and control of the AGI system.
So what do you think, after spending that evening, having a discussion with the AGI system,
what do you think you would do? Well, the ideal world that I'd like to
imagine is one where humanity, I like the board members of a company with
AGI's CEO. So it would be, I would like the picture which I would imagine is you have some kind of different
entities, different countries of cities, and the people that leave their vote for what the
AGI that represents them should do, and the AGI that represents them goes and does it. I think a
picture like that, I find very appealing, and you could have multiple aid. You would have an AGI for a city, for a country, and it would be trying to, in effect, take
the democratic process to the next level.
And the board can always fire the CEO.
Essentially, press the reset button, say.
Press the reset button.
Re-randomize the parameters.
Well, let me sort of, that's actually, okay, that's a beautiful vision, I think, as long as it's possible to press the reset button.
Do you think it will always be possible to press the reset button?
So I think that it's definitely, it's definitely really possible to build.
So you're talking, so the question that I really understand from you is,
So the question that I really understand from you is, will, will humans or, you know,
humans people have control over the AI systems at the build?
Yes.
And my answer is, it's definitely possible to build
AI systems which will want to be controlled by their humans.
Wow, that's part of their,
so it's not that just they can't help
but be controlled, but that's, that's,
So it's not that they can't help but be controlled, but that's the one of the objectives of their existence is to be controlled. In the same way that human parents generally want to help their children, they want their children to succeed.
It's not a burden for them. They are excited to help the children
to feed them and to dress them and to take care of them. And I believe with high conviction
that the same will be possible for an AGI. It will be possible to program an AGI to design
it in such a way that it will have a similar deep drive that it will be delighted to fulfill and the drive will be to help humans flourish.
But let me take a step back to that moment where you create the AGI system.
I think this is a really crucial moment.
And between that moment and the democratic board members with the AGI at the head,
there has to be a relinquishing of power.
So as George Washington, despite all the bad things
he did, one of the big things he did
is he relinquished power.
He, first of all, didn't want to be president
and even when he became president,
he gave, he didn't keep just serving
as most dictators do for indefinitely.
Do you see yourself being able to relinquish control over an AGI system, given how much
power you can have over the world at first financial, just make a lot of money, right?
And then control by having possession of the AGI system.
I find it trivial to do that.
I'd find it trivial to a little relinquish this kind of,
I mean, you know, the kind of scenario
you are describing sounds terrifying to me.
That's all.
I would absolutely not want to be in that position.
Do you think you represent the majority
or the minority of people in the eye community.
Well, I mean, the open question, an important one.
Our most people good is another way to ask it.
So I don't know if most people are good, but I think that when it really counts, people
can be better than we think.
That's beautifully put.
Yeah.
Are there specific mechanisms you can take of aligning AIG
and values to human values?
Is that, do you think about these problems of continued alignment
as we develop the AI systems?
Yeah, definitely.
In some sense, the kind of question which you are asking is,
so if you have a to translate
that question to today's terms, it would be a question about how to get an RL agent that's
optimizing a value function which itself is learned.
If you look at humans, humans are like that because the reward function, the value function
of humans is not external, it is internal.
That's right.
And there are definite ideas of how to train a value function, basically an objective,
an as objective as possible perception system that will be trained separately to recognize,
to internalize human judgments on different situations.
And then that component would then be integrated as the base value function for some more
capable RL system.
You could imagine a process like this.
I'm not saying this is the process.
I'm saying this is an example of the kind of thing you could do.
So on that topic of the objective functions of human existence,
what do you think is the objective function that
simplicity in human existence?
What's the meaning of life?
Oh.
I think the question is wrong in some way. I think that the question implies that there is an objective answer, which is an external answer, you know, your manual life is X.
I think what's going on is that we exist and that's amazing and we should try to make
the most of it and try to maximize our own value and
enjoyment of our very short time while we do exist.
It's funny because action does require an objective function. It's definitely theirs
in some form, but it's difficult to make it explicit and maybe impossible to make it
explicit. I guess is what you're getting at. And that's an interesting fact of an RL
environment.
Well, what I was making a slightly different point
is that humans want things, and their wants
create the drives that cause them to.
Our wants are our objective functions,
our individual objective functions.
We can later decide that we want to change,
that what we wanted before is no longer good,
and we want something else.
Yet, but they're so dynamic.
There's got to be some underlying sort of Freud.
There's things there's like sexual stuff.
There's people who think it's the fear of death and there's also the desire for knowledge
and you know, all these kinds of things, procreation, the sort of all the evolutionary arguments.
It seems to be there might be some kind of fundamental
objective function from which everything else emerges. But it seems that it's very difficult
to make this course. I think that probably is an evolutionary objective function, which
is to survive and procreate and make sure you make your students succeed. That would be my guess.
But it doesn't give an answer to the question of what's the meaning of life.
I think you can see how humans are part of this big process. This ancient process, we are,
we exist on a small planet, and that's it. So given that we exist, try to make the most of it and
try to enjoy more and suffer less as much as we can.
Let me ask two silly questions about life. One, do you have regrets? Moments that if you went back, you would do differently. And two, are there moments that you're especially proud of?
I made you truly happy. So I can answer both questions. Of course, there's a huge
number of choices and decisions that have made that with the benefit of hindsight, I wouldn't have
made them. And I do experience some regret, but you know, I try to take solace in the knowledge
that at the time I did the best they could. And in terms of things that I'm proud of, there are
very fortunate to have things I'm proud, to have done things I'm proud of.
And they made me happy for some time, but I don't think that that is the source of happiness.
So your academic accomplishments, all the papers,
you're one of the most cited people in the world,
all the breakthroughs I mentioned in computer vision
and language and so on,
what is the source of happiness and pride for you?
I mean, all those things are a source of pride for sure.
I'm very grateful for having done all those things.
And it was very fun to do them.
But happiness comes, but you know, you can, happiness, well,
my current view is that happiness comes from our,
to a lot, to a very large degree, from the way we look at things.
You know, you can have a simple meal and be quite happy as a result,
or you can talk to someone and be happy as a result as well.
Or conversely, you can have a meal and be disappointed
that the meal wasn't a better meal.
So I think a lot of happiness comes from that,
but I'm not sure.
I don't want to be too confident, I.
Being humble in the face of the answer seems to be also part of this whole happiness
thing.
Well, I don't think there's a better way to end it than meaning of life and discussions
of happiness.
So, Ilya, thank you so much.
You've given me a few incredible ideas.
You've given the world many incredible ideas.
I really appreciate it.
And thanks for talking today.
You know, thanks for stopping by. I really enjoyed it.
Thanks for listening to this conversation with Ilias at Scatter, and thank you to our
presenting sponsor, CashApp. Please consider supporting the podcast by downloading CashApp
and using the code Lex Podcast. If you enjoyed this podcast, subscribe on YouTube,
review it with five stars and Apple podcasts, support
on Patreon or simply connect with me on Twitter at Lex Friedman.
And now let me leave you with some words from Alan Turing on Machine Learning.
Instead of trying to produce a program to simulate the adult mind, why not rather try
to produce one which simulates the child. If this were then
subjected to an appropriate course of education, one would obtain the adult brain.
Thank you.