Lex Fridman Podcast - Juergen Schmidhuber: Godel Machines, Meta-Learning, and LSTMs
Episode Date: December 23, 2018Juergen Schmidhuber is the co-creator of long short-term memory networks (LSTMs) which are used in billions of devices today for speech recognition, translation, and much more. Over 30 years, he has p...roposed a lot of interesting, out-of-the-box ideas in artificial intelligence including a formal theory of creativity. Video version is available on YouTube. If you would like to get more information about this podcast go to https://lexfridman.com/ai or connect with @lexfridman on Twitter, LinkedIn, Facebook, or YouTube where you can watch the video versions of these conversations.
Transcript
Discussion (0)
The following is a conversation with Yergin Schmidt-Huber.
He's the co-director of its CSWISAI lab
and a co-creator of long-sure term memory networks.
LSTMs are used in billions of devices today
for speech recognition, translation, and much more.
Over 30 years, he has proposed a lot of interesting
out-of-the-box ideas, a meta-learning, adversarial networks,
computer vision,
and even a formal theory of quote, creativity, curiosity, and fun.
This conversation is part of the MIT course on Artificial General Intelligence and the Artificial
Intelligence Podcast.
If you enjoy it, subscribe by YouTube, iTunes, or simply connect with me on Twitter at Lex Friedman spelled
F-R-I-D.
And now here's my conversation with Yergan Schmidt-Huber. Early on, you dreamed of AI systems that self-improve recursively.
When was that dream born?
When I was a baby?
No, that's not true.
When I was a teenager.
And what was the catalyst for that birth?
What was the thing that first inspired you?
When I was a boy, I was thinking about what to do in my life, and then I thought the most
exciting thing is to solve the riddles of the universe, And that means you have to become a physicist.
However, then I realized that there's something even
grammar you can't try to build a machine.
That isn't really a machine any longer.
That learns to become a much better physicist
than I could ever hope to be.
And that's how I thought maybe I can multiply my tiny little bit
of creativity into infinity. But ultimately that creativity will be multiplied to understand the
universe around us. That's the curiosity for that mystery that drove you. Yes, so if you can build a machine that learns to solve more and more complex problems,
and more and more general problems over, then you basically have solved all the problems, at least all
the solvable problems. So how do you think what is the mechanism for that kind of general
solver look like? Obviously we don't quite yet have one or no how to build one
way of ideas and you have had throughout your career several ideas about it.
So how do you think about that mechanism? So in the 80s I thought about how to
build this machine that learns to solve all these problems,
and I cannot solve myself.
And I thought it is clear, it has to be a machine that not only learns to solve this problem
here and this problem here, but it also has to learn to improve the learning algorithm itself. So it has to have the learning algorithm in representation
that allows it to inspect it and modify it, so that it can come up with a better learning algorithm.
So I call that meta-learning, learning to learn, and recursive self-improvement,
learning to learn and recourse a self-improvement that is really the pinnacle of that, where you
then not only learn how to improve on that problem and on that, but you also improve the way the machine improves and you also improve the way it improves the way it improves itself.
And that was my 1987 Diplomathesis, which was all about that hierarchy of metal owners that have no computational limits except for the well-known limits that good
identified in 1931 and for the limits are physics. In the recent years meta learning has gained popularity in a specific
kind of form. You've talked about how that's not really meta learning with neural networks,
that's more basic transfer learning. Can you talk about the difference between the big general
meta learning and a more narrow sense of meta learning the way it's used today. The way it's talked about today.
Let's take the example of a deep neural network that has learned to classify images.
And maybe you have trained that network on 100 different databases of images.
And now a new database comes along and you want to quickly learn the new thing as well.
So, one simple way of doing that is you take the network, which already knows 100 types
of databases, and then you just take the top layer of that and you retrain that using the new label data that you have in the
new image database. And then it turns out that it really, really quickly can learn that to
one shot, basically, because from the first 100 datasets, it already has learned so much about
computer vision that it can reuse that and that is then
almost good enough to solve the new task except you need a little bit of adjustment on
the top.
So that is transfer learning and it has been done in principle for many decades.
People have done similar things with decades. Metal learning, true, metal learning is about
having the learning algorithm itself
open to introspection by the system that is using it.
And also open to modification,
such that the learning system has an opportunity
to modify any part of the learning algorithm and then evaluate the consequences of that modification and then learn from that to create a better learning algorithm and so on recursively.
So that's a very different animal where you are opening the space off possible learning algorithms to the learning system itself. Right. So you've like in the 2004 paper, you describe gatorle machines, programs that were right themselves.
Yeah. philosophically and even in your paper mathematically these are really compelling ideas but practically do you see these self-referential programs being
successful in the near term to having an impact where sort of it demonstrates to
the world that this direction is a good one to pursue in the near term?
Yes we had these two different types of fundamental research how to build direction is a good one to pursue in the near term. Yes.
We had these two different types of fundamental research,
how to build a universal problem solver.
One, basically exploiting proof search,
and things like that that you need to come up with asymptotically
optimal, theoretically optimal, self-improvers and
problems all of us.
However, one has to admit that through this proof search comes in an additive constant,
an overhead, an additive overhead that vanishes in comparison to what you have to do to solve
large problems.
However, for many of the small problems that we want to solve in our everyday life, we
cannot ignore this constant overhead.
And that's why we also have been doing other things, non-universal things such as recurrent neural networks which are
trained by gradient descent and local search techniques which aren't universal at all, which aren't
provably optimal at all, like the other stuff that we did, but which are much more practical as
long as we only want to solve the small problems that we are
typically trying to solve in this environment here. So the universal problem solvers,
like the Gurdle machine, but also Markus Hutters, fastest way of solving all possible problems,
which he developed around 2002 in my lab, they are associated with these cast and overheads
for proof search, which guarantees
that the thing that you're doing is optimal.
For example, there is this fastest way
of solving all problems with a computer solution,
which is due to Markus, Markus Wutter.
And to explain what's going on there,
let's take traveling salesman problems.
With traveling salesman problems, you
have a number of cities and cities,
and you try to find the shortest path through all these cities
without visiting any city twice. And nobody knows the fastest way of solving
traveling salesman problems, TSPs. But let's assume there is a method of solving them
within N to the five operations, where N is the number of cities. Then the universal method of
Markowitz is going to solve the same trance-asement problem also within n to the
five steps plus a number of steps that you need for the proof surcharge, which you need to show that
this particular class of problems that travelling salesman problems can be solved within a certain
time bound within order n to the five steps basically. And this additive constant doesn't care for n, which means as n is getting larger and larger,
as you have more and more cities,
the constant overhead,
pales and comprises,
and that means that almost all large problems are solved.
In the best possible way already today,
we already have a universal problem like that.
However, it's not practical because the overhead, the constant overhead, is so learned
that for the small kinds of problems that you want to solve in this little biosphere.
By the way, when you say small, you're talking about things that fall within the constraints
of our computational systems.
They can seem quite large to us mere humans.
That's right, yeah.
So they seem large and even unsolvable in a practical sense today, but they are still
small compared to almost all problems, because almost all problems are large problems, which
are much larger than
any constant.
Do you find it useful as a person who is dreamed of creating a general learning system,
has worked on creating one, has done a lot of interesting ideas there, to think about
P versus NP, this formalization of how hard problems are, how they scale, this kind of worst case
analysis type of thinking. Do you find that useful? Or is it only just a mathematical,
it's a set of mathematical techniques to give you intuition about what's good and bad?
So P versus Np, that's super interesting from the
reticle point of view.
And in fact, as you are thinking about that problem, you can also get
inspiration for better practical problems, always.
On the other hand, we have to admit that at the moment, the best
practical problems, always for all kinds of problems that we are now
solving through what is called AI at the moment. They are not of the kind that
is inspired by these questions. You know, there we are using general purpose
computers such as recurrent neural networks, but we have a search technique which
is just local search, gradient descent, to try to find a program that is running on these recurrent networks such that it can solve some interesting problems such as speech recognition or machine translation and something like that.
And there is very little theory behind the best solutions that we have at the moment that can do that.
Do you think that needs to change? Do you think that will change? Or can we go, can we create a general intelligence systems
without ever really proving that that system is intelligence of kind of mathematical way,
solving machine translation perfectly or something like that within some kind of syntactic definition of a language, or can we just be super impressed by the thing working extremely well and that's sufficient?
There's an old saying and I don't know who brought it up first, which says there's nothing more
practical than a good theory. And a good theory of problem solving under limited resources, like here in this universe
or on this little planet has to take into account these limited resources.
And so probably there is locking a theory which is related to what we already have, these asymptotically optimal problem
solvers, which tells us what we need in addition to that to come up with a practically optimal
problem solver.
So I believe we will have something like that.
And maybe just a few little tiny twists are necessary to change what we already have to
come up with that as well.
As long as we don't have that, we admit that we are taking sub-optimal ways and we can't
be all nuturized in long short-term memory for equipped with local search techniques and we are happy
that it works better than any competing method, but that doesn't mean that we think we are
done. You said that an AGI system will ultimately be a simple one, a general intelligence system
will ultimately be a simple one, maybe a pseudo code of a few lines
we'll be able to describe it. Can you talk through your intuition behind this idea? Why you feel that
at its core intelligence is a simple algorithm?
Algorithm. Experience tells us that the stuff that works best is really simple. So the asymptotically optimal ways of solving problems, if you look at them, and just a few lines of code,
it's really true. Although they are these amazing properties, just a few lines of code. Then the most promising and most useful practical things,
maybe don't have this proof of optimality associated with them.
However, they are also just a few lines of code.
The most successful recannual networks,
you can write them down in five lines of pseudocode.
That's a beautiful, almost poetic idea.
But what you're describing there is the lines of pseudocode are sitting on top of layers
and layers of abstractions, in a sense.
So you're saying at the very top, it'll be a beautifully written sort of algorithm, but do you think that there's many layers of
abstractions that have to first learn to construct?
Yeah, of course.
We are building on all these great abstractions that people have invented over the millennia, such as matrix multiplications,
and real numbers, and basic arithmetic
and calculus and derivations of error functions
and derivatives of error functions and stuff like that.
So without that language that greatly simplifies our way of thinking about these problems,
we couldn't do anything. So, in that sense, as always, we are standing on the shoulders
of the giants who, in the past, simplified the problem of problems solving so much, that
now we have a chance to do the final step.
So the final step will be a simple one. If we take a step back through all of human civilization, just the universe in general, how do you think about evolution? And what if creating
a universe is required to achieve this final step. What if going through the very painful and
inefficient process of evolution is needed to come up with this set of abstractions that
ultimately to intelligence? Do you think there's a shortcut or do you think we have to create
something like our universe in order to create something like human level intelligence.
So far, the only example we have is this one, this universe.
And you can live you better.
Maybe not, but we are part of this whole process. So apparently, so it might be the case that the code that runs the universe is really,
really simple. Everything points to that possibility because gravity and other basic forces are really
simple laws that can be easily described also in just a few lines of code basically.
also in just a few lines of code basically. And then there are these other events that the apparently random events in the history of the universe, which as far as we know at
the moment don't have a compact code, but who knows, maybe somebody in the near future
is going to figure out the pseudo-ranum generator, which is computing, whether the measurement of that spin-up-or-down
thing here is going to be positive or negative.
Underline quantum mechanics.
Yes.
So you ultimately think quantum mechanics is a pseudo-ranum number.
So it's all deterministic.
There's no randomness in our universe.
There's Godplay dice. So a couple of years ago, a famous physicist, quantum physicist, Anton Seilinger, he wrote an essay in nature, and it started more or less like that.
that. One of the fundamental insights of the 20th century was that the universe is fundamentally random on the quantum level. And that whenever you measure spin up or
down or something like that, a new bit of information enters the history of the universe.
And while I was reading that, I was already typing the response
and they had to publish it because I was right,
that there is no evidence, no physical evidence for that.
So there's an alternative explanation
where everything that we consider random
is actually pseudo-random,
such as the decimal expansion of pi,
3.141, and so on, which looks random, but isn't.
So pi is interesting because every three-dig digit sequence, every sequence of three digits
appears roughly one in a thousand times and every five
digit sequence appears roughly one in ten thousand times. What do you would expect?
If it was random, but there's a very short algorithm, a short program that computes
all of that. So it's extremely compressible. And who knows, maybe tomorrow somebody, some
grad student at SIRM goes back over all these data points, better decay, and whatever,
and figures out, oh, it's the second billion digits of pi or something like that. We don't have any fundamental reason at the moment to believe that this is truly random
and not just a deterministic video game. If it was a deterministic video game,
it would be much more beautiful because beauty, simplicity and many of the basic laws of the universe, like gravity and the other
basic forces are very simple, so very short programs can explain what these are doing.
And it would be awful.
And ugly, the universe would be ugly, the history of the universe would be ugly if for the extra things the random the seemingly
random data points that we get all the time that we really need a huge number of extra bits to
describe all these extra bits of information. So as long as we don't have evidence that there is no short program that computes the
entire history of the entire universe, we are as scientists compelled to look further
for that shortest program.
Your intuition says there exists a short program that can backtrack to the creation of the universe.
So it brings you the shortest path to the creation of the universe.
Including all the entanglement things and all the spin-up and down measurements that have been taken place since 13.8 billion years ago. So yeah, so we don't have a proof that it is random,
we don't have a proof that it is compressible to a short program, but as long as we don't have
that proof, we are obliged as scientists to keep looking for that simple explanation.
Absolutely. So you said simplicity is beautiful or beauty is simple, either one works.
But you also work on curiosity, discovery, you know, the romantic notion of randomness,
of serendipity, of being surprised by things that are about you, kind of in our poetic notion of reality,
we think humans require randomness. So you don't find randomness beautiful. You find
simple determinism beautiful. Yeah. Okay. So why? Why? Because the explanation becomes shorter. A universe that is compressible
to a short program is much more elegant and much more beautiful than another one which needs
an almost infinite number of bits to be described. As far as we know,
many things that are happening in this universe
are really simple in terms of short programs
that compute gravity and the interaction
between elementary particles and so on.
So all of that seems to be very, very simple.
Every electron seems to reuse the same sub-program all the time
as it is interacting with other elementary particles. If we now require an extra-orical
injecting new bits of information all the time for these extra things which are
currently not understood such as
better decay. Then the whole description length of the data that we can observe
of the history of the universe would become much longer.
And therefore, uglier.
And uglier.
Again, simplicity is elegant and beautiful.
All the history of science is a history of compression progress.
Yes, so you've described sort of as we build up abstractions and you talk about the idea of compression.
How do you see this, the history of science, the history of humanity, our civilization,
our life on earth as some kind of path towards greater and greater compression?
What do you mean by that?
How do you think about that?
Indeed.
The history of science is a history of compression
progress.
What does that mean?
Hundreds of years ago, there was an astronomer
whose name was Kepler.
And he looked at the data points that he got
by watching planets move.
And then he had all these data points, and suddenly it
turned out that he can greatly compress the data by
predicting it through an ellipse law. So it turns out that all these data points are more less on
ellipses around the Sun
and another guy came along whose name name was Newton, and before him, Hook.
And they said the same thing that is making these planets move like that is what makes
the apples fall down.
And it also holds for stones and for all kinds of other objects. And suddenly many many of these
compressions of these observations became much more compressible because as
long as you can predict the next thing, given what you have seen so far, you can
compress it, but you don't have to store that data extra. This is called
predictive coding. And then there was still something wrong with that
theory of the universe and you had deviations from these predictions of the theory. And 300 years
later another guy came along whose name was Einstein. And he was able to explain away all these deviations from the predictions,
are the old theory, through a new theory,
which was called the general theory of relativity,
which at first glance looks a little bit more complicated
and you have to warp space and time,
but you can't phrase it within one single sentence,
which is no matter how fast you accelerate
and how fast or hard you
decelerate and no matter what is the gravity in your local framework, light speed always
looks the same. And from that you can calculate all the consequences. So it's a very simple thing
and it allows you to further compress all the observations
because suddenly there are hardly any deviations any longer that you can measure from the predictions
of this new theory.
So all of science is a history of compression progress.
You never arrive immediately at the shortest explanation of the data, but you're making progress.
Whenever you are making progress, you have an insight.
You see, oh, first I needed so many bits of information to describe the data, to describe my falling apples, my video of falling apples.
I need so many data, so many pixels have to be stored, but then suddenly I realize, no,
there is a very simple way of predicting the third frame in the video from the first tool.
And maybe not every little detail can be predicted, but more or less, most of these orange
blobs that are coming down, they accelerate in the same way, which means that I can greatly compress the video. And the amount of compression, progress, that is the depth of the insight that you have at that moment.
That's the fun that you have, the scientific fun, the fun in that discovery.
And we can build artificial systems that do the same thing.
They measure the depth of their insights as they are looking
at the data, which is coming in through their own experiments and we give them a reward,
an intrinsically reward, and proportion to this depth of insight. And since they are trying
to maximize the rewards they get, they are certainly motivated to come up with new action sequences,
with new experiments that have the property, that the data that is coming in, as a consequence
are these experiments, has the property that they can learn something about, see a pattern
in there, which they hadn't seen yet before. So there's an idea of power play. You've described a training of general problem solver in this kind
of way of looking for the unsolved problems. Yeah. Can you describe that idea a little further?
It's another very simple idea. So normally what you do in computer science, you have
computer science, you have some guy who gives you a problem, and then there is a huge search space of potential solution candidates, and you somehow try them out, and you have
more less sophisticated ways of moving around in that search space, until you finally found
a solution which you consider
satisfactory. That's what most of computer science is about. PowerPlay just goes one little
step further and says, let's not only search for solutions to a given problem, but let's
search to pairs of problems and their solutions, whether the system itself
has the opportunity to phrase its own problem.
So we are looking suddenly at pairs of problems and their solutions or modifications of the
problems over that is supposed to generate a solution to that new problem.
And this additional degree of freedom allows us to build career systems that are like scientists
in the sense that they not only try to solve, try to find answers to existing questions,
know there are also free to pose their own questions.
So if you want to build an artificial scientist,
we have to give it that freedom,
and power play is exactly doing that.
So that's a dimension of freedom that's important to have,
but how hard do you think that how multi-dimensional and difficult the space of then coming up
with your own questions is.
So it's one of the things that as human beings, we consider to be the thing that makes
a special, the intelligence that makes a special is that brilliant insight that can create
something totally new.
Yes. So, now let's look at the extreme case.
Let's look at the set of all possible problems that you can formally describe, which is infinite,
which should be the next problem that a scientist or a power play is going to solve. Well, it should be the easiest problem
that goes beyond what you are going to know. So it should be the simplest problem
that the current problem is all of that you have, which can already solve 100 problems,
that he cannot solve yet by just generalizing.
So it has to be new.
So it has to require a modification of the problem solver, such that the new problem solver can solve this new thing, but the old problem solver cannot do it.
And in addition to that, we have to make sure that the problem solver doesn't forget any of the previous solutions.
Right.
And so, by definition, power play is now trying always to search in this pair of,
in the set of pairs of problems and problems over modifications for a combination,
that minimizes the time to achieve these criteria. So it's always trying to find the problem,
which is easiest to add to the repertoire. So just like grad students and academics and
researchers can spend their whole career in a local minima, stuck trying to come up with
interesting questions, but ultimately doing very little. Do you think it's easy
in this approach of looking for the simplest and soluble problem to get stuck in a local minima? It's not never really discovering new, you know, really jumping outside of the 100 problems
that you've already solved in a genuine creative way. No, because that's the nature of power play, that it's always trying to break its current
generalization abilities by coming up with a new problem, which is beyond the current
horizon, just shifting the horizon of knowledge a little bit out there, breaking the existing
rules, such that the new thing becomes solvable, but wasn't solvable
by the old thing. So like adding a new axiom, like what Goudald did when he came up with these
new sentences, new theorems that didn't have a proof in the formal system, which means you can add
them to the repertoire, hoping that they are not going to damage the consistency
of the whole thing.
So in the paper with the amazing title, formal theory of creativity, fun and intrinsic motivation,
you talk about discovery as a intrinsic reward.
So if you view human as intelligent agents,
what do you think is the purpose and meaning of life for us humans?
You've talked about this discovery.
Do you see humans as an instance of power play agents?
Yeah, so humans are curious.
And that means they behave like scientists, not only the official
scientists, but even the babies behave like scientists and they play around with their toys
to figure out how the world works and how it is responding to their actions.
And that's how they learn about gravity and everything. And yeah, in 1990, we had the first systems like that,
we just tried to play around with the environment
and come up with situations that go beyond
what they knew at that time.
And then get a reward for creating these situations
and then becoming more general problem solvers
and being able to understand more of the world.
So, yeah, I think, in principle, that, that, that curiosity, strategy, or a sophisticated,
more sophisticated versions of what I just described, they are what we have built in as well,
because evolution discovered that's a good way of
exploring the unknown world and a guy who explores the unknown world has a higher
chance of solving problems that he needs to survive in this world. On the other
hand those guys who were too curious, they were weeding out as well. So you have
to find this trade-off. Evolution found a certain trade-off,
apparently in our society there is a certain percentage of extremely
explorative guys and it doesn't matter if they die because many of the others are more conservative.
And so yeah, it would be surprising to me if that principle of artificial curiosity
wouldn't be present in almost exactly the same form here in our brains.
So you're a bit of a musician and an artist.
So continuing on this topic of creativity, what do you think is the role of creativity and intelligence?
So you've kind of implied that it's essential for intelligence, if you think of intelligence
as a problem solving system, as ability to solve problems.
But do you think it's essential, this idea of creativity.
We never have a program, a sub-program that is called creativity or something. It's just a side effect of when our problems all of us do,
they are searching a space of problems, or a space of candidates,
of solution candidates, until they hopefully find a solution to a given problem.
But then there are these two types of creativity.
And both of them are now present in our machines.
The first one has been around for a long time,
which is human gives problem to machine,
machine tries to find a solution to that.
And this has been happening for many decades,
and for many decades, machines have found creative solutions
to interesting problems where humans were not aware of these particularly creative solutions,
but then appreciated that the machine found that. The second is the pure creativity. That I would
call, what I just mentioned, I would call the applied creativity, like applied art,
where somebody tells you, now make a nice picture of this Pope, and you will get money
for that.
Okay, so here is the artist, and he makes a convincing picture of the Pope, and the Pope likes it,
and gives him the money.
And then there is the pure creativity, which is more like the power play and the artificial curiosity thing,
where you have the freedom to select your own problem, like opposed to the applied creativity, which serves another.
In that distinction, there's almost echoes of narrow AI versus general AI.
This constrained painting of a pope seems like the approaches of what people are calling narrow AI and pure creativity seems to be, maybe
I'm just biased as a human, but it seems to be an essential element of human level intelligence.
Is that what you're implying to a degree?
If you zoom back a little bit and you just look at a general problem solving machine, which
is trying to solve arbitrary problems, then this machine will figure out in the course
of solving problems that it's good to be curious.
So all of what I said just now about this pre-wired curiosity and this will to invent new problems
that the system doesn't know how to solve
yet. Should be just a byproduct of the general search. However, apparently evolution has
built it into us because it turned out to be so successful, a pre-wiring, a bias, a very successful exploratory bias that
would be a born with.
And you've also said that consciousness in the same kind of way
may be a byproduct of problem solving.
Do you think, do you find this an interesting byproduct?
Do you think it's a useful byproduct?
What are your thoughts on consciousness in general? Or is it simply a buy product of greater and greater capabilities of problem-solving
that's similar to creativity in that sense?
Yeah, we never have a procedure called consciousness in our machines. However, we get a side effects of what these machines are doing, things that
seem to be closely related to what people call consciousness. So, for example, already 1990,
we had simple systems which were basically recurrent networks and therefore universal computers trying to map incoming data into actions that lead to success.
Maximizing reward and a given environment, always finding the charging station in time,
whenever the battery is low and negative signals are coming from the battery,
always finding the charging station in time without bumping against painful obstacles on the way.
So complicated things, but very easily motivated. And then we give these little guys a separate
recondition network, which is just predicting what's happening if I do that and that. What will happen
as a consequence are these actions that I'm executing and it's just
trained on the long and long history of interactions with the world. So it becomes a predictive model of
the world, basically. And therefore also a compressor, our theme, observations, after what, because
whatever you can predict, you don't have to store extra. So compression is a side effect, our
prediction. And how does this recur NetWire compress? Well, it's inventing little
subprograms, little subnetwork, networks that stand for everything that
frequently appears in the environment, like bottles and microphones and
faces, maybe lots of faces in my environment. So I'm learning to create something like a prototype face.
And the new face comes along and all I have to encode
are the deviations from the prototype.
So it's compressing all the time with the stuff that frequently appears.
There's one thing that appears all the time that is present all the time
when the agent is interacting with its environment,
which is the agent itself.
So just for data compression reasons, it is extremely natural for this recurrent network
to come up with little subnetworks that stand for the properties of the agents,
the hand, the other actuators, and all the stuff that you need to better encode the data
which is influenced by the actions of the agent.
So there, just as a side effect of data compression during problem solving,
you have internal self models.
Now you can use this model of the world to plan your future.
And that's what yours have done since 1990.
So the recurrent network, which is the controller, which is trying to maximize reward, can use
this model of the network of the world, this model network of the world, this predictive
model of the world, to plan ahead and say, let's not do this action sequence, let's do this action sequence instead because it leads to more predicted reward.
And whenever it's waking up, these little subnetworks that stand for itself,
then it's thinking about itself, that it's thinking about itself and it's
exploring mentally the consequences of its own actions and now you tell me why it's
still missing.
Missing the gap to consciousness.
There isn't, that's a really beautiful idea that if life is a collection of data, and life is a process of compressing that data to act
efficiently, in that data, you yourself appear very often.
So it's useful to form compressions of yourself.
And it's a really beautiful formulation of what consciousness is, is a necessary side effect.
It's actually quite compelling to me. You've described RNNs developed
LSTM's long short-term memory networks that are type over current neural networks.
They have gotten a lot of success recently. So these are networks that model the temporal aspects in the data,
temporal patterns in the data, and you've called them the deepest of the new
on that works, right? So what do you think is the value of depth in the
models that we use to learn?
Yeah, since you mentioned the long short-term memory and the LSTM, I have to
mention the names of the
brilliant students who have passed the course. First of all, my first student,
Eva Sapochaiter, who had fundamental insights already in his diploma thesis,
then Felix Gehens, who had additional important contributions, Alex Gray,
a guy from Scotland, who is mostly responsible for this CTC algorithm which is now often used to train the LSTM to do the speed recognition on all the Google and the right phones and whatever and CRI and so on.
So these guys without these guys I would be nothing.
It's a lot of incredible work. What is now the depth?
What is the importance of depth?
Well, most problems in the real world are deep in the sense that the current input doesn't
tell you all you need to know about the environment.
So instead, you have to have a memory of what happened in the past and often, important
parts of that memory are dated.
They are pretty old.
So when you're doing speech recognition, for example, and somebody says, 11, then that's
about half a second or something like that, which means it's already 50 time steps.
And another guy or the same guy says seven. So the ending is the same, Evan. But now the system
has to see the distinction between seven and eleven and the only way it can see the differences
see the differences, it has to store that 50 steps ago, there was an s or an l 11 or 7.
So there you have already a problem of depth 50 because for each time step you have something like a virtual alayah and the expanded and rolled version of this recurrent network, which is doing the speech recognition. So these long time lags, they translate into problem depth.
And most problems in this world are such that you really have to look far back in time
to understand what is the problem and to solve it.
But just like with LSCMs, you don't necessarily need to, when you look back
in time, remember every aspect, you just need to remember the important aspects.
That's right. The network has to learn to put the important stuff into memory and to
ignore the unimportant noise.
So, but in that sense, deeper and deeper is better, or is there a limitation?
Is there, I mean, LSTM is one of the great examples of architectures that do something
beyond just deeper and deeper networks.
There's clever mechanisms for filtering data for remembering and forgetting.
So do you think that kind of
thinking is necessary? If you think about LSTM is a leap, a big leap forward
over traditional vanilla RNNs, what do you think is the next leap within this
context? So LSTM is a very clever improvement, but LSTM still don't have the same kind of ability to see far back
in the future.
In the past, as our humans do, the credit assignment problem across way back, not just 50
time steps or 100 or a thousand, but millions and billions.
It's not clear what are the practical limits of the LSTM when it comes to looking back.
Already in 2006, I think, we had examples where it not only looked back tens or thousands of steps,
but really millions of steps.
And who on Paris, artist in my lab, I think, was the first author of a paper where we really was a 2006 or something,
had examples where to learn to look back for more than 10 million steps.
So for most problems of speech recognition, it's not necessary to look that far back, but
their examples where it does. Now, the looking back thing, that's rather
easy because there is only one past, but there are many possible futures. And so a reinforcement
learning system, which is trying to maximize its future, expected reward, and doesn't
know yet which of these many possible futures should I select, given this one single past.
It's facing problems that the LSTM by itself cannot solve.
So the LSTM is good for coming up with a compact representation of the history so far,
of the history and of observations and actions so far,
but now how do you plan in an efficient and good way among
all these? How do you select one of these many possible action sequences that a reinforcement
learning system has to consider to maximize reward in this unknown future? So again, we have this basic setup where you have one recon network which
gets in the video and the speech and whatever and is executing the actions and is trying to maximize
reward. So there is no teacher who tells it what to do at which point in time. And then there's the
other network which is just predicting what's going to happen if I do that and that.
And that could be an LCR network.
And it allows to look back all the way to make better predictions of the next time step.
So essentially, although it's predicting only the next time step, it is motivated to learn to put into memory something that happened maybe a million
steps ago because it's important to memorize that if you want to predict that at the next time
step, the next event. Now, how can a model of the world, like that, a predictive model of the
world be used by the first guy, let's call it the controller and the model,
the controller and the model. How can the model be used by the controller to efficiently select
among these many possible futures? So now, even we had about 30 years ago, was let's just use
the model of the world as a stand-in, as a simulation of the world.
And millisecond by millisecond we plan the future and that means we have to roll it out
really in detail and it will work only if the model is really good and it will still be
inefficient because we have to look at all these possible futures and there are so many of them.
So instead, what we do now, since 2015, in our CM systems, controller model systems,
we give the controller the opportunity to learn by itself how to use the potentially relevant
parts of the model network to solve new problems more quickly.
And if it wants to, it can learn to ignore the M. And sometimes
it's a good idea to ignore the M because it's really bad. It's a bad predictor in this
particular situation of life where the control is currently trying to maximize your
worn. However, it can also learn to address and exploit some of the sub-programs that came about in the model
network through compressing the data by predicting it. So it now has an opportunity to reuse
that code, the algorithmic information in the model network, to reduce its own search space, search that it can solve a new problem more quickly than without the model.
Compression. So you're ultimately optimistic and excited about the power of RL of reinforcement learning in the context of real systems. Absolutely, yeah. So you see RL as a potential having
a huge impact beyond just sort of the M part
is often developed on supervised learning methods.
You see RL as a for problems of self-driving cars
or any kind of applied side robotics.
That's the correct
interesting direction for researching you view.
I do think so.
We have a company called Naysans, which has applied reinforcement learning to little
audis, which learn to park without a teacher. The same principles were used of course.
So these little Audi's, they are small, maybe like that.
So much smaller than the real Audi's.
But they have all the sensors that you find in the real Audi's,
you find the cameras, the lead-out sensors.
They go up to 120 km an hour if they want to.
And they have pain sensors, basically,
and they don't want to bump against obstacles and other outies. And so they must learn,
like little babies to park. Take the raw vision input and translate that into actions that
lead to successful parking behavior, which is a rewarding thing.
And yes, they land that. So we have examples like that, and it's only in the beginning.
This is just a tip of the iceberg, and I believe the next wave of AI is going to be all about that.
So at the moment, the current wave of AI is about passive pattern observation
and prediction, and that's what you have on your smartphone and what the major companies
on the Pacific Iran are using to sell you ads to do marketing. That's the current sort of profit
in AI. And that's only one or two percent of the wild economy, which is big enough
to make these companies pretty much the most valuable companies in the world, but there's
a much, much bigger fraction of the economy going to be affected by the next wave, which
is really about machines that shape the data through their own actions. Do you think simulation is ultimately the biggest way that those methods will be successful
in the next 10, 20 years?
We're not talking about a hundred years from now.
We're talking about the near term impact of RL.
Do you think really good simulation is required or is there other techniques like imitation
learning, observing other
humans operating in the real world? What do you think this success will come from?
So at the moment, we have a tendency of using
physics simulations to learn behavior for machines that learn to solve problems that humans also do not know how to solve.
However, this is not the future because the future is in what little babies do. They don't use
a physics engine to simulate the world. No, they learn a predictive model of the world,
which maybe sometimes is wrong in many ways, but captures all kinds
of important abstract high-level predictions, which are really important to be successful.
And that's what was the future 30 years ago when you started that type of research, but
it's still the future, and now we are no much better to move forward and to really make
working systems based on that where you have a learning model of the world, a model of the
world that learns to predict what's going to happen if I do that and that, and then the
controller uses that model to more quickly learn successful action sequences.
And then of course, always this curiosity thing in the beginning of the model is stupid,
so the controller should be motivated to come up with experiments with action sequences
that lead to data that improve the model.
Do you think improving the model, constructing an understanding of the world in this connection
is the, in, now, the popular approach that has been successful or, you know, grounded
in ideas of neural networks.
But in the 80s with expert systems, there's symbolic AI approaches which, to us humans
are more intuitive in a sense that it makes sense that you build up knowledge
and knowledge representation.
What kind of lessons can we draw into our current approaches for expert systems, symbolic
AI?
So I became aware of all of that in the 80s and back then, a logic programming was a huge
thing.
Was it inspiring to yourself that you find it compelling?
Because most, a lot of your work was not so much in that realm, right?
It was more in the learning systems.
Yes, I know, but we did all of that.
So, my first publication ever actually was 1987,
was the implementation of genetic algorithm of a genetic programming system in
Prolog. So Prolog, that's what you learn back then, which is a logic programming language,
and the Japanese they are this huge fifth generation AI project, which was mostly about logic programming, back then, although neural networks existed
and were well known back then, and deep learning has existed since 1965, since this guy in the
Ukraine, Ivanko started it, but the Japanese and many other people, they focused really on this
logic programming, and I was influenced to the extent that I said, okay, let's take these
biologically inspired algorithms like evolution, programs,
and implement that in the language which I know, which was pro-lock, for example,
back then. And then in many ways, that came back later because the Goudel machine, for example,
has approved search on board. And without that, it would not be optimal. Or Markus Futters,
Universal Algorithm for solving all well-defined problems has approved search on board. So that's
very much logic programming. Without that, it would not be asymptotically optimal. But then on the
other hand, because we have very pragmatic guys also, we focused on reconnual networks
and suboptimal stuff such as gradient-based search and program space rather than provenly optimal things. The logic programming, does it certainly has a usefulness in when you're trying to construct
something provably optimal, or provably good, or something like that, but is it useful
for practical problems?
It's really useful for our theorem proving.
The best theorem Provers today are not neural networks.
No.
They are logic programming systems, and they are much better theorem Provers today are not neural networks. Right. No, they are logic programming systems and they are much better
theory and Provers than most math students in the first or
second semester.
But for reasoning to for playing games of go or chess or for
robots, autonomous vehicles that operate in the real world or
object manipulation, you think
learning.
Yeah, as long as the problems have little to do with, with, theorem proving themselves,
then as long as that is not the case, you, you would just want to have better pattern
recognition.
So to build a self-trying car, you want to have better pattern recognition and
pedestrian recognition and all these things and you want to minimize the number of false positives
which is currently slowing down self-trying cars in many ways and all of that has very little to
do with logic programming. What are you most excited about in terms of directions of artificial intelligence at this
moment in the next few years, in your own research and in the broader community?
So I think in the not so distant future we will have for the first time little robots that learn like kids.
And I will be able to say to the robot,
look here robot, we are going to assemble a smartphone. Let's take a slab of plastic
and the screwdriver and let's screw in the screw like that. No, not like that, like that.
Not like that, like that.
And I don't have a data level or something.
He will see me and he will hear me
and he will try to do something with his own actuators
which will be really different from mine,
but he will understand the difference
and will learn to different from mine, but he will understand the difference and will learn to
imitate me, but not in the supervised way, where a teacher is giving target signals for all his
muscles all the time. No, by doing this high-level imitation where he first has to learn to imitate me,
and then to interpret these additional noises coming from my mouth
as helping helpful signals to do that.
And then it will, by itself, come up with faster ways and more efficient ways of doing
the same thing.
And finally, I stop his learning algorithm and make a million copies and sell it.
And so at the moment, this is not possible, but we already see how we are going to get
there.
And you can imagine, to the extent that this works economically and cheaply, it's going
to change everything.
Almost all of production is going to be affected by that.
And a much bigger wave, a much bigger AI wave is coming than the one that we are currently witnessing,
which is mostly about passive pattern recognition on your smartphone.
This is about active machines that shapes the data through the actions they are executing
and they learn to do that in a good way.
So many of the traditional industries are going to be affected by that. All the
companies that are building machines will equip these machines with cameras and other sensors
and they are going to learn to solve all kinds of problems through interaction
with humans, but also a lot on their own
to improve what they already can do.
And lots of old economy is going to be affected by that.
And in recent years, I have seen that old economy
is actually waking up and realizing that
those are the gains.
And are you optimistic about that future?
Are you concerned?
There's a lot of people concerned in the near term about the transformation of the nature
of work.
The kind of ideas that you just suggested would have a significant impact of what kind of
things could be automated.
Are you optimistic about that future? you just suggested would have a significant impact of what kind of things could be automated.
Are you optimistic about that future?
Are you nervous about that future?
And looking a little bit farther into the future, there's people like Gila Musk, who
are still a Russell concerned about the existential threats of that future.
So in the near term, job loss in the long term existential threat,
are these concerns to you or are you ultimately optimistic?
So let's first address the near future.
We have had predictions of job losses for many decades, for example, when industrial robots came along,
many people predicted lots of jobs are going to get lost.
And in a sense, they were right, because back then there were car factories and hundreds of people and these factories assembled cars and
today the same car factories have hundreds of robots and maybe three guys
watching the robots. On the other hand those countries that have lots of robots
per capita, Japan, Korea, Germany, Switzerland, a couple of other countries, they have really low unemployment
rates.
Somehow all kinds of new jobs were created.
Back then nobody anticipated those jobs.
And decades ago I already said it's really easy to say which jobs are going to get
last but it's really hard to predict the new ones. 30 years ago who would have
predicted all these people making money as YouTube bloggers, for example. 200 200 years ago, 60% of all people used to work in agriculture.
Today maybe 1%.
But still, only 5% unemployment.
Lots of new jobs were created and homoludans, the playing man, is inventing new jobs all the time.
Most of these jobs are not existentially necessary
for the survival of our species.
There are only very few existentially necessary jobs,
such as farming and building houses and warming up the houses,
but less than 10% of the population is doing that.
And most of these newly invented jobs are about interacting with other people in new ways, through new media and so on,
getting new types of kudos, and forms of likes and whatever, and even making money through that. So, homoludence, the playing man, doesn't want to be unemployed,
and that's why he's inventing new jobs all the time.
And he keeps considering these jobs as really important
and is investing a lot of energy and hours of work
into those new jobs.
It's quite beautifully put.
We're really nervous about the future
because we can't predict what kind of new jobs will be created.
But you are ultimately optimistic
that we humans are so restless that we create
and give meaning to newer and newer jobs,
telling you things that get likes on Facebook or whatever the social
platform is. So what about long-term existential threat of AI, where our whole
civilization may be swallowed up by this ultra-superintelligent systems?
Maybe it's not going to be swallowed up, but I'd be surprised if we were, the humans were
the last step in the evolution of the universe.
You've actually had this beautiful comment somewhere that I've seen saying that artificial quite insightful.
So artificial general intelligence systems,
just like us humans, will likely not want to interact with humans.
They'll just interact amongst themselves,
just like ants interact amongst themselves,
and only tangentially interact with humans.
And it's quite an interesting idea that once we create a GI,
they will lose interest
in humans and have compete for their own Facebook likes and their own social platforms.
So within that quite elegant idea, how do we know in a hypothetical sense that there's
not already intelligent systems out there? How do you think broadly of general intelligence greater than us?
How do we know it's out there?
How do we know it's around us and could it already be?
I'd be surprised if within the next few decades or something like that.
We won't have AI's that truly smart in every single way and better problem solvers in almost every single important way.
And I'd be surprised if they wouldn't realize
what we have realized a long time ago,
which is that almost all physical resources are not here in this biosphere, but
throughout the rest of the solar system gets 2 billion times more solar energy than our
little planet.
There's lots of material out there that you can use to build robots and self-replicating robot factories and all
this stuff. And they are going to do that. And they will be scientists and curious and they will
explore what they can do. And in the beginning, they will be fascinated by life and by their own
origins and our civilization. They will want to understand that completely,
just like people today would like to understand
how life works and also the history of our own existence
and civilization, but then also the physical laws
that created all of that.
So in the beginning they will be fascinated, but once they understand it,
they lose interest, like anybody who loses interest and things he understands. And then as you said,
the most interesting sources of information for them will be others of their own kind.
So, at least in the long run, there seems to be some sort of protection
through lack of interest on the other side.
protection through lack of interest on the other side. And now it seems also clear, as far as we understand physics,
you need matter and energy to compute and to build more robots
and infrastructure and more AI-sualization
and AI-ecologies, consisting of trillions of different types of AI's.
And so it seems inconceivable to me that this thing is not going to expand.
Some AI ecology, not controlled by one AI, but trillions of different types of AI's
competing and all kinds of quickly evolving and disappearing ecological niches in ways that
be kind of fathom at the moment. But it's going to expand, limited by light speed and physics,
but it's going to expand. And I'll be realised that the universe is still young. It's only
a 13.8 billion years old, and it's going to be a thousand times older than that.
8 billion years old and it's going to be a thousand times older than that. So there's plenty of time to conquer the entire universe and to fill it with intelligence
and senders and receivers such that AI can travel the way they are traveling in our labs today, which is by radio from Center to receiver.
And let's call the current age of the universe, one eon, one eon.
Now it will take just a few eons from now and the entire visible universe is going to be full of that stuff.
And let's look ahead to a time when the universe is going to be
1000 times older than it is now. They will look back and they will say, look, almost immediately
after the Big Bang, only a few eons later, the entire universe started to become intelligent.
Now to your question, how do we see whether anything like that has already happened or has already in a more advanced stage in some other part of the universe, of the visible universe?
We are trying to look out there and nothing like that has happened so far.
Or is that true?
Do you think we would recognize it?
How do we know it's not among us?
How do we know planets aren't in themselves intelligent?
beings how do we know ants seen as a collective are not much greater intelligence in our own?
These kinds of ideas. But it was a boy I was thinking about these things and I thought
maybe it hasn't already happened because back then, I know, I knew
I learned from popular physics books that the structure, the large scale structure of the
universe is not homogeneous. And you have these clusters of galaxies. And then in between
there are these huge empty spaces. And I thought, hmm, maybe they aren't really empty. It's just that
in the middle of that, some AI civilization already has expanded and then has covered a bubble
of a billion light years diameter and is using all the energy of all the stars within that bubble
for its own unfathomable purposes. And so it always happened and we just failed to
interpret the signs, but then I learned that gravity by itself
explains the largest scale of structure of the universe and that
this is not a convincing explanation. And then I thought,
maybe, maybe it's the dark matter because as far as we know today 80% of the
measurable matter is invisible. And we know that because otherwise our galaxy or other galaxies
would fall apart. They would they are rotating too quickly. And then the idea was maybe all of these AI civilizations that are already out there, they
are just invisible because they are really efficient in using the energies of their own
local systems.
And that's why they appear dark to us.
But this is awesome at a convincing explanation because then the question becomes,
why is there, are there still any visible stars left in our own galaxy,
which also must have a lot of dark matter.
So that is awesome at a convincing thing.
And today, I like to think it's quite plausible that maybe the first, at least in our local light cone,
within the few hundreds of millions of light years
that we can reliably observe. Is that exciting to you? There might be the first.
And it would make us much more important, because if we mess it up through a nuclear war,
then maybe this will have an effect on the development of the entire universe.
So let's now mess it up. Let's now mess it up.
Jürgen, thank you so much for talking today. I really appreciate it.
It's my pleasure. Thank you.