Big Technology Podcast - Getting AI To Think And Learn Like Humans — With Daniel Kahneman and Yann LeCun
Episode Date: December 8, 2021Daniel Kahneman is a Nobel prize-winning psychologist and economist and author of Thinking, Fast and Slow, a landmark book that decodes human decision-making. Yann LeCun is the chief AI scientist at M...eta (Facebook) and a pioneer in the field of deep learning, which the cutting edge of AI is based on today. The two come together on Big Technology Podcast this week to discuss how machines and humans learn, whether there are parallels, and what each field can learn from each other.
Transcript
Discussion (0)
Hello and welcome to the big technology podcast, a show for cool-headed, nuanced conversations of the tech world and beyond.
Let me tell you, I'm pinching myself that we're actually about to have this conversation because I think it's going to be the most fascinating we've had on the podcast, bar none.
And let me tell you why.
So in 2017, I got a chance to spend some time with Jan Lecun.
And Jan was the head of Facebook AI research.
Now you might know the company as Meta.
He's the chief AI scientist there.
And we spoke a lot about how you can actually take the human mind and make it artificial,
turn it into something that's an AI.
And I think that that's making a thinking machine is Jan's long-term goal.
Then this summer, I, you know, had some time off and decided that I was going to read a book I wanted to get around to for a long time.
I was thinking fast and slow by the Nobel Prize winning Professor Daniel Kahneman.
And as I'm reading this book, I start thinking, wow, like Professor Kahneman describes the way the mind works in so many interesting ways that I hadn't thought about and ways that you'd imagine Yom would have to think about if he wants to build a thinking machine.
And so why don't we just get the two of them together?
Obviously, it was a pipe dream, right?
Anyway, about a month ago, I'm in the same room as both of them by some stroke of luck.
And I propose, you know, why don't we have this conversation?
So, of course, I go up to a professor, connemon, we'll call him Danny for, you know, this conversation and say, hey, have you ever spoken with Jan?
And it turns out that they've spoken many times.
And this is discussions they've had in private a handful of times, but they've never had it in public.
And so we'll have it in public today.
So first of all, I want to say Jan and Danny, welcome to the show.
It's great to have you both.
So I think we should just do the foundational stuff about what Jan's mission is, a little bit on Danny's research.
And then actually, stay tuned for the second enough because Danny brought a list of questions that he wants to ask Jan.
And I'll just take a back seat for that.
But why don't we start here?
Jan, I'd love to hear a little bit more about, if you could briefly share your ambition to build a thinking machine in, like, what does that actually look like? Do you want to replicate the human mind one for one?
No, I don't want to replicate the human mind. I want to understand intelligence. I think one of the most fascinating questions over time, scientific questions over time, is what is intelligence? How does the brain work?
you know, the other two fascinating questions over times are what is life all about and how does the universe work, right?
So, you know, these are like, you know, big questions.
And as a scientist and an engineer, I consider that I don't really understand how something works unless I build it myself.
So or have some understanding of how to build it.
And so the purpose of AI is a dual purpose.
One is understand intelligence, perhaps approach models of human intelligence if we want.
And the other one, of course, is to construct interesting artifacts that can help people in the dirty lives and vector world better and everything.
So I think it's really two purposes, the scientific one and a technological one.
Right. And just a follow up on that. I mean, I remember sitting with you in your office talking about, you were talking about, you know, how, you know, you want to teach computers how to predict. And if they can predict, they can plan. And if they plan, they can start to have some functionality that, you know, we have as human beings. And there is this rush towards general intelligence where folks like you and researchers in your, in your stratosphere, are attempting to build thinking.
machine that's right well yeah so the i mean the first question is or the first thing you can
you can notice in the animal world is that there is no intelligence without learning
and in the engineering world it's almost true as well and it's probably because i'm either
lazy or not smart enough to that i think that as human engineers we cannot actually directly
conceive and construct an intelligent machine we have to build a machine that can make itself
intelligent through learning, right? So that's not going to interest in learning that was very
early on when I was an undergrad or something. And then the next question is how to get machines
to learn. And of course, you know, there's a long history of machine learning, you know,
supervised learning, starting with the perceptron and things like that. And what has become really
clear over the last couple decades, it was clear, it was even clear for people like Jeffington,
before that, but maybe just for me, is that the type of learning that we are currently able
to reproduce in machine, which is supervised running and reinforcement learning, do not seem
to reflect what we observe in humans and animals. There is another type of learning, another
paradigm of learning that seems to take place in humans and animals that allows humans and
animals to learn how the world works, you know, mostly by observation, a little bit by interaction,
but mostly by observation. And we, we accumulate.
enormous amounts of background knowledge about how the world works.
And that connects with what we think that, you know, Danny talks about.
If we have a model of the world, this model of the world, we can use it to plan
because we can imagine the result of the consequences of actions we're taking.
That allows us to plan.
So this is what Danny calls system two.
But currently what we can do in machine learning is more like the system one, the stuff
that, you know, here is an input, here is an output that does not require reasoning, if you
long. So we're sort of, you know, I'm interested in sort of trying to get richly to learn models
of the world so that we can get them to reason, essentially. Yeah, it's so amazing to have you both
on the same show because now we get to go to Danny to talk a little bit about these concepts
in terms of the way that the human mind thinks. And then we can go back to you, Jan, and
think about how we might be able to get that into AI.
So, Danny, at the risk of having you go through, you know, the free bird speech that you make often,
but I think it's important for listeners, especially those who haven't read the book.
Are you able to just describe a little bit more what Jan is talking about in terms of the way human beings think and, you know, talk a little bit.
I know it's sort of a short cut, but talk a little bit about System 1 and System 2 and give us an overview there?
I would say that when we describe human intelligence,
we speak of a representation of the world.
And it's the representation that leads to prediction,
that there is no shortcut to the prediction from the data.
You go through a representation,
which includes how the system works.
It includes the causal relations.
This is what enables you to predict.
And it turns out that we do have such a model of the world, and it's not so much that we have specific expectations about what's going to happen next.
What is happening most of the time is that things happen, and then we make sense of them, that as we actually go back and fit them into what happened before.
And quite often, you know, right now you're not predicting what I'm going to say next,
but I'm not going to surprise you by what I say.
Or, you know, if I said lumber, all of a sudden, I wouldn't surprise you because it doesn't affect.
But so this is the way the most interesting part of this works.
And it is remarkable from the point of view of the psychologists.
there are many, many things that touch on what Jan is doing.
What's remarkable is how little it takes, how quickly people learn,
and whether that, I think, is a fundamental puzzle.
And it turns out that the representation of the world that we have,
it's hard to imagine it completely without symbols.
That is, we do think symbolically.
And how you represent symbols without using symbols is sort of a puzzle.
So, you know, I would ask Jan the question of asking before,
which is whether solving a generic learning problem is enough to get all of that.
that is to get the system that we learn quickly
and to get a system that will have the kind of logic
that enables us that there is a certain kind of mistakes
that people just don't do.
And so we have it that an object is not in those two places at once
or basic facts about the world.
And there is,
sort of a certainty about it that seems to be difficult to achieve with just a system that learns
approximately. So I was wondering, you know, how Jan is going to deal with that.
Yeah. And yeah, before we get into the answer, just a quick table setting. So the system one
is the automatic sort of the things that we do without, you know, thinking. And the system two is,
or what we would feel like we're doing without thinking, even though we might be in system two is more
complex
more complex. So far I was talking
everything I was saying
you know, system one basically
because I think the representation of the world
that we have and our ability
to anticipate or to
to feel
unsurprised by what
happens which is I think
more than anticipating
that is all system one
that's all automatic
it's effortless and it's very quick
System 2 is, we'll get to system 2, but solving system 1 seems to be a big one.
Yeah, yeah, so yeah, we'll start on System 1.
Jan, your thoughts?
Well, I mean, I agree with Danny that current AI systems are, you know, very specialized,
and that makes them very brittle because they're trained for one task or maybe a collection of tasks.
There's a sort of a motion towards training.
finding relatively large systems for multiple tasks at once, not just one, because they tend to work better and require less data.
But it is astonishing how fast humans and animals can learn new tasks when faced with a sort of a new situation.
How is it that a teenager can learn to drive a car in about 10 or 20 hours of practice or learn to fly an airplane in 20 hours of practice?
It's incredibly fast.
If we were to use, let's say, reinforcement learning to train a self-driving car to drive itself,
it would have to drive itself for millions of hours and cause, you know, until thousands of accidents
and destroy itself multiple times before it learns to drive probably not nearly as reliably as a human.
So what's the difference?
Now, of course, you know, obviously we can say humans rely on their background.
knowledge about the world. And that basically is the answer to that. We learn enormous amounts
of background knowledge about how the world works. So we don't have to, when we're not to drive,
we don't have to learn that if we drive next to a cliff and we turn the wheel to the right,
the car will run off the cliff and nothing good will come out of it. We don't need to try
because we know that from our model of how the world works, of intuitive physics and things
like that. And this type of model is what gives us, in my opinion, some sort of common sense.
I mean, what we call common sense comes out of this that prevents us from making the really
stupid mistakes that Danny was talking about that AI systems are doing currently.
So how do we get machines to learn that? You know, you look at how, at what stage in their life
human babies learn basic concepts, like what is the difference between an animate and inanimate?
object or I'm going to put this object on the table, is it going to stay stable or is it going
to fall?
That's around the age of three months or so, three, four months.
Difference between animate and animate also comes pretty early.
Object permanence comes very early as well.
Some people claim it's innate, not clear.
And then there are concepts that, you know, we take for granted the fact that objects that
are not supported fall, they're subject to gravity.
babies learn this around the age of eight months, eight to nine months.
It takes a long time to understand that an object that is not supported will fall.
It takes a long time to understand momentum and things like that.
But then, you know, by the age of nine months,
and so in the first few months, you know, babies have very little ability to act on the world, right?
Pretty much everything they're on is through observation only.
And, you know, by the time they are eight or nine months, they've pretty much understood, you know, the physics of the world.
And babies are relatively slow.
You take a baby cat, they understand intuitive physics incredibly quickly and their own, the dynamics of their own body and everything.
So, of course, you know, it's not clear how much of this is hardwired, but there is clearly a lot of learning taking place.
This type of learning is the one that we don't yet know how to reproduce with machines.
And I think, I mean, that's where I focus my research, you know, what is this type of learning that allows us to learn representations of the world and predictive models of the world that, you know, then allows us to learn new tasks very quickly with very few trials or very few examples.
And it's a very hot topic in machine learning at the moment, actually.
Well, I'm curious about one thing.
I mean, there are that I know about and I'm quite ignorant about it,
but there are two tentative evances that I know about
to solve the problem of the speed of learning,
and one is that a lot is built in.
And that looks very appealing because actually some animals
really come out of the womb and they're ready to go.
I mean, goats, mountain goats, they know how not to fall as soon as they are born.
So there it's difficult to think of learning.
So nativism is part of the answer.
Then there is something that I've been learning about, the hierarchical Bayesian models,
which is sort of a different, that there are logical structures that we are prepared for to organize the world
that we see and that this is part of what enables us to learn things very quickly.
Now, I think you reject both of these.
Well, yes.
The first one only partially, because, you know, it's obvious that at least in certain species,
including in humans, you know, a lot of things are not necessarily completely hardwired,
but at least we have sort of intrinsic motivation to learn them quickly and it drives us to learn them
quickly, essentially. And it's certainly true for, you know, baby Ibexes and mountain goats and
whatever. And certainly for simpler animals, you know, like spiders and things like that. But
it's, you know, even for an Ibex or a mountain goat, they still need to kind of learn a lot about,
you know, the dynamics of their own body. They are not hardwired with that. They are hardwired
maybe with the geometry, but not with the details of it. And, you know, you can think of this
at just being a few parameters to adjust,
but I think what goes on in the brain of the animal
is much more complex than that.
So I think there is a lot of learning in there,
even if in the end we can observe that the learning takes place
really quickly.
So then, and, you know, there is some evidence,
you know, for the things that,
A lot of people in cognitive science think or have thought in the past, perhaps, that were hardwired for those things to be learned because they can be learned really quickly.
So, for example, it would be relatively simple for genetics to encode the fact that the visual cortex needs to be built with neurons.
They can detect oriented edges, for example, the kind of stuff that we find in the primary visual cortex area.
But in fact, we have now a dozen different learning algorithms that if we were to run them in real time in sort of a simulated animal, if you want, would learn those oriented feature detectors from almost random images within minutes.
And so there is no point in hardwiring this because it can be learned within minutes.
Same for face detection.
So, you know, you look at the brain and there are areas in the brain that light up when people are shown faces.
And so an easy conclusion of that is, oh, you know, this area of the brain is specialized
in face detection and it's hardwired and, you know, phase detection is innate or face recognition
is innate.
But again, phase detection can be learned in minutes.
If you're a baby, your virgins is bad.
Your focus is basically fixed at a relatively short range.
So the only thing you see during the first weeks of your life are faces and nipples, essentially.
So, you know, and then you have a hardwired thing to pay attention to motion.
And so within minutes, you're going to have a phase detector in your visual context.
You know, any reasonable learning algorithm will learn this extremely quickly.
So you don't need this to be innate if you can be learned.
So that's the first thing about the innate.
You know, is it the case that the fact that the world is three-dimensional is innate, for example?
It would make sense, right?
because a world is three-dimensional,
all of evolution has always taken place in the three-dimensional world.
So it would make sense to kind of hardwired this in the cortex.
Then if you're an AI scientist or engineer,
it's like, how would I even define an architecture that has that hardwired?
I don't even know how to do it.
And so it's not clear you can actually encode this in the genome.
On the other hand, you can learn it really quickly
because the fact that every point in the world
has a depth is the best explanation for how your view of the world changes when you move your head
or if you correlate the two views from your left, time, and right. And so learning things like,
very basic things like this is very simple for whatever learning algorithm or brain uses. So that's
the first part about the sort of nature and nurture debate, if you want. And then there is the second
question of symbols about symbols.
So is it necessarily to have explicitly hardwired mechanisms in our brain that allows us to do
things like symbolic manipulation or reasoning?
And I find that hard to believe that they need to be hardwired.
First of all, the first question, the first question to you, Danny, perhaps, which I've
asked to many people who've come up with that question, is.
Do you view the type of reasoning that, let's say, great apes do as symbolic or do you view what monkeys do as symbolic or dogs or cats or octopus?
Do they do symbolic reasoning?
And can we also just for table setting to find what symbolic reasoning is?
I don't know how to define it.
I'm not a big, you know, advocate of the very existence of symbolic christening.
Okay, we'll toss it over to Dan, yeah.
I think that when we're talking about symbols,
we're really talking about logical relations.
That's not an easy one to understand each other,
but it's hard for me to explain.
But I think a characteristic of symbolic reasoning is a certain discreteness.
And it's not, so it's not completely compatible with what you have talked about.
I mean, this is interesting what is happening here because I would push you with your idea that everything has to be learned or can be learned very easily.
I would push you down the animal chain
where I think it becomes highly plausible
and you are doing the same thing to me with symbols
by asking me how far down do symbols go
and it's clear that the answer
that sort of the standard answer
that it's connected with language
and that the symbolic system and the language system
are related and therefore other animals.
and most don't have symbols in the way that we do.
Whether that is really compelling, I'm not sure.
That is, in the sense, for example,
that in the hierarchical Bayesian models
that I'd like to get your evaluation of,
but those are supposed to take care of perceptual learning
and with the symbolic element of the logical,
element is sort of upstream of that in what categories of things they are,
you know, what categories of structure they are that you're about to learn.
Right.
So I think the number of very interesting questions there.
So the first one is what do we mean by symbols, really?
And if by symbol we mean the ability to form sort of discrete categories,
represent sort of discrete categories in our mental representation system,
then I think pretty much every brain has some sort of symbolic representation.
And the reason I think this is that discretizing concepts,
like having identified categories, makes memory more important.
because it makes representation have the ability to be error corrected.
So if I write a zip code or a credit card number and you make a mistake on one digit,
that mistake would be detected because when you make a single error on a credit card number,
there is some self-consistency, right?
And that's because credit card numbers not only are discrete because there are numbers,
but they are actually some distance apart,
like two credit card numbers or some distance apart from each other.
So if you make a small error,
you can snap it back to the correct one.
You can detect the mistake and you can,
you know, that's the basics of error correction.
So just for the purpose of being able to do error correction
and for the purpose of being able to store concepts in memory,
discrete concepts in memory efficiently,
you need discrete representations.
But now the big conundrum is,
so in a way, you need symbols.
I think if that's the definition of symbols, then I think dogs have symbols.
I don't know if they can reason logically.
I guess they have some logic, certainly.
I mean, I think cats and dogs and probably mice even have these kind of symbols.
And what's a discrete representation?
You know, it's a difference between a real number and a natural number, right?
So if I...
Ah, yeah, yeah, go ahead.
Right. So let's say I send, I want to send a number to you. And the only thing I have is a, is a electric wire and I can send a voltage, right? So I can send a voltage, you know, three volts, let's say. But then at your end, it may not be exactly three volts because there is resistance, impedance, you know, parasites, noise, whatever. So what you're going to observe is some sort of fluctuating thing that may be around three volts, but maybe a little shifted. But it's not going to be shifting.
shifted all the way to four volts or two volts, it's going to be around three volts.
So because you know that the signals I want to transmit to you are either 0, 1, 2, or 3, or 4,
you know it's going to be 3, right?
So despite the fact that there is noise in the transmission, and despite the fact that the signal is continuous,
you snap it back to its correct value.
Okay?
Yeah.
So, and then to store that value somewhere, because there is only, you know, five different
possible values, 0, 1, 2, 3, 5.
one, two, three, four, five, you can store it with a small number of bits, whereas if you had to store a precise voltage value, it would require many bits, right?
So, discretization makes memory more efficient, and it makes transmission, either inside the brain or between agents, reliable.
It also, you know, give the possibility of associative memory.
So if you have a, in the brain, you know, if you, no, the brain represents things with voltages, right?
To, you know, to first order approximation.
Now, you know, there may be noise again in this representation and, you know, you may see a partial view of an image and you can reconstruct the whole image because of your, you know, knowledge of, you know, what the system is supposed to look like.
You may never have seen my left side, okay, the left side of my face.
But even if you've never seen it, you probably would have a pretty good idea of what it looks like
because of your general model of a human face and the fact that they are mostly symmetrical.
So that's an example of kind of snapping what's a noisy signal to, you know, a perfect signal
because of the maybe discrete, maybe not discrete, but at least the structure.
internal structure of what you're looking at, which your brain has captured.
So that's why discrete symbols are interesting.
It's the same reason, by the way, why all of modern communication is digital.
It used to be analog.
It used to be that to communicate with each other, we would call each other on the phone
and it would just transmit a voltage directly, you know, from,
my phone to your phone.
But now we replace this with digital communication with bits.
And the reason is it's more efficient and more robust to noise and there's all kinds
of advantages.
Our brains actually use digital communication internally.
A neuron, you know, neurons and spike to communicate with each other.
And the reason is it's easier to regenerate a binary signal than it is to regenerate an analog
signal.
And it's more efficient energetically.
I mean, there's all kinds of good reasons for this.
So, you know, that may explain why symbols may emerge in all representations of the world,
just for reasons of efficiency, and including for animals that do not have language.
Now, of course, if you do want language, language is, you know, basically a sort of approximate way of representing the sort of complex data structures that we have, you know,
or in our mind, and to serialize it so that we can put out on the single one-dimensional signal,
which is sound, sound pressure or sequences of words, which are symbols.
And again, if I'm talking to you through a microphone, it goes through the internet,
and there's some noise attached to that process.
It's pretty high quality, but it's still some noise.
because our language uses discrete words,
we can have error-correcting communication.
So even if you don't completely understand
all of the syllables or phones that I'm pronouncing,
you can still kind of recover what I meant
because there is only a finite number of words, right?
And you may not have heard all the syllables are pronounced,
but you kind of snap it back to whatever makes sense in the context.
In fact, every speech recognition system works this way.
They have sort of a language model,
and even if they don't understand every sound,
they can sort of reconstruct really what was meant.
In my case, it's even harder because of my accent.
So language has to be discrete because it needs to be
because it needs to be noise-resistant, essentially.
And perhaps, you know, language was, the appearance of language, of symbolic language, was facilitated by the fact that, you know, we need to form the equivalent of symbols or discrete categories in our brain for efficient storage and sort of recovery, associative memory recovery, possibly.
And so it's quite possible that, you know, in the right condition, animals would have kind of more linguistic capabilities than we give them.
And there's a number of experiments, of course, on this that have been done with monkeys and parrots and whatever.
So in some sense, it's the the discretization itself has built in.
That has to be.
I mean, but what you're saying is no content.
or very little content that is built in.
To the human brain at inception.
Is that the, yeah, okay.
Well, I mean, the big question is,
where do the meaning of those discrete entities come from?
Right.
And I certainly do not believe that the meaning of those discrete entities
are predetermined in the human mind,
certainly, and the animal mind either.
those are completely learned.
And so here now comes the conundrum.
So you're talking about hierarchical basin systems.
In a sense, multi-layer deep neural nets are hierarchical basin systems if you kind of view them the right way.
And there are certain forms of them that are actually explicitly Bayesian.
The main question in sort of the classical approaches to Bayesian modeling is that the concepts in a hierarchical,
a Bayesian graph, you know, a Bayesian network or a graphical model, you know, people give
them different names. Those concepts have to be designed by the human designer when they
build those systems. And it clearly is not happening that way in the brain. Those concepts are
learned, right? So the basic entities that are manipulated are those kind of
you know, discrete concepts are learned.
Yeah, and can you, sorry to ask, but can you, just for the general audience,
describe hierarchical basian systems?
Well, okay.
Yeah.
I mean, it means, it means different things to different people, but there is a classical
view of it called Bayesian networks where you have, basically it's a graph.
So a graph in the mathematical term is a collection of nodes that are linked
by edges.
And the structure of this graph,
so each node represents a variable.
For example, so here's a classical example
that people cite in courses on AI.
It's a node that represents whether your house is jumping,
OK, is moving.
It's another node that indicates whether a truck has hit your house.
And then there is another node that indicates
whether you're in California and whether there was an earthquake.
Okay, so you could establish a causal relationship or at least a dependency between those nodes.
You can say, well, if my house just jolted, it's either because a truck hit it or because there was an earthquake and there's probably not many other reasons for that to happen.
Now, if I look at the window and I see that a truck just hit the house, it immediately,
lower is my estimate of the likelihood that there was a simultaneous earthquake, right?
Because the likelihood of those two happening at the same time is very low.
So that's called the explaining away thing.
So you can imagine sort of building a network of those nodes where each node has a particular meaning.
For example, some nodes would be symptoms that you can observe in a person and then underlying nodes
that are causally influenced it
would be sort of, you know,
infection by a particular bug or whatever, right?
And you could imagine building those things
and then doing inference.
So you know the value of some nodes
and you can infer the probability
that all the other nodes that you don't observe
have a particular value.
I'm observing this and this and this symptom
and I have this graph.
And now I've measured the blood pressure
and, you know,
whether the person's tummy hurts
or in particular way, whatever.
And I conclude that, you know,
it's probably up in the side of it, whatever, you know.
So you can imagine, and in fact, you know,
systems in the 1970s were built exactly on that model
for, you know, so-called expert systems
or probabilistic expert systems.
And what we now call vision networks
and hierarchical vision models and whatever
are kind of the descendant of this, if you want.
Now, this does not involve any learning.
All those things would be built entirely by hand,
completely engineered.
And that was, essentially, that necessity was one of the reason of the kind of decrease in
interest in sort of good old-fashioned AI.
Yeah, these were your competitors.
Yeah, the Bayesian, yeah, the Bayesian folks were like.
It could be completely logical and hard decisions, but the fact that you had to hand-engineer
those entire systems from scratch is kind of a little bit way to kill them.
So the alternative is running, but now you have the question, you know, how do you learn those concepts?
How do you learn that, you know, you want to do computer vision, the nodes at the bottom level are pixel values, okay?
What would be the right node just above that represent good combinations of pixel values?
How do you learn that?
Turns out those combinations would be things that, you know, detect oriented edges, which is what you're observing the brain and what, you know, convolutional neural nets actually learn.
but that has to be learned.
So then the big conundrum is
if a system is to learn and manipulate discrete symbols
and learning essentially requires things to be kind of continuous,
how do you kind of make those two things compatible with each other?
And the answer?
I don't know.
Maybe the answer is that we are giving perhaps a little too much importance to logical reasoning.
So it's one thing that Jeff Hinton is saying, which I agree with to a large extent,
which is a lot of reasoning, certainly in animals and in humans, is not logical reasoning.
It's basically simulation or analogical reasoning, which is kind of similar.
So, you know, you are a lion in the savannah and you're kind of chasing a wildebeest or something.
And you have to do some prediction about the trajectory of the wildebeest and, you know, work with the other lion or lionesses usually to kind of chase the animal in the right way.
And, you know, that requires kind of, you know, a simulation of the animal you're chasing.
The best way to predict how the NML you're chasing is going to act is to have an eternal model of that that you can simulate.
And it's the same, like, so in a more kind of human situation of, you know, you want to build a widget, like a box out of planks or whatever.
You have to have sort of a model in your mind of what would be the result of assembling those planks and, you know, how solid it is going to be if you use glue or nails or screws or kind of more complex carpentry.
And so the key element in this form of intelligence is your ability to build models of the world, predictive models of the world.
And that's what we're missing.
Okay, that's what we need to figure out how to do with machines.
And you can do this, and I find this very convincing, actually.
I mean, so you can do this without symbols, you can do this without an explicit representation of causality.
you just observe, and from the observation, you can go directly to prediction.
Right. So there is going to be some causality involved. The question is whether you again need
an explicit mechanism for causal inference. But if your model of the world includes the
prediction of the next state of the world, given the previous state and given your action,
then you're going to build a causal model, right? Because you know that your action,
or the action you observe from other people or other agent will tell you, if I take that action,
I will get this result from this state, right?
So you will be able to establish causal models if you can act or if you can determine
that another agent has acted and you have observed the result.
I mean, mechanisms of that kind certainly exist, you know, in human physiology.
So any movement of the eye involves a prediction of where the world looks,
will look like when your eye moves.
And there are those beautiful old experiments
where people, in fact, you use Curary to paralyze their eyes,
so they can't move their eyes.
But when they intend to move their eyes, the world moves.
And so they're anticipating.
And that is clearly nobody would claim
that there is any symbol or any causality required to do that.
This is the kind of thing.
So that would be your model for...
In fact, your perception of the world is not the world as it is.
It's the world as it's going to be because there is, you know,
about 100 milliseconds delay between what you see and how your brain interprets what it is.
So in fact, your brain predicts 100 milliseconds in the future.
Your estimate of the world that you are consciously aware of is, was predicted from your
perception from a 10th of a second ago.
And that takes place everywhere.
So, you know, one of the current main theoretical concept in systems neuroscience or
computational neuroscience is this idea of predictive coding where basically everything in
the brain is trying to predict everything else in the brain and, you know, predict future
state of other parts of the brain and, you know, predict.
things like that. Now, this doesn't mean that people who are, you know, pushing for this kind
of theory know how the brain works because between a general concept of this type and sort of
reduction to a practical algorithm, if you want, that you could sort of implement on the machine
is a huge distance that has not been, has not been bridged yet. So that's kind of a big program,
I think, for science for the next years. I was curious earlier, and it's a question I neglected
to us, but that was about
face detection and the
existence of a face system.
It seems to me
that seeing your mother's face a lot
doesn't
really equip you to distinguish
between faces.
That's right.
That's what the face system does.
I mean, it's not only
recognize this as a face.
It is actually specialized
in identifying faces.
And that's
that ability to learn very quickly, a distinctive face, that seems to be, I mean, you know,
it's the same problem. Again, I don't quite see how that gets done.
I don't know. You know, historically, also in computer vision,
phase detection preceded reliable face recognition by a decade or two.
So we had reliable phase detectors in the early 2000,
and reliable face recognition didn't pop up until 15 years later, roughly,
with deep-running methods, actually.
And it works surprisingly well.
So you don't need a big neural net to be able to recognize, you know,
identify faces.
Identify faces and have a system that recognizes more faces than any human can
with, you know, the same level of accuracy.
You know, maybe not with the same sort of robustness to, you know, different changes in pose and facial hair and things like that.
But in terms of a number of different people being able to put a name on, they're incredibly, incredibly good, incredibly reliable.
So it may not be that hard of a task as we thought, perhaps.
I kind of thought that this was going to be a discussion more of how we could transpose the human, learn.
from the human brain and turn that into AI.
But it seems like we're actually talking a little bit
about how we can learn about the human brain
from the construction of AI.
I think that this is actually happening.
But there is an interesting question to me as a psychologist.
It's a version of the touring test.
So what could it take to, for a question?
the computer to fool you that it's
human
and I were thinking
that one characteristic of it is
mistakes that strike us as absurd
that is
and they seem to violate some basic
constraints
I mentioned earlier in object now being
in two places and there is I think
a finite set
of absurd mistakes
or am I wrong
And would it be part of, I mean, would it be of any interest to have a system that makes the same mistakes that avoid mistakes that people consider absurd?
And that makes mistakes of the kind that people tolerate.
I mean, if there's such a category of absurd mistakes in your field?
Well, well, there's different kinds of mistakes, right?
I mean, there are computer vision systems that make stupid mistakes because, so in the past, when computer vision systems, you know, were like maybe 15 years ago, computer vision systems were okay at picking out recognizing objects and images, but they were sometimes confused by the context.
And people, they were not confused by the context, just were not using the context at all.
And so, you know, a system would make a mistake of recognizing a face that wasn't really a face, and from the context you could tell it wasn't a face.
or would recognize an object that had the shape of, I don't know, a cow or something.
You can tell from the context that, you know, it cannot possibly be a cow if you are human.
So that was a big criticism towards computer vision system.
They don't take context into account.
And now the criticism you see is that they take too much context into account.
So there is this famous example of, you know, a modern vision system.
I mean, it's a system of a few years ago.
You train it to recognize, you know, things like cows and trains and airplanes and cars and stuff like that
by showing you thousands of examples of each category.
And then you show it a cow on a beach.
And it can't recognize it as a cow because every example of cows it's seen were on, you know, green pasture.
And so the system actually learns to use the context and it uses it too much.
And the context was, you know, had a spurious correlation with the category.
And the system doesn't have the common sense to say, well, that's a cow regardless of, you know,
because a cow could be very well sitting on the beach.
And so then how do you fix that, you know?
And how do you get that common sense?
I mean, that seems to be.
The common sense, exactly.
So I think this is not going to be solved, in my opinion, by, you know, more tweaks on the architectures and more
training data and things like this.
I mean, it may be mitigated, but I think it's not going to be fixed by that.
I think it's going to be fixed by systems that basically learn models of the world
and then, you know, have the ability to tell that it's perfectly possible that, you know,
it can be on the beach.
That's the kind of compositional nature of the world that, you know, even combination of things
you've never seen can very well happen.
And that is already high.
happening? Not really. So there's a thing that, a very interesting thing that is happening,
which is one of the hottest topic, as I was mentioning earlier in AI today is, or in machine
learning at least, is this idea of self-supervised running, which I've been a big advocate of for
many years. And it's the idea that you train a system not to solve a particular task,
but to, you know, basically learn to represent the world in sort of a, you know, semi-generic way.
You train it by basically, you train it to predict, essentially.
So this works wonderfully for natural language understanding system.
The standard way nowadays has become standard over the last three years of training a natural language understanding system is you take a sequence of words, a sentence or something longer than that, maybe a few hundred words or a thousand words.
and you mask about 10 or 15% of those words.
You just replace them by a blank marker.
And you train some giant neural net to predict the words that are missing.
And in the process of doing so,
the network learns the nature of language, if you want.
So, you know, it learns that if you say the lion chases the blank in the savannah,
The blank, you know, is likely to be something like a wildebeest or an antelope or whatever, right?
But if you say the catches is the blank in the kitchen, that's probably a mouse.
So a lot of semantics about the world, certainly the syntax, but also the semantics about the world,
ends up being represented by this network that is just trying to predict missing words, right?
Now, of course, the system can never predict exactly which word is missing.
you know, it could be antelope or wildebeest or zebra or whatever.
But it can easily represent a distribution over words,
like a probability distribution over words.
Because there is only a finite number of words in the dictionary in English.
So you just have a number, which is a score for how likely it is for this particular word to appear.
And that's how the systems handle the uncertainty in the prediction.
Yeah, I mean, and the results.
are sort of amazing.
You get text that sounds like text in a particular style,
and it's coherent, and it's grammatical.
Right.
But it makes stupid mistakes.
It makes very stupid mistakes.
It makes absurd mistakes.
Absurd mistakes, yeah.
Yeah.
And I was wondering, when does it say that it makes absurd mistakes?
It's a question that we talked about before,
whether it knows what it's talking about.
And the sense is that if you just merely predict words, there is no content there, that purely predictive system.
Well, I don't know.
There is a little bit of content, but it's very superficial, right?
So the understanding of the world about those systems is superficial.
There is some understanding, but it's very superficial.
So, yeah, if you ask questions like, you know, is the fifth leg of, you know, is the fifth leg
of a dog longer than the other four,
you know, the thing will say yes or no, right?
I mean, it won't tell you that dogs have four legs, right?
And, you know, and things of that type.
They can't count, you know.
I mean, there's a lot of things that they can do,
and it's because they don't have any common sense,
and it's because all of this learning is not grounded in an underlying reality.
It's basically just from text,
and the amount of knowledge about the world that is encoded in all the text in the world,
that those systems have been trained on, and it's billions of words, is not present.
Like, most of human knowledge is not represented in any text in existence.
For example, you know, I take an object, I take my phone, and I put it on the table next to me, and I push the table.
You know, because of your physical intuition, that the object will move with the table.
When I push the table, the object that is on top of it will move with it, right?
there is there is nothing in any text in any any part of the world that explains this
and so a machine that is purely trained to predict missing words will not learn about this
which are sort of basic really basic facts about the world so I'm one of those people who
believe that truly intelligent systems will need to acquire knowledge will need to be grounded
in some reality it could be a simulated reality it could be a virtual world
but it has to be an environment that has its own logic and its own constraints and its all physics.
So you think that in principle, if you took those transformable and you put them together with,
if you put them to work on videos, if you put language and videos together,
is that going to be a qualitative advance and is that beginning to happen?
Well, yeah, so yes and no.
So I think you should start with just video.
Can you have, can you build a neural net that will or some learning system?
They will watch video all day and basically learn busy concepts about the world just by watching video.
The world is three-dimensional.
There are objects in front of others.
There are objects that are animated and animate.
There are objects whose trajectory is completely predictable in animate objects.
There are objects whose trajectory is not completely predictable, like, you know, the leaves on the tree, but, you know, qualitatively you can.
So there's some level of representation where you can predict what those things do.
And then objects that are very difficult to predict, they are animate objects, humans and things like that, right?
Or chaotic systems and whatever.
And so, you know, a lot of research goes into this, essentially.
None of them work at the moment.
There's a lot of systems that attempt to do video prediction.
and basically attempt to learn representations of the world that are in an abstract way
can learn to predict what's going to happen in the video in the long term.
They all work within a few frames of a video, like a fraction of a second,
but then the predictions go really bad,
and the representations that are learned by those systems are not very good.
We can test them by using the representation as input to, let's say, an object classification system,
for example, and measure how many samples does it take for the system to run
the concept of elephant, right? Does it still need 3,000 samples, a training example,
or would it work okay with 3 or 4? And those things, the answer is the training from video,
they don't work very well. There is one thing that is starting to work, and that gives us a lot
of hope. So a particular form of cell supervised running, where you can stimulate the video
if you want. So you take an image and you transform this image in some way. You
You change the colors a little bit, you change the scale, the orientation, you translate it a little bit, you do some, you know, manipulation to it.
It's called data augmentation.
And then you train a neural net, or two neural nets, rather, to basically map those two images to the same representation, the same vector.
Typically, it's a vector, right, a list of numbers, typically a couple thousand.
And so that's easy enough.
You have two images that basically represent the same content, two view.
use of the same object, for example.
And you train the system to tell you, well,
the run a representation that tells you the content
of the image and not the details.
The difficulty is how do you make sure that when you show two
different objects, you will produce different representations?
And avoiding this problem.
This problem is called the collapse problem,
and the big question is how you avoid this.
So there's a lot of work, a really interesting work,
on what's called contrasting methods,
and also non-contrasting methods,
transitive methods to do this.
And I'm really excited about this area of research because I think it's the germ.
I think it's our best shot as to, you know, a path towards learning, you know, getting
machines to learn abstract representations, you know, and predictive models where those
predictive models will not take place in the space of pixels, but will take place in the
space of whatever it is that your input is.
but in the space of abstract representations.
But you think that the architectures that exist now will eventually,
that it's not going to take a different architecture?
Well, I think the architecture is not the crucial question here.
I think the architectural components are already here.
What I think is not here is the whole learning paradigm.
So let me take an example, right?
We can build a giant neural net where you feed it a few frames of a video
and you train it to predict the next frame or the next few frames.
You can do this with Lee Square, right?
So just measure some of the square,
the differences between the values of the predicted pixels
and the pixels that actually occur in the video.
And when you do this, the predictions you get from your neural net are blurry.
You get very blurry images as the prediction.
And the more you let the system predict far in the future,
the more blurry the predictions are.
And the reason for this is that you ask the system to make one prediction,
and the system cannot predict, a priori, if it's a video of us talking,
it cannot predict if I'm going to move my hands this way,
or I'm going to move my head to the left or to the right.
And so the only thing you can do is predict some sort of average
of all the possible things that can happen,
and that would be a very blurry version of us.
And so that's not a good model of the world.
Now, how do you represent the uncertainty in the prediction?
So in the case of text, it's easy because it's discreet.
You can just have a distribution over which word are possible,
but we don't have a good way of representing distributions over images, for example.
That's why those techniques don't work currently for video prediction,
and whatever representations they learn actually are not very good.
And that's because of the sheer complexity.
Is it a quantitative problem or is it a point where quantity becomes quality?
No, no, it's not a question of, you know, we don't have enough data or computers are not powerful enough or neural nets are not big enough.
It's a question of principle.
It's a question, it's a question of like, what is the right objective function?
What is the right way of representing uncertainty?
And then technical questions like what, how do you prevent this collapse I was talking about, which I didn't explain very well.
but which is a more technical issue.
But it has to do with basically representing uncertainty in the prediction.
The fact that there are, you know,
when you have an initial segment of a video,
there are many, many ways that are plausible to continue it.
And the machine has to basically represent all of those
or a good chunk of those possibilities.
And it could have to be predictive,
that it couldn't work backward.
It could work both ways.
It doesn't even have to be controlled.
It's the way that I think the mind.
works.
Oh, yeah, sure.
Really, you get prediction very short term, but by and large,
and a lot of them are simply making sense after the fact.
You know, you can predict everything.
Yeah.
I mean, if you're a police inspector, you get to a scene and you have to basically figure out
how it got there, right?
Yeah.
Whether it's a sequence of event that, you know, led to this.
So, yeah, I mean, I think, you know, I use the example of prediction.
a future event, but it could very well be, you know,
prediction of things you do not currently perceive, right?
So you do not currently perceive the back of my head.
But you have a pretty good idea what it looks like.
And if I show it to you, then you know, you can correct your internal model.
You're not very surprised, but, you know, maybe if I had a ponytail, you'd be surprised, slightly more surprised.
So, you know, the, it's not just prediction in time.
it can be prediction in space,
predicting a piece of an image from another piece,
predicting a piece of a percept that is currently occluded
or not, not, you know, can it be obtained,
and retradiction, you know, predicting the past,
which you have not observed from the present and maybe an evolution.
So, yeah, I'm not insisting that it has to be a foreign prediction.
Forward prediction is useful for planning.
So if we're talking about a form of reasoning that would be planning a sequence of actions
to arrive at a particular result, then what you need is a forward prediction
that will predict the set of the world as when we take an action.
Well, I mean, you know, the amount of progress that is occurring in your field is just amazing.
I have a high time following
I'm very envious
I'm very
I don't know because
now a few years ago
I think I heard you say
about a research
program that we have the cherry
but we don't have the cake
but you recognize that
oh absolutely
yeah this has become a bit of a running joke
in the in the community now
I use the analogy of the cake
to basically make very concrete
the fact that most of what we learn
as humans and animals
and in the future that machine will learn
is learned in this kind of self-supervised manner
basically by watching the world go by
and by taking an action once in a while
but in a non-task specific way
learning how the world works, self-supervasion things.
That's the bulk of the cake.
So if intelligence is a cake,
the bulk of the cake is this type of learning.
That's where we learn everything.
And then there is a thin layer of, you know,
well, there is supervised running, right?
You're taught at school or your parents, you know, teach you something
or you read a book or whatever.
Or you, you know, you show a young child, a picture book,
and you say, that's an elephant.
And with three examples, the baby has figured out
what an elephant is, a toddler, what an elephant is.
So that's supervised running.
And then there is reinforcement learning where you learn a new skill by trial and error,
and you know, you get rewarded by your own success or not.
And that was the cherry on the cake.
The supervised running was the icing, if you want.
And this was sort of a metaphor to tell people, like, you know,
we're currently focusing on the icing and the cherry,
but we haven't figured out how to bake the cake yet.
And so I used this joke saying that, you know, this was the dark matter of intelligence because, you know, it's kind of like physicists, right?
They tell you dark matter exists and it's, you know, most of the mass in the universe is dark matter and they have no idea what it is.
It's very embarrassing.
So we're in the same situation.
You get a physicist, it's even worse because there's dark matter and dark energy and the combination of the two represents something like 95% of the mass in the universe.
So that's really, really embarrassing.
And we have no idea what it is.
So we're in the same situation, which means, you know, we have a lot of work to do.
I mean, is the progress in your field experimental?
Yeah, before you answer, I'm just watching the clock.
You guys can go as long as you want, but I'm just saying that whenever you're ready.
Oh, yeah.
Yeah.
I'm past my hard to stop.
Just the entire question.
Okay, sounds good.
I think it's increasing because there,
There is more and more people joining the party and more governments and companies investing money in it.
So you see a growth.
You see a growth in applications.
You don't necessarily a growth, an exponential growth in sort of new concepts.
But there are new concepts that are coming up at a surprisingly high speed.
And so I guess one big question is, so, you know, one question is, are we going to have another AI winter like they were in the past?
And my answer to this is probably no, or at least not to the same extent that we had in the past,
because there's a big industry now behind AI.
It's useful for a lot of things.
You take deep learning out of Google and meta and a few other companies, and they crumble.
I mean, they're completely built around it now.
And certainly there's a lot of other areas also.
So I think it's not going to collapse as it used to.
The question is, are we going to make it?
make this next step towards, you know, machine common sense, self-supervised running, etc., before
people funding all of this get tired. And that I don't know the answer. I'm just hoping that,
you know, we make progress fast. Fascinating question. Thank you. Thank you. Thank you, Danny. It's
always always a pleasure chatting with you. And thanks to both of you. It's amazing to hear you talk about
this. I feel like I was just like an ultra-graduate.
level course we should have more psychology and AI together because this is I think like it's just
amazing to hear you guys bounce these concepts around and in terms of energy uh Jan clearly there's energy
in the field it's amazing hearing you um you know no drop from the five years we've we've known
each other maybe even more so thank you Danny thank you Jan so great having you here um yeah let's
let's do it again yes I'd love to do it again for me thank you okay all right thank you
Take care. Danny. Thank you to Nate Gwattany for doing the editing, Red Circle for hosting and all of you, the listeners. We'll be back next Wednesday for another episode of the Big Technology podcast. Until then, we'll see you.