Lex Fridman Podcast - Ian Goodfellow: Generative Adversarial Networks (GANs)
Episode Date: April 18, 2019Ian Goodfellow is the author of the popular textbook on deep learning (simply titled "Deep Learning"). He coined the term Generative Adversarial Networks (GANs) and with his 2014 paper is responsible ...for launching the incredible growth of research on GANs. Video version is available on YouTube. If you would like to get more information about this podcast go to https://lexfridman.com/ai or connect with @lexfridman on Twitter, LinkedIn, Facebook, Medium, or YouTube where you can watch the video versions of these conversations.
Transcript
Discussion (0)
The following is a conversation with Ian Goodfellow.
He's the author of the popular textbook on deep learning, simply titled Deep Learning.
He coined the term of generative adversarial networks, otherwise known as Gans.
And with his 2014 paper is responsible for launching the incredible growth of research
and innovation in this subfield of deep learning. He got his BSNMS at Stanford, his PhD at University of Montreal with Yashu Abenjo and Aaron Curville.
He held several research positions including an open AI, Google Brain, and now at Apple
as the director of machine learning.
This recording happened while Ian was still a Google brain, but we don't
talk about anything specific to Google or any other organization. This conversation is
part of the Artificial Intelligence Podcast. If you enjoy it, subscribe on YouTube, iTunes,
or simply connect with me on Twitter at Lex Friedman spelled F-R-I-D and now here's my conversation with Ian Goodfellow.
You open your popular deep learning book with a Russian doll type diagram that shows deep learning as a subset of representation learning, which in turn is a subset of machine learning
and finally a subset of AI.
So this kind of implies that there may be limits to deep learning in the context of AI.
So what do you think is the current limits of deep learning?
And are those limits something that we can overcome with time?
Yeah, I think one of the biggest limitations of deep learning is that right now it requires
really a lot of data, especially labeled data.
There are some unsupervised and semi-supervised learning algorithms that can reduce the amount
of labeled data you need, but they still require a lot of unlabeled data.
Reinforcement learning algorithms, they don't need labels, but they need really a lot of
experiences.
As human beings, we don't learn to play pong by failing at pong two million times.
So just getting the generalization ability better is one of the most important bottlenecks
in the capability of the technology today.
And then I guess I'd also say deep learning
is like a component of a bigger system.
So far nobody is really proposing to have
only what you'd call deep learning
as the entire ingredient of intelligence.
You use deep learning as submodules of other systems.
Like, Alphago has a deep learning model that estimates the value function.
Most reinforcement learning algorithms have a deep learning module that
estimates which action to take next, but you might have other components.
You're basically building a function estimator.
Do you think it's possible?
You said nobody's kind of been thinking about this so far,
but do you think neural networks can be made to reason
in the way symbolic systems did in the 80s and 90s
to do more, create more like programs as opposed to functions?
Yeah, I think we already see that a little bit.
I already kind of think of neural nets as a kind of program. I think of deep
learning as basically learning programs that have more than one step. So if you draw a flow chart
or if you draw a TensorFlow graph describing your machine learning model, I think of the depth of
that graph is describing the number of steps that run in sequence and then the width of that graph
is the number of steps that run in parallel. and then the width of that graph is the number of steps that run in parallel.
Now it's been long enough that we've had deep learning working that it's a little bit
silly to even discuss shallow learning anymore.
Back when I first got involved in AI, when we used machine learning, we were usually
learning things like support vector machines.
You could have a lot of input features to the model, and you could multiply each feature
by a different weight.
All those multiplications were done in parallel to each other.
There wasn't a lot done in series.
I think what we got with deep learning was really the ability to have steps of a program
that run in sequence.
I think that we've actually started to see that what's important with deep learning
is more the fact that we have a multi-step program rather than the fact that we've learned a representation.
If you look at things like resnuts, for example, they take one particular kind of representation
and they update it several times.
Back when deep learning first really took off in the academic world in 2006, when Jeff
Hinton showed that you could train
deep belief networks. Everybody who was interested in the idea thought of it as each layer learns
a different level of abstraction, that the first layer trained on images learns something like edges
and the second layer learns corners and eventually you get these kind of grandmother cell units
that recognize specific objects. Today I think most people think of it more as a computer
program where as you add more layers, you can do more updates before you output your final number.
But I don't think anybody believes that layer 150 of the ResNet is a grandmother cell and
layer 100 is contours or something like that. Okay, So you think you're not thinking of it as a singular representation that keeps
building. You think of it as a program, sort of almost like a state. Representation is
a state of understanding.
Yeah. I think of it as a program that makes several updates and arrives at better and
better understandings, but it's not replacing the representation at each step. It's refining it.
And in some sense, that's a little bit like reasoning.
It's not reasoning in the form of deduction, but it's reasoning in the form of taking
a thought and refining it and refining it carefully until it's good enough to use.
So, do you think, and I hope you don't mind, we'll jump philosophical everyone's in a
while, do you think of cognition, human cognition, or even consciousness as simply a result of
this kind of sequential representation learning?
Do you think that can emerge?
Cognition, yes, I think so.
Consciousness, it's really hard to even define what we mean by that.
I guess there's consciousness is often defined as things like having self-awareness.
And that's relatively easy to turn it into something actionable for a computer scientist
to reason about.
People also define consciousness in terms of having qualitative states of experience,
like qualia.
And there's all these philosophical problems like like could you imagine a Zombie who does all the same information processing as a human but
Doesn't really have the qualitative experiences that we have
That sort of thing. I have no idea how to formalize or turn it into a scientific question
I don't know how you could run an experiment to tell
Whether a person is a zombie or not and similarly. I don't know how you could run an experiment to tell
Whether an advanced AI system had become conscious in the sense of qualia or not.
But in the more practical sense, like almost like self-attention,
you think consciousness and cognition can,
in an impressive way emerge from current types of architectures
that work as a learning.
Or if you think of consciousness in terms of self-awareness and just making plans
based on the fact that the agent itself exists in the world, reinforcement learning algorithms
are already more or less forced to model the agent's effect on the environment.
So that more limited version of consciousness is already something that we get limited versions
of with reinforcement learning algorithms if they're trained well.
But you say limited.
So the big question really is how you jump from limited to human level.
Yeah.
Right.
And whether it's possible, you know, even just building common sense reasoning
seems to be exceptionally difficult.
So if we scale things up,
if we get much better on supervised learning,
if we get better at labeling,
if we get bigger data sets,
more compute, do you think we'll start to see
really impressive things that go from limited to
something echoes of human level cognition.
I think so, yeah. I'm optimistic about what can happen just with more computation and more data.
I do think it'll be important to get the right kind of data. Today, most of the machine learning
systems we train are mostly trained on one type of data for each model. But the human brain, we get all of our different senses,
and we have many different experiences,
like writing a bike, driving a car,
talking to people, reading.
I think when we get that integrated data set,
working with a machine learning model
that can actually close the loop and interact,
we may find that algorithms not so different from what we have today, with a machine learning model that can actually close the loop and interact.
We may find that algorithms not so different from what we have today learn really interesting
things when you scale them up a lot and train them on a large amount of multimodal data.
So multimodals are really interesting, but within like you within one mode of data, selecting better at what are the
difficult cases for when we should most easily learn from?
Oh, yeah.
Like, could we get a whole lot of mileage out of designing a model that's resistant
to adversarial examples or something like that?
Right.
Yeah.
My thinking has evolved a lot over the last few years.
Interesting.
When I first started to really invest in studying adversarial examples, I was thinking
of it mostly as adversarial examples reveal a big problem with machine learning.
And we would like to close the gap between how machine learning models respond to adversarial
examples and how humans respond. After studying the problem more, I still think that adversarial examples are important.
I think of them now more of as a security liability than as an issue that necessarily shows
there's something uniquely wrong with machine learning as opposed to humans.
Also do you see them as a tool to improve the performance of the system, not on the security side,
but literally just accuracy?
I do see them as a kind of tool on that side, but maybe not quite as much as I used to think.
We've started to find that there's a trade-off between accuracy on adversarial examples
and accuracy on clean examples.
Back in 2014, when I did the first adversarily trained classifier that showed
resistance to some kinds of adversarial examples, it also got better at the clean data on
MNIST. And that's something we've replicated several times in MNIST, that when we train
against weak adversarial examples, MNIST classifiers get more accurate. So far, that hasn't
really held up on other data sets and hasn't held up when
we train against stronger adversaries.
It seems like when you confront a really strong adversary, you tend to have to give something
up.
Interesting.
But it's such a compelling idea because it feels like that's how humans learn the difficult
cases.
We try to think of what would we screw up
and then we make sure we fix that.
Yeah.
It's also in a lot of branches of engineering,
you do a worst case analysis and make sure
that your system will work in the worst case.
And then that guarantees that it'll work
in all of the messy average cases that happen
when you go out into a really randomized world.
Yeah, with driving with autonomous vehicles, there seems to be a desire to just look for,
think adversarially, try to figure out how to mess up the system, and if you can be robust
to all those difficult cases, then you can, it's a hand-wavy empirical way to show your
system is safe.
Yeah, yeah.
Today, most adversarial example research
isn't really focused on a particular use case,
but there are a lot of different use cases
where you'd like to make sure that the adversary
can't interfere with the operation of your system.
Like in finance, if you have an algorithm
making trades for you, people go to a lot of an effort
to obfuscate their algorithm.
That's both to protect their IP IP because you don't want to research and develop a profitable
trading algorithm, then have somebody else capture the gains. But it's at least
partly because you don't want people to make adversarial examples that fool
your algorithm into making bad trades. Or I guess one area that's been popular in the academic literature is speech recognition.
If you use speech recognition to hear an audio waveform and then, uh, and turn that into
a command that a phone executes for you, you don't want, and a malicious adversary to be
able to produce audio that gets interpreted as malicious commands, especially if a human
in the room doesn't realize
that something like that is happening.
And speech recognition has been much success
in being able to create adversarial examples
that fool the system.
Yeah, actually, I guess the first work that I'm aware of
is a paper called Hidden Voice Commands
that came out in 2016, I believe. And they were able to show that
they could make sounds that are not understandable by a human, but are recognized as the target
phrase that the attacker wants the phone to recognize it as. Since then, things have gotten a little bit better on the attacker side when worse on the defender side.
It's become possible to make sounds that sound like normal speech,
but are actually interpreted as a different
sentence than the human hears.
The level of perceptibility of the adversarial perturbation
is still kind of high.
When you listen to the recording,
it sounds like there's some noise in the background, just like rustling sounds. But those
rustling sounds are actually the adversarial perturbation that makes the phone here a completely
different sentence.
Yeah, that's so fascinating. Peter Norweg mentioned that you're writing the deep learning
chapter for the fourth edition of the Artificial Intelligence and Modern Approach Book. So how do you even begin summarizing
the field of deep learning in a chapter?
Well, in my case, I waited like a year
before I actually wrote anything.
Is it even having written a full length textbook
before it's still pretty intimidating
to try to start writing just one chapter that covers everything.
One thing that helped me make that plan was actually the experience I've had in having
written the full book before and then watching how the field changed after the book came
out.
I've realized there's a lot of topics that were maybe extraneous in the first book and
just seeing what stood the test of a few years of
being published and what seems a little bit less important to have included now
helped me pair down the topics I wanted to cover for the book. It's also really
nice now that the field is kind of stabilized to the point where some core ideas
from the 1980s are still used today. When I first started studying machine
learning almost everything from the 1980s had been rejected and now some of it has come back. So that stuff that's
really stood the test of time is what I focused on putting into the book. There's also, I
guess, two different philosophies about how you might write a book. One philosophy is
you try to write a reference that covers everything. The other philosophy is you try to
provide a high level summary that gives people the language to write a reference that covers everything. The other philosophy is you try to provide a high-level summary
that gives people the language to understand a field
and tells them what the most important concepts are.
The first deep learning book that I wrote with Yoshua and Aaron was
somewhere between the two philosophies that it's trying to be
both a reference and an introductory guide.
Writing this chapter for Russell and Norweg's book,
I was able to focus more on just a concise introduction
of the key concepts and the language
you need to read about them more.
In a lot of cases, I actually just wrote paragraphs that said,
here's a rapidly evolving area that you should pay attention to.
It's pointless to try to tell you what the latest and best
version of a learn to learn model is.
I can point you to a paper that's recent right now, but there isn't a whole lot of a reason
to delve into exactly what's going on with the latest learning to learn approach or the
latest module produced by a learning to learn algorithm.
You should know that learning to learn is a thing in that it may very well be the source of the latest and greatest convolutional net or recurrent net
module that you would want to use in your latest project. But there isn't a lot of point in trying
to summarize exactly which architecture and which learning approach got to which level of performance.
So you may be focused more on the basics of the methodology.
So from back propagation to feed forward to recurring, you'll know what's convolutional
and that kind of thing.
Yeah.
Yeah.
So if I were to ask, I remember I took algorithms and data structures algorithm.
Of course.
And remember the professor asked, what is an algorithm?
And yelled at everybody in a good way that nobody was answering it correctly.
Everybody knew what the algorithm was.
It was a graduate course.
Everybody knew what an algorithm was, but they weren't able to answer it well.
Let me ask you in that same spirit, what is deep learning?
I would say deep learning is any kind of machine learning
that involves learning parameters of more than one consecutive step.
So that I mean, shallow learning is things
where you learn a lot of operations that happen in parallel.
You might have a system that makes multiple steps,
like you might have hand-designed feature
extractors, but really only one step is learned.
Deep learning is anything where you have multiple operations and sequence, and that includes
the things that are really popular today, like convolutional networks and recurrent networks.
But it also includes some of the things that have died out, like bolt-in machines, where
we weren't using backpropagation.
Today I hear a lot of people define deep learning as gradient descent applied to these different
shiible functions.
And I think that's a legitimate usage of the term.
It's just different from the way that I use the term myself. So what's an example of deep learning that is not gradient descent and differentiable functions?
In your, I mean, not specifically perhaps, but more even looking into the future. What's your thought about that space of approaches?
Yeah. So I tend to think of machine learning algorithms as decomposed into really three different pieces.
There's the model, which can be something like a neural net
or a bolts-in-machine or a recurrent model.
And that basically just describes how do you take data
and how do you take parameters and what function
do you use to make a prediction, given the data
and the parameters.
Another piece of the learning algorithm is the optimization algorithm.
Or not every algorithm can be really described in terms of optimization, but
what's the algorithm for updating the parameters or
updating whatever the state of the network is.
And then the last part is the data set,
like how do you actually represent the world as
it comes into your machine learning system.
So I think of deep learning as telling us something about what does the model look like?
And basically to qualify as deep, I say that it just has to have multiple layers that
can be multiple steps in a feed forward, differentiable computation. That can be multiple layers in a graphical model.
There's a lot of ways that you could satisfy me that something has
multiple steps that are each parameterized separately.
I think of gradient descent as being all about that other piece, how do you
actually update the parameters piece?
So you can imagine having a deep model, like a convolutional net,
and training it with something like evolution or a genetic algorithm. And I would say that
still qualifies as deep learning. And then in terms of models that aren't necessarily
differentiable, I guess bolts and machines are probably the main example of something
where you can't really take a derivative and use that for the learning process,
but you can still argue that the model has
many steps of processing that it applies
when you run inference in the model.
So this is steps of processing that's key.
So Jeff Hinton suggests that we need to throw away
backpropagation and start all over.
What do you think about that?
What could an alternative direction
of training neural networks look like?
I don't know that backpropagation
is going to go away entirely.
Most of this time, when we decide
that a machine learning algorithm
isn't on the critical path to research
for improving AI,
the algorithm doesn't die,
it just becomes used for some specialized set of things.
A lot of algorithms like logistic regression don't seem that exciting to you. The algorithm doesn't die, it just becomes used for some specialized set of things.
A lot of algorithms like logistic regression don't seem that exciting to AI researchers who
are working on things like speech recognition or autonomous cars today, but there's still
a lot of use for logistic regression in things like analyzing really noisy data in medicine
and finance or making really rapid predictions in really
time limited contexts.
So I think back propagation and gradient descent are around to stay, but they may not end up
being everything that we need to get to real human level or super human AI.
Are you optimistic about us discovering?
Back propagation has been around for a few decades.
So are you optimistic about us as a community being able to discover something better?
Yeah, I am. I think we likely will find something that works better. You could imagine
things like having stacks of models where some of the lower level models predict
parameters of the higher level models. And so at the top level you're not
learning in terms of literally calculating gradients, but just predicting how
different values will perform. You can kind of see that already in some areas
like Bayesian optimization, or you have a Gaussian process that predicts how well
different parameter values will perform. We already use those kinds of algorithms
for things like hyper parameter optimization.
And in general, we know a lot of things other than BackProp that work really well for specific problems.
The main thing we haven't found is a way of taking one of these other non-backprop-based algorithms
and having it really advance the state of the art on an AI-level problem.
Right.
But I wouldn't be surprised if eventually we find
that some of these algorithms that even the ones that
already exist, not even necessarily a new one,
we might find some way of customizing one of these algorithms
to do something really interesting at the level of cognition
or the level of, I think one system
that we really don't have working quite right yet is
like short-term memory. We have things like LSTMs, they're called long short-term memory.
They still don't do quite what a human does with short-term memory.
Like gradient descent to learn a specific fact has to do multiple steps on that fact.
Like, if I tell you the meeting today is at 3 p.m. I don't need to say over and over again.
It's at 3 p.m. It's at 3 p.m. It's at 3 p.m. It's at 3 p.m. It's at 3 p.m.
Right. For you to do a gradient step on each one. You just hear it once and you remember it.
There's been some work on things like self-attention and attention-like mechanisms like the neural touring machine that can write to memory cells and update them
cells with facts like that right away, but I don't think we've really nailed it yet and
That's one area where I'd imagine that
new optimization algorithms are different ways of applying existing optimization algorithms could give us a way of just
Lightning fast updating the state of a machine learning system to contain a specific fact like that
without needing to have it presented over and over and over again. So some of the success
of symbolic systems in the 80s is they were able to assemble these kinds of facts
better, but there's a lot of expert input required
and it's very limited in that sense.
Do you ever look back to that as something
that we'll have to return to eventually
sort of dust off the book from the shelf
and think about how we build knowledge,
representation, knowledge,
based on what we have to use graph searches.
Graph searches, right.
And like first-order logic and entailment
and things like that.
Yeah, exactly. In my particular line of work, which has mostly been machine learning security
and also generative modeling, I haven't usually found myself moving in that direction.
For generative models, I could see a little bit of it could be useful if you had something like a
It could be useful if you had something like a
Differential knowledge base or some other kind of knowledge base where it's possible for some of our
Fuzzier machine learning algorithms to interact with the knowledge base. I mean your network is kind of like that It's a differentiable knowledge base of sorts. Yeah, but
If we had a really easy way of
Giving feedback to machine learning models, that would clearly
help a lot with with generative models.
And so you could imagine one way of getting there would be get a lot better at natural
language processing.
But another way of getting there would be take some kind of knowledge base and figure
out a way for it to actually interact with the neural network.
Being able to have a chat with the neural network.
Yeah.
So like one thing in generative models we see a lot today
is you'll get things like faces that are not symmetrical.
Like people that have two eyes that are different colors.
I mean, there are people with eyes that
are different colors in real life,
but not nearly as many of them as you
tend to see in the machine learning generated data.
So if you had either a knowledge base that
could contain the fact, people's faces are generally approximately symmetric, and eye
color is especially likely to be the same on both sides. Being able to just inject that
hint into the machine learning model without it having to discover that itself after studying
a lot of data would be a really useful feature.
I could see a lot of ways of getting there
without bringing back some of the 1980s technology,
but I also see some ways that you could imagine
extending the 1980s technology to play nice
with neural nets and have it help get there.
Awesome.
So you talked about the story of you coming up
with the idea of Gans at a bar with some friends.
You were arguing that this Gans at a bar with some friends, you were arguing
that this, you know, Gans would work, the general of adversarial networks and the others didn't
think so. Then you went home, and midnight coated it up and it worked. So if I was a friend
of yours at the bar, I would also have doubts. It's a really nice idea, but I'm very skeptical
that it would work. What was the basis of their skepticism?
What was the basis of your intuition?
Why should work?
I don't want to be someone who goes around promoting alcohol for the purposes of science,
but in this case, I do actually think that drinking helped a little bit.
When your inhibitions are lowered, you're more willing to try out things that you wouldn't try out otherwise.
So I have noticed in general
that I'm less prone to shooting down some of my ideas
when I have had a little bit to drink.
I think if I had that idea at lunchtime,
I probably would have thought.
It's hard enough to train one neural net.
You can't train a second neural net
in the inner loop of the outer neural net. That was basically my friend's objection, was that trying to train two neural net. You can't train a second neural net in the inner loop of the outer neural net.
That was basically my friend's objection, was that trying to train two neural nets at
the same time would be too hard.
It was more about the training process, unless my skepticism would be, I'm sure you could
train it, but the thing would converge to would not be able to generate anything reasonable
and you kind of reasonable realism.
Yeah, so part of what all of us were thinking about
when we had this conversation was deep bolts and machines,
which a lot of us in the lab, including me,
were a big fan of deep bolts and machines at the time.
They involved two separate processes running at the same time.
One of them is called the positive phase,
where you load data into the model
and tell the model to make the data more likely.
The other one is called the negative phase,
where you draw samples from the model
and tell the model to make those samples less likely.
In a deep bolts and machine, it's not trivial to generate
a sample.
You have to actually run an iterative process
that gets better and better samples coming closer and closer to the distribution the model represents.
So during the training process, you're always running these two systems at the same time.
One that's updating the parameters of the model and another one that's trying to generate
samples from the model.
And they worked really well on things like MNIST, but a lot of us in the lab, including
me, had tried to get deep bolts and machines to scale past MNIST to things like generate in color photos
and we just couldn't get the two processes to stay synchronized.
So when I had the idea for GANs, a lot of people thought that the discriminator would
have more or less the same problem as the negative phase in the bolt and machine, that
trying to train the discriminator in the innerloop, you just couldn't get it to keep up with the generator and the outer loop.
And that would prevent it from converging to anything useful.
Yeah, I share that intuition.
Yeah.
Um, what turns out to not be the case.
Oh, a lot of the time with machine learning algorithms, it's really hard to predict ahead
of time how well they'll actually perform.
You have to just run the experiment and see what happens.
And I would say I still today don't have like one factor I can put my finger on
and say this is why GANs worked for photo generation and deep bolts and machines
don't. There are a lot of theory papers showing that under some theoretical
settings the GAN algorithm does actually converge.
But those settings are restricted enough that they don't necessarily explain the whole
picture in terms of all the results that we see in practice.
So taking a step back, can you, in the same way as we talked about deep learning, can you
tell me what generative adversarial networks are?
Yeah, so generative adversarial networks
are a particular kind of generative model.
A generative model is a machine learning model
that can train on some set of data,
like so you have a collection of photos of cats,
and you want to generate more photos of cats,
or you want to estimate a probability distribution
over cats so you can ask how
likely it is that some new image is a photo of a cat.
Gans are one way of doing this.
Some generative models are good at creating new data.
Other generative models are good at estimating that density function and telling you how likely
particular pieces of data are to come from the same distribution as the training data.
Gans are more focused on generating samples rather than estimating the density function.
There are some kinds of Gans like flow Gans that can do both, but mostly Gans are about generating
samples of generating new photos of cats that look realistic. And they do that completely from scratch.
It's analogous to human imagination.
When again, creates a new image of a cat.
It's using a neural network to produce a cat that has not existed before.
It isn't doing something like compositing photos together.
You're not literally taking the eye off of one cat and the ear off of another cat.
It's more of this digestive process
where the neural net trains in a lot of data
and comes up with some representation
of the probability distribution
and generates entirely new cats.
There are a lot of different ways
of building a generative model.
What's specific to Gens is that we have a two-player game
in the game theoretic sense. And as the players
in this game compete, one of them becomes able to generate realistic data. The first player
is called the generator. It produces output data, such as just images, for example. And
at the start of the learning process, it'll just produce completely random images. The
other player is called the discriminator. The discriminator takes images as input
and guesses whether they're real or fake.
You train it both on real data, so photos
that come from your training set, actual photos of cats.
And you try to say that those are real.
You also train it on images that come from the generator
network.
And you try to say that those are fake.
As the two players compete in this game,
the discriminator tries to become better at recognizing whether images are real or fake,
and the generator becomes better at fooling the discriminator into thinking that its
outputs are real. And you can analyze this through the language of game theory and find
that there's a Nash equilibrium where the generator has captured
the correct probability distribution. So in the cat example, it makes perfectly realistic
cat photos, and the discriminator is unable to do better than random guessing, because
all the samples coming from both the data and the generator look equally likely to have
come from either source. So, do you ever sit back and just blow your mind that this thing works?
So from very, so it's able to estimate the identity function off to generate, generate
realistic images.
I mean, does it, yeah, do you ever sit back?
Yeah.
How does this even, why this is quite incredible, especially where Gans have gone in terms of realism?
Yeah, and not just to flatter my own work,
but generative models, all of them have this property
that if they really did what we ask them to do,
they would do nothing but memorize the training data.
Right, exactly.
Models that are based on maximizing the likelihood,
the way that you obtain the maximum likelihood for a specific training set is you assign all of your probability mass to the training examples and nowhere else.
For GAMs, the game is played using a training set, so the way that you become unbeatable in the game is you literally memorize training examples.
One of my former interns wrote a paper, his name is Vaishnav Nagarajan, and he showed that it's actually hard for the generator to memorize the training data, hard in a statistical
learning theory sense, that you can actually create reasons for why it would require quite
a lot of learning steps
and a lot of observations of different variables
before you can memorize the training data.
That still doesn't really explain why
when you produce samples that are new,
why do you get compelling images
rather than just garbage that's different
from the training set?
And I don't think we really have a good answer for that,
especially if you think about how many
possible images are out there and how few images the generative model sees during training.
It seems just unreasonable that generative models create new images as well as they do,
especially considering that we're basically training them to memorize rather than generalize.
I think part of the answer is there's a paper called Deep Image Prior,
where they show that you can take a convolutional net
and you don't even need to learn the parameters of it at all.
You just use the model architecture.
And it's already useful for things like in-painting images.
I think that shows us that the convolutional network architecture
captures something really important about the structure of images.
And we don't need to actually use the learning to capture all the information The convolutional network architecture captures something really important about the structure of images.
And we don't need to actually use the learning to capture all the information coming out
of the convolutional net.
That would imply that it would be much harder to make generative models in other domains.
So far, we're able to make reasonable speech models and things like that.
But to be honest, we haven't actually explored a whole lot of different data sets all that
much.
We don't, for example, see a lot of deep learning models of biology data sets, where you
have lots of microarrays measuring the amount of different enzymes and things like that.
So we may find that some of the progress that we've seen for images and speech turns out
to really rely heavily on the model architecture.
And we were able to do what we did for vision by trying to reverse engineer the human visual system.
And maybe it'll turn out that we can't just use that same trick for arbitrary kinds of data.
Right. So there's aspect of the human vision system, the hardware of it that makes it without
learning, without cognition, just makes it really effective at detecting the patterns we've
seen in the visual world. Yeah. That's really interesting. What, in a big, quick overview,
in your view, what types of GANS are are there and what other generative models besides GANs are there?
Yeah, so it's maybe a little bit easier to start
with what kinds of generative models are there other than GANs.
So most generative models are likelihood-based,
where to train them, you have a model that tells you
how much probability it assigns to a particular
example, and you just maximize the probability assigned to all the training examples.
It turns out that it's hard to design a model that can create really complicated images
or really complicated audio waveforms, and still have it be possible to estimate the likelihood function from a computational point of view. Most
interesting models that you would just write down intuitively, it turns out that it's almost
impossible to calculate the amount of probability they assigned to a particular point. So there's
a few different schools of generative models in the likelihood family. One approach is to very carefully design
the models so that it is computationally tractable to measure the density to signs to a particular
point. So there are things like auto regressive models, like pixel CNN, those basically break
down the probability distribution into a product over every single feature.
So for an image, you estimate the probability of each pixel, given all of the pixels that came before it.
There's tricks where if you want to measure the density function, you can actually calculate the density for all these pixels, more or less in parallel.
Generating the image still tends to require you to go one pixel at a time, and
that can be very slow. But there are again tricks for doing this in a hierarchical pattern
where you can keep the runtime under control. Are the quality of the images that generates
putting runtime aside pretty good? They're reasonable. Yeah, I would say a lot of the best results are from GANS these days, but it can be
hard to tell how much of that is based on who's studying which type of algorithm, if that makes sense.
The amount of effort invested in it. Yeah, or like the kind of expertise. So a lot of people who
have traditionally been excited about graphics or art and things like that have gotten interested
in GANS. And to some extent it's hard to tell our Gans doing better because they
have a lot of graphics and art experts behind them or our Gans doing better
because they're more computationally efficient or our Gans doing better because
they prioritize the realism of samples over the accuracy of the density function.
I think all those are potentially valid explanations,
and it's hard to tell.
So can you give a brief history of GANS from 2014,
way of paper 13?
Yeah, so a few highlights in the first paper,
we just showed that GANS basically work.
If you look back at the samples we had now,
they look terrible.
On the C-FAR 10 data set,
you can't even recognize objects in them.
Your papers, I will use C-FAR 10.
We use MNIST, which is little handwritten digits.
We use the Toronto Face database,
which is small gray-scale photos of faces.
We did have recognizable faces.
My colleague Bing Shoe put together
the first GAN face model for that paper.
We also had the CFR-10 data set, which is things like very small, 32 by 32 pixels of cars
and cats and dogs.
For that, we didn't get recognizable objects, but all the deep learning people back then were
really used to looking at these failed samples and kind of reading them like tea leaves.
And people who are used to reading the tea leaves recognize that our tea leaves at least look different.
Maybe not necessarily better, but there was something unusual about them.
And that got a lot of us excited.
One of the next really big steps was lap gown by Emily Denton and Sumith Chintala at Facebook
AI research, where they actually got really good high-resolution photos working with Gens
for the first time.
They had a complicated system where they generated the image starting at low res and then
scaling up to high res, but they were able to get it to work. And then, in 2015, I believe later that same year,
Alec Radford and Sumith Chintala and Luke Metz
published the DC GAN paper, which it stands for deep convolutional GAN.
It's kind of a non-unique name because these days basically all GANs and even some before that
were deep and convolutional,
but they just kind of picked a name for a really great recipe
where they were able to actually using only one model
instead of a multi-step process,
actually generate realistic images of faces
and things like that.
That was sort of like the beginning of the Cambrian explosion
of Gans, like once you had animals that had a backbone,
you suddenly got lots of different versions of fish
and four-legged animals and things like that.
So DCGAN became kind of the backbone
for many different models that came out.
Used as a baseline even still.
Yeah, yeah.
And so from there, I would say some interesting things we've
seen are there's a lot you can say about how just the quality of standard image generation
GANs has increased, but what's also maybe more interesting on an intellectual level
is how the things you can use GANs for has also changed.
One thing is that you can use them to learn classifiers without having to have
class labels for every example in your training set.
So that's called semi-supervised learning.
My colleague at OpenAI, Tim Solomon, who's at Brain now, wrote a paper called Improve
Techniques for Training Gans.
I'm a co-author on this paper, but I can't claim any credit for this particular part.
One thing he showed on the paper is that you can take the GAN discriminator and use it as a classifier that actually tells you, you know, this image is a cat, this image is a dog, this image is a car, this image is a truck.
And so not just to say whether the image is real or fake, but if it is real to say specifically what kind of object it is.
And he found that you can train these classifiers with far fewer labeled examples than traditional
classifiers.
So if you supervise based on also not just your discrimination ability, but your ability
to classify, you're going to do much, you're going to converge much faster to being effective
at being this commuter.
Yeah.
So for example, for the MNIST dataset, you want to look at an image of a handwritten
digit and say whether it's a zero, a one, or two, and so on. To get down to less than 1% accuracy,
required around 60,000 examples until maybe about 2014 or so. In 2016 with this semi-supervised GAN project, Tim was able to get below 1% error
using only 100 labeled examples.
So that was about a 600x decrease in the amount of labels
that he needed.
He's still using more images than that,
but he doesn't need to have each of them labeled as,
this one's a one, this one's a two, this one's a zero,
and so on.
Then to be able to for GANS to be able to generate recognizable objects, so object for
a particular class, you still need labeled data because you need to know what it means
to be a particular class cat dog.
How do you think we can move away from that?
Yeah, some researchers at Brain Zurich actually just released a really great paper on
semi-supervised GANs where their goal isn't to classify it's to make recognizable objects,
despite not having a lot of label data. They were working off of DeepMind's BigGan project
and they showed that they can match the performance of BigGan using only 10% I believe of the labels.
BigGAN was trained on the ImageNet data set, which is about 1.2 million images, and had all of them labeled.
This latest project from BrainserX shows that they're able to get away with only having about 10%
of the images labeled. And they do that essentially using a clustering algorithm
where the discriminator learns to assign the objects to groups
and then this understanding that objects can be grouped
into similar types helps it to form more realistic ideas
of what should be appearing in the image
because it knows that every image it creates
has to come from one of these archetypal groups rather than just being some arbitrary
image.
If you're trying to get in with no class labels, you tend to get things that look sort of
like grass or water or brick or dirt, but without necessarily a lot going on in them.
And I think that's partly because if you look at a large image net image,
the object doesn't necessarily occupy the whole image. And so you learn to create realistic
sets of pixels, but you don't necessarily learn that the object is the star of the show.
And you want it to be in every image you make.
Yeah, I've heard you talk about the the horse, the zebra cycle, gang mapping and how it turns out, again,
thought provoking that horses are usually on grass
and zebras are usually on dry or terrain.
So when you're doing that kind of generation,
you're going to end up generating greener horses
or whatever.
So those are connected together.
It's not just, you're not able to segment,
be able to generate the segment away.
So are there other types of games you come across
in your mind that neural networks can play with each other
to be able to solve problems?
Yeah, the one that I spend most of my time on is in security,
you can model most interactions as a game
where there's attackers trying to break your system
and the defender trying to build a resilient system.
There's also domain adversarial learning,
which is an approach to domain adaptation
that looks really a lot like GANs.
The authors had the idea before the GAN paper came out. Their paper came out a little bit later.
They're very nice and sighted the GAN paper,
but I know that they actually had the idea before it came out.
Domain adaptation is when you want to train a machine learning model in one setting,
call a domain,
and then deploy it in another domain later.
And you would like it to perform well in the new domain, even though the new domain is different
from how it was trained.
So for example, you might want to train on a really clean image dataset, like ImageNet,
but then deploy on user's phones, where the user is taking pictures in the dark and pictures
while moving quickly
and just pictures that aren't really centered
or composed all that well.
When you take a normal machine learning model,
it often degrades really badly when you move to the new domain
because it looks so different from what the model was trained on.
Domain adaptation algorithms try to smooth out that gap.
And the domain adversarial approach is based on training a feature extractor,
where the features have the same statistics, regardless of which domain you extracted them on.
So in the domain adversarial game, you have one player that's a feature extractor,
and another player that's a domain recognizer.
The domain recognizer wants to look at the output of the feature extractor,
and guess which of the two domains
the features came from.
So it's a lot like the real versus fake
discriminator and GANS.
And then the feature extractor,
you can think of as loosely analogous
to the generator and GANS,
except what's trying to do here is
both full the domain recognizer
and to not knowing which domain the data came from
and also extract features that are good for classification.
So at the end of the day, you can,
in the cases where it works out,
you can actually get features that work
about the same in both domains.
Sometimes this has a drawback where,
in order to make things work the same in both domains,
it just gets worse at the first one.
But there are a lot of cases where it actually works out well on both.
Do you think of GANs being useful in the context of data augmentation?
Yeah. One thing you could hope for with GANs is you could imagine
I've got a limited training set and I'd like to make
more training data to train something else like a classifier.
You could train the GAN on the training set and then create more data,
and then maybe the classifier would perform better on the test set
after training on this bigger GAN generated data set.
So that's the simplest version of something you might hope would work.
I've never heard of that particular approach working,
but I think there's some closely related things
that I think could work in the future
and some that actually already have worked.
So if we think a little bit about what we'd be hoping for,
if we used the GAN to make more training data,
we're hoping that the GAN will generalize
to new examples better than the classifier would have
generalized if it was trained on the same data.
And I don't know of any reason to believe that the GAN would
generalize better than the classifier would.
But what we might hope for is that the GAN could generalize
differently from a specific classifier.
So one thing I think is worth trying that I haven't personally
tried, but someone could try is, what if you trained a whole
lot of different generative models on the same training set,
create samples from all of them,
and then train a classifier on that.
Because each of the generative models
might generalize in a slightly different way,
they might capture many different axes of variation
that one individual model wouldn't.
And then the classifier can capture all of those ideas
by training in all of their data.
So it would be a little bit like making an ensemble
of classifiers.
And I think that will have GANs. Yeah, In a way, I think that could generalize better. The other
thing that GANs are really good for is not necessarily generating new data that's exactly
like what you already have. But by generating new data that has different properties from
the data you already had, one thing that you can do is you can create differentially private data.
So suppose that you have something like medical records
and you don't want to train a classifier on the medical records
and then publish the classifier because someone might be able to reverse
engineers some of the medical records you trained on.
There's a paper from KC Green's lab that shows how you can train
again using differential privacy.
And then the samples from the GAN still have the same differential privacy guarantees
as the parameters of the GAN.
So you can make fake patient data for other researchers to use,
and they can do almost anything they want with that data
because it doesn't come from real people.
And the differential privacy mechanism gives you clear guarantees
on how much the original people's data has been protected.
That's really interesting.
Actually, I haven't heard you talk about that before.
In terms of fairness, I've seen from Triple AI,
your talk, how can an adversarial machine learning
help models be more fair with respect
to sensitive variables?
Yeah.
So there's a paper from Amos Storki's lab
about how to learn machine learning models
that are incapable of using specific variables.
So say, for example, you wanted to make predictions
that are not affected by gender.
It isn't enough to just leave gender
out of the input to the model.
You can often infer gender
from a lot of other characteristics.
Like, say that you have the person's name,
but you're not told they're gender.
Well, if their name is Ian, they're kind of obviously a man.
So what you'd like to do is make a machine learning model
that can still take in a lot of different attributes
and make it really accurate informed prediction,
but be confident that it isn't reverse engineering gender
or another sensitive variable internally.
You can do that using something very similar to the domain adversarial approach, where you have one player that's a feature extractor, and another player that's a feature analyzer.
And you want to make sure that the feature analyzer is not able to guess the value of the sensitive
variable that you're trying to keep private. Right. I love this approach.
Yeah. With the feature,
you're not able to infer the sensitive variables.
Yeah. It's quite brilliant and simple actually.
Another way I think that
GANs in particular could be used for fairness would be to make something like
a cycle GAN where you can take data from one domain and convert it into
another. We've seen cycle gan turning horses in
disebras. We've seen other unsupervised gans made by Ming
Yu Lu doing things like turning day photos into night photos.
I think for fairness, you could imagine taking records for
people in one group and transforming
them into analogous people in another group and testing to see if they're treated equitably
across those two groups.
There's a lot of things that would be hard to get right to make sure that the conversion
process itself is fair, and I don't think it's anywhere near something that we could actually
use yet.
But if you could design that conversion process very carefully, it might give you a way of doing audits where you
say, what if we took people from this group, converted them
into equivalent people in another group, does the system actually
treat them how it ought to?
That's also really interesting.
In popular press and in general in our imagination you think well
Gans are able to generate data and you start to think about deep fakes or
being able to sort of maliciously generate data that fakes the identity of
other people. Is this something of a concern to you? Is this something if you look
10, 20 years into the future, is that something
that pops up in your work, in the work of the community that's working on generating
models?
I'm a lot less concerned about 20 years from now than the next few years.
I think there will be a kind of bumpy cultural transition as people encounter this idea
that there can be very realistic videos and audio that aren't real.
I think 20 years from now, people will mostly understand that you shouldn't believe something is real just
because you saw a video of it. People will expect to see that it's been cryptographically
signed or have some other mechanism to make them believe that the content is real. There's
already people working on this, like there's a startup called TruePick, that provides a lot of mechanisms for authenticating that an image is real. There may be not
quite up to having a state actor try to evade their verification techniques, but it's
something that people are already working on and I think we'll get right eventually.
So you think authentication will eventually win out.
So being able to authenticate that this is real and this is not as opposed to Gans just
getting better and better or in general the models being able to get better and better
to where the nature of what is real.
I don't think we'll ever be able to look at the pixels of a photo and tell you for sure that it's real or not real.
And I think it would actually be somewhat dangerous to rely on that approach too much.
If you make a really good fake detector and then someone's able to fool your fake detector
and your fake detector says this image is not fake, then it's even more credible than
if you've never made a thick detector in the first place. What I do think we'll get to is systems that we can kind of use behind the scenes to make estimates of what's going on and maybe
not like use them in court for a definitive analysis. I also think we will likely get better
authentication systems where, you know, I imagine that every phone cryptographically signs everything
that comes out of it.
You wouldn't be able to conclusively tell that an image was real, but you would be able
to tell somebody who knew the appropriate private key for this phone was actually able
to sign this image and upload it to this server at this time stamp.
So you could imagine, maybe you make phones that have the private keys hardware embedded
in them.
If like a state security agency really wants to infiltrate the company, they could probably
plant a private key of their choice or break open the chip and learn the private key or
something like that.
But it would make it a lot harder for an adversary with fewer resources to fake things.
For most of us it would be okay.
So you mentioned the beer and the bar and the new ideas.
Are you able to implement this or come up with this new idea pretty quickly and implement it pretty quickly?
Do you think there are still many such groundbreaking ideas and deep learning that could be developed so quickly?
Yeah, I do think that there are a lot of ideas that can be developed really quickly.
Gans were probably a little bit of an outlier on the whole like one-hour time scale,
but just in terms of like low resource ideas where you do something really different on the
algorithm scale and get a big payback.
I think it's not as likely that you'll see that in terms of things like core machine learning technologies,
like a better classifier or a better reinforcement learning algorithm or a better genitive model.
If I had the GAN idea today, it would be a lot harder to prove that it was useful than it was back in 2014 because I would need to get it
running on something like ImageNet or CellLAB-A at high resolution. Those take a while to train.
You couldn't train it in an hour and know that it was something really new and exciting.
Back in 2014, training on M-NIST was enough. But there are other areas of machine learning where I think a new idea could actually be
developed really quickly with low resources.
What's your intuition about what areas of machine learning are right for this?
Yeah, so I think fairness and interpretability are areas where we just really don't have any idea
how anything should be done yet.
Like for interpretability, I don't think we even
have the right definitions.
And even just defining a really useful concept,
you don't even need to run any experiments,
could have a huge impact on the field.
We've seen that, for example, in differential privacy
that Cynthia Dwork and her collaborators
made this technical definition of privacy, where
before a lot of things are really mushy, and then with that definition, you could actually
design randomized algorithms for accessing databases and guarantee that they preserved
individual people's privacy in a mathematical quantitative sense.
Right now, we all talk a lot about how interpretable different machine learning algorithms are,
but it's really just people's opinion.
Everybody probably has a different idea
of what interpretability means in their head.
If we could define some concept related to interpretability,
that's actually measurable,
that would be a huge leap forward,
even without a new algorithm that increases that quantity.
Also, once we had the definition of differential privacy, it was
fast to get the algorithms that guaranteed it. So you could imagine, once we have definitions
of good concepts and interpretability, we might be able to provide the algorithms that
have the interpretability guarantees quickly too.
What do you think it takes to build a system with human level intelligence as we quickly
venture into the philosophical?
So artificial general intelligence.
What do you think it takes?
I think that it definitely takes better environments than we currently have for training agents
that we want them to have a really wide diversity of experiences.
I also think it's gonna take really a lot of computation.
It's hard to imagine exactly how much.
So you're optimistic about simulation,
simulating a variety of environments as the path forward,
as well as operating the world.
I think it's a necessary ingredient.
Yeah, I don't think that we're going to get
to artificial general intelligence
by training on fixed
data sets or by thinking really hard about the problem.
I think that the agent really needs to interact and have a variety of experiences within
the same lifespan.
And today we have many different models that can each do one thing and we tend to train
them on one data set or one RL environment.
Sometimes there are actually papers about
getting one set of parameters to perform well
in many different RL environments,
but we don't really have anything like an agent
that goes seamlessly from one type of experience
to another and really integrates all the different things
that it does over the course of its life.
When we do see multi-agent environments, they tend to be, there's so many multi-environment agents,
they tend to be similar environments. All of them are playing an action-based video game.
We don't really have an agent that goes from playing a video game to reading the Wall Street
journal to predicting how effective a molecule will be as a drug game to like reading the Wall Street Journal to predicting how effective
a molecule will be as a drug or something like that.
What do you think is a good test for intelligence in your view?
It's been a lot of benchmarks.
Started with the without entouring natural conversation and being a good benchmark for
intelligence. and being a good benchmark for intelligence, what would Ian Goodfell sit back and be really damn impressed
if a system was able to accomplish?
Something that doesn't take a lot of glue
from human engineers.
So imagine that instead of having to go to the C-FAR website
and download C-FAR 10 and then write a Python script to parse
it and all that.
You could just point an agent at the C-Far 10 problem and it downloads and extracts the
data and trains a model and starts giving you predictions.
I feel like something that doesn't need to have every step of the pipeline assembled for it.
Definitely understands what it's doing.
Is AutoML moving into that direction?
Are you thinking wave and bigger?
AutoML has mostly been moving toward, once we've built all the glue, can the machine learning
system design the architecture really well?
So, I'm worth saying, like, if something knows how to pre-process the data so that it successfully
accomplishes the task, then it would be very hard to argue that it doesn't truly understand
the task in some fundamental sense.
And I don't necessarily know that that's the philosophical definition of intelligence,
but that's something that would be really cool to build, that would be really useful,
and would impress me, and would convince me that we've made a step forward in real AI. So you give it like the URL for Wikipedia and then next day
expect it to be able to solve C-Fart 10. Or like you type in a paragraph explaining what you want
it to do and it figures out what web searches it should run and downloads all the necessary ingredients.
it figures out what web searches it should run and downloads all the all unnecessary ingredients.
So you have a very clear calm way of speaking. No, ums. Easy to edit. I've seen comments for both you and I have been identified as both potentially being robots. If you have to prove to the world that you are indeed human, how would you do it?
Well, I can understand thinking that I'm a robot.
It's the flip side of the touring test, I think.
Yeah, the prove your human test.
Intellectual. You say you have to, is there something that's truly unique in your mind?
It doesn't go back to just natural language again, just being able to talk about.
So proving proving that I'm not a robot with today's technology.
Yeah, that's pretty straightforward.
Like my conversation today hasn't veered off into, you know, talking about the stock market
or something because of my training data.
But I guess more generally trying to prove
that something is real from the content alone
is incredibly hard.
That's one of the main things I've gotten out
of my GAN research that you can simulate almost anything.
And so you have to really step back
to a separate channel to prove that something is real.
So I guess I should have had myself
stamped on a blockchain when I was born or something,
but I didn't do that. So according to my own research methodology, I guess I should have had myself stamped on a blockchain when I was born or something, but
I didn't do that.
So according to my own research methodology, there's just no way to know at this point.
So what last question, problem stands out for you that you're really excited about challenging
in the near future.
I think resistance to adversarial examples, figuring out how to make machine learning secure
against an adversary who wants to interfere and control it that is one of the most important things researchers today could solve.
In all domains, image, language, driving, and everything.
I guess I'm most concerned about domains we haven't really encountered yet.
Like imagine 20 years from now when we're using advanced AIs to do things we haven't even thought of yet.
Like if you ask people, what are the important problems in security of phones in like
2002, I don't think we would have anticipated that we're using them for nearly as many things
as we're using them for today.
I think it's going to be like that with AI that you can kind of try to speculate about
where it's going, but really the business opportunities that I'm taking off would be hard to predict
ahead of time. What you can predict ahead of time is that almost anything you can do with machine
learning, you would like to make sure that people can't get it to do what they want rather than what
you want just by showing it a funny QR code or a funny input pattern.
And you think that the set of methodology to do that can be bigger than anyone domain.
And that's the...
I think so, yeah. Yeah. Like one methodology that I think is not a specific methodology,
but like a category of solutions that I'm excited about today is making dynamic models that change
every time they make a prediction.
So right now, we tend to train models,
and then after they're trained, we freeze them,
and we just use the same rule to classify
everything that comes in from then on.
That's really a sitting duck from a security point of view.
If you always output the same answer for the same input,
then people can just run inputs through until they find a mistake that benefits them.
And then they use the same mistake over and over and over again.
I think having a model that updates its predictions so that it's
harder to predict what you're going to get will make it harder for
an adversary to really take control of the system and make it do what
they want it to do.
Yeah, models that maintain a bit of a sense of mystery
about them, because they always keep changing.
Yeah.
And thanks so much for talking today.
Oh, awesome.
Thank you for coming in.
That's great to see you.